HCLS AI Factory — Deployment and Configuration Guide for NVIDIA DGX Spark¶
Precision Medicine Accelerator on NVIDIA DGX Spark
The HCLS AI Factory is a three-stage precision medicine pipeline that transforms raw genomic sequencing data (FASTQ) into actionable drug discovery candidates. Stage 1 performs GPU-accelerated genomics alignment and variant calling with NVIDIA Parabricks. Stage 2 annotates variants against clinical databases, embeds them into a Milvus vector store, and provides a RAG-powered conversational interface using Anthropic Claude. Stage 3 leverages NVIDIA BioNeMo NIM microservices for structure-aware molecule generation and molecular docking, producing ranked drug candidates with composite scores. The entire platform runs on a single NVIDIA DGX Spark desktop workstation — a $3,999 system powered by the GB10 Grace Blackwell Superchip with 128 GB unified LPDDR5x memory. This guide covers the public fork: everything you need to clone, configure, and deploy the full stack using Docker Compose.
License: Apache 2.0 | Date: February 2026
Table of Contents¶
- Introduction
- Architecture Overview
- Prerequisites
- Environment Preparation
- Repository Setup
- Reference Data Preparation
- Docker Compose Configuration
- Deploy Genomic Foundation Engine (Stage 1)
- Deploy Precision Intelligence Network (Stage 2)
- Deploy Therapeutic Discovery Engine (Stage 3)
- Nextflow Orchestration
- Service Startup and Health
- Monitoring and Observability
- Security Configuration
- Data Management
- Performance Tuning
- Troubleshooting Guide
- VCP/FTD Demo Walkthrough
- Scaling Beyond DGX Spark
- Appendix A: Complete Configuration Reference
- Appendix B: API Reference
- Appendix C: Schema Definitions
- Appendix D: Docker Image Reference
- Appendix E: Validation Checklists
- Appendix F: Glossary
1. Introduction¶
1.1 Purpose¶
This document provides step-by-step instructions for deploying the HCLS AI Factory on an NVIDIA DGX Spark workstation. It covers all three pipeline stages — genomics, RAG-powered variant intelligence, and AI-driven drug discovery — using publicly available components.
1.2 Scope¶
The guide addresses hardware validation, software installation, container deployment, data preparation, pipeline execution, monitoring, security, and troubleshooting. It targets the HCLS AI Factory that runs entirely on Docker Compose without requiring Kubernetes or multi-node infrastructure.
1.3 Audience¶
- Bioinformatics Engineers deploying genomics pipelines on DGX Spark
- ML/AI Engineers integrating RAG and BioNeMo NIM microservices
- DevOps Engineers managing containerized service stacks
- Researchers forking the project for their own precision medicine workflows
1.4 Document Conventions¶
| Convention | Meaning |
|---|---|
monospace |
Commands, file paths, code |
| Bold | UI elements, key terms |
| Italic | Variable values to be replaced |
$VARIABLE |
Environment variable |
<placeholder> |
User-supplied value |
1.5 Genomics and Drug Discovery Primer¶
This section provides essential background for engineers who may not have a biology or chemistry background.
1.5.1 DNA Sequencing¶
DNA sequencing reads the order of nucleotide bases (A, T, C, G) in an organism's genome. Modern short-read sequencers (e.g., Illumina) produce paired-end reads — two sequences from opposite ends of a DNA fragment. The standard demo sample HG002 is a 30x whole-genome sequencing (WGS) dataset with 2x250 bp paired-end reads, producing approximately 200 GB of FASTQ data.
1.5.2 Genomic Foundation Engine Stages¶
| Stage | Input | Tool | Output | Description |
|---|---|---|---|---|
| Quality Control | FASTQ | FastQC | QC Report | Assess read quality and adapter contamination |
| Alignment | FASTQ + Reference | BWA-MEM2 (fq2bam) | BAM | Map reads to GRCh38 reference genome |
| Variant Calling | BAM | DeepVariant | VCF | Identify SNPs and indels vs. reference |
| Annotation | VCF | VEP + ClinVar + AlphaMissense | Annotated VCF | Add functional, clinical, and pathogenicity data |
| Embedding | Annotated VCF | BGE-small-en-v1.5 | Vectors (384-dim) | Convert variant evidence to dense embeddings |
1.5.3 Variant Annotation¶
Variants are annotated from multiple sources:
- VEP (Variant Effect Predictor): Assigns functional consequences and impact levels — HIGH, MODERATE, LOW, or MODIFIER.
- ClinVar: NCBI database of 4.1 million clinical variant interpretations (Pathogenic, Likely Pathogenic, Benign, etc.).
- AlphaMissense: DeepMind model with 71,697,560 missense variant pathogenicity predictions. Thresholds: pathogenic (>0.564), ambiguous (0.34-0.564), benign (<0.34).
1.5.4 Vector Embeddings and RAG¶
Annotated variants are converted to 384-dimensional dense vectors using the BGE-small-en-v1.5 embedding model and stored in Milvus. Retrieval-Augmented Generation (RAG) queries Milvus for relevant genomic evidence, then passes the results as context to Anthropic Claude for natural-language clinical interpretation.
1.5.5 Drug Discovery Pipeline¶
The 10-stage drug discovery pipeline transforms a genomic target into ranked drug candidates:
| Stage | Name | Description |
|---|---|---|
| 1 | Initialize | Load configuration, validate target gene and variant |
| 2 | Normalize Target | Map gene symbol to UniProt ID and canonical name |
| 3 | Structure Discovery | Query RCSB PDB for 3D protein structures, score by resolution and method |
| 4 | Structure Preparation | Download PDB files, extract binding site coordinates |
| 5 | Molecule Generation | Generate SMILES candidates via MolMIM NIM (Port 8001) using seed molecule |
| 6 | Chemistry QC | Filter by Lipinski Rule of Five (MW<=500, LogP<=5, HBD<=5, HBA<=10) |
| 7 | Conformer Generation | Generate 3D conformers with RDKit for docking input |
| 8 | Molecular Docking | Score binding affinity via DiffDock NIM (Port 8002) |
| 9 | Composite Ranking | Rank candidates: 30% generation + 40% docking + 30% QED |
| 10 | Reporting | Generate PDF report with structures, scores, and recommendations |
1.5.6 End-to-End Data Flow Summary¶
FASTQ (200 GB) ─► Parabricks fq2bam ─► BAM (100 GB) ─► DeepVariant ─► VCF (11.7M variants)
─► Annotation (ClinVar + AlphaMissense + VEP) ─► Milvus (384-dim vectors)
─► Claude RAG (variant interpretation) ─► Target Hypothesis
─► PDB Structure Retrieval ─► MolMIM (molecule generation)
─► DiffDock (molecular docking) ─► Composite Ranking ─► PDF Report
2. Architecture Overview¶
2.1 System Components¶
The HCLS AI Factory comprises three application pipeline stages running on a single DGX Spark:
| Stage | Engine Name | Function |
|---|---|---|
| Stage 1 | Genomic Foundation Engine | FASTQ alignment and variant calling with GPU-accelerated Parabricks |
| Stage 2 | Precision Intelligence Network | Variant annotation, vector embedding, Claude-powered conversational AI, and 11 intelligence agents |
| Stage 3 | Therapeutic Discovery Engine | Structure-aware molecule generation, docking, and composite ranking |
2.2 Technology Stack¶
| Layer | Technology | Version / Details |
|---|---|---|
| Hardware | NVIDIA DGX Spark | GB10 GPU, 128 GB unified LPDDR5x, ARM64 cores |
| OS | DGX OS | Ubuntu-based, ARM64 (aarch64) |
| Container Runtime | Docker + NVIDIA Container Toolkit | nvidia-docker runtime |
| Orchestration | Docker Compose | Multi-service deployment |
| Pipeline Orchestration | Nextflow | DSL2, multiple profiles |
| Genomic Foundation Engine | NVIDIA Parabricks | 4.6.0-1 |
| Vector Database | Milvus | 2.4 (with etcd + MinIO) |
| Embedding Model | BGE-small-en-v1.5 | 384 dimensions |
| LLM | Anthropic Claude | claude-sonnet-4-20250514 |
| Molecule Generation | BioNeMo MolMIM NIM | 1.0 |
| Molecular Docking | BioNeMo DiffDock NIM | 1.0 |
| Cheminformatics | RDKit | Python library |
| Monitoring | Grafana + Prometheus | 11.0.0 / v2.52.0 |
| GPU Monitoring | DCGM Exporter | Port 9400 |
| Language | Python | 3.10+ |
2.3 Service Architecture¶
The platform deploys 13 services across 13 ports:
| # | Service | Port | Protocol | Description |
|---|---|---|---|---|
| 1 | Landing Page | 8080 | HTTP | Platform entry point and service directory |
| 2 | Genomics Portal | 5000 | HTTP | Genomics pipeline UI and results viewer |
| 3 | RAG API | 5001 | HTTP | REST API for variant queries and RAG |
| 4 | Milvus | 19530 | gRPC | Vector database for genomic evidence |
| 5 | Streamlit Chat | 8501 | HTTP | Conversational AI interface for variant analysis |
| 6 | MolMIM NIM | 8001 | HTTP | BioNeMo molecule generation microservice |
| 7 | DiffDock NIM | 8002 | HTTP | BioNeMo molecular docking microservice |
| 8 | Discovery UI | 8505 | HTTP | Drug discovery pipeline interface |
| 9 | Discovery Portal | 8510 | HTTP | Drug discovery results and reporting portal |
| 10 | Grafana | 3000 | HTTP | Monitoring dashboards |
| 11 | Prometheus | 9099 | HTTP | Metrics collection and storage |
| 12 | Node Exporter | 9100 | HTTP | Host system metrics |
| 13 | DCGM Exporter | 9400 | HTTP | NVIDIA GPU metrics |
Infrastructure services (not externally exposed):
| Service | Port | Purpose |
|---|---|---|
| etcd | 2379 | Milvus metadata store |
| MinIO | 9000 | Milvus object storage |
Note: Services 1–4 and 6–13 are launched via
docker compose up. The Streamlit UIs (Chat on 8501, Discovery UI on 8505, Discovery Portal on 8510) and RAG API (5001) are started via./start-services.sh, which handles both Docker and non-Docker services. For the simplest deployment, run./start-services.sh startwhich orchestrates everything.
2.4 Data Flow¶
┌────────────────────────────────────────────────────────────────────────┐
│ HCLS AI Factory — Data Flow │
├────────────────────────────────────────────────────────────────────────┤
│ │
│ Stage 1 — Genomic Foundation Engine │
│ FASTQ ──► Parabricks fq2bam ──► BAM ──► DeepVariant ──► VCF │
│ (200 GB) (20-45 min) (100 GB) (10-35 min) (11.7M) │
│ │
│ Stage 2 — Precision Intelligence Network │
│ VCF ──► ClinVar + AlphaMissense ──► VEP ──► Annotated VCF │
│ Annotated ──► BGE-small ──► Milvus (384-dim, COSINE) │
│ Milvus ──► Claude (sonnet-4, temp=0.3) ──► Target Hypothesis │
│ │
│ Stage 3 — Therapeutic Discovery Engine │
│ Target ──► PDB Structures ──► MolMIM ──► Chemistry QC │
│ Conformers ──► DiffDock ──► Composite Ranking ──► PDF Report │
│ (0.3*gen + 0.4*dock + 0.3*QED) │
│ │
└────────────────────────────────────────────────────────────────────────┘
3. Prerequisites¶
3.1 Hardware Requirements¶
| Component | Specification |
|---|---|
| System | NVIDIA DGX Spark |
| GPU | GB10 Grace Blackwell Superchip |
| Memory | 128 GB unified LPDDR5x |
| CPU | ARM64 cores |
| Architecture | aarch64 (ARM64) |
| Price | $3,999 |
Storage requirements:
| Dataset / Component | Size |
|---|---|
| GRCh38 Reference Genome | 3.1 GB |
| FASTQ Input (HG002 30x WGS) | ~200 GB |
| BAM Output (intermediate) | ~100 GB |
| ClinVar Database | ~1.2 GB |
| AlphaMissense Predictions | ~4 GB |
| Milvus Index Data | ~2 GB |
| BioNeMo Model Cache | ~10 GB |
| Total Minimum | ~320 GB |
| Recommended | 1 TB NVMe |
3.2 Software Requirements¶
| Software | Minimum Version | Notes |
|---|---|---|
| DGX OS | Latest | Ubuntu-based ARM64 |
| Docker Engine | 24.0+ | With Compose V2 |
| NVIDIA Container Toolkit | Latest | nvidia-docker runtime |
| CUDA Toolkit | 12.x | Included with DGX OS |
| Python | 3.10+ | For pipeline scripts |
| Nextflow | 23.04+ | DSL2 support required |
| Git | 2.30+ | For repository clone |
| NGC CLI | Latest | For BioNeMo container pulls |
3.3 Network Requirements¶
- Internet access for initial setup (container pulls, data downloads)
- Outbound HTTPS to
api.anthropic.comfor Claude API calls - Outbound HTTPS to
nvcr.iofor NGC container registry - Outbound HTTPS to NCBI, RCSB PDB for reference data downloads
- All service ports (listed in Section 2.3) accessible on localhost
3.4 Access Credentials¶
| Credential | Purpose | How to Obtain |
|---|---|---|
ANTHROPIC_API_KEY |
Claude API access | https://console.anthropic.com |
NGC_API_KEY |
NVIDIA NGC container registry | https://ngc.nvidia.com |
4. Environment Preparation¶
4.1 DGX Spark Initial Setup¶
Verify the system is a DGX Spark with the expected hardware:
# Verify ARM64 architecture
uname -m
# Expected: aarch64
# Verify CPU cores
nproc
# Expected: 144
# Verify total memory (128 GB)
free -h | grep Mem
# Expected: ~128 GB total
# Verify GPU is detected
nvidia-smi
# Expected: GB10 GPU listed with driver version
4.2 NVIDIA Driver and CUDA Verification¶
# Check NVIDIA driver version
nvidia-smi --query-gpu=driver_version --format=csv,noheader
# Expected: 550.x or later
# Check CUDA version
nvcc --version
# Expected: CUDA 12.x
# Verify GPU compute capability
nvidia-smi --query-gpu=compute_cap --format=csv,noheader
# Run a quick GPU test
nvidia-smi -q | head -30
4.3 Docker Installation and Configuration¶
# Verify Docker is installed
docker --version
# Expected: Docker version 24.0+
# Verify Docker Compose V2
docker compose version
# Expected: Docker Compose version v2.x
# Verify NVIDIA runtime is available
docker info | grep -i runtime
# Expected: nvidia runtime listed
# Test GPU access from a container
docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi
Configure Docker daemon for NVIDIA runtime as default:
sudo tee /etc/docker/daemon.json <<'EOF'
{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "nvidia-container-runtime",
"runtimeArgs": []
}
},
"default-address-pools": [
{"base": "172.20.0.0/16", "size": 24}
],
"log-driver": "json-file",
"log-opts": {
"max-size": "50m",
"max-file": "3"
}
}
EOF
sudo systemctl restart docker
4.4 Python Environment Setup¶
# Verify Python version
python3 --version
# Expected: Python 3.10+
# Create virtual environment
python3 -m venv ~/hcls-env
source ~/hcls-env/bin/activate
# Install core dependencies
pip install --upgrade pip
pip install \
anthropic \
pymilvus \
sentence-transformers \
rdkit-pypi \
pydantic \
streamlit \
fastapi \
uvicorn \
requests \
pandas \
numpy \
reportlab \
biopython \
nextflow
4.5 NGC CLI Installation¶
# Download NGC CLI for ARM64
wget -O ngc-cli.zip https://api.ngc.nvidia.com/v2/resources/nvidia/ngc-apps/ngc_cli/versions/latest/files/ngccli_arm64.zip
# Extract and install
unzip ngc-cli.zip -d ~/ngc-cli
chmod +x ~/ngc-cli/ngc-cli/ngc
export PATH=$PATH:~/ngc-cli/ngc-cli
# Configure NGC CLI
ngc config set
# Enter your NGC API key when prompted
# Verify authentication
ngc registry image list --format_type csv | head -5
5. Repository Setup¶
5.1 Fork and Clone¶
# Fork the repository on GitHub, then clone your fork
git clone https://github.com/<your-username>/hcls-ai-factory.git
cd hcls-ai-factory
# Verify repository structure
ls -la
5.2 Repository Layout¶
hcls-ai-factory/
├── docker-compose.yml # All 13 services + infrastructure
├── .env.example # Template environment configuration
├── nextflow.config # Nextflow pipeline configuration
├── main.nf # Nextflow DSL2 pipeline definition
├── start-services.sh # Service startup script
├── requirements.txt # Python dependencies
│
├── genomics/ # Stage 1: Genomic Foundation Engine
│ ├── parabricks/ # Parabricks configs and scripts
│ │ ├── fq2bam.sh # BWA-MEM2 alignment wrapper
│ │ └── deepvariant.sh # DeepVariant variant calling wrapper
│ ├── portal/ # Genomics Portal (Port 5000)
│ │ └── app.py
│ └── data/ # Input/output data directory
│ ├── reference/ # GRCh38 reference genome
│ ├── fastq/ # Input FASTQ files
│ ├── bam/ # Alignment output
│ └── vcf/ # Variant call output
│
├── rag/ # Stage 2: Precision Intelligence Network
│ ├── api/ # RAG API (Port 5001)
│ │ └── app.py
│ ├── chat/ # Streamlit Chat (Port 8501)
│ │ └── app.py
│ ├── embeddings/ # BGE embedding pipeline
│ │ └── embed_variants.py
│ ├── annotation/ # Variant annotation pipeline
│ │ ├── clinvar.py
│ │ ├── alphamissense.py
│ │ └── vep.py
│ ├── knowledge/ # Gene knowledge base
│ │ └── genes.json # 201 genes, 13 therapeutic areas
│ └── data/
│ ├── clinvar/ # ClinVar database
│ └── alphamissense/ # AlphaMissense predictions
│
├── discovery/ # Stage 3: Drug Discovery Pipeline
│ ├── pipeline/ # 10-stage discovery pipeline
│ │ ├── __init__.py
│ │ ├── initialize.py # Stage 1: Initialize
│ │ ├── normalize.py # Stage 2: Normalize Target
│ │ ├── structure_discovery.py # Stage 3: Structure Discovery
│ │ ├── structure_prep.py # Stage 4: Structure Preparation
│ │ ├── molecule_gen.py # Stage 5: Molecule Generation
│ │ ├── chemistry_qc.py # Stage 6: Chemistry QC
│ │ ├── conformer_gen.py # Stage 7: Conformer Generation
│ │ ├── docking.py # Stage 8: Molecular Docking
│ │ ├── ranking.py # Stage 9: Composite Ranking
│ │ └── reporting.py # Stage 10: Reporting
│ ├── ui/ # Discovery UI (Port 8505)
│ │ └── app.py
│ ├── portal/ # Discovery Portal (Port 8510)
│ │ └── app.py
│ └── models/ # Pydantic data models
│ └── schemas.py
│
├── monitoring/ # Monitoring stack
│ ├── grafana/
│ │ ├── provisioning/
│ │ └── dashboards/
│ ├── prometheus/
│ │ └── prometheus.yml
│ └── exporters/
│
├── landing/ # Landing Page (Port 8080)
│ └── index.html
│
├── scripts/ # Utility scripts
│ ├── run_pipeline.py # Pipeline launcher
│ ├── download_references.sh # Reference data downloader
│ └── validate_deployment.sh # Deployment validator
│
└── docs/ # Documentation
└── ...
5.3 Environment Configuration¶
# Copy the example environment file
cp .env.example .env
# Edit with your credentials and paths
nano .env
The .env file should contain:
# === API Keys ===
ANTHROPIC_API_KEY=sk-ant-api03-XXXXXXXXXXXX
NGC_API_KEY=XXXXXXXXXXXX
# === Model Configuration ===
CLAUDE_MODEL=claude-sonnet-4-20250514
CLAUDE_TEMPERATURE=0.3
# === Reference Data ===
REFERENCE_GENOME=/data/reference/GRCh38.fa
# === Milvus Configuration ===
MILVUS_HOST=localhost
MILVUS_PORT=19530
# === BioNeMo NIM URLs ===
MOLMIM_URL=http://localhost:8001
DIFFDOCK_URL=http://localhost:8002
# === Pipeline Configuration ===
PIPELINE_MODE=full
NUM_CANDIDATES=100
MIN_QED=0.3
MIN_DOCK_SCORE=-6.0
# === Monitoring ===
GRAFANA_USER=admin
GRAFANA_PASSWORD=changeme
5.4 Directory Structure for Data¶
# Create data directories
mkdir -p genomics/data/{reference,fastq,bam,vcf}
mkdir -p rag/data/{clinvar,alphamissense}
mkdir -p discovery/data/{structures,molecules,reports}
mkdir -p monitoring/data/{grafana,prometheus}
6. Stage 0: Data Acquisition¶
Automated setup: The
setup-data.shscript handles all data downloads with automatic retry, checksum verification, and progress tracking. Run./setup-data.sh --allfrom the repository root to download everything. This is a one-time step (~500 GB total). See Stage 0: Data Acquisition for complete troubleshooting.
# Recommended: Automated download of all data (~500 GB)
./setup-data.sh --all
# Or download by stage
./setup-data.sh --stage1 # Genomics: FASTQ + reference (~300 GB)
./setup-data.sh --stage2 # RAG/Chat: ClinVar + AlphaMissense (~2 GB)
./setup-data.sh --stage3 # Drug Discovery: PDB cache (optional)
# Check status
./setup-data.sh --status
The sections below document the manual process for reference. For most deployments, setup-data.sh is the recommended approach.
6.1 GRCh38 Reference Genome¶
# Automated (recommended):
./setup-data.sh --stage1 # Downloads reference as part of Stage 1
# Manual alternative:
cd genomics/data/reference
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz
# Decompress
gunzip GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz
mv GCA_000001405.15_GRCh38_no_alt_analysis_set.fna GRCh38.fa
# Index the reference (required by Parabricks)
samtools faidx GRCh38.fa
# Verify
ls -lh GRCh38.fa*
# Expected: GRCh38.fa (~3.1 GB), GRCh38.fa.fai
6.2 ClinVar Database¶
# Automated (recommended):
./setup-data.sh --stage2 # Downloads ClinVar + AlphaMissense with verification
# Manual alternative:
cd rag/data/clinvar
wget https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/clinvar.vcf.gz
wget https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/clinvar.vcf.gz.tbi
# Verify record count (~4.1M clinical variants)
zcat clinvar.vcf.gz | grep -v '^#' | wc -l
# Expected: ~4,100,000
echo "ClinVar download complete"
6.3 AlphaMissense Database¶
# Automated (recommended):
./setup-data.sh --stage2 # Downloads AlphaMissense with gzip integrity check
# Manual alternative:
cd rag/data/alphamissense
wget https://storage.googleapis.com/dm_alphamissense/AlphaMissense_hg38.tsv.gz
# Verify record count (~71.7M predictions)
zcat AlphaMissense_hg38.tsv.gz | tail -n +5 | wc -l
# Expected: ~71,697,560
echo "AlphaMissense download complete"
AlphaMissense pathogenicity thresholds:
| Classification | Score Range | Description |
|---|---|---|
| Pathogenic | > 0.564 | Likely damaging to protein function |
| Ambiguous | 0.34 - 0.564 | Uncertain significance |
| Benign | < 0.34 | Likely tolerated |
6.4 HG002 Sample Data¶
# Automated (recommended):
./setup-data.sh --stage1
# Downloads all 68 FASTQ files with MD5 verification, auto-retry,
# and merges into HG002_R1.fastq.gz + HG002_R2.fastq.gz
# Manual alternative (not recommended — no checksum retry):
cd genomics/data/fastq
# GIAB HG002 30x WGS, 2x250 bp paired-end
# Note: These are large files — ensure ~350 GB free space
wget ftp://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/HG002_NA24385_son/NIST_HiSeq_HG002_Homogeneity-10953946/NHGRI_Illumina300X_AJtrio_novoalign_bams/HG002.GRCh38.2x250.fastq.gz
echo "HG002 download complete — verify file sizes match expected ~200 GB"
ls -lh *.fastq.gz
Troubleshooting FASTQ downloads: FASTQ files are the most failure-prone download (~200 GB across 68 files from NCBI FTP). If downloads fail checksum verification, the
setup-data.shscript automatically retries with progressively more conservative settings. See DATA_SETUP.md for detailed troubleshooting.
7. Docker Compose Configuration¶
7.1 Service Definition Overview¶
The docker-compose.yml defines all 13 application services plus 2 infrastructure services (etcd, MinIO) for Milvus. Services are organized into three groups matching the pipeline stages, plus monitoring.
7.2 docker-compose.yml Structure¶
version: '3.8'
services:
# ─── Infrastructure ───────────────────────────────────────
etcd:
image: quay.io/coreos/etcd:v3.5.5
environment:
- ETCD_AUTO_COMPACTION_MODE=revision
- ETCD_AUTO_COMPACTION_RETENTION=1000
ports:
- "2379:2379"
volumes:
- etcd_data:/etcd
restart: unless-stopped
minio:
image: minio/minio:latest
environment:
MINIO_ACCESS_KEY: minioadmin
MINIO_SECRET_KEY: minioadmin
ports:
- "9000:9000"
volumes:
- minio_data:/data
command: server /data
restart: unless-stopped
# ─── Milvus Vector Database ──────────────────────────────
milvus:
image: milvusdb/milvus:v2.4-latest
ports:
- "19530:19530"
environment:
ETCD_ENDPOINTS: etcd:2379
MINIO_ADDRESS: minio:9000
depends_on:
- etcd
- minio
volumes:
- milvus_data:/var/lib/milvus
restart: unless-stopped
# ─── Stage 1: Genomics ──────────────────────────────────
genomics-portal:
build: ./genomics/portal
ports:
- "5000:5000"
volumes:
- ./genomics/data:/data
environment:
- REFERENCE_GENOME=/data/reference/GRCh38.fa
restart: unless-stopped
# ─── Stage 2: RAG Chat ──────────────────────────────────
rag-api:
build: ./rag/api
ports:
- "5001:5001"
environment:
- ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}
- CLAUDE_MODEL=${CLAUDE_MODEL}
- CLAUDE_TEMPERATURE=${CLAUDE_TEMPERATURE}
- MILVUS_HOST=milvus
- MILVUS_PORT=19530
depends_on:
- milvus
restart: unless-stopped
streamlit-chat:
build: ./rag/chat
ports:
- "8501:8501"
environment:
- RAG_API_URL=http://rag-api:5001
depends_on:
- rag-api
restart: unless-stopped
# ─── Stage 3: Drug Discovery ────────────────────────────
molmim:
image: nvcr.io/nvidia/clara/bionemo-molmim:1.0
ports:
- "8001:8001"
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
restart: unless-stopped
diffdock:
image: nvcr.io/nvidia/clara/diffdock:1.0
ports:
- "8002:8002"
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
restart: unless-stopped
discovery-ui:
build: ./discovery/ui
ports:
- "8505:8505"
environment:
- MOLMIM_URL=http://molmim:8001
- DIFFDOCK_URL=http://diffdock:8002
- ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}
- NUM_CANDIDATES=${NUM_CANDIDATES}
- MIN_QED=${MIN_QED}
- MIN_DOCK_SCORE=${MIN_DOCK_SCORE}
depends_on:
- molmim
- diffdock
restart: unless-stopped
discovery-portal:
build: ./discovery/portal
ports:
- "8510:8510"
depends_on:
- discovery-ui
restart: unless-stopped
# ─── Landing Page ───────────────────────────────────────
landing-page:
build: ./landing
ports:
- "8080:8080"
restart: unless-stopped
# ─── Monitoring ─────────────────────────────────────────
prometheus:
image: prom/prometheus:v2.52.0
ports:
- "9099:9090"
volumes:
- ./monitoring/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus_data:/prometheus
restart: unless-stopped
grafana:
image: grafana/grafana-oss:11.0.0
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_USER=${GRAFANA_USER}
- GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD}
volumes:
- ./monitoring/grafana/provisioning:/etc/grafana/provisioning
- ./monitoring/grafana/dashboards:/var/lib/grafana/dashboards
- grafana_data:/var/lib/grafana
depends_on:
- prometheus
restart: unless-stopped
node-exporter:
image: prom/node-exporter:latest
ports:
- "9100:9100"
restart: unless-stopped
dcgm-exporter:
image: nvcr.io/nvidia/k8s/dcgm-exporter:latest
ports:
- "9400:9400"
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
restart: unless-stopped
volumes:
etcd_data:
minio_data:
milvus_data:
prometheus_data:
grafana_data:
7.3 Infrastructure Services¶
Milvus 2.4 requires two backend services:
| Service | Image | Port | Purpose |
|---|---|---|---|
| etcd | quay.io/coreos/etcd:v3.5.5 | 2379 | Metadata storage for Milvus |
| MinIO | minio/minio:latest | 9000 | Object storage for Milvus segments |
| Milvus | milvusdb/milvus:v2.4-latest | 19530 | Vector database |
7.4 Volume Mounts and Data Paths¶
| Volume | Container Path | Host Purpose |
|---|---|---|
./genomics/data |
/data |
Reference genome, FASTQ, BAM, VCF |
./rag/data |
/data |
ClinVar, AlphaMissense databases |
etcd_data |
/etcd |
Milvus metadata persistence |
minio_data |
/data |
Milvus segment persistence |
milvus_data |
/var/lib/milvus |
Milvus index persistence |
prometheus_data |
/prometheus |
Prometheus TSDB |
grafana_data |
/var/lib/grafana |
Grafana state and dashboards |
7.5 GPU Resource Allocation¶
The GB10 GPU is shared across GPU-consuming services. Only one GPU-heavy workload should run at a time:
| Service | GPU Usage | Peak Memory | Typical Duration |
|---|---|---|---|
| Parabricks fq2bam | 70-90% GPU | ~40 GB | 20-45 min |
| Parabricks DeepVariant | 80-95% GPU | ~60 GB | 10-35 min |
| MolMIM NIM | Moderate | ~8 GB | Always running |
| DiffDock NIM | Moderate | ~8 GB | Always running |
| DCGM Exporter | Minimal | Minimal | Always running |
8. Deploy Genomic Foundation Engine (Stage 1)¶
8.1 Parabricks Container Setup¶
# Pull Parabricks container for ARM64
docker pull nvcr.io/nvidia/clara/clara-parabricks:4.6.0-1
# Verify the image
docker images | grep parabricks
# Expected: clara-parabricks 4.6.0-1
8.2 BWA-MEM2 Alignment (fq2bam)¶
The fq2bam tool performs GPU-accelerated read alignment using BWA-MEM2 and produces a sorted, duplicate-marked BAM file.
docker run --rm --gpus all \
-v $(pwd)/genomics/data:/data \
nvcr.io/nvidia/clara/clara-parabricks:4.6.0-1 \
pbrun fq2bam \
--ref /data/reference/GRCh38.fa \
--in-fq /data/fastq/HG002_R1.fastq.gz /data/fastq/HG002_R2.fastq.gz \
--out-bam /data/bam/HG002.bam \
--num-gpus 1
Expected performance:
| Metric | Value |
|---|---|
| Runtime | 20-45 minutes |
| GPU Utilization | 70-90% |
| Peak GPU Memory | ~40 GB |
| Output | Sorted, duplicate-marked BAM (~100 GB) |
8.3 DeepVariant Variant Calling¶
docker run --rm --gpus all \
-v $(pwd)/genomics/data:/data \
nvcr.io/nvidia/clara/clara-parabricks:4.6.0-1 \
pbrun deepvariant \
--ref /data/reference/GRCh38.fa \
--in-bam /data/bam/HG002.bam \
--out-variants /data/vcf/HG002.vcf.gz \
--num-gpus 1
Expected performance:
| Metric | Value |
|---|---|
| Runtime | 10-35 minutes |
| GPU Utilization | 80-95% |
| Peak GPU Memory | ~60 GB |
| Output | Compressed VCF (gzipped) |
8.4 VCF Output Verification¶
# Count total variants
zcat genomics/data/vcf/HG002.vcf.gz | grep -v '^#' | wc -l
# Expected: ~11,700,000 (11.7M variants)
# Count PASS variants with QUAL > 30
zcat genomics/data/vcf/HG002.vcf.gz | grep -v '^#' | \
awk '$7 == "PASS" && $6 > 30' | wc -l
# Expected: ~3,561,170 (3.56M)
# Count SNPs vs Indels
zcat genomics/data/vcf/HG002.vcf.gz | grep -v '^#' | \
awk '{if(length($4)==1 && length($5)==1) print "SNP"; else print "INDEL"}' | \
sort | uniq -c
# Expected: ~4,200,000 SNPs, ~1,000,000 indels
VCF output summary:
| Metric | Expected Value |
|---|---|
| Total variants | ~11.7M |
| PASS variants (QUAL > 30) | ~3.56M |
| SNPs | ~4.2M |
| Indels | ~1.0M |
| Coding region variants | ~35,000 |
8.5 Genomics Portal (Port 5000)¶
After genomics processing, start the portal:
docker compose up -d genomics-portal
# Verify
curl -s http://localhost:5000/health
# Expected: {"status": "healthy"}
Access the Genomics Portal at http://<dgx-spark-ip>:5000 to browse VCF results.
8.6 Performance Benchmarks¶
| Step | Wall Time | GPU Util | Peak Memory | Output Size |
|---|---|---|---|---|
| fq2bam (alignment) | 20-45 min | 70-90% | ~40 GB | ~100 GB BAM |
| DeepVariant (calling) | 10-35 min | 80-95% | ~60 GB | ~1 GB VCF.gz |
| Total Stage 1 | 30-80 min | — | — | — |
9. Deploy Precision Intelligence Network (Stage 2)¶
9.1 Milvus Vector Database Setup¶
# Start Milvus and its dependencies
docker compose up -d etcd minio milvus
# Wait for Milvus to be ready (30-60 seconds)
sleep 30
# Verify Milvus is running
curl -s http://localhost:19530/v1/health/ready
# Expected: {"status":"ok"}
9.2 Collection Schema¶
Create the genomic_evidence collection with 17 fields:
from pymilvus import connections, Collection, FieldSchema, CollectionSchema, DataType, utility
# Connect to Milvus
connections.connect(host="localhost", port=19530)
# Define schema with 17 fields
fields = [
FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True),
FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=384),
FieldSchema(name="chrom", dtype=DataType.VARCHAR, max_length=10),
FieldSchema(name="pos", dtype=DataType.INT64),
FieldSchema(name="ref", dtype=DataType.VARCHAR, max_length=500),
FieldSchema(name="alt", dtype=DataType.VARCHAR, max_length=500),
FieldSchema(name="qual", dtype=DataType.FLOAT),
FieldSchema(name="gene", dtype=DataType.VARCHAR, max_length=100),
FieldSchema(name="consequence", dtype=DataType.VARCHAR, max_length=200),
FieldSchema(name="impact", dtype=DataType.VARCHAR, max_length=20),
FieldSchema(name="genotype", dtype=DataType.VARCHAR, max_length=10),
FieldSchema(name="text_summary", dtype=DataType.VARCHAR, max_length=5000),
FieldSchema(name="clinical_significance", dtype=DataType.VARCHAR, max_length=200),
FieldSchema(name="rsid", dtype=DataType.VARCHAR, max_length=20),
FieldSchema(name="disease_associations", dtype=DataType.VARCHAR, max_length=2000),
FieldSchema(name="am_pathogenicity", dtype=DataType.FLOAT),
FieldSchema(name="am_class", dtype=DataType.VARCHAR, max_length=20),
]
schema = CollectionSchema(fields, description="Genomic evidence for RAG")
collection = Collection("genomic_evidence", schema)
# Create IVF_FLAT index on embedding field
index_params = {
"metric_type": "COSINE",
"index_type": "IVF_FLAT",
"params": {"nlist": 1024}
}
collection.create_index("embedding", index_params)
# Load collection into memory
collection.load()
print(f"Collection created: {collection.name}")
print(f"Schema fields: {len(fields)}")
Collection schema reference:
| # | Field | Type | Details |
|---|---|---|---|
| 1 | id | INT64 | Primary key, auto-generated |
| 2 | embedding | FLOAT_VECTOR | 384 dimensions (BGE-small-en-v1.5) |
| 3 | chrom | VARCHAR(10) | Chromosome (chr1-22, chrX, chrY) |
| 4 | pos | INT64 | Genomic position |
| 5 | ref | VARCHAR(500) | Reference allele |
| 6 | alt | VARCHAR(500) | Alternate allele |
| 7 | qual | FLOAT | Variant quality score |
| 8 | gene | VARCHAR(100) | Gene symbol |
| 9 | consequence | VARCHAR(200) | VEP functional consequence |
| 10 | impact | VARCHAR(20) | HIGH, MODERATE, LOW, MODIFIER |
| 11 | genotype | VARCHAR(10) | Sample genotype (e.g., 0/1, 1/1) |
| 12 | text_summary | VARCHAR(5000) | Natural-language variant summary |
| 13 | clinical_significance | VARCHAR(200) | ClinVar classification |
| 14 | rsid | VARCHAR(20) | dbSNP identifier |
| 15 | disease_associations | VARCHAR(2000) | Associated diseases/conditions |
| 16 | am_pathogenicity | FLOAT | AlphaMissense score (0.0-1.0) |
| 17 | am_class | VARCHAR(20) | pathogenic, ambiguous, or benign |
9.3 Variant Annotation Pipeline¶
The annotation pipeline enriches VCF variants with data from three sources:
# Run the annotation pipeline
python3 rag/annotation/clinvar.py \
--vcf genomics/data/vcf/HG002.vcf.gz \
--clinvar rag/data/clinvar/clinvar.vcf.gz \
--output rag/data/annotated_clinvar.tsv
python3 rag/annotation/alphamissense.py \
--vcf genomics/data/vcf/HG002.vcf.gz \
--am rag/data/alphamissense/AlphaMissense_hg38.tsv.gz \
--output rag/data/annotated_am.tsv
python3 rag/annotation/vep.py \
--vcf genomics/data/vcf/HG002.vcf.gz \
--output rag/data/annotated_vep.tsv
Expected annotation matches:
| Source | Total Records | Patient Matches |
|---|---|---|
| ClinVar | 4,100,000 | ~35,616 |
| AlphaMissense | 71,697,560 | ~6,831 (ClinVar-matched with predictions) |
| VEP | Per-variant | All coding variants |
9.4 BGE Embedding and Indexing¶
from sentence_transformers import SentenceTransformer
from pymilvus import connections, Collection
# Load embedding model
model = SentenceTransformer('BAAI/bge-small-en-v1.5') # 384 dimensions
# Connect to Milvus
connections.connect(host="localhost", port=19530)
collection = Collection("genomic_evidence")
# Example: embed and insert a variant
text = "chr9:35065263 G>A in VCP gene. ClinVar: Pathogenic. AlphaMissense: 0.87 (pathogenic). Consequence: missense_variant. Impact: MODERATE."
embedding = model.encode(text).tolist() # 384-dim vector
# Insert into Milvus
data = [{
"embedding": embedding,
"chrom": "chr9",
"pos": 35065263,
"ref": "G",
"alt": "A",
"qual": 99.0,
"gene": "VCP",
"consequence": "missense_variant",
"impact": "MODERATE",
"genotype": "0/1",
"text_summary": text,
"clinical_significance": "Pathogenic",
"rsid": "rs188935092",
"disease_associations": "Inclusion body myopathy with Paget disease and frontotemporal dementia",
"am_pathogenicity": 0.87,
"am_class": "pathogenic"
}]
collection.insert(data)
collection.flush()
Milvus index configuration:
| Parameter | Value |
|---|---|
| Embedding Model | BGE-small-en-v1.5 |
| Dimensions | 384 |
| Index Type | IVF_FLAT |
| Metric Type | COSINE |
| nlist | 1024 |
| nprobe (search) | 16 |
9.5 Anthropic Claude Integration¶
import anthropic
client = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
def query_claude(question: str, context: str) -> str:
"""Send RAG query to Claude with retrieved genomic context."""
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=4096,
temperature=0.3,
messages=[{
"role": "user",
"content": f"""You are a genomics expert. Answer the question using the provided genomic evidence.
Context:
{context}
Question: {question}"""
}]
)
return response.content[0].text
Claude configuration:
| Parameter | Value |
|---|---|
| Model | claude-sonnet-4-20250514 |
| Temperature | 0.3 |
| Max Tokens | 4096 |
9.6 Knowledge Base¶
The platform includes a curated knowledge base of 201 genes across 13 therapeutic areas, with 171 genes (85%) classified as druggable.
| Metric | Value |
|---|---|
| Total genes | 201 |
| Therapeutic areas | 13 |
| Druggable genes | 171 (85%) |
9.7 RAG API and Streamlit Chat¶
# Start RAG API and Chat services
docker compose up -d rag-api streamlit-chat
# Verify RAG API
curl -s http://localhost:5001/health
# Expected: {"status": "healthy"}
# Verify Streamlit Chat
curl -s -o /dev/null -w "%{http_code}" http://localhost:8501
# Expected: 200
Access the Streamlit Chat at http://<dgx-spark-ip>:8501 for conversational variant analysis.
10. Deploy Therapeutic Discovery Engine (Stage 3)¶
10.1 BioNeMo NIM Services¶
# Pull BioNeMo containers (requires NGC authentication)
docker pull nvcr.io/nvidia/clara/bionemo-molmim:1.0
docker pull nvcr.io/nvidia/clara/diffdock:1.0
# Start NIM services
docker compose up -d molmim diffdock
# Wait for models to load (may take 2-5 minutes)
sleep 120
# Verify MolMIM
curl -s http://localhost:8001/v1/health/ready
# Expected: {"status": "ready"}
# Verify DiffDock
curl -s http://localhost:8002/v1/health/ready
# Expected: {"status": "ready"}
10.2 10-Stage Pipeline Detail¶
| Stage | Name | Input | Output | Key Operations |
|---|---|---|---|---|
| 1 | Initialize | Config + target gene | PipelineConfig | Validate parameters, create run ID |
| 2 | Normalize Target | Gene symbol | Normalized target | Map to UniProt, canonical name |
| 3 | Structure Discovery | UniProt ID | PDB structure list | Query RCSB PDB, score by resolution |
| 4 | Structure Preparation | PDB IDs | Prepared structures | Download PDB, extract binding sites |
| 5 | Molecule Generation | Seed SMILES + protein | Generated SMILES | MolMIM NIM (Port 8001) |
| 6 | Chemistry QC | SMILES list | Filtered SMILES | Lipinski, QED, TPSA checks |
| 7 | Conformer Generation | Filtered SMILES | 3D conformers (SDF) | RDKit conformer embedding |
| 8 | Molecular Docking | Conformers + protein | Docking scores | DiffDock NIM (Port 8002) |
| 9 | Composite Ranking | All scores | Ranked candidates | Weighted composite formula |
| 10 | Reporting | Ranked candidates | PDF report | Visualizations, recommendations |
10.3 Structure Retrieval and Scoring¶
import requests
def search_pdb_structures(uniprot_id: str) -> list:
"""Search RCSB PDB for protein structures by UniProt ID."""
url = "https://search.rcsb.org/rcsbsearch/v2/query"
query = {
"query": {
"type": "terminal",
"service": "text",
"parameters": {
"attribute": "rcsb_polymer_entity_container_identifiers.reference_sequence_identifiers.database_accession",
"operator": "exact_match",
"value": uniprot_id
}
},
"return_type": "entry"
}
response = requests.post(url, json=query)
return response.json().get("result_set", [])
10.4 Molecule Generation (MolMIM)¶
import requests
def generate_molecules(seed_smiles: str, num_candidates: int = 100) -> list:
"""Generate molecule candidates using MolMIM NIM."""
response = requests.post(
"http://localhost:8001/generate",
json={
"smiles": seed_smiles,
"num_molecules": num_candidates,
"algorithm": "CMA-ES",
"property_name": "QED",
"min_similarity": 0.3,
"particles": 30,
"iterations": 10
}
)
return response.json()["generated_molecules"]
10.5 Molecular Docking (DiffDock)¶
def dock_molecule(protein_pdb: str, ligand_sdf: str) -> dict:
"""Score binding affinity using DiffDock NIM."""
response = requests.post(
"http://localhost:8002/molecular-docking/diffdock/generate",
json={
"protein": protein_pdb,
"ligand": ligand_sdf,
"num_poses": 10
}
)
return response.json()
10.6 Drug-Likeness Scoring¶
Drug-likeness is assessed using three criteria:
Lipinski Rule of Five:
| Property | Threshold | Description |
|---|---|---|
| Molecular Weight | <= 500 Da | Size constraint |
| LogP | <= 5 | Lipophilicity |
| H-Bond Donors (HBD) | <= 5 | Polar surface groups |
| H-Bond Acceptors (HBA) | <= 10 | Polar surface groups |
Additional thresholds:
| Metric | Threshold | Interpretation |
|---|---|---|
| QED | > 0.67 | Drug-like |
| TPSA | < 140 Angstrom squared | Good oral bioavailability |
from rdkit import Chem
from rdkit.Chem import Descriptors, QED
def assess_drug_likeness(smiles: str) -> dict:
"""Evaluate drug-likeness using Lipinski, QED, and TPSA."""
mol = Chem.MolFromSmiles(smiles)
if mol is None:
return {"valid": False}
mw = Descriptors.MolWt(mol)
logp = Descriptors.MolLogP(mol)
hbd = Descriptors.NumHDonors(mol)
hba = Descriptors.NumHAcceptors(mol)
tpsa = Descriptors.TPSA(mol)
qed_score = QED.qed(mol)
lipinski_pass = (mw <= 500 and logp <= 5 and hbd <= 5 and hba <= 10)
return {
"valid": True,
"mw": mw,
"logp": logp,
"hbd": hbd,
"hba": hba,
"tpsa": tpsa,
"qed": qed_score,
"lipinski_pass": lipinski_pass,
"drug_like": qed_score > 0.67,
"oral_bioavail": tpsa < 140
}
10.7 Composite Ranking Formula¶
Candidates are ranked using a weighted composite score:
Docking score normalization:
def normalize_docking_score(dock_score: float) -> float:
"""Normalize docking score to [0, 1] range.
More negative = better binding = higher normalized score."""
return max(0.0, min(1.0, (10.0 + dock_score) / 20.0))
| Raw Docking Score | Normalized Score | Interpretation |
|---|---|---|
| -10.0 kcal/mol | 0.00 | Excellent binding |
| -8.0 kcal/mol | 0.10 | Strong binding |
| -6.0 kcal/mol | 0.20 | Moderate binding |
| 0.0 kcal/mol | 0.50 | Weak binding |
| +10.0 kcal/mol | 1.00 | No binding |
Note: The normalization maps more negative (better) docking scores to lower normalized values. In the composite formula, the docking component rewards lower (better) scores.
Composite score weights:
| Component | Weight | Source |
|---|---|---|
| Generation Score | 30% | MolMIM similarity/property score |
| Docking Score (normalized) | 40% | DiffDock binding affinity |
| QED Score | 30% | RDKit quantitative drug-likeness |
10.8 Discovery UI and Portal¶
# Start Discovery services
docker compose up -d discovery-ui discovery-portal
# Verify Discovery UI
curl -s -o /dev/null -w "%{http_code}" http://localhost:8505
# Expected: 200
# Verify Discovery Portal
curl -s -o /dev/null -w "%{http_code}" http://localhost:8510
# Expected: 200
- Discovery UI (Port 8505): Interactive pipeline execution interface
- Discovery Portal (Port 8510): Results browser and reporting portal
10.9 PDF Report Generation¶
The final pipeline stage generates a PDF report containing:
- Target gene and variant summary
- PDB structure details with binding site analysis
- Top-ranked candidates with SMILES, scores, and 2D depictions
- Docking poses and binding affinity plots
- Lipinski and QED compliance table
- Composite score ranking
11. Nextflow Orchestration¶
11.1 DSL2 Pipeline Architecture¶
The HCLS AI Factory uses Nextflow DSL2 for pipeline orchestration. Each pipeline stage is defined as a separate process, with channels connecting inputs and outputs.
11.2 Pipeline Modes¶
| Mode | Description | Stages Executed |
|---|---|---|
full |
Complete end-to-end pipeline | 1 + 2 + 3 (all stages) |
target |
Start from target gene (skip genomics) | 2 + 3 |
drug |
Drug discovery only (pre-existing target) | 3 only |
demo |
VCP demo with pre-loaded data | 1 + 2 + 3 (demo subset) |
genomics_only |
Genomics pipeline only | 1 only |
11.3 Execution Profiles¶
| Profile | Description | Use Case |
|---|---|---|
standard |
Local execution, default settings | Development |
docker |
Docker container execution | Standard deployment |
singularity |
Singularity container execution | HPC environments |
dgx_spark |
Optimized for DGX Spark hardware | Production on DGX Spark |
slurm |
SLURM workload manager | Multi-node clusters |
test |
Minimal test data, fast execution | CI/CD testing |
11.4 Pipeline Launcher¶
# Run with the pipeline launcher script
python3 scripts/run_pipeline.py \
--mode full \
--profile dgx_spark \
--fastq genomics/data/fastq/ \
--reference genomics/data/reference/GRCh38.fa
# Or run directly with Nextflow
nextflow run main.nf \
-profile dgx_spark \
--mode full \
--fastq_dir genomics/data/fastq/ \
--reference genomics/data/reference/GRCh38.fa \
--outdir results/
11.5 Pipeline Configuration¶
// nextflow.config
params {
// Pipeline mode
mode = 'full'
// Input paths
fastq_dir = 'genomics/data/fastq'
reference = 'genomics/data/reference/GRCh38.fa'
outdir = 'results'
// Service endpoints
milvus_host = 'localhost'
milvus_port = 19530
molmim_url = 'http://localhost:8001'
diffdock_url = 'http://localhost:8002'
// Drug discovery parameters
num_candidates = 100
min_qed = 0.67
min_dock_score = -6.0
}
profiles {
dgx_spark {
docker.enabled = true
docker.runOptions = '--gpus all'
process {
executor = 'local'
memory = '120 GB'
cpus = 128
}
}
test {
params.mode = 'demo'
process {
memory = '16 GB'
cpus = 4
}
}
}
12. Service Startup and Health¶
12.1 start-services.sh Startup Order¶
Services should be started in dependency order:
#!/bin/bash
# start-services.sh — Start all HCLS AI Factory services
set -e
echo "Starting infrastructure services..."
docker compose up -d etcd minio
sleep 10
echo "Starting Milvus..."
docker compose up -d milvus
sleep 30
echo "Starting BioNeMo NIM services..."
docker compose up -d molmim diffdock
sleep 120
echo "Starting application services..."
docker compose up -d genomics-portal rag-api streamlit-chat discovery-ui discovery-portal landing-page
echo "Starting monitoring..."
docker compose up -d prometheus grafana node-exporter dcgm-exporter
echo "All services started. Running health checks..."
sleep 10
bash scripts/validate_deployment.sh
12.2 Landing Page (Port 8080)¶
The landing page at http://<dgx-spark-ip>:8080 provides a directory of all services with links and status indicators.
12.3 Health Check Endpoints¶
| Service | Port | Health Endpoint | Expected Response |
|---|---|---|---|
| Genomics Portal | 5000 | /health |
{"status": "healthy"} |
| RAG API | 5001 | /health |
{"status": "healthy"} |
| Milvus | 19530 | /v1/health/ready |
{"status": "ok"} |
| Streamlit Chat | 8501 | /healthz |
HTTP 200 |
| MolMIM NIM | 8001 | /v1/health/ready |
{"status": "ready"} |
| DiffDock NIM | 8002 | /v1/health/ready |
{"status": "ready"} |
| Discovery UI | 8505 | /health |
{"status": "healthy"} |
| Discovery Portal | 8510 | /health |
{"status": "healthy"} |
| Grafana | 3000 | /api/health |
{"status": "ok"} |
| Prometheus | 9099 | /-/healthy |
HTTP 200 |
| Node Exporter | 9100 | /metrics |
Metrics text |
| DCGM Exporter | 9400 | /metrics |
Metrics text |
12.4 Verifying All Services¶
#!/bin/bash
# validate_deployment.sh — Verify all services are running
declare -A SERVICES=(
["Landing Page"]="http://localhost:8080"
["Genomics Portal"]="http://localhost:5000/health"
["RAG API"]="http://localhost:5001/health"
["Milvus"]="http://localhost:19530/v1/health/ready"
["Streamlit Chat"]="http://localhost:8501/healthz"
["MolMIM"]="http://localhost:8001/v1/health/ready"
["DiffDock"]="http://localhost:8002/v1/health/ready"
["Discovery UI"]="http://localhost:8505/health"
["Discovery Portal"]="http://localhost:8510/health"
["Grafana"]="http://localhost:3000/api/health"
["Prometheus"]="http://localhost:9099/-/healthy"
["Node Exporter"]="http://localhost:9100/metrics"
["DCGM Exporter"]="http://localhost:9400/metrics"
)
echo "=== HCLS AI Factory Health Check ==="
for service in "${!SERVICES[@]}"; do
url="${SERVICES[$service]}"
status=$(curl -s -o /dev/null -w "%{http_code}" "$url" 2>/dev/null || echo "ERR")
if [ "$status" == "200" ]; then
echo "[OK] $service ($url)"
else
echo "[FAIL] $service ($url) — HTTP $status"
fi
done
13. Monitoring and Observability¶
13.1 Grafana Setup (Port 3000)¶
# Start Grafana
docker compose up -d grafana
# Access at http://<dgx-spark-ip>:3000
# Default credentials: admin / changeme
Default Grafana credentials:
| Parameter | Value |
|---|---|
| Username | admin |
| Password | changeme |
13.2 Prometheus Configuration (Port 9099)¶
# monitoring/prometheus/prometheus.yml
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'node-exporter'
static_configs:
- targets: ['node-exporter:9100']
- job_name: 'dcgm-exporter'
static_configs:
- targets: ['dcgm-exporter:9400']
- job_name: 'rag-api'
static_configs:
- targets: ['rag-api:5001']
metrics_path: /metrics
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
13.3 DCGM Exporter (Port 9400)¶
Key GPU metrics exposed by the DCGM Exporter:
| Metric | Description |
|---|---|
DCGM_FI_DEV_GPU_UTIL |
GPU utilization percentage |
DCGM_FI_DEV_FB_USED |
GPU framebuffer memory used (MB) |
DCGM_FI_DEV_FB_FREE |
GPU framebuffer memory free (MB) |
DCGM_FI_DEV_GPU_TEMP |
GPU temperature (Celsius) |
DCGM_FI_DEV_POWER_USAGE |
Power consumption (Watts) |
DCGM_FI_DEV_SM_CLOCK |
Streaming multiprocessor clock (MHz) |
DCGM_FI_DEV_MEM_CLOCK |
Memory clock (MHz) |
13.4 Node Exporter (Port 9100)¶
The Node Exporter provides host system metrics — CPU, memory, disk, and network utilization — critical for monitoring the DGX Spark ARM64 system.
13.5 Key Dashboard Panels¶
Recommended Grafana dashboard panels:
| Panel | Data Source | Purpose |
|---|---|---|
| GPU Utilization | DCGM | Track fq2bam and DeepVariant GPU usage |
| GPU Memory | DCGM | Monitor peak memory during genomics |
| CPU Utilization | Node Exporter | ARM64 core usage across all ARM64 cores |
| Memory Usage | Node Exporter | Unified 128 GB LPDDR5x utilization |
| Disk I/O | Node Exporter | NVMe throughput for FASTQ/BAM processing |
| Network I/O | Node Exporter | API call throughput |
| Container Status | Docker | Service health overview |
13.6 Alert Configuration¶
# Example alert rules for Prometheus
groups:
- name: hcls-alerts
rules:
- alert: GPUMemoryHigh
expr: DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE) > 0.95
for: 5m
labels:
severity: warning
annotations:
summary: "GPU memory usage above 95%"
- alert: ServiceDown
expr: up == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Service {{ $labels.job }} is down"
14. Security Configuration¶
14.1 API Key Management¶
# Store API keys in .env file (not committed to git)
echo ".env" >> .gitignore
# Set restrictive permissions
chmod 600 .env
# Verify .env is in .gitignore
grep -q '.env' .gitignore && echo "OK: .env is gitignored"
Never commit API keys to version control. Use environment variables exclusively:
| Variable | Sensitivity | Storage |
|---|---|---|
ANTHROPIC_API_KEY |
High | .env file, chmod 600 |
NGC_API_KEY |
High | .env file, chmod 600 |
GRAFANA_PASSWORD |
Medium | .env file |
14.2 Docker Network Isolation¶
Docker Compose creates an isolated bridge network. Only explicitly exposed ports are accessible from the host:
# Verify network isolation
docker network ls | grep hcls
docker network inspect hcls-ai-factory_default
14.3 Container Security¶
Best practices applied to the deployment:
- Run application containers as non-root users where possible
- Use read-only filesystem mounts for reference data
- Limit container capabilities with
--cap-drop ALL - Pin container image versions (no
latesttags in production)
14.4 Data Access Controls¶
# Set appropriate permissions on data directories
chmod -R 750 genomics/data/
chmod -R 750 rag/data/
chmod -R 750 discovery/data/
# Ensure only the deployment user can access sensitive data
chown -R $(whoami):$(whoami) genomics/data/ rag/data/ discovery/data/
15. Data Management¶
15.1 Storage Layout¶
| Directory | Contents | Size | Persistence |
|---|---|---|---|
genomics/data/reference/ |
GRCh38 genome | 3.1 GB | Permanent |
genomics/data/fastq/ |
Input FASTQ files | ~200 GB | Keep until processed |
genomics/data/bam/ |
Alignment output | ~100 GB | Delete after VCF |
genomics/data/vcf/ |
Variant calls | ~1 GB | Permanent |
rag/data/clinvar/ |
ClinVar database | ~1.2 GB | Permanent |
rag/data/alphamissense/ |
AlphaMissense DB | ~4 GB | Permanent |
milvus_data (Docker volume) |
Vector index | ~2 GB | Permanent |
discovery/data/ |
Structures, molecules | Variable | Per-run |
15.2 Intermediate File Cleanup¶
BAM files are the largest intermediate output (~100 GB). Once the VCF has been verified, BAM files can be deleted to reclaim storage:
# Verify VCF is complete before deleting BAM
zcat genomics/data/vcf/HG002.vcf.gz | grep -v '^#' | wc -l
# Confirm ~11.7M variants
# Delete intermediate BAM
rm -f genomics/data/bam/HG002.bam genomics/data/bam/HG002.bam.bai
echo "Reclaimed ~100 GB"
15.3 Milvus Data Persistence¶
Milvus data is stored in Docker volumes. To back up:
# Stop Milvus for consistent backup
docker compose stop milvus
# Back up volumes
docker run --rm \
-v hcls-ai-factory_milvus_data:/data \
-v $(pwd)/backups:/backup \
alpine tar czf /backup/milvus_data_$(date +%Y%m%d).tar.gz /data
# Restart
docker compose start milvus
15.4 Backup Procedures¶
# Full backup script
#!/bin/bash
BACKUP_DIR=./backups/$(date +%Y%m%d)
mkdir -p $BACKUP_DIR
# Back up VCF results
cp -r genomics/data/vcf/ $BACKUP_DIR/vcf/
# Back up environment config (without secrets)
grep -v 'API_KEY' .env > $BACKUP_DIR/env_sanitized.txt
# Back up Milvus volumes
docker compose stop milvus
for vol in milvus_data etcd_data minio_data; do
docker run --rm \
-v hcls-ai-factory_${vol}:/data \
-v $(pwd)/$BACKUP_DIR:/backup \
alpine tar czf /backup/${vol}.tar.gz /data
done
docker compose start milvus
echo "Backup complete: $BACKUP_DIR"
16. Performance Tuning¶
16.1 GPU Memory Management¶
The DGX Spark uses 128 GB unified LPDDR5x memory shared between CPU and GPU. Key considerations:
- Parabricks DeepVariant peaks at ~60 GB GPU memory — ensure other GPU services are idle during genomics processing
- MolMIM and DiffDock each require ~8 GB — they can co-exist during drug discovery
- Monitor with
nvidia-smiand DCGM metrics during pipeline runs
# Monitor GPU memory in real-time
watch -n 1 nvidia-smi
# Check unified memory allocation
nvidia-smi --query-gpu=memory.used,memory.free,memory.total --format=csv
16.2 Milvus Index Tuning¶
| Parameter | Default | Tuning Guidance |
|---|---|---|
nlist |
1024 | Increase for larger collections (trade build time for search quality) |
nprobe |
16 | Increase for higher recall (trade latency for accuracy) |
metric_type |
COSINE | Use COSINE for normalized BGE embeddings |
# Search with tuned parameters
search_params = {
"metric_type": "COSINE",
"params": {"nprobe": 16}
}
results = collection.search(
data=[query_embedding],
anns_field="embedding",
param=search_params,
limit=10,
output_fields=["gene", "clinical_significance", "text_summary"]
)
16.3 Docker Resource Limits¶
# Example resource limits in docker-compose.yml
services:
rag-api:
deploy:
resources:
limits:
memory: 16G
cpus: '16'
reservations:
memory: 4G
cpus: '4'
16.4 NVMe I/O Optimization¶
For FASTQ and BAM processing, I/O throughput is critical:
# Check NVMe performance
fio --name=seqread --rw=read --bs=1M --size=1G --numjobs=4 --runtime=10 --group_reporting
# Ensure data directories are on NVMe
df -h genomics/data/
16.5 Pipeline Concurrency Settings¶
The Nextflow pipeline supports controlled concurrency:
// nextflow.config — concurrency settings
process {
maxForks = 4 // Maximum parallel processes
maxRetries = 2 // Retry failed processes
errorStrategy = 'retry'
}
executor {
queueSize = 8 // Maximum queued tasks
pollInterval = '5 sec'
}
17. Troubleshooting Guide¶
17.1 Service Not Starting¶
# Check service logs
docker compose logs <service-name> --tail 50
# Check if port is already in use
ss -tlnp | grep <port>
# Restart a specific service
docker compose restart <service-name>
17.2 GPU Out of Memory¶
# Check current GPU memory usage
nvidia-smi
# Kill any orphaned GPU processes
sudo fuser -v /dev/nvidia*
# Reduce Parabricks memory by limiting GPU threads
# Add --gpu-mem-limit flag if available
# Ensure NIM services are stopped during genomics
docker compose stop molmim diffdock
17.3 Milvus Connection Issues¶
# Verify Milvus dependencies are running
docker compose ps etcd minio milvus
# Check Milvus logs for errors
docker compose logs milvus --tail 100
# Test connectivity
curl -s http://localhost:19530/v1/health/ready
# Reset Milvus if corrupted
docker compose down milvus etcd minio
docker volume rm hcls-ai-factory_milvus_data hcls-ai-factory_etcd_data hcls-ai-factory_minio_data
docker compose up -d etcd minio milvus
17.4 BioNeMo NIM Not Ready¶
# NIM services may take 2-5 minutes to load models
# Check logs for model loading progress
docker compose logs molmim --tail 50
docker compose logs diffdock --tail 50
# Verify GPU is available for NIM
nvidia-smi | grep -i "molmim\|diffdock"
# Restart if stuck
docker compose restart molmim diffdock
17.5 Parabricks Failures¶
| Error | Cause | Resolution |
|---|---|---|
CUDA out of memory |
Insufficient GPU memory | Stop other GPU services first |
Reference index not found |
Missing .fai file | Run samtools faidx GRCh38.fa |
Input file not found |
Wrong FASTQ path | Check volume mount paths |
Unsupported GPU |
Driver mismatch | Update NVIDIA driver |
17.6 Claude API Errors¶
| Error | Cause | Resolution |
|---|---|---|
401 Unauthorized |
Invalid API key | Verify ANTHROPIC_API_KEY in .env |
429 Rate Limited |
Too many requests | Implement exponential backoff |
500 Server Error |
Anthropic service issue | Retry after 30 seconds |
Connection refused |
No internet | Check network connectivity |
17.7 Docker Issues¶
# Docker daemon not running
sudo systemctl start docker
sudo systemctl enable docker
# Disk space full
docker system prune -a --volumes
df -h /var/lib/docker
# Permission denied
sudo usermod -aG docker $USER
newgrp docker
17.8 Common Error Messages Table¶
| Error Message | Service | Resolution |
|---|---|---|
Connection refused on port 19530 |
Milvus | Start etcd + MinIO first, then Milvus |
NVIDIA driver not found |
Docker | Install NVIDIA Container Toolkit |
Model not loaded |
MolMIM/DiffDock | Wait 2-5 minutes for model loading |
Collection not found |
Milvus | Run schema creation script (Section 9.2) |
API key not set |
RAG API | Set ANTHROPIC_API_KEY in .env |
Out of disk space |
Parabricks | Clean BAM intermediates, expand storage |
Permission denied: /data |
Any | Check volume mount permissions |
18. VCP/FTD Demo Walkthrough¶
18.1 Demo Overview¶
The VCP (Valosin-Containing Protein) / FTD (Frontotemporal Dementia) demo showcases the full three-stage pipeline using a known pathogenic variant:
| Parameter | Value |
|---|---|
| Variant | rs188935092 |
| Location | chr9:35065263 G>A |
| Gene | VCP |
| ClinVar Classification | Pathogenic |
| AlphaMissense Score | 0.87 (pathogenic, threshold >0.564) |
| Disease | Inclusion body myopathy with Paget disease and FTD |
| Seed Molecule | CB-5083 (VCP/p97 inhibitor) |
| PDB Structures | 8OOI, 9DIL, 7K56, 5FTK |
| Binding Domain | D2 ATPase domain, ~450 cubic angstroms |
| Druggability Score | 0.92 |
18.2 Pre-Demo Setup¶
# Ensure all services are running
bash scripts/validate_deployment.sh
# Verify Milvus has the VCP variant loaded
python3 -c "
from pymilvus import connections, Collection
connections.connect(host='localhost', port=19530)
col = Collection('genomic_evidence')
col.load()
results = col.query('gene == \"VCP\"', output_fields=['rsid', 'clinical_significance', 'am_pathogenicity'])
print(f'VCP variants found: {len(results)}')
for r in results[:3]:
print(r)
"
18.3 Running the Demo¶
# Run the demo pipeline mode
python3 scripts/run_pipeline.py --mode demo
# Or via Nextflow
nextflow run main.nf -profile dgx_spark --mode demo
Step-by-step execution:
- Stage 1 (Genomics): Process demo FASTQ subset through Parabricks fq2bam and DeepVariant
- Stage 2 (RAG): Annotate VCP variant with ClinVar (Pathogenic) and AlphaMissense (0.87), embed into Milvus, query Claude for clinical interpretation
- Stage 3 (Drug Discovery): Retrieve PDB structures (8OOI, 9DIL, 7K56, 5FTK), generate molecules from CB-5083 seed via MolMIM, dock with DiffDock, rank by composite score
18.4 Expected Results¶
| Metric | Expected Value |
|---|---|
| Candidates generated | 100 |
| Pass Lipinski Rule of Five | 87 |
| QED > 0.67 (drug-like) | 72 |
| Top docking scores | -8.2 to -11.4 kcal/mol |
| Composite score range | 0.68 - 0.89 |
Top candidate characteristics:
| Property | Range |
|---|---|
| Molecular Weight | 300 - 500 Da |
| LogP | 1.5 - 4.5 |
| QED | 0.67 - 0.92 |
| TPSA | 40 - 130 squared angstroms |
| Docking Score | -8.2 to -11.4 kcal/mol |
| Composite Score | 0.68 - 0.89 |
19. Scaling Beyond DGX Spark¶
19.1 Phase 1 to Phase 3 Roadmap¶
| Phase | Hardware | Scale | Use Case |
|---|---|---|---|
| Phase 1 | DGX Spark | Single workstation | Development, demos, single-patient analysis |
| Phase 2 | DGX B200 | Single server, multi-GPU | Production cohort analysis |
| Phase 3 | DGX SuperPOD | Multi-node cluster | Population-scale genomics |
19.2 Kubernetes Migration Path¶
For Phase 2 and beyond, migrate from Docker Compose to Kubernetes:
- Replace
docker-compose.ymlwith Helm charts - Use NVIDIA GPU Operator for GPU scheduling
- Deploy Milvus Cluster mode (distributed) instead of standalone
- Use persistent volume claims (PVCs) for data storage
- Implement horizontal pod autoscaling for RAG API
19.3 Multi-GPU Considerations¶
- Parabricks supports
--num-gpusfor multi-GPU parallelism - MolMIM and DiffDock can be replicated across GPUs
- Milvus supports distributed deployment with multiple query nodes
19.4 NVIDIA FLARE for Federated Learning¶
For multi-institutional deployments, NVIDIA FLARE enables federated learning across DGX Spark nodes without sharing raw patient data.
19.5 Intelligence Agent Deployment¶
The Precision Intelligence Network includes 11 intelligence agents. Each agent provides a Streamlit frontend and a FastAPI backend:
| # | Agent | Streamlit Port | API Port | Domain |
|---|---|---|---|---|
| 1 | Precision Oncology | 8503 | 8103 | Molecular tumor board decision support |
| 2 | Precision Biomarker | 8502 | 8102 | Genotype-aware biomarker interpretation |
| 3 | CAR-T Intelligence | 8504 | 8104 | Cellular immunotherapy intelligence |
| 4 | Imaging Intelligence | 8524 | 8105 | Medical imaging AI (CT, MRI, X-ray) |
| 5 | Precision Autoimmune | 8506 | 8106 | Autoimmune and immune-mediated conditions |
| 6 | Pharmacogenomics | 8507 | 8107 | Drug-gene interaction and dosing |
| 7 | Cardiology Intelligence | 8527 | 8126 | Cardiovascular clinical decision support |
| 8 | Clinical Trial Intelligence | 8538 | 8128 | Trial matching, eligibility, enrollment optimization |
| 9 | Rare Disease Diagnostic | 8134 | 8544 | Rare disease differential diagnosis and gene panel analysis |
| 10 | Neurology Intelligence | 8528 | 8529 | Neurological condition assessment and treatment planning |
| 11 | Single-Cell Intelligence | 8540 | 8130 | Single-cell transcriptomics and cell-type analysis |
# Deploy all 11 intelligence agents
docker compose up -d \
oncology-agent biomarker-agent cart-agent imaging-agent \
autoimmune-agent pharmacogenomics-agent cardiology-agent \
clinical-trial-agent rare-disease-agent neurology-agent single-cell-agent
# Verify new agents
curl -s http://localhost:8538/health # Clinical Trial Intelligence
curl -s http://localhost:8134/health # Rare Disease Diagnostic
curl -s http://localhost:8528/health # Neurology Intelligence
curl -s http://localhost:8540/health # Single-Cell Intelligence
20. Appendix A: Complete Configuration Reference¶
20.1 All Environment Variables¶
| Variable | Default | Description |
|---|---|---|
ANTHROPIC_API_KEY |
(required) | Anthropic API key for Claude |
NGC_API_KEY |
(required) | NVIDIA NGC API key |
REFERENCE_GENOME |
/data/reference/GRCh38.fa |
Path to reference genome |
MILVUS_HOST |
localhost |
Milvus server hostname |
MILVUS_PORT |
19530 |
Milvus server port |
MOLMIM_URL |
http://localhost:8001 |
MolMIM NIM endpoint |
DIFFDOCK_URL |
http://localhost:8002 |
DiffDock NIM endpoint |
CLAUDE_MODEL |
claude-sonnet-4-20250514 |
Claude model identifier |
CLAUDE_TEMPERATURE |
0.3 |
Claude sampling temperature |
PIPELINE_MODE |
full |
Pipeline execution mode |
NUM_CANDIDATES |
100 |
Number of molecules to generate |
MIN_QED |
0.3 |
Minimum QED threshold |
MIN_DOCK_SCORE |
-6.0 |
Minimum docking score (kcal/mol) |
GRAFANA_USER |
admin |
Grafana admin username |
GRAFANA_PASSWORD |
changeme |
Grafana admin password |
20.2 AlphaMissense Thresholds¶
| Classification | Score Range |
|---|---|
| Pathogenic | > 0.564 |
| Ambiguous | 0.34 - 0.564 |
| Benign | < 0.34 |
20.3 Scoring Weights¶
| Component | Weight |
|---|---|
| Generation Score | 0.30 (30%) |
| Docking Score (normalized) | 0.40 (40%) |
| QED Score | 0.30 (30%) |
20.4 Drug-Likeness Thresholds¶
| Property | Threshold | Rule |
|---|---|---|
| Molecular Weight | <= 500 Da | Lipinski |
| LogP | <= 5 | Lipinski |
| H-Bond Donors | <= 5 | Lipinski |
| H-Bond Acceptors | <= 10 | Lipinski |
| QED | > 0.67 | Drug-likeness |
| TPSA | < 140 squared angstroms | Oral bioavailability |
20.5 Docking Score Interpretation¶
| Score (kcal/mol) | Binding Affinity | Assessment |
|---|---|---|
| < -10.0 | Excellent | Strong candidate |
| -8.0 to -10.0 | Strong | Viable candidate |
| -6.0 to -8.0 | Moderate | Marginal candidate |
| > -6.0 | Weak | Poor candidate |
Normalization formula:
21. Appendix B: API Reference¶
21.1 MolMIM API (Port 8001)¶
Generate Molecules:
// POST http://localhost:8001/generate
// Request:
{
"smiles": "CC1=CC=C(C=C1)C(=O)NC2=CC=CC=C2",
"num_molecules": 100,
"algorithm": "CMA-ES",
"property_name": "QED",
"min_similarity": 0.3,
"particles": 30,
"iterations": 10
}
// Response:
{
"generated_molecules": [
{
"smiles": "CC1=CC=C(C=C1)C(=O)NC2=CC=C(F)C=C2",
"score": 0.85,
"similarity": 0.78
}
]
}
Health Check:
21.2 DiffDock API (Port 8002)¶
Molecular Docking:
// POST http://localhost:8002/molecular-docking/diffdock/generate
// Request:
{
"protein": "<PDB file content>",
"ligand": "<SDF file content>",
"num_poses": 10
}
// Response:
{
"poses": [
{
"pose_id": 0,
"confidence": 0.95,
"score": -9.7,
"ligand_sdf": "<docked SDF content>"
}
]
}
Health Check:
21.3 RAG API Endpoints (Port 5001)¶
| Method | Endpoint | Description |
|---|---|---|
| GET | /health |
Service health check |
| POST | /query |
RAG query with context retrieval |
| POST | /search |
Vector similarity search |
| GET | /collections |
List Milvus collections |
| GET | /stats |
Collection statistics |
RAG Query Example:
// POST http://localhost:5001/query
// Request:
{
"question": "What pathogenic variants are found in the VCP gene?",
"top_k": 10,
"filters": {
"gene": "VCP",
"impact": "HIGH"
}
}
// Response:
{
"answer": "The VCP gene contains the variant rs188935092...",
"sources": [
{
"gene": "VCP",
"rsid": "rs188935092",
"clinical_significance": "Pathogenic",
"am_pathogenicity": 0.87,
"similarity_score": 0.94
}
],
"model": "claude-sonnet-4-20250514",
"tokens_used": 1847
}
21.4 Health Check Endpoints Summary¶
| Service | Endpoint | Method |
|---|---|---|
| Genomics Portal | /health |
GET |
| RAG API | /health |
GET |
| Milvus | /v1/health/ready |
GET |
| Streamlit Chat | /healthz |
GET |
| MolMIM | /v1/health/ready |
GET |
| DiffDock | /v1/health/ready |
GET |
| Discovery UI | /health |
GET |
| Discovery Portal | /health |
GET |
| Grafana | /api/health |
GET |
| Prometheus | /-/healthy |
GET |
| Node Exporter | /metrics |
GET |
| DCGM Exporter | /metrics |
GET |
22. Appendix C: Schema Definitions¶
22.1 Milvus Collection Schema¶
Collection: genomic_evidence
| # | Field | Data Type | Constraints | Description |
|---|---|---|---|---|
| 1 | id |
INT64 | Primary Key, Auto ID | Unique record identifier |
| 2 | embedding |
FLOAT_VECTOR | dim=384 | BGE-small-en-v1.5 embedding |
| 3 | chrom |
VARCHAR | max_length=10 | Chromosome (chr1-22, chrX, chrY) |
| 4 | pos |
INT64 | — | Genomic position (1-based) |
| 5 | ref |
VARCHAR | max_length=500 | Reference allele |
| 6 | alt |
VARCHAR | max_length=500 | Alternate allele |
| 7 | qual |
FLOAT | — | Variant quality score |
| 8 | gene |
VARCHAR | max_length=100 | HGNC gene symbol |
| 9 | consequence |
VARCHAR | max_length=200 | VEP consequence term |
| 10 | impact |
VARCHAR | max_length=20 | HIGH/MODERATE/LOW/MODIFIER |
| 11 | genotype |
VARCHAR | max_length=10 | Sample genotype (0/1, 1/1) |
| 12 | text_summary |
VARCHAR | max_length=5000 | Natural-language summary |
| 13 | clinical_significance |
VARCHAR | max_length=200 | ClinVar classification |
| 14 | rsid |
VARCHAR | max_length=20 | dbSNP RS identifier |
| 15 | disease_associations |
VARCHAR | max_length=2000 | Associated diseases |
| 16 | am_pathogenicity |
FLOAT | 0.0-1.0 | AlphaMissense pathogenicity |
| 17 | am_class |
VARCHAR | max_length=20 | pathogenic/ambiguous/benign |
Index configuration:
| Parameter | Value |
|---|---|
| Index Type | IVF_FLAT |
| Metric Type | COSINE |
| nlist | 1024 |
| nprobe (search) | 16 |
22.2 Pydantic Data Models¶
from pydantic import BaseModel, Field
from typing import List, Optional
from enum import Enum
class TargetHypothesis(BaseModel):
"""Genomic target identified from variant analysis."""
gene: str
variant_id: str
rsid: Optional[str]
clinical_significance: str
am_pathogenicity: Optional[float]
am_class: Optional[str]
therapeutic_area: str
druggability_score: float
rationale: str
class StructureInfo(BaseModel):
"""PDB structure information for a target protein."""
pdb_id: str
resolution: float
method: str
chain: str
binding_site_volume: Optional[float]
class StructureManifest(BaseModel):
"""Collection of structures for a target."""
target_gene: str
uniprot_id: str
structures: List[StructureInfo]
selected_structure: str
class MoleculeProperties(BaseModel):
"""Chemical properties of a generated molecule."""
molecular_weight: float
logp: float
hbd: int
hba: int
tpsa: float
qed: float
lipinski_pass: bool
class GeneratedMolecule(BaseModel):
"""Molecule generated by MolMIM."""
smiles: str
generation_score: float
similarity_to_seed: float
properties: MoleculeProperties
class DockingResult(BaseModel):
"""Molecular docking result from DiffDock."""
smiles: str
dock_score: float # kcal/mol (negative = better)
confidence: float
pose_sdf: str
class RankedCandidate(BaseModel):
"""Final ranked drug candidate with composite score."""
rank: int
smiles: str
generation_score: float
dock_score: float
dock_score_normalized: float
qed: float
composite_score: float # 0.3*gen + 0.4*dock + 0.3*qed
lipinski_pass: bool
properties: MoleculeProperties
class PipelineConfig(BaseModel):
"""Configuration for a pipeline run."""
mode: str = "full"
target_gene: Optional[str]
seed_smiles: Optional[str]
num_candidates: int = 100
min_qed: float = 0.67
min_dock_score: float = -6.0
molmim_url: str = "http://localhost:8001"
diffdock_url: str = "http://localhost:8002"
claude_model: str = "claude-sonnet-4-20250514"
claude_temperature: float = 0.3
class PipelineRun(BaseModel):
"""Record of a pipeline execution."""
run_id: str
config: PipelineConfig
target: Optional[TargetHypothesis]
structures: Optional[StructureManifest]
candidates_generated: int = 0
candidates_passed_lipinski: int = 0
candidates_drug_like: int = 0
top_composite_score: float = 0.0
status: str = "initialized"
23. Appendix D: Docker Image Reference¶
23.1 All Container Images¶
| Service | Image | Tag | Architecture |
|---|---|---|---|
| Parabricks | nvcr.io/nvidia/clara/clara-parabricks | 4.6.0-1 | ARM64 (aarch64) |
| Milvus | milvusdb/milvus | v2.4-latest | ARM64 |
| MolMIM | nvcr.io/nvidia/clara/bionemo-molmim | 1.0 | ARM64 |
| DiffDock | nvcr.io/nvidia/clara/diffdock | 1.0 | ARM64 |
| Grafana | grafana/grafana-oss | 11.0.0 | ARM64 |
| Prometheus | prom/prometheus | v2.52.0 | ARM64 |
| Node Exporter | prom/node-exporter | v1.8.0 | ARM64 |
| DCGM Exporter | nvcr.io/nvidia/k8s/dcgm-exporter | latest | ARM64 |
| etcd | quay.io/coreos/etcd | v3.5.5 | ARM64 |
| MinIO | minio/minio | latest | ARM64 |
23.2 ARM64 Compatibility Notes¶
The DGX Spark uses an ARM64 (aarch64) processor. All container images must be ARM64-compatible:
- NVIDIA NGC images for Parabricks, BioNeMo, and DCGM include ARM64 variants
- Community images (Grafana, Prometheus, MinIO, etcd) provide multi-arch manifests
- Custom application images must be built with
--platform linux/arm64 - If building locally, ensure the base image supports ARM64
# Verify image architecture
docker inspect --format='{{.Architecture}}' <image-name>
# Expected: arm64
# Build for ARM64 explicitly
docker build --platform linux/arm64 -t my-service:latest ./my-service/
24. Appendix E: Validation Checklists¶
24.1 Pre-Deployment Checklist¶
| # | Item | Command / Check | Expected |
|---|---|---|---|
| 1 | DGX Spark hardware | uname -m |
aarch64 |
| 2 | GPU detected | nvidia-smi |
GB10 GPU listed |
| 3 | Docker installed | docker --version |
24.0+ |
| 4 | Docker Compose V2 | docker compose version |
v2.x |
| 5 | NVIDIA runtime | docker info \| grep nvidia |
nvidia listed |
| 6 | Python version | python3 --version |
3.10+ |
| 7 | Disk space | df -h / |
>= 320 GB free |
| 8 | Reference genome | ls genomics/data/reference/GRCh38.fa |
File exists, ~3.1 GB |
| 9 | ClinVar data | ls rag/data/clinvar/clinvar.vcf.gz |
File exists, ~1.2 GB |
| 10 | AlphaMissense data | ls rag/data/alphamissense/AlphaMissense_hg38.tsv.gz |
File exists, ~4 GB |
| 11 | API keys configured | grep ANTHROPIC_API_KEY .env |
Key set (not empty) |
| 12 | NGC key configured | grep NGC_API_KEY .env |
Key set (not empty) |
| 13 | .env permissions |
stat -c %a .env |
600 |
| 14 | .env in .gitignore |
grep .env .gitignore |
Present |
24.2 Post-Deployment Checklist¶
| # | Item | Command / Check | Expected |
|---|---|---|---|
| 1 | All containers running | docker compose ps |
14+ services "Up" |
| 2 | Landing Page | curl http://localhost:8080 |
HTTP 200 |
| 3 | Genomics Portal | curl http://localhost:5000/health |
{"status":"healthy"} |
| 4 | RAG API | curl http://localhost:5001/health |
{"status":"healthy"} |
| 5 | Milvus ready | curl http://localhost:19530/v1/health/ready |
{"status":"ok"} |
| 6 | Streamlit Chat | curl -o /dev/null -w "%{http_code}" http://localhost:8501 |
200 |
| 8 | MolMIM ready | curl http://localhost:8001/v1/health/ready |
{"status":"ready"} |
| 9 | DiffDock ready | curl http://localhost:8002/v1/health/ready |
{"status":"ready"} |
| 10 | Discovery UI | curl http://localhost:8505/health |
{"status":"healthy"} |
| 11 | Discovery Portal | curl http://localhost:8510/health |
{"status":"healthy"} |
| 12 | Grafana | curl http://localhost:3000/api/health |
{"status":"ok"} |
| 13 | Prometheus | curl http://localhost:9099/-/healthy |
HTTP 200 |
| 14 | DCGM metrics | curl http://localhost:9400/metrics |
Metrics text |
| 15 | Milvus collection | Python: Collection("genomic_evidence").num_entities |
> 0 |
24.3 Demo Readiness Checklist¶
| # | Item | Check | Expected |
|---|---|---|---|
| 1 | All services healthy | Run validate_deployment.sh |
All [OK] |
| 2 | VCP variant in Milvus | Query gene="VCP" | rs188935092 found |
| 3 | ClinVar annotation | VCP classification | Pathogenic |
| 4 | AlphaMissense score | VCP am_pathogenicity | 0.87 |
| 5 | PDB structures accessible | Query RCSB for VCP | 8OOI, 9DIL, 7K56, 5FTK |
| 6 | MolMIM generates | Test generation from CB-5083 | Molecules returned |
| 7 | DiffDock docks | Test docking against VCP structure | Scores returned |
| 8 | Claude responds | Test RAG query about VCP | Coherent response |
| 9 | Grafana dashboards | Login at port 3000 | Dashboards visible |
| 10 | GPU metrics flowing | Check DCGM in Grafana | GPU util, memory shown |
25. Appendix F: Glossary¶
25.1 Genomics Terms¶
| Term | Definition |
|---|---|
| FASTQ | Text-based format for storing nucleotide sequences and quality scores |
| BAM | Binary Alignment Map — compressed format for aligned sequencing reads |
| VCF | Variant Call Format — standard format for genomic variants |
| SNP | Single Nucleotide Polymorphism — single base-pair variant |
| Indel | Insertion or deletion of nucleotides in the genome |
| WGS | Whole Genome Sequencing — sequencing of entire genome |
| GRCh38 | Genome Reference Consortium Human Build 38 — current reference genome |
| GIAB | Genome in a Bottle — NIST benchmark samples (e.g., HG002) |
| ClinVar | NCBI database of clinically relevant genomic variants |
| VEP | Variant Effect Predictor — functional annotation tool |
| AlphaMissense | DeepMind model predicting missense variant pathogenicity |
| Paired-end | Sequencing both ends of a DNA fragment for improved alignment |
| Coverage (30x) | Average number of reads covering each position in the genome |
25.2 ML/AI Terms¶
| Term | Definition |
|---|---|
| RAG | Retrieval-Augmented Generation — combining search with LLM generation |
| Embedding | Dense vector representation of text or data |
| BGE | BAAI General Embedding — sentence transformer model family |
| IVF_FLAT | Inverted File Index — approximate nearest neighbor search method |
| COSINE | Cosine similarity — metric for comparing vector directions |
| NIM | NVIDIA Inference Microservice — containerized model serving |
| LLM | Large Language Model — e.g., Claude |
| Vector Database | Database optimized for similarity search on dense vectors |
| nlist | Number of clusters in IVF index (build-time parameter) |
| nprobe | Number of clusters to search at query time (recall vs. latency) |
25.3 Drug Discovery Terms¶
| Term | Definition |
|---|---|
| SMILES | Simplified Molecular Input Line Entry System — text notation for molecules |
| PDB | Protein Data Bank — repository of 3D protein structures |
| Molecular Docking | Computational prediction of ligand-protein binding pose and affinity |
| QED | Quantitative Estimate of Drug-likeness — composite drug-likeness score (0-1) |
| Lipinski Rule of Five | Empirical rules predicting oral bioavailability |
| TPSA | Topological Polar Surface Area — predictor of membrane permeability |
| LogP | Partition coefficient — measure of lipophilicity |
| HBD / HBA | Hydrogen Bond Donors / Acceptors |
| Conformer | 3D spatial arrangement of a molecule's atoms |
| Binding Affinity | Strength of interaction between a drug molecule and its target protein |
| kcal/mol | Kilocalories per mole — unit for binding energy (more negative = stronger) |
| MolMIM | Molecule generation model from NVIDIA BioNeMo |
| DiffDock | Diffusion-based molecular docking model |
| Druggability | Assessment of whether a protein target can be modulated by a small molecule |
| CB-5083 | VCP/p97 inhibitor used as seed molecule in the VCP demo |
| RDKit | Open-source cheminformatics toolkit for molecular analysis |
This deployment guide is maintained as part of the HCLS AI Factory open project. For updates, issues, and contributions, visit the project repository on GitHub.
Copyright 2026 Adam Jones. Licensed under the Apache License, Version 2.0.