HCLS AI Factory — Project Bible
Purpose: Complete implementation reference for building the HCLS AI Factory on NVIDIA DGX Spark. This platform transforms patient DNA into novel drug candidates in under 5 hours across three GPU-accelerated stages: Secondary Genomics, RAG-Grounded Target Identification, and AI-Driven Drug Discovery. Import this document into a Claude Code session as context for implementation.
License: Apache 2.0 | Author: Adam Jones | Date: February 2026
Table of Contents
- Project Overview & Goals
- DGX Spark Hardware Reference
- Repository Layout
- Docker Compose Services
- Stage 1: Genomics Pipeline
- Stage 2: RAG/Chat Pipeline
- Milvus Vector Database Schema
- Variant Annotation Pipeline
- Knowledge Base — 201 Genes, 13 Therapeutic Areas
- Anthropic Claude LLM Integration
- Stage 3: Drug Discovery Pipeline
- BioNeMo NIM Services
- Drug-Likeness Scoring
- Cryo-EM Structure Evidence
- VCP/FTD Demo Walkthrough
- Pydantic Data Models
- Nextflow DSL2 Orchestration
- Landing Page & Service Health
- Monitoring Stack
- Cross-Modal Integration
- Configuration Reference
- Deployment Roadmap
- Testing Strategy
- Implementation Sequence
1. Project Overview & Goals
The HCLS AI Factory is an end-to-end precision medicine platform that takes a patient's raw DNA sequencing data (FASTQ) and produces ranked novel drug candidates — all on a single NVIDIA DGX Spark desktop workstation. Three GPU-accelerated stages execute sequentially: variant calling, RAG-grounded target identification, and generative drug discovery.
Three-Stage Pipeline
| Stage |
Function |
Duration |
Key Output |
| 1 — Genomics |
BWA-MEM2 alignment + DeepVariant calling |
120-240 min |
VCF (~11.7M variants) |
| 2 — RAG/Chat |
Annotation → Embedding → LLM reasoning |
Interactive |
Target gene + evidence |
| 3 — Drug Discovery |
MolMIM generation → DiffDock docking → RDKit scoring |
8-16 min |
100 ranked drug candidates |
End-to-End Flow
Patient DNA → Illumina Sequencer → FASTQ (~200 GB)
→ Parabricks fq2bam → BAM
→ DeepVariant → VCF (11.7M variants)
→ ClinVar + AlphaMissense + VEP annotation
→ Milvus vector indexing (3.5M embeddings)
→ Claude RAG reasoning → Target hypothesis (gene + evidence)
→ RCSB PDB structure retrieval
→ MolMIM molecule generation
→ DiffDock molecular docking
→ RDKit drug-likeness scoring
→ 100 ranked novel drug candidates + PDF report
Design Principles
- GPU-first: Every compute-intensive step runs on the GB10 GPU
- Clinically grounded: ClinVar, AlphaMissense, and VEP provide evidence-based annotation
- Reproducible: Nextflow DSL2 orchestration with containerized processes
- Open: Apache 2.0 license, open-source tools, public reference databases
- Desktop-scale: Runs entirely on a $3,999 DGX Spark
2. DGX Spark Hardware Reference
Specifications
| Parameter |
Value |
| CPU |
NVIDIA Grace (ARM64 / aarch64), 144 cores |
| GPU |
NVIDIA GB10, 1 GPU |
| Memory |
128 GB unified LPDDR5x (CPU + GPU shared pool) |
| System RAM |
512 GB |
| Storage |
NVMe, high-throughput I/O |
| Storage Access |
GPUDirect Storage (zero-copy GPU access) |
| Price |
$3,999 |
| OS |
Ubuntu-based (NVIDIA DGX OS) |
Critical: ARM64 Architecture
ALL containers must be ARM64-compatible. The Grace CPU is aarch64, not x86_64. This affects:
- Base Docker images (must use ARM64 variants)
- Python wheel availability (most scientific packages have ARM64 wheels)
- NVIDIA container images (use NGC ARM64 variants)
- Any compiled C/C++ extensions (RDKit, BioPython)
Unified Memory Model
The 128 GB LPDDR5x is shared between CPU and GPU — there is no separate GPU VRAM. This means:
- No explicit CPU→GPU data transfers needed for many operations
- Memory pressure from CPU workloads reduces GPU-available memory
- Monitor total system memory, not just "GPU memory"
- Parabricks fq2bam peaks at ~40 GB, DeepVariant peaks at ~60 GB
Storage Requirements
| Dataset |
Size |
Notes |
| GRCh38 reference |
3.1 GB |
Pre-indexed for BWA-MEM2 |
| FASTQ input (30× WGS) |
~200 GB |
HG002 paired-end |
| BAM intermediate |
~100 GB |
Temporary, deleted after VCF |
| ClinVar database |
~1.2 GB |
4.1M clinical variants |
| AlphaMissense database |
~4 GB |
71M predictions |
| Milvus index |
~2 GB |
3.5M × 384-dim vectors |
| BioNeMo model cache |
~10 GB |
MolMIM + DiffDock weights |
| Total minimum |
~320 GB |
Plus OS and Docker layers |
3. Repository Layout
hcls-ai-factory-public/
├── README.md # Project overview
├── LICENSE # Apache 2.0
├── docker-compose.yml # All services
├── start-services.sh # Service startup orchestration
├── .env.example # Environment variable template
│
├── hls-orchestrator/ # Nextflow pipeline orchestration
│ ├── main.nf # DSL2 entry point
│ ├── nextflow.config # Profiles and parameters
│ ├── run_pipeline.py # Python CLI launcher
│ ├── modules/
│ │ ├── genomics.nf # Stage 1 processes
│ │ ├── rag_chat.nf # Stage 2 processes
│ │ ├── drug_discovery.nf # Stage 3 processes
│ │ └── reporting.nf # Report generation
│ └── tests/
│
├── genomics-pipeline/ # Stage 1: Parabricks
│ ├── README.md # Genomics documentation (48 KB)
│ ├── Dockerfile
│ ├── src/
│ │ ├── run_parabricks.py # fq2bam + DeepVariant launcher
│ │ ├── vcf_stats.py # VCF quality statistics
│ │ └── web_portal.py # Flask portal (:5000)
│ ├── config/
│ │ └── parabricks.yaml # GPU resource allocation
│ └── tests/
│
├── rag-chat-pipeline/ # Stage 2: RAG + Claude
│ ├── README.md # RAG documentation (51 KB)
│ ├── Dockerfile
│ ├── src/
│ │ ├── rag_engine.py # Core RAG orchestration (23 KB)
│ │ ├── milvus_client.py # Milvus vector DB client (13 KB)
│ │ ├── annotator.py # ClinVar + AlphaMissense + VEP (23 KB)
│ │ ├── knowledge.py # 201 genes, 13 areas (88 KB)
│ │ ├── streamlit_chat.py # Chat UI (:8501)
│ │ └── api.py # REST API (:5001)
│ ├── config/
│ │ └── milvus.yaml # Vector DB configuration
│ └── tests/
│
├── drug-discovery-pipeline/ # Stage 3: BioNeMo + RDKit
│ ├── README.md # Drug discovery documentation (56 KB)
│ ├── Dockerfile
│ ├── src/
│ │ ├── pipeline.py # 10-stage orchestration (18 KB)
│ │ ├── nim_clients.py # MolMIM + DiffDock clients (15 KB)
│ │ ├── molecule_generator.py # SMILES generation (11 KB)
│ │ ├── cryoem_evidence.py # Cryo-EM structure scoring (6 KB)
│ │ ├── models.py # Pydantic data models (8 KB)
│ │ ├── streamlit_discovery.py # Discovery UI (:8505)
│ │ └── portal.py # Discovery portal (:8510)
│ ├── config/
│ │ └── discovery.yaml # Pipeline parameters
│ └── tests/
│
├── landing-page/ # HCLS AI Factory entry point
│ ├── Dockerfile
│ └── src/
│ └── landing.py # Flask landing page (:8080)
│
├── monitoring/ # Observability stack
│ ├── prometheus.yml # Scrape configuration
│ └── grafana/
│ └── dashboards/
│ └── hcls-factory.json # GPU + pipeline dashboard
│
└── docs/ # Documentation
├── PRODUCT_DOCUMENTATION.txt # Full product docs (122 KB)
├── ARCHITECTURE_MINDMAP.md # Architecture reference
└── PIPELINE_REPORT.md # Pipeline analysis (29 KB)
4. Docker Compose Services
Port Allocation
| Service |
Port |
Protocol |
Stage |
| Landing Page |
8080 |
HTTP (Flask) |
Orchestration |
| Genomics Portal |
5000 |
HTTP (Flask) |
Stage 1 |
| RAG REST API |
5001 |
HTTP REST |
Stage 2 |
| Milvus Vector DB |
19530 |
gRPC |
Stage 2 |
| Attu (Milvus UI) |
8000 |
HTTP |
Stage 2 |
| Streamlit Chat |
8501 |
HTTP |
Stage 2 |
| MolMIM NIM |
8001 |
HTTP REST |
Stage 3 |
| DiffDock NIM |
8002 |
HTTP REST |
Stage 3 |
| Discovery UI |
8505 |
HTTP (Streamlit) |
Stage 3 |
| Discovery Portal |
8510 |
HTTP |
Stage 3 |
| Grafana |
3000 |
HTTP |
Monitoring |
| Prometheus |
9099 |
HTTP |
Monitoring |
| Node Exporter |
9100 |
HTTP |
Monitoring |
| DCGM Exporter |
9400 |
HTTP |
Monitoring |
Key Container Images
| Service |
Image |
Notes |
| Parabricks |
nvcr.io/nvidia/clara/clara-parabricks:4.6.0-1 |
GPU-accelerated genomics |
| Milvus |
milvusdb/milvus:v2.4-latest |
Vector database |
| MolMIM |
nvcr.io/nvidia/clara/bionemo-molmim:1.0 |
Molecule generation NIM |
| DiffDock |
nvcr.io/nvidia/clara/diffdock:1.0 |
Molecular docking NIM |
| Grafana |
grafana/grafana:10.2.2 |
Dashboards |
| Prometheus |
prom/prometheus:v2.48.0 |
Metrics TSDB |
Service Startup Order
The start-services.sh script orchestrates startup in dependency order:
# 1. Infrastructure (Milvus, monitoring)
# 2. Stage 1 services (Parabricks, genomics portal)
# 3. Stage 2 services (RAG engine, Streamlit chat)
# 4. Stage 3 services (BioNeMo NIMs, discovery UI)
# 5. Landing page (health monitor for all 10 services)
Health Monitoring
The landing page at port 8080 monitors 10 services with periodic health checks:
| Service |
Health Endpoint |
Check Interval |
| Parabricks |
Port 5000 /health |
30s |
| Milvus |
Port 19530 gRPC ping |
30s |
| RAG API |
Port 5001 /health |
30s |
| Chat UI |
Port 8501 /healthz |
30s |
| MolMIM NIM |
Port 8001 /v1/health/ready |
30s |
| DiffDock NIM |
Port 8002 /v1/health/ready |
30s |
| Discovery UI |
Port 8505 /healthz |
30s |
| Grafana |
Port 3000 /api/health |
30s |
| Prometheus |
Port 9099 /-/healthy |
30s |
| DCGM Exporter |
Port 9400 /metrics |
30s |
5. Stage 1: Genomics Pipeline
Overview
Stage 1 takes raw FASTQ files from an Illumina sequencer and produces a Variant Call Format (VCF) file using NVIDIA Parabricks — a GPU-accelerated implementation of industry-standard bioinformatics tools.
| Parameter |
Value |
| Sample |
HG002 (GIAB reference standard) |
| Coverage |
30× whole-genome sequencing (WGS) |
| Read Length |
2×250 bp paired-end |
| File Size |
~200 GB (FASTQ pair) |
| Reference Genome |
GRCh38 (3.1 GB, pre-indexed) |
| Format |
FASTQ (gzip-compressed) |
Pipeline Steps
Step 1: BWA-MEM2 Alignment (fq2bam)
pbrun fq2bam \
--ref /reference/GRCh38.fa \
--in-fq /data/HG002_R1.fastq.gz /data/HG002_R2.fastq.gz \
--out-bam /output/HG002.bam \
--num-gpus 1
| Metric |
Value |
| Duration |
20-45 minutes |
| GPU Utilization |
70-90% |
| Peak Memory |
~40 GB |
| Output |
Sorted BAM + BAI index |
| Algorithm |
BWA-MEM2 (GPU-accelerated) |
Step 2: DeepVariant Variant Calling
pbrun deepvariant \
--ref /reference/GRCh38.fa \
--in-bam /output/HG002.bam \
--out-variants /output/HG002.vcf.gz \
--num-gpus 1
| Metric |
Value |
| Duration |
10-35 minutes |
| GPU Utilization |
80-95% |
| Peak Memory |
~60 GB |
| Output |
VCF (gzip-compressed + tabix index) |
| Algorithm |
Google DeepVariant (CNN-based, >99% accuracy) |
Output: VCF Statistics
| Metric |
Count |
| Total Variants |
~11.7M |
| High-Quality (QUAL>30) |
~3.5M |
| SNPs |
~4.2M |
| Indels |
~1.0M |
| Coding Region Variants |
~35,000 |
| Multi-allelic Sites |
~150,000 |
Parabricks Container
Image: nvcr.io/nvidia/clara/clara-parabricks:4.6.0-1
GPU: Required (CUDA)
Volumes: /reference, /data, /output
Port: 5000 (Flask web portal for run status)
Genomics Portal (Port 5000)
The Flask-based portal provides:
- Real-time pipeline progress monitoring
- BAM quality statistics (mapping rate, duplication, coverage)
- VCF summary statistics and variant type distributions
- Run history and configuration logs
6. Stage 2: RAG/Chat Pipeline
Overview
Stage 2 annotates the VCF variants with clinical and functional databases, indexes them in a Milvus vector database, and uses Anthropic Claude with RAG to identify druggable gene targets supported by evidence.
Architecture
VCF (11.7M variants)
→ Quality filter (QUAL>30) → 3.5M variants
→ ClinVar annotation → clinical significance
→ AlphaMissense annotation → pathogenicity prediction
→ VEP annotation → functional consequences
→ BGE-small-en-v1.5 embedding → 384-dim vectors
→ Milvus IVF_FLAT indexing → 3.5M searchable embeddings
→ Claude RAG query → target hypothesis with evidence chain
Annotation Funnel
| Stage |
Variant Count |
Filter |
| Raw VCF |
~11.7M |
— |
| Quality filter |
~3.5M |
QUAL > 30 |
| ClinVar match |
~35,616 |
Clinical significance annotated |
| AlphaMissense match |
~6,831 |
AI pathogenicity predicted |
| Coding + pathogenic |
~2,400 |
Actionable subset |
Embedding Model
| Parameter |
Value |
| Model |
BGE-small-en-v1.5 |
| Dimensions |
384 |
| Input |
Text summary of annotated variant |
| Index Type |
IVF_FLAT |
| Index Params |
nlist=1024 |
| Search Params |
nprobe=16 |
| Distance Metric |
COSINE |
| Total Embeddings |
~3.5M |
Query Flow
- User asks a natural language question in the Streamlit chat
- Query is expanded using 10 therapeutic area keyword maps
- BGE-small-en-v1.5 embeds the expanded query
- Milvus performs approximate nearest-neighbor search (top_k=20)
- Retrieved variant contexts are assembled into a RAG prompt
- Claude processes the prompt with knowledge base grounding
- Response includes gene target, evidence chain, and confidence assessment
7. Milvus Vector Database Schema
Collection: genomic_evidence
| Field |
Type |
Description |
id |
INT64 (PK, auto) |
Primary key |
embedding |
FLOAT_VECTOR(384) |
BGE-small-en-v1.5 embedding |
chrom |
VARCHAR(10) |
Chromosome (chr1-chr22, chrX, chrY) |
pos |
INT64 |
Genomic position |
ref |
VARCHAR(1000) |
Reference allele |
alt |
VARCHAR(1000) |
Alternate allele |
qual |
FLOAT |
Variant quality score |
gene |
VARCHAR(100) |
Gene symbol (e.g., VCP, EGFR) |
consequence |
VARCHAR(200) |
Functional consequence (e.g., missense_variant) |
impact |
VARCHAR(20) |
Impact level (HIGH, MODERATE, LOW, MODIFIER) |
genotype |
VARCHAR(10) |
Sample genotype (0/1, 1/1, etc.) |
text_summary |
VARCHAR(2000) |
Human-readable variant description |
clinical_significance |
VARCHAR(200) |
ClinVar classification |
rsid |
VARCHAR(20) |
dbSNP identifier (e.g., rs188935092) |
disease_associations |
VARCHAR(2000) |
Associated diseases/conditions |
am_pathogenicity |
FLOAT |
AlphaMissense pathogenicity score (0-1) |
am_class |
VARCHAR(20) |
AlphaMissense class (pathogenic/ambiguous/benign) |
Total: 17 fields
Index Configuration
index_params = {
"index_type": "IVF_FLAT",
"metric_type": "COSINE",
"params": {"nlist": 1024}
}
search_params = {
"metric_type": "COSINE",
"params": {"nprobe": 16}
}
Milvus Infrastructure
| Component |
Port |
Purpose |
| Milvus standalone |
19530 |
gRPC vector operations |
| Attu UI |
8000 |
Web-based Milvus management |
| etcd |
2379 |
Metadata storage |
| MinIO |
9000 |
Object storage for indexes |
8. Variant Annotation Pipeline
ClinVar Integration
| Parameter |
Value |
| Database |
ClinVar (NCBI) |
| Total Variants |
4.1M clinical variants |
| Match Rate |
~35,616 / 3.5M variants (1.0%) |
| Classifications |
Pathogenic, Likely pathogenic, VUS, Likely benign, Benign |
| Update Frequency |
Monthly releases |
| Key Fields |
Clinical significance, disease associations, review status |
AlphaMissense Integration
| Parameter |
Value |
| Database |
AlphaMissense (DeepMind) |
| Total Predictions |
71,697,560 missense variant predictions |
| Match Rate |
~6,831 / 35,616 ClinVar variants (19.2%) |
| Model |
AlphaFold-derived protein structure features |
| Output |
Pathogenicity score (0.0-1.0) |
AlphaMissense Thresholds:
| Class |
Score Range |
Interpretation |
| Pathogenic |
> 0.564 |
Likely disease-causing |
| Ambiguous |
0.34 - 0.564 |
Uncertain significance |
| Benign |
< 0.34 |
Likely neutral |
Ensembl VEP Integration
| Parameter |
Value |
| Tool |
Ensembl Variant Effect Predictor (VEP) |
| Purpose |
Functional consequence annotation |
| Output |
Gene, transcript, consequence type, impact level |
| Impact Levels |
HIGH, MODERATE, LOW, MODIFIER |
| Key Consequences |
missense_variant, stop_gained, frameshift_variant, splice_donor_variant |
Annotation Pipeline Code Pattern
# From annotator.py — three-database annotation pipeline
def annotate_variants(vcf_path: str) -> List[AnnotatedVariant]:
"""
Pipeline: VCF → ClinVar → AlphaMissense → VEP → Annotated variants
"""
variants = parse_vcf(vcf_path, min_qual=30) # ~3.5M pass filter
variants = annotate_clinvar(variants) # Clinical significance
variants = annotate_alphamissense(variants) # AI pathogenicity
variants = annotate_vep(variants) # Functional consequences
return variants
9. Knowledge Base — 201 Genes, 13 Therapeutic Areas
Gene Distribution
| Therapeutic Area |
Gene Count |
Example Genes |
| Neurology |
36 |
VCP, APP, PSEN1, MAPT, SOD1, FUS, C9orf72 |
| Oncology |
27 |
EGFR, BRAF, KRAS, TP53, BRCA1, BRCA2, PIK3CA |
| Metabolic |
22 |
GCK, PPARG, SLC2A2, ABCA1, PCSK9 |
| Infectious Disease |
21 |
ACE2, CCR5, IFITM3, TLR4, TMPRSS2 |
| Respiratory |
13 |
CFTR, SERPINA1, MUC5B, TERT |
| Rare Disease |
12 |
VCP, HTT, SMN1, DMD, CFTR |
| Hematology |
12 |
HBB, HBA1, F5, JAK2, CALR |
| GI/Hepatology |
12 |
HFE, ATP7B, NOD2, SERPINA1 |
| Pharmacogenomics |
11 |
CYP2D6, CYP2C19, CYP3A4, DPYD, TPMT |
| Ophthalmology |
11 |
RHO, RPE65, RS1, ABCA4 |
| Cardiovascular |
10 |
LDLR, PCSK9, SCN5A, MYBPC3, KCNQ1 |
| Immunology |
9 |
HLA-B, TNF, IL6, JAK1, CTLA4 |
| Dermatology |
9 |
FLG, MC1R, TYR, KRT14 |
| Total |
201 |
171 druggable (85% druggability) |
Knowledge Base Structure
Each gene entry in knowledge.py (88 KB) contains:
{
"gene": "VCP",
"uniprot": "P55072",
"therapeutic_area": "Neurology",
"diseases": ["Frontotemporal Dementia", "ALS", "IBMPFD"],
"druggability": "High",
"drug_targets": ["D2 ATPase domain", "N-D1 interface"],
"known_inhibitors": ["CB-5083", "NMS-873"],
"variant_hotspots": ["R155H", "R191Q", "A232E"],
"pathway": "Ubiquitin-proteasome system",
"mechanism": "AAA+ ATPase, protein homeostasis",
}
Query Expansion Maps
10 therapeutic area query expansion maps enrich user queries with domain-specific terminology:
QUERY_EXPANSION = {
"oncology": ["tumor", "cancer", "neoplasm", "carcinoma", "mutation driver",
"somatic", "germline", "tumor suppressor", "oncogene"],
"neurology": ["neurodegeneration", "dementia", "ALS", "Parkinson",
"Alzheimer", "frontotemporal", "motor neuron"],
# ... 8 more therapeutic areas
}
10. Anthropic Claude LLM Integration
Configuration
| Parameter |
Value |
| Model |
claude-sonnet-4-20250514 |
| Temperature |
0.3 |
| Max Tokens |
4096 |
| API |
Anthropic Messages API |
| Role |
RAG-grounded clinical reasoning |
RAG Prompt Structure
system_prompt = """You are a clinical genomics specialist analyzing patient
variant data. Ground all responses in the retrieved variant evidence and
knowledge base. Cite specific variants, genes, and clinical classifications.
When recommending drug targets, explain the evidence chain from variant
to disease mechanism to druggability assessment."""
user_prompt = f"""
## Retrieved Variant Evidence (top {top_k} matches)
{formatted_variants}
## Knowledge Base Context
{knowledge_context}
## User Question
{user_question}
"""
Claude generates structured target hypotheses:
{
"target_gene": "VCP",
"confidence": "high",
"evidence_chain": [
"rs188935092 (chr9:35065263 G>A) — ClinVar: Pathogenic",
"AlphaMissense: 0.87 (pathogenic, >0.564 threshold)",
"VEP: missense_variant, HIGH impact",
"Known drug target: CB-5083 (Phase I VCP inhibitor)",
"Druggability: 0.92 (D2 ATPase domain, ~450ų pocket)"
],
"therapeutic_area": "Neurology",
"diseases": ["Frontotemporal Dementia", "ALS", "IBMPFD"],
"recommended_action": "Proceed to drug discovery with VCP as primary target"
}
11. Stage 3: Drug Discovery Pipeline
Overview
Stage 3 takes a target gene hypothesis from Stage 2 and produces 100 ranked novel drug candidates using BioNeMo generative chemistry, molecular docking, and drug-likeness scoring.
10-Stage Pipeline
| Stage |
Process |
Description |
| 1 |
Initialize |
Load target hypothesis, validate inputs |
| 2 |
Normalize Target |
Map gene → UniProt ID → PDB structures |
| 3 |
Structure Discovery |
Query RCSB PDB for Cryo-EM/X-ray structures |
| 4 |
Structure Preparation |
Score and rank structures, select best binding site |
| 5 |
Molecule Generation |
MolMIM generates novel SMILES from seed compound |
| 6 |
Chemistry QC |
RDKit validates chemical feasibility |
| 7 |
Conformer Generation |
RDKit 3D conformer embedding (ETKDG) |
| 8 |
Molecular Docking |
DiffDock predicts binding poses and affinities |
| 9 |
Composite Ranking |
Weighted scoring: 30% gen + 40% dock + 30% QED |
| 10 |
Reporting |
PDF report generation (ReportLab) |
Pipeline Configuration
# From pipeline.py
PIPELINE_CONFIG = {
"num_candidates": 100,
"molmim_endpoint": "http://localhost:8001/v1/generate",
"diffdock_endpoint": "http://localhost:8002/v1/dock",
"min_qed": 0.3,
"min_dock_score": -6.0, # kcal/mol
"scoring_weights": {
"generation": 0.30,
"docking": 0.40,
"qed": 0.30
}
}
UniProt Mappings
| Gene |
UniProt ID |
Function |
| VCP |
P55072 |
AAA+ ATPase, protein homeostasis |
| EGFR |
P00533 |
Receptor tyrosine kinase |
| BRAF |
P15056 |
Serine/threonine kinase |
| KRAS |
P01116 |
GTPase signaling |
12. BioNeMo NIM Services
MolMIM (Port 8001) — Molecule Generation
| Parameter |
Value |
| Endpoint |
POST http://localhost:8001/v1/generate |
| Model |
MolMIM (Molecular Masked Inverse Model) |
| Input |
Seed SMILES string |
| Output |
Novel SMILES candidates |
| Method |
Masked language model on molecular tokens |
| Container |
nvcr.io/nvidia/clara/bionemo-molmim:1.0 |
Request Format:
{
"smiles": "CC(=O)Nc1ccc(O)cc1",
"num_molecules": 100,
"temperature": 0.7,
"top_k": 50
}
Response Format:
{
"molecules": [
{"smiles": "CC(=O)Nc1ccc(O)c(F)c1", "score": 0.85},
{"smiles": "CC(=O)Nc1ccc(O)c(Cl)c1", "score": 0.82}
]
}
DiffDock (Port 8002) — Molecular Docking
| Parameter |
Value |
| Endpoint |
POST http://localhost:8002/v1/dock |
| Model |
DiffDock (diffusion-based docking) |
| Input |
Ligand SMILES + protein PDB structure |
| Output |
Binding pose + affinity score (kcal/mol) |
| Method |
Score-based generative diffusion model |
| Container |
nvcr.io/nvidia/clara/diffdock:1.0 |
Request Format:
{
"ligand_smiles": "CC(=O)Nc1ccc(O)c(F)c1",
"protein_pdb": "<PDB file content or path>",
"num_poses": 5
}
Response Format:
{
"poses": [
{"score": -8.7, "confidence": 0.92, "pose_pdb": "..."},
{"score": -7.3, "confidence": 0.84, "pose_pdb": "..."}
]
}
Docking Score Interpretation
| Score (kcal/mol) |
Interpretation |
| -12 to -8 |
Excellent binding affinity |
| -8 to -6 |
Good binding affinity |
| -6 to -4 |
Moderate binding affinity |
| > -4 |
Weak binding affinity |
13. Drug-Likeness Scoring
Lipinski's Rule of Five
| Rule |
Threshold |
Description |
| Molecular Weight |
≤ 500 Da |
Oral absorption limit |
| LogP |
≤ 5 |
Lipophilicity |
| H-Bond Donors |
≤ 5 |
NH + OH groups |
| H-Bond Acceptors |
≤ 10 |
N + O atoms |
QED (Quantitative Estimate of Drug-likeness)
| Range |
Interpretation |
| > 0.67 |
Drug-like (favorable properties) |
| 0.49 - 0.67 |
Moderate drug-likeness |
| < 0.49 |
Less drug-like |
TPSA (Topological Polar Surface Area)
| Range (Ų) |
Interpretation |
| < 140 |
Good oral bioavailability |
| 60-90 |
Optimal range |
| > 140 |
Poor oral absorption |
def compute_composite_score(gen_score, dock_score, qed_score):
"""
Weighted composite: 30% generation + 40% docking + 30% QED
Docking normalization: scale raw kcal/mol to 0-1 range
dock_normalized = max(0, min(1, (10 + dock_score) / 20))
Example: dock_score = -8.5 → (10 + (-8.5)) / 20 = 0.075 → normalized
"""
dock_normalized = max(0.0, min(1.0, (10.0 + dock_score) / 20.0))
composite = (
0.30 * gen_score +
0.40 * dock_normalized +
0.30 * qed_score
)
return composite
RDKit Property Calculation
from rdkit import Chem
from rdkit.Chem import Descriptors, QED
def calculate_properties(smiles: str) -> dict:
mol = Chem.MolFromSmiles(smiles)
return {
"molecular_weight": Descriptors.MolWt(mol),
"logp": Descriptors.MolLogP(mol),
"hbd": Descriptors.NumHDonors(mol),
"hba": Descriptors.NumHAcceptors(mol),
"tpsa": Descriptors.TPSA(mol),
"qed": QED.qed(mol),
"rotatable_bonds": Descriptors.NumRotatableBonds(mol),
"lipinski_pass": all([
Descriptors.MolWt(mol) <= 500,
Descriptors.MolLogP(mol) <= 5,
Descriptors.NumHDonors(mol) <= 5,
Descriptors.NumHAcceptors(mol) <= 10,
])
}
14. Cryo-EM Structure Evidence
Structure Scoring Algorithm
The pipeline automatically retrieves and scores protein structures from RCSB PDB:
def score_structure(structure: StructureInfo) -> float:
"""
Score a PDB structure for suitability in drug discovery.
Factors:
- Resolution: lower is better (max 5 Å cutoff)
- Inhibitor-bound: +3 bonus (binding site already defined)
- Druggable pockets: +0.5 per pocket
- Cryo-EM method: +0.5 (modern technique bonus)
"""
score += max(0, 5.0 - resolution) # Resolution: 0-5 scale
if has_inhibitor_bound:
score += 3.0
score += num_druggable_pockets * 0.5
if 'Cryo-EM' in method:
score += 0.5
return score
VCP Structures (Demo)
| PDB ID |
Resolution |
Method |
Description |
Score |
| 8OOI |
2.9 Å |
Cryo-EM |
WT VCP hexamer |
High |
| 9DIL |
3.2 Å |
Cryo-EM |
Mutant VCP |
High |
| 7K56 |
2.5 Å |
Cryo-EM |
VCP complex |
Highest |
| 5FTK |
2.3 Å |
X-ray |
VCP + CB-5083 inhibitor |
Highest (inhibitor-bound) |
VCP Binding Site
| Parameter |
Value |
| Domain |
D2 ATPase domain |
| Mechanism |
ATP-competitive inhibition |
| Pocket Volume |
~450 ų |
| Druggability Score |
0.92 |
| Key Residues |
ALA464, GLY479, ASP320, GLY215 |
15. VCP/FTD Demo Walkthrough
Demo Target: Valosin-Containing Protein (VCP/p97)
| Parameter |
Value |
| Gene |
VCP |
| Protein |
p97 / Valosin-Containing Protein |
| UniProt |
P55072 |
| Function |
AAA+ ATPase, ubiquitin-proteasome pathway |
| Diseases |
Frontotemporal Dementia (FTD), ALS, IBMPFD |
| Variant |
rs188935092 (chr9:35065263 G>A) |
| ClinVar |
Pathogenic |
| AlphaMissense |
0.87 (pathogenic, >0.564 threshold) |
| Seed Compound |
CB-5083 (Phase I clinical VCP inhibitor) |
Demo Flow
Stage 1 — Genomics (Demo Mode: ~20 min):
1. Load pre-processed HG002 FASTQ subset
2. Run Parabricks fq2bam alignment
3. Run DeepVariant variant calling
4. Output VCF with ~11.7M variants including rs188935092
Stage 2 — RAG/Chat (Interactive):
1. VCF annotated: ClinVar flags rs188935092 as pathogenic in VCP
2. AlphaMissense scores the missense variant at 0.87 (pathogenic)
3. 3.5M variants embedded and indexed in Milvus
4. User queries: "What are the most promising drug targets in this patient's genome?"
5. Claude identifies VCP with full evidence chain
6. Target hypothesis: VCP → FTD → druggable D2 ATPase domain
Stage 3 — Drug Discovery (~10 min):
1. VCP → UniProt P55072 → PDB structure retrieval
2. Cryo-EM structures scored: 8OOI, 9DIL, 7K56, 5FTK
3. 5FTK selected (inhibitor-bound, highest score)
4. CB-5083 seed SMILES → MolMIM generates 100 novel analogs
5. RDKit validates Lipinski, QED, TPSA
6. DiffDock docks each candidate against VCP D2 domain
7. Composite ranking: 30% generation + 40% docking + 30% QED
8. Top candidates: novel VCP inhibitors with improved drug-likeness
9. PDF report generated via ReportLab
Expected Demo Output
Pipeline: HCLS AI Factory — VCP/FTD Demo
Target: VCP (P55072) — Frontotemporal Dementia
Seed: CB-5083 (ATP-competitive VCP inhibitor)
Structure: 5FTK (2.3 Å, X-ray, inhibitor-bound)
Results:
- 100 novel VCP inhibitor candidates generated
- 87 pass Lipinski's Rule of Five
- 72 have QED > 0.67 (drug-like)
- Top 10 show docking scores -8.2 to -11.4 kcal/mol
- Composite scores range 0.68-0.89
16. Pydantic Data Models
Core Models (from models.py)
from pydantic import BaseModel, Field
from typing import List, Optional
from enum import Enum
class TargetHypothesis(BaseModel):
"""Output from Stage 2 — RAG-identified drug target"""
gene: str # e.g., "VCP"
uniprot_id: str # e.g., "P55072"
confidence: str # high, medium, low
evidence_chain: List[str] # Supporting evidence items
therapeutic_area: str # e.g., "Neurology"
diseases: List[str] # Associated conditions
druggability_score: float # 0-1 scale
class StructureInfo(BaseModel):
"""PDB structure metadata"""
pdb_id: str # e.g., "8OOI"
resolution: float # Angstroms
method: str # ELECTRON MICROSCOPY, X-RAY DIFFRACTION
title: str
has_inhibitor: bool
num_pockets: int
score: float # Computed suitability score
class StructureManifest(BaseModel):
"""Collection of scored PDB structures for a target"""
target_gene: str
uniprot_id: str
structures: List[StructureInfo]
best_structure: str # PDB ID of highest-scored
class MoleculeProperties(BaseModel):
"""RDKit-computed molecular properties"""
smiles: str
molecular_weight: float
logp: float
hbd: int # H-bond donors
hba: int # H-bond acceptors
tpsa: float # Topological polar surface area
qed: float # Quantitative drug-likeness
rotatable_bonds: int
lipinski_pass: bool
class GeneratedMolecule(BaseModel):
"""MolMIM output — a novel molecule candidate"""
smiles: str
generation_score: float # MolMIM confidence
properties: Optional[MoleculeProperties]
class DockingResult(BaseModel):
"""DiffDock output — binding prediction"""
ligand_smiles: str
dock_score: float # kcal/mol (negative = better)
confidence: float # 0-1 model confidence
pose_pdb: Optional[str] # PDB-format binding pose
class RankedCandidate(BaseModel):
"""Final ranked drug candidate with composite score"""
rank: int
smiles: str
generation_score: float
dock_score: float
qed: float
composite_score: float # 30% gen + 40% dock + 30% QED
lipinski_pass: bool
molecular_weight: float
logp: float
class PipelineConfig(BaseModel):
"""Pipeline execution configuration"""
mode: str # full, target, drug, demo, genomics_only
num_candidates: int = 100
min_qed: float = 0.3
min_dock_score: float = -6.0
molmim_url: str = "http://localhost:8001/v1/generate"
diffdock_url: str = "http://localhost:8002/v1/dock"
class PipelineRun(BaseModel):
"""Complete pipeline execution record"""
run_id: str
mode: str
target: Optional[TargetHypothesis]
structures: Optional[StructureManifest]
candidates: List[RankedCandidate]
total_generated: int
total_passed_qc: int
total_docked: int
duration_seconds: float
status: str # running, completed, failed
17. Nextflow DSL2 Orchestration
Pipeline Modes
| Mode |
Stages |
Description |
full |
1 → 2 → 3 |
Complete end-to-end pipeline |
target |
2 → 3 |
Skip genomics, use existing VCF |
drug |
3 only |
Skip to drug discovery with known target |
demo |
1 → 2 → 3 |
Pre-configured VCP/FTD demonstration |
genomics_only |
1 only |
Run only variant calling |
Main Pipeline Entry (main.nf)
#!/usr/bin/env nextflow
nextflow.enable.dsl=2
include { GENOMICS_PIPELINE } from './modules/genomics'
include { RAG_CHAT_PIPELINE } from './modules/rag_chat'
include { DRUG_DISCOVERY_PIPELINE } from './modules/drug_discovery'
include { REPORTING } from './modules/reporting'
workflow {
if (params.mode in ['full', 'demo', 'genomics_only']) {
GENOMICS_PIPELINE(
params.fastq_r1,
params.fastq_r2,
params.reference
)
}
if (params.mode in ['full', 'demo', 'target']) {
RAG_CHAT_PIPELINE(
params.mode == 'target' ? params.vcf : GENOMICS_PIPELINE.out.vcf
)
}
if (params.mode in ['full', 'demo', 'target', 'drug']) {
DRUG_DISCOVERY_PIPELINE(
params.mode == 'drug' ? params.target_gene : RAG_CHAT_PIPELINE.out.target
)
}
REPORTING(
DRUG_DISCOVERY_PIPELINE.out.candidates
)
}
Nextflow Profiles
| Profile |
Description |
standard |
Default local execution |
docker |
Docker container execution |
singularity |
Singularity container execution |
dgx_spark |
DGX Spark optimized (GPU resources) |
slurm |
HPC cluster submission |
test |
Minimal test data |
Pipeline Launcher (run_pipeline.py)
# Full pipeline
python run_pipeline.py --mode full \
--fastq-r1 /data/HG002_R1.fastq.gz \
--fastq-r2 /data/HG002_R2.fastq.gz \
--reference /reference/GRCh38.fa
# Demo mode (pre-configured VCP/FTD)
python run_pipeline.py --mode demo
# Drug discovery only (known target)
python run_pipeline.py --mode drug --target-gene VCP --seed-smiles "CC(=O)..."
18. Landing Page & Service Health
Landing Page (Port 8080)
The Flask-based landing page serves as the entry point for the HCLS AI Factory:
- URL:
http://localhost:8080
- Framework: Flask
- Features:
- 10-service health status dashboard
- Pipeline mode selector (full, target, drug, demo)
- Quick-start links to all service UIs
- Real-time service status with green/red indicators
- Pipeline execution history
Service Health Check Implementation
SERVICES = [
{"name": "Parabricks Portal", "url": "http://localhost:5000/health", "port": 5000},
{"name": "Milvus Vector DB", "url": "http://localhost:19530", "port": 19530},
{"name": "RAG API", "url": "http://localhost:5001/health", "port": 5001},
{"name": "Streamlit Chat", "url": "http://localhost:8501/healthz", "port": 8501},
{"name": "MolMIM NIM", "url": "http://localhost:8001/v1/health/ready", "port": 8001},
{"name": "DiffDock NIM", "url": "http://localhost:8002/v1/health/ready", "port": 8002},
{"name": "Discovery UI", "url": "http://localhost:8505/healthz", "port": 8505},
{"name": "Grafana", "url": "http://localhost:3000/api/health", "port": 3000},
{"name": "Prometheus", "url": "http://localhost:9099/-/healthy", "port": 9099},
{"name": "DCGM Exporter", "url": "http://localhost:9400/metrics", "port": 9400},
]
19. Monitoring Stack
Grafana (Port 3000)
| Parameter |
Value |
| Image |
grafana/grafana:10.2.2 |
| Default User |
admin / changeme |
| Dashboards |
HCLS AI Factory (GPU, pipeline, services) |
| Data Source |
Prometheus |
Prometheus (Port 9099)
| Parameter |
Value |
| Image |
prom/prometheus:v2.48.0 |
| Internal Port |
9090 → External 9099 |
| Retention |
30 days |
| Scrape Targets |
Node Exporter, DCGM Exporter, service metrics |
Node Exporter (Port 9100)
| Metric Category |
Examples |
| CPU |
Usage %, load average, core temperatures |
| Memory |
Used/free/cached, swap usage |
| Disk |
I/O throughput, NVMe utilization, space |
| Network |
Bandwidth, packet rates, error rates |
DCGM Exporter (Port 9400)
| Metric |
Description |
DCGM_FI_DEV_GPU_UTIL |
GPU utilization percentage |
DCGM_FI_DEV_FB_USED |
GPU memory used (bytes) |
DCGM_FI_DEV_FB_FREE |
GPU memory free (bytes) |
DCGM_FI_DEV_GPU_TEMP |
GPU temperature (°C) |
DCGM_FI_DEV_POWER_USAGE |
GPU power draw (watts) |
DCGM_FI_DEV_SM_CLOCK |
SM clock frequency (MHz) |
Key Dashboard Panels
- GPU Utilization Timeline — fq2bam (70-90%) → DeepVariant (80-95%) → idle → MolMIM/DiffDock bursts
- Pipeline Stage Progress — Stage 1/2/3 completion with timing
- Memory Pressure — Unified memory usage across CPU + GPU workloads
- Service Health Grid — Green/red status for all 10 services
- Variant Processing Rate — Variants annotated per second
- Drug Discovery Throughput — Molecules generated/docked per minute
20. Cross-Modal Integration
HCLS AI Factory Ecosystem
The genomics-to-drug-discovery pipeline integrates with the broader HCLS AI Factory:
Imaging Intelligence Agent (CT/MRI/X-Ray)
│
├── Lung-RADS 4B+ finding
│ ↓
│ FHIR ServiceRequest
│ ↓
├── Trigger genomics analysis (Parabricks)
│ ↓
│ Tumor gene profiling
│ ↓
└── Drug candidates → Combined imaging + genomics report
Cross-Modal Triggers
| Trigger |
Source |
Target |
Action |
| Lung-RADS 4B+ |
Imaging Agent |
Genomics Pipeline |
Initiate tumor profiling |
| Pathogenic Variant |
Genomics Pipeline |
Drug Discovery |
Generate targeted therapies |
| Drug Candidates |
Drug Discovery |
Imaging Agent |
Combined clinical report |
NVIDIA FLARE — Federated Learning
For multi-site deployments (Phase 3), NVIDIA FLARE enables federated model training:
- Models train locally at each site
- Only model updates (not patient data) are shared
- Aggregation server combines updates
- Privacy-preserving: raw genomic data never leaves the institution
21. Configuration Reference
Environment Variables
| Variable |
Default |
Description |
ANTHROPIC_API_KEY |
(required) |
Anthropic API key for Claude |
NGC_API_KEY |
(required) |
NVIDIA NGC key for BioNeMo NIMs |
REFERENCE_GENOME |
/reference/GRCh38.fa |
Path to GRCh38 reference |
MILVUS_HOST |
localhost |
Milvus server hostname |
MILVUS_PORT |
19530 |
Milvus gRPC port |
MOLMIM_URL |
http://localhost:8001 |
MolMIM NIM endpoint |
DIFFDOCK_URL |
http://localhost:8002 |
DiffDock NIM endpoint |
CLAUDE_MODEL |
claude-sonnet-4-20250514 |
Claude model identifier |
CLAUDE_TEMPERATURE |
0.3 |
LLM temperature |
PIPELINE_MODE |
full |
Pipeline execution mode |
NUM_CANDIDATES |
100 |
Number of drug candidates to generate |
MIN_QED |
0.3 |
Minimum QED threshold |
MIN_DOCK_SCORE |
-6.0 |
Minimum docking score (kcal/mol) |
GRAFANA_USER |
admin |
Grafana admin username |
GRAFANA_PASSWORD |
changeme |
Grafana admin password |
AlphaMissense Thresholds
AM_PATHOGENIC_THRESHOLD = 0.564
AM_AMBIGUOUS_LOWER = 0.34
AM_AMBIGUOUS_UPPER = 0.564
AM_BENIGN_THRESHOLD = 0.34
Scoring Weights
SCORING_WEIGHTS = {
"generation": 0.30, # MolMIM generation confidence
"docking": 0.40, # DiffDock binding affinity
"qed": 0.30 # RDKit drug-likeness
}
Drug-Likeness Thresholds
LIPINSKI = {
"max_mw": 500, # Molecular weight (Da)
"max_logp": 5, # Partition coefficient
"max_hbd": 5, # H-bond donors
"max_hba": 10 # H-bond acceptors
}
QED_THRESHOLDS = {
"drug_like": 0.67, # QED > 0.67
"moderate": 0.49, # 0.49 < QED < 0.67
"less_drug_like": 0 # QED < 0.49
}
DOCKING_THRESHOLDS = {
"excellent": -8.0, # kcal/mol
"good": -6.0,
"moderate": -4.0,
"minimum": -6.0 # Pipeline cutoff
}
22. Deployment Roadmap
Phase 1: Proof Build
| Parameter |
Value |
| Hardware |
NVIDIA DGX Spark ($3,999) |
| Orchestration |
Docker Compose |
| Scale |
Single patient, sequential processing |
| Timeline |
Proof of concept |
| GPU |
1× GB10 |
| Memory |
128 GB unified |
Phase 2: Departmental
| Parameter |
Value |
| Hardware |
1-2× DGX B200 |
| Orchestration |
Kubernetes |
| Scale |
Multiple concurrent patients |
| GPU |
8× B200 per node |
| Memory |
1-2 TB HBM3e |
| Networking |
InfiniBand |
Phase 3: Enterprise / Multi-Site
| Parameter |
Value |
| Hardware |
DGX SuperPOD |
| Orchestration |
Kubernetes + NVIDIA FLARE |
| Scale |
Thousands of concurrent patients |
| GPU |
Hundreds of B200 GPUs |
| Networking |
InfiniBand fabric |
| Privacy |
Federated learning (data stays local) |
Scaling Considerations
| Bottleneck |
Phase 1 Solution |
Phase 2+ Solution |
| Genomics throughput |
Sequential (1 sample) |
Parallel Parabricks instances |
| Milvus query latency |
Single-node Milvus |
Milvus cluster with sharding |
| BioNeMo inference |
Single NIM per model |
Multiple NIM replicas |
| Storage I/O |
NVMe direct |
GPUDirect Storage + RAID |
23. Testing Strategy
Unit Tests
| Component |
Test Focus |
| VCF Parser |
Variant extraction, quality filtering |
| Annotator |
ClinVar/AlphaMissense/VEP lookup accuracy |
| Milvus Client |
Index creation, search recall |
| MolMIM Client |
SMILES generation, request format |
| DiffDock Client |
Docking request/response parsing |
| RDKit Scoring |
Lipinski, QED, TPSA calculations |
| Composite Scorer |
Weight application, normalization |
Integration Tests
| Test |
Validates |
| VCF → Annotation → Milvus |
End-to-end Stage 2 pipeline |
| Target → PDB → MolMIM → DiffDock |
End-to-end Stage 3 pipeline |
| Health check endpoints |
All 10 services responding |
| Nextflow modes |
full, target, drug, demo execution |
Demo Mode Validation
The demo pipeline mode uses pre-configured inputs to validate the complete pipeline:
- Input: HG002 FASTQ subset (smaller dataset for faster execution)
- Expected: VCP identified as target with rs188935092 evidence
- Expected: 100 novel VCP inhibitor candidates ranked
- Validation: Top candidates show improved QED vs CB-5083 seed
24. Implementation Sequence
Recommended Build Order
- Infrastructure: Docker Compose, Milvus, monitoring stack
- Stage 1 — Genomics: Parabricks container, fq2bam, DeepVariant, VCF output
- Stage 2 — Annotation: ClinVar + AlphaMissense + VEP pipeline
- Stage 2 — Vector DB: Milvus schema, BGE embedding, IVF_FLAT index
- Stage 2 — RAG: Claude integration, knowledge base, query expansion
- Stage 2 — Chat UI: Streamlit interface, REST API
- Stage 3 — Structure: RCSB PDB retrieval, Cryo-EM scoring
- Stage 3 — Generation: MolMIM NIM, molecule generation
- Stage 3 — Docking: DiffDock NIM, binding prediction
- Stage 3 — Scoring: RDKit properties, composite ranking
- Stage 3 — Reporting: PDF generation, Discovery UI
- Orchestration: Nextflow DSL2, pipeline modes, landing page
- Testing: Unit tests, integration tests, demo mode validation
- Monitoring: Grafana dashboards, alerting rules
Key Dependencies
GRCh38 reference → BWA-MEM2 index → fq2bam alignment
ClinVar + AlphaMissense databases → Annotation pipeline
Milvus running → Embedding indexing → RAG queries
BioNeMo NIMs running → Molecule generation + docking
All services healthy → Landing page green status
This Project Bible is the authoritative technical reference for the HCLS AI Factory. All other documentation assets (White Paper, Demo Guide, Intelligence Report, Learning Guides) derive their technical details from this source.