HCLS AI Factory — Learning Guide: Advanced¶
Level: Graduate / Professional — Assumes familiarity with molecular biology, bioinformatics, and machine learning concepts.
What You'll Learn: Deep technical analysis of the HCLS AI Factory architecture, from BWA-MEM2 seed-and-extend algorithms through diffusion-based molecular docking, with emphasis on algorithmic design decisions, scaling bottlenecks, and clinical translation barriers.
License: Apache 2.0 | Author: Adam Jones | Date: February 2026
Chapter 1: Computational Genomics — From FASTQ to VCF¶
1.1 Sequencing Data Characteristics¶
The HCLS AI Factory processes Illumina short-read data: 2×250 bp paired-end reads from 30× whole-genome sequencing of HG002 (NA24385), a GIAB Ashkenazi Jewish reference standard. The FASTQ files total approximately 200 GB and contain ~800 million read pairs.
Why HG002? The Genome in a Bottle (GIAB) Consortium provides extensively validated truth sets for HG002, enabling rigorous benchmarking. The high-confidence regions cover >95% of the GRCh38 reference, with variant calls validated by multiple orthogonal technologies (PacBio HiFi, Oxford Nanopore, Hi-C, optical mapping).
1.2 GPU-Accelerated Alignment: BWA-MEM2 on Parabricks¶
NVIDIA Parabricks 4.6.0-1 (container: nvcr.io/nvidia/clara/clara-parabricks:4.6.0-1) provides a GPU-accelerated implementation of BWA-MEM2.
Algorithm overview: BWA-MEM2 uses a seed-and-extend approach: 1. Seeding: Extract fixed-length k-mers from the query read and look them up in the FM-index of the reference genome 2. Chaining: Group collinear seeds into chains representing candidate alignment locations 3. Extension: Perform Smith-Waterman local alignment around each chain to produce the final alignment 4. Scoring: Select the best alignment and assign a MAPQ (mapping quality) score
GPU acceleration strategy: Parabricks parallelizes the computationally intensive Smith-Waterman extension step across GPU cores. The FM-index lookup (seeding) remains CPU-bound but constitutes a small fraction of total compute. The fq2bam command also integrates coordinate sorting and duplicate marking, eliminating separate samtools sort and picard MarkDuplicates steps.
Performance on DGX Spark (GB10):
| Metric | Value |
|---|---|
| Wall time | 20-45 minutes |
| GPU utilization | 70-90% |
| Peak memory | ~40 GB (of 128 GB unified) |
| Output | Sorted BAM + BAI index |
| Mapping rate | >99.5% |
| Duplicate rate | ~8-12% |
1.3 Deep Learning Variant Calling: DeepVariant¶
Google DeepVariant reframes variant calling as an image classification problem. For each candidate variant site, it constructs a pileup image — a visual representation of aligned reads at that position — and classifies it using a convolutional neural network (CNN).
Architecture details: - Input: Pileup image (channels: read bases, base qualities, mapping qualities, strand, etc.) - Network: Inception-v3 CNN architecture - Output: Three-class softmax (homozygous reference, heterozygous variant, homozygous variant) - Training: Supervised on GIAB truth sets, with data augmentation and hard example mining
Why DeepVariant outperforms GATK HaplotypeCaller: 1. The CNN learns complex error patterns that statistical models cannot capture 2. No explicit error model required — the network learns directly from data 3. Better performance on indels and complex variants 4. Transferable across sequencing platforms (Illumina, PacBio, ONT)
Performance:
| Metric | Value |
|---|---|
| Wall time | 10-35 minutes (GPU-accelerated via Parabricks) |
| GPU utilization | 80-95% |
| Peak memory | ~60 GB |
| SNP F1 | >99.7% on HG002 |
| Indel F1 | >99.4% on HG002 |
| Total variants | ~11.7M (unfiltered) |
| QUAL>30 variants | ~3.5M |
1.4 VCF Quality Metrics¶
| Metric | Expected Range | Interpretation |
|---|---|---|
| Ti/Tv ratio | 2.0-2.1 | Transition/transversion ratio; deviation suggests systematic error |
| Het/Hom ratio | 1.5-2.0 | Heterozygous/homozygous ratio; population-dependent |
| SNP count | ~4.2M | Consistent with Ashkenazi ancestry |
| Indel count | ~1.0M | Normal range for WGS |
| Novel variant rate | <5% | Variants not in dbSNP; higher rates suggest error |
Chapter 2: Variant Annotation — Multi-Database Integration¶
2.1 ClinVar: Clinical Variant Classification¶
ClinVar (NCBI) is a freely accessible archive of relationships between human variants and phenotypes. The HCLS AI Factory integrates the February 2026 release containing 4.1 million variant-condition records.
Classification system (ACMG/AMP): - Pathogenic (P): Strong evidence of disease causation - Likely Pathogenic (LP): Moderate evidence - Variant of Uncertain Significance (VUS): Insufficient evidence - Likely Benign (LB): Moderate evidence against pathogenicity - Benign (B): Strong evidence against pathogenicity
Review status tiers: ClinVar classifies assertion confidence using star ratings (0-4 stars). The pipeline weights variants with ≥2 stars (multiple submitters with concordant interpretations) more heavily.
Annotation performance: Of ~3.5M QUAL>30 variants, approximately 35,616 (1.0%) match ClinVar entries. The low match rate reflects that most variants in a healthy individual are common polymorphisms not represented in a clinical database focused on rare disease.
2.2 AlphaMissense: AI Pathogenicity Prediction¶
AlphaMissense (Cheng et al., Science 2023) predicts the pathogenicity of all possible human missense variants using features derived from AlphaFold protein structure predictions and evolutionary conservation.
Model architecture: - Input features: amino acid sequence context, evolutionary conservation (from MSA), and structural features from AlphaFold - Output: pathogenicity score (0-1, continuous) - Total predictions: 71,697,560 unique missense variants
Calibrated thresholds: - Pathogenic: >0.564 (90% precision on ClinVar pathogenic set) - Ambiguous: 0.34-0.564 - Benign: <0.34 (90% precision on ClinVar benign set)
Critical limitation: AlphaMissense only predicts missense variant effects. Stop-gain, frameshift, splice site, and non-coding variants require other prediction tools. The pipeline uses VEP for functional consequence annotation to complement AlphaMissense.
2.3 Ensembl VEP: Functional Consequence Prediction¶
The Variant Effect Predictor maps variants to genes, transcripts, and regulatory regions, annotating each with standardized Sequence Ontology (SO) terms.
Impact classification:
| Impact Level | Example Consequences | Typical Action |
|---|---|---|
| HIGH | stop_gained, frameshift_variant, splice_donor_variant | Likely loss of function |
| MODERATE | missense_variant, inframe_deletion | Protein function may change |
| LOW | synonymous_variant, splice_region_variant | Unlikely to affect protein |
| MODIFIER | intron_variant, upstream_gene_variant | Non-coding effects |
2.4 The Annotation Pipeline Architecture¶
The three annotation databases are applied sequentially in annotator.py (23 KB):
VCF (11.7M variants)
→ parse_vcf(min_qual=30) → 3.5M variants
→ annotate_clinvar() → Clinical significance
→ annotate_alphamissense() → AI pathogenicity scores
→ annotate_vep() → Functional consequences
→ generate_text_summary() → Natural language descriptions
→ embed_variants() → 384-dim BGE embeddings
→ index_in_milvus() → Searchable vector database
Chapter 3: Vector Database Architecture — Milvus and RAG¶
3.1 Milvus Schema Design¶
The genomic_evidence collection in Milvus 2.4 uses a 17-field schema designed to support both vector similarity search and scalar filtering:
| Field | Type | Rationale |
|---|---|---|
| id | INT64 (PK, auto) | Milvus-managed primary key |
| embedding | FLOAT_VECTOR(384) | Semantic search vector |
| chrom | VARCHAR(10) | Genomic coordinate filtering |
| pos | INT64 | Positional queries |
| ref/alt | VARCHAR(1000) | Allele matching |
| qual | FLOAT | Quality score filtering |
| gene | VARCHAR(100) | Gene-level queries |
| consequence | VARCHAR(200) | Functional filtering (e.g., missense only) |
| impact | VARCHAR(20) | Impact level filtering |
| genotype | VARCHAR(10) | Zygosity queries |
| text_summary | VARCHAR(2000) | Human-readable context for RAG |
| clinical_significance | VARCHAR(200) | ClinVar classification |
| rsid | VARCHAR(20) | dbSNP lookup |
| disease_associations | VARCHAR(2000) | Disease context for RAG |
| am_pathogenicity | FLOAT | AlphaMissense score filtering |
| am_class | VARCHAR(20) | Pathogenicity class filtering |
3.2 Index Configuration and Performance¶
Index type: IVF_FLAT (Inverted File with Flat vectors) - Why IVF_FLAT? At 3.5M vectors with 384 dimensions, IVF_FLAT provides the best recall-latency tradeoff. HNSW would use more memory; IVF_PQ would sacrifice recall. - nlist=1024: Partitions vectors into 1024 clusters. Query searches ~16 clusters (nprobe=16), examining ~55K vectors per query. - Metric: COSINE similarity (normalized dot product)
Search performance:
| Metric | Value |
|---|---|
| Index build time | ~8 minutes (3.5M × 384-dim) |
| Index memory | ~2 GB |
| Search latency (nprobe=16) | 8-15 ms |
| Recall@20 | >95% |
3.3 RAG Architecture with Claude¶
The RAG pipeline in rag_engine.py (23 KB) implements a multi-stage retrieval strategy:
-
Query expansion: User queries are enriched using 10 therapeutic area keyword maps. For example, a query about "neurodegeneration" is expanded with terms like "frontotemporal dementia," "ALS," "motor neuron," "tau protein."
-
Hybrid retrieval: The expanded query is embedded and used for vector search (top_k=20). Results are optionally filtered by scalar fields (e.g., impact=HIGH, am_class=pathogenic).
-
Context assembly: Retrieved variants are formatted into structured context:
-
Claude inference: The assembled context + knowledge base + user query are sent to
claude-sonnet-4-20250514(temperature=0.3, max_tokens=4096).
Why temperature=0.3? Lower temperature produces more deterministic, factual responses. For clinical genomics, hallucination is dangerous — the model should report only what the evidence supports.
Chapter 4: Drug Discovery Pipeline — Deep Dive¶
4.1 The 10-Stage Architecture¶
The drug discovery pipeline in pipeline.py (18 KB) implements a sequential 10-stage workflow:
| Stage | Module | Key Algorithm |
|---|---|---|
| 1. Initialize | pipeline.py | Pydantic model validation |
| 2. Normalize Target | pipeline.py | Gene → UniProt → PDB mapping |
| 3. Structure Discovery | cryoem_evidence.py | RCSB PDB REST API query |
| 4. Structure Preparation | cryoem_evidence.py | Multi-factor scoring |
| 5. Molecule Generation | nim_clients.py | MolMIM masked LM inference |
| 6. Chemistry QC | molecule_generator.py | RDKit valence/kekulization |
| 7. Conformer Generation | molecule_generator.py | RDKit ETKDG algorithm |
| 8. Molecular Docking | nim_clients.py | DiffDock diffusion inference |
| 9. Composite Ranking | pipeline.py | Weighted multi-objective |
| 10. Reporting | pipeline.py | ReportLab PDF generation |
4.2 Cryo-EM Structure Scoring¶
The cryoem_evidence.py (6 KB) module implements a multi-factor structure scoring algorithm:
score += max(0, 5.0 - resolution) # Resolution: 0-5 scale
if has_inhibitor_bound: score += 3.0 # Binding site defined
score += num_druggable_pockets * 0.5 # Pocket count bonus
if 'Cryo-EM' in method: score += 0.5 # Method bonus
Design rationale: - Resolution is the primary factor (0-5 scale, where lower resolution values yield higher scores). The 5 Å cutoff excludes low-resolution structures unsuitable for docking. - Inhibitor-bound structures receive a +3 bonus because they provide a pre-defined binding site and reference ligand geometry. - The Cryo-EM method bonus reflects the growing prevalence and quality of Cryo-EM structures for drug targets.
4.3 MolMIM: Molecular Masked Inverse Modeling¶
MolMIM applies masked language modeling (the technique behind BERT in NLP) to molecular SMILES strings. Given a seed molecule, it:
- Tokenizes the SMILES into a vocabulary of molecular substructures
- Randomly masks 15-30% of tokens
- Predicts the masked tokens using a transformer architecture
- The predicted tokens create novel molecular structures
Critical considerations: - MolMIM generates SMILES strings, not 3D structures. Chemical validity must be verified by RDKit. - The generation process is stochastic — different random seeds produce different molecules. - Temperature controls diversity: higher temperature = more diverse but potentially less valid molecules.
4.4 DiffDock: Diffusion-Based Molecular Docking¶
DiffDock (Corso et al., ICLR 2023) models molecular docking as a generative diffusion process over the product space of rotations, translations, and torsion angles.
Key innovation: Unlike grid-based docking methods (AutoDock Vina, Glide), DiffDock does not require a pre-defined search box around a binding site. It learns to predict binding poses directly from protein-ligand pairs, making it suitable for blind docking.
Score interpretation: - The confidence score (0-1) indicates the model's certainty about the predicted pose - The binding affinity score (kcal/mol) estimates the free energy of binding - More negative scores indicate stronger binding
Limitations: - DiffDock was trained primarily on crystal structures; performance may degrade on Cryo-EM structures with lower resolution - The model predicts pose and affinity but not binding kinetics (on/off rates) - Protein flexibility is not modeled — the protein is treated as rigid
4.5 Composite Scoring and Normalization¶
The composite scoring formula balances three objectives:
dock_normalized = max(0.0, min(1.0, (10.0 + dock_score) / 20.0))
composite = 0.30 * gen_score + 0.40 * dock_normalized + 0.30 * qed_score
Normalization rationale:
- Docking scores range from ~-15 to ~0 kcal/mol. The formula (10 + dock) / 20 maps this to approximately 0-1, with -10 kcal/mol mapping to 0.0 and +10 mapping to 1.0.
- Generation scores are already 0-1 (MolMIM confidence).
- QED scores are inherently 0-1.
Weight rationale: - Docking receives the highest weight (40%) because binding affinity is the most direct predictor of therapeutic activity - Generation and QED each receive 30% — balancing novelty against practical drug-likeness
Chapter 5: Nextflow DSL2 Pipeline Architecture¶
5.1 Module Design¶
The pipeline uses Nextflow DSL2's module system for composable workflow design:
hls-orchestrator/
├── main.nf # Entry point, mode routing
├── nextflow.config # Profiles, parameters
├── run_pipeline.py # Python CLI launcher
└── modules/
├── genomics.nf # Stage 1 processes
├── rag_chat.nf # Stage 2 processes
├── drug_discovery.nf # Stage 3 processes
└── reporting.nf # Report generation
5.2 Execution Modes and Data Flow¶
| Mode | Data Flow | Use Case |
|---|---|---|
| full | FASTQ → VCF → Target → Candidates | Complete pipeline |
| target | VCF → Target → Candidates | Pre-existing VCF |
| drug | Target → Candidates | Known gene target |
| demo | Pre-configured FASTQ → Candidates | VCP/FTD demonstration |
| genomics_only | FASTQ → VCF | Variant calling only |
5.3 Profile Configuration¶
The nextflow.config defines six execution profiles optimized for different environments:
- dgx_spark: GPU resource requests, memory limits tuned for 128 GB unified memory
- docker: Docker container execution with GPU passthrough
- singularity: Singularity containers for HPC environments without Docker
- slurm: SLURM scheduler integration for cluster execution
Chapter 6: Clinical Translation and Limitations¶
6.1 From Computational Hits to Drug Leads¶
The HCLS AI Factory generates computational drug candidates — not approved medications. The path from computational hit to clinical drug requires:
- In vitro validation: Test top candidates in biochemical assays (e.g., VCP ATPase activity inhibition)
- Cell-based assays: Confirm activity in relevant cell lines
- ADMET profiling: Absorption, Distribution, Metabolism, Excretion, and Toxicity studies
- Lead optimization: Iterative cycles of design, synthesis, and testing
- Preclinical studies: Animal models for efficacy and safety
- Clinical trials: Phase I (safety), Phase II (efficacy), Phase III (large-scale)
Estimated timeline: 10-15 years from computational hit to approved drug. The HCLS AI Factory accelerates the earliest stage — computational lead generation — from months to minutes.
6.2 Limitations and Caveats¶
Genomics: - DeepVariant accuracy varies by variant type (SNPs > indels > structural variants) - Short-read WGS has limited sensitivity for structural variants and repeat expansions - Population-specific biases in GRCh38 may affect variant calling in non-European ancestries
RAG/Annotation: - ClinVar has known biases toward well-studied genes and European ancestry variants - AlphaMissense is limited to missense variants; non-coding variants are not scored - The 201-gene knowledge base covers common drug targets but not the full druggable genome
Drug Discovery: - MolMIM-generated molecules have not been synthesized or tested - DiffDock docking scores are predictions, not experimental measurements - Protein flexibility is not modeled; induced-fit effects are ignored - The composite scoring weights (30/40/30) are heuristic, not optimized on clinical outcomes
6.3 Ethical Considerations¶
- Informed consent: Patient genomic data requires explicit consent for research use
- Data sovereignty: NVIDIA FLARE federated learning keeps data local; essential for HIPAA/GDPR compliance
- Return of results: Incidental findings (e.g., BRCA1 pathogenic variants) may require clinical reporting
- Equity: Pipeline performance should be validated across diverse ancestries to avoid exacerbating health disparities
Chapter 7: Scaling Analysis¶
7.1 DGX Spark Bottleneck Analysis¶
| Component | Bottleneck | Phase 1 Impact |
|---|---|---|
| Parabricks (fq2bam) | GPU compute | 20-45 min, acceptable |
| DeepVariant | GPU memory (60 GB peak) | Leaves 68 GB for other tasks |
| Milvus indexing | CPU + I/O | 24 min for 3.5M vectors |
| MolMIM inference | GPU compute | 2 min for 100 molecules |
| DiffDock inference | GPU compute + memory | 8 min for 98 candidates |
| Sequential total | GPU time-sharing | ~4 hours end-to-end |
7.2 Phase 2: DGX B200 Scaling¶
With 8× B200 GPUs and 1-2 TB HBM3e: - Parallel Parabricks: 4-8 simultaneous samples - Dedicated Milvus GPU: GPU-accelerated vector search (sub-millisecond) - NIM replicas: 2-4 MolMIM + 2-4 DiffDock instances - Estimated throughput: 10-20 patients per day
7.3 Phase 3: DGX SuperPOD¶
- Hundreds of B200 GPUs with NVLink and InfiniBand
- Distributed Milvus cluster: Billions of variants across institutions
- NVIDIA FLARE: Federated model training without data sharing
- Estimated throughput: Thousands of patients per day
Chapter 8: Advanced Topics and Extensions¶
8.1 Alternative Embedding Strategies¶
BGE-small-en-v1.5 (384-dim) was chosen for its balance of quality and efficiency. Alternatives:
| Model | Dimensions | Size | Trade-off |
|---|---|---|---|
| BGE-small-en-v1.5 | 384 | 33M params | Current choice: fast, efficient |
| BGE-base-en-v1.5 | 768 | 109M params | Better recall, 2× memory |
| BGE-large-en-v1.5 | 1024 | 335M params | Best recall, 3× memory |
| BiomedBERT | 768 | 109M params | Domain-specific, biomedical text |
| PubMedBERT | 768 | 109M params | PubMed-trained, clinical text |
8.2 Multi-Objective Optimization¶
The current composite scoring uses fixed weights (30/40/30). Advanced approaches:
- Pareto optimization: Identify the Pareto frontier of generation, docking, and QED
- Bayesian optimization: Learn optimal weights from experimental feedback
- Active learning: Prioritize candidates that reduce uncertainty in the scoring model
8.3 Long-Read Sequencing Integration¶
Oxford Nanopore and PacBio long-read technologies can detect structural variants (SVs) and repeat expansions that short-read WGS misses. Future extensions could: - Add ONT/PacBio alignment with minimap2 - Detect SVs with Sniffles2 or PEPPER-Margin-DeepVariant - Phase haplotypes for compound heterozygosity detection
8.4 Pharmacogenomics Integration¶
The knowledge base includes 11 pharmacogenomics genes (CYP2D6, CYP2C19, CYP3A4, DPYD, TPMT, etc.). Future extensions could: - Star allele calling with PharmCAT - Drug-drug interaction prediction - Dosing recommendations based on metabolizer status
Review Questions¶
-
Explain why DeepVariant reframes variant calling as an image classification problem. What advantages does this provide over statistical models like GATK HaplotypeCaller?
-
Describe the IVF_FLAT index configuration (nlist=1024, nprobe=16) and calculate the approximate number of vectors examined per query from a collection of 3.5M vectors.
-
Why does the RAG pipeline use temperature=0.3 for Claude? What are the trade-offs of lower vs. higher temperature in clinical genomics applications?
-
Explain the docking score normalization formula
max(0, min(1, (10 + dock_score) / 20)). What docking score maps to 0.5? Why is this mapping appropriate? -
Compare the AlphaMissense thresholds (pathogenic >0.564, benign <0.34) with ClinVar classifications. What does the "ambiguous" zone represent, and why is it clinically significant?
-
Describe three limitations of DiffDock that could affect the reliability of the drug candidate rankings.
-
Explain why inhibitor-bound PDB structures (like 5FTK) receive a +3 bonus in the structure scoring algorithm. What information does an inhibitor-bound structure provide that an apo structure does not?
-
Design a validation experiment to test the top 5 drug candidates from the VCP/FTD demo pipeline. What assays would you use, and what would constitute a positive result?
-
Calculate the approximate memory budget for the Milvus vector index: 3.5M vectors × 384 dimensions × 4 bytes per float. How does this compare to the available memory on DGX Spark?
-
The composite scoring weights (30% generation, 40% docking, 30% QED) are heuristic. Propose an approach to optimize these weights using experimental feedback from in vitro screening.
HCLS AI Factory Learning Guide: Advanced — Apache 2.0 | Author: Adam Jones | February 2026