HCLS AI Factory White Paper — Open Architecture on NVIDIA DGX Spark¶

From Patient DNA to Novel Drug Candidates in Under Five Hours

License: Apache 2.0 | Author: Adam Jones | Date: February 2026

1. Executive Summary¶

The HCLS AI Factory follows a reusable pattern: identify a canonical artifact, build a persistent data model around it, and let agentic workflows operate continuously on that model. In genomics, the canonical artifact is the VCF — a structured record of every genomic variant identified in a patient's DNA. This white paper describes an end-to-end platform that processes raw DNA sequencing data through GPU-accelerated variant calling, RAG-grounded clinical reasoning, and AI-driven drug discovery — all on a single NVIDIA DGX Spark desktop workstation.

The platform transforms patient FASTQ files (~200 GB of raw sequencing data from a 30× whole-genome study) into 100 ranked novel drug candidates in under 5 hours. Three stages execute sequentially: NVIDIA Parabricks performs GPU-accelerated alignment and variant calling (120-240 min), producing ~11.7 million variants. A RAG pipeline annotates variants with ClinVar, AlphaMissense, and VEP, indexes 3.5 million high-quality variants in a Milvus vector database, and uses Anthropic Claude to identify druggable gene targets. Finally, BioNeMo NIM services (MolMIM and DiffDock) generate novel molecules, predict binding affinities, and rank candidates by a composite drug-likeness score.

The architecture is designed to run end-to-end on a $3,999 DGX Spark for proof builds and scale to DGX SuperPOD for enterprise deployments. All components are open-source or NVIDIA-licensed, released under Apache 2.0 as part of the HCLS AI Factory.

2. The Precision Medicine Data Challenge¶

Precision medicine promises therapies tailored to an individual's genetic profile. A single 30× whole-genome sequencing (WGS) run produces approximately 200 GB of raw data and 11.7 million genomic variants. The challenge is not generating this data — modern sequencers produce it reliably — but transforming it into actionable therapeutic hypotheses within a clinically relevant timeframe.

The Limits of Traditional Bioinformatics¶

Today's genomic analysis pipelines assemble disconnected components: CPU-based alignment tools that take 12-24 hours, separate variant callers, annotation databases accessed through web APIs, and manual literature review for target identification. Each step must be independently provisioned, and the gap between "variants identified" and "drug target nominated" is filled by months of manual curation.

This sequential, manual approach introduces three structural problems:

Compute bottleneck. CPU-based BWA-MEM alignment of a 30× WGS sample takes 12-24 hours on a 32-core server. DeepVariant on CPU adds another 8-12 hours. The genomics stage alone consumes 1-2 days of wall time.
Annotation fragmentation. Clinical variant databases (ClinVar), AI pathogenicity predictors (AlphaMissense), and functional annotation tools (VEP) exist as separate resources. Integrating them requires bespoke ETL pipelines that break with each database update.
Target-to-drug gap. Even after identifying a pathogenic variant in a druggable gene, the path to a lead compound requires separate molecular modeling tools, docking servers, and medicinal chemistry expertise — typically a months-long process.

The GPU-Accelerated Opportunity¶

NVIDIA DGX Spark collapses the compute bottleneck. Its GB10 GPU accelerates BWA-MEM2 alignment from hours to 20-45 minutes and DeepVariant variant calling from hours to 10-35 minutes — a 10-20× speedup that transforms genomics from an overnight batch job into an interactive workflow.

More importantly, the same GPU that runs genomics can run vector similarity search (Milvus), molecular generation (MolMIM), and molecular docking (DiffDock). A single $3,999 desktop workstation handles the entire pipeline.

3. Architecture Overview¶

Three-Stage Pipeline¶

The HCLS AI Factory processes data through three sequential stages:

Stage	Technology	Duration	Input	Output
1 — Genomics	Parabricks 4.6 (BWA-MEM2 + DeepVariant)	120-240 min	FASTQ (~200 GB)	VCF (~11.7M variants)
2 — RAG/Chat	Milvus + BGE + Claude	Interactive	VCF	Target gene + evidence
3 — Drug Discovery	MolMIM + DiffDock + RDKit	8-16 min	Target gene	100 ranked drug candidates

Technology Stack¶

Layer	Components
Compute	NVIDIA DGX Spark (GB10 GPU, 128 GB unified memory, 144 ARM64 cores)
Genomics	NVIDIA Parabricks 4.6.0-1, GRCh38 reference genome
Annotation	ClinVar (4.1M variants), AlphaMissense (71M predictions), Ensembl VEP
Vector DB	Milvus 2.4, BGE-small-en-v1.5 (384-dim), IVF_FLAT index
LLM	Anthropic Claude (claude-sonnet-4-20250514, temperature=0.3)
Drug Discovery	BioNeMo MolMIM, BioNeMo DiffDock, RDKit
Orchestration	Nextflow DSL2, Docker Compose
Monitoring	Grafana, Prometheus, DCGM Exporter

Service Architecture¶

The platform runs 14 services across 14 ports:

Orchestration: Landing page (8080) with 10-service health monitor
Stage 1: Genomics portal (5000)
Stage 2: Milvus (19530), Attu UI (8000), RAG API (5001), Streamlit Chat (8501)
Stage 3: MolMIM NIM (8001), DiffDock NIM (8002), Discovery UI (8505), Portal (8510)
Monitoring: Grafana (3000), Prometheus (9099), Node Exporter (9100), DCGM (9400)

4. Stage 1 — GPU-Accelerated Genomics¶

NVIDIA Parabricks 4.6¶

Parabricks provides GPU-accelerated implementations of standard bioinformatics tools. On DGX Spark's GB10 GPU, it delivers 10-20× speedup over CPU implementations.

BWA-MEM2 Alignment (fq2bam): Aligns paired-end reads against the GRCh38 reference genome. GPU-accelerated implementation achieves 70-90% GPU utilization, producing a sorted BAM file with index in 20-45 minutes.

Google DeepVariant: A CNN-based variant caller that achieves >99% accuracy on the GIAB truth set. GPU-accelerated implementation achieves 80-95% GPU utilization, calling variants in 10-35 minutes. The deep learning approach outperforms traditional statistical callers (GATK HaplotypeCaller) on both SNPs and indels.

Input: HG002 Reference Standard¶

The platform uses HG002 from the Genome in a Bottle (GIAB) Consortium — an Ashkenazi Jewish male sample with extensively validated truth sets. This enables rigorous accuracy benchmarking: every called variant can be compared against the ground truth.

Sample: HG002 (NA24385)
Coverage: 30× WGS
Read length: 2×250 bp paired-end
File size: ~200 GB (FASTQ pair)

Output: Variant Call Format (VCF)¶

The VCF contains ~11.7 million variants: ~4.2 million SNPs, ~1.0 million indels, and ~150,000 multi-allelic sites. Of these, ~3.5 million pass the quality filter (QUAL>30), with ~35,000 in coding regions.

5. Stage 2 — RAG-Grounded Target Identification¶

Variant Annotation¶

Stage 2 begins by annotating the 3.5 million high-quality variants with three complementary databases:

ClinVar (NCBI): 4.1 million clinical variant records mapping genomic positions to clinical significance classifications (Pathogenic, Likely pathogenic, VUS, Likely benign, Benign). Approximately 35,616 patient variants match ClinVar entries.

AlphaMissense (DeepMind): 71,697,560 AI-predicted pathogenicity scores for missense variants, derived from AlphaFold protein structure features. Thresholds: pathogenic >0.564, ambiguous 0.34-0.564, benign <0.34. Approximately 6,831 ClinVar-matched variants have AlphaMissense predictions.

Ensembl VEP: Functional consequence annotation mapping variants to genes, transcripts, and impact levels (HIGH, MODERATE, LOW, MODIFIER). Identifies missense variants, stop gains, frameshift variants, and splice site disruptions.

Vector Embedding and Indexing¶

Each annotated variant is transformed into a text summary and embedded using BGE-small-en-v1.5 (384 dimensions). The 3.5 million embeddings are indexed in Milvus 2.4 using IVF_FLAT (nlist=1024, COSINE metric) with 17 structured fields per record.

RAG-Grounded Reasoning with Claude¶

User queries are expanded using 10 therapeutic area keyword maps, embedded, and used for approximate nearest-neighbor search in Milvus (top_k=20). Retrieved variant contexts are assembled into a RAG prompt and processed by Anthropic Claude (claude-sonnet-4-20250514, temperature=0.3).

Claude generates structured target hypotheses: gene name, confidence level, evidence chain (from variant to clinical significance to druggability), therapeutic area, and recommended action.

Knowledge Base: 201 Genes Across 13 Therapeutic Areas¶

The RAG pipeline is grounded by a curated knowledge base of 201 genes spanning 13 therapeutic areas: Neurology (36), Oncology (27), Metabolic (22), Infectious Disease (21), Respiratory (13), Rare Disease (12), Hematology (12), GI/Hepatology (12), Pharmacogenomics (11), Ophthalmology (11), Cardiovascular (10), Immunology (9), and Dermatology (9). Of these, 171 (85%) are classified as druggable targets.

6. Stage 3 — AI-Driven Drug Discovery¶

10-Stage Drug Discovery Pipeline¶

Stage 3 transforms a target gene hypothesis into 100 ranked novel drug candidates through a 10-stage automated pipeline:

Initialize — Load target hypothesis, validate inputs
Normalize Target — Map gene → UniProt ID → PDB structures
Structure Discovery — Query RCSB PDB for Cryo-EM and X-ray structures
Structure Preparation — Score structures by resolution, inhibitor presence, pocket count
Molecule Generation — MolMIM NIM generates novel SMILES from seed compound
Chemistry QC — RDKit validates chemical feasibility
Conformer Generation — RDKit 3D conformer embedding (ETKDG)
Molecular Docking — DiffDock NIM predicts binding poses and affinities
Composite Ranking — Weighted scoring: 30% generation + 40% docking + 30% QED
Reporting — PDF report generation via ReportLab

BioNeMo NIM Services¶

MolMIM (Port 8001): A masked language model for molecular generation. Given a seed compound's SMILES string, it generates structurally novel analogs by masking and regenerating molecular tokens. The model explores chemical space around the seed while maintaining chemical validity.

DiffDock (Port 8002): A score-based generative diffusion model for molecular docking. It predicts the 3D binding pose and affinity of a ligand in a protein binding site without requiring pre-defined binding pockets — a significant advantage over traditional grid-based docking methods.

Drug-Likeness Scoring¶

Each candidate is evaluated against Lipinski's Rule of Five (MW≤500, LogP≤5, HBD≤5, HBA≤10), Quantitative Estimate of Drug-likeness (QED >0.67 = drug-like), and Topological Polar Surface Area (TPSA <140 Å² for oral bioavailability).

Composite Scoring¶

The final ranking uses a weighted composite: 30% MolMIM generation confidence, 40% DiffDock binding affinity (normalized: max(0, min(1, (10 + dock_score) / 20))), and 30% QED score. This balances novelty, binding prediction, and drug-likeness.

7. VCP/FTD Demonstration¶

Target: Valosin-Containing Protein (VCP/p97)¶

The platform ships with a pre-configured demonstration targeting VCP — a AAA+ ATPase involved in the ubiquitin-proteasome pathway. Pathogenic VCP mutations cause Frontotemporal Dementia (FTD), ALS, and IBMPFD.

Parameter	Value
Gene	VCP (UniProt P55072)
Variant	rs188935092 (chr9:35065263 G>A)
ClinVar	Pathogenic
AlphaMissense	0.87 (pathogenic)
Seed Compound	CB-5083 (Phase I clinical VCP inhibitor)
PDB Structures	8OOI, 9DIL, 7K56, 5FTK
Binding Site	D2 ATPase domain (~450 Å³, druggability 0.92)

Demo Results¶

The VCP/FTD demo produces 100 novel VCP inhibitor candidates. Typical results: 87 pass Lipinski's Rule of Five, 72 have QED >0.67, top 10 show docking scores from -8.2 to -11.4 kcal/mol, with composite scores ranging 0.68-0.89.

8. Cryo-EM Structure Evidence¶

Automated Structure Retrieval and Scoring¶

The drug discovery pipeline automatically queries RCSB PDB for protein structures and scores them by resolution (lower is better, 5 Å maximum), inhibitor presence (+3 bonus), druggable pocket count (+0.5 each), and experimental method (Cryo-EM +0.5 bonus).

For VCP, four structures are retrieved and scored:

PDB	Resolution	Method	Key Feature
5FTK	2.3 Å	X-ray	CB-5083 inhibitor-bound (highest score)
7K56	2.5 Å	Cryo-EM	VCP complex
8OOI	2.9 Å	Cryo-EM	WT VCP hexamer
9DIL	3.2 Å	Cryo-EM	Mutant VCP

The inhibitor-bound structure (5FTK) is preferred because it provides a pre-defined binding site and a reference ligand for molecular generation.

9. Orchestration and Monitoring¶

Nextflow DSL2 Orchestration¶

The pipeline is orchestrated by Nextflow DSL2, supporting five execution modes:

full: End-to-end (Stages 1→2→3)
target: From existing VCF (Stages 2→3)
drug: Known target to drug candidates (Stage 3 only)
demo: Pre-configured VCP/FTD demonstration
genomics_only: Variant calling only (Stage 1)

Six execution profiles (standard, docker, singularity, dgx_spark, slurm, test) adapt the pipeline to different infrastructure.

Monitoring Stack¶

Grafana dashboards (port 3000) visualize GPU utilization, memory pressure, pipeline progress, and service health. Prometheus (port 9099) collects metrics from DCGM Exporter (port 9400) for GPU telemetry and Node Exporter (port 9100) for system resources. The landing page (port 8080) provides a 10-service health grid with real-time status indicators.

Imaging Intelligence Agent Integration¶

The HCLS AI Factory ecosystem includes an Imaging Intelligence Agent for CT, MRI, and X-ray analysis. Cross-modal triggers connect imaging findings to genomic analysis:

Imaging → Genomics: A Lung-RADS 4B+ finding triggers a FHIR ServiceRequest for tumor gene profiling via Parabricks
Genomics → Drug Discovery: Pathogenic variants in druggable genes trigger targeted molecule generation
Drug Discovery → Imaging: Drug candidates are combined with imaging findings in integrated clinical reports

NVIDIA FLARE for Federated Learning¶

Phase 3 deployments use NVIDIA FLARE for federated learning across institutions. Models train locally; only gradient updates are shared. Patient genomic data never leaves the originating institution.

11. Deployment Roadmap¶

Three-Phase Scaling¶

Phase	Hardware	Orchestration	Scale
1 — Proof Build	DGX Spark ($3,999)	Docker Compose	Single patient, sequential
2 — Departmental	1-2× DGX B200 ($500K-$1M)	Kubernetes	Multiple concurrent patients
3 — Enterprise	DGX SuperPOD ($7M-$60M+)	Kubernetes + FLARE	Thousands concurrent, federated

Phase 1: DGX Spark Proof Build¶

A single DGX Spark runs the complete pipeline: GB10 GPU handles Parabricks, Milvus, MolMIM, and DiffDock sequentially. Docker Compose manages all 14 services. The 128 GB unified memory accommodates all stages without swapping. Total cost: $3,999 hardware + API keys (Anthropic, NGC).

Phase 2: Departmental Scale¶

DGX B200 systems (8× B200 GPUs, 1-2 TB HBM3e) enable parallel processing of multiple patients, GPU-dedicated Milvus instances for sub-millisecond search, and multiple BioNeMo NIM replicas for parallel drug discovery. Kubernetes replaces Docker Compose for production orchestration.

Phase 3: Enterprise / Multi-Site¶

DGX SuperPOD deployments with InfiniBand fabric, NVIDIA FLARE for federated learning, and institutional-scale variant databases. Thousands of patients processed concurrently, with models improving through cross-institutional collaboration while maintaining data sovereignty.

12. Conclusion¶

The HCLS AI Factory demonstrates that the full precision medicine pipeline — from raw DNA to novel drug candidates — can run on a single desktop workstation. GPU acceleration collapses genomics from days to hours. Vector databases and LLM reasoning transform annotation from manual curation to interactive exploration. Generative chemistry and molecular docking automate the target-to-lead transition that traditionally takes months.

The three-stage architecture (Genomics → RAG/Chat → Drug Discovery) provides a reproducible, auditable, and scalable framework. The same Nextflow pipelines that run on a $3,999 DGX Spark scale to DGX SuperPOD for enterprise deployments. All components are open-source or NVIDIA-licensed under Apache 2.0.

This is precision medicine as a continuous, computable workflow — not a disconnected collection of tools, but an integrated factory that transforms patient data into therapeutic hypotheses in a single session.

HCLS AI Factory — Apache 2.0 | Author: Adam Jones | February 2026