Metrics
Performance
Pipeline Report
Results
Pipeline Report
Executive Summary
The HCLS AI Factory represents a breakthrough in precision medicine, delivering an end-to-end platform that transforms raw patient DNA sequencing data into novel drug candidates in under 5 hours . Built on NVIDIA's accelerated computing stack and powered by advanced AI, this platform reduces what traditionally takes months of manual analysis to a streamlined, GPU-accelerated workflow.
Key Achievements
Metric
Value
Impact
Processing Time
~5 hours
99% reduction from traditional methods
Lines of Code
36,000+
Production-grade implementation
Variant Coverage
3.56M
Comprehensive genomic database
Target Genes
201
Clinically validated targets
Druggability Rate
85%
High therapeutic potential
PRECISION MEDICINE TO DRUG DISCOVERY
════════════════════════════════════════════════════════════════
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ │ │ │ │ │
│ STAGE 1 │────▶│ STAGE 2 │────▶│ STAGE 3 │
│ │ │ │ │ │
│ GENOMICS │ │ RAG/CHAT │ │ DRUG │
│ PIPELINE │ │ PIPELINE │ │ DISCOVERY │
│ │ │ │ │ │
│ ┌───────────┐ │ │ ┌───────────┐ │ │ ┌───────────┐ │
│ │ FASTQ │ │ │ │ VCF │ │ │ │ TARGET │ │
│ │ ↓ │ │ │ │ ↓ │ │ │ │ ↓ │ │
│ │ VCF │ │ │ │ TARGET │ │ │ │ MOLECULES │ │
│ └───────────┘ │ │ └───────────┘ │ │ └───────────┘ │
│ │ │ │ │ │
│ 120-240 min │ │ Interactive │ │ Minutes │
└─────────────────┘ └─────────────────┘ └─────────────────┘
│ │ │
└───────────────────────┴───────────────────────┘
│
┌───────────────────────────────┐
│ NVIDIA DGX SPARK │
│ 128GB unified LPDDR5x | 144 │
└───────────────────────────────┘
Stage 1: Genomics Pipeline
Overview
The Genomics Pipeline transforms raw DNA sequencing data (FASTQ) into variant calls (VCF) using NVIDIA Parabricks, achieving 10-50x acceleration over traditional CPU-based methods.
Technical Specifications
Component
Technology
Performance
Alignment
BWA-MEM2 (GPU-accelerated)
20-45 minutes
Sorting
Coordinate sorting with deduplication
Included
Indexing
samtools index + flagstat
2-5 minutes
Variant Calling
Google DeepVariant (GPU)
10-35 minutes
Data Flow
┌──────────────────────────────────────────────────────────────────────┐
│ GENOMICS PIPELINE │
├──────────────────────────────────────────────────────────────────────┤
│ │
│ INPUT PROCESSING OUTPUT │
│ ───── ────────── ────── │
│ │
│ ┌─────────┐ ┌─────────────────────────────┐ ┌─────────┐ │
│ │ FASTQ │ │ │ │ BAM │ │
│ │ R1 / R2 │───▶│ NVIDIA PARABRICKS │───▶│ File │ │
│ │ ~ 200 GB │ │ fq2bam + DeepVariant │ │ ~ 100 GB │ │
│ └─────────┘ │ │ └────┬────┘ │
│ │ ┌───────────────────────┐ │ │ │
│ ┌─────────┐ │ │ GRCh38 Reference │ │ ┌────▼────┐ │
│ │ Reference │───▶│ │ Genome ( 3.1 GB ) │ │ │ VCF │ │
│ │ GRCh38 │ │ └───────────────────────┘ │ │ File │ │
│ └─────────┘ │ │ │ ~ 11.7 M │ │
│ └─────────────────────────────┘ │ variants │ │
│ └─────────┘ │
│ │
│ TIMING : 120 - 240 minutes ( vs . 24 - 48 hours on CPU ) │
│ │
└──────────────────────────────────────────────────────────────────────┘
Metric
Value
Input Size
~200 GB (paired-end FASTQ)
Output Variants
~11.7 million
Processing Time
120-240 minutes
GPU Utilization
85-95%
Accuracy
>99% (DeepVariant)
Stage 2: RAG/Chat Pipeline
Overview
The RAG (Retrieval-Augmented Generation) Pipeline enables natural language queries across millions of genomic variants, synthesizing AI-powered answers grounded in clinical evidence.
Technical Specifications
Component
Technology
Capacity
Vector Database
Milvus
3.56M embeddings
Embedding Model
BGE-small-en-v1.5
384 dimensions
Knowledge Base
Clinker
201 genes, 150+ diseases
LLM
Claude (Anthropic)
claude-sonnet-4
Clinical Data
ClinVar
4.1M variants
AI Predictions
AlphaMissense
71M predictions
Architecture
┌──────────────────────────────────────────────────────────────────────┐
│ RAG / CHAT PIPELINE │
├──────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────┐ │
│ │ User Query │ │
│ │ "What variants │ │
│ │ affect VCP ? " │ │
│ └────────┬────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ EMBEDDING LAYER │ │
│ │ BGE - small - en - v1 . 5 │ │
│ │ 384 Dimensions │ │
│ └─────────────────────────────┬───────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ MILVUS VECTOR DB │ │
│ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ ClinVar │ │ AlphaMissense │ │ Clinker │ │ │
│ │ │ 4.1 M vars │ │ 71 M scores │ │ 201 genes │ │ │
│ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │
│ └─────────────────────────────┬───────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ CLAUDE LLM │ │
│ │ Evidence Synthesis & Reasoning │ │
│ │ Grounded in Citations │ │
│ └─────────────────────────────┬───────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ AI Response │ │
│ │ + Citations │ │
│ │ + Evidence │ │
│ └─────────────────┘ │
│ │
└──────────────────────────────────────────────────────────────────────┘
Knowledge Base Coverage
Therapeutic Area
Target Genes
Key Diseases
Oncology
45
Breast, Lung, Colorectal Cancers
Neurology
38
ALS, FTD, Parkinson's, Alzheimer's
Rare Disease
52
Inherited Metabolic Disorders
Cardiovascular
28
Cardiomyopathy, Arrhythmia
Immunology
22
Autoimmune Disorders
Ophthalmology
16
Retinal Dystrophies
Sample Queries
Query : "What pathogenic variants are associated with VCP?"
Response : Based on clinical evidence from ClinVar and AlphaMissense :
• VCP R155H ( rs121909331 ) - Pathogenic
- Associated with IBMPFD ( Inclusion Body Myopathy with Paget 's Disease)
- AlphaMissense Score : 0.94 ( Likely Pathogenic )
• VCP R191Q ( rs121909332 ) - Pathogenic
- Causes familial ALS and FTD
- 85 % druggability confidence
• VCP A232E ( rs121909333 ) - Pathogenic
- Multi - system proteinopathy
- Structure available : PDB 8 OOI
Target Hypothesis : VCP is a validated therapeutic target for
neurodegenerative disease with known inhibitors in clinical development .
Stage 3: Drug Discovery Pipeline
Overview
The Drug Discovery Pipeline leverages NVIDIA BioNeMo NIM microservices to generate novel drug-like molecules from validated protein targets, complete with binding pose predictions and drug-likeness scoring.
Technical Specifications
Component
Technology
Function
Structure Retrieval
RCSB PDB API
Cryo-EM/X-ray structures
Molecule Generation
BioNeMo MolMIM
Novel analog creation
Molecular Docking
BioNeMo DiffDock
Binding pose prediction
3D Conformers
RDKit
Energy minimization
Drug Scoring
QED + Lipinski
Drug-likeness evaluation
Report Generation
ReportLab
Professional PDF output
Pipeline Flow
┌──────────────────────────────────────────────────────────────────────┐
│ DRUG DISCOVERY PIPELINE │
├──────────────────────────────────────────────────────────────────────┤
│ │
│ PHASE 1: STRUCTURE RETRIEVAL │
│ ───────────────────────────── │
│ ┌─────────────┐ ┌─────────────────────────────────────────┐ │
│ │ Target │───▶│ RCSB PDB API │ │
│ │ (VCP) │ │ 8OOI: WT Hexamer (2.9Å, Cryo-EM) │ │
│ └─────────────┘ │ 9DIL: Mutant (3.2Å) │ │
│ │ 5FTK: +CB-5083 Inhibitor (2.3Å) │ │
│ └─────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ PHASE 2: MOLECULE GENERATION │
│ ──────────────────────────── │
│ ┌─────────────┐ ┌─────────────────────────────────────────┐ │
│ │ Seed Mol │───▶│ BioNeMo MolMIM │ │
│ │ (CB-5083) │ │ Masked Language Modeling │ │
│ │ SMILES │ │ Generate Novel Analogs │ │
│ └─────────────┘ └─────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ PHASE 3: MOLECULAR DOCKING │
│ ────────────────────────── │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ BioNeMo DiffDock │ │
│ │ Diffusion-Based Docking Predictions │ │
│ │ Binding Pose Generation │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ PHASE 4: SCORING & RANKING │
│ ────────────────────────── │
│ ┌────────────────┐ ┌────────────────┐ ┌────────────────┐ │
│ │ Lipinski │ │ QED │ │ ADMET │ │
│ │ Rule of 5 │ │ Score │ │ Properties │ │
│ │ MW ≤ 500 │ │ 0.0-1.0 │ │ Absorption │ │
│ │ LogP ≤ 5 │ │ Drug-likeness │ │ Metabolism │ │
│ └────────────────┘ └────────────────┘ └────────────────┘ │
│ │ │
│ ▼ │
│ PHASE 5: REPORT GENERATION │
│ ────────────────────────── │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ VCP_Drug_Candidate_Report.pdf │ │
│ │ • Executive Summary • Ranked Candidates │ │
│ │ • Structure Analysis • Scoring Details │ │
│ │ • Binding Site Maps • Next Steps │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │
└──────────────────────────────────────────────────────────────────────┘
Drug-Likeness Criteria
Rule
Criteria
Purpose
Lipinski Rule 1
MW ≤ 500 Da
Oral bioavailability
Lipinski Rule 2
LogP ≤ 5
Membrane permeability
Lipinski Rule 3
H-Bond Donors ≤ 5
Absorption
Lipinski Rule 4
H-Bond Acceptors ≤ 10
Solubility
QED Score
0.0 - 1.0
Overall drug-likeness
Infrastructure & Monitoring
NVIDIA DGX Spark Specifications
Component
Specification
GPU
NVIDIA GB10 (Blackwell)
Unified Memory
128 GB LPDDR5x (shared CPU/GPU)
CPU
ARM64 Cores
Storage
2+ TB NVMe
Network
100 GbE
Service Architecture
Port
Service
Status
8080
Landing Page
Active
5000
Genomics Portal
Active
5001
RAG/Chat API
Active
8501
RAG Chat Interface
Active
8505
Drug Discovery UI
Active
8510
Discovery Portal
Active
19530
Milvus Vector DB
Active
3000
Grafana
Active
9099
Prometheus
Active
9100
Node Exporter
Active
9400
DCGM Exporter
Active
Monitoring Dashboard
┌──────────────────────────────────────────────────────────────────────┐
│ NVIDIA DGX SPARK GPU MONITORING │
├──────────────────────────────────────────────────────────────────────┤
│ │
│ ┌────────────────┐ ┌────────────────┐ ┌────────────────┐ │
│ │ GPU Utilization│ │ GPU Temperature│ │ GPU Power │ │
│ │ 85% │ │ 62°C │ │ 320W │ │
│ │ ████████░░ │ │ ██████░░░░ │ │ ████████░░ │ │
│ └────────────────┘ └────────────────┘ └────────────────┘ │
│ │
│ ┌────────────────┐ ┌────────────────┐ ┌────────────────┐ │
│ │ CPU Utilization│ │Memory Bandwidth│ │ NVMe Throughput│ │
│ │ 45% │ │ 450 GB/s │ │ 2.8 GB/s │ │
│ │ ████░░░░░░ │ │ ███████░░░ │ │ ███████░░░ │ │
│ └────────────────┘ └────────────────┘ └────────────────┘ │
│ │
└──────────────────────────────────────────────────────────────────────┘
Codebase Metrics
Pipeline
Python
JavaScript
Shell
Markdown
CSS/HTML
Total
Genomics
1,839
1,344
2,636
2,510
1,671
10,000
RAG/Chat
9,409
716
349
1,797
975
13,321
Drug Discovery
5,332
-
55
1,102
234
6,723
Landing Page
400
750
100
450
708
2,408
Documentation
-
-
-
3,540
-
3,540
Total
16,980
2,810
3,140
9,399
3,588
~36,000
Data Assets
Asset
Count
Source
Variant Embeddings
3,561,170
VCF + Annotations
ClinVar Variants
4,100,000
NCBI ClinVar
AlphaMissense Predictions
71,000,000
DeepMind
Target Genes
201
Clinker Knowledge Base
Disease Associations
150+
Curated Database
Therapeutic Areas
13
Clinical Classification
Druggable Targets
171
Druggability Analysis
Technology Stack
Core Technologies
Layer
Technology
Purpose
Compute
NVIDIA DGX Spark
GPU-accelerated processing
Genomics
NVIDIA Parabricks 4.6
Variant calling pipeline
AI/ML
NVIDIA BioNeMo NIM
Drug discovery models
LLM
Claude (Anthropic)
Natural language reasoning
Vector DB
Milvus
Similarity search
Embeddings
BGE-small-en-v1.5
Semantic encoding
Web
Flask + Streamlit
User interfaces
Monitoring
Grafana + Prometheus
System observability
Container
Docker + NVIDIA Runtime
Service orchestration
AI Models
Model
Provider
Application
DeepVariant
Google
Variant calling (>99% accuracy)
BGE-small-en-v1.5
BAAI
Semantic embeddings
Claude Sonnet 4
Anthropic
Evidence synthesis
MolMIM
NVIDIA BioNeMo
Molecule generation
DiffDock
NVIDIA BioNeMo
Molecular docking
AlphaMissense
DeepMind
Pathogenicity prediction
Business Value
Time Savings
Process
Traditional
AI Factory
Improvement
FASTQ to VCF
24-48 hours
120-240 min
50x faster
Variant Interpretation
2-4 weeks
Minutes
1000x faster
Target Identification
1-3 months
Hours
100x faster
Lead Generation
6-12 months
Hours
1000x faster
Total Pipeline
12-18 months
~5 hours
99% reduction
Cost Efficiency
Factor
Impact
Compute Time Reduction
50-100x lower GPU hours
Manual Analysis Reduction
90% fewer specialist hours
Iteration Speed
10x faster hypothesis testing
Error Reduction
AI-validated annotations
Conclusion
The HCLS AI Factory delivers a production-ready platform for precision medicine to drug discovery, demonstrating the transformative potential of GPU-accelerated computing and AI in healthcare.
Key Differentiators
End-to-End Integration : Single platform from DNA to drug candidates
GPU Acceleration : NVIDIA Parabricks and BioNeMo for 10-100x speedups
AI-Powered Insights : Claude LLM for evidence synthesis
Clinical Grounding : 4.1M ClinVar variants with 71M AI predictions
Production Ready : 36,000+ lines of tested code