Drug Discovery
Genomics
Pipeline Phases
RAG
Genomics to Drug Discovery Pipeline - Complete Phases Overview
Pipeline Architecture
Phase 1-3: Genomics Pipeline (Data Acquisition & Processing)
↓
Phase 4: RAG Chat (Evidence Search & Query)
↓
Phase 5: Target Selection (Hypothesis & Structure)
↓
Phase 6: Molecule Generation (Drug Candidates)
Phase 1: Environment Setup & Prerequisites
Objective : Prepare the compute environment for GPU-accelerated genomics processing.
┌─────────────────────────────────────────────────────────────┐
│ Phase 1: Prerequisites │
├─────────────────────────────────────────────────────────────┤
│ Script: 00-setup-check.sh │
│ │
│ ├── Docker daemon verification │
│ ├── NVIDIA Container Runtime check │
│ ├── GPU detection and CUDA validation │
│ └── Disk space verification (500GB+) │
└─────────────────────────────────────────────────────────────┘
Deliverables
Component
Output
Docker
Container runtime operational
NVIDIA Runtime
GPU passthrough enabled
GPU
CUDA-capable device detected
Storage
500GB+ available space confirmed
Commands
./run.sh check # Run prerequisites check
./run.sh status # View system status
Phase 2: Authentication & Container Setup
Objective : Authenticate with NVIDIA NGC and pull the Parabricks container.
┌─────────────────────────────────────────────────────────────┐
│ Phase 2: Authentication │
├─────────────────────────────────────────────────────────────┤
│ Script: 01-ngc-login.sh │
│ │
│ ├── NGC API key authentication │
│ ├── Docker login to nvcr.io │
│ └── Pull clara-parabricks:4.6.0-1 (~15GB) │
└─────────────────────────────────────────────────────────────┘
Deliverables
Component
Output
NGC Auth
Valid API token configured
Docker Registry
Authenticated to nvcr.io
Parabricks Image
clara-parabricks:4.6.0-1 pulled locally
Commands
./run.sh login # NGC authentication + container pull
Phase 3: Data Acquisition & Genome Processing
Objective : Download reference data, acquire sample data, and run GPU-accelerated variant calling.
┌─────────────────────────────────────────────────────────────┐
│ Phase 3 : Data & Processing │
├─────────────────────────────────────────────────────────────┤
│ Scripts : 02 - download - data . sh │
│ 03 - setup - reference . sh │
│ 04 - run - chr20 - test . sh │
│ 05 - run - full - genome . sh │
│ │
│ ├── Download GIAB HG002 FASTQ ( ~ 200 GB ) │
│ ├── Setup GRCh38 reference genome │
│ ├── GPU - accelerated alignment ( BWA - MEM2 ) │
│ ├── Coordinate sorting & duplicate marking │
│ ├── DeepVariant CNN variant calling │
│ └── VCF generation with quality scores │
└─────────────────────────────────────────────────────────────┘
Processing Pipeline
FASTQ ( Raw Reads )
↓
┌───────────────────────────────────┐
│ fq2bam ( GPU ) │
│ ├── BWA - MEM2 alignment │
│ ├── Coordinate sorting │
│ └── Duplicate marking │
└───────────────────────────────────┘
↓
BAM ( Aligned Reads )
↓
┌───────────────────────────────────┐
│ DeepVariant ( GPU ) │
│ ├── Pileup image generation │
│ ├── CNN inference │
│ └── Genotype classification │
└───────────────────────────────────┘
↓
VCF ( Variant Calls )
Deliverables
Component
Output
FASTQ Data
HG002_R1.fastq.gz, HG002_R2.fastq.gz (~200GB)
Reference
GRCh38.fa with BWA index files
BAM
HG002.genome.bam (~100GB aligned reads)
VCF
HG002.genome.vcf.gz (~4M variants)
Commands
./run.sh download # Download GIAB HG002 data
./run.sh reference # Setup reference genome
./run.sh test # Run chr20 test (~20 min)
./run.sh full # Full genome processing (~2-3 hrs)
Phase 4: RAG Chat System
Objective : Build a retrieval-augmented generation system for querying genomic evidence.
┌─────────────────────────────────────────────────────────────┐
│ Phase 4: RAG Chat │
├─────────────────────────────────────────────────────────────┤
│ Components: │
│ │
│ ├── Milvus vector database (Docker) │
│ ├── VCF → Evidence object extraction │
│ ├── Sentence-transformer embeddings │
│ ├── Streamlit chat interface │
│ └── LLM backend (vLLM or API) │
└─────────────────────────────────────────────────────────────┘
Architecture
User Query : "What EGFR variants are in HG002?"
↓
┌───────────────────────────────────┐
│ Streamlit Chat UI │
└───────────────────────────────────┘
↓
┌───────────────────────────────────┐
│ Embedding ( bge - small - en ) │
│ Query → 384 - dim vector │
└───────────────────────────────────┘
↓
┌───────────────────────────────────┐
│ Milvus Vector Search │
│ Top - K similar evidence │
└───────────────────────────────────┘
↓
┌───────────────────────────────────┐
│ LLM Response │
│ Context + Query → Answer │
└───────────────────────────────────┘
↓
Evidence - backed response to user
Evidence Object Structure
{
"variant_key" : "chr7:55249071:G:A" ,
"chrom" : "chr7" ,
"pos" : 55249071 ,
"ref" : "G" ,
"alt" : "A" ,
"qual" : 45.2 ,
"gene" : "EGFR" ,
"tags" : [ "giab" , "vcf" , "variant" ],
"source" : "GIAB WGS VCF (GRCh38)" ,
"summary_text" : "Variant at chr7:55249071 G>A in EGFR gene..."
}
Deliverables
Component
Output
Milvus
Running vector database (port 19530)
Evidence Collection
Indexed VCF variants with embeddings
Chat Interface
Streamlit app (port 8501)
LLM Integration
Query-response pipeline
Project Structure
~/rag-chat-pipeline/
├── docker-compose.yml # Milvus standalone
├── ingest_evidence.py # VCF → embeddings → Milvus
├── chat_app.py # Streamlit interface
├── llm_backend.py # vLLM or API wrapper
└── requirements.txt
Phase 5: Target Selection
Objective : Form drug target hypothesis using RAG evidence and integrate structural biology data.
┌─────────────────────────────────────────────────────────────┐
│ Phase 5: Target Selection │
├─────────────────────────────────────────────────────────────┤
│ Components: │
│ │
│ ├── Hypothesis formation from RAG queries │
│ ├── Cryo-EM structure fetch (PDB:7SYE) │
│ └── Binding site identification │
└─────────────────────────────────────────────────────────────┘
Workflow
RAG Chat Query
"What variants affect EGFR in this sample?"
↓
┌───────────────────────────────────┐
│ Evidence Retrieval │
│ Returns : EGFR variants list │
│ chr7 : 55249071 , chr7 : 55259515. .. │
└───────────────────────────────────┘
↓
┌───────────────────────────────────┐
│ Hypothesis Formation │
│ "EGFR is a viable drug target │
│ based on variant evidence " │
└───────────────────────────────────┘
↓
┌───────────────────────────────────┐
│ Cryo - EM Structure Fetch │
│ PDB : 7 SYE ( EGFR kinase domain ) │
│ Resolution : 2.8 Å │
└───────────────────────────────────┘
↓
┌───────────────────────────────────┐
│ Binding Site Identification │
│ ATP binding pocket coordinates │
│ Key residues : K745 , E762 , M793 │
└───────────────────────────────────┘
↓
Target profile ready for molecule generation
Deliverables
Component
Output
RAG Query
"EGFR variants in HG002" → evidence list
PDB Fetch
7SYE structure file (~3MB)
Binding Site
Coordinates for molecule docking constraints
Key Data
Target : EGFR ( Epidermal Growth Factor Receptor )
PDB ID : 7 SYE
Structure : Cryo - EM , 2.8 Å resolution
Binding Pocket : ATP - competitive site
Key Residues :
- K745 ( catalytic lysine )
- E762 ( salt bridge )
- M793 ( gatekeeper )
- T790 ( resistance mutation site )
Phase 6: Molecule Generation
Objective : Generate drug candidate molecules using NVIDIA BioNeMo and MolMIM.
┌─────────────────────────────────────────────────────────────┐
│ Phase 6 : Molecule Generation │
├─────────────────────────────────────────────────────────────┤
│ Components : │
│ │
│ ├── BioNeMo MolMIM model │
│ ├── Constraint - based molecule filtering │
│ └── SMILES / SDF export for wet lab │
└─────────────────────────────────────────────────────────────┘
Architecture
Target Profile ( from Phase 5 )
↓
┌───────────────────────────────────┐
│ MolMIM │
│ Transformer - based generation │
│ SMILES string output │
└───────────────────────────────────┘
↓
┌───────────────────────────────────┐
│ Constraint Filtering │
│ ├── Molecular weight ( < 500 Da ) │
│ ├── LogP ( drug - likeness ) │
│ ├── Binding affinity score │
│ └── Toxicity prediction │
└───────────────────────────────────┘
↓
┌───────────────────────────────────┐
│ Ranked Candidates │
│ Top molecules by combined score │
└───────────────────────────────────┘
↓
┌───────────────────────────────────┐
│ Export │
│ ├── SMILES strings │
│ ├── SDF 3 D structures │
│ └── CSV with properties │
└───────────────────────────────────┘
↓
Ready for wet lab synthesis / testing
Deliverables
Component
Output
MolMIM
Candidate molecules (SMILES strings)
Filtering
Ranked molecules by binding affinity
Export
CSV/SDF for wet lab or further simulation
Example Output
Rank SMILES MW LogP Affinity
──── ────────────────────────────────────────── ────── ───── ────────
1 Cc1cccc(Nc2ncnc3cc(OC)c(OC)cc23)c1 325.4 3.2 -9.8
2 COc1cc2ncnc(Nc3ccc(F)c(Cl)c3)c2cc1OC 347.8 3.5 -9.4
3 Cn1cnc2cc3c(Nc4ccc(OCc5cccc(F)c5)c(Cl)c4) 421.9 4.1 -9.1
...
Complete Pipeline Flow
┌─────────────────────────────────────────────────────────────────────────┐
│ GENOMICS TO DRUG DISCOVERY PIPELINE │
└─────────────────────────────────────────────────────────────────────────┘
Phase 1 : Prerequisites Phase 2 : Authentication
┌─────────────────────┐ ┌─────────────────────┐
│ Docker + GPU Check │ ──────► │ NGC Login + Pull │
└─────────────────────┘ └─────────────────────┘
│
▼
Phase 3 : Processing
┌─────────────────────────────────┐
│ FASTQ → BAM → VCF │
│ ( GPU - accelerated Parabricks ) │
└─────────────────────────────────┘
│
▼
Phase 4 : RAG Chat
┌─────────────────────────────────┐
│ Milvus + Evidence + Streamlit │
│ Query : "EGFR variants?" │
└─────────────────────────────────┘
│
▼
Phase 5 : Target Selection
┌─────────────────────────────────┐
│ Hypothesis + Cryo - EM ( 7 SYE ) │
│ Binding site identification │
└─────────────────────────────────┘
│
▼
Phase 6 : Molecule Generation
┌─────────────────────────────────┐
│ BioNeMo MolMIM │
│ SMILES → Wet Lab │
└─────────────────────────────────┘
═══════════════════════════════════════════════════════════════════════════
Input : Raw FASTQ reads ( ~ 200 GB )
Output : Ranked drug candidate molecules ( SMILES / SDF )
═══════════════════════════════════════════════════════════════════════════
Technology Stack
Phase
Primary Technologies
1-3
Docker, NVIDIA Parabricks, CUDA, BWA-MEM2, DeepVariant
4
Milvus, sentence-transformers, Streamlit, vLLM/OpenAI
5
PDB/RCSB API, PyMOL/ChimeraX, BioPython
6
NVIDIA BioNeMo, MolMIM, RDKit
Hardware Requirements
Phase
GPU
RAM
Storage
1-3
NVIDIA GPU (16GB+ VRAM)
64GB+
500GB+ NVMe
4
Optional (CPU embeddings OK)
32GB+
50GB
5
Optional
16GB+
10GB
6
NVIDIA GPU (24GB+ VRAM recommended)
64GB+
100GB
Status Checklist