Genomics to Drug Discovery Pipeline - Complete Phases Overview¶
Pipeline Architecture¶
Phase 1-3: Genomics Pipeline (Data Acquisition & Processing)
↓
Phase 4: RAG Chat (Evidence Search & Query)
↓
Phase 5: Target Selection (Hypothesis & Structure)
↓
Phase 6: Molecule Generation (Drug Candidates)
Phase 1: Environment Setup & Prerequisites¶
Objective: Prepare the compute environment for GPU-accelerated genomics processing.
┌─────────────────────────────────────────────────────────────┐
│ Phase 1: Prerequisites │
├─────────────────────────────────────────────────────────────┤
│ Script: 00-setup-check.sh │
│ │
│ ├── Docker daemon verification │
│ ├── NVIDIA Container Runtime check │
│ ├── GPU detection and CUDA validation │
│ └── Disk space verification (500GB+) │
└─────────────────────────────────────────────────────────────┘
Deliverables¶
| Component | Output |
|---|---|
| Docker | Container runtime operational |
| NVIDIA Runtime | GPU passthrough enabled |
| GPU | CUDA-capable device detected |
| Storage | 500GB+ available space confirmed |
Commands¶
Phase 2: Authentication & Container Setup¶
Objective: Authenticate with NVIDIA NGC and pull the Parabricks container.
┌─────────────────────────────────────────────────────────────┐
│ Phase 2: Authentication │
├─────────────────────────────────────────────────────────────┤
│ Script: 01-ngc-login.sh │
│ │
│ ├── NGC API key authentication │
│ ├── Docker login to nvcr.io │
│ └── Pull clara-parabricks:4.6.0-1 (~15GB) │
└─────────────────────────────────────────────────────────────┘
Deliverables¶
| Component | Output |
|---|---|
| NGC Auth | Valid API token configured |
| Docker Registry | Authenticated to nvcr.io |
| Parabricks Image | clara-parabricks:4.6.0-1 pulled locally |
Commands¶
Phase 3: Data Acquisition & Genome Processing¶
Objective: Download reference data, acquire sample data, and run GPU-accelerated variant calling.
┌─────────────────────────────────────────────────────────────┐
│ Phase 3: Data & Processing │
├─────────────────────────────────────────────────────────────┤
│ Scripts: 02-download-data.sh │
│ 03-setup-reference.sh │
│ 04-run-chr20-test.sh │
│ 05-run-full-genome.sh │
│ │
│ ├── Download GIAB HG002 FASTQ (~200GB) │
│ ├── Setup GRCh38 reference genome │
│ ├── GPU-accelerated alignment (BWA-MEM2) │
│ ├── Coordinate sorting & duplicate marking │
│ ├── DeepVariant CNN variant calling │
│ └── VCF generation with quality scores │
└─────────────────────────────────────────────────────────────┘
Processing Pipeline¶
FASTQ (Raw Reads)
↓
┌───────────────────────────────────┐
│ fq2bam (GPU) │
│ ├── BWA-MEM2 alignment │
│ ├── Coordinate sorting │
│ └── Duplicate marking │
└───────────────────────────────────┘
↓
BAM (Aligned Reads)
↓
┌───────────────────────────────────┐
│ DeepVariant (GPU) │
│ ├── Pileup image generation │
│ ├── CNN inference │
│ └── Genotype classification │
└───────────────────────────────────┘
↓
VCF (Variant Calls)
Deliverables¶
| Component | Output |
|---|---|
| FASTQ Data | HG002_R1.fastq.gz, HG002_R2.fastq.gz (~200GB) |
| Reference | GRCh38.fa with BWA index files |
| BAM | HG002.genome.bam (~100GB aligned reads) |
| VCF | HG002.genome.vcf.gz (~4M variants) |
Commands¶
./run.sh download # Download GIAB HG002 data
./run.sh reference # Setup reference genome
./run.sh test # Run chr20 test (~20 min)
./run.sh full # Full genome processing (~2-3 hrs)
Phase 4: RAG Chat System¶
Objective: Build a retrieval-augmented generation system for querying genomic evidence.
┌─────────────────────────────────────────────────────────────┐
│ Phase 4: RAG Chat │
├─────────────────────────────────────────────────────────────┤
│ Components: │
│ │
│ ├── Milvus vector database (Docker) │
│ ├── VCF → Evidence object extraction │
│ ├── Sentence-transformer embeddings │
│ ├── Streamlit chat interface │
│ └── LLM backend (vLLM or API) │
└─────────────────────────────────────────────────────────────┘
Architecture¶
User Query: "What EGFR variants are in HG002?"
↓
┌───────────────────────────────────┐
│ Streamlit Chat UI │
└───────────────────────────────────┘
↓
┌───────────────────────────────────┐
│ Embedding (bge-small-en) │
│ Query → 384-dim vector │
└───────────────────────────────────┘
↓
┌───────────────────────────────────┐
│ Milvus Vector Search │
│ Top-K similar evidence │
└───────────────────────────────────┘
↓
┌───────────────────────────────────┐
│ LLM Response │
│ Context + Query → Answer │
└───────────────────────────────────┘
↓
Evidence-backed response to user
Evidence Object Structure¶
{
"variant_key": "chr7:55249071:G:A",
"chrom": "chr7",
"pos": 55249071,
"ref": "G",
"alt": "A",
"qual": 45.2,
"gene": "EGFR",
"tags": ["giab", "vcf", "variant"],
"source": "GIAB WGS VCF (GRCh38)",
"summary_text": "Variant at chr7:55249071 G>A in EGFR gene..."
}
Deliverables¶
| Component | Output |
|---|---|
| Milvus | Running vector database (port 19530) |
| Evidence Collection | Indexed VCF variants with embeddings |
| Chat Interface | Streamlit app (port 8501) |
| LLM Integration | Query-response pipeline |
Project Structure¶
~/rag-chat-pipeline/
├── docker-compose.yml # Milvus standalone
├── ingest_evidence.py # VCF → embeddings → Milvus
├── chat_app.py # Streamlit interface
├── llm_backend.py # vLLM or API wrapper
└── requirements.txt
Phase 5: Target Selection¶
Objective: Form drug target hypothesis using RAG evidence and integrate structural biology data.
┌─────────────────────────────────────────────────────────────┐
│ Phase 5: Target Selection │
├─────────────────────────────────────────────────────────────┤
│ Components: │
│ │
│ ├── Hypothesis formation from RAG queries │
│ ├── Cryo-EM structure fetch (PDB:7SYE) │
│ └── Binding site identification │
└─────────────────────────────────────────────────────────────┘
Workflow¶
RAG Chat Query
"What variants affect EGFR in this sample?"
↓
┌───────────────────────────────────┐
│ Evidence Retrieval │
│ Returns: EGFR variants list │
│ chr7:55249071, chr7:55259515... │
└───────────────────────────────────┘
↓
┌───────────────────────────────────┐
│ Hypothesis Formation │
│ "EGFR is a viable drug target │
│ based on variant evidence" │
└───────────────────────────────────┘
↓
┌───────────────────────────────────┐
│ Cryo-EM Structure Fetch │
│ PDB:7SYE (EGFR kinase domain) │
│ Resolution: 2.8Å │
└───────────────────────────────────┘
↓
┌───────────────────────────────────┐
│ Binding Site Identification │
│ ATP binding pocket coordinates │
│ Key residues: K745, E762, M793 │
└───────────────────────────────────┘
↓
Target profile ready for molecule generation
Deliverables¶
| Component | Output |
|---|---|
| RAG Query | "EGFR variants in HG002" → evidence list |
| PDB Fetch | 7SYE structure file (~3MB) |
| Binding Site | Coordinates for molecule docking constraints |
Key Data¶
Target: EGFR (Epidermal Growth Factor Receptor)
PDB ID: 7SYE
Structure: Cryo-EM, 2.8Å resolution
Binding Pocket: ATP-competitive site
Key Residues:
- K745 (catalytic lysine)
- E762 (salt bridge)
- M793 (gatekeeper)
- T790 (resistance mutation site)
Phase 6: Molecule Generation¶
Objective: Generate drug candidate molecules using NVIDIA BioNeMo and MolMIM.
┌─────────────────────────────────────────────────────────────┐
│ Phase 6: Molecule Generation │
├─────────────────────────────────────────────────────────────┤
│ Components: │
│ │
│ ├── BioNeMo MolMIM model │
│ ├── Constraint-based molecule filtering │
│ └── SMILES/SDF export for wet lab │
└─────────────────────────────────────────────────────────────┘
Architecture¶
Target Profile (from Phase 5)
↓
┌───────────────────────────────────┐
│ MolMIM │
│ Transformer-based generation │
│ SMILES string output │
└───────────────────────────────────┘
↓
┌───────────────────────────────────┐
│ Constraint Filtering │
│ ├── Molecular weight (< 500 Da) │
│ ├── LogP (drug-likeness) │
│ ├── Binding affinity score │
│ └── Toxicity prediction │
└───────────────────────────────────┘
↓
┌───────────────────────────────────┐
│ Ranked Candidates │
│ Top molecules by combined score │
└───────────────────────────────────┘
↓
┌───────────────────────────────────┐
│ Export │
│ ├── SMILES strings │
│ ├── SDF 3D structures │
│ └── CSV with properties │
└───────────────────────────────────┘
↓
Ready for wet lab synthesis/testing
Deliverables¶
| Component | Output |
|---|---|
| MolMIM | Candidate molecules (SMILES strings) |
| Filtering | Ranked molecules by binding affinity |
| Export | CSV/SDF for wet lab or further simulation |
Example Output¶
Rank SMILES MW LogP Affinity
──── ────────────────────────────────────────── ────── ───── ────────
1 Cc1cccc(Nc2ncnc3cc(OC)c(OC)cc23)c1 325.4 3.2 -9.8
2 COc1cc2ncnc(Nc3ccc(F)c(Cl)c3)c2cc1OC 347.8 3.5 -9.4
3 Cn1cnc2cc3c(Nc4ccc(OCc5cccc(F)c5)c(Cl)c4) 421.9 4.1 -9.1
...
Complete Pipeline Flow¶
┌─────────────────────────────────────────────────────────────────────────┐
│ GENOMICS TO DRUG DISCOVERY PIPELINE │
└─────────────────────────────────────────────────────────────────────────┘
Phase 1: Prerequisites Phase 2: Authentication
┌─────────────────────┐ ┌─────────────────────┐
│ Docker + GPU Check │ ──────► │ NGC Login + Pull │
└─────────────────────┘ └─────────────────────┘
│
▼
Phase 3: Processing
┌─────────────────────────────────┐
│ FASTQ → BAM → VCF │
│ (GPU-accelerated Parabricks) │
└─────────────────────────────────┘
│
▼
Phase 4: RAG Chat
┌─────────────────────────────────┐
│ Milvus + Evidence + Streamlit │
│ Query: "EGFR variants?" │
└─────────────────────────────────┘
│
▼
Phase 5: Target Selection
┌─────────────────────────────────┐
│ Hypothesis + Cryo-EM (7SYE) │
│ Binding site identification │
└─────────────────────────────────┘
│
▼
Phase 6: Molecule Generation
┌─────────────────────────────────┐
│ BioNeMo MolMIM │
│ SMILES → Wet Lab │
└─────────────────────────────────┘
═══════════════════════════════════════════════════════════════════════════
Input: Raw FASTQ reads (~200GB)
Output: Ranked drug candidate molecules (SMILES/SDF)
═══════════════════════════════════════════════════════════════════════════
Technology Stack¶
| Phase | Primary Technologies |
|---|---|
| 1-3 | Docker, NVIDIA Parabricks, CUDA, BWA-MEM2, DeepVariant |
| 4 | Milvus, sentence-transformers, Streamlit, vLLM/OpenAI |
| 5 | PDB/RCSB API, PyMOL/ChimeraX, BioPython |
| 6 | NVIDIA BioNeMo, MolMIM, RDKit |
Hardware Requirements¶
| Phase | GPU | RAM | Storage |
|---|---|---|---|
| 1-3 | NVIDIA GPU (16GB+ VRAM) | 64GB+ | 500GB+ NVMe |
| 4 | Optional (CPU embeddings OK) | 32GB+ | 50GB |
| 5 | Optional | 16GB+ | 10GB |
| 6 | NVIDIA GPU (24GB+ VRAM recommended) | 64GB+ | 100GB |
Status Checklist¶
- Phase 1: Prerequisites - Complete
- Phase 2: Authentication - Complete
- Phase 3: Data & Processing - Complete
- Phase 4: RAG Chat - Pending
- Phase 5: Target Selection - Pending
- Phase 6: Molecule Generation - Pending
Clinical Decision Support Disclaimer
This pipeline is a research tool. It is not FDA-cleared and is not intended as a standalone diagnostic device. All recommendations should be reviewed by qualified healthcare professionals. Apache 2.0 License.