Skip to content

Genomics to Drug Discovery Pipeline - Complete Phases Overview

Pipeline Architecture

Phase 1-3: Genomics Pipeline (Data Acquisition & Processing)
Phase 4:   RAG Chat (Evidence Search & Query)
Phase 5:   Target Selection (Hypothesis & Structure)
Phase 6:   Molecule Generation (Drug Candidates)

Phase 1: Environment Setup & Prerequisites

Objective: Prepare the compute environment for GPU-accelerated genomics processing.

┌─────────────────────────────────────────────────────────────┐
│                    Phase 1: Prerequisites                    │
├─────────────────────────────────────────────────────────────┤
│  Script: 00-setup-check.sh                                  │
│                                                             │
│  ├── Docker daemon verification                             │
│  ├── NVIDIA Container Runtime check                         │
│  ├── GPU detection and CUDA validation                      │
│  └── Disk space verification (500GB+)                       │
└─────────────────────────────────────────────────────────────┘

Deliverables

Component Output
Docker Container runtime operational
NVIDIA Runtime GPU passthrough enabled
GPU CUDA-capable device detected
Storage 500GB+ available space confirmed

Commands

./run.sh check        # Run prerequisites check
./run.sh status       # View system status

Phase 2: Authentication & Container Setup

Objective: Authenticate with NVIDIA NGC and pull the Parabricks container.

┌─────────────────────────────────────────────────────────────┐
│                   Phase 2: Authentication                    │
├─────────────────────────────────────────────────────────────┤
│  Script: 01-ngc-login.sh                                    │
│                                                             │
│  ├── NGC API key authentication                             │
│  ├── Docker login to nvcr.io                                │
│  └── Pull clara-parabricks:4.6.0-1 (~15GB)                  │
└─────────────────────────────────────────────────────────────┘

Deliverables

Component Output
NGC Auth Valid API token configured
Docker Registry Authenticated to nvcr.io
Parabricks Image clara-parabricks:4.6.0-1 pulled locally

Commands

./run.sh login        # NGC authentication + container pull

Phase 3: Data Acquisition & Genome Processing

Objective: Download reference data, acquire sample data, and run GPU-accelerated variant calling.

┌─────────────────────────────────────────────────────────────┐
              Phase 3: Data & Processing                      
├─────────────────────────────────────────────────────────────┤
  Scripts: 02-download-data.sh                               
           03-setup-reference.sh                             
           04-run-chr20-test.sh                              
           05-run-full-genome.sh                             
                                                             
  ├── Download GIAB HG002 FASTQ (~200GB)                     
  ├── Setup GRCh38 reference genome                         
  ├── GPU-accelerated alignment (BWA-MEM2)                   
  ├── Coordinate sorting & duplicate marking                 
  ├── DeepVariant CNN variant calling                        
  └── VCF generation with quality scores                     
└─────────────────────────────────────────────────────────────┘

Processing Pipeline

FASTQ (Raw Reads)
    
┌───────────────────────────────────┐
           fq2bam (GPU)            
  ├── BWA-MEM2 alignment           
  ├── Coordinate sorting           
  └── Duplicate marking            
└───────────────────────────────────┘
    
BAM (Aligned Reads)
    
┌───────────────────────────────────┐
        DeepVariant (GPU)          
  ├── Pileup image generation      
  ├── CNN inference                
  └── Genotype classification      
└───────────────────────────────────┘
    
VCF (Variant Calls)

Deliverables

Component Output
FASTQ Data HG002_R1.fastq.gz, HG002_R2.fastq.gz (~200GB)
Reference GRCh38.fa with BWA index files
BAM HG002.genome.bam (~100GB aligned reads)
VCF HG002.genome.vcf.gz (~4M variants)

Commands

./run.sh download     # Download GIAB HG002 data
./run.sh reference    # Setup reference genome
./run.sh test         # Run chr20 test (~20 min)
./run.sh full         # Full genome processing (~2-3 hrs)

Phase 4: RAG Chat System

Objective: Build a retrieval-augmented generation system for querying genomic evidence.

┌─────────────────────────────────────────────────────────────┐
│                    Phase 4: RAG Chat                         │
├─────────────────────────────────────────────────────────────┤
│  Components:                                                │
│                                                             │
│  ├── Milvus vector database (Docker)                        │
│  ├── VCF → Evidence object extraction                       │
│  ├── Sentence-transformer embeddings                        │
│  ├── Streamlit chat interface                               │
│  └── LLM backend (vLLM or API)                              │
└─────────────────────────────────────────────────────────────┘

Architecture

User Query: "What EGFR variants are in HG002?"
    
┌───────────────────────────────────┐
        Streamlit Chat UI          
└───────────────────────────────────┘
    
┌───────────────────────────────────┐
     Embedding (bge-small-en)      
     Query  384-dim vector        
└───────────────────────────────────┘
    
┌───────────────────────────────────┐
        Milvus Vector Search       
     Top-K similar evidence        
└───────────────────────────────────┘
    
┌───────────────────────────────────┐
           LLM Response            
   Context + Query  Answer        
└───────────────────────────────────┘
    
Evidence-backed response to user

Evidence Object Structure

{
    "variant_key": "chr7:55249071:G:A",
    "chrom": "chr7",
    "pos": 55249071,
    "ref": "G",
    "alt": "A",
    "qual": 45.2,
    "gene": "EGFR",
    "tags": ["giab", "vcf", "variant"],
    "source": "GIAB WGS VCF (GRCh38)",
    "summary_text": "Variant at chr7:55249071 G>A in EGFR gene..."
}

Deliverables

Component Output
Milvus Running vector database (port 19530)
Evidence Collection Indexed VCF variants with embeddings
Chat Interface Streamlit app (port 8501)
LLM Integration Query-response pipeline

Project Structure

~/rag-chat-pipeline/
├── docker-compose.yml          # Milvus standalone
├── ingest_evidence.py          # VCF → embeddings → Milvus
├── chat_app.py                 # Streamlit interface
├── llm_backend.py              # vLLM or API wrapper
└── requirements.txt

Phase 5: Target Selection

Objective: Form drug target hypothesis using RAG evidence and integrate structural biology data.

┌─────────────────────────────────────────────────────────────┐
│                  Phase 5: Target Selection                   │
├─────────────────────────────────────────────────────────────┤
│  Components:                                                │
│                                                             │
│  ├── Hypothesis formation from RAG queries                  │
│  ├── Cryo-EM structure fetch (PDB:7SYE)                     │
│  └── Binding site identification                            │
└─────────────────────────────────────────────────────────────┘

Workflow

RAG Chat Query
"What variants affect EGFR in this sample?"
    
┌───────────────────────────────────┐
       Evidence Retrieval          
  Returns: EGFR variants list      
  chr7:55249071, chr7:55259515...  
└───────────────────────────────────┘
    
┌───────────────────────────────────┐
      Hypothesis Formation         
  "EGFR is a viable drug target    │
   based on variant evidence"      │
└───────────────────────────────────┘
    
┌───────────────────────────────────┐
     Cryo-EM Structure Fetch       
  PDB:7SYE (EGFR kinase domain)    
  Resolution: 2.8Å                 
└───────────────────────────────────┘
    
┌───────────────────────────────────┐
    Binding Site Identification    
  ATP binding pocket coordinates   
  Key residues: K745, E762, M793   
└───────────────────────────────────┘
    
Target profile ready for molecule generation

Deliverables

Component Output
RAG Query "EGFR variants in HG002" → evidence list
PDB Fetch 7SYE structure file (~3MB)
Binding Site Coordinates for molecule docking constraints

Key Data

Target: EGFR (Epidermal Growth Factor Receptor)
PDB ID: 7SYE
Structure: Cryo-EM, 2.8Å resolution
Binding Pocket: ATP-competitive site
Key Residues:
  - K745 (catalytic lysine)
  - E762 (salt bridge)
  - M793 (gatekeeper)
  - T790 (resistance mutation site)

Phase 6: Molecule Generation

Objective: Generate drug candidate molecules using NVIDIA BioNeMo and MolMIM.

┌─────────────────────────────────────────────────────────────┐
                Phase 6: Molecule Generation                  
├─────────────────────────────────────────────────────────────┤
  Components:                                                
                                                             
  ├── BioNeMo MolMIM model                              
  ├── Constraint-based molecule filtering                    
  └── SMILES/SDF export for wet lab                          
└─────────────────────────────────────────────────────────────┘

Architecture

Target Profile (from Phase 5)
    ┌───────────────────────────────────┐
         MolMIM                 Transformer-based generation       SMILES string output             └───────────────────────────────────┘
    ┌───────────────────────────────────┐
      Constraint Filtering           ├── Molecular weight (< 500 Da)    ├── LogP (drug-likeness)           ├── Binding affinity score         └── Toxicity prediction          └───────────────────────────────────┘
    ┌───────────────────────────────────┐
        Ranked Candidates            Top molecules by combined score  └───────────────────────────────────┘
    ┌───────────────────────────────────┐
           Export                    ├── SMILES strings                 ├── SDF 3D structures              └── CSV with properties          └───────────────────────────────────┘
    Ready for wet lab synthesis/testing

Deliverables

Component Output
MolMIM Candidate molecules (SMILES strings)
Filtering Ranked molecules by binding affinity
Export CSV/SDF for wet lab or further simulation

Example Output

Rank  SMILES                                      MW      LogP   Affinity
────  ──────────────────────────────────────────  ──────  ─────  ────────
1     Cc1cccc(Nc2ncnc3cc(OC)c(OC)cc23)c1          325.4   3.2    -9.8
2     COc1cc2ncnc(Nc3ccc(F)c(Cl)c3)c2cc1OC        347.8   3.5    -9.4
3     Cn1cnc2cc3c(Nc4ccc(OCc5cccc(F)c5)c(Cl)c4)   421.9   4.1    -9.1
...

Complete Pipeline Flow

┌─────────────────────────────────────────────────────────────────────────┐
                    GENOMICS TO DRUG DISCOVERY PIPELINE                   
└─────────────────────────────────────────────────────────────────────────┘

Phase 1: Prerequisites          Phase 2: Authentication
┌─────────────────────┐         ┌─────────────────────┐
 Docker + GPU Check   ──────►  NGC Login + Pull    
└─────────────────────┘         └─────────────────────┘
                                          
                                          
                                Phase 3: Processing
                        ┌─────────────────────────────────┐
                          FASTQ  BAM  VCF              
                          (GPU-accelerated Parabricks)   
                        └─────────────────────────────────┘
                                          
                                          
                                  Phase 4: RAG Chat
                        ┌─────────────────────────────────┐
                          Milvus + Evidence + Streamlit  
                          Query: "EGFR variants?"        
                        └─────────────────────────────────┘
                                          
                                          
                              Phase 5: Target Selection
                        ┌─────────────────────────────────┐
                          Hypothesis + Cryo-EM (7SYE)    
                          Binding site identification    
                        └─────────────────────────────────┘
                                          
                                          
                            Phase 6: Molecule Generation
                        ┌─────────────────────────────────┐
                          BioNeMo MolMIM            
                          SMILES  Wet Lab               
                        └─────────────────────────────────┘

═══════════════════════════════════════════════════════════════════════════
 Input: Raw FASTQ reads (~200GB)
 Output: Ranked drug candidate molecules (SMILES/SDF)
═══════════════════════════════════════════════════════════════════════════

Technology Stack

Phase Primary Technologies
1-3 Docker, NVIDIA Parabricks, CUDA, BWA-MEM2, DeepVariant
4 Milvus, sentence-transformers, Streamlit, vLLM/OpenAI
5 PDB/RCSB API, PyMOL/ChimeraX, BioPython
6 NVIDIA BioNeMo, MolMIM, RDKit

Hardware Requirements

Phase GPU RAM Storage
1-3 NVIDIA GPU (16GB+ VRAM) 64GB+ 500GB+ NVMe
4 Optional (CPU embeddings OK) 32GB+ 50GB
5 Optional 16GB+ 10GB
6 NVIDIA GPU (24GB+ VRAM recommended) 64GB+ 100GB

Status Checklist

  • Phase 1: Prerequisites - Complete
  • Phase 2: Authentication - Complete
  • Phase 3: Data & Processing - Complete
  • Phase 4: RAG Chat - Pending
  • Phase 5: Target Selection - Pending
  • Phase 6: Molecule Generation - Pending