Drug Discovery Pipeline¶

Stage 3 of the Precision Medicine to Drug Discovery AI Factory

Structure-based drug design pipeline that transforms validated therapeutic targets into novel drug candidates using NVIDIA BioNeMo NIM microservices, Cryo-EM structural evidence, and AI-powered molecule generation.

┌──────────────────────────────────────────────────────────────────────────────────────┐
│                    PRECISION MEDICINE TO DRUG DISCOVERY AI FACTORY                   │
├──────────────────────────────────────────────────────────────────────────────────────┤
│                                                                                      │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐    ┌─────────────────────┐   │
│  │  GENOMICS   │    │  RAG/CHAT   │    │   CRYO-EM   │    │ MOLECULE GENERATION │   │
│  │  PIPELINE   │───▶│  PIPELINE   │───▶│  EVIDENCE   │───▶│     (BioNeMo)       │   │
│  │             │    │             │    │             │    │    (This Repo)      │   │
│  └─────────────┘    └─────────────┘    └─────────────┘    └─────────────────────┘   │
│    FASTQ→VCF         VCF→Target        Target→Structure    Structure→Molecules      │
│    Parabricks        Milvus+Claude     PDB/EMDB            MolMIM+DiffDock          │
│                                                                                      │
└──────────────────────────────────────────────────────────────────────────────────────┘

Table of Contents¶

Overview
From Genomics to Drug Candidates
Key Features
Architecture
The VCP/FTD Demo Workflow
Cryo-EM Structure Evidence
Molecule Generation with BioNeMo
Drug-Likeness Scoring
PDF Report Generation
Quick Start
Installation
Usage
Configuration
Directory Structure
Services
Monitoring Stack
Performance Benchmarks
Troubleshooting
Related Pipelines
References

Overview¶

This pipeline is the final stage of the Precision Medicine to Drug Discovery AI Factory. It takes validated therapeutic targets from Stage 2 (RAG/Chat Pipeline) and transforms them into actionable drug candidates through:

Cryo-EM Structure Retrieval: Fetch high-resolution protein structures from RCSB PDB
Known Inhibitor Analysis: Analyze existing drugs as generation seeds
AI-Powered Molecule Generation: Generate novel candidates using NVIDIA BioNeMo MolMIM
Molecular Docking: Predict binding poses with DiffDock
Drug-Likeness Scoring: Rank candidates by Lipinski, QED, and ADMET properties
Professional PDF Reports: Generate executive-ready reports with visualizations

┌─────────────────────────────────────────────────────────────────────────────┐
│                    STAGE 3: DRUG DISCOVERY PIPELINE                          │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   Target Hypothesis (from RAG/Chat)                                         │
│   "VCP is a druggable target for Frontotemporal Dementia"                  │
│        │                                                                    │
│        ▼                                                                    │
│   ┌──────────────────────────────────────────────────────────────┐        │
│   │              PHASE 5: STRUCTURAL EVIDENCE                     │        │
│   │                                                               │        │
│   │   ┌──────────┐   ┌──────────┐   ┌──────────┐   ┌──────────┐ │        │
│   │   │  8OOI    │   │  9DIL    │   │  7K56    │   │  5FTK    │ │        │
│   │   │WT Hexamer│   │ Mutant   │   │ Complex  │   │+CB-5083  │ │        │
│   │   │ 2.9 Å    │   │ 3.2 Å    │   │ 2.5 Å    │   │ 2.3 Å    │ │        │
│   │   └──────────┘   └──────────┘   └──────────┘   └──────────┘ │        │
│   │                                                               │        │
│   │   Binding Site Analysis: D2 ATPase domain, ATP-competitive   │        │
│   └──────────────────────────────────────────────────────────────┘        │
│        │                                                                    │
│        ▼                                                                    │
│   ┌──────────────────────────────────────────────────────────────┐        │
│   │              PHASE 6: MOLECULE GENERATION                     │        │
│   │                                                               │        │
│   │   Seed: CB-5083 (Phase I VCP inhibitor)                      │        │
│   │        │                                                      │        │
│   │        ▼                                                      │        │
│   │   ┌─────────────┐   ┌─────────────┐   ┌─────────────┐        │        │
│   │   │   MolMIM    │   │  DiffDock   │   │   RDKit     │        │        │
│   │   │ Generation  │──▶│  Docking    │──▶│  Scoring    │        │        │
│   │   │  (BioNeMo)  │   │  (BioNeMo)  │   │  (QED/Lip)  │        │        │
│   │   └─────────────┘   └─────────────┘   └─────────────┘        │        │
│   │        │                  │                  │                │        │
│   │        ▼                  ▼                  ▼                │        │
│   │   100 Analogues    Binding Poses     Ranked Candidates       │        │
│   └──────────────────────────────────────────────────────────────┘        │
│        │                                                                    │
│        ▼                                                                    │
│   Drug Candidate Report (PDF) → Ready for medicinal chemistry              │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

From Genomics to Drug Candidates¶

The Complete Journey¶

What traditionally takes pharmaceutical companies months to years can now be explored in hours:

┌─────────────────────────────────────────────────────────────────────────────┐
│                     END-TO-END DRUG DISCOVERY TIMELINE                       │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   TRADITIONAL                          AI FACTORY                           │
│   ────────────                         ──────────                           │
│                                                                             │
│   Sequencing: 2-4 weeks                Parabricks: 120-240 min               │
│   Variant Analysis: 2-4 weeks          RAG/Chat: Interactive               │
│   Target ID: 3-6 months                Clinker: Instant                     │
│   Structure Analysis: 1-2 months       PDB Fetch: < 1 min                  │
│   Lead Discovery: 6-12 months          MolMIM: 2-5 min                     │
│   Lead Optimization: 1-2 years         DiffDock: 5-10 min                  │
│                                                                             │
│   TOTAL: 2-3 years                     TOTAL: < 5 hours                    │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

The Demo Narrative¶

"Genomics tells us what changed, Cryo-EM shows us how it changed, and generative AI helps us design what could fix it."

This pipeline demonstrates how NVIDIA's accelerated computing transforms drug discovery from an art into a systematic, data-driven science.

Key Features¶

Cryo-EM Structure Integration¶

PDB/EMDB Access: Automatic retrieval of high-resolution protein structures
Structure Gallery: Visual display of available conformations
Binding Site Analysis: ATP-competitive pocket characterization
Resolution Tracking: Quality metrics for each structure (2.3-3.2 Å)

NVIDIA BioNeMo NIM Microservices¶

MolMIM: Generative AI for molecule design (masked modeling)
DiffDock: Diffusion-based molecular docking
GPU Acceleration: 10-100x faster than CPU-based methods
REST API: Easy integration via HTTP endpoints

Known Inhibitor Analysis¶

CB-5083: Clinical-stage VCP inhibitor as generation seed
Structure-Activity Relationships: Analyze what makes inhibitors work
Multi-Seed Support: Generate from multiple reference compounds

Drug-Likeness Scoring¶

Lipinski's Rule of Five: MW, LogP, HBD, HBA compliance
QED Score: Quantitative estimate of drug-likeness (0-1)
Synthetic Accessibility: Ease of chemical synthesis
Tanimoto Similarity: Distance from seed compounds

Professional PDF Reports¶

Executive-Ready: Designed for VP-level presentations
Cryo-EM Visuals: Embedded structure images from PDB
Molecule Graphics: 2D structure renderings from PubChem
NVIDIA Branding: Professional color scheme matching DGX Spark

Real-Time Monitoring¶

Grafana Dashboards: GPU utilization, memory, power
Prometheus Metrics: Time-series data collection
DCGM Exporter: NVIDIA-specific GPU metrics

Architecture¶

System Architecture¶

┌────────────────────────────────────────────────────────────────────────────────┐
│                          DRUG DISCOVERY PIPELINE                                │
├────────────────────────────────────────────────────────────────────────────────┤
│                                                                                │
│   ┌──────────────────────────────────────────────────────────────────────┐    │
│   │                    STREAMLIT UI (Port 8505)                           │    │
│   │                                                                       │    │
│   │   ┌──────────────┐  ┌──────────────┐  ┌──────────────┐              │    │
│   │   │   Target     │  │  Structure   │  │  Molecule    │              │    │
│   │   │  Hypothesis  │  │   Gallery    │  │  Generation  │              │    │
│   │   └──────────────┘  └──────────────┘  └──────────────┘              │    │
│   │                                                                       │    │
│   │   ┌──────────────┐  ┌──────────────┐  ┌──────────────┐              │    │
│   │   │   Docking    │  │   Scoring    │  │    Report    │              │    │
│   │   │   Results    │  │   & Ranking  │  │  Generation  │              │    │
│   │   └──────────────┘  └──────────────┘  └──────────────┘              │    │
│   │                                                                       │    │
│   └───────────────────────────────┬───────────────────────────────────────┘    │
│                                   │                                            │
│   ┌───────────────────────────────▼───────────────────────────────────────┐    │
│   │                        PIPELINE CORE                                   │    │
│   │                                                                       │    │
│   │   ┌────────────────────────────────────────────────────────────────┐ │    │
│   │   │                    src/pipeline.py                              │ │    │
│   │   │                                                                 │ │    │
│   │   │  Target Import → Structure Fetch → Generation → Docking → Score│ │    │
│   │   │                                                                 │ │    │
│   │   └────────────────────────────────────────────────────────────────┘ │    │
│   │                                                                       │    │
│   └───────────┬───────────────────┬───────────────────┬───────────────────┘    │
│               │                   │                   │                        │
│               ▼                   ▼                   ▼                        │
│   ┌───────────────────┐ ┌───────────────────┐ ┌───────────────────┐           │
│   │    RCSB PDB       │ │   BioNeMo NIMs    │ │      RDKit        │           │
│   │                   │ │                   │ │                   │           │
│   │  Structure Data   │ │  MolMIM (8001)    │ │  Cheminformatics  │           │
│   │  8OOI, 5FTK, etc  │ │  DiffDock (8002)  │ │  QED, Lipinski    │           │
│   │                   │ │                   │ │  SMILES parsing   │           │
│   └───────────────────┘ └───────────────────┘ └───────────────────┘           │
│                                                                                │
│   ┌──────────────────────────────────────────────────────────────────────┐    │
│   │                      MONITORING STACK                                 │    │
│   │                                                                       │    │
│   │   Grafana (3000) ←── Prometheus (9099) ←── DCGM Exporter (9400)     │    │
│   │                                                                       │    │
│   └──────────────────────────────────────────────────────────────────────┘    │
│                                                                                │
└────────────────────────────────────────────────────────────────────────────────┘

Data Flow¶

┌─────────────────────────────────────────────────────────────────────────────┐
│                              DATA FLOW                                       │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   RAG/Chat Pipeline Output                                                  │
│   targets_for_phase5.json                                                  │
│        │                                                                    │
│        │  {                                                                 │
│        │    "gene": "VCP",                                                 │
│        │    "variant": "rs188935092",                                      │
│        │    "disease": "Frontotemporal Dementia",                          │
│        │    "druggability": "HIGH"                                         │
│        │  }                                                                 │
│        │                                                                    │
│        ▼                                                                    │
│   ┌──────────────────────────────────────────────────────────────┐        │
│   │ 1. TARGET IMPORT                                              │        │
│   │    • Load hypothesis from RAG/Chat                           │        │
│   │    • Validate gene symbol and variant                        │        │
│   │    • Check druggability annotation                           │        │
│   └──────────────────────────────────────────────────────────────┘        │
│        │                                                                    │
│        ▼                                                                    │
│   ┌──────────────────────────────────────────────────────────────┐        │
│   │ 2. STRUCTURE RETRIEVAL                                        │        │
│   │    • Query RCSB PDB for gene (VCP → p97)                     │        │
│   │    • Fetch structure metadata (resolution, method)           │        │
│   │    • Download structure images for visualization             │        │
│   │    • Identify inhibitor-bound structures (5FTK + CB-5083)   │        │
│   └──────────────────────────────────────────────────────────────┘        │
│        │                                                                    │
│        ▼                                                                    │
│   ┌──────────────────────────────────────────────────────────────┐        │
│   │ 3. MOLECULE GENERATION (MolMIM)                               │        │
│   │    • Extract seed SMILES from known inhibitor                │        │
│   │    • Call BioNeMo MolMIM API                                 │        │
│   │    • Generate 100 structural analogues                       │        │
│   │    • Filter by similarity threshold (0.5-0.8)               │        │
│   └──────────────────────────────────────────────────────────────┘        │
│        │                                                                    │
│        ▼                                                                    │
│   ┌──────────────────────────────────────────────────────────────┐        │
│   │ 4. MOLECULAR DOCKING (DiffDock)                               │        │
│   │    • Load protein structure (5FTK)                           │        │
│   │    • Dock each candidate molecule                            │        │
│   │    • Predict binding pose and affinity                       │        │
│   │    • Score: -8 to -12 kcal/mol (good binders)               │        │
│   └──────────────────────────────────────────────────────────────┘        │
│        │                                                                    │
│        ▼                                                                    │
│   ┌──────────────────────────────────────────────────────────────┐        │
│   │ 5. SCORING & RANKING                                          │        │
│   │    • Calculate QED score (drug-likeness)                     │        │
│   │    • Check Lipinski Rule of Five                             │        │
│   │    • Compute Tanimoto similarity to seed                     │        │
│   │    • Rank by composite score                                  │        │
│   └──────────────────────────────────────────────────────────────┘        │
│        │                                                                    │
│        ▼                                                                    │
│   Output: Ranked Drug Candidates                                           │
│   VCP_Drug_Candidate_Report.pdf                                            │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

The VCP/FTD Demo Workflow¶

Complete Patient-to-Molecule Journey¶

This demo shows the full AI Factory workflow using a real genomic variant:

Step 1: Genomic Discovery (Stage 1)¶

Patient: HG002 (GIAB reference genome)
Variant: rs188935092 (chr9:35065263 G>A)
Gene: VCP (Valosin-Containing Protein)
Impact: Missense mutation
AlphaMissense: 0.89 (Likely Pathogenic)

Step 2: Target Validation (Stage 2)¶

Query: "What variants are associated with frontotemporal dementia?"
Result: VCP identified as druggable target
Evidence: 13 VCP variants in genome
Diseases: FTD, ALS, IBMPFD (Inclusion Body Myopathy)
Druggability: HIGH (ATP-competitive site validated)

Step 3: Structure-Based Design (This Pipeline)¶

Structures: 4 VCP Cryo-EM structures from PDB
Seed: CB-5083 (clinical VCP inhibitor)
Generated: 100 novel analogues
Top Candidates: 4 with QED > 0.35, Lipinski compliant
Best Docking: -10.95 kcal/mol

Why VCP?¶

VCP (also known as p97) is a AAA+ ATPase that plays critical roles in: - Protein quality control: Extracts misfolded proteins for degradation - Autophagy: Clears damaged organelles - DNA repair: Removes proteins from chromatin

Mutations in VCP cause: - Frontotemporal Dementia (FTD): Progressive brain degeneration - ALS: Motor neuron disease - IBMPFD: Muscle, bone, and brain disorder

VCP is an ideal drug target because: - Well-characterized ATP-binding pocket - Multiple high-resolution structures available - Clinical-stage inhibitors exist (CB-5083) - Clear disease mechanism (loss of protein homeostasis)

Cryo-EM Structure Evidence¶

Available VCP Structures¶

PDB ID	Description	Method	Resolution	Key Feature
8OOI	VCP/p97 wild-type hexamer	Cryo-EM	2.9 Å	Native conformation
9DIL	VCP with disease mutation	Cryo-EM	3.2 Å	Mutant structure
7K56	VCP-cofactor complex	Cryo-EM	2.5 Å	Binding mechanism
5FTK	VCP + CB-5083 inhibitor	X-ray	2.3 Å	Drug binding site

Structure Analysis¶

┌─────────────────────────────────────────────────────────────────────────────┐
│                        VCP STRUCTURE GALLERY                                 │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   ┌─────────────────┐   ┌─────────────────┐   ┌─────────────────┐          │
│   │      8OOI       │   │      5FTK       │   │      7K56       │          │
│   │                 │   │                 │   │                 │          │
│   │   [Hexamer]     │   │   [+CB-5083]    │   │   [Complex]     │          │
│   │                 │   │                 │   │                 │          │
│   │   Wild-type     │   │   Inhibitor     │   │   Cofactor      │          │
│   │   2.9 Å         │   │   2.3 Å         │   │   2.5 Å         │          │
│   └─────────────────┘   └─────────────────┘   └─────────────────┘          │
│                                                                             │
│   Binding Site: D2 ATPase Domain                                           │
│   Mode: ATP-competitive                                                     │
│   Key Residues: ALA464, GLY479, ASP320, GLY215                            │
│   Pocket Volume: ~450 Å³                                                   │
│   Druggability Score: 0.92                                                 │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Image Caching¶

The pipeline automatically downloads and caches structure images:

from src.cryoem_evidence import CryoEMEvidence

evidence = CryoEMEvidence()
structures = evidence.get_structures("VCP")

# Images cached in data/structures/image_cache/
# - 8ooi_structure.jpeg
# - 5ftk_structure.jpeg
# - 7k56_structure.jpeg

Molecule Generation with BioNeMo¶

MolMIM: Masked Modeling for Molecules¶

NVIDIA BioNeMo's MolMIM uses transformer-based masked modeling to generate novel molecules:

┌─────────────────────────────────────────────────────────────────────────────┐
│                           MolMIM GENERATION                                  │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   Input: CB-5083 SMILES                                                    │
│   CC(C)C1=C(C=C(C=C1)NC2=NC3=C(C=N2)N(C=C3)C)C(=O)NC4=CC=C(C=C4)CN5CCOCC5  │
│                                                                             │
│   Process:                                                                  │
│   1. Tokenize SMILES into substructures                                    │
│   2. Randomly mask 15-25% of tokens                                        │
│   3. Model predicts replacements                                           │
│   4. Generate diverse completions                                          │
│   5. Filter by validity and novelty                                        │
│                                                                             │
│   Output: 100 Novel Analogues                                              │
│                                                                             │
│   ┌─────────────────────────────────────────────────────────────────────┐ │
│   │ VCP-AI-001: CC(C)C1=C(C=C(C=C1)NC2=NC3=...  Sim: 0.98  QED: 0.387 │ │
│   │ VCP-AI-002: CC(N)c1ccc(Nc2ncc3c(ccn3C)n2)... Sim: 0.85  QED: 0.365 │ │
│   │ VCP-AI-003: Cc1ccc(NC2=NC3=C(C=N2)N(C=C3)... Sim: 0.72  QED: 0.454 │ │
│   │ VCP-AI-004: CC(C)c1ccc(NC2=NC3=C(C=N2)N(... Sim: 0.75  QED: 0.387 │ │
│   └─────────────────────────────────────────────────────────────────────┘ │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

DiffDock: Diffusion-Based Docking¶

DiffDock predicts how generated molecules bind to the target:

from src.molecule_generator import DiffDockClient

client = DiffDockClient(endpoint="http://localhost:8002")
poses = client.dock(
    protein_pdb="data/structures/5ftk.pdb",
    ligand_smiles="CC(C)C1=C(C=C(C=C1)..."
)

# Returns:
# - Binding pose (PDB coordinates)
# - Predicted affinity (kcal/mol)
# - Confidence score

API Endpoints¶

Service	Port	Endpoint	Description
MolMIM	8001	POST /generate	Generate molecules from seed
DiffDock	8002	POST /dock	Predict binding poses

Example API Call¶

# Generate molecules with MolMIM
curl -X POST http://localhost:8001/generate \
  -H "Content-Type: application/json" \
  -d '{
    "seed_smiles": "CC(C)C1=C(C=C(C=C1)NC2=NC3=C(C=N2)N(C=C3)C)C(=O)NC4=CC=C(C=C4)CN5CCOCC5",
    "num_samples": 50,
    "temperature": 0.8,
    "similarity_threshold": 0.7
  }'

Drug-Likeness Scoring¶

Lipinski's Rule of Five¶

Predicts oral bioavailability based on molecular properties:

Rule	Threshold	VCP-AI-001	Status
Molecular Weight	≤ 500 Da	484.6 Da	✓ PASS
LogP	≤ 5	4.92	✓ PASS
H-Bond Donors	≤ 5	2	✓ PASS
H-Bond Acceptors	≤ 10	6	✓ PASS

QED Score¶

Quantitative Estimate of Drug-likeness (0-1 scale):

from rdkit.Chem.QED import qed

# Calculate QED
qed_score = qed(mol)

# Interpretation:
# > 0.67: Drug-like
# 0.49-0.67: Moderately drug-like
# < 0.49: Less drug-like

Composite Scoring¶

Candidates are ranked by a weighted combination:

score = (
    0.3 * qed_score +
    0.3 * (1 - lipinski_violations / 4) +
    0.2 * tanimoto_similarity +
    0.2 * normalized_docking_score
)

Top Candidates¶

Rank	ID	Docking (kcal/mol)	QED	MW (Da)	LogP	Score
1	VCP-AI-001	-8.62	0.387	484.6	4.92	0.444
2	VCP-AI-002	-8.26	0.365	485.6	3.82	0.399
3	VCP-AI-003	-9.86	0.454	456.5	4.10	0.364
4	VCP-AI-004	-10.95	0.387	484.6	4.92	0.356

PDF Report Generation¶

Professional Executive Reports¶

The pipeline generates stunning PDF reports suitable for VP-level presentations:

┌─────────────────────────────────────────────────────────────────────────────┐
│                    VCP DRUG CANDIDATE REPORT                                 │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   PRECISION MEDICINE TO DRUG                                                │
│   DISCOVERY                                                                  │
│   AI Factory Pipeline Report                                                │
│                                                                             │
│   ┌─────────────────────────────────────────────────────────────────────┐ │
│   │ Target: VCP | Patient: HG002 | Generated: January 14, 2026         │ │
│   └─────────────────────────────────────────────────────────────────────┘ │
│                                                                             │
│   ┌─────────┐  ┌─────────┐  ┌─────────┐  ┌─────────┐                     │
│   │GENOMICS │─▶│RAG/CHAT │─▶│STRUCTURE│─▶│MOLECULES│                     │
│   │Phase 1-3│  │Phase 4  │  │Phase 5  │  │Phase 6  │                     │
│   └─────────┘  └─────────┘  └─────────┘  └─────────┘                     │
│                                                                             │
│   1. GENOMIC VARIANT DETECTION                                             │
│      VCP missense variant rs188935092                                      │
│      AlphaMissense: 0.89 (LIKELY PATHOGENIC)                              │
│                                                                             │
│   2. RAG/CHAT TARGET HYPOTHESIS                                            │
│      VCP/p97 confirmed as high-priority target                            │
│                                                                             │
│   3. STRUCTURAL EVIDENCE                                                    │
│      ┌───────────────┐  ┌───────────────┐                                 │
│      │ [8OOI Image]  │  │ [5FTK Image]  │                                 │
│      │ 2.9 Å Cryo-EM │  │ 2.3 Å X-ray   │                                 │
│      └───────────────┘  └───────────────┘                                 │
│                                                                             │
│   4. GENERATED DRUG CANDIDATES                                             │
│      ┌─────────────────────────────────────────────────────────────────┐ │
│      │ Rank │ ID        │ Docking  │ QED   │ Score │                   │ │
│      │──────│───────────│──────────│───────│───────│                   │ │
│      │ #1   │ VCP-AI-001│ -8.62    │ 0.387 │ 0.444 │                   │ │
│      │ #2   │ VCP-AI-002│ -8.26    │ 0.365 │ 0.399 │                   │ │
│      └─────────────────────────────────────────────────────────────────┘ │
│                                                                             │
│   5. EXECUTIVE SUMMARY                                                      │
│      • Pathogenic variant identified                                       │
│      • Validated drug target                                               │
│      • 4 novel candidates generated                                        │
│                                                                             │
│   ─────────────────────────────────────────────────────────────────────── │
│   HCLS AI Factory | Powered by NVIDIA Accelerated Computing               │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Report Features¶

Cryo-EM Structure Images: Downloaded from RCSB PDB
Molecule Graphics: 2D structures from PubChem
KeepTogether Tables: No awkward page breaks
NVIDIA Color Scheme: #76B900 green accents
Professional Typography: Clean, readable fonts

Generate Report¶

from generate_vcp_report_enhanced import VCPReportGeneratorEnhanced

generator = VCPReportGeneratorEnhanced(
    output_path="outputs/VCP_Drug_Candidate_Report.pdf"
)
generator.generate()

Quick Start¶

Prerequisites¶

Python 3.10+
NVIDIA GPU with 16GB+ VRAM (for NIM services)
Docker (for BioNeMo NIMs and monitoring)
NGC API Key (for pulling NIM containers)
Completed RAG/Chat Pipeline with target hypothesis

Installation¶

# Clone the repository
git clone https://github.com/ajones1923/drug-discovery-pipeline.git
cd drug-discovery-pipeline

# Create virtual environment
python -m venv venv
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

# Configure environment
cp .env.example .env
nano .env  # Add your NGC_API_KEY

Start the Pipeline¶

# Option 1: Streamlit Discovery UI (Recommended)
streamlit run app/discovery_ui.py --server.port 8505

# Option 2: Pipeline Portal
streamlit run portal/app.py --server.port 8510

# Option 3: Generate PDF Report
python generate_vcp_report_enhanced.py

Access the UI at: http://localhost:8505

Installation¶

Step 1: Clone the Repository¶

git clone https://github.com/ajones1923/drug-discovery-pipeline.git
cd drug-discovery-pipeline

Step 2: Create Virtual Environment¶

python -m venv venv
source venv/bin/activate

Step 3: Install Dependencies¶

pip install -r requirements.txt

Required packages: - streamlit: Web UI framework - rdkit: Cheminformatics toolkit - reportlab: PDF generation - requests: HTTP client for NIMs - plotly: Interactive visualizations

Step 4: Configure Environment¶

cp .env.example .env

Edit .env:

# NGC API Key (for BioNeMo NIMs)
NGC_API_KEY=your_key_here

# NIM Endpoints
MOLMIM_URL=http://localhost:8001
DIFFDOCK_URL=http://localhost:8002

# Mock Mode (set to true if NIMs unavailable)
NIM_ALLOW_MOCK_FALLBACK=true

Step 5: Start BioNeMo NIMs (Optional)¶

docker-compose up -d molmim diffdock

Step 6: Start Monitoring (Optional)¶

cd monitoring
docker-compose up -d

Usage¶

Streamlit Discovery UI¶

The main interface for interactive drug discovery:

streamlit run app/discovery_ui.py --server.port 8505

Features: - Step-by-step guided workflow - Interactive 3D structure viewer - Real-time molecule generation - Export to PDF report

Pipeline Portal¶

Dashboard for managing multiple targets:

streamlit run portal/app.py --server.port 8510

Features: - Target hypothesis management - Batch molecule generation - Pipeline orchestration - Results comparison

PDF Report Generation¶

Generate executive-ready reports:

python generate_vcp_report_enhanced.py

Output: outputs/VCP_Drug_Candidate_Report.pdf

Python API¶

from src.pipeline import DrugDiscoveryPipeline

pipeline = DrugDiscoveryPipeline()

# Run complete workflow
results = pipeline.run(
    gene="VCP",
    disease="Frontotemporal Dementia",
    seed_smiles="CC(C)C1=C(C=C(C=C1)..."
)

# Access results
print(results['candidates'])
print(results['top_score'])

Configuration¶

Environment Variables¶

# API Keys
NGC_API_KEY=nvapi-...

# NIM Endpoints
MOLMIM_URL=http://localhost:8001
DIFFDOCK_URL=http://localhost:8002

# Generation Parameters
NUM_MOLECULES=100
SIMILARITY_THRESHOLD=0.7
TEMPERATURE=0.8

# Scoring Weights
DOCKING_WEIGHT=0.3
QED_WEIGHT=0.3
SIMILARITY_WEIGHT=0.2
LIPINSKI_WEIGHT=0.2

# Mock Mode
NIM_ALLOW_MOCK_FALLBACK=true

Molecule Generation Parameters¶

Parameter	Default	Description
`NUM_MOLECULES`	100	Candidates to generate
`TEMPERATURE`	0.8	Generation diversity (0-1)
`SIMILARITY_THRESHOLD`	0.7	Min Tanimoto to seed
`MAX_MW`	550	Maximum molecular weight
`MAX_LOGP`	5.5	Maximum LogP

Docking Parameters¶

Parameter	Default	Description
`NUM_POSES`	10	Poses per molecule
`EXHAUSTIVENESS`	8	Search thoroughness
`BINDING_THRESHOLD`	-6.0	Min binding affinity

Directory Structure¶

drug-discovery-pipeline/
├── app/
│   └── discovery_ui.py              # Main Streamlit UI (Port 8505)
├── portal/
│   └── app.py                       # Pipeline portal (Port 8510)
├── src/
│   ├── pipeline.py                  # Main pipeline orchestrator
│   ├── target_import.py             # Import from RAG/Chat
│   ├── cryoem_evidence.py           # Cryo-EM structure handling
│   ├── structure_viewer.py          # 3D visualization
│   ├── molecule_generator.py        # BioNeMo MolMIM integration
│   ├── nim_clients.py               # NIM API clients
│   ├── scoring.py                   # Drug-likeness scoring
│   └── models.py                    # Data models (Pydantic)
├── generate_vcp_report_enhanced.py  # PDF report generator
├── data/
│   ├── structures/                  # Structure metadata
│   │   ├── vcp_structures.json      # VCP PDB entries
│   │   └── image_cache/             # Cached structure images
│   ├── targets/                     # Imported target hypotheses
│   └── molecules/                   # Reference molecules
├── outputs/
│   ├── VCP_Drug_Candidate_Report.pdf  # Generated reports
│   ├── candidates/                    # Ranked candidates
│   └── docking/                       # DiffDock results
├── monitoring/
│   ├── docker-compose.yml           # Prometheus + Grafana
│   ├── grafana/
│   │   └── dashboards/              # GPU monitoring dashboards
│   └── prometheus/
│       └── prometheus.yml           # Scrape configuration
├── docker-compose.yml               # BioNeMo NIM services
├── requirements.txt                 # Python dependencies
├── .env.example                     # Environment template
└── README.md                        # This documentation

Services¶

Service	Port	Description	URL
Discovery UI	8505	Main Streamlit interface	http://localhost:8505
Pipeline Portal	8510	Target management portal	http://localhost:8510
MolMIM NIM	8001	Molecule generation	http://localhost:8001
DiffDock NIM	8002	Molecular docking	http://localhost:8002
Grafana	3000	GPU monitoring dashboards	http://localhost:3000
Prometheus	9099	Metrics collection	http://localhost:9099
DCGM Exporter	9400	NVIDIA GPU metrics	http://localhost:9400
Node Exporter	9100	System metrics	http://localhost:9100

Monitoring Stack¶

Start Monitoring¶

cd monitoring
docker-compose up -d

Grafana Dashboard¶

Access GPU monitoring: - URL: http://localhost:3000 - Username: admin - Password: dgxspark

Available Metrics¶

Metric	Description
`DCGM_FI_DEV_GPU_UTIL`	GPU compute utilization
`DCGM_FI_DEV_MEM_COPY_UTIL`	Memory bandwidth utilization
`DCGM_FI_DEV_FB_USED`	GPU memory used
`DCGM_FI_DEV_POWER_USAGE`	Power consumption (W)
`DCGM_FI_DEV_SM_CLOCK`	SM clock speed

Performance Benchmarks¶

Expected Timings (DGX Spark GB10)¶

Step	Time	Notes
Target Import	< 1 sec	Load from JSON
Structure Fetch	5-10 sec	PDB API + image download
MolMIM Generation	2-5 min	100 molecules
DiffDock Docking	5-10 min	10 poses each
Scoring	< 30 sec	RDKit calculations
PDF Report	10-30 sec	With image embedding
Total	8-16 min	End-to-end

GPU Utilization¶

Phase	GPU Util	Memory	Power
MolMIM	70-85%	8-12 GB	200-300W
DiffDock	80-95%	12-16 GB	250-350W
Scoring	5-10%	2 GB	50-100W

Comparison to Traditional Methods¶

Method	Time	Hardware
GPU Pipeline (This)	8-16 min	DGX Spark
CPU Pipeline	2-4 hours	32-core Xeon
Manual Discovery	6-12 months	Chemistry lab

Troubleshooting¶

BioNeMo NIM Not Responding¶

# Check container status
docker ps | grep bionemo

# View logs
docker logs dd-molmim

# Restart NIMs
docker-compose restart molmim diffdock

GPU Out of Memory¶

# Reduce batch size
python src/cli.py --batch-size 10

# Check GPU memory
nvidia-smi

# Clear GPU memory
nvidia-smi --gpu-reset

PDF Report Generation Failed¶

# Check ReportLab installation
pip install --upgrade reportlab

# Verify image cache
ls -la data/structures/image_cache/

# Clear cache and regenerate
rm -rf data/structures/image_cache/*
python generate_vcp_report_enhanced.py

Structure Images Not Loading¶

# Check RCSB PDB connectivity
curl https://cdn.rcsb.org/images/structures/8o/8ooi/8ooi_assembly-1.jpeg

# Clear and refetch images
rm -rf data/structures/image_cache/*
python -c "from src.cryoem_evidence import CryoEMEvidence; e = CryoEMEvidence(); e.get_structures('VCP')"

Monitoring Not Showing Data¶

# Check Prometheus targets
curl http://localhost:9099/api/v1/targets

# Verify DCGM exporter
curl http://localhost:9400/metrics | grep DCGM

# Restart monitoring stack
cd monitoring && docker-compose down && docker-compose up -d

Stage	Pipeline	Description
1	Genomics Pipeline	FASTQ → VCF with Parabricks
2	RAG/Chat Pipeline	VCF → Target Hypothesis
3	Drug Discovery Pipeline (This repo)	Target → Molecule Candidates

Integration Flow¶

┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│    GENOMICS     │     │    RAG/CHAT     │     │ DRUG DISCOVERY  │
│    PIPELINE     │────▶│    PIPELINE     │────▶│    PIPELINE     │
├─────────────────┤     ├─────────────────┤     ├─────────────────┤
│ Input:          │     │ Input:          │     │ Input:          │
│  FASTQ files    │     │  VCF file       │     │  Target JSON    │
│                 │     │                 │     │                 │
│ Process:        │     │ Process:        │     │ Process:        │
│  Parabricks     │     │  Milvus + Claude│     │  BioNeMo NIMs   │
│  DeepVariant    │     │  Clinker        │     │  RDKit          │
│                 │     │                 │     │                 │
│ Output:         │     │ Output:         │     │ Output:         │
│  HG002.vcf.gz   │     │  VCP target     │     │  Candidates PDF │
│  (11.7M vars)   │     │  hypothesis     │     │  (100 molecules)│
└─────────────────┘     └─────────────────┘     └─────────────────┘

References¶

NVIDIA Technologies¶

Structural Biology¶

Cheminformatics¶

VCP/p97 Biology¶

License¶

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Acknowledgments¶

NVIDIA for BioNeMo NIMs, DGX Spark, and accelerated computing platform
RCSB PDB for open structural biology data
RDKit for open-source cheminformatics toolkit
Google DeepMind for AlphaMissense pathogenicity predictions
Streamlit for the interactive web framework
Anthropic for Claude AI reasoning capabilities

Status: Production Ready | Optimized for NVIDIA DGX Spark | FTD → Drug Discovery Demo

Drug Discovery Pipeline¶

Table of Contents¶

Overview¶

From Genomics to Drug Candidates¶

The Complete Journey¶

The Demo Narrative¶

Key Features¶

Cryo-EM Structure Integration¶

NVIDIA BioNeMo NIM Microservices¶

Known Inhibitor Analysis¶

Drug-Likeness Scoring¶

Professional PDF Reports¶

Real-Time Monitoring¶

Architecture¶

System Architecture¶

Data Flow¶

The VCP/FTD Demo Workflow¶

Complete Patient-to-Molecule Journey¶

Step 1: Genomic Discovery (Stage 1)¶

Step 2: Target Validation (Stage 2)¶

Step 3: Structure-Based Design (This Pipeline)¶

Why VCP?¶

Cryo-EM Structure Evidence¶

Available VCP Structures¶

Structure Analysis¶

Image Caching¶

Molecule Generation with BioNeMo¶

MolMIM: Masked Modeling for Molecules¶

DiffDock: Diffusion-Based Docking¶

API Endpoints¶

Example API Call¶

Drug-Likeness Scoring¶

Lipinski's Rule of Five¶

QED Score¶

Composite Scoring¶

Top Candidates¶

PDF Report Generation¶

Professional Executive Reports¶

Report Features¶

Generate Report¶

Quick Start¶

Prerequisites¶

Installation¶

Start the Pipeline¶

Installation¶

Step 1: Clone the Repository¶

Step 2: Create Virtual Environment¶

Step 3: Install Dependencies¶

Step 4: Configure Environment¶

Step 5: Start BioNeMo NIMs (Optional)¶

Step 6: Start Monitoring (Optional)¶

Usage¶

Streamlit Discovery UI¶

Pipeline Portal¶

PDF Report Generation¶

Python API¶

Configuration¶

Environment Variables¶

Molecule Generation Parameters¶

Docking Parameters¶

Directory Structure¶

Services¶

Monitoring Stack¶

Start Monitoring¶

Grafana Dashboard¶

Available Metrics¶

Performance Benchmarks¶

Expected Timings (DGX Spark GB10)¶

GPU Utilization¶

Comparison to Traditional Methods¶

Troubleshooting¶

BioNeMo NIM Not Responding¶

GPU Out of Memory¶

PDF Report Generation Failed¶

Structure Images Not Loading¶

Monitoring Not Showing Data¶

Related Pipelines¶

Integration Flow¶

References¶

NVIDIA Technologies¶