Stage 3: Drug Discovery¶

03

Designing new medicines with generative AI

8 – 16 minutes per target

What This Stage Does¶

Stage 2 identified a druggable target — a disease-causing gene that scientists know how to modulate with drugs. Stage 3 designs new molecules that might treat the disease.

This is generative drug discovery:

Structure Retrieval — Fetch the 3D protein structure from PDB
Seed Compound — Identify an existing drug as a starting point
Molecule Generation — AI creates 100 novel molecular designs
Binding Prediction — Simulate how each molecule binds to the target
Drug-Likeness Scoring — Filter for molecules that could work as medicines

The Process¶

1. Get the Target Structure¶

For VCP (the target from Stage 2), the pipeline queries the Protein Data Bank:

PDB ID: 5FTK
Resolution: 2.4 Å
Key insight: Structure shows VCP bound to an existing inhibitor — this reveals exactly where drugs should attach

2. Identify a Seed Compound¶

The existing drug CB-5083 serves as our starting point:

VCP inhibitor that reached Phase I trials
Discontinued due to off-target effects
Provides a molecular scaffold to improve upon

3. Generate Novel Molecules¶

BioNeMo MolMIM generates 100 new molecular structures:

Uses the same AI architecture as text generators
But writes molecular structures (SMILES notation) instead of sentences
Each molecule is a variation designed to improve on the seed

4. Predict Binding¶

BioNeMo DiffDock simulates molecular docking:

Predicts the 3D pose of each molecule in the binding pocket
Calculates binding affinity (docking score)
Score < -8.0 kcal/mol = excellent binding

5. Score Drug-Likeness¶

RDKit evaluates whether molecules could work as oral drugs:

Lipinski's Rule of Five — Size, solubility, permeability
QED Score — Quantitative Estimate of Drug-likeness (0–1)
Threshold: QED > 0.67 indicates good drug potential

Results¶

From 100 generated molecules:

Metric	Value
Valid structures	98
Pass Lipinski's rules	87 (89%)
QED > 0.67	72 (73%)
Docking score < -8.0	45 (46%)
Top candidate QED	0.81
Top candidate docking	-11.4 kcal/mol

Top Candidate vs. Original Drug¶

Metric	CB-5083 (Original)	AI Candidate
Binding strength	-8.1 kcal/mol	-11.4 kcal/mol
Drug-likeness (QED)	0.62	0.81
Composite score	0.64	0.89
Improvement	—	+39%

Technology Stack¶

- **NVIDIA BioNeMo MolMIM** — Generative AI for molecules - **NVIDIA BioNeMo DiffDock** — AI-powered docking prediction - **RDKit** — Cheminformatics and drug-likeness scoring - **Protein Data Bank** — 3D protein structures - **RCSB API** — Programmatic structure retrieval

How Scoring Works¶

Each candidate is ranked by a composite formula:

Score = (0.30 × Generation Confidence) +
        (0.40 × Binding Strength) +
        (0.30 × Drug-Likeness)

Generation Confidence — How well the AI generated the structure
Binding Strength — Predicted affinity to the target (docking score)
Drug-Likeness — QED score from RDKit

Important Context¶

These are computational predictions — promising starting points for laboratory testing, not finished medicines.

Real drug development requires:

Synthesis — Actually making the molecules
In vitro testing — Cell-based assays
In vivo testing — Animal models
Clinical trials — Human safety and efficacy
Regulatory approval — FDA or equivalent

What this pipeline does is collapse the first step — identifying targets and generating candidates — from months of work into hours.

The Output¶

Stage 3 produces a ranked list of 100 drug candidates:

Rank  SMILES                          Docking   QED    Score
----  ------------------------------  --------  -----  -----
1     CC1=CC=C(C=C1)C2=CN=C(N2)...   -11.4     0.81   0.89
2     COC1=CC=C(C=C1)N2C=NC3=C2...   -10.8     0.78   0.86
3     CC(C)NC(=O)C1=CC=C(C=C1)...    -10.2     0.79   0.84
...

Each candidate includes:

SMILES notation — Machine-readable molecular structure
2D structure image — Visual representation
Docking score — Predicted binding affinity
QED score — Drug-likeness
Lipinski violations — Rule-of-five compliance
Composite rank — Overall score

Full Pipeline Summary¶

Patient DNA ──▶ Stage 1 ──▶ Stage 2 ──▶ Stage 3 ──▶ Drug Candidates
   (FASTQ)      (VCF)      (Target)    (Molecules)

   200 GB       11.7M       VCP         100 ranked
   raw data     variants    gene        candidates

               2-4 hrs     minutes     8-16 min

Total time: Under 5 hours from raw DNA to drug candidates.

Learn More¶

- [**Technical Documentation →**](drug-discovery-pipeline/README.md) — Full pipeline details, parameters, and API reference - [**Back to Stage 2 →**](stage-2-rag.md) — Evidence RAG - [**Back to Stage 1 →**](stage-1-genomics.md) — GPU Genomics - [**Return to Overview →**](index.md) — Homepage