Stage 2: Evidence RAG¶
Finding the needle in 11.7 million variants
Interactive queries in seconds
What This Stage Does¶
Stage 1 produced 11.7 million variants. Most are harmless — they're just what makes you, you. The challenge is finding the ones that actually cause disease and can be targeted with drugs.
Stage 2 uses Retrieval-Augmented Generation (RAG) to:
- Annotate — Cross-reference every variant against clinical databases
- Embed — Convert variants into searchable vector representations
- Reason — Use AI to interpret evidence and identify targets
The Funnel¶
This is the heart of what the platform does — narrowing millions of variants to actionable targets:
flowchart TD
A["11,700,000 variants discovered"]
B["3,561,170 pass quality filters"]
C["35,616 match ClinVar records"]
D["6,831 have pathogenicity predictions"]
E["2,400 high-impact, disease-causing"]
F["847 in druggable genes"]
A -->|"Quality filter (QUAL>30)"| B
B -->|"ClinVar annotation"| C
C -->|"AlphaMissense scoring"| D
D -->|"Impact classification"| E
E -->|"Druggability filter"| F
style A fill:#B3E5FC,stroke:#0288D1
style B fill:#81D4FA,stroke:#0288D1
style C fill:#4FC3F7,stroke:#0277BD
style D fill:#29B6F6,stroke:#0277BD,color:#fff
style E fill:#039BE5,stroke:#01579B,color:#fff
style F fill:#00B4D8,stroke:#0077B6,color:#fff
From 11.7 million down to 847. That's the power of AI-driven annotation.
Knowledge Base¶
The RAG system is grounded in curated biomedical knowledge:
| Source | Coverage |
|---|---|
| ClinVar | 4.1 million clinically studied variants |
| AlphaMissense | DeepMind's pathogenicity predictions |
| Gene Panel | 201 genes across 13 therapeutic areas |
| Druggability | 171 genes (85%) with known drug targets |
Therapeutic Areas Covered¶
Neurology · Oncology · Cardiovascular · Metabolic · Immunology · Rare Disease · Ophthalmology · Dermatology · Respiratory · Hematology · Musculoskeletal · Endocrine · Infectious Disease
How RAG Works¶
Retrieval-Augmented Generation combines search with AI reasoning:
flowchart LR
Q["Natural Language\nQuery"]
E["BGE Embedding\n384 dimensions"]
M["Milvus Vector Search\n3.56M variants"]
R["Retrieved Evidence\nClinVar + AlphaMissense"]
C["Claude AI\nGrounded Reasoning"]
A["Grounded Answer\nwith citations"]
Q --> E --> M --> R --> C --> A
style Q fill:#E1BEE7,stroke:#7B1FA2
style E fill:#B3E5FC,stroke:#0288D1
style M fill:#00B4D8,stroke:#0077B6,color:#fff
style R fill:#B2DFDB,stroke:#00796B
style C fill:#FFE082,stroke:#FFA000
style A fill:#C8E6C9,stroke:#388E3C
- You ask a question in natural language
- Vector search finds the most relevant variants and annotations
- Evidence is retrieved from the knowledge base
- Claude reasons over the evidence to provide grounded answers
The AI can only cite variants and evidence that actually exist in this patient's data. It's not hallucinating — it's reasoning over real genomic evidence.
Example Query¶
Question: "What pathogenic variants does this patient have in genes associated with neurodegeneration?"
Response: The system identifies a VCP variant (chr9:35,065,263 G>A) with:
- ClinVar classification: Pathogenic
- AlphaMissense score: 0.87 (threshold: 0.564)
- Associated disease: Frontotemporal Dementia
- Druggability: Yes — VCP is a known therapeutic target
This variant becomes the input for Stage 3: Drug Discovery.
Technology Stack¶
By the Numbers¶
| Metric | Value |
|---|---|
| Variants indexed | 3.56 million |
| Embedding dimensions | 384 |
| Query latency | < 2 seconds |
| Knowledge sources | 4 (ClinVar, AlphaMissense, Gene Panel, PDB) |
| Genes covered | 201 |
| Druggable targets | 171 (85%) |
The Precision Intelligence Engine — 8 Specialized Agents¶
Stage 2 is more than a search engine. It powers the Precision Intelligence Engine — a constellation of 11 intelligence agents, each specialized for a clinical domain. All agents share read-only access to the 3.56 million annotated variant vectors in the genomic_evidence collection and follow a common five-phase reasoning loop: plan, search, evaluate, synthesize, report.
| # | Agent | Port | Domain |
|---|---|---|---|
| 1 | Precision Oncology | 8503 | Molecular tumor board decision support |
| 2 | Precision Biomarker | 8502 | Biomarker discovery and analysis, biological age estimation |
| 3 | CAR-T Intelligence | 8504 | Cellular immunotherapy, response biomarker tracking |
| 4 | Imaging Intelligence | 8505 | Medical imaging AI (CT, MRI, X-ray) with NVIDIA NIMs |
| 5 | Precision Autoimmune | 8506 | 13 autoimmune conditions, flare prediction |
| 6 | Pharmacogenomics | 8507 | Drug-gene interactions, dosing algorithms |
| 7 | Cardiology Intelligence | 8527 | 6 risk calculators (ASCVD, HEART, CHA2DS2-VASc) |
| 8 | Clinical Trial Intelligence | 8538 | Trial matching and enrollment optimization |
| 9 | Rare Disease Diagnostic | 8544 | 88 rare diseases, 23 ACMG criteria |
| 10 | Neurology Intelligence | 8528 | Neurodegeneration pathways, treatment planning |
| 11 | Single-Cell Intelligence | 8540 | 57 cell types, expression profiling |
Each agent adds domain-specific collections (10–15 per agent) on top of the shared genomic evidence, bringing the total to approximately 80+ specialized collections across the platform. Agents also communicate via cross-modal triggers — for example, a suspicious lung nodule detected by the Imaging Agent can automatically initiate genomic analysis through the Oncology Agent.
Why This Matters¶
Traditional variant interpretation requires:
- Manual literature review — Hours per variant
- Expert curation — Scarce clinical geneticists
- Fragmented tools — Separate databases, no unified view
The RAG pipeline delivers:
- Instant answers — Seconds, not hours
- Grounded reasoning — Every claim traceable to evidence
- Unified view — All annotations in one place
- Natural language — No bioinformatics expertise required
- 11 domain-specialized agents — Expert reasoning across oncology, cardiology, rare disease, and more
Clinical Decision Support Disclaimer
The Precision Intelligence Engine and its 11 intelligence agents are clinical decision support research tools. It is not FDA-cleared and is not intended as a standalone diagnostic device. All recommendations should be reviewed by qualified healthcare professionals. Apache 2.0 License.