Precision Oncology Agent¶
Source: github.com/ajones1923/precision-oncology-agent
Part of the Precision Intelligence Engine — one of 11 specialized agents sharing a common molecular foundation within the HCLS AI Factory.
Closed-loop precision oncology clinical decision support -- from paired tumor-normal genomics to Molecular Tumor Board packets. Part of the HCLS AI Factory.
Overview¶
The Precision Oncology Agent transforms raw genomic data (VCF files) into actionable clinical intelligence. It combines variant annotation, evidence retrieval, therapy ranking, trial matching, and outcomes learning into a closed-loop system that generates Molecular Tumor Board (MTB) packets for precision cancer treatment decisions.
| Collection | Seed Records | Source |
|---|---|---|
| Variants | 130 | CIViC, curated actionable variants |
| Literature | 60 | PubMed oncology research |
| Therapies | 94 | FDA-approved targeted/immuno/chemo |
| Guidelines | 45 | NCCN, ASCO, ESMO guidelines |
| Trials | 55 | ClinicalTrials.gov oncology trials |
| Biomarkers | 50 | TMB, MSI-H, PD-L1, HRD, fusion panels |
| Resistance | 50 | Resistance mechanisms and bypasses |
| Pathways | 45 | Oncogenic signaling pathways |
| Outcomes | 40 | Treatment outcomes (synthetic) |
| Cases | 40 | Case snapshots (synthetic) |
| Genomic Evidence | (read-only) | Shared from Stage 2 RAG (3.5M vectors) |
| Total | 609 vectors | 11 collections (10 owned + 1 shared) |
Note: Seed data provides 609 demo-ready vectors. Running the full ingest pipelines (PubMed, ClinicalTrials.gov, CIViC) expands the knowledge base to ~1,700+ vectors.
Example Queries¶
"What therapies target BRAF V600E in melanoma?"
"Compare EGFR TKI generations for NSCLC"
"Resistance mechanisms to osimertinib"
"NCCN recommendations for HER2+ breast cancer"
"Match clinical trials for MSI-H colorectal cancer"
"What is the role of TMB as a predictive biomarker for immunotherapy?"
All queries return grounded, cross-collection answers with clickable PubMed and ClinicalTrials.gov citations.
Comparative Analysis Mode¶
Comparative queries are auto-detected and produce structured side-by-side analysis with markdown tables, efficacy data, safety profiles, and clinical context.
"Compare osimertinib vs erlotinib for EGFR-mutant NSCLC" -> TKI generation comparison
"BRAF+MEK inhibition vs immunotherapy for melanoma" -> Modality comparison
"Pembrolizumab versus nivolumab" -> Product vs product
"Compare PARP inhibitors for BRCA-mutant ovarian cancer" -> Drug class comparison
How it works: The engine detects "vs/versus/compare/difference between" keywords, parses two entities, resolves each against the knowledge graph (~40 actionable targets, ~30 therapy mappings, ~20 resistance mechanisms), runs dual retrievals with per-entity filtering, identifies shared/head-to-head evidence, and builds a comparative prompt instructing Claude to produce structured tables with efficacy, safety, biomarker, and guideline data.
| Feature | Detail |
|---|---|
| Entity types | Genes, drugs, drug classes, biomarkers, cancer types, pathways |
| Entity resolution | ~40 gene targets + ~30 therapy mappings + 50+ aliases |
| Dual retrieval | ~400 ms for 46 results (24 + 22 per entity) |
| Shared evidence | Head-to-head trials identified and highlighted separately |
| Structured output | Comparison table, MoA differences, efficacy, safety, guideline recs |
| Fallback | Unrecognized entities gracefully fall back to normal query path |
MTB Packet Generation¶
The Precision Oncology Agent generates structured Molecular Tumor Board packets from patient data:
- VCF Upload -- raw VCF text is parsed, extracting PASS variants with gene, consequence, and position
- Variant Annotation -- each variant is classified against ~40 actionable targets using AMP/ASCO/CAP evidence tiers (A-D)
- Evidence Lookup -- RAG retrieval for each actionable variant across literature, therapies, and guidelines
- Therapy Ranking -- evidence-level-sorted therapy recommendations with resistance flags and contraindication checks
- Trial Matching -- hybrid deterministic + semantic search against oncology clinical trials
- Open Questions -- VUS variants, missing biomarkers, and evidence gaps flagged for MTB discussion
The resulting MTB packet is exported as Markdown, JSON, PDF, or FHIR R4 DiagnosticReport Bundle.
Architecture¶
VCF / Patient Data
|
v
[Case Manager] ---- VCF parsing, variant extraction, actionability classification
|
v
[Knowledge Graph Lookup]
(~40 actionable targets, ~30 therapies, ~20 resistance, ~10 pathways, ~15 biomarkers)
|
v
[Parallel 11-Collection RAG Search] --- BGE-small-en-v1.5 (384-dim)
| | | | |
v v v v v
Variants Literature Therapies Guidelines Trials
130 60 94 45 55
| | | | |
Biomarkers Resistance Pathways Outcomes Cases
50 50 45 40 40
| |
+-------------- genomic_evidence (3.5M, read-only) ----+
|
v
[Query Expansion] (12 maps, ~120 keywords -> ~700 terms)
|
v
[Evidence Synthesis]
|
+--- [Therapy Ranker] -- evidence-level sort, resistance check, contraindication
+--- [Trial Matcher] --- deterministic filter + semantic search + composite scoring
+--- [Cross-Modal] ----- variant severity -> imaging, variant actionability -> drug discovery
|
v
[Claude Sonnet 4.6 LLM] -> Grounded response with citations
|
v
[MTB Packet / Clinical Report]
|
+--- Markdown report
+--- JSON export
+--- PDF (NVIDIA-themed, ReportLab)
+--- FHIR R4 DiagnosticReport Bundle (SNOMED CT, LOINC coded)
Built on the HCLS AI Factory platform:
- Vector DB: Milvus 2.4 with IVF_FLAT/COSINE indexes (nlist=1024, nprobe=16)
- Embeddings: BGE-small-en-v1.5 (384-dim)
- LLM: Claude Sonnet 4.6 (Anthropic API)
- UI: Streamlit MTB Workbench (port 8526)
- API: FastAPI REST server (port 8527)
- Hardware target: NVIDIA DGX Spark ($4,699)
Setup¶
Prerequisites¶
- Python 3.10+
- Milvus 2.4 running on
localhost:19530 ANTHROPIC_API_KEYenvironment variable (or inrag-chat-pipeline/.env)
Install¶
1. Create Collections and Seed Data¶
This creates 11 Milvus collections (10 owned + 1 read-only) with IVF_FLAT indexes and seeds actionable variants, therapies, guidelines, resistance mechanisms, pathways, and biomarker panels.
2. Ingest PubMed Literature (~15 min)¶
Fetches oncology abstracts via NCBI E-utilities, classifies by cancer type and gene, embeds with BGE-small, and stores in onco_literature.
3. Ingest Clinical Trials (~3 min)¶
Fetches oncology trials via ClinicalTrials.gov API v2, extracts biomarker criteria, embeds, and stores in onco_trials.
4. Ingest CIViC Variants (~2 min)¶
Fetches clinically actionable variants from the CIViC database, maps evidence levels to AMP/ASCO/CAP tiers, and stores in onco_variants.
5. Validate¶
Runs end-to-end tests: collection stats, single-collection search, multi-collection search_all(), filtered search, and demo queries.
6. Launch UI¶
7. Launch API¶
Project Structure¶
precision_oncology_agent/agent/
├── src/
│ ├── models.py # Pydantic data models (538 lines)
│ ├── collections.py # 11 Milvus collection schemas + manager (606 lines)
│ ├── knowledge.py # Knowledge graph: targets, therapies, resistance (1,662 lines)
│ ├── query_expansion.py # 12 expansion maps (812 lines)
│ ├── rag_engine.py # Multi-collection RAG + comparative analysis (899 lines)
│ ├── agent.py # Plan-search-synthesize pipeline (553 lines)
│ ├── case_manager.py # VCF parsing + MTB packet generation (509 lines)
│ ├── trial_matcher.py # Hybrid deterministic + semantic matching (513 lines)
│ ├── therapy_ranker.py # Evidence-based therapy ranking (748 lines)
│ ├── cross_modal.py # Cross-modal triggers to imaging + drug discovery (383 lines)
│ ├── export.py # Markdown, JSON, PDF, FHIR R4 export (1,055 lines)
│ ├── metrics.py # Prometheus metrics (362 lines)
│ ├── scheduler.py # Data ingestion scheduler (263 lines)
│ ├── ingest/
│ │ ├── base.py # Base ingest pipeline (249 lines)
│ │ ├── civic_parser.py # CIViC actionable variant ingest (340 lines)
│ │ ├── oncokb_parser.py # OncoKB data parser (104 lines)
│ │ ├── literature_parser.py # PubMed E-utilities ingest (248 lines)
│ │ ├── clinical_trials_parser.py # ClinicalTrials.gov API v2 (279 lines)
│ │ ├── guideline_parser.py # NCCN/ASCO/ESMO guideline parser (168 lines)
│ │ ├── pathway_parser.py # Signaling pathway parser (121 lines)
│ │ ├── resistance_parser.py # Resistance mechanism parser (125 lines)
│ │ └── outcome_parser.py # Outcome record parser (158 lines)
│ └── utils/
│ ├── vcf_parser.py # VCF file parsing utilities (361 lines)
│ └── pubmed_client.py # NCBI E-utilities HTTP client (296 lines)
├── app/
│ └── oncology_ui.py # Streamlit MTB Workbench (758 lines)
├── api/
│ ├── main.py # FastAPI REST server (393 lines)
│ └── routes/
│ ├── meta_agent.py # /api/ask, /api/deep-research (169 lines)
│ ├── cases.py # /api/cases, /api/cases/{id}/mtb (234 lines)
│ ├── trials.py # /api/trials/match (153 lines)
│ ├── reports.py # /api/reports/{format} (236 lines)
│ └── events.py # /api/events, cross-modal triggers (89 lines)
├── config/
│ └── settings.py # Pydantic BaseSettings (134 lines)
├── tests/
│ └── conftest.py # Test fixtures (214 lines)
├── requirements.txt
└── LICENSE # Apache 2.0
64 Python files | ~20,000 lines of code | Apache 2.0
Knowledge Graph¶
| Category | Count | Examples |
|---|---|---|
| Actionable Targets | ~40 | BRAF, EGFR, ALK, ROS1, KRAS G12C, HER2, NTRK, RET, MET, FGFR, PIK3CA, BRCA1/2, IDH1/2, ESR1, TP53, PTEN... |
| Therapy Mappings | ~30 | vemurafenib, osimertinib, pembrolizumab, sotorasib, lorlatinib, olaparib, trastuzumab deruxtecan... |
| Resistance Mechanisms | ~20 | EGFR T790M, EGFR C797S, MET amplification bypass, BRAF amplification, NRAS activation, BRCA reversion... |
| Signaling Pathways | ~10 | MAPK, PI3K/AKT/mTOR, DDR, cell cycle, apoptosis, Wnt, Notch, Hedgehog, JAK/STAT, angiogenesis |
| Biomarker Panels | ~15 | TMB-H, MSI-H/dMMR, PD-L1 TPS/CPS, HRD, NTRK fusion, ALK rearrangement, ROS1 fusion, ctDNA, FGFR |
| Entity Aliases | ~50+ | Cancer type aliases (lung->NSCLC, CRC->COLORECTAL), drug brand names, gene synonyms |
Performance¶
Measured on NVIDIA DGX Spark (GB10 GPU, 128GB unified LPDDR5x memory, 20 ARM cores):
| Metric | Value |
|---|---|
| Single-collection search | < 200 ms |
| Cross-collection RAG query (11 collections) | < 5 s |
| MTB packet generation (full workflow) | < 30 s |
| Trial matching (deterministic + semantic) | < 10 s |
| Therapy ranking (variant + biomarker driven) | < 5 s |
| Comparative dual retrieval | ~400 ms |
| Full RAG query (search + Claude) | ~24 s |
| Cosine similarity scores | 0.72 - 0.92 |
| Embedding dimension | 384 (BGE-small-en-v1.5) |
Service Ports¶
| Service | Port | Protocol |
|---|---|---|
| Streamlit MTB Workbench | 8526 | HTTP |
| FastAPI REST Server | 8527 | HTTP |
| Milvus gRPC | 19530 | gRPC |
Credits¶
- Adam Jones
- Apache 2.0 License
Clinical Decision Support Disclaimer
This agent is a clinical decision support research tool. It is not FDA-cleared and is not intended as a standalone diagnostic device. All recommendations should be reviewed by qualified healthcare professionals. Apache 2.0 License.
