Precision Oncology Agent¶
Part of the Precision Intelligence Network — one of 11 specialized agents sharing a common molecular foundation within the HCLS AI Factory.
Closed-loop precision oncology clinical decision support -- from paired tumor-normal genomics to Molecular Tumor Board packets. Part of the HCLS AI Factory.
Source: github.com/ajones1923/precision-oncology-agent
Overview¶
The Precision Oncology Agent transforms raw genomic data (VCF files) into actionable clinical intelligence. It combines variant annotation, evidence retrieval, therapy ranking, trial matching, and outcomes learning into a closed-loop system that generates Molecular Tumor Board (MTB) packets for precision cancer treatment decisions.
| Collection | Seed Records | Source |
|---|---|---|
| Variants | 130 | CIViC, curated actionable variants |
| Literature | 60 | PubMed oncology research |
| Therapies | 94 | FDA-approved targeted/immuno/chemo |
| Guidelines | 45 | NCCN, ASCO, ESMO guidelines |
| Trials | 55 | ClinicalTrials.gov oncology trials |
| Biomarkers | 50 | TMB, MSI-H, PD-L1, HRD, fusion panels |
| Resistance | 50 | Resistance mechanisms and bypasses |
| Pathways | 45 | Oncogenic signaling pathways |
| Outcomes | 40 | Treatment outcomes (synthetic) |
| Cases | 40 | Case snapshots (synthetic) |
| Genomic Evidence | (read-only) | Shared from Stage 2 RAG (3.5M vectors) |
| Total | 609 vectors | 11 collections (10 owned + 1 shared) |
Note: Seed data provides 609 demo-ready vectors. Running the full ingest pipelines (PubMed, ClinicalTrials.gov, CIViC) expands the knowledge base to ~1,700+ vectors.
Example Queries¶
"What therapies target BRAF V600E in melanoma?"
"Compare EGFR TKI generations for NSCLC"
"Resistance mechanisms to osimertinib"
"NCCN recommendations for HER2+ breast cancer"
"Match clinical trials for MSI-H colorectal cancer"
"What is the role of TMB as a predictive biomarker for immunotherapy?"
All queries return grounded, cross-collection answers with clickable PubMed and ClinicalTrials.gov citations.
Comparative Analysis Mode¶
Comparative queries are auto-detected and produce structured side-by-side analysis with markdown tables, efficacy data, safety profiles, and clinical context.
"Compare osimertinib vs erlotinib for EGFR-mutant NSCLC" -> TKI generation comparison
"BRAF+MEK inhibition vs immunotherapy for melanoma" -> Modality comparison
"Pembrolizumab versus nivolumab" -> Product vs product
"Compare PARP inhibitors for BRCA-mutant ovarian cancer" -> Drug class comparison
How it works: The engine detects "vs/versus/compare/difference between" keywords, parses two entities, resolves each against the knowledge graph (~40 actionable targets, ~30 therapy mappings, ~20 resistance mechanisms), runs dual retrievals with per-entity filtering, identifies shared/head-to-head evidence, and builds a comparative prompt instructing Claude to produce structured tables with efficacy, safety, biomarker, and guideline data.
| Feature | Detail |
|---|---|
| Entity types | Genes, drugs, drug classes, biomarkers, cancer types, pathways |
| Entity resolution | ~40 gene targets + ~30 therapy mappings + 50+ aliases |
| Dual retrieval | ~400 ms for 46 results (24 + 22 per entity) |
| Shared evidence | Head-to-head trials identified and highlighted separately |
| Structured output | Comparison table, MoA differences, efficacy, safety, guideline recs |
| Fallback | Unrecognized entities gracefully fall back to normal query path |
MTB Packet Generation¶
The Precision Oncology Agent generates structured Molecular Tumor Board packets from patient data:
- VCF Upload -- raw VCF text is parsed, extracting PASS variants with gene, consequence, and position
- Variant Annotation -- each variant is classified against ~40 actionable targets using AMP/ASCO/CAP evidence tiers (A-D)
- Evidence Lookup -- RAG retrieval for each actionable variant across literature, therapies, and guidelines
- Therapy Ranking -- evidence-level-sorted therapy recommendations with resistance flags and contraindication checks
- Trial Matching -- hybrid deterministic + semantic search against oncology clinical trials
- Open Questions -- VUS variants, missing biomarkers, and evidence gaps flagged for MTB discussion
The resulting MTB packet is exported as Markdown, JSON, PDF, or FHIR R4 DiagnosticReport Bundle.
Architecture¶
VCF / Patient Data
|
v
[Case Manager] ---- VCF parsing, variant extraction, actionability classification
|
v
[Knowledge Graph Lookup]
(~40 actionable targets, ~30 therapies, ~20 resistance, ~10 pathways, ~15 biomarkers)
|
v
[Parallel 11-Collection RAG Search] --- BGE-small-en-v1.5 (384-dim)
| | | | |
v v v v v
Variants Literature Therapies Guidelines Trials
130 60 94 45 55
| | | | |
Biomarkers Resistance Pathways Outcomes Cases
50 50 45 40 40
| |
+-------------- genomic_evidence (3.5M, read-only) ----+
|
v
[Query Expansion] (12 maps, ~120 keywords -> ~700 terms)
|
v
[Evidence Synthesis]
|
+--- [Therapy Ranker] -- evidence-level sort, resistance check, contraindication
+--- [Trial Matcher] --- deterministic filter + semantic search + composite scoring
+--- [Cross-Modal] ----- variant severity -> imaging, variant actionability -> drug discovery
|
v
[Claude Sonnet 4.6 LLM] -> Grounded response with citations
|
v
[MTB Packet / Clinical Report]
|
+--- Markdown report
+--- JSON export
+--- PDF (NVIDIA-themed, ReportLab)
+--- FHIR R4 DiagnosticReport Bundle (SNOMED CT, LOINC coded)
Built on the HCLS AI Factory platform:
- Vector DB: Milvus 2.4 with IVF_FLAT/COSINE indexes (nlist=1024, nprobe=16)
- Embeddings: BGE-small-en-v1.5 (384-dim)
- LLM: Claude Sonnet 4.6 (Anthropic API)
- UI: Streamlit MTB Workbench (port 8526)
- API: FastAPI REST server (port 8527)
- Hardware target: NVIDIA DGX Spark ($3,999)
Setup¶
Prerequisites¶
- Python 3.10+
- Milvus 2.4 running on
localhost:19530 ANTHROPIC_API_KEYenvironment variable (or inrag-chat-pipeline/.env)
Install¶
1. Create Collections and Seed Data¶
This creates 11 Milvus collections (10 owned + 1 read-only) with IVF_FLAT indexes and seeds actionable variants, therapies, guidelines, resistance mechanisms, pathways, and biomarker panels.
2. Ingest PubMed Literature (~15 min)¶
Fetches oncology abstracts via NCBI E-utilities, classifies by cancer type and gene, embeds with BGE-small, and stores in onco_literature.
3. Ingest Clinical Trials (~3 min)¶
Fetches oncology trials via ClinicalTrials.gov API v2, extracts biomarker criteria, embeds, and stores in onco_trials.
4. Ingest CIViC Variants (~2 min)¶
Fetches clinically actionable variants from the CIViC database, maps evidence levels to AMP/ASCO/CAP tiers, and stores in onco_variants.
5. Validate¶
Runs end-to-end tests: collection stats, single-collection search, multi-collection search_all(), filtered search, and demo queries.
6. Launch UI¶
7. Launch API¶
Project Structure¶
precision_oncology_agent/agent/
├── src/
│ ├── models.py # Pydantic data models (538 lines)
│ ├── collections.py # 11 Milvus collection schemas + manager (606 lines)
│ ├── knowledge.py # Knowledge graph: targets, therapies, resistance (1,662 lines)
│ ├── query_expansion.py # 12 expansion maps (812 lines)
│ ├── rag_engine.py # Multi-collection RAG + comparative analysis (899 lines)
│ ├── agent.py # Plan-search-synthesize pipeline (553 lines)
│ ├── case_manager.py # VCF parsing + MTB packet generation (509 lines)
│ ├── trial_matcher.py # Hybrid deterministic + semantic matching (513 lines)
│ ├── therapy_ranker.py # Evidence-based therapy ranking (748 lines)
│ ├── cross_modal.py # Cross-modal triggers to imaging + drug discovery (383 lines)
│ ├── export.py # Markdown, JSON, PDF, FHIR R4 export (1,055 lines)
│ ├── metrics.py # Prometheus metrics (362 lines)
│ ├── scheduler.py # Data ingestion scheduler (263 lines)
│ ├── ingest/
│ │ ├── base.py # Base ingest pipeline (249 lines)
│ │ ├── civic_parser.py # CIViC actionable variant ingest (340 lines)
│ │ ├── oncokb_parser.py # OncoKB data parser (104 lines)
│ │ ├── literature_parser.py # PubMed E-utilities ingest (248 lines)
│ │ ├── clinical_trials_parser.py # ClinicalTrials.gov API v2 (279 lines)
│ │ ├── guideline_parser.py # NCCN/ASCO/ESMO guideline parser (168 lines)
│ │ ├── pathway_parser.py # Signaling pathway parser (121 lines)
│ │ ├── resistance_parser.py # Resistance mechanism parser (125 lines)
│ │ └── outcome_parser.py # Outcome record parser (158 lines)
│ └── utils/
│ ├── vcf_parser.py # VCF file parsing utilities (361 lines)
│ └── pubmed_client.py # NCBI E-utilities HTTP client (296 lines)
├── app/
│ └── oncology_ui.py # Streamlit MTB Workbench (758 lines)
├── api/
│ ├── main.py # FastAPI REST server (393 lines)
│ └── routes/
│ ├── meta_agent.py # /api/ask, /api/deep-research (169 lines)
│ ├── cases.py # /api/cases, /api/cases/{id}/mtb (234 lines)
│ ├── trials.py # /api/trials/match (153 lines)
│ ├── reports.py # /api/reports/{format} (236 lines)
│ └── events.py # /api/events, cross-modal triggers (89 lines)
├── config/
│ └── settings.py # Pydantic BaseSettings (134 lines)
├── tests/
│ └── conftest.py # Test fixtures (214 lines)
├── requirements.txt
└── LICENSE # Apache 2.0
64 Python files | ~20,000 lines of code | Apache 2.0
Knowledge Graph¶
| Category | Count | Examples |
|---|---|---|
| Actionable Targets | ~40 | BRAF, EGFR, ALK, ROS1, KRAS G12C, HER2, NTRK, RET, MET, FGFR, PIK3CA, BRCA1/2, IDH1/2, ESR1, TP53, PTEN... |
| Therapy Mappings | ~30 | vemurafenib, osimertinib, pembrolizumab, sotorasib, lorlatinib, olaparib, trastuzumab deruxtecan... |
| Resistance Mechanisms | ~20 | EGFR T790M, EGFR C797S, MET amplification bypass, BRAF amplification, NRAS activation, BRCA reversion... |
| Signaling Pathways | ~10 | MAPK, PI3K/AKT/mTOR, DDR, cell cycle, apoptosis, Wnt, Notch, Hedgehog, JAK/STAT, angiogenesis |
| Biomarker Panels | ~15 | TMB-H, MSI-H/dMMR, PD-L1 TPS/CPS, HRD, NTRK fusion, ALK rearrangement, ROS1 fusion, ctDNA, FGFR |
| Entity Aliases | ~50+ | Cancer type aliases (lung->NSCLC, CRC->COLORECTAL), drug brand names, gene synonyms |
Performance¶
Measured on NVIDIA DGX Spark (GB10 GPU, 128GB unified LPDDR5x memory, 20 ARM cores):
| Metric | Value |
|---|---|
| Single-collection search | < 200 ms |
| Cross-collection RAG query (11 collections) | < 5 s |
| MTB packet generation (full workflow) | < 30 s |
| Trial matching (deterministic + semantic) | < 10 s |
| Therapy ranking (variant + biomarker driven) | < 5 s |
| Comparative dual retrieval | ~400 ms |
| Full RAG query (search + Claude) | ~24 s |
| Cosine similarity scores | 0.72 - 0.92 |
| Embedding dimension | 384 (BGE-small-en-v1.5) |
Service Ports¶
| Service | Port | Protocol |
|---|---|---|
| Streamlit MTB Workbench | 8526 | HTTP |
| FastAPI REST Server | 8527 | HTTP |
| Milvus gRPC | 19530 | gRPC |
Credits¶
- Adam Jones
- Apache 2.0 License
Clinical Decision Support Disclaimer
This agent is a clinical decision support research tool. It is not FDA-cleared and is not intended as a standalone diagnostic device. All recommendations should be reviewed by qualified healthcare professionals. Apache 2.0 License.