Precision Oncology Agent¶

Source: github.com/ajones1923/precision-oncology-agent

Part of the Precision Intelligence Engine — one of 11 specialized agents sharing a common molecular foundation within the HCLS AI Factory.

Closed-loop precision oncology clinical decision support -- from paired tumor-normal genomics to Molecular Tumor Board packets. Part of the HCLS AI Factory.

Overview¶

The Precision Oncology Agent transforms raw genomic data (VCF files) into actionable clinical intelligence. It combines variant annotation, evidence retrieval, therapy ranking, trial matching, and outcomes learning into a closed-loop system that generates Molecular Tumor Board (MTB) packets for precision cancer treatment decisions.

Collection	Seed Records	Source
Variants	130	CIViC, curated actionable variants
Literature	60	PubMed oncology research
Therapies	94	FDA-approved targeted/immuno/chemo
Guidelines	45	NCCN, ASCO, ESMO guidelines
Trials	55	ClinicalTrials.gov oncology trials
Biomarkers	50	TMB, MSI-H, PD-L1, HRD, fusion panels
Resistance	50	Resistance mechanisms and bypasses
Pathways	45	Oncogenic signaling pathways
Outcomes	40	Treatment outcomes (synthetic)
Cases	40	Case snapshots (synthetic)
Genomic Evidence	(read-only)	Shared from Stage 2 RAG (3.5M vectors)
Total	609 vectors	11 collections (10 owned + 1 shared)

Note: Seed data provides 609 demo-ready vectors. Running the full ingest pipelines (PubMed, ClinicalTrials.gov, CIViC) expands the knowledge base to ~1,700+ vectors.

Example Queries¶

"What therapies target BRAF V600E in melanoma?"
"Compare EGFR TKI generations for NSCLC"
"Resistance mechanisms to osimertinib"
"NCCN recommendations for HER2+ breast cancer"
"Match clinical trials for MSI-H colorectal cancer"
"What is the role of TMB as a predictive biomarker for immunotherapy?"

All queries return grounded, cross-collection answers with clickable PubMed and ClinicalTrials.gov citations.

Comparative Analysis Mode¶

Comparative queries are auto-detected and produce structured side-by-side analysis with markdown tables, efficacy data, safety profiles, and clinical context.

"Compare osimertinib vs erlotinib for EGFR-mutant NSCLC"     -> TKI generation comparison
"BRAF+MEK inhibition vs immunotherapy for melanoma"           -> Modality comparison
"Pembrolizumab versus nivolumab"                              -> Product vs product
"Compare PARP inhibitors for BRCA-mutant ovarian cancer"      -> Drug class comparison

How it works: The engine detects "vs/versus/compare/difference between" keywords, parses two entities, resolves each against the knowledge graph (~40 actionable targets, ~30 therapy mappings, ~20 resistance mechanisms), runs dual retrievals with per-entity filtering, identifies shared/head-to-head evidence, and builds a comparative prompt instructing Claude to produce structured tables with efficacy, safety, biomarker, and guideline data.

Feature	Detail
Entity types	Genes, drugs, drug classes, biomarkers, cancer types, pathways
Entity resolution	~40 gene targets + ~30 therapy mappings + 50+ aliases
Dual retrieval	~400 ms for 46 results (24 + 22 per entity)
Shared evidence	Head-to-head trials identified and highlighted separately
Structured output	Comparison table, MoA differences, efficacy, safety, guideline recs
Fallback	Unrecognized entities gracefully fall back to normal query path

MTB Packet Generation¶

The Precision Oncology Agent generates structured Molecular Tumor Board packets from patient data:

VCF Upload -- raw VCF text is parsed, extracting PASS variants with gene, consequence, and position
Variant Annotation -- each variant is classified against ~40 actionable targets using AMP/ASCO/CAP evidence tiers (A-D)
Evidence Lookup -- RAG retrieval for each actionable variant across literature, therapies, and guidelines
Therapy Ranking -- evidence-level-sorted therapy recommendations with resistance flags and contraindication checks
Trial Matching -- hybrid deterministic + semantic search against oncology clinical trials
Open Questions -- VUS variants, missing biomarkers, and evidence gaps flagged for MTB discussion

The resulting MTB packet is exported as Markdown, JSON, PDF, or FHIR R4 DiagnosticReport Bundle.

Architecture¶

VCF / Patient Data
    |
    v
[Case Manager] ---- VCF parsing, variant extraction, actionability classification
    |
    v
[Knowledge Graph Lookup]
(~40 actionable targets, ~30 therapies, ~20 resistance, ~10 pathways, ~15 biomarkers)
    |
    v
[Parallel 11-Collection RAG Search] --- BGE-small-en-v1.5 (384-dim)
    |               |              |           |           |
    v               v              v           v           v
 Variants      Literature     Therapies   Guidelines    Trials
  130            60             94           45          55
    |               |              |           |           |
 Biomarkers    Resistance     Pathways    Outcomes      Cases
   50             50            45           40          40
    |                                                      |
    +-------------- genomic_evidence (3.5M, read-only) ----+
    |
    v
[Query Expansion] (12 maps, ~120 keywords -> ~700 terms)
    |
    v
[Evidence Synthesis]
    |
    +--- [Therapy Ranker] -- evidence-level sort, resistance check, contraindication
    +--- [Trial Matcher] --- deterministic filter + semantic search + composite scoring
    +--- [Cross-Modal] ----- variant severity -> imaging, variant actionability -> drug discovery
    |
    v
[Claude Sonnet 4.6 LLM] -> Grounded response with citations
    |
    v
[MTB Packet / Clinical Report]
    |
    +--- Markdown report
    +--- JSON export
    +--- PDF (NVIDIA-themed, ReportLab)
    +--- FHIR R4 DiagnosticReport Bundle (SNOMED CT, LOINC coded)

Built on the HCLS AI Factory platform:

Vector DB: Milvus 2.4 with IVF_FLAT/COSINE indexes (nlist=1024, nprobe=16)
Embeddings: BGE-small-en-v1.5 (384-dim)
LLM: Claude Sonnet 4.6 (Anthropic API)
UI: Streamlit MTB Workbench (port 8526)
API: FastAPI REST server (port 8527)
Hardware target: NVIDIA DGX Spark ($4,699)

Setup¶

Prerequisites¶

Python 3.10+
Milvus 2.4 running on localhost:19530
ANTHROPIC_API_KEY environment variable (or in rag-chat-pipeline/.env)

Install¶

cd ai_agent_adds/precision_oncology_agent
pip install -r requirements.txt

1. Create Collections and Seed Data¶

python3 scripts/setup_collections.py --seed

This creates 11 Milvus collections (10 owned + 1 read-only) with IVF_FLAT indexes and seeds actionable variants, therapies, guidelines, resistance mechanisms, pathways, and biomarker panels.

2. Ingest PubMed Literature (~15 min)¶

python3 scripts/ingest_pubmed.py --max-results 5000

Fetches oncology abstracts via NCBI E-utilities, classifies by cancer type and gene, embeds with BGE-small, and stores in onco_literature.

3. Ingest Clinical Trials (~3 min)¶

python3 scripts/ingest_clinical_trials.py --max-results 1500

Fetches oncology trials via ClinicalTrials.gov API v2, extracts biomarker criteria, embeds, and stores in onco_trials.

4. Ingest CIViC Variants (~2 min)¶

python3 scripts/ingest_civic.py

Fetches clinically actionable variants from the CIViC database, maps evidence levels to AMP/ASCO/CAP tiers, and stores in onco_variants.

5. Validate¶

python3 scripts/validate_e2e.py

Runs end-to-end tests: collection stats, single-collection search, multi-collection search_all(), filtered search, and demo queries.

6. Launch UI¶

streamlit run app/oncology_ui.py --server.port 8526

7. Launch API¶

uvicorn api.main:app --host 0.0.0.0 --port 8527

Project Structure¶

precision_oncology_agent/agent/
├── src/
│   ├── models.py                  # Pydantic data models (538 lines)
│   ├── collections.py             # 11 Milvus collection schemas + manager (606 lines)
│   ├── knowledge.py               # Knowledge graph: targets, therapies, resistance (1,662 lines)
│   ├── query_expansion.py         # 12 expansion maps (812 lines)
│   ├── rag_engine.py              # Multi-collection RAG + comparative analysis (899 lines)
│   ├── agent.py                   # Plan-search-synthesize pipeline (553 lines)
│   ├── case_manager.py            # VCF parsing + MTB packet generation (509 lines)
│   ├── trial_matcher.py           # Hybrid deterministic + semantic matching (513 lines)
│   ├── therapy_ranker.py          # Evidence-based therapy ranking (748 lines)
│   ├── cross_modal.py             # Cross-modal triggers to imaging + drug discovery (383 lines)
│   ├── export.py                  # Markdown, JSON, PDF, FHIR R4 export (1,055 lines)
│   ├── metrics.py                 # Prometheus metrics (362 lines)
│   ├── scheduler.py               # Data ingestion scheduler (263 lines)
│   ├── ingest/
│   │   ├── base.py                # Base ingest pipeline (249 lines)
│   │   ├── civic_parser.py        # CIViC actionable variant ingest (340 lines)
│   │   ├── oncokb_parser.py       # OncoKB data parser (104 lines)
│   │   ├── literature_parser.py   # PubMed E-utilities ingest (248 lines)
│   │   ├── clinical_trials_parser.py  # ClinicalTrials.gov API v2 (279 lines)
│   │   ├── guideline_parser.py    # NCCN/ASCO/ESMO guideline parser (168 lines)
│   │   ├── pathway_parser.py      # Signaling pathway parser (121 lines)
│   │   ├── resistance_parser.py   # Resistance mechanism parser (125 lines)
│   │   └── outcome_parser.py      # Outcome record parser (158 lines)
│   └── utils/
│       ├── vcf_parser.py          # VCF file parsing utilities (361 lines)
│       └── pubmed_client.py       # NCBI E-utilities HTTP client (296 lines)
├── app/
│   └── oncology_ui.py             # Streamlit MTB Workbench (758 lines)
├── api/
│   ├── main.py                    # FastAPI REST server (393 lines)
│   └── routes/
│       ├── meta_agent.py          # /api/ask, /api/deep-research (169 lines)
│       ├── cases.py               # /api/cases, /api/cases/{id}/mtb (234 lines)
│       ├── trials.py              # /api/trials/match (153 lines)
│       ├── reports.py             # /api/reports/{format} (236 lines)
│       └── events.py              # /api/events, cross-modal triggers (89 lines)
├── config/
│   └── settings.py                # Pydantic BaseSettings (134 lines)
├── tests/
│   └── conftest.py                # Test fixtures (214 lines)
├── requirements.txt
└── LICENSE                        # Apache 2.0

64 Python files | ~20,000 lines of code | Apache 2.0

Knowledge Graph¶

Category	Count	Examples
Actionable Targets	~40	BRAF, EGFR, ALK, ROS1, KRAS G12C, HER2, NTRK, RET, MET, FGFR, PIK3CA, BRCA1/2, IDH1/2, ESR1, TP53, PTEN...
Therapy Mappings	~30	vemurafenib, osimertinib, pembrolizumab, sotorasib, lorlatinib, olaparib, trastuzumab deruxtecan...
Resistance Mechanisms	~20	EGFR T790M, EGFR C797S, MET amplification bypass, BRAF amplification, NRAS activation, BRCA reversion...
Signaling Pathways	~10	MAPK, PI3K/AKT/mTOR, DDR, cell cycle, apoptosis, Wnt, Notch, Hedgehog, JAK/STAT, angiogenesis
Biomarker Panels	~15	TMB-H, MSI-H/dMMR, PD-L1 TPS/CPS, HRD, NTRK fusion, ALK rearrangement, ROS1 fusion, ctDNA, FGFR
Entity Aliases	~50+	Cancer type aliases (lung->NSCLC, CRC->COLORECTAL), drug brand names, gene synonyms

Performance¶

Measured on NVIDIA DGX Spark (GB10 GPU, 128GB unified LPDDR5x memory, 20 ARM cores):

Metric	Value
Single-collection search	< 200 ms
Cross-collection RAG query (11 collections)	< 5 s
MTB packet generation (full workflow)	< 30 s
Trial matching (deterministic + semantic)	< 10 s
Therapy ranking (variant + biomarker driven)	< 5 s
Comparative dual retrieval	~400 ms
Full RAG query (search + Claude)	~24 s
Cosine similarity scores	0.72 - 0.92
Embedding dimension	384 (BGE-small-en-v1.5)

Service Ports¶

Service	Port	Protocol
Streamlit MTB Workbench	8526	HTTP
FastAPI REST Server	8527	HTTP
Milvus gRPC	19530	gRPC

Credits¶

Adam Jones
Apache 2.0 License

Clinical Decision Support Disclaimer

This agent is a clinical decision support research tool. It is not FDA-cleared and is not intended as a standalone diagnostic device. All recommendations should be reviewed by qualified healthcare professionals. Apache 2.0 License.