Precision Oncology Intelligence Agent -- Advanced Learning Guide¶

Audience: Experienced developers contributing to or extending the agent. Prerequisite reading: LEARNING_GUIDE.md (introductory guide). Codebase snapshot: March 2026 -- ~14,000 lines across src/, api/, app/, tests/, scripts/.

Prerequisites¶

Before diving in you should be comfortable with:

Python 3.10+ -- dataclasses, async generators, concurrent.futures, Pydantic v2.
Vector databases -- IVF indexing, cosine similarity, embedding pipelines. Specifically Milvus (pymilvus SDK, collection schemas, filter expressions).
Clinical genomics vocabulary -- VCF format, SnpEff / VEP annotations, somatic vs. germline, AMP/ASCO/CAP tiering, CIViC evidence levels.
RAG architecture -- query embedding, multi-collection retrieval, weighted re-ranking, prompt construction with grounded citations.
FastAPI + Streamlit -- lifespan management, dependency injection, Pydantic request/response models, Streamlit session state.

Codebase Map¶

agent/
├── config/
│   └── settings.py               OncoSettings (Pydantic BaseSettings, ONCO_ env prefix)
├── src/                           11,440 lines across 13+ modules
│   ├── agent.py          (553)    OncoIntelligenceAgent -- plan/search/evaluate/synthesize
│   ├── case_manager.py   (516)    VCF parsing, case lifecycle, MTB packet generation
│   ├── collections.py    (665)    11 collection schemas, OncoCollectionManager
│   ├── cross_modal.py    (383)    Cross-agent triggers (genomic + imaging enrichment)
│   ├── export.py       (1,055)    Markdown / JSON / PDF / FHIR R4 export
│   ├── knowledge.py    (1,662)    Knowledge graphs (targets, therapies, resistance, pathways)
│   ├── metrics.py        (362)    Prometheus instrumentation
│   ├── models.py         (538)    14 Pydantic models, 13 enums
│   ├── query_expansion.py(812)    Domain-aware query rewriting
│   ├── rag_engine.py     (908)    Multi-collection RAG with comparative retrieval
│   ├── scheduler.py               Background task scheduling
│   ├── therapy_ranker.py (748)    Evidence-based therapy ranking
│   ├── trial_matcher.py  (513)    Hybrid deterministic + semantic trial matching
│   ├── ingest/         (1,793)    9 parsers + abstract base class
│   │   ├── base.py                BaseIngestPipeline (fetch -> parse -> embed_and_store)
│   │   ├── civic_parser.py        CIViC variant evidence
│   │   ├── oncokb_parser.py       OncoKB annotations
│   │   ├── literature_parser.py   PubMed / PMC abstracts
│   │   ├── clinical_trials_parser.py  ClinicalTrials.gov v2 API
│   │   ├── guideline_parser.py    NCCN / ESMO / ASCO guidelines
│   │   ├── resistance_parser.py   Resistance mechanism seeds
│   │   ├── pathway_parser.py      Signaling pathway seeds
│   │   └── outcome_parser.py      Real-world outcome records
│   └── utils/            (668)    VCF parser, PubMed client
├── api/                 (1,300)
│   ├── main.py          (410)     FastAPI lifespan, CORS, health endpoint
│   └── routes/                    5 routers (meta_agent, cases, trials, reports, events)
├── app/                   (758)
│   └── oncology_ui.py             Streamlit 5-tab workbench
├── tests/               (4,370)   556 tests across 10 files + conftest
│   ├── test_models.py   (644)     Parametrized enum validation
│   ├── test_integration.py(785)   4 patient profiles, full pipeline
│   ├── test_knowledge.py(603)     Knowledge graph integrity
│   ├── test_export.py   (439)     All 4 export formats
│   ├── test_therapy_ranker.py(363) Ranking logic
│   ├── test_trial_matcher.py(363) Matching logic
│   ├── test_case_manager.py(332)  VCF and case lifecycle
│   ├── test_rag_engine.py(301)    Retrieval pipeline
│   ├── test_collections.py(276)   Schema validation
│   └── test_agent.py    (264)     Planning and evaluation
└── scripts/             (2,273)   17 scripts (ingest, seed, validate, benchmark)

Data Models Overview (src/models.py)¶

The agent defines 13 enums and 14 Pydantic models that enforce type safety across all modules:

Enums (13 total)¶

Enum	Values
CancerType	25 types: NSCLC, SCLC, BREAST, COLORECTAL, ...
VariantType	SNV, INDEL, CNV_AMP, CNV_DEL, FUSION, REARRANGEMENT, SV
EvidenceLevel	A (FDA-approved), B (clinical), C (case), D (preclinical), E (inferential)
TherapyCategory	TARGETED, IMMUNOTHERAPY, CHEMOTHERAPY, HORMONAL, COMBINATION, RADIOTHERAPY, CELL_THERAPY, ADC, BISPECIFIC
TrialPhase	Early Phase 1, Phase 1, Phase 1/2, Phase 2, Phase 2/3, Phase 3, Phase 4, N/A
TrialStatus	9 values: Recruiting, Active not recruiting, Completed, Terminated, ...
ResponseCategory	CR (complete), PR (partial), SD (stable), PD (progressive), NE (not evaluable)
BiomarkerType	PREDICTIVE, PROGNOSTIC, DIAGNOSTIC, MONITORING, RESISTANCE, PHARMACODYNAMIC, SCREENING, THERAPEUTIC_SELECTION
PathwayName	13 pathways: MAPK, PI3K_AKT_MTOR, DDR, CELL_CYCLE, APOPTOSIS, WNT, NOTCH, HEDGEHOG, JAK_STAT, ANGIOGENESIS, HIPPO, NF_KB, TGF_BETA
GuidelineOrg	NCCN, ESMO, ASCO, WHO, CAP_AMP, FDA, EMA, AACR
SourceType	PUBMED, PMC, PREPRINT, MANUAL

Domain Models (14 total)¶

Each model implements to_embedding_text() to generate the text string that gets embedded for vector storage. This method concatenates the most semantically relevant fields with pipe separators:

Model	Key Fields
OncologyLiterature	id, title, text_chunk, source_type, year, gene
OncologyTrial	id (NCT), title, phase, status, biomarker_criteria
OncologyVariant	id, gene, variant_name, evidence_level, drugs
OncologyBiomarker	id, name, biomarker_type, testing_method, cutoff
OncologyTherapy	id, drug_name, category, targets, mechanism_of_action
OncologyPathway	id, name, key_genes, therapeutic_targets, cross_talk
OncologyGuideline	id, org, cancer_type, version, year, recommendations
ResistanceMechanism	id, primary_therapy, gene, mechanism, alternatives
OutcomeRecord	id, case_id, therapy, response, duration_months
CaseSnapshot	case_id, patient_id, cancer_type, variants, biomarkers
MTBPacket	case_id, variant_table, therapy_ranking, trial_matches
AgentQuery	question, gene, cancer_type, filters
SearchHit	collection, id, score, text, metadata, label
CrossCollectionResult	query, hits, total_collections_searched
AgentResponse	question, answer, evidence, plan, report

Example: OncologyVariant.to_embedding_text()¶

def to_embedding_text(self) -> str:
    parts = [
        f"{self.gene} {self.variant_name}",
        self.text_summary,
        f"Type: {self.variant_type.value}",
        f"Evidence: {self.evidence_level.value}",
    ]
    if self.cancer_type:
        parts.append(f"Cancer: {self.cancer_type.value}")
    if self.drugs:
        parts.append(f"Drugs: {', '.join(self.drugs)}")
    if self.clinical_significance:
        parts.append(f"Significance: {self.clinical_significance}")
    return " | ".join(parts)

Chapter 1: Deep Dive into the RAG Engine¶

File: src/rag_engine.py (908 lines)

The OncoRAGEngine is the central retrieval component. Every query passes through it -- whether from the agent, the API, or the Streamlit UI.

1.1 Collection Configuration¶

The engine searches 11 Milvus collections in parallel. Each collection has a configurable weight that scales raw cosine-similarity scores before merging:

COLLECTION_CONFIG: Dict[str, Dict[str, Any]] = {
    "onco_variants":   {"weight": settings.WEIGHT_VARIANTS,   "label": "Variant",    "filter_field": "gene",  "year_field": None},
    "onco_literature": {"weight": settings.WEIGHT_LITERATURE,  "label": "Literature", "filter_field": "gene",  "year_field": "year"},
    "onco_therapies":  {"weight": settings.WEIGHT_THERAPIES,   "label": "Therapy",    "filter_field": None,    "year_field": None},
    "onco_guidelines": {"weight": settings.WEIGHT_GUIDELINES,  "label": "Guideline",  "filter_field": None,    "year_field": "year"},
    "onco_trials":     {"weight": settings.WEIGHT_TRIALS,      "label": "Trial",      "filter_field": None,    "year_field": "start_year"},
    "onco_biomarkers": {"weight": settings.WEIGHT_BIOMARKERS,  "label": "Biomarker",  "filter_field": None,    "year_field": None},
    "onco_resistance": {"weight": settings.WEIGHT_RESISTANCE,  "label": "Resistance", "filter_field": "gene",  "year_field": None},
    "onco_pathways":   {"weight": settings.WEIGHT_PATHWAYS,    "label": "Pathway",    "filter_field": None,    "year_field": None},
    "onco_outcomes":   {"weight": settings.WEIGHT_OUTCOMES,    "label": "Outcome",    "filter_field": None,    "year_field": None},
    "onco_cases":      {"weight": settings.WEIGHT_CASES,       "label": "Case",       "filter_field": None,    "year_field": None},
    "genomic_evidence":{"weight": settings.WEIGHT_GENOMIC,     "label": "Genomic",    "filter_field": None,    "year_field": None},
}

Default weights (from config/settings.py) sum to 1.0:

Collection	Default Weight	Purpose
onco_variants	0.18	CIViC / OncoKB variant evidence
onco_literature	0.16	PubMed / PMC literature chunks
onco_therapies	0.14	Approved & investigational drugs
onco_guidelines	0.12	NCCN / ESMO / ASCO recommendations
onco_trials	0.10	ClinicalTrials.gov summaries
onco_biomarkers	0.08	Predictive / prognostic biomarkers
onco_resistance	0.07	Resistance mechanisms
onco_pathways	0.06	Signaling pathway context
onco_outcomes	0.04	Real-world treatment outcomes
genomic_evidence	0.03	VCF-derived evidence (Stage 1)
onco_cases	0.02	De-identified patient snapshots

Key insight: filter_field and year_field allow the engine to apply Milvus metadata filters (gene-level narrowing, publication-year ranges) on a per-collection basis. Collections without a filter_field rely purely on vector similarity.

1.2 The retrieve() Pipeline¶

The retrieve() method is the heart of the engine. Here is the step-by-step execution flow:

retrieve(query, top_k, collections_filter, year_min, year_max, conversation_context)
  │
  ├─ 1. Embed query text
  │     └─ _embed_query(): prepends BGE instruction prefix
  │        "Represent this sentence for searching relevant passages: <query>"
  │
  ├─ 2. _search_all_collections() -- parallel ThreadPoolExecutor
  │     ├─ For each target collection (up to 8 workers):
  │     │   ├─ Build Milvus filters (gene, year range)
  │     │   ├─ collection_manager.search(collection, vector, top_k, filters)
  │     │   ├─ Scale raw _distance by collection weight
  │     │   ├─ Wrap in SearchHit with label, citation, relevance
  │     │   └─ Return hits
  │     └─ Merge all hits into a flat list
  │
  ├─ 3. Expanded search (if query_expander provided)
  │     ├─ query_expander(query.text) -> expansion terms
  │     ├─ Concatenate: "{query} {expansion_terms}"
  │     ├─ Re-embed expanded text
  │     └─ _search_all_collections() with top_k // 2
  │
  ├─ 4. _merge_and_rank()
  │     ├─ Deduplicate by record_id (first-seen wins)
  │     ├─ Sort descending by weighted score
  │     └─ Cap at _MAX_EVIDENCE = 30
  │
  └─ 5. Return CrossCollectionResult(query, hits, total_collections_searched)

1.3 Relevance Scoring¶

After weighting, each hit receives a human-readable relevance label:

@staticmethod
def _score_relevance(score: float) -> str:
    if score >= 0.85:
        return "high"
    if score >= 0.65:
        return "medium"
    return "low"

These labels appear in the prompt sent to the LLM so it can calibrate confidence in individual evidence items.

1.4 Citation Formatting¶

The engine automatically generates clickable citation links:

PubMed IDs (PMID:12345) -> [PubMed 12345](https://pubmed.ncbi.nlm.nih.gov/12345/)
NCT IDs (NCT01234567) -> [NCT01234567](https://clinicaltrials.gov/study/NCT01234567)
Everything else -> [Label: record_id] using the collection label

1.5 Comparative Retrieval¶

When the query contains comparative language (vs, versus, compare, difference between, head to head), the engine switches to a dual-entity retrieval path:

_is_comparative(question)
  └─ regex: r"\b(compare|vs\.?|versus|difference between|head.to.head)\b"

retrieve_comparative(question, ...)
  ├─ _parse_comparison_entities() -> (entity_a, entity_b)
  │   Handles: "A vs B", "compare A and B", "difference between A and B"
  ├─ retrieve(entity_a) -> hits_a
  ├─ retrieve(entity_b) -> hits_b
  ├─ Compute shared_hits = intersection by record_id
  └─ Return {entity_a, entity_b, hits_a, hits_b, shared_hits}

The comparative prompt template instructs the LLM to structure its answer across 8 comparison dimensions: mechanism of action, efficacy data, safety profile, biomarker considerations, resistance mechanisms, guideline recommendations, clinical trial evidence, and summary recommendation.

1.6 Knowledge Context Injection¶

Before building the LLM prompt, _get_knowledge_context() queries the knowledge module for five types of domain knowledge:

Gene mentions -> knowledge.lookup_gene()
Therapy mentions -> knowledge.lookup_therapy()
Resistance mentions -> knowledge.lookup_resistance()
Pathway mentions -> knowledge.lookup_pathway()
Biomarker mentions -> knowledge.lookup_biomarker()

Each successful lookup is tagged with a section header ([Gene Knowledge], [Therapy Knowledge], etc.) and prepended to the evidence in the prompt.

1.7 Prompt Construction¶

The final prompt has this structure:

=== Domain Knowledge ===
[Gene Knowledge] ...
[Therapy Knowledge] ...

=== Retrieved Evidence ===
1. Variant [high] (score 0.892) -- [PubMed 12345](...)
   EGFR L858R confers sensitivity to osimertinib...
2. Literature [medium] (score 0.734) -- [Variant: civic-123]
   ...

=== Question ===
<original question>

Using the evidence above, provide a thorough, well-cited answer...

1.8 System Prompt¶

The system prompt establishes 8 core competency areas:

Molecular profiling
Variant interpretation (CIViC/OncoKB evidence levels, AMP/ASCO/CAP)
Therapy selection (NCCN/ESMO guideline-concordant)
Clinical trial matching
Resistance mechanisms
Biomarker assessment
Outcomes monitoring (RECIST, MRD, ctDNA)
Cross-modal integration (imaging, drug discovery pipelines)

Five behavioral instructions enforce citation, cross-functional reasoning, resistance/contraindication awareness, guideline references, and uncertainty acknowledgment.

Chapter 2: The OncoIntelligenceAgent¶

File: src/agent.py (553 lines)

The agent implements the plan-search-evaluate-synthesize loop -- the highest-level orchestration pattern in the system.

2.1 SearchPlan Dataclass¶

Every query starts with a structured plan:

@dataclass
class SearchPlan:
    question: str
    identified_topics: List[str] = field(default_factory=list)
    target_genes: List[str] = field(default_factory=list)
    relevant_cancer_types: List[str] = field(default_factory=list)
    search_strategy: str = "broad"   # "broad" | "targeted" | "comparative"
    sub_questions: List[str] = field(default_factory=list)

2.2 Gene and Cancer-Type Recognition¶

The planner uses two static vocabularies for entity extraction:

KNOWN_GENES (30 entries): BRAF, EGFR, ALK, ROS1, KRAS, HER2, NTRK, RET, MET, FGFR, PIK3CA, IDH1, IDH2, BRCA, BRCA1, BRCA2, TP53, PTEN, CDKN2A, STK11, ESR1, ERBB2, NRAS, APC, VHL, KIT, PDGFRA, FLT3, NPM1, DNMT3A
KNOWN_CANCER_TYPES (25 entries): NSCLC, BREAST, MELANOMA, COLORECTAL, PANCREATIC, OVARIAN, PROSTATE, GLIOMA, GLIOBLASTOMA, AML, CML, CLL, DLBCL, BLADDER, RENAL, HEPATOCELLULAR, GASTRIC, ESOPHAGEAL, THYROID, ENDOMETRIAL, CERVICAL, HEAD_AND_NECK, SARCOMA, CHOLANGIOCARCINOMA, MESOTHELIOMA
_CANCER_ALIASES (50+ entries): maps natural language like "lung cancer" -> "NSCLC", "triple negative breast" -> "BREAST", "gbm" -> "GLIOBLASTOMA", "crpc" -> "PROSTATE", etc.

Gene detection is case-insensitive uppercase matching against q_upper. Cancer type detection checks both canonical names and aliases.

2.3 Topic Detection¶

The planner scans for 20+ topic keywords and maps them to clinical concepts:

Keyword	Topic
`resistance`	therapeutic resistance
`biomarker`	biomarker identification
`immunotherapy`	immunotherapy response
`combination`	combination therapy
`tmb`	TMB
`msi`	MSI / microsatellite instab.
`ctdna`	liquid biopsy / ctDNA
`pdl1`, `pd-l1`	PD-L1 / immune checkpoint
`fusion`	gene fusion
`methylation`	epigenetic regulation

2.4 Strategy Selection¶

if any(sig in q_lower for sig in comparative_signals):
    search_strategy = "comparative"
elif target_genes and relevant_cancer_types:
    search_strategy = "targeted"
else:
    search_strategy = "broad"

comparative: triggered by compare, vs, versus, difference between, head to head
targeted: both gene(s) and cancer type(s) identified
broad: fallback when context is ambiguous

2.5 Question Decomposition¶

Complex queries are broken into focused sub-questions:

Multiple genes -> one sub-question per gene (e.g., "What is the role of EGFR in NSCLC?")
Multiple cancer types -> one sub-question per type (e.g., "BRAF therapeutic landscape in melanoma")
Topic-driven sub-questions for resistance, trials, biomarkers, combinations

2.6 The run() Pipeline¶

def run(self, question: str, **kwargs) -> AgentResponse:
    # 1. Plan
    plan = self.search_plan(question)

    # 2. Search with adaptive retry
    for attempt in range(1, MAX_RETRIES + 2):     # up to 3 attempts
        for q in [plan.question] + plan.sub_questions:
            results = self.rag_engine.cross_collection_search(AgentQuery(question=q))
            all_evidence.extend(results)

        # 3. Evaluate
        verdict = self.evaluate_evidence(all_evidence)
        if verdict == "sufficient" or attempt > MAX_RETRIES:
            break
        # Broaden: targeted -> broad, generate fallback queries
        queries_to_run = self._generate_fallback_queries(plan)

    # 4. Synthesize
    response = self.rag_engine.synthesize(question, all_evidence, plan)
    response.report = self.generate_report(response)
    return response

2.7 Evidence Evaluation¶

The evaluate_evidence() method classifies evidence adequacy as one of three verdicts:

Verdict	Criteria
`sufficient`	>= 3 hits AND >= 2 collections represented
`partial`	> 0 hits but insufficient diversity or count
`insufficient`	0 usable hits (all below MIN_SIMILARITY_SCORE = 0.30)

Evidence items with scores below 0.30 are filtered out before evaluation. An average score >= 0.50 is preferred but not required for sufficient.

2.8 Adaptive Retry¶

When evidence is insufficient:

Switch strategy from targeted to broad
Generate fallback queries:
Per gene: "{gene} oncology therapeutic implications", "{gene} mutation clinical significance"
Per cancer type: "{ct} precision medicine current landscape"
Default: "{question} precision oncology"

Maximum retries: 2 (total 3 attempts including the initial search).

2.9 Report Generation¶

The agent produces a structured Markdown report with sections: - Query (original question) - Analysis (strategy, genes, cancer types, topics, sub-questions) - Evidence Sources (grouped by collection, top 10 per collection) - Knowledge Graph (if attached) - Synthesis (LLM-generated answer)

Chapter 3: Knowledge Graph Architecture¶

File: src/knowledge.py (1,662 lines)

The knowledge module is a curated, code-embedded domain knowledge graph providing instant lookup without vector search latency.

3.1 ACTIONABLE_TARGETS (~40 genes)¶

Each entry has a consistent structure:

ACTIONABLE_TARGETS["EGFR"] = {
    "gene": "EGFR",
    "full_name": "Epidermal Growth Factor Receptor",
    "cancer_types": ["NSCLC", "head and neck", "colorectal", "glioblastoma"],
    "key_variants": ["L858R", "exon 19 deletion", "T790M", "C797S",
                     "exon 20 insertion", "S768I", "L861Q", "G719X"],
    "targeted_therapies": ["osimertinib", "erlotinib", "gefitinib",
                           "afatinib", "dacomitinib", "amivantamab"],
    "combination_therapies": ["osimertinib + chemotherapy",
                              "amivantamab + lazertinib"],
    "resistance_mutations": ["T790M", "C797S", "MET amplification",
                             "HER2 amplification", "small cell transformation",
                             "BRAF V600E", "PIK3CA mutations"],
    "pathway": "MAPK",
    "evidence_level": "A",
    "description": "EGFR mutations occur in ~15-20% of NSCLC ..."
}

Covered genes include: BRAF, EGFR, ALK, ROS1, KRAS, HER2, NTRK, RET, MET, FGFR, PIK3CA, IDH1, IDH2, and many more -- each with full_name, cancer_types, key_variants, targeted_therapies, combination_therapies, resistance_mutations, pathway, evidence_level, and description.

3.2 THERAPY_MAP¶

Maps drug names (lowercase keys) to structured therapy records:

THERAPY_MAP["osimertinib"] = {
    "drug_name": "osimertinib",
    "brand_name": "Tagrisso",
    "category": "targeted therapy",
    "targets": ["EGFR"],
    "approved_indications": ["EGFR-mutant NSCLC (first-line)", ...],
    "mechanism": "3rd-gen EGFR TKI, active against T790M",
    "key_trials": ["FLAURA", "ADAURA", "FLAURA2"],
}

3.3 RESISTANCE_MAP¶

Maps primary therapies to their known resistance mechanisms:

RESISTANCE_MAP["osimertinib"] = {
    "primary_therapy": "osimertinib",
    "resistance_triggers": [...],
    "mechanism": "C797S mutation, MET amplification, ...",
    "alternatives": ["amivantamab + lazertinib", ...],
}

The therapy ranker and case manager both consult this map to flag resistance concerns and suggest next-line options.

3.4 PATHWAY_MAP¶

Maps signaling pathway names to their constituent genes, druggable targets, and cross-talk connections:

MAPK: BRAF, KRAS, NRAS, MEK1/2, ERK1/2
PI3K/AKT/mTOR: PIK3CA, PTEN, AKT1, mTOR
DNA Damage Repair: BRCA1, BRCA2, ATM, ATR, PALB2
Cell Cycle: CDK4/6, CDKN2A, RB1, CCND1

Each pathway entry includes cross_talk describing how pathways interact (e.g., MAPK <-> PI3K bypass signaling).

3.5 BIOMARKER_PANELS¶

Defines clinically validated biomarker panels with testing methods, cutoffs, and associated therapies:

Biomarker	Type	Testing Method	Clinical Cutoff	Evidence
TMB	Predictive	WGS / WES / Panel	>= 10 mut/Mb	A
MSI	Predictive	IHC / PCR / NGS	MSI-H	A
PD-L1 TPS	Predictive	IHC (22C3/SP263)	>= 50% (first-line)	A
HRD	Predictive	Myriad MyChoice	HRD score >= 42	A

3.6 ENTITY_ALIASES¶

Provides 30+ alias mappings for entity resolution in natural language queries:

ENTITY_ALIASES = {
    "keytruda": "pembrolizumab",
    "opdivo": "nivolumab",
    "tagrisso": "osimertinib",
    "herceptin": "trastuzumab",
    ...
}

3.7 Lookup Functions¶

The module exposes helper functions consumed by the RAG engine:

lookup_gene(query) -- searches ACTIONABLE_TARGETS for gene mentions
lookup_therapy(query) -- searches THERAPY_MAP
lookup_resistance(query) -- searches RESISTANCE_MAP
lookup_pathway(query) -- searches PATHWAY_MAP
lookup_biomarker(query) -- searches BIOMARKER_PANELS
get_target_context(gene) -- formatted context string for a gene
classify_variant_actionability(gene, variant) -- returns evidence level (A/B/C/D/VUS)

Chapter 4: Query Expansion and Rewriting¶

File: src/query_expansion.py (812 lines)

4.1 Expansion Categories¶

The module contains 12 domain-specific expansion dictionaries:

CANCER_TYPE_EXPANSIONS -- maps abbreviations to full names and subtypes (e.g., "NSCLC" -> ["non-small cell lung cancer", "lung adenocarcinoma","lung squamous cell", "EGFR-mutant lung", "ALK-positive lung"])
GENE_EXPANSIONS -- maps gene symbols to full names and common variants (e.g., "EGFR" -> ["epidermal growth factor receptor", "EGFR L858R","EGFR exon 19 deletion", "EGFR T790M", "EGFR C797S"])
THERAPY_EXPANSIONS -- maps drug names to brand names, mechanisms, and related compounds
BIOMARKER_EXPANSIONS -- maps biomarker abbreviations to full terms
PATHWAY_EXPANSIONS -- signaling pathway synonyms
RESISTANCE_EXPANSIONS -- resistance mechanism terminology
CLINICAL_TERM_EXPANSIONS -- clinical outcomes and staging terms
TRIAL_EXPANSIONS -- trial-related terminology
IMMUNOTHERAPY_EXPANSIONS -- checkpoint inhibitor vocabulary
SURGERY_RADIATION_EXPANSIONS -- procedural terms
TOXICITY_EXPANSIONS -- adverse event terminology
GENOMICS_EXPANSIONS -- sequencing and variant calling terms

4.2 How Expansion Works¶

The RAG engine calls the expander as a callable:

expansion_terms = self.query_expander(query.text)
expanded_text = f"{query.text} {' '.join(expansion_terms)}"
expanded_vector = self._embed_query(expanded_text)

The expander scans the input query for keywords matching any expansion dictionary. For each match, it appends the associated expansion terms. This widens the embedding to capture semantically related documents that might use different terminology (e.g., a query mentioning "osimertinib" also captures "Tagrisso" and "3rd-generation EGFR TKI").

4.3 Expansion Coverage¶

Category	Key Count	Example Key	Example Expansions
Cancer types	16	`NSCLC`	non-small cell lung cancer, lung adenocarcinoma
Genes	12	`KRAS`	KRAS G12C, KRAS G12D, Kirsten rat sarcoma
Therapies	~15	`osimertinib`	Tagrisso, 3rd-gen EGFR TKI, T790M active
Biomarkers	~8	`TMB`	tumor mutational burden, TMB-high, TMB >= 10
Pathways	~6	`MAPK`	RAS/RAF/MEK/ERK, MAP kinase cascade
Resistance	~8	`T790M`	gatekeeper mutation, acquired resistance EGFR
Clinical terms	~10	`PFS`	progression-free survival, time to progression
Trials	~6	`Phase 3`	pivotal trial, registration study
Immunotherapy	~8	`checkpoint`	PD-1, PD-L1, CTLA-4, immune checkpoint

Chapter 5: Collection Schemas and Indexing¶

File: src/collections.py (665 lines)

5.1 Shared Index Configuration¶

All 11 collections use identical vector index parameters:

EMBEDDING_DIM = 384  # BGE-small-en-v1.5

INDEX_PARAMS = {
    "metric_type": "COSINE",
    "index_type": "IVF_FLAT",
    "params": {"nlist": 1024},
}

SEARCH_PARAMS = {
    "metric_type": "COSINE",
    "params": {"nprobe": 16},
}

IVF_FLAT partitions the vector space into 1024 Voronoi cells. At query time, nprobe=16 cells are searched (1.6% of partitions), balancing speed and recall. For larger datasets (>1M vectors per collection), consider increasing nprobe to 32-64 or switching to IVF_PQ for memory savings.

5.2 Collection Schema Reference¶

Each collection follows a pattern: VARCHAR primary key id, a 384-dim FLOAT_VECTOR embedding field, and typed metadata fields.

onco_variants¶

Field	Type	Max Length	Notes
id	VARCHAR (PK)	100
embedding	FLOAT_VECTOR	dim=384
gene	VARCHAR	50	Filterable
variant_name	VARCHAR	100
variant_type	VARCHAR	30
cancer_type	VARCHAR	50
evidence_level	VARCHAR	20	A/B/C/D/E
drugs	VARCHAR	500
civic_id	VARCHAR	20
vrs_id	VARCHAR	100	VRS identifier
text_summary	VARCHAR	3000
clinical_significance	VARCHAR	200
allele_frequency	FLOAT

onco_literature¶

Field	Type	Max Length	Notes
id	VARCHAR (PK)	100
embedding	FLOAT_VECTOR	dim=384
title	VARCHAR	500
text_chunk	VARCHAR	3000
source_type	VARCHAR	20	pubmed/pmc
year	INT64		Filterable
cancer_type	VARCHAR	50
gene	VARCHAR	50	Filterable
variant	VARCHAR	100
keywords	VARCHAR	1000
journal	VARCHAR	200

onco_trials¶

Field	Type	Max Length	Notes
id	VARCHAR (PK)	20	NCT ID
embedding	FLOAT_VECTOR	dim=384
title	VARCHAR	500
text_summary	VARCHAR	3000
phase	VARCHAR	30
status	VARCHAR	30
sponsor	VARCHAR	200
cancer_types	VARCHAR	200
biomarker_criteria	VARCHAR	500
enrollment	INT64
start_year	INT64		Filterable
outcome_summary	VARCHAR	2000

genomic_evidence (read-only)¶

Field	Type	Max Length	Notes
id	VARCHAR (PK)	200
embedding	FLOAT_VECTOR	dim=384
chrom	VARCHAR	10
pos	INT64
ref	VARCHAR	500
alt	VARCHAR	500
qual	FLOAT
gene	VARCHAR	50
consequence	VARCHAR	100
impact	VARCHAR	20	HIGH/MODERATE/LOW/MOD
genotype	VARCHAR	10	0/1, 1/1, etc.
text_summary	VARCHAR	2000
clinical_significance	VARCHAR	200
rsid	VARCHAR	20
disease_associations	VARCHAR	500
am_pathogenicity	FLOAT		AlphaMissense score
am_class	VARCHAR	30	likely_pathogenic, etc.

5.3 Schema and Model Registries¶

Two registries map collection names to schemas and Pydantic models:

COLLECTION_SCHEMAS: Dict[str, CollectionSchema] = {
    "onco_literature": ONCO_LITERATURE_SCHEMA,
    "onco_trials": ONCO_TRIALS_SCHEMA,
    # ... all 11 collections
}

COLLECTION_MODELS: Dict[str, Optional[Type]] = {
    "onco_literature": OncologyLiterature,
    "onco_trials": OncologyTrial,
    # ...
    "genomic_evidence": None,  # read-only, populated by Stage 1
}

5.4 OncoCollectionManager¶

The manager wraps pymilvus operations:

connect() / disconnect() -- Milvus connection lifecycle
create_collection(name) -- creates from COLLECTION_SCHEMAS registry
get_collection(name) -- returns cached Collection handle
get_collection_count(name) -- entity count after flush
insert(collection_name, data) -- single-record or batch insert
search(collection_name, query_vector, top_k, filters) -- ANN search
query(collection_name, filter_expr, output_fields, limit) -- filter query

Parallel search across all collections uses ThreadPoolExecutor with max_workers = min(len(collections), 8).

Chapter 6: Therapy Ranking Engine¶

File: src/therapy_ranker.py (748 lines)

6.1 Evidence Level Ordering¶

EVIDENCE_LEVEL_ORDER = {"A": 0, "B": 1, "C": 2, "D": 3, "E": 4, "VUS": 5}

Lower ordinal = stronger evidence. Level A (FDA-approved / companion diagnostic) always ranks above Level B (clinical evidence), and so on.

6.2 The rank_therapies() Pipeline¶

rank_therapies(cancer_type, variants, biomarkers, prior_therapies)
  │
  ├─ Step 1: Identify variant-driven therapies
  │   └─ For each variant: _identify_variant_therapies(gene, variant, cancer_type)
  │       ├─ Look up gene in ACTIONABLE_TARGETS
  │       ├─ Check if variant matches key_variants
  │       ├─ Get drug list from targeted_therapies
  │       └─ Return therapy dict with evidence_level
  │
  ├─ Step 2: Identify biomarker-driven therapies
  │   └─ _identify_biomarker_therapies(biomarkers, cancer_type)
  │       ├─ MSI-H/dMMR -> pembrolizumab, nivolumab, dostarlimab (Level A)
  │       ├─ TMB >= 10 -> pembrolizumab (Level A), atezolizumab (Level B)
  │       ├─ HRD/BRCA -> olaparib, rucaparib, niraparib, talazoparib
  │       ├─ PTEN loss -> alpelisib (Level C)
  │       ├─ PD-L1 TPS >= 50% -> pembrolizumab first-line (Level A)
  │       ├─ NTRK fusion -> larotrectinib, entrectinib (Level A)
  │       └─ BIOMARKER_PANELS for additional mappings
  │
  ├─ Step 3: Deduplicate (keep strongest evidence per drug)
  │   └─ Sort by EVIDENCE_LEVEL_ORDER
  │
  ├─ Step 4: Check resistance via _check_resistance()
  │   ├─ Mutation-level: RESISTANCE_MAP lookup
  │   └─ Class-level: _DRUG_CLASS_GROUPS cross-resistance
  │
  ├─ Step 5: Check contraindications via _check_contraindication()
  │   ├─ Direct match: same drug used before
  │   └─ Same drug class via THERAPY_MAP category matching
  │
  ├─ Step 6: Retrieve supporting evidence from Milvus
  │   └─ Search onco_therapies + onco_literature per drug
  │
  ├─ Step 6.5: Identify combination regimens
  │   └─ _COMBO_REGIMENS: 6 FDA-approved combos
  │       dabrafenib+trametinib, encorafenib+binimetinib,
  │       encorafenib+cetuximab, ipilimumab+nivolumab,
  │       lenvatinib+pembrolizumab, trastuzumab+pertuzumab
  │
  └─ Step 7: _assign_final_ranks()
      ├─ Partition: clean vs. flagged (resistance/contraindication)
      ├─ Sort each group by evidence level
      ├─ Combine: clean first, then flagged
      └─ Assign rank 1..N

6.3 Drug Class Groups¶

The ranker defines 10 drug class groups for cross-resistance detection:

_DRUG_CLASS_GROUPS = {
    "egfr_tki_1g":     ["erlotinib", "gefitinib"],
    "egfr_tki_2g":     ["afatinib", "dacomitinib"],
    "egfr_tki_3g":     ["osimertinib"],
    "alk_tki":         ["crizotinib", "ceritinib", "alectinib",
                        "brigatinib", "lorlatinib"],
    "braf_inhibitor":  ["vemurafenib", "dabrafenib", "encorafenib"],
    "mek_inhibitor":   ["trametinib", "cobimetinib", "binimetinib"],
    "anti_pd1":        ["pembrolizumab", "nivolumab", "dostarlimab",
                        "cemiplimab"],
    "anti_pdl1":       ["atezolizumab", "durvalumab", "avelumab"],
    "parp_inhibitor":  ["olaparib", "rucaparib", "niraparib",
                        "talazoparib"],
    "kras_g12c":       ["sotorasib", "adagrasib"],
}

If a patient previously received erlotinib (1st-gen EGFR TKI), gefitinib is flagged for cross-resistance because they share the same drug class.

6.4 Resistance Check Logic¶

Two layers of resistance detection:

Mutation-level (RESISTANCE_MAP): Checks if the candidate drug has documented resistance mutations that overlap with prior therapy history. Returns mechanism details and suggested alternatives.
Class-level (_DRUG_CLASS_GROUPS): If the candidate belongs to the same drug class as a prior therapy, flags likely cross-resistance even without mutation-specific data.

6.5 Convenience API¶

def rank_for_case(self, case: CaseSnapshot) -> List[Dict]:
    """Rank therapies directly from a CaseSnapshot."""
    return self.rank_therapies(
        cancer_type=case.cancer_type,
        variants=case.variants,
        biomarkers=case.biomarkers or {},
        prior_therapies=case.prior_therapies or [],
    )

Chapter 7: Clinical Trial Matching¶

File: src/trial_matcher.py (513 lines)

7.1 Two-Stage Matching Strategy¶

Stage 1: Deterministic Filter¶

_deterministic_search(cancer_type, top_k=30)

Resolve cancer type to aliases via _CANCER_ALIASES (18 cancer type groups)
For each alias x each open status: build Milvus filter expression
Filter on: cancer_type == "{alias}" and status == "{status}"
Open statuses: Recruiting, Active, not recruiting, Enrolling by invitation, Not yet recruiting
Input validation via _SAFE_FILTER_RE = r"^[A-Za-z0-9 _.\-/,]+$" to prevent Milvus injection

Stage 2: Semantic Search¶

_semantic_search(query_text, top_k=30)

Build natural-language query: "{cancer_type} clinical trial stage {stage} {marker1} {value1} {marker2} {value2}"
Embed and search onco_trials collection

Merge¶

Results are merged by trial_id (union). If a trial appears in both sets, the semantic score is preserved as _semantic_score.

7.2 Composite Scoring¶

composite = (
    0.40 * biomarker_score      # fraction of patient biomarkers in criteria
  + 0.25 * semantic_score       # vector similarity
  + 0.20 * phase_weight         # trial phase
  + 0.15 * status_weight        # recruitment status
) * age_penalty                 # 1.0 or 0.5 if age outside range

Phase Weights¶

Phase	Weight
Phase 3	1.0
Phase 2/3	0.9
Phase 2	0.8
Phase 1/2	0.7
Phase 1	0.6
Phase 4	0.5

Status Weights¶

Status	Weight
Recruiting	1.0
Enrolling by invitation	0.8
Active, not recruiting	0.6
Not yet recruiting	0.4

7.3 Biomarker Matching¶

_score_biomarker_match() performs case-insensitive fuzzy matching of each patient biomarker key and value against the combined trial criteria text. Returns fraction matched (0.0 to 1.0).

7.4 Age Penalty¶

_compute_age_penalty() parses age eligibility from criteria text using regex patterns:

"Age >= 18" / "minimum age: 18"
"Age <= 75" / "maximum age: 75"
"18-75 years" / "18 to 75 years"

Returns 1.0 (no penalty) if age is within range or unspecified; 0.5 if out of range.

7.5 Match Explanation¶

Each matched trial gets a structured explanation:

{
    "trial_id": "NCT04185831",
    "title": "Phase 3 Study of Osimertinib + Chemo in EGFR-mutant NSCLC",
    "phase": "Phase 3",
    "status": "Recruiting",
    "sponsor": "AstraZeneca",
    "match_score": 0.8234,
    "matched_criteria": ["Cancer type: NSCLC", "EGFR=L858R", "Age 62"],
    "unmatched_criteria": ["TMB=8.5 (not explicitly listed)"],
    "explanation": "Matched: Cancer type: NSCLC, EGFR=L858R, Age 62. ..."
}

Chapter 8: Case Management and VCF Parsing¶

File: src/case_manager.py (516 lines)

8.1 Case Lifecycle¶

create_case(patient_id, cancer_type, stage, vcf_or_variants, biomarkers, prior_therapies)
  ├─ Parse VCF text (if string) via _parse_vcf_text()
  │   └─ Delegates to src/utils/vcf_parser.py:
  │       parse_vcf_text() -> filter_pass_variants() -> extract gene/consequence
  ├─ Classify actionability per variant
  │   └─ classify_variant_actionability(gene, variant) -> A/B/C/D/VUS
  ├─ Generate case_id (UUID)
  ├─ Build text_summary for embedding
  ├─ Create CaseSnapshot
  └─ _store_case() -> embed summary -> insert into onco_cases collection

8.2 VCF Parsing Details¶

The VCF parser (src/utils/vcf_parser.py) handles three annotation formats:

SnpEff ANN field: ANN=A|missense_variant|MODERATE|EGFR|...
VEP CSQ field: CSQ=A|missense_variant|MODERATE|EGFR|...
GENE / GENEINFO fields: GENE=EGFR or GENEINFO=EGFR:1956

Parsing pipeline: - parse_vcf_text() -- split lines, skip headers, extract CHROM/POS/REF/ALT/FILTER/INFO - filter_pass_variants() -- keep only FILTER == "PASS" or "." - extract_gene_from_info() -- try ANN, then CSQ, then GENE/GENEINFO - extract_consequence_from_info() -- extract SnpEff/VEP consequence term

8.3 Variant Actionability Classification¶

def _classify_variant_actionability(self, gene: str, variant: str) -> str:
    return classify_variant_actionability(gene, variant)
    # from src.knowledge:
    # 1. Is gene in ACTIONABLE_TARGETS? No -> "VUS"
    # 2. Does variant match any key_variants? Yes -> evidence_level (A/B/C)
    # 3. Gene-level actionable? -> default_evidence_level
    # 4. Fallback -> "VUS"

8.4 MTB Packet Generation¶

generate_mtb_packet(case_id_or_snapshot) assembles 5 sections:

variant_table: all variants with actionability classification and associated drugs from ACTIONABLE_TARGETS
evidence_table: RAG-retrieved evidence for each actionable variant. Queries: "{gene} {variant} {cancer_type} targeted therapy clinical evidence" across onco_literature and onco_therapies.
therapy_ranking: delegates to the TherapyRanker (Chapter 6)
trial_matches: delegates to the TrialMatcher (Chapter 7)
open_questions: identifies gaps:
VUS variants that may need reclassification
Missing biomarker results (TMB, MSI, PD-L1 if not provided)
Uncertain evidence items needing tumor board discussion

8.5 Case Storage¶

Cases are stored in onco_cases with embedded text summaries:

self.collection_manager.insert(
    collection_name="onco_cases",
    data={
        "id": str(snapshot.case_id)[:100],
        "patient_id": str(snapshot.patient_id)[:100],
        "cancer_type": str(snapshot.cancer_type)[:50],
        "stage": str(snapshot.stage or "")[:20],
        "variants": variants_str[:1000],      # serialized to CSV string
        "biomarkers": biomarkers_str[:1000],   # serialized to CSV string
        "prior_therapies": therapies_str[:500],
        "embedding": embedding,
        "text_summary": summary_text[:3000],
    },
)

Note the explicit length truncation -- Milvus VARCHAR fields enforce max_length limits and will reject oversized values.

Chapter 9: Export System¶

File: src/export.py (1,055 lines)

9.1 Input Normalization¶

All four export functions accept MTBPacket, dict, or str via _normalise_input():

def _normalise_input(mtb_packet_or_response):
    if isinstance(..., dict): return it
    if isinstance(..., str):  try json.loads, else wrap in {"raw_text": ...}
    # Pydantic: try .model_dump(), .dict(), .__dict__

9.2 Markdown Export¶

export_markdown(mtb_packet_or_response, title=None) -> str

Sections generated: - Header with timestamp, pipeline name, patient ID, cancer type - Clinical Summary - Somatic Variant Profile (Markdown table: Gene | Variant | Type | VAF | Consequence | Tier) - Biomarker Summary (TMB, MSI, PD-L1 + any additional) - Evidence Summary (per-gene, per-level, with source citations) - Therapy Ranking (table: Rank | Therapy | Targets | Evidence | Line | Notes) - Clinical Trial Matches (NCT ID, title, phase, status, match rationale) - Pathway Context - Known Resistance Mechanisms - Open Questions / Follow-Up - Disclaimer

9.3 JSON Export¶

export_json(mtb_packet_or_response) -> dict

Returns a structured dictionary with:

{
    "meta": {
        "format": "hcls-ai-factory-oncology-report",
        "version": "1.0.0",
        "generated_at": "2026-02-15T...",
        "pipeline": "Oncology Intelligence Agent",
        "author": "HCLS AI Factory",
    },
    "patient_id": ...,
    "cancer_type": ...,
    "variants": [...],
    "biomarkers": {...},
    "evidence": [...],
    "therapy_ranking": [...],
    "clinical_trials": [...],
    "pathways": [...],
    "resistance_mechanisms": [...],
    "open_questions": [...],
}

9.4 PDF Export¶

export_pdf(mtb_packet_or_response, output_path) -> str

Requires ReportLab. Features:

NVIDIA branding: header bar in RGB (118, 185, 0) with white title
Custom styles: NVTitle (20pt white), NVHeading (14pt dark), NVBody (10pt), NVDisclaimer (7pt gray)
Page layout: letter size, 40pt margins
Structured tables for variants, therapies, trials using reportlab.platypus.Table with TableStyle
Disclaimer footer on every page

Brand color is overridable via ONCO_PDF_BRAND_COLOR_R/G/B environment variables.

9.5 FHIR R4 Export¶

export_fhir_r4(mtb_packet_or_response) -> dict

Generates a FHIR R4 Bundle resource containing:

FHIR Resource	Content	Coding System
Patient	Patient demographics
DiagnosticReport	Genomic report	SNOMED
Observation	Genetic variant assessment	LOINC 69548-6
Observation	Tumor mutation burden	LOINC 94076-7
Observation	Microsatellite instability	LOINC 81695-9

LOINC Codes Used¶

FHIR_LOINC_CODES = {
    "genomic_report":          "81247-9",
    "gene_studied":            "48018-6",
    "variant":                 "69548-6",
    "therapeutic_implication":  "51969-4",
    "tumor_mutation_burden":   "94076-7",
    "microsatellite_instability": "81695-9",
}

SNOMED Cancer Codes¶

22 cancer types are mapped to SNOMED CT codes:

FHIR_SNOMED_CANCER_CODES = {
    "nsclc": ("254637007", "Non-small cell lung cancer"),
    "breast": ("254837009", "Malignant neoplasm of breast"),
    "colorectal": ("363406005", "Malignant tumor of colon"),
    "melanoma": ("372244006", "Malignant melanoma"),
    # ... 18 more
}

9.6 FHIR R4 Bundle Structure¶

The generated FHIR Bundle has type: "collection" and contains:

{
  "resourceType": "Bundle",
  "id": "<uuid>",
  "type": "collection",
  "timestamp": "2026-02-15T10:30:00Z",
  "entry": [
    {
      "fullUrl": "urn:uuid:<patient-uuid>",
      "resource": {
        "resourceType": "Patient",
        "id": "<patient-uuid>",
        "identifier": [{"system": "urn:hcls-ai-factory:patient", "value": "<patient_id>"}],
        "active": true
      }
    },
    {
      "fullUrl": "urn:uuid:<obs-uuid>",
      "resource": {
        "resourceType": "Observation",
        "status": "final",
        "code": {"coding": [{"system": "http://loinc.org", "code": "69548-6"}]},
        "subject": {"reference": "urn:uuid:<patient-uuid>"},
        "valueCodeableConcept": {
          "coding": [{"system": "http://varnomen.hgvs.org", "code": "EGFR L858R"}]
        },
        "component": [
          {"code": {"text": "Gene"}, "valueString": "EGFR"},
          {"code": {"text": "Consequence"}, "valueString": "missense_variant"},
          {"code": {"text": "VAF"}, "valueQuantity": {"value": 0.35}}
        ]
      }
    },
    {
      "resource": {
        "resourceType": "DiagnosticReport",
        "code": {"coding": [{"system": "http://snomed.info/sct", "code": "254637007"}]},
        "result": [{"reference": "urn:uuid:<obs-uuid>"}]
      }
    }
  ]
}

TMB and MSI Observations¶

When biomarkers include TMB or MSI values, dedicated Observation resources are created:

TMB: LOINC 94076-7, valueQuantity with unit 1/1000000{Base}
MSI: LOINC 81695-9, valueCodeableConcept with MSI-H/MSS/MSI-L

9.7 Export Format Comparison¶

Feature	Markdown	JSON	PDF	FHIR R4
Human-readable	Yes	No	Yes	No
Machine-parseable	Partial	Yes	No	Yes
Interoperable	No	No	No	Yes
Print-ready	No	No	Yes	No
Branding	No	No	NVIDIA	No
ReportLab required	No	No	Yes	No
SNOMED/LOINC coded	No	No	No	Yes

Chapter 10: Ingest Pipeline Architecture¶

Directory: src/ingest/ (1,793 lines, 9 parsers + base)

10.1 BaseIngestPipeline¶

All parsers inherit from BaseIngestPipeline which provides the standard three-step orchestration:

class BaseIngestPipeline(ABC):
    def __init__(self, collection_manager, embedder, collection_name, batch_size=50):
        ...

    def run(self, query=None, max_results=None) -> int:
        raw_data = self.fetch(**kwargs)            # Step 1: Fetch
        parsed_records = self.parse(raw_data)      # Step 2: Parse
        count = self.embed_and_store(parsed_records) # Step 3: Embed & Store
        return count

    @abstractmethod
    def fetch(self, **kwargs) -> List[Dict]: ...

    @abstractmethod
    def parse(self, raw_data: List[Dict]) -> List[Dict]: ...

    def embed_and_store(self, records: List[Dict]) -> int:
        # Batch embed text fields using self.embedder
        # Insert into Milvus via self.collection_manager
        # Returns total records inserted

10.2 Parser Inventory¶

Parser	Collection	Data Source
`civic_parser.py`	onco_variants	CIViC GraphQL API
`oncokb_parser.py`	onco_variants	OncoKB annotation files
`literature_parser.py`	onco_literature	PubMed E-utilities API
`clinical_trials_parser.py`	onco_trials	ClinicalTrials.gov v2 API
`guideline_parser.py`	onco_guidelines	Curated guideline seed files
`resistance_parser.py`	onco_resistance	Curated resistance seed files
`pathway_parser.py`	onco_pathways	Curated pathway seed files
`outcome_parser.py`	onco_outcomes	Curated outcome records

10.3 Running Ingest¶

Each parser can be invoked independently:

from src.ingest.literature_parser import LiteratureParser

parser = LiteratureParser(
    collection_manager=collection_manager,
    embedder=embedder,
    collection_name="onco_literature",
    batch_size=50,
)
count = parser.run(query="EGFR NSCLC targeted therapy", max_results=500)

The scripts/ directory contains 17 scripts for automated ingestion, seeding, validation, and benchmarking of all collections.

10.4 Embedding Strategy¶

All parsers use the same BGE-small-en-v1.5 model (384 dimensions). Each domain model has a to_embedding_text() method that concatenates the most semantically relevant fields:

class OncologyLiterature(BaseModel):
    def to_embedding_text(self) -> str:
        parts = [self.title, self.text_chunk]
        if self.gene:    parts.append(f"Gene: {self.gene}")
        if self.variant: parts.append(f"Variant: {self.variant}")
        if self.cancer_type:
            parts.append(f"Cancer: {self.cancer_type.value}")
        if self.keywords:
            parts.append(f"Keywords: {', '.join(self.keywords)}")
        return " | ".join(parts)

File: src/cross_modal.py (383 lines)

The OncoCrossModalTrigger fires when a case contains variants with evidence level A or B in ACTIONABLE_TARGETS. It enriches the clinical context by querying across modalities.

11.2 CrossModalResult Dataclass¶

@dataclass
class CrossModalResult:
    trigger_reason: str               # why the trigger fired
    actionable_variants: List[Dict]   # A/B-level variants
    genomic_context: List[Dict]       # genomic evidence hits
    imaging_context: List[Dict]       # imaging findings (if available)
    genomic_hit_count: int
    imaging_hit_count: int
    enrichment_summary: str           # human-readable summary

11.3 Evaluation Flow¶

evaluate(case_or_variants)
  ├─ Extract variants from input
  ├─ Filter to actionability A or B
  │   (re-classify via ACTIONABLE_TARGETS if not pre-computed)
  ├─ If no actionable variants: return None (trigger not fired)
  │
  ├─ Build queries per actionable variant:
  │   Genomic: "{gene} {variant} targeted therapy evidence"
  │            "{gene} mutation clinical significance"
  │   Imaging: "{gene} mutation {cancer_type} imaging findings"
  │
  ├─ _query_genomics(queries) -> genomic hits
  │   Collection: "genomic_evidence"
  │   Threshold: DEFAULT_THRESHOLD = 0.40
  │   Top-K: genomic_top_k = 5
  │
  ├─ _query_imaging(queries) -> imaging hits
  │   Collection prefix: "imaging_"
  │   Graceful failure if imaging collections don't exist
  │
  └─ Build enrichment_summary and return CrossModalResult

11.4 Graceful Degradation¶

The imaging query is wrapped in try/except -- if the Imaging Intelligence Agent is not deployed (no imaging_* collections in Milvus), the trigger still returns genomic context without imaging data. This allows the oncology agent to function standalone or in a multi-agent deployment.

Chapter 12: Testing Strategy¶

Directory: tests/ (4,370 lines, 556 tests)

12.1 Test File Map¶

File	Lines	Tests	Focus
test_models.py	644	~120	Enum membership, model fields
test_integration.py	785	~80	4 patient profiles, end-to-end
test_knowledge.py	603	~70	Knowledge graph integrity
test_export.py	439	~60	All 4 export formats
test_therapy_ranker.py	363	~50	Ranking, resistance, combos
test_trial_matcher.py	363	~50	Matching, scoring, explanations
test_case_manager.py	332	~40	VCF parsing, case lifecycle
test_rag_engine.py	301	~35	Retrieval pipeline
test_collections.py	276	~30	Schema validation
test_agent.py	264	~21	Planning, evaluation

12.2 Test Patterns¶

Parametrized Enum Tests (test_models.py)¶

@pytest.mark.parametrize("cancer_type", [
    CancerType.NSCLC, CancerType.BREAST, CancerType.MELANOMA, ...
])
def test_cancer_type_values(cancer_type):
    assert cancer_type.value == cancer_type.name.lower() or ...

Integration Test Patient Profiles (test_integration.py)¶

Four synthetic patient profiles exercise the full pipeline:

NSCLC with EGFR L858R -- targeted therapy path
Melanoma with BRAF V600E -- combination therapy path
CRC with MSI-H -- biomarker-driven immunotherapy path
Breast with BRCA2 mutation -- PARP inhibitor path

Each profile tests: case creation -> VCF parsing -> actionability classification -> therapy ranking -> trial matching -> export generation.

Knowledge Graph Integrity (test_knowledge.py)¶

Validates structural consistency across all knowledge dictionaries:

Every ACTIONABLE_TARGETS entry has required keys
Every therapy in targeted_therapies appears in THERAPY_MAP
Every resistance_mutations entry has corresponding RESISTANCE_MAP entries
Pathway references are valid

Mock Fixtures (conftest.py)¶

The test suite uses pytest fixtures for: - mock_collection_manager -- in-memory Milvus simulation - mock_embedder -- returns fixed-dimension zero vectors - mock_llm_client -- returns canned responses - sample_vcf_text -- synthetic VCF content with SnpEff annotations

Export Format Tests (test_export.py)¶

Tests validate all four export formats against the same input data:

def test_export_markdown_contains_sections(sample_mtb_packet):
    result = export_markdown(sample_mtb_packet)
    assert "# " in result                     # has heading
    assert "Somatic Variant Profile" in result # has variant section
    assert "Therapy Ranking" in result         # has therapy section
    assert "research use only" in result       # has disclaimer

def test_export_fhir_r4_valid_bundle(sample_mtb_packet):
    bundle = export_fhir_r4(sample_mtb_packet, patient_id="P001")
    assert bundle["resourceType"] == "Bundle"
    assert bundle["type"] == "collection"
    resources = [e["resource"]["resourceType"] for e in bundle["entry"]]
    assert "Patient" in resources
    assert "Observation" in resources
    assert "DiagnosticReport" in resources

Therapy Ranker Tests (test_therapy_ranker.py)¶

Tests cover the full ranking pipeline including edge cases:

Variant-driven therapy identification for each major gene
Biomarker-driven therapy identification (MSI-H, TMB-H, HRD, PD-L1)
Resistance flagging from prior therapy history
Contraindication detection for same drug class
Combination regimen identification
Final rank ordering (clean before flagged)

Trial Matcher Tests (test_trial_matcher.py)¶

Tests validate:

Cancer type alias resolution
Deterministic filter construction
Biomarker matching scoring
Phase and status weighting
Age penalty computation
Composite score calculation
Match explanation generation

12.3 Running Tests¶

# All tests
pytest tests/ -v

# Specific module
pytest tests/test_therapy_ranker.py -v

# Integration tests only
pytest tests/test_integration.py -v -k "integration"

# With coverage
pytest tests/ --cov=src --cov-report=html

Chapter 13: Performance Tuning¶

13.1 Collection Weight Tuning¶

Weights control how much each collection influences final ranking. To tune:

Run a set of benchmark queries with known-good answers
Adjust weights in config/settings.py (or via ONCO_WEIGHT_* env vars)
Measure retrieval quality (NDCG, MRR, or manual relevance assessment)

Default rationale: - Variants (0.18) and literature (0.16) weighted highest because they carry the most direct clinical evidence - Guidelines (0.12) and therapies (0.14) provide actionability context - Cases (0.02) and genomic evidence (0.03) are supplementary

13.2 Milvus Index Tuning¶

Current defaults:

Parameter	Value	Effect
`nlist`	1024	Number of IVF partitions
`nprobe`	16	Partitions searched per query
`metric`	COSINE	Similarity measure

Tuning guidelines:

Scenario	Recommendation
< 100K vectors/collection	nlist=256, nprobe=16
100K-1M vectors/collection	nlist=1024, nprobe=16-32
> 1M vectors/collection	nlist=2048, nprobe=32-64
Memory constrained	Switch to IVF_PQ (lossy)
Maximum recall needed	Use FLAT index (brute-force)

13.3 Parallel Search Workers¶

The RAG engine caps ThreadPoolExecutor at min(len(collections), 8). On systems with more cores, increase this ceiling. On memory-constrained systems, reduce to 4 to limit concurrent Milvus connections.

13.4 Evidence Cap¶

_MAX_EVIDENCE = 30 limits the number of evidence items sent to the LLM. This balances context window utilization against prompt cost:

Increase to 50 for models with large context windows (Claude Opus)
Decrease to 15-20 for faster response times

13.5 Embedding Batch Size¶

EMBEDDING_BATCH_SIZE = 32 (from settings). During ingest, texts are batched before calling embedder.encode(). Increase to 64-128 on GPU systems for throughput. During query time, only single texts are embedded.

13.6 Query Expansion Impact¶

Expanded searches use top_k // 2 to avoid overwhelming the result set. If expansion adds too much noise, consider:

Reducing expansion terms per category
Increasing the score threshold for expanded hits
Disabling expansion for targeted strategies

13.7 Trial Matcher Optimization¶

The deterministic search issues len(aliases) * len(statuses) Milvus queries per cancer type (typically 5 aliases x 4 statuses = 20 queries). For latency-sensitive deployments:

Pre-filter the alias list to the top 2-3 most relevant
Cache deterministic results with a short TTL
Increase top_k multiplier for semantic search, reduce for deterministic

Chapter 14: Extending the Agent¶

14.1 Adding a New Collection¶

Define schema in src/collections.py:

MY_NEW_FIELDS = [
    FieldSchema(name="id", dtype=DataType.VARCHAR, max_length=100, is_primary=True),
    FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=EMBEDDING_DIM),
    # ... your metadata fields
    FieldSchema(name="text_summary", dtype=DataType.VARCHAR, max_length=3000),
]

MY_NEW_SCHEMA = CollectionSchema(
    fields=MY_NEW_FIELDS,
    description="Your collection description",
)

Register in COLLECTION_SCHEMAS and COLLECTION_MODELS
Add weight in config/settings.py:

WEIGHT_MY_NEW: float = 0.05  # adjust other weights to maintain sum ~1.0

Add config entry in src/rag_engine.py COLLECTION_CONFIG
Create ingest parser by subclassing BaseIngestPipeline
Add Pydantic model in src/models.py if needed
Write tests in a new test_my_new.py file

14.2 Adding a New Biomarker Rule¶

In src/therapy_ranker.py, add to _identify_biomarker_therapies():

# Example: TROP2 overexpression -> sacituzumab govitecan
trop2 = biomarkers.get("TROP2", "").upper()
if trop2 in ("POSITIVE", "OVEREXPRESSION", "HIGH"):
    therapies.append({
        "drug_name": "sacituzumab govitecan",
        "brand_name": "Trodelvy",
        "category": "ADC",
        "targets": ["TROP2"],
        "evidence_level": "A",
        "guideline_recommendation": "FDA-approved for TROP2+ TNBC and urothelial.",
        "source": "biomarker",
        "source_biomarker": "TROP2",
    })

Then add corresponding entries to: - BIOMARKER_PANELS in src/knowledge.py - Query expansion terms in src/query_expansion.py - Test cases in tests/test_therapy_ranker.py

14.3 Adding a New Drug Class Group¶

In src/therapy_ranker.py, extend _DRUG_CLASS_GROUPS:

_DRUG_CLASS_GROUPS = {
    ...
    "her2_adc": ["trastuzumab deruxtecan", "trastuzumab emtansine"],
    "trop2_adc": ["sacituzumab govitecan", "datopotamab deruxtecan"],
}

14.4 Adding a New Ingest Parser¶

Create src/ingest/my_source_parser.py:

from src.ingest.base import BaseIngestPipeline

class MySourceParser(BaseIngestPipeline):
    def fetch(self, query=None, max_results=None):
        # Call external API or read seed files
        return [{"raw_field": "value", ...}, ...]

    def parse(self, raw_data):
        # Normalize to collection schema fields
        records = []
        for item in raw_data:
            records.append({
                "id": ...,
                "text_summary": ...,  # this gets embedded
                # ... other schema fields
            })
        return records

Register in src/ingest/__init__.py
Add a script in scripts/ to run the parser

14.5 Adding a New Export Format¶

Follow the pattern in src/export.py:

def export_my_format(mtb_packet_or_response: Any, **kwargs) -> Any:
    data = _normalise_input(mtb_packet_or_response)
    # Transform data into your target format
    return result

The _normalise_input() helper handles MTBPacket, dict, and str inputs.

14.6 Adding a New API Route¶

Create api/routes/my_router.py:

from fastapi import APIRouter, Depends
router = APIRouter(prefix="/my-endpoint", tags=["my-feature"])

@router.post("/action")
async def my_action(request: MyRequest):
    state = get_state()
    # Use state["rag_engine"], state["collection_manager"], etc.
    return {"result": ...}

Import and include in api/main.py:

from api.routes import my_router
app.include_router(my_router.router)

Extend OncoCrossModalTrigger.evaluate() in src/cross_modal.py to query additional collection prefixes. The pattern is:

Define the new collection prefix (e.g., DRUG_DISCOVERY_PREFIX = "drug_")
Build domain-specific queries for actionable variants
Query the collections with graceful failure handling
Add results to the CrossModalResult

14.8 Modifying the System Prompt¶

The system prompt in src/rag_engine.py (ONCO_SYSTEM_PROMPT) defines the agent's persona and behavioral constraints. To modify:

Edit the ONCO_SYSTEM_PROMPT string in src/rag_engine.py
Keep the 8 competency areas unless deliberately removing capability
Preserve the 5 behavioral instructions (cite evidence, cross-functional thinking, resistance/contraindications, guideline references, uncertainty)
Test with test_rag_engine.py to verify prompt construction still works

14.9 Changing the Embedding Model¶

To switch from BGE-small-en-v1.5 to a different model:

Update EMBEDDING_MODEL and EMBEDDING_DIM in config/settings.py
Update EMBEDDING_DIM constant in src/collections.py
All existing collections must be dropped and recreated (dimension mismatch will cause Milvus errors)
Re-run all ingest pipelines to re-embed existing data
Update the BGE instruction prefix _BGE_INSTRUCTION in src/rag_engine.py (different models may use different instruction prefixes or none at all)

14.10 Working with the Metrics Module¶

The src/metrics.py module (362 lines) provides Prometheus instrumentation. Key metrics tracked:

onco_query_total -- total queries processed (counter)
onco_query_duration_seconds -- query latency (histogram)
onco_evidence_hits -- evidence items retrieved per query (histogram)
onco_collection_search_duration -- per-collection search time (histogram)
onco_therapy_candidates -- therapies identified per ranking (histogram)
onco_trial_matches -- trials matched per patient (histogram)

Metrics are enabled by default (METRICS_ENABLED=True). Disable in development with ONCO_METRICS_ENABLED=false.

Appendix A: Complete API Reference¶

FastAPI Application (`api/main.py`)¶

Endpoint	Method	Router	Description
`/health`	GET	main	Liveness check with component status
`/api/v1/query`	POST	meta_agent	Full agent query (plan/search/synth)
`/api/v1/search`	POST	meta_agent	Evidence-only search (no LLM)
`/api/v1/compare`	POST	meta_agent	Comparative retrieval
`/api/v1/cases`	POST	cases	Create case from VCF/variants
`/api/v1/cases/{id}`	GET	cases	Retrieve case by ID
`/api/v1/cases/{id}/mtb`	GET	cases	Generate MTB packet
`/api/v1/trials`	POST	trials	Match trials for patient profile
`/api/v1/reports`	POST	reports	Export report (markdown/json/pdf/fhir)
`/api/v1/events`	POST	events	Cross-modal trigger evaluation

Startup Configuration¶

The lifespan() function initializes all components in order:

1. Load OncoSettings (ONCO_ env prefix, .env file)
2. Connect OncoCollectionManager to Milvus
3. Load SentenceTransformer (BGE-small-en-v1.5) via EmbedderWrapper
4. Initialize OncoRAGEngine
5. Initialize OncoIntelligenceAgent
6. Initialize OncologyCaseManager
7. Initialize TrialMatcher
8. Initialize TherapyRanker
9. Initialize OncoCrossModalTrigger
10. Store all in _state dict for dependency injection

EmbedderWrapper¶

The EmbedderWrapper class in api/main.py adapts SentenceTransformer to expose both .encode() and .embed() APIs:

class EmbedderWrapper:
    def __init__(self, model: SentenceTransformer):
        self._model = model

    def encode(self, texts):
        """SentenceTransformer native API -- accepts str or list."""
        return self._model.encode(texts)

    def embed(self, text) -> list:
        """Single-text convenience. Returns list of floats (384-dim)."""
        if isinstance(text, str):
            return self._model.encode([text])[0].tolist()
        return self._model.encode(text).tolist()

The RAG engine uses .encode() (inherited from SentenceTransformer). The therapy ranker, trial matcher, and case manager use .embed() for single-text embedding. Both methods produce identical 384-dimensional vectors.

Request/Response Models¶

The API uses Pydantic models for request validation:

# Query endpoint
class QueryRequest(BaseModel):
    question: str
    top_k: int = 5
    collections_filter: Optional[List[str]] = None
    year_min: Optional[int] = None
    year_max: Optional[int] = None

# Case creation endpoint
class CreateCaseRequest(BaseModel):
    patient_id: str
    cancer_type: str
    stage: str
    vcf_content: Optional[str] = None
    variants: Optional[List[Dict]] = None
    biomarkers: Optional[Dict[str, Any]] = None
    prior_therapies: Optional[List[str]] = None

# Trial matching endpoint
class TrialMatchRequest(BaseModel):
    cancer_type: str
    biomarkers: Dict[str, Any]
    stage: str
    age: Optional[int] = None
    top_k: int = 10

CORS Configuration¶

The FastAPI app enables CORS for development:

app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],
    allow_credentials=True,
    allow_methods=["*"],
    allow_headers=["*"],
)

For production deployments, restrict allow_origins to the Streamlit UI host and any frontend applications.

Appendix B: Collection Schema Reference¶

Summary Table¶

Collection	PK Length	Text Field	Text Max	Typed Metadata Fields
onco_variants	100	text_summary	3000	gene, variant_name, evidence_level, drugs, civic_id
onco_literature	100	text_chunk	3000	title, source_type, year, gene, journal
onco_trials	20	text_summary	3000	phase, status, sponsor, cancer_types, enrollment
onco_biomarkers	100	text_summary	3000	name, biomarker_type, predictive_value, testing_method
onco_therapies	100	text_summary	3000	drug_name, category, targets, mechanism_of_action
onco_pathways	100	text_summary	3000	name, key_genes, therapeutic_targets, cross_talk
onco_guidelines	100	text_summary	3000	org, cancer_type, version, year, evidence_level
onco_resistance	100	text_summary	3000	primary_therapy, gene, mechanism, bypass_pathway
onco_outcomes	100	text_summary	3000	case_id, therapy, response, duration_months
onco_cases	100	text_summary	3000	patient_id, cancer_type, stage, variants, biomarkers
genomic_evidence	200	text_summary	2000	chrom, pos, gene, consequence, impact, am_pathogenicity

Common Index Configuration¶

All collections share: - Embedding: FLOAT_VECTOR, dim=384 (BGE-small-en-v1.5) - Index: IVF_FLAT, nlist=1024 - Metric: COSINE - Search: nprobe=16

Appendix C: Configuration Parameter Reference¶

All parameters use the ONCO_ environment variable prefix (via Pydantic BaseSettings). Override any value by setting ONCO_<PARAM_NAME> in your environment or .env file.

Paths¶

Parameter	Default	Description
PROJECT_ROOT	`<auto-detected>`	Repository root
DATA_DIR	`{PROJECT_ROOT}/data`	Data file directory
CACHE_DIR	`{PROJECT_ROOT}/cache`	Cache directory
REFERENCE_DIR	`{PROJECT_ROOT}/reference`	Reference file directory
RAG_PIPELINE_ROOT	`<parent>/rag-chat-pipeline`	Shared RAG pipeline root

Milvus¶

Parameter	Default	Description
MILVUS_HOST	`localhost`	Milvus server hostname
MILVUS_PORT	`19530`	Milvus server port

Embeddings¶

Parameter	Default	Description
EMBEDDING_MODEL	`BAAI/bge-small-en-v1.5`	HuggingFace model ID
EMBEDDING_DIM	`384`	Vector dimensionality
EMBEDDING_BATCH_SIZE	`32`	Batch size for ingest

LLM¶

Parameter	Default	Description
LLM_PROVIDER	`anthropic`	LLM provider
LLM_MODEL	`claude-sonnet-4-6`	Model identifier
ANTHROPIC_API_KEY	`None`	Anthropic API key

RAG Search¶

Parameter	Default	Description
TOP_K	`5`	Per-collection hit limit
SCORE_THRESHOLD	`0.4`	Minimum similarity score

Collection Weights¶

Parameter	Default	Collection
WEIGHT_VARIANTS	0.18	onco_variants
WEIGHT_LITERATURE	0.16	onco_literature
WEIGHT_THERAPIES	0.14	onco_therapies
WEIGHT_GUIDELINES	0.12	onco_guidelines
WEIGHT_TRIALS	0.10	onco_trials
WEIGHT_BIOMARKERS	0.08	onco_biomarkers
WEIGHT_RESISTANCE	0.07	onco_resistance
WEIGHT_PATHWAYS	0.06	onco_pathways
WEIGHT_OUTCOMES	0.04	onco_outcomes
WEIGHT_CASES	0.02	onco_cases
WEIGHT_GENOMIC	0.03	genomic_evidence

External APIs¶

Parameter	Default	Description
NCBI_API_KEY	`None`	NCBI E-utilities API key
PUBMED_MAX_RESULTS	`5000`	Max PubMed fetch results
CT_GOV_BASE_URL	`https://clinicaltrials.gov/api/v2`	ClinicalTrials.gov API
CIVIC_BASE_URL	`https://civicdb.org/api`	CIViC API endpoint

Server¶

Parameter	Default	Description
API_HOST	`0.0.0.0`	FastAPI bind address
API_PORT	`8527`	FastAPI port
STREAMLIT_PORT	`8526`	Streamlit UI port

Operational¶

Parameter	Default	Description
METRICS_ENABLED	`True`	Enable Prometheus metrics
SCHEDULER_INTERVAL	`168h`	Background task interval (7 days)
CONVERSATION_MEMORY_DEPTH	`3`	Conversation turns to retain

Scheduler Configuration¶

Parameter	Default	Description
SCHEDULER_INTERVAL	`168h`	Ingest re-run interval (7 days)

The scheduler (src/scheduler.py) runs background ingest tasks at the configured interval. Each run refreshes collection data from external sources (PubMed, ClinicalTrials.gov, CIViC) without requiring manual intervention.

Evidence Level Labels (for Export)¶

The export module maps internal evidence levels to human-readable labels:

EVIDENCE_LEVEL_LABELS = {
    "level_1": "Level 1 -- FDA-approved / Standard of Care",
    "level_2": "Level 2 -- Clinical Evidence / Consensus",
    "level_3": "Level 3 -- Case Reports / Early Trials",
    "level_4": "Level 4 -- Preclinical / Biological Rationale",
    "level_R": "Level R -- Resistance Evidence",
}

Environment Variable Quick Reference¶

Set any parameter by prefixing with ONCO_:

# Example .env file
ONCO_MILVUS_HOST=milvus-server.internal
ONCO_MILVUS_PORT=19530
ONCO_EMBEDDING_MODEL=BAAI/bge-small-en-v1.5
ONCO_LLM_MODEL=claude-sonnet-4-6
ONCO_ANTHROPIC_API_KEY=sk-ant-...
ONCO_WEIGHT_VARIANTS=0.20
ONCO_WEIGHT_LITERATURE=0.18
ONCO_TOP_K=10
ONCO_SCORE_THRESHOLD=0.35
ONCO_API_PORT=8527
ONCO_METRICS_ENABLED=true
ONCO_CONVERSATION_MEMORY_DEPTH=5

This guide reflects the codebase as of March 2026. For introductory material, see LEARNING_GUIDE.md. For deployment instructions, see DEPLOYMENT.md.