Precision Oncology Intelligence Agent: Foundations Learning Guide¶
An educational primer for clinicians, bioinformaticians, and software engineers who want to understand how this agent works -- no prior oncology or AI background required.
Welcome¶
This guide explains the ideas behind the Precision Oncology Intelligence Agent, a software system that helps clinicians make treatment decisions for cancer patients by combining genomic data, clinical evidence, and artificial intelligence.
If you have ever wondered how a computer can read a patient's DNA, find the right clinical trials, rank therapies, and package everything into a report that a tumor board can act on -- this document walks through every step in plain language with diagrams and examples.
Who This Guide Is For¶
This guide serves three distinct audiences. Each chapter is written so that all three can follow along, but look for the persona tags when a section speaks directly to one group.
Persona 1 -- The Oncologist¶
You treat cancer patients every day. You attend molecular tumor boards. You want to know what this agent does, what evidence it relies on, and how to interpret its output. You do not need to write code.
Persona 2 -- The Bioinformatician¶
You work with genomic data -- VCF files, variant annotations, gene panels. You want to understand how the agent ingests and searches that data, and how to extend it with new collections or annotation sources.
Persona 3 -- The Software Engineer¶
You build and deploy applications. You want to understand the architecture -- FastAPI, Streamlit, Milvus, Docker, Pydantic models -- and how to run, modify, or contribute to the codebase.
What You Will Learn¶
By the end of this guide you will be able to:
- Explain what precision oncology is and why it matters.
- Describe the role of a molecular tumor board (MTB) and how the agent supports it.
- List the major genomic variant types (SNV, indel, CNV, fusion, rearrangement) and explain why they matter for treatment.
- Distinguish five evidence levels (A through E) and map them to the AMP/ASCO/CAP classification framework.
- Name the 11 Milvus vector collections the agent searches and explain the purpose of each one.
- Explain how RAG (Retrieval-Augmented Generation) combines a vector database with a large language model.
- Trace the agent workflow from a question through planning, searching, evaluating, and synthesizing an answer.
- Describe how therapies are ranked with resistance awareness.
- Explain how clinical trial matching combines deterministic filters and semantic search with composite scoring.
- Identify key oncology biomarkers (TMB, MSI, PD-L1, HRD) and the therapies they predict.
- Run a first query against the agent and read the resulting report.
- Read an exported MTB packet in Markdown, JSON, PDF, or FHIR R4 format.
Prerequisites¶
- Oncologists: A working web browser. That is all.
- Bioinformaticians: Familiarity with VCF files and command-line tools.
- Software Engineers: Python 3.10+, Docker, and a text editor.
No special hardware is needed to read this guide. To run the agent itself, the reference platform is an NVIDIA DGX Spark ($3,999) with CUDA 12.x, but any machine with Docker and a CPU will work for testing.
Chapter 1: What Is Precision Oncology?¶
The Old Approach¶
For decades, cancer treatment was based primarily on where the tumor was located. Lung cancer patients got lung cancer drugs. Breast cancer patients got breast cancer drugs. The same chemotherapy cocktail was given to everyone with the same tumor site, even though response rates were often below 30%.
Think of it like prescribing the same pair of eyeglasses to every patient in an optometrist's office -- some will see better, some will see worse, and some will not benefit at all.
The New Approach¶
Precision oncology flips this model. Instead of asking "where is the tumor?" it asks "what is driving the tumor?" The answer comes from the tumor's DNA.
Here is the core idea:
Traditional oncology Precision oncology
===================== =====================
Tumor site --> Drug Tumor DNA --> Driver mutation --> Matched drug
(one-size-fits-all) (tailored to each patient)
A patient with a BRAF V600E mutation in melanoma receives the same class of drug (a BRAF inhibitor like dabrafenib) as a patient with BRAF V600E in colorectal cancer -- because the molecular driver is the same, even though the tumor sites are different.
Why It Works¶
Cancer is fundamentally a disease of the genome. Tumors grow because specific genes acquire mutations that tell cells to divide without stopping. If you can identify the exact mutation driving growth, you can select a drug designed to block that specific signal.
Key statistics that illustrate the shift:
- A single tumor can harbor tens of thousands of mutations, but typically only a handful are "drivers" that fuel growth.
- Over 40 genes now have FDA-approved targeted therapies.
- Some drugs (like larotrectinib for NTRK fusions) are approved regardless of tumor site -- these are called "tissue-agnostic" approvals.
The Data Problem¶
The promise of precision oncology runs headlong into a data problem. To make a good treatment decision, a clinician needs to synthesize information from at least five different domains:
- Genomic data -- the patient's tumor and germline variants.
- Clinical evidence -- what published studies say about those variants.
- Treatment guidelines -- what NCCN, ESMO, and ASCO recommend.
- Clinical trials -- what experimental options are available.
- Resistance data -- which drugs might fail because of co-mutations.
No single database holds all of this. The information is scattered across CIViC, OncoKB, ClinicalTrials.gov, PubMed, and institutional guidelines. Reading and cross-referencing these sources manually takes hours per patient.
That is the problem this agent solves.
Chapter 2: The Molecular Tumor Board¶
What Is an MTB?¶
A molecular tumor board (MTB) is a regular meeting where oncologists, pathologists, geneticists, pharmacists, and other specialists review a cancer patient's genomic test results and decide on a treatment plan.
Think of it as a panel of experts gathered around a conference table, each bringing a different piece of the puzzle:
+-----------+ +--------------+ +-------------+
| Oncologist| | Pathologist | | Geneticist |
| | | | | |
| "What drug| | "The tumor is| | "The VCF |
| do we | | grade III | | shows EGFR |
| give?" | | with high | | L858R and |
| | | mitotic | | TP53 R248W"|
| | | index" | | |
+-----------+ +--------------+ +-------------+
| | |
+----------- MTB Decision -----------+
|
Treatment Plan
The Bottleneck¶
An MTB typically reviews 10-20 cases per session. For each case, someone must prepare a packet that includes:
- A summary of the patient's variants and their clinical significance.
- Relevant published evidence for each actionable variant.
- A ranked list of therapy options.
- Matching clinical trials.
- Known resistance mechanisms for proposed treatments.
Preparing one packet manually can take 30-60 minutes. Multiply that by 15 cases and you have a full day of preparation for a single MTB session.
How the Agent Helps¶
The Precision Oncology Intelligence Agent automates packet preparation. It
takes in a patient's genomic profile and produces a structured MTB packet
(the MTBPacket data model in the codebase) that includes:
- Variant table -- each variant with gene, type, evidence level, and associated drugs.
- Evidence table -- retrieved literature, guidelines, and outcomes.
- Therapy ranking -- prioritized treatments sorted by evidence level, with resistance flags.
- Trial matches -- clinical trials the patient may qualify for.
- Open questions -- uncertainties the board should discuss.
- Citations -- links to PubMed, ClinicalTrials.gov, and guidelines.
The packet can be exported to Markdown, JSON, PDF, or FHIR R4 format.
This does not replace the MTB. It gives the board better-prepared input so clinicians can spend their time on judgment calls rather than data gathering.
What an MTB Packet Contains¶
The MTBPacket model (src/models.py) fields include: case_id,
patient_summary, patient_id, cancer_type, stage, variant_table,
evidence_table, therapy_ranking, trial_matches, open_questions,
generated_at, and citations. The packet is generated automatically by
calling agent.run() and typically completes in under 5 seconds.
Chapter 3: Genomic Variants and Actionability¶
What Is a Genomic Variant?¶
Every cell in your body contains DNA -- a long string of chemical "letters" (A, T, C, G) that encodes instructions for building proteins. A genomic variant is any place where a patient's DNA differs from the standard reference sequence.
Most variants are harmless. Some cause disease. A small number drive cancer.
Types of Variants¶
The agent recognizes seven variant types, defined in the VariantType enum
in src/models.py:
Variant Type What It Means Example
================== ================================ ====================
SNV Single letter change BRAF V600E (T>A)
Indel Small insertion or deletion EGFR exon 19 del
CNV amplification Extra copies of a gene HER2 amplification
CNV deletion Lost copies of a gene CDKN2A homozygous del
Fusion Two genes joined together EML4-ALK fusion
Rearrangement Part of a chromosome rearranged RET rearrangement
Structural variant Large-scale chromosomal change Complex SV
Analogy: The Cookbook¶
Imagine DNA as a cookbook with 20,000 recipes (genes). Each recipe tells the cell how to build a specific protein.
- An SNV is a single typo in a recipe -- "add 1 cup of sugar" becomes "add 1 cup of salt." Sometimes the dish still works; sometimes it is ruined.
- An indel is a few words inserted or removed -- "bake at 350 degrees for 30 minutes" becomes "bake at 350 degrees." Missing information.
- A CNV amplification is like photocopying one recipe 50 times. The cell makes far too much of that protein.
- A fusion is like taping two recipes together so the cell reads them as one, producing a hybrid protein that was never meant to exist.
What Makes a Variant "Actionable"?¶
A variant is actionable when there is clinical evidence that it predicts response (or resistance) to a specific therapy. Not every mutation is actionable. In fact, most are not.
The agent's knowledge base (src/knowledge.py) tracks approximately 40
actionable gene targets. Here are ten of the most important:
| Gene | Key Variant(s) | Cancer Type(s) | Matched Therapy |
|---|---|---|---|
| EGFR | L858R, ex19 del | NSCLC | Osimertinib |
| BRAF | V600E | Melanoma, CRC | Dabrafenib+trametinib |
| ALK | EML4-ALK fusion | NSCLC | Alectinib |
| KRAS | G12C | NSCLC | Sotorasib, adagrasib |
| HER2 | Amplification | Breast, gastric | Trastuzumab deruxtecan |
| NTRK | NTRK fusions | Tissue-agnostic | Larotrectinib |
| RET | Fusions, mutations | NSCLC, thyroid | Selpercatinib |
| ROS1 | CD74-ROS1 fusion | NSCLC | Entrectinib |
| BRCA1/2 | Pathogenic variants | Ovarian, breast | Olaparib (PARP inh.) |
| PIK3CA | H1047R, E545K | Breast | Alpelisib |
Variant Classification: AMP/ASCO/CAP¶
The Association for Molecular Pathology (AMP), the American Society of Clinical Oncology (ASCO), and the College of American Pathologists (CAP) jointly published a four-tier system for classifying somatic variants:
Tier I -- Strong clinical significance
Variants with FDA-approved therapies or included in
professional guidelines (NCCN, ESMO).
Tier II -- Potential clinical significance
Variants with clinical evidence from well-powered studies,
clinical trials, or strong biological rationale.
Tier III -- Unknown significance
Variants not observed in existing databases or with
conflicting evidence.
Tier IV -- Benign or likely benign
Observed at high frequency in the general population;
no disease association.
The agent's five-level evidence system (A through E) maps to a more granular version of this framework. See Chapter 7 for the full mapping.
Chapter 4: The Data Challenge¶
Five Silos, One Patient¶
To make a treatment decision for a single patient, a clinician ideally consults at least five major data sources:
+-------------------+ +-------------------+ +-------------------+
| CIViC / OncoKB | | PubMed / PMC | | ClinicalTrials.gov|
| | | | | |
| Curated variant- | | 35M+ biomedical | | 400,000+ trial |
| drug associations | | research articles | | registrations |
+-------------------+ +-------------------+ +-------------------+
| | |
+----------+--------------+--------------+----------+
| |
+-------------------+ +-------------------+
| NCCN / ESMO / | | Institutional |
| ASCO Guidelines | | Tumor Boards |
| | | |
| Standard-of-care | | Case histories, |
| recommendations | | outcome records |
+-------------------+ +-------------------+
The Problems¶
- Different formats. CIViC uses a REST API. PubMed returns XML. ClinicalTrials.gov has its own JSON schema. Guidelines are often PDFs.
- Different vocabularies. One source calls it "ERBB2"; another calls it "HER2." One says "non-small cell lung cancer"; another says "NSCLC."
- Constant change. New papers are published daily. Trial statuses change. Guidelines are updated annually. A search done last month may already be outdated.
- Volume. There are over 4.1 million records in ClinVar alone. Reading even a fraction manually is impossible.
- Cross-referencing. The real insight comes from connecting variant data to therapy data to trial data to resistance data. No single source makes those connections.
The Human Cost¶
A recent study found that preparing a single MTB case requires consulting an average of 3-5 databases and takes 30-60 minutes of a specialist's time. With 10-20 cases per session, that is a full day of preparation -- time that could be spent with patients.
Chapter 5: How RAG Solves the Data Challenge¶
What Is RAG?¶
RAG stands for Retrieval-Augmented Generation. It is a technique that combines two technologies:
- A search engine that finds relevant documents from a knowledge base.
- A large language model (LLM) that reads those documents and writes a coherent, cited answer.
Analogy: The Research Librarian¶
Imagine you walk into a medical library and ask the librarian a question:
"What targeted therapies are available for KRAS G12C-mutant NSCLC, and what resistance mechanisms should I watch for?"
The librarian does two things:
- Retrieves relevant books, journals, and guidelines from the shelves (the "R" in RAG).
- Generates a summary answer in plain language, citing the sources (the "G" in RAG).
RAG works the same way, except the "library" is a vector database (Milvus) and the "librarian" is an LLM (Claude Sonnet 4.6).
How Vector Search Works¶
Traditional search engines match keywords. If you search for "lung cancer EGFR," they find documents containing those exact words. But what about a document that says "non-small cell pulmonary adenocarcinoma with epidermal growth factor receptor mutations"? That document is highly relevant but shares no keywords with your query.
Vector search solves this by converting text into numbers. Here is how:
Step 1: Embedding¶
An embedding model reads a piece of text and converts it into a list of numbers (a "vector"). The agent uses the BAAI/bge-small-en-v1.5 model, which produces a vector of 384 numbers for any input text.
"EGFR L858R mutation in NSCLC"
|
| Embedding model (BGE-small-en-v1.5)
v
[0.023, -0.112, 0.087, ..., 0.045] <-- 384 numbers
Texts with similar meaning produce vectors that are close together in 384-dimensional space. Texts with different meanings produce vectors that are far apart.
Step 2: Indexing¶
When the agent ingests data, it embeds every document and stores the resulting vector alongside the original text in Milvus. The agent uses IVF_FLAT indexing with COSINE similarity -- a method that groups similar vectors into clusters so searches run fast even over millions of records.
Step 3: Searching¶
When a user asks a question, the agent:
- Embeds the question into a 384-dimensional vector.
- Searches Milvus for the vectors closest to the question vector.
- Returns the original documents attached to those vectors.
User question Vector database (Milvus)
============== ========================
"EGFR resistance +------+------+------+
mechanisms in NSCLC" -----> | Lit | Var | Res | ...11 collections
embed +------+------+------+
& search |
v
Top-K most similar documents
|
v
Claude Sonnet 4.6 (LLM)
|
v
Cited, structured answer
Step 4: Weighted Scoring¶
Not all collections are equally important. The agent assigns a weight to
each collection so that variant data and literature carry more influence
than older case records. The weights (from config/settings.py) are:
| Collection | Weight | Purpose |
|---|---|---|
| onco_variants | 0.18 | Actionable variant evidence |
| onco_literature | 0.16 | Published research |
| onco_therapies | 0.14 | Drug mechanisms and indications |
| onco_guidelines | 0.12 | NCCN/ESMO/ASCO recommendations |
| onco_trials | 0.10 | Clinical trial summaries |
| onco_biomarkers | 0.08 | Predictive/prognostic biomarkers |
| onco_resistance | 0.07 | Resistance mechanisms |
| onco_pathways | 0.06 | Signaling pathway context |
| onco_outcomes | 0.04 | Real-world treatment outcomes |
| genomic_evidence | 0.03 | VCF-derived genomic evidence |
| onco_cases | 0.02 | De-identified patient snapshots |
These weights sum to 1.00 and can be tuned via environment variables with
the ONCO_ prefix.
Step 5: Synthesis¶
The top search results (up to 30 evidence items) are passed as context to Claude Sonnet 4.6, which reads them and produces a structured, cited answer.
Chapter 6: The 11 Knowledge Collections¶
The agent organizes its knowledge into 11 Milvus vector collections. Each collection stores a different type of oncology information. All collections use the same embedding model (BGE-small-en-v1.5, 384 dimensions) and the same index type (IVF_FLAT with COSINE similarity).
Think of these collections as 11 specialized filing cabinets in the medical library, each containing a different kind of document.
Collection 1: onco_variants¶
Purpose: Clinically annotated genomic variants from CIViC and OncoKB.
What it stores: For each variant -- the gene name, variant name (e.g., "V600E"), variant type (SNV, fusion, etc.), cancer type, evidence level, associated drugs, CIViC ID, clinical significance, and a text summary.
Example record:
Gene: EGFR
Variant: L858R
Type: SNV
Cancer: NSCLC
Evidence: A (FDA-approved)
Drugs: osimertinib, erlotinib, gefitinib, afatinib
Summary: "EGFR L858R is a sensitizing mutation found in ~40%
of EGFR-mutant NSCLC cases..."
Who cares: Oncologists looking up whether a patient's variant has an approved therapy. Bioinformaticians curating variant databases.
Collection 2: onco_literature¶
Purpose: Chunks of published research from PubMed, PMC, and preprints.
What it stores: Title, text chunk, source type, year, cancer type, gene, variant, keywords, and journal name.
Why chunks? A single paper may be 10 pages long. Embedding and searching the entire paper at once would dilute the signal. Instead, the agent splits each paper into smaller chunks (passages) so that a search can pinpoint the most relevant paragraph.
Who cares: Everyone. Literature is the backbone of evidence-based medicine.
Collection 3: onco_therapies¶
Purpose: Approved and investigational therapies with mechanism of action data.
What it stores: Drug name, therapy category (targeted, immunotherapy, chemotherapy, ADC, bispecific, etc.), molecular targets, approved indications, known resistance mechanisms, evidence level, and a text summary.
Example record:
Drug: trastuzumab deruxtecan (T-DXd / Enhertu)
Category: ADC (antibody-drug conjugate)
Targets: HER2
Indications: HER2+ breast, HER2+ gastric, HER2-mutant NSCLC
Resistance: HER2 loss, drug efflux, payload resistance
Evidence: A
Who cares: Oncologists selecting treatments. Pharmacists checking mechanisms of action.
Collection 4: onco_guidelines¶
Purpose: Recommendations from NCCN, ASCO, ESMO, and other guideline-issuing organizations.
What it stores: Issuing organization, cancer type, guideline version, year, key recommendations, text summary, and evidence level.
Who cares: Oncologists who need to confirm that a proposed treatment is guideline-concordant.
Collection 5: onco_trials¶
Purpose: Clinical trial summaries from ClinicalTrials.gov.
What it stores: NCT ID, title, text summary, phase, recruitment status, sponsor, cancer types, biomarker eligibility criteria, enrollment count, start year, and outcome summary.
Example record:
ID: NCT04136756
Title: "Phase III study of sotorasib vs. docetaxel in KRAS
G12C-mutant NSCLC"
Phase: Phase 3
Status: Recruiting
Biomarkers: KRAS G12C
Cancer: NSCLC
Who cares: Oncologists matching patients to trials. Trial coordinators evaluating eligibility.
Collection 6: onco_biomarkers¶
Purpose: Predictive and prognostic biomarkers with testing context.
What it stores: Biomarker name, functional type (predictive, prognostic, diagnostic, etc.), cancer types, predictive value, testing method, clinical cutoff, text summary, and evidence level.
Who cares: Pathologists ordering tests. Oncologists interpreting results.
Collection 7: onco_resistance¶
Purpose: Documented mechanisms of therapeutic resistance.
What it stores: Primary therapy, gene, resistance mechanism, bypass pathway, alternative therapies, and a text summary.
Example record:
Therapy: osimertinib
Gene: EGFR
Mechanism: C797S mutation in cis with T790M
Bypass: MET amplification
Alternatives: amivantamab + lazertinib, platinum-based chemo
Who cares: Oncologists planning second-line therapy. Researchers studying resistance biology.
Collection 8: onco_pathways¶
Purpose: Oncogenic signaling pathways with therapeutic context.
What it stores: Pathway name, key genes, therapeutic targets, cross-talk relationships, and a text summary.
The agent tracks 13 pathways: MAPK, PI3K/AKT/mTOR, DNA Damage Repair, Cell Cycle, Apoptosis, WNT, NOTCH, Hedgehog, JAK/STAT, Angiogenesis, Hippo, NF-kB, and TGF-beta.
Why pathways matter: Cancer cells often escape targeted therapy by activating a parallel signaling pathway ("bypass resistance"). Knowing which pathways cross-talk helps predict which escape routes a tumor might take.
Who cares: Bioinformaticians building pathway models. Oncologists understanding why a drug stopped working.
Collection 9: onco_outcomes¶
Purpose: Real-world and trial-derived treatment outcome records.
What it stores: Case ID, therapy given, cancer type, response category (RECIST criteria: CR, PR, SD, PD, NE), duration in months, toxicities, baseline biomarkers, and a text summary.
Who cares: Oncologists comparing expected outcomes. Researchers analyzing real-world evidence.
Collection 10: onco_cases¶
Purpose: De-identified patient case snapshots for similarity search.
What it stores: Patient ID, cancer type, stage, variants, biomarkers, prior therapies, and a text summary.
How it is used: When a new patient comes in, the agent can search this collection to find similar historical cases -- patients with the same cancer type, similar mutations, and similar biomarker profiles -- and see what treatments were tried and how they responded.
Who cares: Oncologists who want to learn from prior cases.
Collection 11: genomic_evidence (Read-Only)¶
Purpose: VCF-derived genomic evidence from the upstream genomics pipeline (Stage 1 of the HCLS AI Factory).
What it stores: Chromosome, position, reference/alternate alleles, quality score, gene, consequence, impact, genotype, clinical significance, rsID, disease associations, AlphaMissense pathogenicity score and class, and a text summary.
Why read-only: This collection is populated by the genomics pipeline (Parabricks + DeepVariant). The oncology agent reads it but never writes to it. This separation ensures that the genomics pipeline is the single source of truth for variant calls.
Who cares: Bioinformaticians who manage the variant calling pipeline. Engineers maintaining data flow between stages.
All 11 at a Glance¶
+--------------------+ +--------------------+ +--------------------+
| 1. onco_variants | | 2. onco_literature | | 3. onco_therapies |
| Actionable somatic | | PubMed / PMC / | | Drugs, MOA, |
| & germline variants| | preprint chunks | | indications |
+--------------------+ +--------------------+ +--------------------+
+--------------------+ +--------------------+ +--------------------+
| 4. onco_guidelines | | 5. onco_trials | | 6. onco_biomarkers |
| NCCN / ESMO / ASCO | | ClinicalTrials.gov | | TMB, MSI, PD-L1, |
| recommendations | | summaries | | HRD, assay details |
+--------------------+ +--------------------+ +--------------------+
+--------------------+ +--------------------+ +--------------------+
| 7. onco_resistance | | 8. onco_pathways | | 9. onco_outcomes |
| Resistance mechs & | | MAPK, PI3K, DDR, | | RECIST responses, |
| bypass strategies | | 13 pathways total | | duration, toxicity |
+--------------------+ +--------------------+ +--------------------+
+--------------------+ +--------------------+
| 10. onco_cases | | 11. genomic_ |
| De-identified | | evidence |
| patient snapshots | | (read-only, VCF) |
+--------------------+ +--------------------+
Chapter 7: Evidence Levels and Variant Classification¶
The Five Evidence Levels¶
The agent uses a five-level evidence system, defined in the EvidenceLevel
enum in src/models.py. Each level indicates how strong the clinical
evidence is for a particular variant-drug association.
Level Name What It Means
====== ===================== ==========================================
A FDA-Approved The variant-drug link has been validated
by the FDA. There is a companion
diagnostic. This is standard of care.
B Clinical Evidence Well-powered clinical studies (phase II/III)
show a significant association. May be
included in guidelines but not yet FDA-
approved for this specific indication.
C Case Reports Evidence comes from case reports, small
series, or early-phase trials. Promising
but not yet confirmed in large studies.
D Preclinical Evidence comes from cell lines, animal
models, or in-vitro experiments. Not yet
tested in patients.
E Computational Evidence comes from computational
prediction (e.g., AlphaMissense, protein
structure modeling). No wet-lab or
clinical validation yet.
Analogy: The Confidence Thermometer¶
Think of evidence levels as a thermometer measuring confidence:
HOT |===| A FDA says "this works"
|===| B Large studies confirm it
|===| C A few reports suggest it
|===| D Works in a lab dish
COLD |===| E A computer predicted it
Level A evidence means you can prescribe with high confidence. Level E evidence means the association is speculative and should only be considered when all other options are exhausted.
Mapping to AMP/ASCO/CAP Tiers¶
The agent's evidence levels align with the AMP/ASCO/CAP classification:
Agent Level AMP/ASCO/CAP Tier Clinical Action
============ ================== ================================
A Tier I-A FDA-approved therapy or guideline
B Tier I-B / II-C Clinical trial evidence, consensus
C Tier II-D Case reports, small studies
D Tier III Unknown significance, preclinical
E Tier III Computational prediction only
Variants classified as Tier IV (benign/likely benign) in the AMP/ASCO/CAP system are filtered out before they reach the agent -- they are not clinically actionable.
VCF Files: Where Variant Data Comes From¶
Genomic variants reach the agent through VCF (Variant Call Format) files. A VCF file is a standard text format produced by variant calling tools like GATK, DeepVariant, Mutect2, and Parabricks.
What a VCF Line Looks Like¶
#CHROM POS ID REF ALT QUAL FILTER INFO
chr7 55181378 rs121913529 T A 99 PASS ANN=A|missense|HIGH|EGFR|...
Each line represents one variant. The columns tell you:
- CHROM/POS -- where in the genome the variant is (chromosome 7, position 55,181,378).
- REF/ALT -- what the reference base is (T) and what the patient has instead (A).
- QUAL -- confidence score for the variant call.
- FILTER -- whether the variant passed quality filters.
- INFO -- annotations added by tools like SnpEff, VEP, or GENEINFO.
How the Agent Parses VCF Files¶
The VCF parser (src/utils/vcf_parser.py) extracts gene names and variant
consequences from the INFO field using three annotation formats:
- SnpEff (ANN= field) -- extracts gene name and functional impact from the standard SnpEff annotation format.
- VEP (CSQ= field) -- extracts gene symbol from the Ensembl Variant Effect Predictor annotation.
- GENEINFO -- extracts gene name from the NCBI GENEINFO field.
The parser handles all common VCF 4.x formats produced by major variant calling pipelines.
How Evidence Level Affects Therapy Ranking¶
When the TherapyRanker scores candidate therapies (Chapter 8), evidence
level is the primary sorting criterion. The ranking order in the codebase
(src/therapy_ranker.py) is:
A therapy backed by Level A evidence will always rank above a therapy backed by Level C evidence, regardless of other factors.
Chapter 8: Therapy Ranking and Resistance¶
How the Therapy Ranker Works¶
The TherapyRanker class (src/therapy_ranker.py) produces a prioritized
list of candidate therapies for a given patient. It uses a six-step process:
Patient Profile
(cancer type, variants, biomarkers, prior therapies)
|
v
Step 1: Identify variant-driven therapies
| Look up each variant in ACTIONABLE_TARGETS
v
Step 2: Identify biomarker-driven therapies
| Check MSI-H, TMB-H, HRD, PD-L1 status
v
Step 3: Rank by evidence level (A > B > C > D > E)
|
v
Step 4: Flag resistance
| Check if prior therapies predict resistance
v
Step 5: Flag contraindications
| Same drug class as a previously failed therapy
v
Step 6: Attach supporting evidence
| Retrieve from onco_therapies and onco_literature
v
Ranked Therapy List with Citations
Example Walkthrough¶
Consider a patient with: - Cancer: NSCLC, Stage IV - Variants: EGFR L858R (SNV), TP53 R248W (SNV) - Biomarkers: PD-L1 TPS 80%, TMB 8.2 mut/Mb - Prior therapies: erlotinib (progressed after 11 months)
Step 1: EGFR L858R maps to osimertinib, erlotinib, gefitinib, afatinib.
Step 2: PD-L1 >= 50% maps to pembrolizumab. TMB 8.2 is below the TMB-High threshold (typically >= 10 mut/Mb), so no TMB-driven therapy.
Step 3: All EGFR therapies are Level A. Pembrolizumab for PD-L1 is also Level A.
Step 4: The patient previously received erlotinib and progressed. This raises a resistance flag for erlotinib-class drugs. Known resistance mechanisms include T790M mutation and MET amplification. The ranker checks for these.
Step 5: Erlotinib is flagged as a contraindication (same drug, prior failure). Gefitinib and afatinib (same generation TKIs) are downranked.
Step 6: Osimertinib rises to the top because it is a third-generation TKI effective against T790M resistance.
Final ranking:
Rank Therapy Evidence Notes
==== ================= ======== ============================
1 Osimertinib A Overcomes erlotinib resistance
2 Pembrolizumab A PD-L1 >= 50%
3 Amivantamab + B EGFR-directed bispecific +
lazertinib 3rd-gen TKI combination
4 Erlotinib A CONTRAINDICATED (prior failure)
Resistance Mechanisms¶
Resistance is one of the biggest challenges in precision oncology. A drug that works brilliantly for 12 months can suddenly stop working because the tumor evolves.
The agent tracks three categories of resistance:
1. On-Target Resistance¶
The drug's target gene acquires a new mutation that prevents the drug from binding.
Example: EGFR T790M
- Patient on erlotinib (1st-gen EGFR TKI)
- Tumor acquires T790M "gatekeeper" mutation
- Erlotinib can no longer bind EGFR
- Solution: Switch to osimertinib (3rd-gen, binds despite T790M)
2. Bypass Resistance¶
The tumor activates a parallel signaling pathway that circumvents the blocked target.
Example: MET amplification during EGFR therapy
- Patient on osimertinib
- Tumor amplifies MET gene, activating PI3K/AKT pathway
- EGFR is still inhibited, but the cell survives via MET
- Solution: Add MET inhibitor (capmatinib, tepotinib)
3. Lineage Plasticity¶
The tumor transforms into a different cell type entirely, changing from (for example) an adenocarcinoma to a small cell carcinoma.
Example: Small cell transformation in EGFR-mutant NSCLC
- ~5-15% of EGFR-mutant NSCLC patients
- Tumor loses EGFR dependency entirely
- Targeted therapy no longer effective
- Solution: Switch to small cell chemo (platinum + etoposide)
The onco_resistance collection stores all of these mechanisms so the
agent can proactively warn clinicians about potential resistance.
Chapter 9: Clinical Trial Matching¶
Why Trial Matching Matters¶
Clinical trials offer access to new therapies before they are FDA-approved. For patients who have exhausted standard options, a trial may be their best chance. But finding the right trial is hard:
- There are over 400,000 registered trials on ClinicalTrials.gov.
- Eligibility criteria are complex and written in free text.
- Trial statuses change frequently (recruiting, closed, suspended).
- Patients may qualify for a trial they never hear about.
How the Trial Matcher Works¶
The TrialMatcher class (src/trial_matcher.py) uses a hybrid approach
that combines rule-based filtering with semantic search.
Patient Profile
(cancer type, variants, biomarkers, stage, age)
|
v
Step 1: Deterministic filter
| Match cancer type (with fuzzy aliases)
| Filter to open statuses (Recruiting, Active, etc.)
v
Step 2: Semantic search
| Embed patient profile as a query vector
| Search onco_trials for similar trial descriptions
v
Step 3: Composite scoring
| Score = w1*biomarker_match + w2*semantic_sim
| + w3*phase_weight + w4*status_weight
v
Step 4: Age-based eligibility filtering
|
v
Step 5: Structured explanation for each match
|
v
Ranked Trial List
Composite Scoring Weights¶
The trial match score is a weighted combination of four factors, configured
in config/settings.py:
Factor Weight What It Measures
================ ====== =================================
Biomarker match 0.40 Does the trial require this patient's
specific biomarkers?
Semantic sim. 0.25 How similar is the trial description
to the patient's profile?
Phase weight 0.20 Higher phases (III > II > I) get
higher weight.
Status weight 0.15 Actively recruiting trials rank
above "not yet recruiting."
Phase Weights¶
The matcher assigns weights to trial phases to prefer later-phase trials, which are typically closer to producing actionable results:
Phase Weight
============ ======
Phase 3 1.0
Phase 2/3 0.9
Phase 2 0.8
Phase 1/2 0.7
Phase 1 0.6
Phase 4 0.5
Status Weights¶
Trials that are actively enrolling patients are more useful than those that are closed:
Status Weight
========================== ======
Recruiting 1.0
Enrolling by invitation 0.8
Active, not recruiting 0.6
Not yet recruiting 0.4
Trials with statuses like Completed, Terminated, Suspended, or Withdrawn are excluded from matching entirely (they are not in the OPEN_STATUSES set).
Chapter 10: Biomarkers in Oncology¶
What Is a Biomarker?¶
A biomarker is a measurable characteristic that tells you something about a patient's disease or their likely response to a treatment. In oncology, biomarkers are used to select drugs, predict outcomes, and monitor treatment response.
Analogy: The Dashboard Gauges¶
Think of biomarkers as gauges on a car's dashboard. The speedometer tells you how fast you are going. The fuel gauge tells you how far you can go. The temperature gauge warns you of overheating. Similarly:
- TMB tells you how many mutations the tumor has (the "mutation load gauge").
- MSI tells you if the tumor's DNA repair machinery is broken (the "repair status gauge").
- PD-L1 tells you if the tumor is hiding from the immune system (the "immune evasion gauge").
- HRD tells you if the tumor cannot repair DNA breaks properly (the "DNA repair gauge").
The Eight Biomarker Types¶
The agent classifies biomarkers into eight functional categories, defined
in the BiomarkerType enum in src/models.py:
Type What It Tells You
===================== ===============================================
Predictive Predicts response to a specific therapy.
Example: EGFR mutation predicts response to
osimertinib.
Prognostic Predicts disease outcome regardless of therapy.
Example: TP53 mutation is associated with worse
prognosis across many cancer types.
Diagnostic Helps confirm what type of cancer it is.
Example: PSA (prostate-specific antigen) for
prostate cancer screening.
Monitoring Tracks disease status over time.
Example: ctDNA (circulating tumor DNA) levels
to detect recurrence.
Resistance Indicates likely resistance to a therapy.
Example: EGFR T790M predicts resistance to
erlotinib.
Pharmacodynamic Measures whether the drug is hitting its target.
Example: phospho-ERK reduction during BRAF
inhibitor therapy.
Screening Used in population-level screening before
diagnosis. Example: BRCA1/2 testing in high-
risk individuals.
Therapeutic Selection Directly determines which therapy to use.
Example: HER2 status determines whether
trastuzumab is indicated.
Key Biomarker-Therapy Mappings¶
The agent maintains a curated set of biomarker-to-therapy mappings. These are among the most well-established in precision oncology:
Biomarker Cutoff / Status Therapy Evidence
================ ================== ====================== ========
MSI-H Microsatellite Pembrolizumab A
instability-high (tissue-agnostic)
TMB-H >= 10 mut/Mb Pembrolizumab A
(tissue-agnostic)
HRD Homologous PARP inhibitors A
recombination (olaparib, niraparib,
deficiency rucaparib, talazoparib)
PD-L1 >= 50% TPS >= 50% Pembrolizumab A
(first-line NSCLC)
NTRK fusion Any NTRK1/2/3 Larotrectinib or A
fusion entrectinib
(tissue-agnostic)
EGFR mutation L858R, ex19 del Osimertinib A
BRAF V600E V600E mutation Dabrafenib + A
trametinib
ALK fusion EML4-ALK or other Alectinib A
ALK fusion
KRAS G12C G12C mutation Sotorasib or A
adagrasib
RET fusion/mut Fusions or Selpercatinib A
activating muts
Chapter 11: The Agent Workflow¶
The Four-Phase Loop¶
The OncoIntelligenceAgent class (src/agent.py) implements a four-phase
workflow: Plan, Search, Evaluate, and Synthesize. If the evidence is
insufficient, the agent retries with broader queries (up to 2 retries).
User Question
|
v
+-------------------+
| Phase 1: PLAN |
| |
| - Parse question |
| - Identify genes |
| - Identify cancer |
| types |
| - Choose strategy |
| - Decompose into |
| sub-questions |
+-------------------+
|
v
+-------------------+
| Phase 2: SEARCH |
| |
| - Embed query |
| - Search 11 |
| collections |
| - Apply weighted |
| scoring |
| - Collect hits |
+-------------------+
|
v
+-------------------+
| Phase 3: EVALUATE |
| |
| - Count hits |
| - Check diversity |
| (collections) |
| - Check quality |
| (avg score) |
| |
| sufficient? --+ |
| partial? --+ |
| insufficient -+ |
+-------------------+
| |
sufficient| |insufficient
| | (retry with
v | broader queries,
+-------------------+ up to 2x)
| Phase 4:SYNTHESIZE|<-----+
| |
| - Pass top-30 |
| evidence items |
| to Claude LLM |
| - Generate cited |
| answer |
| - Format report |
+-------------------+
|
v
AgentResponse
(answer + evidence + report)
Phase 1: Plan¶
The agent analyzes the user's question to extract structured information:
- Target genes: Recognized from a set of 30+ known oncology genes (BRAF, EGFR, ALK, KRAS, HER2, NTRK, RET, MET, FGFR, PIK3CA, BRCA1, BRCA2, TP53, PTEN, and more).
- Cancer types: Recognized from 25+ canonical names and 60+ aliases (e.g., "lung cancer" maps to "NSCLC", "colon" maps to "COLORECTAL").
- Topics: Identified by keyword matching (resistance, biomarker, immunotherapy, clinical trial, combination therapy, etc.).
- Search strategy: One of three options:
broad-- no specific gene or cancer type identified; search everything.targeted-- both a gene and cancer type are identified; focus the search.comparative-- the question asks to compare two entities ("vs", "compare", "difference between").
Complex questions are decomposed into sub-questions. For example:
"What are the resistance mechanisms to EGFR TKIs in NSCLC and what clinical trials are available?"
Becomes: 1. "Mechanisms of resistance to EGFR inhibitors" 2. "Active clinical trials targeting EGFR in NSCLC"
Phase 2: Search¶
The RAG engine (src/rag_engine.py) embeds each query and searches all 11
collections in parallel using a thread pool (up to 6 workers). Results are
ranked by weighted cosine similarity.
The search is cross-collection: the same question is sent to all 11 collections, and results from every collection are merged into a single ranked list.
Phase 3: Evaluate¶
The agent assesses whether the retrieved evidence is sufficient:
- Sufficient: At least 3 hits from at least 2 different collections.
- Partial: Some hits but not enough diversity.
- Insufficient: No relevant hits at all.
If the evidence is insufficient or partial and retries remain, the agent broadens the search. If the strategy was "targeted," it switches to "broad." It also generates fallback queries (e.g., "{gene} oncology therapeutic implications").
The minimum similarity score threshold is 0.30 -- hits below this cutoff are discarded as irrelevant.
Phase 4: Synthesize¶
The top evidence items (capped at 30) are passed to Claude Sonnet 4.6 along with a specialized oncology system prompt. The LLM reads the evidence and generates a structured, cited answer.
The system prompt instructs the LLM to: 1. Cite evidence with PubMed or ClinicalTrials.gov links. 2. Connect genomic variants to therapy options, trials, and resistance. 3. Proactively highlight resistance mechanisms and safety concerns. 4. Reference NCCN, ESMO, or ASCO guidelines. 5. Acknowledge uncertainty and evidence gaps.
The final output is an AgentResponse containing the answer text, the
evidence used, the search plan, and a formatted Markdown report.
Chapter 12: Running Your First Query¶
Option 1: The Streamlit UI (Recommended for Oncologists)¶
The agent includes a Streamlit web interface (app/oncology_ui.py) that
provides a user-friendly chat-style interaction.
Step 1: Start the services (Docker required):
Step 2: Open your browser to http://localhost:8526.
Step 3: Type a question in the chat box. For example:
"What targeted therapies are available for BRAF V600E melanoma, and what resistance mechanisms should I monitor?"
Step 4: The agent will: 1. Plan the search (identify BRAF, melanoma, resistance topic). 2. Search all 11 collections. 3. Evaluate the evidence. 4. Synthesize a cited answer.
Step 5: Review the report. It will include: - A summary of BRAF V600E in melanoma. - Ranked therapies (dabrafenib + trametinib at #1). - Known resistance mechanisms (MEK1/2 mutations, NRAS activation, BRAF amplification). - Relevant clinical trials. - Citations to published literature.
Option 2: The FastAPI Endpoint (For Engineers and Integrations)¶
The agent exposes a REST API on port 8527.
Query endpoint:
curl -X POST http://localhost:8527/api/query \
-H "Content-Type: application/json" \
-d '{
"question": "What is the significance of KRAS G12C in NSCLC?",
"cancer_type": "nsclc",
"gene": "KRAS"
}'
The response is a JSON object matching the AgentResponse model:
{
"question": "What is the significance of KRAS G12C in NSCLC?",
"answer": "KRAS G12C is found in approximately 13% of NSCLC cases...",
"evidence": {
"query": "...",
"hits": [ ... ],
"total_collections_searched": 11,
"search_time_ms": 342.5
},
"knowledge_used": ["KRAS", "NSCLC", "sotorasib", "adagrasib"],
"timestamp": "2026-03-12T14:22:01Z",
"report": "# Oncology Intelligence Report\n..."
}
Option 3: The Python API (For Bioinformaticians and Developers)¶
from src.agent import OncoIntelligenceAgent
from src.rag_engine import OncoRAGEngine
from src.collections import OncoCollectionManager
# Connect to Milvus
manager = OncoCollectionManager(host="localhost", port=19530)
manager.connect()
# Initialize the RAG engine
rag = OncoRAGEngine(collection_manager=manager)
# Create the agent
agent = OncoIntelligenceAgent(rag_engine=rag)
# Run a query
response = agent.run("What ALK inhibitors are effective after crizotinib resistance?")
# Print the report
print(response.report)
Interpreting the Output¶
Every agent response includes:
| Field | What It Contains |
|---|---|
question |
Your original question. |
answer |
The synthesized, cited answer text. |
evidence |
The CrossCollectionResult with all search hits. |
knowledge_used |
Genes, drugs, and concepts referenced in the answer. |
plan |
The SearchPlan showing strategy and sub-questions. |
report |
A formatted Markdown report ready for review. |
timestamp |
When the response was generated (UTC). |
Exporting Results¶
The export module (src/export.py) supports four output formats:
| Format | Use Case |
|---|---|
| Markdown | Quick review, pasting into documents |
| JSON | Integration with other systems |
| Formal reports for the medical record | |
| FHIR R4 | Clinical interoperability (HL7 standard) |
FHIR R4: Why It Matters¶
FHIR (Fast Healthcare Interoperability Resources) is the international standard for exchanging healthcare data electronically. When the agent exports a report in FHIR R4 format, it produces a structured document that any FHIR-compliant electronic health record (EHR) system can import.
The export uses standard LOINC codes for genomic observations:
Code Meaning
========= =====================================
81247-9 Master HL7 genomic report
48018-6 Gene studied
69548-6 Genetic variant assessment
This means an oncologist can generate a report in the agent and have it automatically appear in the patient's medical record at their hospital -- no manual data entry required.
PDF Reports¶
PDF exports include branded headers (NVIDIA green by default, configurable
via ONCO_PDF_BRAND_COLOR_R/G/B environment variables), formatted variant
tables, therapy rankings, and trial match summaries.
Understanding the Technical Stack¶
For software engineers new to the project, here is how the major components fit together:
+-------------------+ +-------------------+ +-------------------+
| Streamlit UI | | FastAPI REST | | Python API |
| (port 8526) | | (port 8527) | | (direct import) |
| | | | | |
| Browser-based | | JSON endpoints | | Programmatic |
| chat interface | | for integrations | | access from code |
+--------+----------+ +--------+----------+ +--------+----------+
| | |
+------------+------------+------------+------------+
|
+--------v----------+
| OncoIntelligence |
| Agent |
| (src/agent.py) |
+--------+----------+
|
+--------v----------+
| OncoRAGEngine |
| (src/rag_engine) |
+--------+----------+
|
+------------+------------+
| |
+--------v----------+ +--------v----------+
| BGE-small-en-v1.5 | | Claude Sonnet 4.6 |
| (embedding model) | | (LLM synthesis) |
| 384-dim vectors | | Anthropic API |
+--------+----------+ +-------------------+
|
+--------v----------+
| Milvus |
| (port 19530) |
| 11 collections |
| IVF_FLAT / COSINE |
+-------------------+
Pydantic Models¶
Every data structure in the agent is a Pydantic model (src/models.py).
Pydantic provides automatic type validation, JSON serialization, and clear
error messages when data does not match the expected schema.
Docker Containerization¶
The entire stack (Milvus, etcd, MinIO, the agent, the UI, monitoring)
runs in Docker containers via docker-compose.dgx-spark.yml. Start
everything with docker compose up -d; tear down with docker compose down.
Appendix A: Glossary¶
Below are 60 terms used throughout this guide and the codebase, defined in plain language.
| # | Term | Definition |
|---|---|---|
| 1 | Actionable variant | A genomic mutation for which there is a clinically relevant therapy or clinical trial. |
| 2 | ADC (antibody-drug conjugate) | A drug that links an antibody (which finds the tumor cell) to a toxic payload (which kills it). Example: trastuzumab deruxtecan. |
| 3 | Allele | One of two copies of a gene at a given position in the genome. |
| 4 | AlphaMissense | A machine learning model from DeepMind that predicts whether a missense variant is pathogenic. |
| 5 | AMP/ASCO/CAP | The Association for Molecular Pathology, American Society of Clinical Oncology, and College of American Pathologists -- jointly published the standard variant classification tiers. |
| 6 | Biomarker | A measurable characteristic (e.g., TMB, MSI, PD-L1) used to guide treatment decisions. |
| 7 | Bispecific antibody | An antibody engineered to bind two different targets simultaneously. Example: amivantamab (EGFR + MET). |
| 8 | BRAF | A gene that encodes a protein in the MAPK signaling pathway. BRAF V600E is the most common oncogenic BRAF mutation. |
| 9 | Bypass resistance | When a tumor activates an alternative signaling pathway to circumvent a blocked target. |
| 10 | CIViC | Clinical Interpretation of Variants in Cancer -- an open-source database of curated variant-drug associations. |
| 11 | Claude | An LLM (large language model) developed by Anthropic. This agent uses Claude Sonnet 4.6 for synthesis. |
| 12 | Clinical trial | A research study that tests a new treatment in human patients. Trials progress through phases I-IV. |
| 13 | ClinVar | An NCBI database of genomic variants and their clinical significance (over 4.1 million records). |
| 14 | CNV (copy number variation) | A change in the number of copies of a gene. Amplification means extra copies; deletion means lost copies. |
| 15 | Companion diagnostic | An FDA-approved test required before prescribing a specific drug (e.g., EGFR mutation test before osimertinib). |
| 16 | Cosine similarity | A mathematical measure of how similar two vectors are. Values range from -1 (opposite) to 1 (identical). |
| 17 | ctDNA | Circulating tumor DNA -- fragments of tumor DNA found in the bloodstream. Used for liquid biopsy. |
| 18 | Docker | A platform that packages software into containers so it runs consistently on any machine. |
| 19 | Driver mutation | A mutation that actively contributes to cancer growth (as opposed to a passenger mutation). |
| 20 | Embedding | A fixed-length numerical representation of text produced by a machine learning model. |
| 21 | EGFR | Epidermal Growth Factor Receptor -- a gene commonly mutated in NSCLC. |
| 22 | ESMO | European Society for Medical Oncology -- publishes treatment guidelines. |
| 23 | Evidence level | A rating (A through E) indicating the strength of clinical evidence for a variant-drug link. |
| 24 | Exon | A segment of a gene that encodes protein. Mutations in exons often affect protein function. |
| 25 | FastAPI | A modern Python web framework used by the agent's REST API (port 8527). |
| 26 | FHIR R4 | Fast Healthcare Interoperability Resources, Release 4 -- an HL7 standard for exchanging healthcare data electronically. |
| 27 | Fusion | A genomic event where two separate genes become joined, producing a hybrid protein. Example: EML4-ALK. |
| 28 | Gatekeeper mutation | A mutation at a specific position in a kinase that blocks drug binding. Example: EGFR T790M. |
| 29 | Germline variant | A variant inherited from a parent, present in every cell of the body. Example: BRCA1 mutation. |
| 30 | HRD (homologous recombination deficiency) | A defect in DNA repair that makes tumors vulnerable to PARP inhibitors. |
| 31 | Immunotherapy | Treatment that helps the immune system recognize and attack cancer cells. Example: pembrolizumab (anti-PD-1). |
| 32 | Indel | A small insertion or deletion of nucleotides in the DNA sequence. |
| 33 | IVF_FLAT | A vector index type in Milvus that partitions vectors into clusters for fast approximate nearest-neighbor search. |
| 34 | KRAS | A gene in the RAS/MAPK pathway. KRAS G12C is targetable with sotorasib and adagrasib. |
| 35 | LLM (large language model) | An AI model trained on large amounts of text that can read, summarize, and generate natural language. |
| 36 | Lineage plasticity | When a tumor transforms from one cell type to another (e.g., adenocarcinoma to small cell carcinoma). |
| 37 | MAPK pathway | A signaling cascade (RAS-RAF-MEK-ERK) that controls cell growth. Frequently mutated in cancer. |
| 38 | Milvus | An open-source vector database used to store and search embeddings. |
| 39 | Missense variant | An SNV that changes one amino acid to another in the resulting protein. |
| 40 | MSI (microsatellite instability) | A condition where the DNA mismatch repair system is defective, leading to hypermutation. MSI-H tumors respond well to immunotherapy. |
| 41 | MTB (molecular tumor board) | A meeting where specialists review a patient's genomic data and decide on treatment. |
| 42 | NCCN | National Comprehensive Cancer Network -- publishes widely used treatment guidelines. |
| 43 | NSCLC | Non-small cell lung cancer -- the most common type of lung cancer (~85% of cases). |
| 44 | OncoKB | A precision oncology knowledge base maintained by Memorial Sloan Kettering Cancer Center. |
| 45 | PARP inhibitor | A drug that blocks the PARP enzyme, preventing DNA repair in HRD-positive tumors. Examples: olaparib, niraparib. |
| 46 | Passenger mutation | A mutation that does not contribute to cancer growth. Present by chance. |
| 47 | PD-L1 | Programmed Death-Ligand 1 -- a protein that tumors use to hide from the immune system. High PD-L1 expression predicts response to anti-PD-1 immunotherapy. |
| 48 | Pydantic | A Python library for data validation using type annotations. The agent uses Pydantic models for all data structures. |
| 49 | RAG (retrieval-augmented generation) | A technique that combines document retrieval with LLM text generation to produce grounded, cited answers. |
| 50 | RECIST | Response Evaluation Criteria In Solid Tumors -- a standard for measuring tumor response: CR (complete response), PR (partial response), SD (stable disease), PD (progressive disease), NE (not evaluable). |
| 51 | Resistance mechanism | A biological change in the tumor that causes a previously effective drug to stop working. |
| 52 | SNV (single nucleotide variant) | A change of a single DNA letter. The simplest type of mutation. |
| 53 | Somatic variant | A mutation acquired during a person's lifetime, present only in tumor cells. |
| 54 | Streamlit | A Python framework for building interactive web applications. The agent's UI runs on port 8526. |
| 55 | Targeted therapy | A drug designed to block a specific molecular target (e.g., osimertinib blocks EGFR). |
| 56 | TMB (tumor mutational burden) | The number of mutations per megabase of DNA in a tumor. TMB-H (high) tumors may respond to immunotherapy. |
| 57 | TKI (tyrosine kinase inhibitor) | A class of targeted therapy that blocks tyrosine kinase enzymes. Examples: osimertinib, alectinib, sotorasib. |
| 58 | Tissue-agnostic | An FDA approval that applies regardless of tumor location, based on a molecular biomarker (e.g., NTRK fusion, MSI-H). |
| 59 | VCF (Variant Call Format) | A standard file format for storing genomic variant data. The agent parses VCF files annotated by SnpEff, VEP, or GENEINFO. |
| 60 | Vector | A list of numbers representing a piece of text in high-dimensional space. Used for similarity search. |
Appendix B: Key Statistics¶
Agent Codebase¶
| Metric | Value |
|---|---|
| Python source files | 66 |
| Lines of code | ~20,490 |
| Test count | 556 |
| Milvus collections | 11 |
| Actionable gene targets | ~40 |
| Supported cancer types | 26 |
| Evidence levels | 5 (A through E) |
| Therapy categories | 9 |
| Biomarker types | 8 |
| Oncogenic pathways | 13 |
| Response categories | 5 (CR, PR, SD, PD, NE) |
| Embedding model | BAAI/bge-small-en-v1.5 |
| Embedding dimensions | 384 |
| LLM | Claude Sonnet 4.6 |
| Vector index type | IVF_FLAT (COSINE) |
| Max evidence to LLM | 30 items |
| Search parallelism | 6 threads |
| Retry attempts | Up to 2 |
| Min. similarity score | 0.30 |
| Min. sufficient hits | 3 (from 2+ collections) |
Platform¶
| Component | Value |
|---|---|
| Reference hardware | NVIDIA DGX Spark ($3,999) |
| GPU/CUDA | CUDA 12.x |
| Embedding batch size | 32 |
| API port | 8527 (FastAPI) |
| UI port | 8526 (Streamlit) |
| Milvus port | 19530 |
| Export formats | Markdown, JSON, PDF, FHIR R4 |
Upstream Pipeline (HCLS AI Factory Context)¶
| Metric | Value |
|---|---|
| Demo genome variants | 11.7 million |
| Searchable vectors | 3.56 million |
| ClinVar records | 4.1 million |
| AlphaMissense preds. | 71 million |
| Gene coverage | 201 genes |
| Therapeutic areas | 13 |
| Druggable targets | 171 |
| GPU variant calling | 120-240 min (vs 24-48 hrs on CPU) |
| Query response time | < 5 seconds |
Collection Weight Distribution¶
onco_variants |================== 0.18
onco_literature |================ 0.16
onco_therapies |============== 0.14
onco_guidelines |============ 0.12
onco_trials |========== 0.10
onco_biomarkers |======== 0.08
onco_resistance |======= 0.07
onco_pathways |====== 0.06
onco_outcomes |==== 0.04
genomic_evidence |=== 0.03
onco_cases |== 0.02
+----+----+----+----+
0 0.05 0.10 0.15 0.20
Enum Quick Reference¶
CancerType (26 values)¶
nsclc, sclc, breast, colorectal, melanoma, pancreatic,
ovarian, prostate, renal, bladder, head_neck, hepatocellular,
gastric, glioblastoma, aml, cml, all, cll, dlbcl,
multiple_myeloma, cholangiocarcinoma, endometrial, thyroid,
mesothelioma, sarcoma, other
EvidenceLevel (5 values)¶
A (FDA-approved), B (clinical evidence), C (case reports),
D (preclinical), E (computational)
TherapyCategory (9 values)¶
targeted, immunotherapy, chemotherapy, hormonal, combination,
radiotherapy, cell_therapy, adc, bispecific
BiomarkerType (8 values)¶
predictive, prognostic, diagnostic, monitoring, resistance,
pharmacodynamic, screening, therapeutic_selection
PathwayName (13 values)¶
mapk, pi3k_akt_mtor, dna_damage_repair, cell_cycle, apoptosis,
wnt, notch, hedgehog, jak_stat, angiogenesis, hippo,
nf_kb, tgf_beta
ResponseCategory (5 values -- RECIST criteria)¶
complete_response (CR), partial_response (PR), stable_disease (SD),
progressive_disease (PD), not_evaluable (NE)
VariantType (7 values)¶
snv, indel, cnv_amplification, cnv_deletion, fusion,
rearrangement, structural_variant
TrialPhase (8 values)¶
Early Phase 1, Phase 1, Phase 1/Phase 2, Phase 2,
Phase 2/Phase 3, Phase 3, Phase 4, N/A
GuidelineOrg (8 values)¶
NCCN, ESMO, ASCO, WHO, CAP/AMP, FDA, EMA, AACR
End of Learning Guide Foundations
For questions or contributions, see the project repository or contact the development team.