Precision Biomarker Intelligence Agent -- Learning Guide (Advanced)¶
Author: Adam Jones Date: March 2026 Version: 1.0.0 Audience: Engineers extending the agent, adding collections, writing new analysis modules, or deploying to production.
Table of Contents¶
- Prerequisites
- Deep Dive into the RAG Engine
- Vector Search Internals
- Adding a New Collection
- The Pharmacogenomics Engine Deep Dive
- Biological Age Algorithms
- Disease Trajectory Prediction
- Genotype-Based Reference Ranges
- Clinical Intelligence Modules
- Export System Deep Dive
- Testing Strategies
- The Autonomous Agent Pipeline
- Production Deployment
- Future Architecture
Appendices: - A. Complete API Reference - B. Configuration Reference - C. Collection Schema Reference
Chapter 1: Prerequisites¶
1.1 Required Knowledge¶
Before working with this codebase you should be comfortable with:
- Python 3.10+ -- Pydantic v2, dataclasses, type hints, async/await.
- Vector databases -- Milvus, approximate nearest-neighbor search, IVF indices.
- Embeddings -- Sentence Transformers, BAAI/bge-small-en-v1.5, cosine similarity.
- Clinical genomics -- Star alleles, CPIC guidelines, pharmacogenomics, VCF format.
- FastAPI -- Dependency injection, lifespan events, middleware, Pydantic schemas.
- Docker -- Multi-stage builds, compose networking, health checks.
1.2 Codebase Map¶
The agent lives at ai_agent_adds/precision_biomarker_agent/ within the HCLS AI Factory monorepo. Every source file and its line count is listed below.
Source Modules (src/)¶
| File | Lines | Purpose |
|---|---|---|
pharmacogenomics.py |
1,503 | Star allele to metabolizer phenotype mapping (CPIC) |
disease_trajectory.py |
1,421 | Pre-symptomatic disease trajectory prediction |
collections.py |
1,391 | Milvus collection schemas and manager |
knowledge.py |
1,326 | Static knowledge graph (domains, PGx, PhenoAge) |
export.py |
1,392 | Markdown, JSON, PDF, CSV, FHIR R4 export |
genotype_adjustment.py |
1,225 | Genotype- and age-stratified reference ranges |
report_generator.py |
993 | 12-section clinical report generation |
models.py |
786 | Pydantic models and enums for all data structures |
agent.py |
610 | Autonomous agent pipeline (plan/analyze/search) |
rag_engine.py |
573 | Multi-collection RAG engine |
discordance_detector.py |
299 | Cross-biomarker discordance detection |
lab_range_interpreter.py |
221 | Standard vs optimal range interpretation |
translation.py |
217 | Multi-language report translation |
critical_values.py |
179 | Critical value threshold checking |
audit.py |
83 | Audit logging for PHI access |
__init__.py |
1 | Package marker |
| Total | 12,628 |
API Layer (api/)¶
| File | Lines | Purpose |
|---|---|---|
main.py |
465 | FastAPI app, lifespan, middleware, core endpoints |
routes/analysis.py |
~300 | /v1/analyze, /v1/biological-age, /v1/pgx, /v1/query |
routes/reports.py |
~250 | /v1/report/generate, /v1/report/{id}/pdf, FHIR export |
routes/events.py |
~200 | Cross-modal event ingestion and alert dispatch |
Application Layer (app/)¶
| File | Lines | Purpose |
|---|---|---|
biomarker_ui.py |
1,863 | Streamlit UI (port 8528) |
patient_360.py |
670 | Patient 360-degree dashboard |
protein_viewer.py |
168 | 3D protein structure viewer |
Configuration (config/)¶
| File | Lines | Purpose |
|---|---|---|
settings.py |
139 | Pydantic BaseSettings with BIOMARKER_ prefix |
Tests (tests/)¶
| File | Lines | Test count |
|---|---|---|
test_edge_cases.py |
972 | 69 |
test_api.py |
1,080 | 59 |
test_disease_trajectory.py |
509 | 48 |
test_export.py |
453 | 46 |
test_ui.py |
610 | 39 |
test_models.py |
585 | 39 |
test_lab_range_interpreter.py |
460 | 37 |
test_biological_age.py |
406 | 30 |
test_critical_values.py |
390 | 28 |
test_pharmacogenomics.py |
380 | 27 |
test_genotype_adjustment.py |
332 | 26 |
test_discordance_detector.py |
378 | 25 |
test_collections.py |
279 | 22 |
test_report_generator.py |
348 | 21 |
test_rag_engine.py |
273 | 21 |
test_integration.py |
540 | 21 |
test_longitudinal.py |
162 | 18 |
test_agent.py |
307 | 16 |
conftest.py |
307 | - |
| Total | 8,772 | 709 |
1.3 Key Dependencies¶
sentence-transformers # BAAI/bge-small-en-v1.5, 384-dim embeddings
pymilvus # Milvus Python SDK
anthropic # Claude API client
fastapi / uvicorn # REST API server
streamlit # Interactive UI
pydantic / pydantic-settings # Configuration and data models
reportlab # PDF report generation (Platypus engine)
loguru # Structured logging
1.4 Port Assignments¶
| Service | Port |
|---|---|
| Streamlit UI | 8528 |
| FastAPI API | 8529 |
| Milvus | 19530 |
Chapter 2: Deep Dive into the RAG Engine¶
File: src/rag_engine.py (573 lines)
2.1 Architecture Overview¶
The BiomarkerRAGEngine class implements a multi-collection Retrieval-Augmented Generation pipeline. It searches across all 14 Milvus collections simultaneously using a ThreadPoolExecutor (delegated to the collection manager), merges results with knowledge graph context, and generates grounded LLM responses via Claude.
User Question
|
v
[1] Embed query (BGE-small-en-v1.5, 384 dims)
|
v
[2] Determine collections to search (14 total, or filtered subset)
|
v
[3] Build per-collection filter expressions
| - Disease area filter (diabetes, cardiovascular, liver, ...)
| - Year range filter (clinical evidence only)
|
v
[4] Parallel search across all collections (ThreadPoolExecutor)
|
v
[5] Deduplicate + Citation scoring + Rank by weighted score
|
v
[6] Knowledge graph augmentation (domains, PGx, PhenoAge, biomarkers)
|
v
CrossCollectionResult (max 30 merged hits)
|
v
[7] Build prompt with evidence, knowledge context, patient profile
|
v
[8] LLM generation (Claude, max_tokens=2048, temperature=0.7)
2.2 The retrieve() Method¶
This is the core retrieval method. It accepts an AgentQuery and returns a CrossCollectionResult:
def retrieve(self, query: AgentQuery,
top_k_per_collection: int = None,
collections_filter: List[str] = None,
year_min: int = None,
year_max: int = None,
conversation_context: str = None) -> CrossCollectionResult:
Key parameters:
top_k_per_collection: Max results per collection. Default:settings.TOP_K_PER_COLLECTION(5).collections_filter: Optional list of collection names. IfNone, searches all 14.year_min/year_max: Applied only tobiomarker_clinical_evidencevia theyearfield.conversation_context: For multi-turn queries; limited to 2,000 chars, prepended to search text.
Step-by-step flow:
- Embed query -- Calls
_embed_query(), which prepends the BGE instruction prefix"Represent this sentence for searching relevant passages: "to the question text, then callsembedder.embed_text(). - Build filters -- For collections with
has_disease_area: True, detects disease area keywords in the question using_detect_disease_area(). Filter expressions use Milvus boolean syntax (e.g.,disease_area == "cardiovascular"). Input is validated with a safe-character regex to prevent injection. - Parallel search -- Delegates to
collections.search_all()which usesThreadPoolExecutor. Each collection is searched independently with its own filter expression. - Merge and rank -- Deduplicates by ID and text prefix (first 200 chars), sorts by weighted score descending, caps at
MAX_MERGED_RESULTS = 30.
2.3 Score Weighting Math¶
Every search hit receives a weighted score that combines the raw cosine similarity with the collection's importance weight:
Where weight is the collection-specific weight from settings. This formula provides a bounded boost: a collection with weight 0.12 boosts scores by up to 12%. The min(..., 1.0) clamp prevents scores from exceeding 1.0.
Collection weights (must sum to ~1.0):
| Collection | Weight | Label |
|---|---|---|
biomarker_reference |
0.12 | BiomarkerRef |
genetic_variants |
0.11 | GeneticVariant |
pgx_rules |
0.10 | PGxRule |
disease_trajectories |
0.10 | DiseaseTrajectory |
clinical_evidence |
0.09 | ClinicalEvidence |
genomic_evidence |
0.08 | Genomic |
drug_interactions |
0.07 | DrugInteraction |
aging_markers |
0.07 | AgingMarker |
nutrition |
0.05 | Nutrition |
genotype_adjustments |
0.05 | GenotypeAdj |
monitoring |
0.05 | Monitoring |
critical_values |
0.04 | CriticalValue |
discordance_rules |
0.04 | DiscordanceRule |
aj_carrier_screening |
0.03 | AJCarrierScreen |
| Sum | 1.00 |
2.4 Citation Relevance Scoring¶
Each hit is tagged with a relevance level based on the raw similarity score before weighting:
if raw_score >= settings.CITATION_HIGH_THRESHOLD: # 0.75
relevance = "high"
elif raw_score >= settings.CITATION_MEDIUM_THRESHOLD: # 0.60
relevance = "medium"
else:
relevance = "low"
The relevance tag is injected into the LLM prompt as [high relevance], [medium relevance], or [low relevance] next to each citation. The system prompt instructs the LLM to "prioritize [high relevance] citations."
2.5 The System Prompt¶
The system prompt (BIOMARKER_SYSTEM_PROMPT) is a 40-line instruction set that defines the agent's nine expertise domains:
- Biological Aging (PhenoAge, GrimAge, epigenetic clocks)
- Pre-Symptomatic Disease Detection (trajectories, timelines)
- Pharmacogenomic Drug-Gene Interactions (CPIC, star alleles)
- Genotype-Adjusted Reference Ranges (MTHFR, APOE, PNPLA3, etc.)
- Nutritional Genomics (MTHFR/methylfolate, FADS1/omega-3, VDR/vitamin D)
- Cardiovascular Risk Stratification (Lp(a), ApoB, APOE, PCSK9)
- Liver Health Assessment (PNPLA3 I148M, TM6SF2, FIB-4)
- Iron Metabolism (HFE C282Y/H63D, hemochromatosis)
- Ashkenazi Jewish Carrier Screening (10-gene AJ panel)
It instructs the LLM to cite evidence using collection labels, specify units, provide genotype-specific interpretation, highlight critical findings, and flag cross-modal triggers.
2.6 Prompt Construction¶
The _build_prompt() method assembles the final prompt from four sections:
## Retrieved Evidence
### Evidence from BiomarkerRef
1. [BiomarkerRef:albumin] [high relevance] (score=0.892) ...
### Evidence from ClinicalEvidence
1. [ClinicalEvidence:PMID 29676998](https://pubmed.ncbi.nlm.nih.gov/29676998/) ...
### Knowledge Graph Context
PhenoAge Clock Context: ...
### Patient Profile Context
Age: 45, Sex: M
Biomarkers: albumin: 4.1, creatinine: 0.9, ...
Genotypes: rs1801133: CT, ...
Star Alleles: CYP2D6: *1/*4, ...
---
## Question
What does my HbA1c of 5.8% mean given my TCF7L2 CT genotype?
Please provide a comprehensive answer grounded in the evidence above. ...
Clinical evidence citations include clickable PubMed URLs: [ClinicalEvidence:PMID 29676998](https://pubmed.ncbi.nlm.nih.gov/29676998/).
2.7 Cross-Collection Entity Linking¶
The find_related() method enables cross-collection entity discovery:
engine.find_related("MTHFR")
# Returns: {
# "biomarker_genetic_variants": [SearchHit(...)],
# "biomarker_nutrition": [SearchHit(...)],
# "biomarker_genotype_adjustments": [SearchHit(...)],
# }
This powers queries like "show me everything about MTHFR" or "find all CYP2D6 drug interactions" spanning all 14 collections.
Chapter 3: Vector Search Internals¶
3.1 Index Type: IVF_FLAT¶
All 14 collections use IVF_FLAT (Inverted File with Flat quantization) as the index type. This partitions the vector space into clusters using k-means, then performs exhaustive search within the selected clusters.
- nlist=128: Number of Voronoi cells (clusters). At ingest time, each vector is assigned to the nearest of 128 centroids.
- nprobe=16: At query time, the 16 nearest clusters are searched. Higher nprobe means better recall at the cost of latency.
The recall/latency tradeoff: with nprobe=16 out of nlist=128, roughly 12.5% of the index is scanned. For biomarker collections (hundreds to low thousands of records), this provides near-perfect recall with sub-millisecond search times.
3.2 Distance Metrics: COSINE vs L2 vs IP¶
The agent uses COSINE similarity as its distance metric:
| Metric | Formula | Range | Use Case |
|---|---|---|---|
| COSINE | 1 - cos(A, B) |
[0, 2] | Normalized embeddings (BGE) |
| L2 | ||A - B||_2 |
[0, inf) | Raw distance, sensitive to magnitude |
| IP | A . B |
(-inf, inf) | Maximizes dot product |
Why COSINE? BGE-small-en-v1.5 produces L2-normalized embeddings, so COSINE and IP are mathematically equivalent. COSINE is chosen because Milvus returns similarity scores in [0, 1] for COSINE, which maps naturally to the citation relevance thresholds (0.75 high, 0.60 medium).
3.3 BGE Embedding Model¶
The agent uses BAAI/bge-small-en-v1.5:
- Dimensions: 384
- Model size: ~33M parameters (~130MB)
- Sequence length: 512 tokens max
- Instruction-tuned: Uses the prefix
"Represent this sentence for searching relevant passages: "for queries (but not for documents).
# Query embedding (with instruction prefix)
prefix = "Represent this sentence for searching relevant passages: "
query_vec = embedder.embed_text(prefix + "What affects CYP2D6 metabolism?")
# Document embedding (no prefix)
doc_vec = embedder.embed_text("CYP2D6 is a cytochrome P450 enzyme that...")
3.4 Search Parameters¶
The SCORE_THRESHOLD setting (default 0.4) filters out hits below minimum relevance. Any hit with score < 0.4 is discarded before ranking. This prevents low-quality noise from reaching the LLM prompt.
3.5 Embedding Pipeline¶
Input Text
|
v
SentenceTransformer("BAAI/bge-small-en-v1.5")
|
v
model.encode(text) --> numpy array (384,)
|
v
.tolist() --> List[float] (384 elements)
|
v
Milvus insert / search
At API startup, the model is loaded once and shared across requests:
class _Embedder:
def __init__(self):
self.model = SentenceTransformer(settings.EMBEDDING_MODEL)
def embed_text(self, text: str) -> List[float]:
return self.model.encode(text).tolist()
Chapter 4: Adding a New Collection¶
This chapter walks through adding a hypothetical biomarker_microbiome collection in 10 steps.
Step 1: Define the Pydantic Model¶
Add to src/models.py:
class MicrobiomeMarker(BaseModel):
"""Microbiome-biomarker interaction -- maps to biomarker_microbiome collection."""
id: str = Field(..., max_length=100, description="Unique marker identifier")
organism: str = Field("", max_length=100, description="Bacterial species/genus")
biomarker_affected: str = Field("", max_length=100, description="Biomarker name")
mechanism: str = Field("", max_length=2000, description="Mechanism of action")
text_chunk: str = Field(..., max_length=3000, description="Text for embedding")
disease_area: str = Field("", max_length=50, description="Disease area tag")
Step 2: Define the Milvus Schema¶
Add to src/collections.py:
MICROBIOME_FIELDS = [
FieldSchema(name="id", dtype=DataType.VARCHAR, is_primary=True, max_length=100),
FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=EMBEDDING_DIM),
FieldSchema(name="organism", dtype=DataType.VARCHAR, max_length=100),
FieldSchema(name="biomarker_affected", dtype=DataType.VARCHAR, max_length=100),
FieldSchema(name="mechanism", dtype=DataType.VARCHAR, max_length=2000),
FieldSchema(name="text_chunk", dtype=DataType.VARCHAR, max_length=3000),
FieldSchema(name="disease_area", dtype=DataType.VARCHAR, max_length=50),
]
MICROBIOME_SCHEMA = CollectionSchema(
fields=MICROBIOME_FIELDS,
description="Microbiome-biomarker interactions",
)
Step 3: Register in BiomarkerCollectionManager¶
In collections.py, add the collection to the schema registry dict (follow the existing pattern for _COLLECTION_SCHEMAS):
Add the collection name to __init__ where collections are listed, and add it to ensure_collections().
Step 4: Add the Weight Setting¶
In config/settings.py:
Adjust other weights so the total still sums to ~1.0. Run the _validate_settings model validator to confirm.
Step 5: Register in COLLECTION_CONFIG¶
In src/rag_engine.py:
"biomarker_microbiome": {
"weight": settings.WEIGHT_MICROBIOME,
"label": "Microbiome",
"has_disease_area": True,
"year_field": None,
},
Step 6: Add the Setting to env_prefix¶
The env var is automatically named BIOMARKER_WEIGHT_MICROBIOME thanks to the env_prefix="BIOMARKER_" in PrecisionBiomarkerSettings.model_config.
Step 7: Create a Seed Script¶
Create scripts/seed_microbiome.py that reads source data, embeds text chunks, and inserts into Milvus:
from sentence_transformers import SentenceTransformer
from pymilvus import Collection
model = SentenceTransformer("BAAI/bge-small-en-v1.5")
records = load_microbiome_data() # Your data loading function
embeddings = model.encode([r["text_chunk"] for r in records])
collection = Collection("biomarker_microbiome")
collection.insert([
[r["id"] for r in records],
embeddings.tolist(),
[r["organism"] for r in records],
# ... remaining fields
])
collection.flush()
Step 8: Update conftest.py¶
Add the new collection to the mock collection manager's collection_names list:
Step 9: Write Tests¶
Create tests/test_microbiome.py following the existing test patterns. Test at minimum:
- Schema creation
- Insert and search round-trip
- Weight application in RAG engine
- Disease area filtering
Step 10: Verify End-to-End¶
# Start Milvus
docker compose up -d milvus-standalone
# Seed the new collection
python scripts/seed_microbiome.py
# Run tests
pytest tests/test_microbiome.py -v
# Verify via API
curl http://localhost:8529/collections | python -m json.tool
Chapter 5: The Pharmacogenomics Engine Deep Dive¶
File: src/pharmacogenomics.py (1,503 lines)
5.1 Architecture¶
The PharmacogenomicMapper class implements a pure-computation engine that maps star allele diplotypes to metabolizer phenotypes and drug-specific dosing recommendations. It requires no LLM calls or database queries -- all knowledge is embedded in the PGX_GENE_CONFIGS dictionary.
5.2 The Fourteen Pharmacogenes¶
| Gene | Role | CPIC Level | Key Drugs |
|---|---|---|---|
| CYP2D6 | Metabolizes ~25% of drugs | 1A | Codeine, tramadol, tamoxifen |
| CYP2C19 | Proton pump inhibitors, antiplatelets | 1A | Clopidogrel, omeprazole, voriconazole |
| CYP2C9 | NSAIDs, warfarin metabolism | 1A | Warfarin, celecoxib, phenytoin |
| CYP3A5 | Immunosuppressant metabolism | 1A | Tacrolimus |
| SLCO1B1 | Hepatic drug transporter | 1A | Simvastatin, atorvastatin |
| VKORC1 | Warfarin target sensitivity | 1A | Warfarin |
| MTHFR | Folate metabolism enzyme | Info | Methotrexate (adjunctive) |
| TPMT | Thiopurine metabolism | 1A | Azathioprine, 6-mercaptopurine |
| DPYD | Fluoropyrimidine metabolism | 1A | 5-FU, capecitabine |
5.3 Star Allele to Phenotype Mapping¶
Each gene has an allele_to_phenotype dictionary that maps diplotype strings to phenotype labels:
PGX_GENE_CONFIGS = {
"CYP2D6": {
"display_name": "CYP2D6",
"description": "Cytochrome P450 2D6 -- metabolizes ~25% of drugs",
"allele_to_phenotype": {
"*1/*1": "Normal Metabolizer",
"*1/*4": "Intermediate Metabolizer",
"*4/*4": "Poor Metabolizer",
"*1/*1xN": "Ultra-rapid Metabolizer",
# ... 16 diplotype combinations
},
"drug_recommendations": { ... },
},
# ... 13 more genes
}
5.4 Metabolizer Phenotype Classification¶
The MetabolizerPhenotype enum defines four standard CPIC categories:
| Phenotype | Enum Value | Clinical Meaning |
|---|---|---|
| Ultra-rapid | ultra_rapid |
Excess enzyme activity; rapid drug clearance |
| Normal | normal |
Standard enzyme activity; use standard dosing |
| Intermediate | intermediate |
Reduced activity; consider dose adjustment |
| Poor | poor |
Minimal/no activity; avoid or reduce dose |
Non-CYP genes use specialized terminology: - SLCO1B1: Normal Function / Intermediate Function / Poor Function (transporter activity) - VKORC1: Normal Sensitivity / Intermediate Sensitivity / High Sensitivity (drug target) - MTHFR: Normal Activity / Intermediate Activity / Reduced Activity (enzyme activity)
5.5 Drug-Specific Dosing Recommendations¶
Each drug entry maps every possible phenotype to a structured recommendation:
"codeine": {
"Poor Metabolizer": {
"recommendation": "AVOID codeine -- no conversion to morphine, will be ineffective.",
"action": "AVOID",
"alert_level": "CRITICAL",
},
"Ultra-rapid Metabolizer": {
"recommendation": "AVOID codeine -- excess conversion to morphine, "
"risk of fatal respiratory depression.",
"action": "AVOID",
"alert_level": "CRITICAL",
},
}
Action categories:
- STANDARD_DOSING -- No change needed
- DOSE_REDUCTION -- Reduce dose per recommendation
- DOSE_ADJUSTMENT -- Adjust dose (up or down)
- CONSIDER_ALTERNATIVE -- Current drug may work but alternative preferred
- AVOID -- Do not use this drug
- CONTRAINDICATED -- Absolute contraindication (FDA/EMA mandated)
Alert levels: INFO, WARNING, CRITICAL
5.6 CPIC Level Evidence¶
Every gene entry includes version tracking:
CPIC_GUIDELINE_VERSIONS = {
"CYP2D6": {"version": "2019", "pmid": "33387367", "update": "2020-12", "level": "1A"},
"CYP2C19": {"version": "2022", "pmid": "34697867", "update": "2022-12", "level": "1A"},
# ...
}
5.7 The map_all() Method¶
pgx_mapper = PharmacogenomicMapper()
results = pgx_mapper.map_all(
star_alleles={"CYP2D6": "*4/*4", "CYP2C19": "*1/*2"},
genotypes={"rs1801133": "CT"}, # MTHFR
)
# Returns: {
# "gene_results": [
# {"gene": "CYP2D6", "star_alleles": "*4/*4",
# "phenotype": "Poor Metabolizer", "affected_drugs": [...]},
# {"gene": "CYP2C19", "star_alleles": "*1/*2",
# "phenotype": "Intermediate Metabolizer", "affected_drugs": [...]},
# {"gene": "MTHFR", "genotype": "CT",
# "phenotype": "Intermediate Activity", "affected_drugs": [...]},
# ]
# }
5.8 Adding a New Gene¶
To add a new pharmacogene (e.g., NAT2):
- Add CPIC version info to
CPIC_GUIDELINE_VERSIONS. - Add the full gene config to
PGX_GENE_CONFIGSwithallele_to_phenotypeanddrug_recommendations. - Add test cases to
tests/test_pharmacogenomics.py. - The gene is automatically picked up by
map_all().
Chapter 6: Biological Age Algorithms¶
File: src/biological_age.py (408 lines)
6.1 PhenoAge (Levine 2018)¶
PhenoAge estimates biological age from 9 routine blood biomarkers using a Gompertz mortality model trained on NHANES III data.
Reference: Levine et al., "An epigenetic biomarker of aging for lifespan and healthspan", Aging 2018; 10(4):573-591. PMID: 29676998.
6.2 The Nine Biomarkers and Coefficients¶
| Biomarker | Coefficient | Direction | Units (input) | Units (SI) |
|---|---|---|---|---|
| Albumin | -0.0336 | Protective | g/dL | g/L |
| Creatinine | 0.0095 | Aging | mg/dL | umol/L |
| Glucose | 0.1953 | Aging | mg/dL | mmol/L |
| ln(CRP) | 0.0954 | Aging | mg/L (ln) | ln(mg/L) |
| Lymphocyte % | -0.0120 | Protective | % | % |
| MCV | 0.0268 | Aging | fL | fL |
| RDW | 0.3306 | Aging | % | % |
| Alkaline Phosphatase | 0.0019 | Aging | U/L | U/L |
| WBC | 0.0554 | Aging | 10^3/uL | 10^3/uL |
Intercept: -19.9067 Chronological age coefficient: 0.0804
6.3 Unit Conversion¶
The module accepts standard US clinical units and converts internally:
UNIT_CONVERSIONS = {
"albumin": 10.0, # g/dL -> g/L (multiply by 10)
"creatinine": 88.4, # mg/dL -> umol/L (multiply by 88.4)
"glucose": 1 / 18.016, # mg/dL -> mmol/L (divide by 18.016)
}
Other biomarkers (lymphocyte %, MCV, RDW, alkaline phosphatase, WBC) use the same units in US and SI systems.
6.4 The PhenoAge Formula¶
Step 1: Compute the linear predictor (xb)
Step 2: Compute mortality score via Gompertz model
Where:
- MORT_NUMERATOR = -1.51714 (derived from -(exp(120 * gamma) - 1))
- MORT_DENOMINATOR = 0.007692696 (Gompertz shape parameter gamma)
Step 3: Convert mortality score to biological age
inner = BA_NUMERATOR * ln(1 - mortality_score)
biological_age = (ln(inner) / BA_DENOMINATOR) + BA_INTERCEPT
Where:
- BA_NUMERATOR = -0.0055305
- BA_DENOMINATOR = 0.09165
- BA_INTERCEPT = 141.50225
Step 4: Age acceleration
6.5 Confidence Intervals¶
Standard error depends on biomarker completeness:
- All 9 biomarkers available: SE = 4.9 years (from NHANES III validation)
- Fewer than 9 biomarkers: SE = 6.5 years (increased uncertainty)
95% CI: biological_age +/- 1.96 * SE
6.6 Risk Classification¶
| Age Acceleration | Risk Level | Meaning |
|---|---|---|
| > +5 years | HIGH | Significantly accelerated aging |
| > +2 years | MODERATE | Mildly accelerated aging |
| -2 to +2 years | NORMAL | Aging at expected rate |
| < -2 years | LOW | Aging slower than expected |
6.7 GrimAge Surrogate Estimation¶
True GrimAge requires DNA methylation data. This module provides a surrogate estimate using plasma proteins that correlate with DNAm GrimAge components (r-squared = 0.72, Hillary et al. 2020, PMID: 32941527).
Six plasma protein markers:
| Marker | Weight | Unit | Ref Max |
|---|---|---|---|
| GDF-15 | 0.15 | pg/mL | 1,200 |
| Cystatin C | 0.12 | mg/L | 1.0 |
| PAI-1 | 0.10 | ng/mL | 43.0 |
| ADM | 0.11 | pmol/L | 50.0 |
| TIMP-1 | 0.09 | ng/mL | 250.0 |
| Leptin | 0.08 | ng/mL | 15.0 |
Surrogate formula:
deviation_i = (value_i - ref_max_i) / ref_max_i
weighted_deviation = SUM(weight_i * deviation_i) / SUM(weight_i)
estimated_acceleration = weighted_deviation * 10.0 # empirical scale factor
grimage_score = chronological_age + estimated_acceleration
Validation: SE = 5.8 years, from Lothian Birth Cohort 1936 (n=906).
6.8 Code Example: Full Calculation¶
from src.biological_age import BiologicalAgeCalculator
calc = BiologicalAgeCalculator()
result = calc.calculate(
chronological_age=45,
biomarkers={
"albumin": 4.1, # g/dL
"creatinine": 0.9, # mg/dL
"glucose": 95, # mg/dL
"hs_crp": 1.2, # mg/L (auto-converted to ln_crp)
"lymphocyte_pct": 30, # %
"mcv": 89, # fL
"rdw": 13.5, # %
"alkaline_phosphatase": 65, # U/L
"wbc": 6.5, # 10^3/uL
# GrimAge surrogate markers
"gdf15": 800, # pg/mL
"cystatin_c": 0.85, # mg/L
},
)
print(f"PhenoAge: {result['biological_age']}")
print(f"Acceleration: {result['age_acceleration']:+.1f} years")
print(f"GrimAge: {result['grimage']['grimage_score']}")
Chapter 7: Disease Trajectory Prediction¶
File: src/disease_trajectory.py (1,421 lines)
7.1 Overview¶
The DiseaseTrajectoryAnalyzer detects pre-symptomatic disease trajectories across 9 disease categories using genotype-stratified biomarker thresholds. It identifies patients on a trajectory toward clinical disease years before conventional diagnosis, enabling early intervention.
7.2 The Nine Disease Categories¶
| Category | Display Name | Key Biomarkers |
|---|---|---|
type2_diabetes |
Type 2 Diabetes | HbA1c, fasting glucose, fasting insulin, HOMA-IR |
cardiovascular |
Cardiovascular Disease | Lp(a), LDL-C, ApoB, hs-CRP, TC, HDL-C, TG |
liver |
Liver Disease (NAFLD/Fibrosis) | ALT, AST, GGT, ferritin, platelets, albumin |
thyroid |
Thyroid Dysfunction | TSH, free T4, free T3 |
iron |
Iron Metabolism Disorder | Ferritin, transferrin saturation, serum iron, TIBC |
nutritional |
Nutritional Deficiency | Omega-3 index, vitamin D, B12, folate, Mg, Zn, Se |
kidney |
Kidney Disease | Creatinine, eGFR, BUN, albumin, cystatin C |
bone_health |
Bone Health | Vitamin D, calcium, PTH, phosphorus |
cognitive |
Cognitive Decline | Homocysteine, B12, folate, hs-CRP, HbA1c |
7.3 Genetic Modifiers¶
Each disease category includes genetic modifiers that shift risk thresholds:
"type2_diabetes": {
"genetic_modifiers": {
"TCF7L2_rs7903146": {"risk_allele": "T", "effect": "beta_cell_dysfunction"},
"PPARG_rs1801282": {"risk_allele": "C", "effect": "insulin_sensitivity"},
"SLC30A8_rs13266634": {"risk_allele": "C", "effect": "zinc_transport"},
"KCNJ11_rs5219": {"risk_allele": "T", "effect": "potassium_channel"},
"GCKR_rs780094": {"risk_allele": "T", "effect": "glucokinase_regulation"},
},
}
When a patient carries a risk allele, the biomarker thresholds shift -- for example, an HbA1c of 5.7% might be classified as "pre-diabetic" for a TCF7L2 TT carrier but "early metabolic shift" for a CC carrier.
7.4 Progression Staging¶
Each disease has defined stages representing the trajectory from healthy to clinical disease:
Type 2 Diabetes: normal -> early_metabolic_shift -> insulin_resistance -> pre_diabetic -> diabetic
Cardiovascular: optimal -> borderline -> elevated_risk -> high_risk
Liver: normal -> steatosis_risk -> early_fibrosis -> advanced_fibrosis
Thyroid: euthyroid -> subclinical -> overt_dysfunction
Iron: normal -> early_accumulation -> iron_overload
7.5 Risk Score Formula¶
The disease trajectory engine computes a composite risk score for each disease category:
- Biomarker deviation score: For each relevant biomarker, calculate deviation from normal range, weighted by clinical importance.
- Genetic risk multiplier: If the patient carries risk alleles, multiply the base risk by a gene-specific factor (typically 1.2x to 2.0x per risk allele).
- Age/sex adjustment: Age and sex modifiers shift thresholds based on epidemiological data.
- Composite score: Weighted combination mapped to risk levels (NORMAL, LOW, MODERATE, HIGH, CRITICAL).
7.6 The analyze_all() Method¶
analyzer = DiseaseTrajectoryAnalyzer()
trajectories = analyzer.analyze_all(
biomarkers={"hba1c": 5.8, "fasting_glucose": 105, "fasting_insulin": 12},
genotypes={"TCF7L2_rs7903146": "CT"},
age=45,
sex="M",
)
# Returns: [
# {
# "disease": "type2_diabetes",
# "risk_level": "MODERATE",
# "current_stage": "early_metabolic_shift",
# "current_markers": {"hba1c": 5.8, "fasting_glucose": 105},
# "genetic_risk_factors": [
# {"gene": "TCF7L2", "genotype": "CT", "effect": "beta_cell_dysfunction"}
# ],
# "years_to_onset_estimate": 8.5,
# "recommendations": ["Monitor HbA1c every 3 months", "Consider metformin discussion"],
# },
# # ... results for other disease categories
# ]
7.7 Years-to-Onset Estimation¶
The engine estimates time to clinical onset based on current biomarker levels, rate of change (if longitudinal data available), and genetic risk factors. This is a rough estimate intended to motivate preventive action, not a precise prediction.
Chapter 8: Genotype-Based Reference Ranges¶
File: src/genotype_adjustment.py (1,225 lines)
8.1 Why Genotype-Adjusted Ranges?¶
Standard laboratory reference ranges are population averages. Genetic variants can significantly alter what is "normal" for an individual. For example:
- MTHFR C677T (rs1801133): Homozygous TT carriers have 70% reduced enzyme activity, leading to elevated homocysteine. A homocysteine of 12 umol/L is "normal" by standard ranges but may be pathological for a TT carrier.
- APOE E4: Carriers have naturally higher LDL-C and respond differently to statin therapy.
- PNPLA3 I148M: GG homozygotes have 3x higher risk of NAFLD; their ALT reference range should be tighter.
8.2 Core Architecture¶
The GenotypeAdjuster class:
- Looks up the patient's genotype for each biomarker-gene pair in
GENOTYPE_THRESHOLDS(fromknowledge.py). - Applies a genotype-specific multiplier or offset to the standard reference range.
- Returns both the standard and adjusted ranges for comparison.
8.3 Ancestry-Specific Adjustments¶
The apply_ancestry_adjustments() method modifies reference ranges based on reported ancestry using data from NHANES III, UK Biobank, and MESA studies. For example:
- eGFR uses the CKD-EPI 2021 equation without race adjustment (PMID: 34554658)
- Vitamin D reference ranges differ by latitude and melanin-mediated synthesis
- Hemoglobin/hematocrit have ancestry-specific normal ranges
8.4 Age-Stratified Reference Ranges¶
Five age brackets with sex-specific ranges:
| Bracket | Age Range | Source Studies |
|---|---|---|
| 0-17 | Pediatric | Pediatric guidelines |
| 18-39 | Young adult | NHANES III, Framingham |
| 40-59 | Middle age | NHANES III, Framingham |
| 60-79 | Older adult | KDIGO 2012, ATA 2017, ACC/AHA 2019 |
| 80+ | Elderly | Geriatric-specific guidelines |
Example for creatinine:
"creatinine": {
"18-39": {
"M": {"low": 0.7, "high": 1.2, "note": "Standard adult male range."},
"F": {"low": 0.5, "high": 1.0, "note": "Standard adult female range."},
},
"60-79": {
"M": {"low": 0.8, "high": 1.4, "note": "Higher normal; age-related GFR decline."},
"F": {"low": 0.6, "high": 1.2, "note": "Higher normal; age-related GFR decline."},
},
}
8.5 Carrier Screening Integration¶
For Ashkenazi Jewish patients, the adjuster integrates carrier screening results for compound risk assessment. For example, GBA heterozygous carriers with APOE E4 have a synergistic increase in Parkinson's disease risk.
8.6 The adjust_all() Method¶
adjuster = GenotypeAdjuster()
result = adjuster.adjust_all(
biomarkers={"homocysteine": 12.0, "ldl_c": 145},
genotypes={"rs1801133": "TT", "APOE": "E3/E4"},
)
# Returns: {
# "adjustments": [
# {
# "biomarker": "homocysteine",
# "standard_range": {"lower": 5.0, "upper": 15.0},
# "adjusted_range": {"lower": 5.0, "upper": 10.0},
# "unit": "umol/L",
# "gene_display_name": "MTHFR",
# "genotype_value": "TT",
# "rationale": "MTHFR 677TT reduces enzyme activity by ~70%; ..."
# },
# ]
# }
Chapter 9: Clinical Intelligence Modules¶
Three small, focused modules that provide clinical decision support.
9.1 Critical Values Engine¶
File: src/critical_values.py (179 lines)
The CriticalValueEngine checks biomarker values against life-threatening thresholds that require immediate clinical action. These are distinct from standard reference ranges -- a critical value means "call the physician now."
engine = CriticalValueEngine()
alerts = engine.check({
"potassium": 6.5, # Critical high (normal: 3.5-5.0)
"glucose": 35, # Critical low (hypoglycemia)
"sodium": 118, # Critical low (severe hyponatremia)
})
# Returns: [
# CriticalValueAlert(biomarker="potassium", value=6.5,
# threshold_type="HIGH", message="CRITICAL: Potassium 6.5 mEq/L ...")
# ]
9.2 Discordance Detector¶
File: src/discordance_detector.py (299 lines)
The DiscordanceDetector identifies contradictions between related biomarkers that suggest a hidden condition or lab error. It implements clinically validated discordance patterns.
Example discordance patterns:
| Pattern | Biomarkers | Clinical Implication |
|---|---|---|
| LDL/ApoB discordance | LDL-C low, ApoB high | Small dense LDL particles; higher risk |
| Ferritin/iron discordance | Ferritin high, iron low | Inflammation masking iron deficiency |
| TSH/T4 discordance | TSH normal, T4 low | Central hypothyroidism |
| AST/ALT ratio | AST >> ALT | Alcoholic vs non-alcoholic liver disease |
detector = DiscordanceDetector()
discordances = detector.check({
"ldl_c": 95, # Appears normal
"apob": 130, # Elevated (discordant with low LDL-C)
"lpa": 85, # Elevated Lp(a)
})
# Returns discordance alerts highlighting the LDL/ApoB mismatch
9.3 Lab Range Interpreter¶
File: src/lab_range_interpreter.py (221 lines)
The LabRangeInterpreter distinguishes between standard reference ranges (what labs report as "normal") and optimal ranges (what evidence suggests is ideal for health). Many biomarkers have a significant gap between "not flagged by the lab" and "truly optimal."
Example:
| Biomarker | Standard Range | Optimal Range | Gap |
|---|---|---|---|
| Vitamin D | 30-100 ng/mL | 40-60 ng/mL | 30-39 is "normal" but suboptimal |
| Ferritin (M) | 12-300 ng/mL | 40-150 ng/mL | 12-39 is "normal" but low-optimal |
| TSH | 0.45-4.5 mIU/L | 1.0-2.5 mIU/L | 2.5-4.5 is subclinical territory |
interpreter = LabRangeInterpreter()
discrepancies = interpreter.get_discrepancies(
biomarkers={"vitamin_d": 32, "tsh": 3.8},
sex="F",
)
# Returns comparisons showing that both values are within standard range
# but outside optimal range, with interpretation context.
Chapter 10: Export System Deep Dive¶
Files: src/export.py (1,392 lines) + src/report_generator.py (993 lines)
10.1 Export Formats¶
The export system produces five output formats from the same analysis result:
| Format | Function | Use Case |
|---|---|---|
| Markdown | export_markdown() |
Human-readable reports |
| JSON | export_json() |
Machine-readable structured data |
export_pdf() |
Clinical reports via ReportLab | |
| CSV | export_csv() |
Spreadsheet analysis |
| FHIR R4 | export_fhir_diagnostic_report() |
EHR integration |
10.2 The 12-Section Report¶
The ReportGenerator class produces a structured clinical report:
| Section | Title | Content |
|---|---|---|
| 1 | Biological Age Assessment | PhenoAge, GrimAge, acceleration, drivers |
| 2 | Executive Findings | Top 5 critical/high-priority findings |
| 3 | Biomarker-Gene Correlation Map | Which genes affect which biomarkers |
| 4 | Disease Trajectory Analysis | Risk for all 9 disease categories |
| 5 | Pharmacogenomic Profile | All PGx results with drug recommendations |
| 6 | Nutritional Analysis | Genotype-aware nutrition assessment |
| 7 | Interconnected Pathways | Cross-domain pathway connections |
| 8 | Prioritized Action Plan | Ranked interventions by urgency |
| 9 | Monitoring Schedule | Follow-up testing timeline |
| 10 | Supplement Protocol Summary | Genotype-guided supplement suggestions |
| 11 | Clinical Summary for MD | Concise physician-oriented summary |
| 12 | References | CPIC, PMID citations, data sources |
10.3 PDF Generation via ReportLab¶
PDF reports use ReportLab's Platypus layout engine:
from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer, Table
from reportlab.lib.styles import getSampleStyleSheet
from reportlab.lib.pagesizes import letter
def export_pdf(query, response_text, evidence=None, analysis=None):
buffer = io.BytesIO()
doc = SimpleDocTemplate(buffer, pagesize=letter)
styles = getSampleStyleSheet()
story = []
story.append(Paragraph("Biomarker Intelligence Report", styles["Title"]))
# ... build story elements for each section
doc.build(story)
return buffer.getvalue()
10.4 FHIR R4 DiagnosticReport¶
The export_fhir_diagnostic_report() function produces a FHIR R4 Bundle containing:
- DiagnosticReport -- The overall report resource with status, code, and conclusion.
- Observation resources -- One per biomarker result, with value, unit, reference range, and interpretation code.
- Bundle wrapper -- Transaction bundle for EHR submission.
fhir_bundle = export_fhir_diagnostic_report(
patient_id="patient-001",
analysis=analysis_result,
practitioner_id="dr-smith-001",
)
# Returns: {
# "resourceType": "Bundle",
# "type": "transaction",
# "entry": [
# {"resource": {"resourceType": "DiagnosticReport", ...}},
# {"resource": {"resourceType": "Observation", ...}},
# ...
# ]
# }
10.5 Timestamped Filenames¶
Exported files use UUID-suffixed timestamps to prevent collisions:
Chapter 11: Testing Strategies¶
11.1 Test Suite Overview¶
The test suite contains 18 test files with 709 tests total. All tests run without external dependencies (Milvus, Claude API) thanks to comprehensive mocking.
Test distribution by file:
| File | Tests | Focus |
|---|---|---|
test_edge_cases.py |
69 | Boundary values, malformed inputs, overflow |
test_api.py |
59 | FastAPI endpoint testing via TestClient |
test_disease_trajectory.py |
48 | All 9 disease categories, staging |
test_export.py |
46 | All 5 export formats, content validation |
test_ui.py |
39 | Streamlit component rendering |
test_models.py |
39 | Pydantic model validation, serialization |
test_lab_range_interpreter.py |
37 | Standard vs optimal range comparisons |
test_biological_age.py |
30 | PhenoAge formula, GrimAge, edge cases |
test_critical_values.py |
28 | Critical threshold alerts |
test_pharmacogenomics.py |
27 | Star allele mapping, drug recommendations |
test_genotype_adjustment.py |
26 | Genotype and age adjustments |
test_discordance_detector.py |
25 | Biomarker discordance patterns |
test_collections.py |
22 | Schema creation, insert, search |
test_report_generator.py |
21 | 12-section report structure |
test_rag_engine.py |
21 | RAG pipeline, scoring, prompt building |
test_integration.py |
21 | End-to-end agent pipeline |
test_longitudinal.py |
18 | Longitudinal biomarker tracking |
test_agent.py |
16 | Agent planning, analysis, synthesis |
11.2 Mock Patterns from conftest.py¶
The conftest.py provides three core fixtures used across all tests:
Mock Embedder:
@pytest.fixture
def mock_embedder():
"""Return a mock embedder that produces 384-dim zero vectors."""
embedder = MagicMock()
embedder.embed_text.return_value = [0.0] * 384
return embedder
Mock LLM Client:
@pytest.fixture
def mock_llm_client():
"""Return a mock LLM client that always responds with 'Mock response'."""
client = MagicMock()
client.generate.return_value = "Mock response"
client.generate_stream.return_value = iter(["Mock ", "response"])
return client
Mock Collection Manager:
@pytest.fixture
def mock_collection_manager():
manager = MagicMock()
manager.search_all.return_value = {name: [] for name in collection_names}
manager.get_collection_stats.return_value = {name: 42 for name in collection_names}
return manager
All 14 collections are present in the mock to ensure COLLECTION_CONFIG lookups succeed.
11.3 Sample Patient Profile Fixture¶
@pytest.fixture
def sample_patient():
return PatientProfile(
patient_id="TEST-001",
age=45,
sex="M",
biomarkers={
"albumin": 4.1, "creatinine": 0.9, "glucose": 95,
"hs_crp": 1.2, "lymphocyte_pct": 30, "mcv": 89,
"rdw": 13.5, "alkaline_phosphatase": 65, "wbc": 6.5,
},
genotypes={"rs1801133": "CT", "APOE": "E3/E4"},
star_alleles={"CYP2D6": "*1/*4", "CYP2C19": "*1/*2"},
)
11.4 Testing Pure Computation Modules¶
Modules like biological_age.py, pharmacogenomics.py, disease_trajectory.py, and genotype_adjustment.py are pure computation -- no I/O, no mocking needed:
def test_phenoage_known_values():
calc = BiologicalAgeCalculator()
result = calc.calculate_phenoage(
chronological_age=50,
biomarkers={
"albumin": 4.0, "creatinine": 1.0, "glucose": 100,
"hs_crp": 2.0, "lymphocyte_pct": 28, "mcv": 90,
"rdw": 14.0, "alkaline_phosphatase": 70, "wbc": 7.0,
},
)
assert 40 < result["biological_age"] < 70
assert "mortality_score" in result
assert len(result["top_aging_drivers"]) <= 5
11.5 Testing the API¶
API tests use FastAPI's TestClient:
from fastapi.testclient import TestClient
from api.main import app
client = TestClient(app)
def test_health_endpoint():
response = client.get("/health")
assert response.status_code == 200
data = response.json()
assert "status" in data
assert "collections" in data
11.6 Running Tests¶
# Run all tests
pytest tests/ -v
# Run with coverage
pytest tests/ --cov=src --cov-report=html
# Run a specific module
pytest tests/test_biological_age.py -v
# Run tests matching a keyword
pytest tests/ -k "phenoage" -v
Chapter 12: The Autonomous Agent Pipeline¶
File: src/agent.py (610 lines)
12.1 Agent Architecture¶
The PrecisionBiomarkerAgent implements the plan -> analyze -> search -> synthesize -> report pattern. It wraps the multi-collection RAG engine with four analysis modules and reasoning capabilities.
Question + PatientProfile
|
v
[Phase 1] analyze_patient() -- Run all 4 analysis modules
| - BiologicalAgeCalculator
| - DiseaseTrajectoryAnalyzer
| - PharmacogenomicMapper
| - GenotypeAdjuster
| - CriticalValueEngine
| - DiscordanceDetector
| - LabRangeInterpreter
|
v
[Phase 2] search_plan() -- Determine search strategy
|
v
[Phase 3] rag_engine.retrieve() -- Multi-collection vector search
|
v
[Phase 4] evaluate_evidence() -- Quality check
|
v
[Phase 5] Sub-question expansion (if evidence insufficient)
|
v
[Phase 6] _build_enhanced_prompt() -- Combine evidence + analysis
|
v
[Phase 7] LLM generation -- Claude response
|
v
AgentResponse (answer, evidence, analysis, alerts)
12.2 The SearchPlan Dataclass¶
@dataclass
class SearchPlan:
question: str
identified_topics: List[str] = field(default_factory=list)
disease_areas: List[str] = field(default_factory=list)
relevant_modules: List[str] = field(default_factory=list)
search_strategy: str = "broad" # broad, targeted, domain-specific
sub_questions: List[str] = field(default_factory=list)
12.3 Strategy Selection¶
The agent selects a search strategy based on the question content:
| Strategy | Condition | Behavior |
|---|---|---|
domain-specific |
Single disease area, 0-1 analysis modules | Focused collection subset |
targeted |
Specific analysis modules identified | Module-guided search |
broad |
No specific domain or module detected | Search all 14 collections |
12.4 Sub-Question Decomposition¶
Complex questions are decomposed into sub-questions:
- "Why is X elevated?" generates:
- "What genetic variants cause elevated biomarker levels?"
- "What lifestyle factors contribute to elevated biomarker levels?"
-
"What medications affect biomarker levels?"
-
"Compare X vs Y" generates:
- "What are the differences in clinical interpretation?"
-
"What are the genotype-specific considerations?"
-
"What supplements/treatments for X?" generates:
- "What are the evidence-based interventions for this condition?"
- "What genetic factors affect treatment response?"
12.5 Evidence Quality Evaluation¶
def evaluate_evidence(self, evidence: CrossCollectionResult) -> str:
if evidence.hit_count == 0:
return "insufficient"
collections_with_hits = len(evidence.hits_by_collection())
if collections_with_hits >= 3 and evidence.hit_count >= 10:
return "sufficient"
elif collections_with_hits >= 2 and evidence.hit_count >= 5:
return "partial"
else:
return "insufficient"
When evidence is "insufficient" and sub-questions exist, the agent runs up to 2 additional retrieval passes with decomposed sub-questions and merges the results.
12.6 Critical Alert Extraction¶
The agent extracts critical alerts from analysis results:
- Biological age acceleration > 5 years: "CRITICAL: Biological age acceleration of X years..."
- Disease trajectory at HIGH/CRITICAL: "HIGH RISK: cardiovascular trajectory at high level..."
- DPYD poor/intermediate metabolizer: "CRITICAL PGx: DPYD -- fluoropyrimidine toxicity risk..."
- CYP2D6 ultra-rapid: "PGx ALERT: CYP2D6 -- avoid codeine/tramadol..."
- CYP2C19 poor/intermediate: "PGx ALERT: CYP2C19 -- clopidogrel may be ineffective..."
- Critical value thresholds: From
CriticalValueEngine - Biomarker discordances: From
DiscordanceDetector - Optimization opportunities: From
LabRangeInterpreter - Age-adjusted flags: From
GenotypeAdjuster.apply_age_adjustments()
12.7 Full Usage Example¶
from src.agent import PrecisionBiomarkerAgent
from src.models import PatientProfile
agent = PrecisionBiomarkerAgent(rag_engine=engine)
profile = PatientProfile(
patient_id="PAT-001",
age=52,
sex="M",
biomarkers={
"albumin": 3.8, "creatinine": 1.1, "glucose": 112,
"hs_crp": 3.5, "hba1c": 5.9, "ldl_c": 155,
"apob": 135, "lpa": 85,
},
genotypes={
"TCF7L2_rs7903146": "CT",
"APOE": "E3/E4",
"rs1801133": "CT",
},
star_alleles={
"CYP2D6": "*1/*4",
"CYP2C19": "*1/*2",
},
)
response = agent.run(
question="Assess my cardiovascular and metabolic risk profile",
patient_profile=profile,
)
print(response.answer)
print(f"Critical alerts: {len(response.critical_alerts)}")
print(f"PGx results: {len(response.pgx_results)}")
print(f"Bio age: {response.biological_age.biological_age:.1f}")
Chapter 13: Production Deployment¶
13.1 Docker Multi-Stage Build¶
The Dockerfile uses a two-stage build to minimize image size:
Stage 1 (builder): Installs build tools (gcc, g++) and compiles Python dependencies into a virtual environment at /opt/venv.
Stage 2 (runtime): Copies only the compiled venv and application source. Runs as non-root user biomarkeruser.
# Stage 1: Build dependencies
FROM python:3.10-slim AS builder
WORKDIR /build
RUN apt-get update && apt-get install -y build-essential gcc g++ ...
COPY requirements.txt .
RUN python -m venv /opt/venv && pip install -r requirements.txt
# Stage 2: Runtime
FROM python:3.10-slim
COPY --from=builder /opt/venv /opt/venv
COPY src/ api/ app/ config/ scripts/ data/ /app/
RUN useradd -r -s /bin/false biomarkeruser
USER biomarkeruser
EXPOSE 8528 8529
13.2 Compose Topology¶
The agent runs alongside the HCLS AI Factory services in docker-compose.dgx-spark.yml:
biomarker-agent:
build: ./ai_agent_adds/precision_biomarker_agent
ports:
- "8528:8528" # Streamlit UI
- "8529:8529" # FastAPI API
environment:
- BIOMARKER_MILVUS_HOST=milvus-standalone
- BIOMARKER_MILVUS_PORT=19530
- ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}
depends_on:
- milvus-standalone
- etcd
- minio
healthcheck:
test: ["CMD", "python", "-c",
"import urllib.request; urllib.request.urlopen('http://localhost:8528/health')"]
interval: 30s
timeout: 10s
start_period: 60s
retries: 3
13.3 Health Checks¶
The API provides health checks at two levels:
GET /health -- Returns collection count, total vector count, and agent readiness:
Docker HEALTHCHECK -- Uses Python's urllib (no curl dependency) to probe the Streamlit health endpoint every 30 seconds.
13.4 Prometheus Monitoring¶
The GET /metrics endpoint exposes Prometheus-compatible counters:
biomarker_api_requests_total 1234
biomarker_api_query_requests_total 567
biomarker_api_analyze_requests_total 89
biomarker_api_errors_total 3
biomarker_collection_vectors{collection="biomarker_reference"} 150
biomarker_collection_vectors{collection="biomarker_genetic_variants"} 320
13.5 Security Considerations¶
- API Key Authentication: When
BIOMARKER_API_KEYis set, all endpoints (except/healthand/metrics) requireX-API-Keyheader. - Request Size Limiting: Middleware rejects requests exceeding
BIOMARKER_MAX_REQUEST_SIZE_MB(default 10 MB). - CORS: Restricted to configured origins (default: localhost ports 8080, 8528, 8529).
- Non-root container: Runtime user is
biomarkeruserwith no shell access. - Input sanitization: Milvus filter expressions are validated with a safe-character regex (
^[A-Za-z0-9 _\-]+$) to prevent injection.
13.6 Startup Sequence¶
1. Connect to Milvus (host:port from settings)
2. Load SentenceTransformer model (BAAI/bge-small-en-v1.5)
3. Initialize Anthropic Claude client
4. Load knowledge module (static knowledge graph)
5. Initialize analysis modules (BiologicalAgeCalculator, etc.)
6. Build BiomarkerRAGEngine
7. Build PrecisionBiomarkerAgent
8. Store references on app.state for route access
9. Start accepting requests
13.7 Graceful Shutdown¶
On SIGTERM/SIGINT, the lifespan context manager disconnects from Milvus:
@asynccontextmanager
async def lifespan(app: FastAPI):
# ... startup code ...
yield
# Shutdown
if _manager:
_manager.disconnect()
Chapter 14: Future Architecture¶
14.1 Multi-Agent Coordination¶
The cross-modal event system (api/routes/events.py) is the foundation for multi-agent communication:
- Biomarker -> Imaging Agent: Elevated Lp(a) triggers coronary calcium scoring recommendation.
- Biomarker -> CAR-T/Oncology Agent: DPYD poor metabolizer PGx alert forwarded to oncology pipeline.
- Imaging Agent -> Biomarker Agent: Imaging findings trigger biomarker panel recommendations.
- Biomarker -> Genomics Pipeline: Unexpected biomarker patterns trigger VCF re-analysis.
Current implementation uses in-memory event stores. Production would use a message bus (NATS, Kafka, or Redis Streams).
14.2 Streaming Biomarker Ingestion¶
Real-time biomarker ingestion from wearables and continuous monitors:
- CGM (continuous glucose monitoring) data -> real-time trajectory updates
- Wearable HRV and resting heart rate -> cardiovascular risk refinement
- Event-driven re-analysis when new data arrives
14.3 Fine-Tuned Embeddings¶
The current BGE-small-en-v1.5 model is general-purpose. Domain-specific fine-tuning opportunities:
- Fine-tune on ClinVar/PharmGKB/CPIC corpus for better biomedical retrieval
- Contrastive learning on biomarker-gene-drug triplets
- Matryoshka Representation Learning for variable-dimension embeddings (128/256/384)
14.4 Longitudinal Analysis¶
Extending the agent to track biomarker trajectories over time:
- Trend detection (improving/worsening/stable) across multiple lab draws
- Velocity-based risk prediction (rate of change matters more than absolute value)
- Intervention effectiveness monitoring (did the supplement protocol work?)
14.5 Federated Learning¶
Privacy-preserving model improvement across institutions:
- Differential privacy for PhenoAge coefficient refinement
- Federated fine-tuning of the embedding model
- Secure aggregation of trajectory risk models
Appendix A: Complete API Reference¶
Root Endpoints¶
GET /¶
Returns service info. No authentication required.
Response:
GET /health¶
Response (200):
Response (503): Milvus unavailable.
GET /collections¶
Response (200):
{
"collections": [
{"name": "biomarker_reference", "record_count": 150},
{"name": "biomarker_genetic_variants", "record_count": 320}
],
"total": 14
}
GET /knowledge/stats¶
Response (200):
{
"disease_domains": 6,
"total_biomarkers": 45,
"total_genetic_modifiers": 28,
"pharmacogenes": 14,
"pgx_drug_interactions": 35,
"phenoage_markers": 9,
"cross_modal_links": 12
}
GET /metrics¶
Returns Prometheus-formatted plain text with counters and gauges.
Analysis Endpoints (/v1)¶
POST /v1/analyze¶
Full patient analysis (all modules).
Request:
{
"patient_id": "PAT-001",
"age": 45,
"sex": "M",
"biomarkers": {"albumin": 4.1, "creatinine": 0.9, "glucose": 95},
"genotypes": {"rs1801133": "CT"},
"star_alleles": {"CYP2D6": "*1/*4"}
}
Response (200):
{
"biological_age": {"chronological_age": 45, "biological_age": 43.2, "age_acceleration": -1.8},
"disease_trajectories": [{"disease": "diabetes", "risk_level": "low", "current_stage": "normal"}],
"pgx_results": [{"gene": "CYP2D6", "phenotype": "intermediate", "drugs_affected": [...]}],
"genotype_adjustments": [{"biomarker": "homocysteine", "standard_range": "5-15", "adjusted_range": "5-12"}],
"critical_alerts": []
}
POST /v1/biological-age¶
Biological age calculation only.
Request:
{
"age": 50,
"biomarkers": {
"albumin": 4.0, "creatinine": 1.0, "glucose": 100,
"hs_crp": 2.0, "lymphocyte_pct": 28, "mcv": 90,
"rdw": 14.0, "alkaline_phosphatase": 70, "wbc": 7.0
}
}
Response (200):
{
"chronological_age": 50,
"biological_age": 48.3,
"age_acceleration": -1.7,
"mortality_score": 0.023456,
"mortality_risk": "NORMAL",
"confidence_interval": {"lower": 38.7, "upper": 57.9},
"top_aging_drivers": [...]
}
POST /v1/disease-risk¶
Disease trajectory analysis.
Request:
{
"age": 45,
"sex": "M",
"biomarkers": {"hba1c": 5.8, "fasting_glucose": 105},
"genotypes": {"TCF7L2_rs7903146": "CT"}
}
Response (200): List of disease trajectory results across all 9 categories.
POST /v1/pgx¶
Pharmacogenomic mapping.
Request:
Response (200):
{
"gene_results": [
{
"gene": "CYP2D6",
"star_alleles": "*4/*4",
"phenotype": "Poor Metabolizer",
"affected_drugs": [
{"drug": "codeine", "recommendation": "AVOID codeine...", "action": "AVOID", "alert_level": "CRITICAL"}
]
}
]
}
POST /v1/query¶
RAG Q&A query with optional patient profile.
Request:
{
"question": "What does my HbA1c of 5.8% mean?",
"patient_profile": {
"patient_id": "PAT-001",
"age": 45,
"sex": "M",
"biomarkers": {"hba1c": 5.8},
"genotypes": {"TCF7L2_rs7903146": "CT"},
"star_alleles": {}
}
}
Response (200):
{
"answer": "Based on the evidence...",
"evidence": {"query": "...", "hits": [...], "total_collections_searched": 14},
"search_time_ms": 234.5
}
GET /v1/health¶
V1-specific health check.
Report Endpoints (/v1/report)¶
POST /v1/report/generate¶
Generate a full 12-section patient report.
Request: Same as /v1/analyze.
Response (200):
{
"report_id": "rpt-a1b2c3d4",
"generated_at": "2026-03-11T14:30:25Z",
"markdown": "# Biomarker Intelligence Report\n\n...",
"analysis_summary": {...}
}
GET /v1/report/{report_id}/pdf¶
Download a previously generated report as PDF.
Response (200): application/pdf binary stream.
POST /v1/report/fhir¶
Export analysis as FHIR R4 DiagnosticReport Bundle.
Response (200): FHIR R4 JSON Bundle.
Event Endpoints (/v1/events)¶
POST /v1/events/inbound¶
Receive cross-modal event from another agent.
Request:
{
"source_agent": "imaging_intelligence_agent",
"event_type": "imaging_finding",
"payload": {"finding": "coronary calcification", "severity": "moderate"},
"patient_id": "PAT-001"
}
GET /v1/events/outbound¶
Retrieve pending outbound alerts for other agents.
POST /v1/events/alert¶
Send a biomarker alert to the platform event bus.
Appendix B: Configuration Reference¶
All settings use the BIOMARKER_ prefix and are defined in config/settings.py via Pydantic BaseSettings. They can be set via environment variables or .env file.
Path Settings¶
| Env Var | Type | Default | Description |
|---|---|---|---|
BIOMARKER_DATA_DIR |
Path | <project_root>/data |
Data directory |
BIOMARKER_CACHE_DIR |
Path | <project_root>/data/cache |
Cache directory |
BIOMARKER_REFERENCE_DIR |
Path | <project_root>/data/reference |
Reference data directory |
Milvus Settings¶
| Env Var | Type | Default | Description |
|---|---|---|---|
BIOMARKER_MILVUS_HOST |
str | localhost |
Milvus server hostname |
BIOMARKER_MILVUS_PORT |
int | 19530 |
Milvus server port |
BIOMARKER_MILVUS_TIMEOUT_SECONDS |
int | 10 |
Milvus operation timeout |
Embedding Settings¶
| Env Var | Type | Default | Description |
|---|---|---|---|
BIOMARKER_EMBEDDING_MODEL |
str | BAAI/bge-small-en-v1.5 |
Sentence Transformer model |
BIOMARKER_EMBEDDING_DIMENSION |
int | 384 |
Embedding vector size |
BIOMARKER_EMBEDDING_BATCH_SIZE |
int | 32 |
Batch size for encoding |
LLM Settings¶
| Env Var | Type | Default | Description |
|---|---|---|---|
BIOMARKER_LLM_PROVIDER |
str | anthropic |
LLM provider name |
BIOMARKER_LLM_MODEL |
str | claude-sonnet-4-6 |
Model ID |
BIOMARKER_ANTHROPIC_API_KEY |
str | None |
Anthropic API key |
BIOMARKER_LLM_MAX_RETRIES |
int | 3 |
Max retry attempts |
RAG Search Settings¶
| Env Var | Type | Default | Description |
|---|---|---|---|
BIOMARKER_TOP_K_PER_COLLECTION |
int | 5 |
Max results per collection |
BIOMARKER_SCORE_THRESHOLD |
float | 0.4 |
Minimum similarity score |
BIOMARKER_CITATION_HIGH_THRESHOLD |
float | 0.75 |
Score threshold for "high" relevance |
BIOMARKER_CITATION_MEDIUM_THRESHOLD |
float | 0.60 |
Score threshold for "medium" relevance |
Collection Weight Settings¶
| Env Var | Type | Default | Collection |
|---|---|---|---|
BIOMARKER_WEIGHT_BIOMARKER_REF |
float | 0.12 |
biomarker_reference |
BIOMARKER_WEIGHT_GENETIC_VARIANTS |
float | 0.11 |
biomarker_genetic_variants |
BIOMARKER_WEIGHT_PGX_RULES |
float | 0.10 |
biomarker_pgx_rules |
BIOMARKER_WEIGHT_DISEASE_TRAJECTORIES |
float | 0.10 |
biomarker_disease_trajectories |
BIOMARKER_WEIGHT_CLINICAL_EVIDENCE |
float | 0.09 |
biomarker_clinical_evidence |
BIOMARKER_WEIGHT_GENOMIC_EVIDENCE |
float | 0.08 |
genomic_evidence |
BIOMARKER_WEIGHT_DRUG_INTERACTIONS |
float | 0.07 |
biomarker_drug_interactions |
BIOMARKER_WEIGHT_AGING_MARKERS |
float | 0.07 |
biomarker_aging_markers |
BIOMARKER_WEIGHT_NUTRITION |
float | 0.05 |
biomarker_nutrition |
BIOMARKER_WEIGHT_GENOTYPE_ADJUSTMENTS |
float | 0.05 |
biomarker_genotype_adjustments |
BIOMARKER_WEIGHT_MONITORING |
float | 0.05 |
biomarker_monitoring |
BIOMARKER_WEIGHT_CRITICAL_VALUES |
float | 0.04 |
biomarker_critical_values |
BIOMARKER_WEIGHT_DISCORDANCE_RULES |
float | 0.04 |
biomarker_discordance_rules |
BIOMARKER_WEIGHT_AJ_CARRIER_SCREENING |
float | 0.03 |
biomarker_aj_carrier_screening |
Weights are validated at startup to sum to ~1.0 (tolerance: +/- 0.05).
Server Settings¶
| Env Var | Type | Default | Description |
|---|---|---|---|
BIOMARKER_API_HOST |
str | 0.0.0.0 |
API bind address |
BIOMARKER_API_PORT |
int | 8529 |
API port |
BIOMARKER_STREAMLIT_PORT |
int | 8528 |
Streamlit UI port |
BIOMARKER_METRICS_ENABLED |
bool | true |
Enable Prometheus metrics |
BIOMARKER_CORS_ORIGINS |
str | http://localhost:8080,... |
Comma-separated CORS origins |
BIOMARKER_MAX_REQUEST_SIZE_MB |
int | 10 |
Max request body size (MB) |
BIOMARKER_REQUEST_TIMEOUT_SECONDS |
int | 60 |
Request timeout |
Authentication Settings¶
| Env Var | Type | Default | Description |
|---|---|---|---|
BIOMARKER_API_KEY |
str | "" |
API key; empty disables auth |
Conversation Settings¶
| Env Var | Type | Default | Description |
|---|---|---|---|
BIOMARKER_MAX_CONVERSATION_CONTEXT |
int | 3 |
Max conversation turns in memory |
Appendix C: Collection Schema Reference¶
All collections use IVF_FLAT index with COSINE metric and 384-dimensional FLOAT_VECTOR embeddings from BAAI/bge-small-en-v1.5.
1. biomarker_reference¶
Reference biomarker definitions, ranges, and clinical significance.
| Field | Type | Max Length | Description |
|---|---|---|---|
id |
VARCHAR (PK) | 100 | Unique biomarker identifier |
embedding |
FLOAT_VECTOR | dim=384 | BGE-small-en-v1.5 text embedding |
name |
VARCHAR | 100 | Biomarker display name |
unit |
VARCHAR | 20 | Measurement unit (e.g., mg/dL) |
category |
VARCHAR | 30 | CBC, CMP, Lipids, Thyroid, etc. |
ref_range_min |
FLOAT | -- | Standard reference range lower bound |
ref_range_max |
FLOAT | -- | Standard reference range upper bound |
text_chunk |
VARCHAR | 3000 | Text chunk used for embedding |
clinical_significance |
VARCHAR | 2000 | Clinical interpretation |
epigenetic_clock |
VARCHAR | 50 | PhenoAge/GrimAge coefficient if applicable |
genetic_modifiers |
VARCHAR | 500 | Comma-separated modifier genes |
2. biomarker_genetic_variants¶
Genetic variants affecting biomarker levels and disease risk.
| Field | Type | Max Length | Description |
|---|---|---|---|
id |
VARCHAR (PK) | 100 | Unique variant identifier |
embedding |
FLOAT_VECTOR | dim=384 | BGE-small-en-v1.5 text embedding |
gene |
VARCHAR | 50 | Gene symbol (e.g., MTHFR) |
rs_id |
VARCHAR | 20 | dbSNP rsID (e.g., rs1801133) |
risk_allele |
VARCHAR | 20 | Risk allele |
protective_allele |
VARCHAR | 5 | Protective allele |
effect_size |
VARCHAR | 250 | Effect size description |
mechanism |
VARCHAR | 2000 | Molecular mechanism |
disease_associations |
VARCHAR | 1000 | Comma-separated disease associations |
text_chunk |
VARCHAR | 3000 | Text chunk used for embedding |
3. biomarker_pgx_rules¶
Pharmacogenomic dosing rules following CPIC guidelines.
| Field | Type | Max Length | Description |
|---|---|---|---|
id |
VARCHAR (PK) | 100 | Unique PGx rule identifier |
embedding |
FLOAT_VECTOR | dim=384 | BGE-small-en-v1.5 text embedding |
gene |
VARCHAR | 50 | Pharmacogene (e.g., CYP2D6) |
star_alleles |
VARCHAR | 100 | Star allele combination (e.g., 1/2) |
drug |
VARCHAR | 100 | Drug name |
phenotype |
VARCHAR | 30 | Metabolizer phenotype |
cpic_level |
VARCHAR | 10 | CPIC evidence level (1A, 1B, 2A, etc.) |
recommendation |
VARCHAR | 2000 | Clinical recommendation text |
text_chunk |
VARCHAR | 3000 | Text chunk used for embedding |
4. biomarker_disease_trajectories¶
Disease progression trajectory definitions and staging criteria.
| Field | Type | Max Length | Description |
|---|---|---|---|
id |
VARCHAR (PK) | 100 | Unique trajectory identifier |
embedding |
FLOAT_VECTOR | dim=384 | BGE-small-en-v1.5 text embedding |
disease |
VARCHAR | 50 | Disease category |
disease_area |
VARCHAR | 50 | Disease area for filtering |
stage |
VARCHAR | 50 | Progression stage |
biomarker_pattern |
VARCHAR | 2000 | Biomarker criteria for this stage |
genetic_modifiers |
VARCHAR | 500 | Genetic modifiers affecting trajectory |
text_chunk |
VARCHAR | 3000 | Text chunk used for embedding |
5. biomarker_clinical_evidence¶
Published clinical evidence with PubMed linkage.
| Field | Type | Max Length | Description |
|---|---|---|---|
id |
VARCHAR (PK) | 100 | Unique evidence identifier |
embedding |
FLOAT_VECTOR | dim=384 | BGE-small-en-v1.5 text embedding |
title |
VARCHAR | 500 | Publication title |
authors |
VARCHAR | 500 | Author list |
year |
INT64 | -- | Publication year (used for date filters) |
pmid |
VARCHAR | 20 | PubMed ID |
disease_area |
VARCHAR | 50 | Disease area for filtering |
evidence_level |
VARCHAR | 20 | Evidence level classification |
text_chunk |
VARCHAR | 3000 | Abstract/summary for embedding |
text_summary |
VARCHAR | 2000 | Concise summary |
6. biomarker_nutrition¶
Genotype-aware nutritional guidance.
| Field | Type | Max Length | Description |
|---|---|---|---|
id |
VARCHAR (PK) | 100 | Unique guideline identifier |
embedding |
FLOAT_VECTOR | dim=384 | BGE-small-en-v1.5 text embedding |
nutrient |
VARCHAR | 100 | Nutrient name |
gene |
VARCHAR | 50 | Relevant gene |
genotype |
VARCHAR | 20 | Genotype that modifies recommendation |
recommendation |
VARCHAR | 2000 | Nutritional recommendation |
text_chunk |
VARCHAR | 3000 | Text chunk used for embedding |
7. biomarker_drug_interactions¶
Gene-drug interaction records.
| Field | Type | Max Length | Description |
|---|---|---|---|
id |
VARCHAR (PK) | 100 | Unique interaction identifier |
embedding |
FLOAT_VECTOR | dim=384 | BGE-small-en-v1.5 text embedding |
drug_name |
VARCHAR | 100 | Drug name |
gene |
VARCHAR | 50 | Interacting gene |
interaction_type |
VARCHAR | 50 | Type of interaction |
severity |
VARCHAR | 20 | Severity level |
recommendation |
VARCHAR | 2000 | Clinical recommendation |
text_chunk |
VARCHAR | 3000 | Text chunk used for embedding |
8. biomarker_aging_markers¶
Epigenetic aging clock markers and correlations.
| Field | Type | Max Length | Description |
|---|---|---|---|
id |
VARCHAR (PK) | 100 | Unique marker identifier |
embedding |
FLOAT_VECTOR | dim=384 | BGE-small-en-v1.5 text embedding |
marker_name |
VARCHAR | 100 | Aging marker name |
clock_type |
VARCHAR | 50 | PhenoAge, GrimAge, etc. |
coefficient |
FLOAT | -- | Clock coefficient value |
direction |
VARCHAR | 20 | Aging or protective |
text_chunk |
VARCHAR | 3000 | Text chunk used for embedding |
9. biomarker_genotype_adjustments¶
Genotype-based reference range adjustments.
| Field | Type | Max Length | Description |
|---|---|---|---|
id |
VARCHAR (PK) | 100 | Unique adjustment identifier |
embedding |
FLOAT_VECTOR | dim=384 | BGE-small-en-v1.5 text embedding |
biomarker |
VARCHAR | 100 | Biomarker being adjusted |
gene |
VARCHAR | 50 | Gene causing adjustment |
genotype |
VARCHAR | 20 | Specific genotype |
standard_range |
VARCHAR | 50 | Standard reference range |
adjusted_range |
VARCHAR | 50 | Genotype-adjusted range |
rationale |
VARCHAR | 2000 | Clinical rationale |
text_chunk |
VARCHAR | 3000 | Text chunk used for embedding |
10. biomarker_monitoring¶
Condition-specific monitoring protocols.
| Field | Type | Max Length | Description |
|---|---|---|---|
id |
VARCHAR (PK) | 100 | Unique protocol identifier |
embedding |
FLOAT_VECTOR | dim=384 | BGE-small-en-v1.5 text embedding |
condition |
VARCHAR | 100 | Condition being monitored |
biomarker |
VARCHAR | 100 | Biomarker to monitor |
frequency |
VARCHAR | 50 | Monitoring frequency |
rationale |
VARCHAR | 2000 | Why this monitoring is needed |
text_chunk |
VARCHAR | 3000 | Text chunk used for embedding |
11. biomarker_critical_values¶
Critical value thresholds requiring immediate clinical action.
| Field | Type | Max Length | Description |
|---|---|---|---|
id |
VARCHAR (PK) | 100 | Unique critical value identifier |
embedding |
FLOAT_VECTOR | dim=384 | BGE-small-en-v1.5 text embedding |
biomarker |
VARCHAR | 100 | Biomarker name |
threshold_high |
FLOAT | -- | Critical high threshold |
threshold_low |
FLOAT | -- | Critical low threshold |
unit |
VARCHAR | 20 | Unit of measurement |
action |
VARCHAR | 2000 | Required clinical action |
text_chunk |
VARCHAR | 3000 | Text chunk used for embedding |
12. biomarker_discordance_rules¶
Cross-biomarker discordance detection rules.
| Field | Type | Max Length | Description |
|---|---|---|---|
id |
VARCHAR (PK) | 100 | Unique rule identifier |
embedding |
FLOAT_VECTOR | dim=384 | BGE-small-en-v1.5 text embedding |
biomarker_a |
VARCHAR | 100 | First biomarker in pair |
biomarker_b |
VARCHAR | 100 | Second biomarker in pair |
pattern |
VARCHAR | 500 | Expected vs discordant pattern |
clinical_meaning |
VARCHAR | 2000 | Clinical interpretation |
text_chunk |
VARCHAR | 3000 | Text chunk used for embedding |
13. biomarker_aj_carrier_screening¶
Ashkenazi Jewish genetic carrier screening panel.
| Field | Type | Max Length | Description |
|---|---|---|---|
id |
VARCHAR (PK) | 100 | Unique screening entry identifier |
embedding |
FLOAT_VECTOR | dim=384 | BGE-small-en-v1.5 text embedding |
gene |
VARCHAR | 50 | Gene name (BRCA1, HEXA, GBA, etc.) |
condition |
VARCHAR | 200 | Associated condition |
carrier_frequency |
VARCHAR | 50 | Population carrier frequency |
inheritance |
VARCHAR | 50 | Inheritance pattern |
compound_risks |
VARCHAR | 1000 | Compound risk interactions |
text_chunk |
VARCHAR | 3000 | Text chunk used for embedding |
14. genomic_evidence (read-only, shared)¶
Shared genomic variant evidence collection from the VCF-derived pipeline. Read-only for the biomarker agent; written by the genomics pipeline.
| Field | Type | Max Length | Description |
|---|---|---|---|
id |
VARCHAR (PK) | 100 | Variant identifier |
embedding |
FLOAT_VECTOR | dim=384 | BGE-small-en-v1.5 text embedding |
chrom |
VARCHAR | 10 | Chromosome |
pos |
INT64 | -- | Genomic position |
ref |
VARCHAR | 500 | Reference allele |
alt |
VARCHAR | 500 | Alternate allele |
gene |
VARCHAR | 50 | Gene symbol |
consequence |
VARCHAR | 100 | Variant consequence (missense, etc.) |
clinvar_significance |
VARCHAR | 100 | ClinVar clinical significance |
text_chunk |
VARCHAR | 3000 | Text summary for embedding |
End of Learning Guide (Advanced) -- Precision Biomarker Intelligence Agent Total codebase: 12,628 lines source + 8,772 lines tests = 21,400 lines across 36 files.