Skip to content

Precision Biomarker Intelligence Agent — Architecture Design Document

Author: Adam Jones Date: March 2026 Version: 1.0.0 License: Apache 2.0


1. Executive Summary

The Precision Biomarker Intelligence Agent extends the HCLS AI Factory platform to deliver genomics-informed biomarker interpretation. Unlike standard lab reports that compare values against population-wide reference ranges, this agent integrates patient genotype data, pharmacogenomic variants, age/sex-stratified thresholds, and multi-lab reference comparisons to produce personalized clinical intelligence.

The agent combines 9 deterministic clinical analysis engines with a 14-collection RAG pipeline to answer questions like "Interpret my LDL of 138 given ApoE E3/E4 genotype" — simultaneously searching biomarker reference data, genetic variant databases, pharmacogenomic guidelines, and clinical evidence, then synthesizing a grounded response through Claude.

Key Results

Metric Value
Total vectors indexed 652 across 14 Milvus collections (13 owned + 1 read-only)
Clinical analysis engines 9 deterministic engines (biological age, disease trajectory, PGx, etc.)
Disease domains covered 9 (cardiovascular, diabetes, liver, thyroid, iron, nutritional, kidney, bone, cognitive)
PGx genes mapped 13 (CYP2D6, CYP2C19, CYP2C9, VKORC1, SLCO1B1, TPMT, DPYD, MTHFR, HLA-B57:01, G6PD, HLA-B58:01, CYP3A5, UGT1A1)
Critical value rules 21 with severity-ordered alerting (critical > urgent > warning)
Unit tests passing 709
Demo validation checks 65/65
Export formats 4 (FHIR R4, PDF, Markdown, CSV)

2. Architecture Overview

2.1 Mapping to VAST AI OS

VAST AI OS Component Biomarker Agent Role
DataStore Raw reference JSON files: biomarker definitions, genetic variants, PGx rules, disease trajectories
DataEngine Seed pipeline: JSON → BGE-small embedding → Milvus vector insert
DataBase 14 Milvus collections (13 owned + 1 read-only) + 2 sample patients
InsightEngine 9 clinical analysis engines + BGE-small embedding + multi-collection RAG
AgentEngine PrecisionBiomarkerAgent orchestrator + Streamlit UI (8 tabs) + FastAPI REST

2.2 System Diagram

                        ┌─────────────────────────────────┐
                            Streamlit UI (8528)            
                            8 tabs: Analysis | Bio Age |   
                            Disease Risk | PGx | Evidence  
                            | Reports | Patient 360 | Long  
                        └──────────────┬──────────────────┘
                                       
                        ┌──────────────▼──────────────────┐
                          PrecisionBiomarkerAgent         
                          Orchestrates 9 analysis engines 
                          + RAG pipeline + export          
                        └──────────────┬──────────────────┘
                                       
            ┌──────────────────────────┼───────────────────────────┐
                                                                 
  ┌─────────▼──────────┐   ┌──────────▼──────────┐   ┌──────────▼──────────┐
   Deterministic           RAG Pipeline             Export               
   Analysis Engines                                                      
                           BGE-small-en-v1.5        FHIR R4 Bundle       
   BiologicalAge           (384-dim embedding)      PDF (reportlab)      
   DiseaseTrajectory                               Markdown             
   Pharmacogenomics                                CSV                  
   GenotypeAdjuster        Parallel Search                               
   CriticalValues          14 Milvus Collections    + FHIR Validation    
   Discordance             (ThreadPoolExecutor)                          
   LabRangeInterp                                                       
   AJ Carrier Screen                                                    
   Age-Stratified          Claude Sonnet 4.6                             
  └─────────────────────┘   └──────────────────────┘   └──────────────────────┘
                                      
  ┌─────────▼──────────────────────────▼──────────────────────────────┐
                    Milvus 2.4  14 Collections                      
                                                                      
    biomarker_reference (208)    biomarker_genetic_variants (42)      
    biomarker_pgx_rules (29)    biomarker_disease_trajectories (39)  
    biomarker_clinical_evidence (80)  biomarker_nutrition (50)       
    biomarker_drug_interactions (51)  biomarker_aging_markers (20)   
    biomarker_genotype_adjustments (30)  biomarker_monitoring (30)   
    biomarker_critical_values (21)  biomarker_discordance_rules (12) 
    biomarker_aj_carrier_screening (10)  genomic_evidence (30) [RO]  
  └───────────────────────────────────────────────────────────────────┘

3. Data Collections — Actual State

3.1 biomarker_reference — 208 records

Primary biomarker definitions with clinical significance, units, reference ranges, and category metadata.

Field Type Description
id VARCHAR(64) Primary key
embedding FLOAT_VECTOR(384) BGE-small-en-v1.5
name VARCHAR(200) Display name (e.g., "LDL Cholesterol")
category VARCHAR(100) Domain category (lipid, metabolic, thyroid, etc.)
unit VARCHAR(50) Measurement unit
reference_low FLOAT Standard reference range lower bound
reference_high FLOAT Standard reference range upper bound
clinical_significance VARCHAR(2000) Clinical interpretation text
epigenetic_clock VARCHAR(500) PhenoAge/GrimAge relevance

3.2 biomarker_genetic_variants — 42 records

Clinically actionable genetic variants (SNPs) with risk alleles and effect sizes.

Field Type Description
id VARCHAR(64) Primary key (e.g., "var_apoe_e4")
gene VARCHAR(50) Gene name (APOE, MTHFR, TCF7L2, etc.)
rsid VARCHAR(20) dbSNP identifier
risk_allele VARCHAR(20) Risk allele designation
effect_size VARCHAR(250) Quantified effect description
affected_biomarkers VARCHAR(1000) Biomarkers influenced by this variant

3.3 biomarker_pgx_rules — 29 records

CPIC Level 1A pharmacogenomic guidelines mapping genotype to drug recommendations.

Field Type Description
id VARCHAR(64) Primary key
gene VARCHAR(50) Pharmacogene (CYP2D6, CYP2C19, etc.)
phenotype VARCHAR(200) Metabolizer status
drugs_affected VARCHAR(1000) Affected medications
recommendation VARCHAR(2000) CPIC dosing recommendation
evidence_level VARCHAR(20) CPIC evidence level (1A, 1B, etc.)

3.4 biomarker_disease_trajectories — 39 records

Multi-biomarker risk patterns for 9 disease domains with genotype-specific thresholds.

3.5 biomarker_clinical_evidence — 80 records

Published clinical evidence supporting biomarker interpretation with study citations.

3.6 biomarker_nutrition — 50 records

Nutrient-biomarker interactions and dietary recommendations.

3.7 biomarker_drug_interactions — 51 records

Medication effects on biomarker levels (e.g., statin effects on LDL, CoQ10, liver enzymes).

3.8 biomarker_aging_markers — 20 records

PhenoAge and GrimAge epigenetic clock biomarker coefficients and interpretation data.

3.9 biomarker_genotype_adjustments — 30 records

Genotype-specific reference range modifications (e.g., ApoE E4 carriers need different LDL thresholds).

3.10 biomarker_monitoring — 30 records

Follow-up testing schedules and monitoring protocols.

3.11 biomarker_critical_values — 21 records

Critical/urgent/warning threshold rules with escalation targets.

Field Type Description
id VARCHAR(64) Primary key
biomarker VARCHAR(200) Biomarker name
critical_high / critical_low FLOAT Critical threshold
urgent_high / urgent_low FLOAT Urgent threshold
warning_high / warning_low FLOAT Warning threshold
severity VARCHAR(20) "critical", "urgent", or "warning"
escalation_target VARCHAR(200) Routing destination
clinical_action VARCHAR(2000) Required clinical action

3.12 biomarker_discordance_rules — 12 records

Cross-biomarker discordance detection patterns (e.g., normal LDL + elevated ApoB).

3.13 biomarker_aj_carrier_screening — 10 records

Ashkenazi Jewish population-specific genetic carrier screening panel.

3.14 Index Configuration (all collections)

Algorithm:  IVF_FLAT
Metric:     COSINE
nlist:      1024
nprobe:     16
Dimension:  384 (BGE-small-en-v1.5)

4. Clinical Analysis Engines

4.1 BiologicalAgeCalculator

Implements two validated epigenetic clock algorithms:

PhenoAge (Levine 2018): - 9 clinical biomarkers: albumin, creatinine, glucose, CRP, lymphocyte %, MCV, RDW, alkaline phosphatase, WBC - Unit conversion: US clinical → SI units - Gompertz mortality model with chronological age coefficient (0.0804) - Outputs: biological age, age acceleration, mortality risk (LOW/NORMAL/MODERATE/HIGH), 95% CI, top aging drivers

GrimAge (Lu 2019): - 6 plasma protein surrogates: GDF-15, Cystatin C, Leptin, PAI-1, TIMP-1, Adrenomedullin - Returns None when no plasma markers are available - Correlation with true GrimAge: r = 0.72 (validation cohort)

4.2 DiseaseTrajectoryAnalyzer

Analyzes patient biomarkers and genotypes across 9 independent disease domains:

Domain Key Biomarkers Key Genotypes
Type 2 Diabetes HbA1c, fasting glucose, fasting insulin, HOMA-IR TCF7L2 rs7903146
Cardiovascular LDL-C, HDL-C, triglycerides, hs-CRP, Lp(a) APOE, PCSK9
Liver ALT, AST, GGT, albumin PNPLA3 rs738409
Thyroid TSH, free T4, free T3 DIO2 rs225014
Iron Ferritin, transferrin saturation, TIBC HFE rs1800562
Nutritional Vitamin D, B12, folate, omega-3 index, magnesium, zinc MTHFR rs1801133
Kidney Creatinine, eGFR, BUN, cystatin C
Bone Health Calcium, alkaline phosphatase, vitamin D, PTH
Cognitive ApoE genotype, homocysteine, omega-3 index, hs-CRP APOE E4

Returns 9 results sorted by risk severity (CRITICAL > HIGH > MODERATE > LOW).

4.3 PharmacogenomicMapper

Maps star alleles and genotypes to drug recommendations following CPIC Level 1A guidelines.

13 Supported Genes:

Gene Input Type Example
CYP2D6 Star alleles *1/*4 → Intermediate Metabolizer
CYP2C19 Star alleles *1/*2 → Intermediate Metabolizer
CYP2C9 Star alleles *1/*3 → Intermediate Metabolizer
VKORC1 Genotype (rs9923231) AG → Intermediate sensitivity
SLCO1B1 Genotype (rs4149056) TC → Intermediate function
TPMT Star alleles *1/*1 → Normal Metabolizer
DPYD Star alleles *1/*2A → Intermediate Metabolizer
MTHFR Genotype (rs1801133) CT → Heterozygous (reduced function)
HLA-B*57:01 Genotype Positive → Abacavir contraindicated
G6PD Genotype Deficient → Multiple drug contraindications
HLA-B*58:01 Genotype Positive → Allopurinol contraindicated
CYP3A5 Star alleles *1/*3 → Intermediate Metabolizer
UGT1A1 Star alleles *1/*28 → Intermediate Metabolizer

Includes drug-drug interaction detection across PGx recommendations.

4.4 CriticalValueEngine

Real-time threshold detection against 21 rules with three severity tiers:

Severity Ordering:  CRITICAL  >  URGENT  >  WARNING
                    (immediate)  (within 4h) (next visit)

Alert Structure:
  - biomarker: which value triggered
  - value: measured result
  - threshold: exceeded threshold
  - direction: "high" or "low"
  - severity: "critical" | "urgent" | "warning"
  - escalation_target: routing destination
  - clinical_action: required next step
  - cross_checks: related biomarkers to verify

4.5 DiscordanceDetector

Identifies clinically discordant biomarker patterns (12 rules). Examples:

  • Normal LDL + elevated ApoB → particle number discordance
  • Normal TSH + low free T3 → subclinical conversion issue
  • Low ferritin + normal hemoglobin → early iron depletion before anemia

4.6 LabRangeInterpreter

Three-way comparison for each biomarker against:

  1. Quest Diagnostics — Standard clinical reference ranges
  2. LabCorp — Standard clinical reference ranges
  3. Function Health — Optimal/functional medicine ranges

Sex-specific lookup: tries "{biomarker} ({sex})" first, then falls back to "{biomarker}".

4.7 GenotypeAdjuster + Age-Stratified Adjustments

Genotype adjustments: Modifies reference ranges based on patient genotype (e.g., ApoE E4 carriers need LDL < 100 instead of < 130).

Age-stratified adjustments: 8 biomarkers with sex-stratified ranges across 5 age brackets:

Biomarker Age Brackets Sex-Stratified
Creatinine 0-17, 18-39, 40-59, 60-79, 80+ Yes
eGFR 0-17, 18-39, 40-59, 60-79, 80+ Yes
TSH 0-17, 18-39, 40-59, 60-79, 80+ No
Fasting Glucose 0-17, 18-39, 40-59, 60-79, 80+ No
Total Cholesterol 0-17, 18-39, 40-59, 60-79, 80+ No
Alkaline Phosphatase 0-17, 18-39, 40-59, 60-79, 80+ Yes
Ferritin 0-17, 18-39, 40-59, 60-79, 80+ Yes
PSA 40-59, 60-79, 80+ Male only

5. Multi-Collection RAG Engine

5.1 Search Flow

Query Text
BGE-small-en-v1.5 Embedding (384-dim)
ThreadPoolExecutor: Parallel search across 14 collections
Weighted merge (configurable per-collection weights)
Knowledge graph augmentation
Claude Sonnet 4.6 prompt with patient context
Grounded response with citations

5.2 Collection Weights

Collection Weight Rationale
biomarker_reference 0.12 Primary biomarker definitions
genetic_variants 0.11 Genotype-specific interpretation
pgx_rules 0.10 Pharmacogenomic guidelines
disease_trajectories 0.10 Risk stratification patterns
clinical_evidence 0.09 Published study evidence
genomic_evidence 0.08 Shared genomic context
drug_interactions 0.07 Medication effects
aging_markers 0.07 Epigenetic clock data
nutrition 0.05 Dietary recommendations
genotype_adjustments 0.05 Reference range modifications
monitoring 0.05 Follow-up protocols
critical_values 0.04 Threshold rules
discordance_rules 0.04 Pattern detection
aj_carrier_screening 0.03 Population-specific screening
Total 1.00

5.3 Embedding Strategy

Model: BGE-small-en-v1.5 (BAAI)

  • Dimension: 384
  • Metric: Cosine similarity
  • Query prefix: "Represent this sentence for searching relevant passages: "
  • Document embedding: Raw text (no prefix)

to_embedding_text() pattern: Each Pydantic model implements this method to produce an optimal embedding string combining key fields.

5.4 Citation Scoring

Level Threshold Display
High confidence >= 0.75 Full citation with source link
Medium confidence >= 0.60 Citation with caveat
Below threshold < 0.40 Filtered out

6. Export Pipeline

6.1 FHIR R4 DiagnosticReport

Produces a FHIR R4 Bundle containing:

  • Patient resource with identifier, gender, birth year
  • DiagnosticReport resource (main report)
  • Observation resources for each biomarker analysis
  • Reference integrity validation (all references resolve within the bundle)

Structural validation checks: 1. Bundle resourceType and entry list 2. DiagnosticReport required fields (status, code, subject, effectiveDateTime) 3. Observation required fields (status, code, subject, valueQuantity) 4. Patient identifier presence 5. Reference integrity across all resources

6.2 PDF Export

Uses reportlab for clinical-grade PDF reports. Graceful degradation when reportlab is not installed (warning displayed in UI).

6.3 Markdown and CSV

Plain-text formats for integration with downstream systems and clinical notes.


7. Sample Patient Data

Two fully specified sample patients for demo and testing:

Patient 1: HCLS-BIO-2026-00001 (Male, 45)

Attribute Value
Sex/Age Male, 45
BMI 23.7
Ethnicity Ashkenazi Jewish
Genome HG002 (NA24385)
ApoE E3/E4
MTHFR C677T CT (Heterozygous)
CYP2D6 1/4 (Intermediate Metabolizer)
CYP2C19 1/2 (Intermediate Metabolizer)
Medications 7 (Atorvastatin, Lisinopril, L-Methylfolate, Fish Oil, Vitamin D3, Methylcobalamin, CoQ10)
Family History Father MI at 58, Mother T2DM at 52, Paternal GM Alzheimer's at 74
Biomarker Count 31
Genotype Count 6 (APOE, MTHFR, TCF7L2, PNPLA3, HFE, DIO2)

Patient 2: HCLS-BIO-2026-00002 (Female, 38)

Attribute Value
Sex/Age Female, 38
BMI 22.6
Ethnicity Ashkenazi Jewish
BRCA1 Status NOT YET TESTED — URGENT
Preconception ACTIVE — 12-18 months
Medications 5 (OCP, L-Methylfolate, Vitamin D3, Iron bisglycinate, Prenatal DHA)
Family History Mother BRCA1+ breast cancer at 48, Maternal Aunt ovarian cancer at 55
Biomarker Count 30
Genotype Count 4 (MTHFR, TCF7L2, PNPLA3, DIO2)

8. Performance Benchmarks

8.1 Seed Performance

Operation Duration
Seed all 14 collections (652 vectors) ~2 min
BGE-small embedding per batch (32 records) ~1.5 sec
Milvus insert per collection <500 ms

8.2 Clinical Engine Performance

Engine Latency
CriticalValueEngine.check() — 21 rules <10 ms
DiscordanceDetector.check() — 12 rules <10 ms
LabRangeInterpreter.interpret() — all biomarkers <20 ms
BiologicalAgeCalculator.calculate() — PhenoAge + GrimAge <100 ms
DiseaseTrajectoryAnalyzer.analyze_all() — 9 domains <200 ms
PharmacogenomicMapper.map_all() — all genes <50 ms
GenotypeAdjuster.apply_age_adjustments() <20 ms
All 9 engines combined <400 ms

8.3 RAG Pipeline Performance

Operation Latency
BGE-small query embedding ~5 ms
14-collection parallel search (top-5 each) 10-18 ms
Claude Sonnet 4.6 generation 15-25 sec
Full RAG query end-to-end ~20-30 sec

8.4 Export Performance

Format Latency
FHIR R4 Bundle + validation <500 ms
PDF (reportlab) <2 sec
Markdown <100 ms
CSV <100 ms

9. Infrastructure

9.1 Technology Stack

Component Technology
Language Python 3.10+
Vector DB Milvus 2.4
Embeddings BGE-small-en-v1.5 (BAAI) — 384-dim
LLM Claude Sonnet 4.6 (Anthropic API)
Web UI Streamlit
REST API FastAPI + Uvicorn
Configuration Pydantic BaseSettings
Testing pytest
Export FHIR R4, reportlab (PDF), Markdown, CSV
Containerization Docker + Docker Compose

9.2 Service Ports

Service Port
Streamlit UI 8528
FastAPI REST API 8529
Milvus (shared) 19530

9.3 Dependencies on HCLS AI Factory

Dependency Type
Milvus 2.4 Shared vector database (port 19530)
genomic_evidence collection Read-only shared collection from Stage 2 RAG pipeline
BGE-small-en-v1.5 Shared embedding model
Claude API key Shared Anthropic API key

10. Demo Scenarios

10.1 Validated Demo Queries

Scenario 1 — Genotype-Informed Lipid Interpretation:

Patient: Male, 45, ApoE E3/E4
LDL-C: 138 mg/dL
Question: "Is my LDL safe given my ApoE status?"
Expected: ApoE E4 carriers need LDL < 100 (not standard < 130). Flag as elevated.

Scenario 2 — MTHFR and Methylation:

Patient: Male, 45, MTHFR C677T CT
Homocysteine: 12.5 umol/L
Question: "How does MTHFR affect my folate metabolism?"
Expected: CT heterozygous = ~35% reduced enzyme activity. L-Methylfolate preferred over folic acid.

Scenario 3 — Pharmacogenomic Alert:

Patient: Male, 45, CYP2D6 *1/*4
Question: "Which medications should I be cautious with?"
Expected: Intermediate metabolizer. Codeine reduced efficacy. Tramadol dose adjustment.

Scenario 4 — Pre-Diabetes Risk with Genetic Context:

Patient: Male, 45, TCF7L2 rs7903146 CT
HbA1c: 5.6%, Fasting Glucose: 98, HOMA-IR: 1.98
Question: "What is my diabetes risk?"
Expected: TCF7L2 CT = 1 risk allele. Pre-diabetes range. MODERATE risk trajectory.

Scenario 5 — Female Preconception Assessment:

Patient: Female, 38, BRCA1 untested, Ferritin 28
Question: "What should I address before pregnancy?"
Expected: Ferritin critically low for preconception (target > 50). BRCA1 testing URGENT.


11. File Structure (Actual)

precision_biomarker_agent/
├── src/                          # 12 core modules (~22,000 lines)
   ├── models.py                 # 20+ Pydantic data models
   ├── collections.py            # 14 Milvus collection schemas + manager
   ├── rag_engine.py             # Multi-collection RAG engine
   ├── agent.py                  # Orchestrator (9 engines + RAG)
   ├── biological_age.py         # PhenoAge + GrimAge calculators
   ├── disease_trajectory.py     # 9-domain risk analyzer
   ├── pharmacogenomics.py       # CPIC PGx mapper (13 genes)
   ├── genotype_adjustment.py    # Genotype + age-stratified adjustments
   ├── critical_values.py        # Critical value threshold engine
   ├── discordance_detector.py   # Cross-biomarker discordance
   ├── lab_range_interpreter.py  # Quest vs LabCorp vs optimal
   └── export.py                 # FHIR R4 + PDF + Markdown + CSV
├── app/
   └── biomarker_ui.py           # Streamlit (8 tabs, ~1,700 lines)
├── api/
   └── main.py                   # FastAPI REST server
├── config/
   └── settings.py               # PrecisionBiomarkerSettings
├── data/reference/               # 16 JSON reference files (652+ records)
├── scripts/                      # seed_all, validate_e2e, demo_validation
├── tests/                        # 709 tests (8 test files + conftest)
├── docker-compose.yml
├── Dockerfile
└── requirements.txt

57 Python files | ~29,000 lines | Apache 2.0


12. Implementation Status

  • Phase 1 (Architecture) — Complete. All data models, collection schemas, 9 clinical engines, RAG engine, and agent orchestrator implemented.
  • Phase 2 (Data) — Complete. 652 vectors across 14 Milvus collections. 2 fully specified sample patients.
  • Phase 3 (Integration) — Complete. Full RAG pipeline with Claude. FHIR R4 export with structural validation.
  • Phase 4 (Testing) — Complete. 709 unit tests. 65-check demo validation. End-to-end data validation.
  • Phase 5 (Demo Ready) — Complete. Production-quality Streamlit UI. All 8 tabs validated. Both sample patients tested.

Remaining Work

  • Longitudinal tracking (multi-time-point biomarker trending) — planned for Phase 6
  • Additional population-specific carrier screening panels (Sephardic, Finnish, French-Canadian)
  • Integration with HCLS AI Factory landing page health monitoring

13. Relationship to HCLS AI Factory

The Precision Biomarker Intelligence Agent is the fourth intelligence agent in the HCLS AI Factory platform, joining:

  1. CAR-T Intelligence Agent — Cross-functional CAR-T cell therapy intelligence
  2. Imaging Intelligence Agent — Medical imaging detection, segmentation, and triage
  3. Precision Oncology Agent — Tumor-specific treatment selection and clinical trial matching
  4. Precision Biomarker Agent — Genomics-informed biomarker interpretation (this agent)

All agents share the same infrastructure (Milvus, BGE-small embeddings, Claude API) and can cross-reference the shared genomic_evidence collection.


14. Credits

  • Adam Jones
  • Apache 2.0 License