Skip to content

Rare Disease Diagnostic Agent — Architecture Design Document

Author: Adam Jones Date: March 2026 License: Apache 2.0


1. Executive Summary

The Rare Disease Diagnostic Agent extends the HCLS AI Factory platform to accelerate the diagnostic odyssey for patients with suspected rare diseases. The average rare disease diagnosis takes 5-7 years across 7+ specialists — this agent compresses that timeline by integrating HPO-based phenotype matching, ACMG variant classification, differential diagnosis generation, gene therapy eligibility screening, and natural history data into a single RAG-powered intelligence platform.

The system searches 14 Milvus vector collections containing phenotype ontologies, disease definitions, gene annotations, variant databases, case reports, clinical trials, therapy records, guidelines, and pathway data. It covers 88 rare diseases across 13 disease categories, implements 23 ACMG criteria for variant classification, catalogs 45+ disease-associated genes, tracks 12 gene therapies (approved and pipeline), and runs 9 diagnostic algorithms for systematic differential diagnosis.

The platform enables queries like "6-year-old with progressive ataxia, nystagmus, and elevated alpha-fetoprotein — differential diagnosis?" that simultaneously search phenotype databases, disease definitions, genetic associations, and case reports — returning a ranked differential with supporting evidence and recommended genetic testing.

Key Results

Metric Value
Lines of code 21,935
Milvus collections 14 (13 owned + 1 shared genomic_evidence)
Disease categories 13
Diseases cataloged 88 (97+ including subtypes)
Genes cataloged 45+ disease-associated
Gene therapies tracked 12 (approved + pipeline)
ACMG criteria implemented 23 (pathogenic: 5 PVS/PS/PM/PP; benign: 5 BA/BS/BP)
HPO top-level terms 23 phenotype categories
Diagnostic algorithms 9 systematic approaches
Diagnostic workflows 10
Decision support engines 6 (HPO matcher, ACMG classifier, drug matcher, family analyzer, natural history, diagnostic algorithm)
Entity aliases 120+
API endpoints 20
Test suite 193 tests (100% pass, 0.16s runtime)

2. Architecture Overview

2.1 Mapping to VAST AI OS

VAST AI OS Component Rare Disease Agent Role
DataStore Raw files: OMIM entries, Orphanet data, HPO ontology, ClinVar variants, gene therapy records
DataEngine Ingest pipelines for phenotypes, diseases, genes, variants, literature, trials, therapies, case reports, guidelines, pathways, registries, natural history, newborn screening
DataBase 14 Milvus collections (13 owned + 1 read-only) + knowledge base (13 categories, 88+ diseases, 12 gene therapies, 23 ACMG criteria)
InsightEngine BGE-small embedding + multi-collection RAG + 6 decision support engines + 9 diagnostic algorithms
AgentEngine RareDiseaseAgent orchestrator + Streamlit UI (5 tabs) + FastAPI REST

2.2 System Diagram

+==============================================================+
|                    RARE DISEASE DIAGNOSTIC AGENT               |
+==============================================================+
|                                                                |
|  +------------------+    +-----------------------------+       |
|  |  Streamlit UI    |    |     FastAPI REST API        |       |
|  |  Port 8544       +--->+     Port 8134               |       |
|  |  5 Tabs          |    |     20 Endpoints            |       |
|  |  NVIDIA Theme    |    |     Auth + CORS + Metrics   |       |
|  +------------------+    +----+--------+--------+------+       |
|                               |        |        |              |
|            +------------------+        |        +--------+     |
|            |                           |                 |     |
|  +---------v---------+  +-------------v----+  +---------v--+  |
|  | Workflow Engine   |  | Decision Support |  | RAG Engine |  |
|  | 10 Workflows      |  | 6 Engines        |  | Embed+Search|  |
|  | Template Method   |  | HPO Matcher      |  | 14 Colls   |  |
|  +--------+----------+  | ACMG Classifier  |  +------+-----+  |
|           |              | Drug Matcher      |         |        |
|           |              | Family Analyzer   |         |        |
|           |              | Natural History   |         |        |
|           |              | Diagnostic Algo   |         |        |
|           |              +--------+---------+         |        |
|           |                       |                   |        |
|  +--------v-----------+-----------v---+    +----------v-----+ |
|  |    Knowledge Base                  |    |    Milvus      | |
|  |    13 disease categories           |    |    14 collections| |
|  |    88+ diseases, 45+ genes         |    |    384-dim BGE | |
|  |    12 gene therapies               |    |    IVF_FLAT    | |
|  |    23 ACMG criteria                |    |    COSINE      | |
|  +------------------------------------+    +----------------+ |
+================================================================+

3. Data Collections — Actual State

3.1 Collection Catalog

# Collection Est. Records Weight Primary Use
1 rd_phenotypes 18,000 0.12 HPO-based phenotype matching
2 rd_diseases 10,000 0.11 Disease definitions, OMIM/Orphanet
3 rd_genes 5,000 0.10 Gene annotations, constraint scores
4 rd_variants 500,000 0.10 ClinVar variants, ACMG classifications
5 rd_literature 50,000 0.08 PubMed rare disease literature
6 rd_trials 8,000 0.06 Rare disease clinical trials
7 rd_therapies 2,000 0.07 Gene therapies, enzyme replacement, substrate reduction
8 rd_case_reports 20,000 0.07 Published case reports with phenotype-genotype correlations
9 rd_guidelines 3,000 0.06 Diagnostic and management guidelines
10 rd_pathways 2,000 0.06 Metabolic pathways, disease mechanisms
11 rd_registries 1,500 0.04 Patient registries, natural history studies
12 rd_natural_history 5,000 0.05 Disease milestones, progression data
13 rd_newborn_screening 80 0.05 NBS conditions, analytes, cutoffs
14 genomic_evidence 3,560,000 0.03 Shared genomic variant context

3.2 Index Configuration

Parameter Value
Index type IVF_FLAT
Metric COSINE
nlist 1024 (variants, literature), 256 (diseases, trials), 128 (others)
nprobe 16
Embedding dim 384 (BGE-small-en-v1.5)

4. Decision Support Engines

4.1 HPO Phenotype Matcher

Matches patient phenotype descriptions to HPO (Human Phenotype Ontology) terms and computes phenotype-based disease similarity scores.

Feature Detail
HPO term lookup Free-text to HPO ID resolution via semantic search
IC scoring Information content-based term weighting (rarer phenotypes weighted higher)
Disease matching Phenotype overlap scoring against 88+ disease phenotype profiles
Negated phenotypes Supports "absent" phenotypes that reduce disease probability
Output Ranked differential diagnosis with match scores and key distinguishing features

4.2 ACMG Variant Classifier

Implements the ACMG/AMP 2015 framework for variant classification with 23 criteria:

Category Criteria Strength Examples
Pathogenic PVS1 Very Strong Null variant in gene where LOF is known disease mechanism
PS1-PS4 Strong Same amino acid change as established pathogenic, de novo, functional studies, prevalence
PM1-PM6 Moderate Hot spot, absent from controls, cosegregation, missense in constrained gene, novel missense, assumed de novo
PP1-PP5 Supporting Coseg (insufficient), missense in gene with low benign rate, in silico, patient phenotype, reputable source
Benign BA1 Stand-alone Allele frequency >5% in any population
BS1-BS4 Strong Greater frequency than expected, healthy adults, functional no effect, non-cosegregation
BP1-BP7 Supporting Missense in truncating gene, observed in trans, in silico benign, synonymous, common

Classification logic: Pathogenic (PVS1 + >= 1 PS, or >= 2 PS, or PS + >= 3 PM/PP), Likely Pathogenic, Uncertain Significance, Likely Benign, Benign — following ACMG combining rules.

4.3 Gene Therapy Eligibility Screener

Evaluates patient eligibility for 12 tracked gene therapies:

Therapy Disease Status Key Criteria
Zolgensma (onasemnogene) SMA Type 1 FDA Approved Age < 2y, SMN1 biallelic deletion, < 3 copies SMN2
Luxturna (voretigene) RPE65 retinal dystrophy FDA Approved Biallelic RPE65 mutations, sufficient viable retinal cells
Hemgenix (etranacogene) Hemophilia B FDA Approved Severe/moderate hemophilia B, no FIX inhibitors
Casgevy (exagamglogene) SCD / Beta-thalassemia FDA Approved Age >= 12, severe SCD or transfusion-dependent thal
Skysona (elivaldogene) CALD FDA Approved Males 4-17, early active CALD, matched donor unavailable
Elevidys (delandistrogene) DMD FDA Approved (accelerated) Age 4-5, ambulatory, confirmed DMD
+ 6 pipeline therapies Various Phase I-III Disease-specific criteria

4.4 Family Segregation Analyzer

Evaluates variant co-segregation with disease phenotype across family pedigrees, supporting autosomal dominant, autosomal recessive, X-linked, and mitochondrial inheritance patterns.

4.5 Natural History Engine

Provides disease-specific milestone timelines, age-of-onset distributions, progression rates, and life expectancy data for counseling and clinical trial eligibility assessment.

4.6 Diagnostic Algorithm Engine

Implements 9 systematic diagnostic approaches:

Algorithm Trigger Steps
Metabolic screen Neonatal/infantile presentation NBS → plasma amino acids → urine organic acids → acylcarnitine → enzyme assay
Neurogenetic panel Progressive neurodegeneration Gene panel → WES → WGS → RNA-seq
Connective tissue Marfanoid habitus, hypermobility Ghent criteria → FBN1/COL3A1/TGFBR → echo + vascular imaging
Skeletal dysplasia Short stature, bone abnormalities Skeletal survey → FGFR3/COL2A1 → WES
Immunodeficiency Recurrent infections Flow cytometry → TREC/KREC → gene panel
Mitochondrial Multi-system, lactic acidosis Lactate → muscle biopsy → mtDNA → nuclear mito genes
Lysosomal storage Organomegaly, regression Enzyme assay → biomarkers → gene confirmation
RASopathy Facial gestalt, cardiac, growth Noonan panel (PTPN11, RAF1, RIT1, SOS1, KRAS)
Ciliopathy Renal cysts, retinal dystrophy Renal US → ophtho → ciliopathy gene panel

5. Clinical Workflows

# Workflow Clinical Question
1 Phenotype-Driven Differential "Generate a ranked differential from these HPO terms"
2 Variant Interpretation "Classify this variant using ACMG criteria"
3 Gene Therapy Eligibility "Is this patient eligible for any available gene therapy?"
4 Diagnostic Algorithm "What systematic workup is appropriate for this presentation?"
5 Family Segregation "Does this variant co-segregate with disease in the family?"
6 Natural History "What is the expected disease course and milestones?"
7 Treatment Options "What therapeutic options exist for this diagnosis?"
8 Newborn Screening "Interpret this NBS result in context of clinical findings"
9 Undiagnosed Disease "This patient has been undiagnosed for 5 years — next steps?"
10 Genetic Counseling "Recurrence risk and reproductive options for this family"

6. Multi-Collection RAG Engine

6.1 Search Flow

User Query: "6yo with progressive ataxia, nystagmus, elevated AFP — differential?"
    
    ├── 1. HPO Matcher: Phenotype resolution                     [< 50 ms]
          "progressive ataxia"  HP:0002073
          "nystagmus"  HP:0000639
          "elevated AFP"  HP:0006254
    
    ├── 2. HPO disease matching: IC-weighted overlap              [< 100 ms]
          #1 Ataxia-telangiectasia (ATM) — score 0.92
          #2 Ataxia with oculomotor apraxia type 2 — score 0.78
          #3 Friedreich ataxia — score 0.65
    
    ├── 3. Embed query with BGE asymmetric prefix                 [< 5 ms]
    
    ├── 4. Parallel search across 14 collections (top-5 each)    [12-18 ms]
       ├── rd_phenotypes:    Ataxia + nystagmus + AFP HPO data  (score: 0.85-0.92)
       ├── rd_diseases:      AT, AOA2, FA disease profiles      (score: 0.82-0.90)
       ├── rd_genes:         ATM, SETX, FXN gene annotations   (score: 0.78-0.86)
       ├── rd_case_reports:  Pediatric ataxia case series       (score: 0.75-0.84)
       └── rd_therapies:     AT management, gene therapy trials (score: 0.70-0.78)
    
    ├── 5. ACMG classification (if variant provided)              [< 20 ms]
    
    ├── 6. Gene therapy eligibility check                         [< 10 ms]
    
    └── 7. Stream Claude Sonnet 4.6 response                     [~22-26 sec]
           Ranked differential with supporting evidence,
           recommended genetic testing (ATM sequencing),
           and natural history context

Total: ~26 sec (HPO matching ~150 ms; retrieval ~25 ms; LLM ~25 sec)


7. Performance Benchmarks

Measured on NVIDIA DGX Spark (GB10 GPU, 128GB unified LPDDR5x memory, 20 ARM cores).

7.1 Decision Engine Performance

Engine Latency
HPO Matcher (phenotype resolution + disease matching) <150 ms
ACMG Classifier (23 criteria evaluation) <20 ms
Gene Therapy Eligibility (12 therapies) <10 ms
Family Segregation (pedigree analysis) <50 ms
Natural History (milestone lookup) <10 ms
Diagnostic Algorithm (workflow selection) <10 ms
All 6 engines combined <250 ms

7.2 RAG Query Performance

Operation Latency
Full query (retrieve + engines + Claude generate) ~26 sec
Streaming query (time to first token) ~3 sec
14-collection parallel search 12-18 ms
Response length 1000-2500 chars

8. Infrastructure

8.1 Technology Stack

Component Technology
Language Python 3.10+
Vector DB Milvus 2.4, localhost:19530
Embeddings BGE-small-en-v1.5 (BAAI) — 384-dim
LLM Claude Sonnet 4.6 (Anthropic API)
Web UI Streamlit (port 8544, NVIDIA black/green theme)
REST API FastAPI + Uvicorn (port 8134)
Configuration Pydantic BaseSettings
Testing pytest (193 tests)
Hardware target NVIDIA DGX Spark (GB10 GPU, 128GB unified, $4,699)

8.2 Service Ports

Port Service
8134 FastAPI REST API
8544 Streamlit Chat UI
19530 Milvus vector database (shared)

8.3 Dependencies on HCLS AI Factory

Dependency Usage
Milvus 2.4 instance Shared vector database — adds 13 owned collections alongside existing genomic_evidence (3.56M vectors, read-only)
ANTHROPIC_API_KEY Shared Anthropic API key
BGE-small-en-v1.5 Same embedding model as main RAG pipeline

9. Disease Categories

9.1 Coverage by Category

# Category Diseases Key Genes Example Conditions
1 Neurogenetic 12 ATM, FXN, SMN1, DMD, HTT Ataxia-telangiectasia, Friedreich ataxia, SMA, Huntington
2 Metabolic / IEM 11 PAH, GAA, GLA, HEXA, IDUA PKU, Pompe, Fabry, Tay-Sachs, Hurler
3 Immunodeficiency 8 BTK, RAG1, RAG2, IL2RG, ADA X-SCID, ADA-SCID, XLA, CGD
4 Hematologic 7 HBB, F8, F9, SERPINC1 Sickle cell, hemophilia A/B, thalassemia
5 Connective tissue 7 FBN1, COL3A1, COL1A1, TGFBR Marfan, vEDS, OI, Loeys-Dietz
6 Skeletal dysplasia 6 FGFR3, COL2A1, COMP Achondroplasia, SED, pseudoachondroplasia
7 Neuromuscular 6 DMD, SMN1, DMPK, CNBP Duchenne, SMA, myotonic dystrophy
8 Retinal dystrophy 5 RPE65, ABCA4, USH2A, RHO LCA, Stargardt, retinitis pigmentosa
9 RASopathy 5 PTPN11, RAF1, BRAF, HRAS Noonan, cardiofaciocutaneous, Costello
10 Ciliopathy 5 PKD1, PKD2, BBS1, NPHP1 ADPKD, Bardet-Biedl, nephronophthisis
11 Mitochondrial 6 MT-ND, POLG, SURF1, MELAS MELAS, Leigh syndrome, CPEO
12 Lysosomal storage 6 GLA, GAA, IDUA, GBA, HEXA Fabry, Pompe, Gaucher, MPS I
13 Other 4 CFTR, DNAAF, FMR1, TSC1/2 CF, PCD, Fragile X, TSC

10. Demo Scenarios

10.1 Validated Demo Queries

1. "6-year-old with progressive cerebellar ataxia, oculomotor apraxia, and elevated AFP" - HPO Matcher: HP:0002073 + HP:0000657 + HP:0006254 - Differential: #1 Ataxia-telangiectasia (ATM), #2 AOA2 (SETX), #3 AOA1 (APTX) - Testing: ATM gene sequencing, immunoglobulin levels, lymphocyte subsets

2. "Classify this variant: NM_000051.4(ATM):c.8977C>T (p.Arg2993Ter)" - ACMG: PVS1 (null variant in LOF gene) + PM2 (absent from controls) + PP3 (in silico pathogenic) - Classification: Pathogenic - Disease association: Ataxia-telangiectasia

3. "Is this SMA Type 1 infant eligible for Zolgensma?" - Gene Therapy Screener: SMN1 biallelic deletion confirmed, age < 2 years, SMN2 copy number - Eligibility: Check anti-AAV9 antibody titer, liver function, platelet count - Timeline: Treatment window critical — earlier administration correlates with better outcomes

4. "Progressive hepatosplenomegaly and pancytopenia in Ashkenazi Jewish child" - Diagnostic Algorithm: Lysosomal storage screen - Differential: Gaucher disease (GBA), Niemann-Pick type B (SMPD1) - Testing: Glucocerebrosidase enzyme activity, chitotriosidase, GBA sequencing

5. "Family with autosomal dominant polycystic kidney disease — genetic counseling" - Family Analyzer: AD inheritance, 50% recurrence risk per child - Genes: PKD1 (85%), PKD2 (15%) - Natural History: Renal cysts by age 30, ESRD by age 55 (PKD1) vs. 75 (PKD2)


11. File Structure (Actual)

rare_disease_diagnostic_agent/
├── src/
   ├── agent.py                     # RareDiseaseAgent orchestrator
   ├── models.py                    # Pydantic data models + enums
   ├── collections.py               # 14 Milvus collection schemas
   ├── rag_engine.py                # Multi-collection RAG engine
   ├── clinical_workflows.py        # 10 diagnostic workflows
   ├── decision_support.py          # 6 decision engines
   ├── hpo_matcher.py               # HPO phenotype matching + IC scoring
   ├── acmg_classifier.py           # 23-criteria ACMG variant classification
   ├── gene_therapy.py              # Gene therapy eligibility screening
   ├── knowledge.py                 # Disease taxonomy, gene catalog, therapy database
   ├── query_expansion.py           # 120+ entity aliases
   ├── cross_modal.py               # Cross-agent integration
   ├── metrics.py                   # Prometheus metrics
   └── export.py                    # Report generation
├── app/
   └── rare_disease_ui.py          # Streamlit (5 tabs, NVIDIA theme)
├── api/
   └── main.py                     # FastAPI REST server (20 endpoints)
├── config/
   └── settings.py                 # Pydantic BaseSettings
├── data/
   └── reference/                  # Disease catalogs, HPO mappings, therapy records
├── tests/                          # 193 tests
├── requirements.txt
├── Dockerfile
├── docker-compose.yml
└── README.md

48 Python files | ~21,935 lines of code | Apache 2.0


12. Implementation Status

Phase Status Details
Phase 1: Architecture Complete 14 collections, 6 decision engines, 10 workflows, knowledge base, HPO matcher, ACMG classifier
Phase 2: Data Complete 88+ diseases, 45+ genes, 12 gene therapies, 23 ACMG criteria, 13 disease categories
Phase 3: RAG Integration Complete Multi-collection parallel search, Claude Sonnet 4.6 streaming
Phase 4: Testing Complete 193 tests, 100% pass, 0.16s runtime
Phase 5: UI + Demo Complete 5-tab Streamlit UI, NVIDIA theme, 5 demo scenarios validated

Remaining Work

Item Priority Effort
OMIM API integration for real-time disease data Medium 2-3 days
Matchmaker Exchange (MME) API integration Medium 1 week
Expanded newborn screening panel (RUSP 2026 update) Low 1-2 days
Integration with HCLS AI Factory landing page Low 1 hour

13. Relationship to HCLS AI Factory

The Rare Disease Diagnostic Agent demonstrates the diagnostic extension of the HCLS AI Factory architecture. While the core platform focuses on common cancer genomics, rare disease requires a fundamentally different approach: phenotype-driven diagnosis, ACMG variant classification, and gene therapy eligibility screening.

  • Same Milvus instance — 13 new owned collections alongside existing genomic_evidence (3.56M vectors, read-only)
  • Same embedding model — BGE-small-en-v1.5 (384-dim)
  • Same LLM — Claude via Anthropic API
  • Same hardware — NVIDIA DGX Spark ($4,699)
  • Same patterns — Pydantic models, BaseIngestPipeline, knowledge graph, query expansion

The critical integration: Stage 1 (genomics pipeline) produces VCF files from whole-genome or exome sequencing; this agent's ACMG classifier interprets those variants in the context of rare disease phenotypes, connecting molecular findings to diagnostic conclusions and therapeutic options including gene therapy.


14. Credits

  • Adam Jones
  • Apache 2.0 License

Clinical Decision Support Disclaimer

The Rare Disease Diagnostic Agent is a clinical decision support research tool for rare disease evaluation. It is not FDA-cleared and is not intended as a standalone diagnostic device. Variant classifications should be reviewed by certified molecular geneticists. Gene therapy eligibility determinations require specialist evaluation. All recommendations should be reviewed by qualified clinical geneticists and genetic counselors. Apache 2.0 License.