From Phenotype to Diagnosis: A Multi-Collection RAG Architecture for Rare Disease Diagnostic Intelligence¶
Author: Adam Jones Date: March 2026 Version: 0.1.0 (Pre-Implementation) License: Apache 2.0
Part of the HCLS AI Factory -- an end-to-end precision medicine platform. https://github.com/ajones1923/hcls-ai-factory
Abstract¶
Rare diseases collectively affect over 300 million people worldwide -- approximately 1 in 17 individuals -- yet the average patient endures a 5-7 year "diagnostic odyssey" involving 7+ specialists, 2-3 misdiagnoses, and catastrophic financial and psychological burden before receiving a correct diagnosis. Among the estimated 7,000-10,000 known rare diseases, approximately 80% have a genetic origin, 50% of patients are children, and 30% of affected children die before age 5. Despite this enormous disease burden, 95% of rare diseases have no FDA-approved treatment, only 5% have been studied in clinical trials, and the critical knowledge needed for diagnosis remains fragmented across OMIM, Orphanet, GARD, ClinVar, HPO, GeneReviews, and thousands of individual publications -- an information desert that no single clinician can navigate.
This paper presents the Rare Disease Diagnostic Agent, an AI-powered clinical decision support system built on the HCLS AI Factory's multi-collection Retrieval-Augmented Generation (RAG) architecture. Named "Diagnostic" rather than "Intelligence" because diagnosis IS the primary clinical value proposition in rare disease, the agent integrates 14 specialized Milvus vector collections -- rd_phenotypes (HPO-coded phenotype database), rd_diseases (OMIM/Orphanet disease catalog), rd_genes (known disease-gene associations), rd_variants (pathogenicity database from ClinVar and gnomAD), rd_literature (PubMed rare disease publications), rd_trials (rare disease clinical trials), rd_therapies (orphan drugs, gene therapies, enzyme replacements), rd_case_reports (published diagnostic cases), rd_guidelines (diagnostic algorithms, ACMG criteria), rd_pathways (metabolic pathways, molecular mechanisms), rd_registries (patient registry data), rd_natural_history (disease progression data), rd_newborn_screening (expanded screening panels), and the shared genomic_evidence collection (3.56 million variant vectors).
Through 10 clinical workflows -- phenotype-driven diagnostic workup, whole exome/genome interpretation, metabolic disease screening, dysmorphology assessment, neurogenetic evaluation, cardiac genetics, connective tissue disorders, inborn errors of metabolism, gene therapy eligibility assessment, and undiagnosed disease program support -- the agent transforms fragmented clinical observations into ranked diagnostic hypotheses with evidence-graded confidence. Six clinical decision support engines -- Phenotype-to-Gene Matcher, ACMG Variant Classifier, Orphan Drug Matcher, Diagnostic Algorithm Recommender, Family Segregation Analyzer, and Natural History Predictor -- provide computational reasoning across the diagnostic-to-therapeutic continuum.
Deployed on the NVIDIA DGX Spark ($3,999) at ports 8544 (UI) and 8134 (API), the agent operates entirely on-premises for HIPAA compliance, processes whole-exome and whole-genome sequencing data through the existing genomics pipeline, and generates structured diagnostic reports in PDF and FHIR R4 formats. By reducing diagnostic odyssey timelines from years to weeks and connecting patients with emerging gene therapies (nusinersen, onasemnogene, Casgevy, Luxturna, Hemgenix) and active clinical trials, this agent addresses one of medicine's most consequential unmet needs: ensuring that no patient remains undiagnosed simply because their disease is rare.
Table of Contents¶
- Introduction
- The Diagnostic Odyssey Crisis
- Clinical Landscape and Market Analysis
- Existing HCLS AI Factory Architecture
- Rare Disease Diagnostic Agent Architecture
- Clinical Document and Genomic Ingestion Pipeline
- HPO (Human Phenotype Ontology) Integration
- Clinical Workflows
- Cross-Modal Integration and Genomic Correlation
- NIM Integration Strategy
- Knowledge Graph Design
- Query Expansion and Retrieval Strategy
- API and UI Design
- Clinical Decision Support Engines
- Reporting and Interoperability
- Product Requirements Document
- Data Acquisition Strategy
- Validation and Testing Strategy
- Regulatory Considerations
- DGX Compute Progression
- Implementation Roadmap
- Risk Analysis
- Competitive Landscape
- Discussion
- Conclusion
- References
1. Introduction¶
1.1 The Scale of Rare Disease¶
A 4-year-old girl in rural Tennessee has been hospitalized 11 times in two years. Her symptoms -- episodic muscle weakness, recurrent vomiting, developmental regression after febrile illness -- have generated referrals to pediatric neurology, gastroenterology, genetics, and metabolic disease, yielding diagnoses of "cyclic vomiting syndrome," "conversion disorder," and "failure to thrive, unspecified." Her parents have driven over 6,000 miles to five different children's hospitals. Her medical record spans 847 pages across 14 providers in 3 electronic health record systems. The answer -- medium-chain acyl-CoA dehydrogenase (MCAD) deficiency, a treatable fatty acid oxidation disorder detectable on newborn screening -- was missed because she was born in a state that did not yet include MCAD on its screening panel.
This case, drawn from composites of real diagnostic odysseys reported in medical literature, illustrates a systemic crisis affecting hundreds of millions of people worldwide. Despite the name "rare," rare diseases are collectively common -- staggeringly so:
| Metric | Value | Source |
|---|---|---|
| Known rare diseases | 7,000-10,000 | GARD/Orphanet |
| Global patients affected | 300+ million (1 in 17 people) | Rare Diseases International |
| US patients affected | 25-30 million | NCATS |
| Genetic origin | ~80% | Nguengang Wakap et al. 2020 |
| Pediatric onset | ~50% | Ferreira 2019 |
| Childhood mortality (< 5 years) | ~30% | NORD |
| Diseases with FDA-approved treatment | < 5% (~600) | FDA Orphan Drug Act data |
| Diseases studied in clinical trials | ~5% | Global Genes |
| Average time to diagnosis | 5-7 years | EURORDIS |
| Average specialists consulted | 7.3 | Rare Disease UK |
| Average misdiagnoses before correct dx | 2.6 | Shire 2013 |
| Economic burden (US, annual) | $966 billion | EveryLife Foundation |
| Out-of-pocket cost per family | $50,000+ | NORD survey data |
The paradox is stark: while any single rare disease may affect fewer than 200,000 people in the US (the statutory definition under the Orphan Drug Act), the aggregate population exceeds that of diabetes (37.3 million) and cancer (18.1 million) combined. Yet research funding, clinical infrastructure, and diagnostic tooling remain fragmented across thousands of individual conditions, each with its own advocacy organization, clinical registry, and expert community.
1.2 The Information Desert¶
The information needed to diagnose most rare diseases exists -- scattered across at least 16 distinct data ecosystems:
- OMIM (Online Mendelian Inheritance in Man): 7,000+ disease entries with gene associations
- Orphanet: 6,000+ disease profiles with prevalence, inheritance, clinical features
- GARD (Genetic and Rare Diseases Information Center, NIH): 7,000+ conditions with patient-facing information
- ClinVar: 2.4 million+ variant submissions with pathogenicity classifications
- HPO (Human Phenotype Ontology): 18,000+ standardized phenotypic abnormality terms
- GeneReviews: 850+ expert-authored disease summaries with diagnostic criteria
- PubMed: 36 million+ articles, ~12,000 rare disease-tagged publications per year
- ClinicalTrials.gov: ~5,800 active rare disease trials
- gnomAD: 807,000 exomes/genomes with population variant frequencies
- ClinGen: 2,000+ gene-disease validity assessments
- HGMD: 300,000+ disease-causing mutations (commercial)
- PanelApp: 300+ curated gene panels
- KEGG/Reactome: Metabolic and signaling pathway databases
- Patient registries: 800+ disease-specific registries worldwide
- Newborn screening panels: 37 core + 26 secondary RUSP conditions, varying by state
- Gene therapy pipeline: 12+ approved therapies, 1,400+ investigational programs
No clinician can maintain awareness of this landscape. No existing system integrates across these sources. The result is an information desert -- not because knowledge is absent, but because it is inaccessible at the point of care.
1.3 Why AI Is Uniquely Suited for Rare Disease Diagnosis¶
Rare disease diagnosis represents perhaps the single most compelling use case for clinical AI, for reasons that directly mirror AI's core strengths:
Pattern recognition across ultra-rare phenotypes: A pediatrician may encounter a child with Angelman syndrome once in an entire career. An AI system trained on 7,000+ disease phenotype profiles can recognize the characteristic pattern -- happy demeanor, hand-flapping, seizures, microcephaly, absent speech -- instantaneously, regardless of how many cases it has "seen."
Exhaustive differential generation: Where a human clinician generates 2-3 diagnostic hypotheses based on a mental library of ~500 conditions, an AI system can simultaneously evaluate a patient's phenotype profile against all 7,000+ known rare diseases, identifying matches that would require decades of subspecialty experience to recognize.
Longitudinal synthesis across fragmented records: The average rare disease patient generates records across 7.3 providers in 2.4 health systems using 1.8 different EHR platforms. AI can ingest, normalize, and cross-correlate years of clinical data that no individual clinician has time to review.
Knowledge currency: With 250-300 new gene-disease associations published annually, ~50,000 new ClinVar variant submissions per year, and thousands of VUS reclassifications, only an automated system can ensure that diagnostic knowledge reflects the current state of science.
Equity amplification: There are approximately 1,200 board-certified clinical geneticists in the US serving 30 million rare disease patients -- a ratio of 1:25,000. The average wait for genetics consultation is 6-18 months, and 40% of US counties have no genetics provider within 50 miles. AI can bring diagnostic capability to settings where genetic expertise is absent.
1.4 Our Contribution¶
This paper presents the complete architectural blueprint and product requirements for the Rare Disease Diagnostic Agent, the seventh domain-specific agent in the HCLS AI Factory platform. The agent is named "Diagnostic" rather than "Intelligence" because in the rare disease domain, diagnosis IS the primary clinical intervention -- for a patient who has waited 5-7 years without answers, a correct diagnosis is itself transformative, even before treatment begins. Our contributions include:
- A 14-collection Milvus vector schema designed specifically for rare disease knowledge retrieval, spanning phenotypes, diseases, genes, variants, literature, trials, therapies, case reports, guidelines, pathways, registries, natural history, and newborn screening
- Ten reference clinical workflows covering phenotype-driven diagnosis, WES/WGS interpretation, metabolic screening, dysmorphology assessment, neurogenetic evaluation, cardiac genetics, connective tissue disorders, inborn errors of metabolism, gene therapy eligibility, and undiagnosed disease program support
- Six clinical decision support engines implementing phenotype-to-gene matching, ACMG variant classification, orphan drug matching, diagnostic algorithm recommendation, family segregation analysis, and natural history prediction
- Deep HPO integration enabling computational phenotype matching with semantic similarity scoring
- Deployment on a single NVIDIA DGX Spark ($3,999) at ports 8544 (UI) and 8134 (API), maintaining the platform's commitment to accessible AI
- Open-source licensing (Apache 2.0), enabling deployment by academic institutions, patient advocacy organizations, and resource-limited settings worldwide
2. The Diagnostic Odyssey Crisis¶
2.1 Why Diagnosis Takes 5-7 Years¶
The diagnostic odyssey is not primarily a failure of medical knowledge -- the information needed to diagnose most rare diseases exists in published literature, genetic databases, and expert clinical experience. It is a failure of information retrieval and pattern synthesis at the point of care. Understanding the anatomy of this failure is essential to designing a system that addresses it.
Stage 1: Initial Presentation (Months 0-6) The patient presents to a primary care physician with symptoms that, individually, appear common: fatigue, developmental delay, recurrent infections, feeding difficulties, joint hypermobility, or seizures. The physician applies the appropriate heuristic -- "when you hear hoofbeats, think horses, not zebras" -- and pursues common diagnoses first. This is rational medicine for the 98.3% of patients who do not have a rare disease. For the 1.7% who do, it begins a cascade of delays.
Stage 2: Subspecialty Referral Cascade (Months 6-24) When initial workup is unrevealing, the patient enters the referral cascade. Each subspecialist evaluates the patient through the lens of their own domain: the neurologist considers neurological conditions, the rheumatologist considers autoimmune diseases, the gastroenterologist considers GI disorders. No single specialist synthesizes findings across domains. Each generates organ-specific diagnoses that may be accurate descriptions of symptoms (e.g., "seizure disorder," "failure to thrive," "hepatomegaly") but miss the unifying diagnosis (e.g., Niemann-Pick disease type C, which manifests across all three systems).
Stage 3: Misdiagnosis and Misdirected Treatment (Months 12-48) The average rare disease patient receives 2.6 incorrect diagnoses before the correct one. Each misdiagnosis triggers treatment for the wrong condition -- immunosuppressants for suspected autoimmune disease when the patient has a primary immunodeficiency, antiepileptic drugs for seizures caused by a metabolic disorder requiring dietary intervention, psychiatric medications for "behavioral symptoms" that are actually manifestations of a neurogenetic condition. These misdirected treatments waste resources, cause iatrogenic harm, and further delay correct diagnosis.
Stage 4: Diagnostic Plateau (Months 24-60) After multiple subspecialty evaluations, inconclusive testing, and failed treatments, the diagnostic workup stalls. The patient is labeled with a non-specific diagnosis -- "undifferentiated connective tissue disease," "unspecified neurodevelopmental disorder," "idiopathic cardiomyopathy" -- and managed symptomatically. The urgency of the diagnostic quest fades as clinical attention shifts to symptom management. New clinical findings that emerge over time, which would refine the differential, are attributed to the existing non-specific diagnosis rather than triggering diagnostic reconsideration.
Stage 5: Eventual Diagnosis (Months 36-84+) Diagnosis ultimately occurs through one of four mechanisms: (1) a specialist with specific rare disease expertise encounters the case, (2) exome/genome sequencing identifies a pathogenic variant, (3) the patient or family conducts their own research and requests specific testing, or (4) the condition progresses to a stage where the diagnosis becomes clinically obvious but treatment opportunities have been missed.
2.2 Phenotypic Overlap Between Rare Diseases¶
A fundamental challenge in rare disease diagnosis is that many conditions share overlapping clinical features. The same constellation of developmental delay, seizures, and hypotonia can result from hundreds of different genetic conditions. This phenotypic overlap creates a combinatorial explosion that overwhelms human pattern recognition:
| Phenotype Combination | Number of Possible Rare Diseases |
|---|---|
| Seizures + Intellectual disability | 800+ |
| Cardiomyopathy + Skeletal myopathy | 200+ |
| Progressive ataxia + Peripheral neuropathy | 150+ |
| Hepatosplenomegaly + Developmental delay | 250+ |
| Short stature + Skeletal anomalies | 400+ |
| Recurrent infections + Failure to thrive | 300+ |
As each additional phenotype is added, the differential narrows -- but only if the diagnostician is aware of all candidate conditions. For diseases affecting fewer than 1 in 100,000 people, the probability that any individual clinician has encountered a case approaches zero.
2.3 The "Horses Not Zebras" Bias¶
Medical education systematically trains against rare disease recognition. The aphorism "when you hear hoofbeats, think horses, not zebras" -- attributed to Theodore Woodward at the University of Maryland -- has become so embedded in clinical reasoning that it functions as a cognitive bias rather than a heuristic. The bias manifests in three ways:
- Anchoring: Once a common diagnosis is considered, it anchors subsequent reasoning even when evidence accumulates against it
- Premature closure: The diagnostic process terminates when a "good enough" common diagnosis is reached, without considering that an uncommon diagnosis might better explain the full clinical picture
- Attribution bias: New symptoms in a patient with an existing diagnosis are attributed to the known condition rather than triggering reconsideration of the underlying diagnosis
For rare disease patients, this bias creates a systematic disadvantage: the rarer the condition, the less likely any individual clinician is to consider it, regardless of how well the clinical features match.
2.4 Geographic Disparities¶
Rare disease diagnostic expertise is concentrated in fewer than 50 academic medical centers globally. The geography of diagnosis creates profound inequities:
- United States: 1,200 board-certified clinical geneticists serving 30 million rare disease patients (1:25,000 ratio)
- Wait times: Average 6-18 months for genetics consultation in the US; 2+ years in many developing nations
- Rural access: 40% of US counties have no genetics provider within 50 miles
- Global disparities: Sub-Saharan Africa has fewer than 50 clinical geneticists serving a population of 1.2 billion
- Center concentration: ~70% of rare disease diagnoses made at exome/genome sequencing occur at 20 academic centers
The result is that a child born in Boston with access to Boston Children's Hospital and the Broad Institute may receive a genomic diagnosis within weeks, while a child with the identical condition born in rural Mississippi may wait years -- or never receive a diagnosis at all.
2.5 The Psychological and Financial Burden¶
The diagnostic odyssey exacts a devastating toll on families:
Financial impact: - Average out-of-pocket diagnostic costs: $50,000-$100,000+ per family (NORD) - Total economic burden of rare diseases in the US: $966 billion annually (EveryLife Foundation) - 40% of families report significant financial hardship or bankruptcy related to the diagnostic odyssey - Insurance coverage gaps for genetic testing, out-of-network specialists, and travel to academic centers
Psychological impact: - 75% of rare disease caregivers report clinical depression or anxiety (NORD survey) - "Diagnostic limbo" -- the psychological distress of watching a child deteriorate without understanding why -- is associated with post-traumatic stress symptoms - Parent-reported guilt ("Did I cause this?") and marital strain are nearly universal - Siblings of affected children experience secondary psychological impacts from family stress and reduced parental attention
2.6 The Undiagnosed Population¶
Even after the diagnostic odyssey, a substantial population remains without answers:
- Estimated 25-30 million patients worldwide have undergone extensive workup without reaching a diagnosis
- The NIH Undiagnosed Diseases Program (UDP), established in 2008, achieves a 35% diagnostic yield -- meaning 65% of the most extensively evaluated patients in the world remain undiagnosed
- The 100,000 Genomes Project (Genomics England) reports a 25% diagnostic rate for rare diseases using whole-genome sequencing
- The All of Us Research Program (NIH) is generating genomic data for 1 million+ participants, creating new opportunities for rare variant identification
These undiagnosed patients represent a population for whom current diagnostic paradigms have failed. They need a fundamentally different approach -- one that can synthesize fragmented data, apply continuously updated knowledge, and identify patterns too subtle or too rare for human recognition.
3. Clinical Landscape and Market Analysis¶
3.1 Rare Disease Diagnostics Market¶
The global rare disease diagnostics market is experiencing rapid growth driven by genomic sequencing cost reductions, newborn screening expansion, and gene therapy development:
| Segment | 2024 Value | 2030 Projected | CAGR |
|---|---|---|---|
| Rare Disease Diagnostics (Global) | $48.2B | $89.1B | 10.8% |
| Genetic Testing Services | $15.8B | $31.2B | 12.0% |
| Rare Disease AI/Decision Support | $1.2B | $5.8B | 30.1% |
| Gene Therapy Market | $7.9B | $35.7B | 28.4% |
| Orphan Drug Market | $217B | $381B | 9.8% |
| Newborn Screening (Global) | $1.8B | $3.4B | 11.2% |
Key market drivers: - Whole-exome/genome sequencing costs below $500/$1,000 respectively, making genomic-first diagnosis economically viable - FDA approval of 12+ gene therapies (2017-2025) creating urgency for genetic diagnosis as treatment prerequisite - RUSP expansion (3 new conditions added 2022-2025) driving NBS infrastructure investment - NIH Undiagnosed Diseases Program demonstrating 35% diagnostic yield with systematic multi-modal analysis - Patient advocacy organizations (NORD, Global Genes, Rare Disease UK, Rare Diseases International) driving policy and funding
3.2 Key Disease Categories¶
The agent must span the full breadth of rare disease, covering thousands of conditions across major categories:
Metabolic Diseases: PKU (phenylketonuria), Gaucher disease (types I-III), Fabry disease, Pompe disease, mucopolysaccharidoses (MPS I-VII), galactosemia, maple syrup urine disease (MSUD), medium-chain acyl-CoA dehydrogenase (MCAD) deficiency, isovaleric acidemia, propionic acidemia, methylmalonic acidemia, urea cycle disorders (OTC deficiency, citrullinemia, argininosuccinic aciduria), Niemann-Pick disease (types A/B/C), Tay-Sachs disease, Krabbe disease, metachromatic leukodystrophy
Neurological Diseases: Spinal muscular atrophy (SMA types I-IV), Duchenne/Becker muscular dystrophy (DMD/BMD), Rett syndrome, Angelman syndrome, Prader-Willi syndrome, Huntington disease, Friedreich ataxia, Charcot-Marie-Tooth disease (CMT types 1-4), tuberous sclerosis, neurofibromatosis (NF1/NF2), ataxia-telangiectasia, Dravet syndrome, SCN1A-related epilepsies
Hematologic Diseases: Sickle cell disease (HbSS, HbSC, HbS-beta-thal), alpha- and beta-thalassemia, hemophilia A (Factor VIII) and B (Factor IX), von Willebrand disease, Factor V Leiden thrombophilia, hereditary spherocytosis, Diamond-Blackfan anemia, Fanconi anemia, severe congenital neutropenia
Connective Tissue Disorders: Marfan syndrome, Ehlers-Danlos syndromes (13 recognized types including classical, hypermobile, vascular), osteogenesis imperfecta (types I-IV+), Loeys-Dietz syndrome, Stickler syndrome, pseudoxanthoma elasticum, cutis laxa
Immunologic Diseases: Severe combined immunodeficiency (SCID -- T-B+NK+, T-B-NK+, T-B+NK- subtypes), chronic granulomatous disease (CGD), hyper-IgE syndrome (STAT3, DOCK8), common variable immunodeficiency (CVID), complement deficiencies (C1-C9), Wiskott-Aldrich syndrome, X-linked agammaglobulinemia
Endocrine Diseases: Congenital adrenal hyperplasia (CAH -- 21-hydroxylase, 11-beta-hydroxylase), Turner syndrome, Klinefelter syndrome, multiple endocrine neoplasia (MEN1, MEN2A/2B), congenital hypothyroidism, familial hypocalciuric hypercalcemia
Cardiac Diseases: Long QT syndrome (LQT1-LQT15), Brugada syndrome, hypertrophic cardiomyopathy (genetic -- sarcomeric), transthyretin amyloid cardiomyopathy (ATTR), catecholaminergic polymorphic ventricular tachycardia (CPVT), arrhythmogenic right ventricular cardiomyopathy (ARVC), familial dilated cardiomyopathy
Cancer Predisposition Syndromes: Li-Fraumeni syndrome (TP53), Lynch syndrome (MLH1, MSH2, MSH6, PMS2), BRCA1/BRCA2 hereditary breast and ovarian cancer, familial adenomatous polyposis (APC), hereditary diffuse gastric cancer (CDH1), retinoblastoma (RB1), multiple endocrine neoplasia (RET, MEN1), von Hippel-Lindau disease (VHL)
3.3 Target Users¶
| Persona | Use Case | Key Need |
|---|---|---|
| Clinical Geneticist | Systematic variant interpretation with phenotype correlation | Reduce interpretation time from 40+ hours to < 2 hours per case |
| Pediatrician / PCP | Early recognition of rare disease red flags | Pattern alerts before referral bottleneck |
| Genetic Counselor | Family cascade screening and risk communication | Automated pedigree analysis and variant segregation |
| NBS Follow-up Coordinator | Confirmatory workup for abnormal newborn screens | ACT sheet integration with genomic correlation |
| Undiagnosed Disease Program | Systematic multi-modal analysis for diagnostic-odyssey patients | Comprehensive evidence synthesis across all data modalities |
| Rare Disease Researcher | Cohort identification and genotype-phenotype correlation | Population-level analytics and natural history data |
| Gene Therapy Coordinator | Patient eligibility assessment for approved/investigational therapies | Real-time matching against therapy-specific genetic criteria |
| Patient / Family Advocate | Understanding diagnosis, prognosis, and available treatments | Accessible reports with evidence-graded recommendations |
4. Existing HCLS AI Factory Architecture¶
4.1 Three-Stage Pipeline¶
The HCLS AI Factory processes patient genomic data through three integrated stages:
Stage 1: Genomics Pipeline -- FASTQ to VCF via NVIDIA Parabricks (BWA-MEM2, DeepVariant), producing annotated variant call files in 2-4 hours on DGX Spark (vs. 24-48 hours on CPU).
Stage 2: RAG/Chat Pipeline -- VCF variants embedded into Milvus vector database with BGE-small-en-v1.5, enabling semantic search across ClinVar, AlphaMissense, and domain-specific knowledge collections. Claude AI provides natural-language interpretation.
Stage 3: Drug Discovery Pipeline -- BioNeMo MolMIM generates novel molecular candidates for identified targets; DiffDock performs binding pose prediction; RDKit computes ADMET properties.
4.2 Existing Intelligence Agents¶
| # | Agent | Ports (UI/API) | Collections | Domain |
|---|---|---|---|---|
| 1 | Precision Biomarker | 8502/8102 | 10 + shared | Genotype-aware biomarker interpretation |
| 2 | Precision Oncology | 8503/8103 | 10 + shared | Molecular tumor board decision support |
| 3 | CAR-T Intelligence | 8504/8104 | 11 + shared | CAR-T cell therapy intelligence |
| 4 | Imaging Intelligence | 8505/8105 | 10 + shared | Medical imaging AI with NVIDIA NIM |
| 5 | Precision Autoimmune | 8506/8106 | 14 + shared | Autoimmune diagnostic odyssey analysis |
| 6 | Pharmacogenomics | 8507/8107 | 14 + shared | Drug-gene interaction and dosing |
| 7 | Rare Disease Diagnostic | 8544/8134 | 14 + shared | Diagnostic odyssey resolution |
4.3 Relationship to Existing Modules¶
The Rare Disease Diagnostic Agent builds on and extends several existing HCLS AI Factory capabilities:
- Genomics Pipeline: Consumes VCF output directly; extends variant annotation with rare disease-specific databases (OMIM, HGMD, LOVD) beyond the standard ClinVar/AlphaMissense annotations
- Precision Biomarker Agent: Shares the
genomic_evidencecollection; extends biomarker interpretation with metabolic rare disease profiles (acylcarnitines, organic acids, amino acids) - Precision Autoimmune Agent: Shares longitudinal document analysis patterns; extends to non-autoimmune rare diseases while leveraging the same clinical document ingestion pipeline
- Cardiology Intelligence Agent: Cross-references inherited arrhythmias (Long QT, Brugada, HCM) and cardiomyopathies -- the cardiac genetics workflow (Workflow 8.6) connects directly to the Cardiology Agent for comprehensive cardiac-genomic evaluation
- Pharmacogenomics Agent: Once a rare disease is diagnosed and treatment initiated, the PGx Agent ensures medication safety, particularly important given that many rare disease patients are on complex multi-drug regimens
- Imaging Intelligence Agent: Cross-references imaging findings with rare disease phenotypes (skeletal dysplasias, neuroimaging patterns, organ-specific structural anomalies)
5. Rare Disease Diagnostic Agent Architecture¶
5.1 System Design¶
+---------------------------------------------------------------------+
| RARE DISEASE DIAGNOSTIC AGENT |
| Streamlit UI (:8544) |
+----------+----------+----------+----------+----------+--------------+
| Phenotype| Variant | Temporal | Trial | Gene Tx | Family |
| Matcher | Interp. | Pattern | Matcher | Eligib. | Cascade |
| Engine | Engine | Engine | Engine | Engine | Engine |
+----------+----------+----------+----------+----------+--------------+
| FastAPI Backend (:8134) |
| +-------------+ +--------------+ +--------------+ +------------+ |
| | HPO Extract | | ACMG Scorer | | VUS Monitor | | NBS Engine | |
| | (NLP->HPO) | | (28 criteria)| | (ClinVar d) | | (ACT->Dx) | |
| +-------------+ +--------------+ +--------------+ +------------+ |
+---------------------------------------------------------------------+
| Multi-Collection RAG Engine |
| +-----------+ +-----------+ +-----------+ +----------+ |
| |rd_diseases| |rd_phenotyp| |rd_genes | |rd_variant| |
| +-----------+ +-----------+ +-----------+ +----------+ |
| +-----------+ +-----------+ +-----------+ +----------+ |
| |rd_literatu| |rd_trials | |rd_therapi | |rd_case_rp| |
| +-----------+ +-----------+ +-----------+ +----------+ |
| +-----------+ +-----------+ +-----------+ +----------+ |
| |rd_guidelin| |rd_pathways| |rd_registri| |rd_natural| |
| +-----------+ +-----------+ +-----------+ +----------+ |
| +-----------+ +-------------------------------------------+ |
| |rd_newborn | | genomic_evidence (shared, read-only) | |
| +-----------+ +-------------------------------------------+ |
+---------------------------------------------------------------------+
| Milvus 19530 | BGE-small-en-v1.5 (384d) | Claude LLM Fallback |
+---------------------------------------------------------------------+
5.2 Naming Convention: "Diagnostic" vs. "Intelligence"¶
The agent is deliberately named "Rare Disease Diagnostic Agent" rather than following the "Intelligence Agent" naming pattern used by other HCLS AI Factory agents (Biomarker Intelligence, Oncology Intelligence, etc.). This naming choice reflects a fundamental truth about rare disease medicine: diagnosis IS the primary clinical value.
For a patient with cancer, diagnosis is the beginning of a well-charted treatment journey. For a patient with a rare disease, diagnosis may be the most significant clinical event in their life:
- Diagnosis ends years of uncertainty and self-doubt ("Am I imagining this?")
- Diagnosis enables access to disease-specific management and surveillance
- Diagnosis qualifies patients for orphan drug access, clinical trials, and gene therapies
- Diagnosis connects families with disease-specific support communities and advocacy organizations
- Diagnosis enables genetic counseling and family cascade screening
- Diagnosis, even for conditions without treatment, provides psychological closure and prognostic information
The word "Diagnostic" foregrounds this clinical reality. The agent's primary mission is to shorten the diagnostic odyssey -- everything else follows from accurate, timely diagnosis.
5.3 Milvus Collection Design: 14 Collections¶
| # | Collection Name | Est. Records | Purpose |
|---|---|---|---|
| 1 | rd_phenotypes |
18,000 | HPO-coded phenotype database with definitions, IC scores, disease annotations |
| 2 | rd_diseases |
8,500 | OMIM/Orphanet disease catalog with inheritance, prevalence, clinical features |
| 3 | rd_genes |
12,000 | Known disease-gene associations with ClinGen evidence levels |
| 4 | rd_variants |
25,000 | Pathogenicity database -- ClinVar rare disease variants, gnomAD frequencies |
| 5 | rd_literature |
15,000 | PubMed rare disease publications, functional studies, reviews |
| 6 | rd_trials |
5,800 | Rare disease clinical trials -- often Phase I/II, with eligibility criteria |
| 7 | rd_therapies |
3,500 | Orphan drugs, gene therapies, enzyme replacement therapies, substrate reduction |
| 8 | rd_case_reports |
8,000 | Published diagnostic cases with phenotype-genotype correlations |
| 9 | rd_guidelines |
3,500 | Diagnostic algorithms, ACMG criteria, GeneReviews management protocols |
| 10 | rd_pathways |
5,500 | Metabolic pathways, molecular mechanisms, enzyme deficiency maps |
| 11 | rd_registries |
3,000 | Patient registry data, cohort demographics, prevalence estimates |
| 12 | rd_natural_history |
4,000 | Disease progression data, survival curves, milestone timelines |
| 13 | rd_newborn_screening |
1,200 | Expanded screening panels, ACT sheets, confirmatory testing protocols |
| 14 | genomic_evidence (shared) |
3,560,000 | ClinVar + AlphaMissense annotations (read-only, shared with all agents) |
Total domain-specific records: ~113,000 + 3.56M shared
5.4 Port Allocation¶
| Service | Port | Protocol |
|---|---|---|
| Streamlit UI | 8544 | HTTP |
| FastAPI API | 8134 | HTTP/REST |
| Webhook Listener (VUS alerts) | 8545 | HTTP |
| Milvus (shared) | 19530 | gRPC |
| etcd (shared) | 2379 | gRPC |
| MinIO (shared) | 9000 | HTTP |
5.5 Core Processing Modules¶
1. HPO Extraction Engine (NLP to HPO) Transforms free-text clinical descriptions into structured HPO terms using a three-stage pipeline: - Stage 1: Named entity recognition for clinical findings (negation-aware) - Stage 2: Semantic similarity matching against HPO term descriptions (cosine similarity, threshold 0.82) - Stage 3: Ontology traversal to identify implied parent/child phenotype terms - Output: Ranked list of HPO terms with confidence scores and source document citations
2. ACMG Variant Classification Engine Automates the 28-criteria ACMG/AMP variant classification framework: - Pathogenic criteria: PVS1, PS1-PS4, PM1-PM6, PP1-PP5 - Benign criteria: BA1, BS1-BS4, BP1-BP7 - Evidence aggregation from ClinVar, gnomAD frequency, computational predictors (REVEL, CADD, SpliceAI), functional studies, segregation data - Output: 5-tier classification (Pathogenic / Likely Pathogenic / VUS / Likely Benign / Benign) with criterion-by-criterion evidence
3. VUS Surveillance Monitor Continuous monitoring for patients with reported variants of uncertain significance: - Weekly ClinVar delta ingestion (new submissions, reclassifications) - PubMed literature monitoring for functional studies on VUS-harboring genes - ClinGen Variant Curation Expert Panel (VCEP) decision tracking - Alert generation when cumulative evidence crosses classification threshold - Retroactive patient notification pipeline with clinical summary
4. Temporal Pattern Engine Longitudinal analysis across fragmented clinical records: - Document ingestion from multiple EHR exports (C-CDA, PDF, FHIR) - Timeline reconstruction with event extraction and temporal ordering - Progressive phenotype accumulation scoring - Episodic pattern detection (cyclical symptoms, trigger-response patterns) - Developmental regression identification (loss of milestones in pediatric patients)
5. Matchmaker Engine Integration with Matchmaker Exchange network for undiagnosed patients: - Automated phenotype/genotype profile submission (with consent) - Cross-institutional matching for patients sharing rare variants and overlapping phenotypes - Privacy-preserving federated queries using GA4GH Beacon protocol
6. Clinical Document and Genomic Ingestion Pipeline¶
6.1 Multi-Source Document Ingestion¶
The rare disease diagnostic pipeline must ingest clinical data from heterogeneous sources spanning years of a patient's diagnostic odyssey:
Input Sources Processing Pipeline Output
----- --- ------
Clinical PDFs --------+
(progress notes, labs, | +------------------+
imaging, pathology) +-------->| PDF Parser |
| (pdfplumber + |
C-CDA / HL7 Documents --------->| layout engine) |
(EHR exports, HIE) +--------+---------+
|
FHIR R4 Bundles ----------> +-----------v--------+ +----------------+
(interop feeds) | Document Normalizer | | HPO Extractor |
| (-> unified JSON) |--->| (NER + sim) |
Genetic Test Reports ------->+--------+------------+ +--------+-------+
(VCF, clinical reports, | |
panel results) | +--------v-------+
| | Phenotype |
Family History --------> +----------v---------+ | Profile Builder|
(pedigree, carrier) | Entity Extraction | +--------+-------+
| (dates, labs, meds,| |
| dx, procedures) | |
+----------+---------+ |
| |
+----------v--------------------------v---+
| Embedding Pipeline |
| BGE-small-en-v1.5 (384-dim) |
| Chunk: 512 tokens, 64 overlap |
+-------------------+---------------------+
|
+--------v--------+
| Milvus Insert |
| (14 collections)|
+-----------------+
6.2 VCF Integration Pipeline¶
For patients with genomic sequencing data, the agent integrates directly with the HCLS AI Factory genomics pipeline:
- VCF ingestion -- Annotated VCF from Parabricks/DeepVariant pipeline
- Variant filtering -- Quality filters (QUAL >= 30, DP >= 10, GQ >= 20), population frequency filters (gnomAD AF < 0.01 for dominant, < 0.05 for recessive)
- Gene panel matching -- Variants in genes associated with patient's HPO-derived differential diagnosis
- ACMG classification -- Automated 28-criteria scoring with evidence aggregation
- Phenotype correlation -- Variant-harboring gene's disease associations compared against patient's observed phenotypes
- Inheritance pattern validation -- Zygosity check against expected inheritance (heterozygous for AD, homozygous/compound het for AR, hemizygous for XL)
- VUS flagging -- Variants classified as VUS entered into surveillance pipeline
7. HPO (Human Phenotype Ontology) Integration¶
7.1 What HPO Is¶
The Human Phenotype Ontology (HPO) is a standardized, hierarchically structured vocabulary of 18,000+ phenotypic abnormalities observed in human disease. Developed by the Monarch Initiative and maintained by an international consortium, HPO provides the computational bridge between clinical observation and disease ontology matching that makes automated rare disease diagnosis possible.
Each HPO term represents a specific clinical finding with: - A unique identifier (e.g., HP:0001250 for "Seizures") - A precise definition distinguishing it from related terms - Hierarchical relationships (parent/child terms in a directed acyclic graph) - Disease annotations (which diseases are associated with this phenotype, and at what frequency) - An information content (IC) score reflecting diagnostic specificity (rarer phenotypes have higher IC)
The ontology is organized under major organ system categories: - Abnormality of the nervous system (HP:0000707) -- 4,000+ terms - Abnormality of the musculoskeletal system (HP:0033127) -- 2,500+ terms - Abnormality of the eye (HP:0000478) -- 1,200+ terms - Abnormality of the cardiovascular system (HP:0001626) -- 800+ terms - Abnormality of the immune system (HP:0002715) -- 600+ terms - Abnormality of metabolism/homeostasis (HP:0001939) -- 1,500+ terms - Growth abnormality (HP:0001507) -- 400+ terms
7.2 How HPO Enables Computational Phenotype Matching¶
HPO transforms rare disease diagnosis from a recognition task (requiring the clinician to recall a specific disease) into a computation task (matching a patient's phenotype profile against all known disease profiles):
Step 1: Clinical Observation to HPO Coding The clinician's observations are mapped to HPO terms, either through manual entry (HPO term search) or NLP extraction from clinical documents:
Clinical observation: "The child has seizures, intellectual disability,
and a small head circumference"
HPO coding: HP:0001250 Seizures
HP:0001249 Intellectual disability
HP:0000252 Microcephaly
Step 2: Phenotype Profile to Disease Matching The patient's HPO term set is compared against disease phenotype profiles using semantic similarity scoring. Each disease in OMIM/Orphanet has an annotated set of HPO terms with frequency modifiers (obligate, very frequent, frequent, occasional, very rare).
Step 3: Ranked Differential Diagnosis Diseases are ranked by phenotype overlap score, with adjustments for: - Information content weighting (specific phenotypes weighted more heavily) - Frequency compatibility (obligate features weighted more than occasional) - Age-of-onset compatibility - Inheritance pattern compatibility (if family history available)
7.3 HPO-to-Disease Scoring: Phenomizer, LIRICAL, Exomiser¶
Three established tools provide validated approaches to HPO-based disease matching, each of which the Rare Disease Diagnostic Agent incorporates:
Phenomizer -- Developed by the HPO consortium, uses semantic similarity between patient HPO terms and disease annotations with p-value calculation for statistical significance of phenotype overlap. The agent replicates this approach for phenotype-only queries.
LIRICAL (LIkelihood Ratio Interpretation of Clinical AbnormaLities) -- Extends phenotype matching with genomic data integration, computing a composite likelihood ratio for each candidate disease based on both phenotype match and variant evidence. The agent's combined phenotype-genotype workflow (Workflow 8.2) implements this methodology.
Exomiser -- The most comprehensive tool, combining HPO phenotype matching, variant pathogenicity scoring, protein interaction network analysis, and cross-species phenotype data (model organisms). The agent's architecture supports Exomiser-like analysis through its multi-collection retrieval strategy.
7.4 HPO Integration Example¶
A practical example demonstrating how HPO coding drives differential diagnosis:
Patient presentation:
10-year-old boy with progressive difficulty walking (onset age 7),
calf pseudohypertrophy, Gowers sign positive, elevated CK (15,000 U/L),
mild intellectual disability
HPO extraction:
HP:0002355 Difficulty walking (IC: 3.1)
HP:0003693 Calf pseudohypertrophy (IC: 8.2) ** highly specific **
HP:0003391 Gowers sign (IC: 9.1) ** highly specific **
HP:0003236 Elevated circulating CK (IC: 4.5)
HP:0001249 Intellectual disability (IC: 2.8)
HP:0003677 Slowly progressive (IC: 2.1)
HP:0003621 Juvenile onset (IC: 2.4)
Differential diagnosis (ranked by phenotype overlap):
Rank Disease Score Matched/Total Key Discriminator
1 Duchenne muscular dystrophy 0.94 7/7 Classic presentation
2 Becker muscular dystrophy 0.89 6/7 Later onset, milder
3 Limb-girdle MD type 2I 0.72 5/7 No calf pseudohypertrophy expected
4 Emery-Dreifuss MD 0.58 4/7 Missing cardiac features
5 SMA type III (Kugelberg-Welander) 0.51 3/7 Different weakness pattern
Recommended next step:
-> Dystrophin gene (DMD) deletion/duplication analysis
-> If negative: DMD sequencing for point mutations
-> If negative: LGMD gene panel
This example illustrates the power of HPO-specific phenotypes: "calf pseudohypertrophy" (HP:0003693) and "Gowers sign" (HP:0003391) have high information content scores because they are associated with very few diseases, making them highly discriminating. General phenotypes like "difficulty walking" (HP:0002355) contribute less to differential narrowing because they are associated with hundreds of conditions.
8. Clinical Workflows¶
8.1 Workflow 1: Phenotype-Driven Diagnostic Workup¶
The primary workflow for undiagnosed patients presenting with a constellation of clinical findings. This workflow implements HPO coding followed by differential diagnosis ranking.
Inputs: - Patient HPO terms (manually entered or NLP-extracted from clinical documents) - Age of onset, sex, ethnicity - Family history (consanguinity, affected relatives, inheritance pattern) - Previously excluded diagnoses
Processing Logic:
- Semantic phenotype matching: Each patient HPO term is embedded and searched against
rd_phenotypesandrd_diseasescollections simultaneously - Information content weighting: Highly specific phenotypes (e.g., HP:0011968 "Feeding difficulties" IC=2.1) weighted lower than highly discriminating phenotypes (e.g., HP:0000478 "Bilateral anophthalmia" IC=8.9)
- Phenotype profile similarity: Resnik semantic similarity between patient's HPO term set and each candidate disease's known phenotype profile, incorporating ontology hierarchy traversal
- Age-dependent filtering: Candidate diseases filtered by compatibility with patient's current age and symptom onset timeline
- Inheritance pattern scoring: If family history suggests specific inheritance (e.g., affected males, unaffected carrier mothers -> X-linked), candidates weighted accordingly
Output: - Top 20 ranked differential diagnoses with similarity score, matched/unmatched phenotype breakdown, expected features not yet observed, recommended confirmatory testing, and evidence citations
8.2 Workflow 2: Whole Exome/Genome Interpretation¶
Genomic-first diagnostic workflow for patients with WES/WGS data, integrating variant filtering, ACMG classification, and candidate gene prioritization.
Inputs: - Annotated VCF file (from HCLS AI Factory genomics pipeline or external) - Patient phenotype profile (HPO terms, if available) - Proband and family member sequencing (trio analysis preferred)
Processing Logic:
- Variant pre-filtering: Retain variants with gnomAD AF < 0.01 (dominant) or < 0.05 (recessive); CADD >= 15 or REVEL >= 0.5; all ClinVar Pathogenic/Likely Pathogenic; LOF variants in constrained genes (pLI > 0.9)
- Gene-disease association lookup: Each variant's gene searched against
rd_genescollection for known disease associations with evidence level - ACMG automated classification: 28-criteria scoring (PVS1, PS1-PS4, PM1-PM6, PP1-PP5, BA1, BS1-BS4, BP1-BP7)
- Phenotype-genotype correlation: Candidate variants ranked by overlap between gene-associated disease phenotypes and patient's observed phenotypes
- Trio analysis (when available): De novo variant detection, compound heterozygosity phasing, X-linked hemizygosity confirmation, segregation analysis (PP1 criterion)
- Structural variant integration: CNV calls correlated with known microdeletion/microduplication syndromes
Output: - Tiered variant list: Tier 1 (Pathogenic/LP in definitive genes matching phenotype), Tier 2 (VUS in strong candidate genes), Tier 3 (VUS in plausible genes)
8.3 Workflow 3: Metabolic Disease Screening¶
Specialized workflow for newborn screening follow-up and metabolic pathway analysis for inborn errors of metabolism.
Inputs: - Abnormal NBS analyte(s) and values - Biochemical test results (amino acids, organic acids, acylcarnitines, enzyme activities) - Clinical presentation (feeding difficulties, lethargy, seizures, metabolic acidosis)
Processing Logic:
- Analyte-to-disease mapping: Abnormal analytes mapped against
rd_pathwaysandrd_newborn_screeningcollections to generate condition-specific differentials - Metabolic pathway analysis: Enzyme deficiency localized within metabolic pathway (e.g., elevated phenylalanine -> phenylalanine hydroxylase -> PAH gene -> PKU vs. BH4 deficiency)
- Confirmatory test recommendation: Ordered sequence of biochemical, enzymatic, and molecular tests based on analyte pattern
- Emergency protocol activation: For time-critical metabolic emergencies (MSUD, galactosemia, urea cycle disorders), immediate management protocols with neonatal dosing
Key conditions covered: PKU, MSUD, galactosemia, MCAD deficiency, isovaleric acidemia, propionic acidemia, methylmalonic acidemia, OTC deficiency, Gaucher, Fabry, Pompe, MPS I-VII
8.4 Workflow 4: Dysmorphology Assessment¶
Workflow for evaluating patients with syndromic features, facial dysmorphism, growth abnormalities, and skeletal anomalies.
Inputs: - Dysmorphic features (HPO-coded facial, skeletal, growth parameters) - Growth measurements (height, weight, head circumference with Z-scores) - Skeletal survey findings (if available) - Clinical photographs (with consent, for pattern matching)
Processing Logic:
- Facial feature HPO coding: Specific dysmorphic features coded (e.g., HP:0000316 hypertelorism, HP:0000278 retrognathia, HP:0000431 wide nasal bridge)
- Growth parameter analysis: Z-scores calculated and HPO-coded (e.g., HP:0004322 short stature, HP:0000256 macrocephaly)
- Syndrome matching: Combined facial, growth, and skeletal HPO profile matched against syndromic disease entries in
rd_diseases - Skeletal survey correlation: Radiographic findings (if available) correlated with skeletal dysplasia profiles
Key conditions covered: Down syndrome, Turner syndrome, Noonan syndrome, Williams syndrome, 22q11.2 deletion, Cornelia de Lange, Rubinstein-Taybi, Smith-Lemli-Opitz, achondroplasia, skeletal dysplasias
8.5 Workflow 5: Neurogenetic Evaluation¶
Specialized workflow for developmental delay, epilepsy, intellectual disability, and movement disorders with suspected genetic etiology.
Inputs: - Developmental milestones (achieved and delayed/absent) - Seizure semiology and EEG findings - Neuroimaging findings (MRI brain) - Movement disorder characterization - Regression history (if applicable)
Processing Logic:
- Developmental trajectory analysis: Milestone timeline compared against normal developmental curves; pattern classified as static delay, progressive decline, or episodic regression
- Epilepsy gene panel matching: Seizure type and EEG pattern mapped against known epilepsy genes (SCN1A, CDKL5, STXBP1, KCNQ2, etc.)
- Neuroimaging-genotype correlation: MRI findings (white matter abnormalities, cerebellar atrophy, basal ganglia changes) correlated with specific neurogenetic conditions
- Regression pattern analysis: Developmental regression patterns matched against neurodegenerative conditions (Rett, Angelman, neuronal ceroid lipofuscinoses, mitochondrial diseases)
Key conditions covered: SMA (types I-IV), DMD/BMD, Rett syndrome, Angelman syndrome, Prader-Willi, Dravet syndrome, tuberous sclerosis, Friedreich ataxia, Huntington disease, CMT, ataxia-telangiectasia, neuronal ceroid lipofuscinoses
8.6 Workflow 6: Cardiac Genetics¶
Workflow bridging rare disease genetics and cardiovascular medicine for inherited arrhythmias and cardiomyopathies, connecting directly to the Cardiology Intelligence Agent.
Inputs: - ECG/Holter findings (QTc interval, Brugada pattern, epsilon waves) - Echocardiographic measurements (wall thickness, chamber dimensions, systolic function) - Cardiac MRI findings (fibrosis, infiltrative pattern) - Family history of sudden cardiac death or cardiomyopathy - Syncope/presyncope history
Processing Logic:
- Arrhythmia gene panel matching: ECG phenotype mapped to candidate genes (KCNQ1/KCNH2/SCN5A for Long QT, SCN5A for Brugada, RYR2 for CPVT)
- Cardiomyopathy classification: Echocardiographic and MRI findings classified (HCM -> sarcomeric genes; DCM -> TTN, LMNA, etc.; ARVC -> desmosomal genes; restrictive -> ATTR, Fabry)
- Sudden death risk stratification: Family history, syncope, ECG markers integrated for risk assessment
- Cross-agent referral: Identified genetic cardiac conditions referred to Cardiology Intelligence Agent for comprehensive cardiac management
Key conditions covered: Long QT syndrome (LQT1-15), Brugada syndrome, HCM (MYH7, MYBPC3), ATTR amyloidosis (TTN), CPVT (RYR2), ARVC (PKP2, DSP), familial dilated cardiomyopathy
8.7 Workflow 7: Connective Tissue Disorders¶
Workflow for evaluating patients with joint hypermobility, vascular fragility, skeletal features, and skin involvement suggestive of heritable connective tissue disorders.
Inputs: - Beighton hypermobility score - Skin features (hyperextensibility, fragility, scarring, translucency) - Vascular history (aneurysm, dissection, varicose veins, easy bruising) - Skeletal features (scoliosis, pectus deformity, arachnodactyly, tall stature) - Ocular findings (lens subluxation, myopia, retinal detachment) - Family history
Processing Logic:
- Clinical scoring system application: Ghent criteria (Marfan), 2017 EDS criteria (13 types), Sillence classification (OI), Loeys-Dietz clinical features
- Phenotype-to-subtype mapping: Feature constellation mapped to specific subtypes (e.g., vascular EDS vs. classical EDS vs. hypermobile EDS have different genetic etiologies and prognoses)
- Gene panel recommendation: Based on clinical scoring, targeted gene testing vs. comprehensive connective tissue panel recommended
- Vascular surveillance protocol: For conditions with aortic/arterial risk (Marfan, vascular EDS, Loeys-Dietz), imaging surveillance schedule generated
Key conditions covered: Marfan syndrome (FBN1), Ehlers-Danlos syndrome (13 types -- COL5A1/2 classical, COL3A1 vascular, TNXB hypermobile, PLOD1 kyphoscoliotic, etc.), osteogenesis imperfecta (COL1A1/2 types I-IV), Loeys-Dietz (TGFBR1/2, SMAD3, TGFB2/3), Stickler syndrome
8.8 Workflow 8: Inborn Errors of Metabolism¶
Comprehensive workflow for suspected metabolic diseases beyond newborn screening, covering enzyme assays, biomarker patterns, and dietary management.
Inputs: - Biochemical profile (amino acids, organic acids, acylcarnitines, very long chain fatty acids, lysosomal enzyme panel) - Metabolic crisis history (triggers, severity, frequency) - Dietary history and response to dietary interventions - Organ involvement pattern (liver, brain, heart, muscle, bone)
Processing Logic:
- Metabolic biomarker pattern recognition: Analyte patterns matched against metabolic disease signatures in
rd_pathways(e.g., elevated C5-carnitine with isovalerylglycine -> isovaleric acidemia) - Enzyme activity interpretation: Reduced enzyme activity correlated with specific deficiency states and residual activity-phenotype correlations
- Dietary management protocol: Disease-specific dietary restrictions and supplement recommendations (e.g., BCAA-restricted diet for MSUD, phenylalanine-restricted diet for PKU, galactose-free diet for galactosemia)
- Enzyme replacement therapy matching: For lysosomal storage disorders, ERT eligibility and monitoring protocols generated
- Substrate reduction therapy consideration: For eligible conditions (Gaucher type 1 -> miglustat/eliglustat; Niemann-Pick C -> miglustat)
Key conditions covered: Gaucher (imiglucerase, velaglucerase, taliglucerase), Fabry (agalsidase alfa/beta, migalastat), Pompe (alglucosidase alfa, avalglucosidase alfa), MPS I (laronidase), MPS II (idursulfase), MPS IVA (elosulfase alfa), MPS VI (galsulfase), Niemann-Pick C (miglustat)
8.9 Workflow 9: Gene Therapy Eligibility Assessment¶
Workflow for matching diagnosed patients with approved and investigational gene therapies, representing a rapidly expanding treatment landscape.
Inputs: - Confirmed genetic diagnosis (disease, gene, specific variant(s)) - Patient demographics (age, weight) - Prior treatment history - Anti-AAV antibody status (if known) - Insurance/access considerations
Processing Logic:
- Approved therapy matching: Patient's disease and genotype searched against
rd_therapiescollection for FDA/EMA-approved therapies:
Nusinersen (Spinraza) for SMA: Intrathecal ASO targeting SMN2 splicing; all SMA types; no age limit; requires lumbar puncture access
Onasemnogene abeparvovec (Zolgensma) for SMA: AAV9 gene replacement; age < 2 years (label); anti-AAV9 titer < 1:50; weight < 21 kg; $2.125M
Risdiplam (Evrysdi) for SMA: Oral SMN2 splicing modifier; all SMA types; age >= 2 months; no AAV antibody concern
Voretigene neparvovec (Luxturna) for RPE65 retinal dystrophy: Subretinal AAV2 injection; biallelic RPE65 mutations; sufficient viable retinal cells; $850K
Etranacogene dezaparvovec (Hemgenix) for hemophilia B: AAV5 gene therapy; Factor IX < 2%; anti-AAV5 titer negative; age >= 18; $3.5M
Exagamglogene autotemcel (Casgevy) for sickle cell/beta-thalassemia: CRISPR-edited autologous HSCs; age >= 12; SCD with recurrent VOC or TDT; $2.2M
Lovotibeglogene autotemcel (Lyfgenia) for sickle cell: Lentiviral gene addition; age >= 12; SCD with recurrent VOC
Delandistrogene moxeparvovec (Elevidys) for DMD: AAV-based micro-dystrophin; ambulatory DMD; age 4-5 (label)
- Eligibility criteria evaluation: Criterion-by-criterion assessment of patient against therapy requirements
- Investigational therapy search: Active gene therapy trials from
rd_trialswith eligibility pre-screening - Compassionate use / expanded access: If no approved therapy and no open trial, expanded access programs identified
- Pre-treatment workup: Required baseline assessments generated (cardiac evaluation, hepatic function, immunological screening, anti-AAV antibodies)
8.10 Workflow 10: Undiagnosed Disease Program Support¶
Comprehensive multi-modal analysis for patients who have exhausted standard diagnostic pathways, modeled on the NIH Undiagnosed Diseases Program (UDP) and Undiagnosed Diseases Network (UDN) methodology.
Inputs: - Complete clinical record corpus (all available documents from diagnostic odyssey) - Genomic data (WES/WGS VCF if available; gene panels if not) - Family history and pedigree - Prior diagnostic hypotheses and testing results
Processing Logic:
- Document ingestion and timeline reconstruction: All clinical documents processed through NLP pipeline; events extracted and ordered chronologically across years and institutions
- Comprehensive HPO extraction: Every clinical finding extracted from every document, with temporal annotation (when first noted, progression, resolution)
- Phenotype trajectory analysis: Progressive phenotype accumulation mapped against known disease trajectories in
rd_natural_history - Exhaustive differential generation: Full phenotype profile matched against all 8,500 diseases in
rd_diseases(not limited to top 20) - Previously excluded disease re-evaluation: Diseases previously ruled out re-evaluated against updated diagnostic criteria and new clinical features
- Genomic re-analysis (if VCF available): Variant re-interpretation against updated ClinVar, new gene-disease associations published since original analysis
- Matchmaker Exchange query: Anonymized phenotype/genotype profile submitted to international matching network (Matchmaker Exchange -- a federated network of 7 databases including DECIPHER, GeneMatcher, PhenomeCentral, MyGene2)
- Novel gene-disease hypothesis generation: Variants in genes without established disease association evaluated for biological plausibility (protein function, expression patterns, animal models, constraint scores)
Output: - Comprehensive diagnostic summary report (20-40 pages) - Ranked diagnostic hypotheses with confidence tiers - Candidate novel gene-disease associations for research follow-up - Matchmaker Exchange results (if matches found)
9. Cross-Modal Integration and Genomic Correlation¶
9.1 Multi-Omics Convergence Architecture¶
The Rare Disease Diagnostic Agent integrates evidence across multiple data modalities to achieve diagnostic convergence -- a principle that no single data type is sufficient for rare disease diagnosis. The cross-modal integration engine operates on a Bayesian framework that updates diagnostic probabilities as new evidence from each modality is incorporated.
Supported Data Modalities:
| Modality | Data Type | Collection(s) | Evidence Weight |
|---|---|---|---|
| Clinical Phenotype | HPO terms from clinical notes | rd_phenotypes, rd_diseases |
Baseline prior |
| Whole Exome/Genome Sequencing | VCF with annotated variants | rd_variants, rd_genes |
Primary genetic |
| Gene Expression (RNA-seq) | Transcript abundance | rd_genes, rd_pathways |
Functional validation |
| Metabolomics | Biomarker panels, NBS results | rd_newborn_screening |
Biochemical confirmation |
| Imaging | MRI, CT, ultrasound features | rd_phenotypes |
Structural phenotype |
| Family Segregation | Pedigree + co-segregation data | rd_variants, rd_genes |
Inheritance validation |
| Literature | Case reports, cohort studies | rd_literature, rd_case_reports |
Knowledge-base evidence |
9.2 Genomic Correlation Engine¶
The genomic correlation engine connects phenotypic observations to genomic findings by traversing the phenotype-gene-variant-disease knowledge graph. For each patient, the engine performs the following operations:
-
Phenotype-to-Gene Mapping: Patient HPO terms are mapped to candidate genes using the HPO gene-phenotype annotation database (currently 5,400+ genes with HPO annotations). Each gene receives a phenotype match score (PMS) based on the information content of shared HPO terms using Resnik semantic similarity.
-
Variant Prioritization: Variants in candidate genes are prioritized using a composite score incorporating:
- ACMG/AMP classification (pathogenic, likely pathogenic, VUS, likely benign, benign)
- Population allele frequency from gnomAD (v4.1, 807,162 genomes)
- In silico predictions: CADD (>20), REVEL (>0.5), AlphaMissense (>0.564 pathogenic threshold)
- Splice prediction: SpliceAI (delta score >0.2)
-
Conservation: PhyloP, GERP++, phastCons
-
Inheritance Pattern Matching: Candidate variant-disease pairs are filtered by inheritance compatibility. The engine evaluates autosomal dominant (heterozygous), autosomal recessive (homozygous or compound heterozygous), X-linked (hemizygous in males, heterozygous in females), and mitochondrial inheritance against the patient's genotype and family structure.
-
Cross-Modal Evidence Aggregation: Evidence from clinical, genomic, biochemical, and literature sources is aggregated using a weighted Bayesian scoring model:
P(disease | evidence) = P(phenotype_match) * P(variant_pathogenicity) * P(inheritance_fit) * P(literature_support) * P(functional_evidence) / P(evidence)
The final diagnostic confidence is categorized as: - Definitive (>0.95): Strong pathogenic variant in established disease gene with phenotype match - Strong (0.80-0.95): Likely pathogenic variant with good phenotype overlap - Moderate (0.50-0.80): VUS in candidate gene with partial phenotype match - Suggestive (<0.50): Possible association requiring additional evidence
9.3 Phenotype-Genotype Discordance Resolution¶
When phenotypic and genomic evidence conflict, the discordance resolution module activates. Common scenarios include:
- Phenotype expansion: Patient has features not previously associated with the candidate gene -- the system queries
rd_literatureandrd_case_reportsfor emerging phenotype-genotype associations - Incomplete penetrance: Strong genetic finding without full phenotypic expression -- the system retrieves penetrance data from
rd_natural_historyand age-dependent expression patterns - Digenic/oligogenic inheritance: No single gene explains the full phenotype -- the system evaluates gene-gene interaction networks from
rd_pathwaysfor synergistic effects - Phenocopies: Clinical presentation mimics a genetic condition but has a non-genetic etiology -- the system flags environmental, autoimmune, or acquired differential diagnoses
9.4 Reanalysis Triggers¶
The system monitors for reanalysis triggers that may change diagnostic interpretation:
- New gene-disease associations published in OMIM or ClinVar
- Variant reclassification events (VUS upgraded to likely pathogenic)
- Updated gnomAD population frequency data
- New functional studies validating gene function
- Patient phenotype evolution (new symptoms or symptom resolution)
When a trigger is detected, affected cases are automatically flagged for reanalysis, and the cross-modal integration is re-executed with updated evidence.
10. NIM Integration Strategy¶
10.1 NVIDIA NIM Microservice Architecture¶
The Rare Disease Diagnostic Agent leverages NVIDIA Inference Microservices (NIMs) deployed on DGX infrastructure to accelerate computationally intensive genomic and AI operations. NIM containers provide GPU-optimized, containerized inference endpoints that can be composed into diagnostic pipelines.
NIM Deployment Configuration:
| NIM Service | Model | GPU Memory | Purpose |
|---|---|---|---|
| Parabricks Germline | BWA-MEM2 + DeepVariant | 24 GB | FASTQ-to-VCF alignment and variant calling |
| BioNeMo ESM-2 | ESM-2 (650M params) | 8 GB | Protein structure impact prediction |
| BioNeMo MolMIM | MolMIM | 8 GB | Molecular interaction modeling for drug candidates |
| LLM Embedding | BGE-small-en-v1.5 | 4 GB | Document and phenotype embedding generation |
| LLM Inference | Claude API (external) | N/A | Clinical reasoning and report generation |
10.2 Genomic NIM Pipeline¶
The genomic processing pipeline chains Parabricks NIMs for accelerated variant analysis:
-
Alignment (BWA-MEM2 GPU): Raw FASTQ reads aligned to GRCh38 reference genome. GPU acceleration reduces alignment time from 6-8 hours (CPU) to 15-25 minutes on DGX Spark.
-
Variant Calling (DeepVariant GPU): Deep learning-based variant caller identifies SNVs and small indels with 99.7% accuracy on Genome-in-a-Bottle truth sets. Processing time: 20-30 minutes (GPU) vs 4-6 hours (CPU).
-
Structural Variant Calling: Long-read data processed through pbsv or Sniffles2 for structural variant detection (deletions, duplications, inversions, translocations >50bp).
-
Annotation: VCF annotated with ClinVar, gnomAD, OMIM, HPO gene associations, AlphaMissense predictions, and SpliceAI scores via custom annotation pipeline.
10.3 Protein Structure NIM Integration¶
For variants of uncertain significance (VUS), the protein structure analysis NIM provides functional impact prediction:
-
ESM-2 Variant Effect Prediction: Zero-shot variant effect scores computed using evolutionary scale modeling. Variants with ESM-2 log-likelihood ratio < -7.5 flagged as likely deleterious.
-
AlphaFold Structure Mapping: Patient variants mapped to predicted protein structures to assess location relative to:
- Active sites and catalytic residues
- Protein-protein interaction interfaces
- Transmembrane domains
-
Post-translational modification sites
-
Molecular Dynamics Impact: For high-priority VUS in therapeutic target genes, MolMIM NIM evaluates structural perturbation and potential impact on drug binding.
10.4 NIM Orchestration and Scaling¶
NIM services are orchestrated through the HCLS AI Factory Nextflow DSL2 pipeline with the following scaling strategy:
- Single-patient mode: Sequential NIM invocation for individual diagnostic workups (typical clinical use)
- Batch mode: Parallel NIM execution for cohort analysis (newborn screening programs, undiagnosed disease cohorts)
- Priority queue: Urgent diagnostic cases (NICU, acute presentations) receive priority GPU allocation
Resource allocation is managed through Kubernetes with NVIDIA GPU Operator, enabling dynamic scaling based on queue depth and urgency classification. The DGX Spark provides 128 GB unified memory allowing simultaneous execution of alignment, variant calling, and embedding generation pipelines.
11. Knowledge Graph Design¶
11.1 Rare Disease Knowledge Graph Schema¶
The knowledge graph underpinning the diagnostic agent models the complex relationships between phenotypes, genes, variants, diseases, and therapies. The graph is implemented as a combination of Milvus vector collections (for semantic search) and an in-memory graph structure (for traversal queries).
Node Types:
| Node Type | Count | Primary Source | Key Properties |
|---|---|---|---|
| Disease | ~8,500 | OMIM, Orphanet | OMIM ID, ORPHA code, prevalence, inheritance |
| Gene | ~22,000 | HGNC, OMIM | Symbol, Ensembl ID, chromosome, constraint scores |
| Phenotype (HPO) | 16,600+ | HPO Ontology | HPO ID, name, definition, synonyms |
| Variant | ~4.1M | ClinVar | rsID, HGVS, classification, review status |
| Therapy | ~600 | FDA, EMA, Orphanet | Drug name, approval status, mechanism, cost |
| Clinical Trial | ~3,200 | ClinicalTrials.gov | NCT ID, phase, status, eligibility |
| Pathway | ~1,800 | Reactome, KEGG | Pathway ID, name, gene members |
| Publication | ~45,000 | PubMed, GeneReviews | PMID, title, abstract, MeSH terms |
Edge Types:
CAUSES(Gene -> Disease): Gene-disease association with evidence level (definitive, strong, moderate, limited, disputed)HAS_PHENOTYPE(Disease -> Phenotype): Disease-phenotype association with frequency annotation (obligate, very frequent, frequent, occasional, very rare)VARIANT_IN(Variant -> Gene): Variant location within geneTREATS(Therapy -> Disease): Therapeutic indicationASSOCIATED_WITH(Phenotype -> Gene): Phenotype-gene annotation from HPOPARTICIPATES_IN(Gene -> Pathway): Gene-pathway membershipINTERACTS_WITH(Gene -> Gene): Protein-protein interactionCITED_IN(Disease/Gene/Variant -> Publication): Literature evidence
11.2 Graph Construction Pipeline¶
The knowledge graph is constructed through automated ingestion from primary sources:
- OMIM Morbid Map: Gene-disease associations with phenotype MIM numbers (updated monthly)
- HPO Annotations: Disease-phenotype associations with frequency qualifiers (updated quarterly)
- ClinGen Gene-Disease Validity: Evidence-based gene-disease relationship classifications
- Orphanet Rare Disease Ontology: Disease hierarchy, prevalence, inheritance patterns
- ClinVar Variant Submissions: Variant-disease associations with review status
- Reactome/KEGG: Metabolic and signaling pathway membership
- STRING: Protein-protein interaction networks (confidence >0.7)
11.3 Graph Traversal Algorithms¶
The diagnostic engine uses specialized graph traversal algorithms:
- Phenotype Propagation: HPO terms are propagated up the ontology hierarchy to find diseases matching at higher granularity when specific terms have no matches
- Gene Network Expansion: Candidate genes are expanded through protein interaction networks to identify functionally related genes that may explain the phenotype
- Pathway Enrichment: When multiple candidate genes converge on a single pathway, the pathway-level signal boosts confidence in that diagnostic hypothesis
- Disease Clustering: Related diseases in the ontology hierarchy are clustered for differential diagnosis presentation
12. Query Expansion and Retrieval Strategy¶
12.1 Multi-Stage Retrieval Architecture¶
The retrieval pipeline implements a multi-stage approach to maximize recall for rare disease queries while maintaining precision:
Stage 1: Query Understanding and Expansion
Clinical queries are processed through a phenotype-aware NLP pipeline:
-
HPO Term Extraction: Free-text clinical descriptions are mapped to HPO terms using the HPO text-mining pipeline. For example, "floppy baby with feeding difficulties" maps to HP:0001252 (Muscular hypotonia), HP:0011968 (Feeding difficulties).
-
Synonym Expansion: Each HPO term is expanded with its full synonym set. HP:0001252 (Muscular hypotonia) also searches for "hypotonia", "poor muscle tone", "decreased muscle tone", "floppy infant".
-
Hierarchical Expansion: HPO terms are expanded both up (more general) and down (more specific) the ontology tree. Depth-limited expansion (2 levels up, 1 level down) balances recall and precision.
-
Negation Handling: Explicitly absent phenotypes are tracked to exclude diseases where those features are obligate or very frequent.
Stage 2: Multi-Collection Parallel Retrieval
The expanded query is dispatched simultaneously to relevant Milvus collections:
async def parallel_retrieval(expanded_query: ExpandedQuery) -> RetrievalResults:
tasks = [
search_collection("rd_phenotypes", expanded_query.hpo_embeddings, top_k=50),
search_collection("rd_diseases", expanded_query.disease_embeddings, top_k=30),
search_collection("rd_genes", expanded_query.gene_embeddings, top_k=30),
search_collection("rd_variants", expanded_query.variant_filter, top_k=100),
search_collection("rd_literature", expanded_query.text_embeddings, top_k=20),
search_collection("rd_case_reports", expanded_query.phenotype_embeddings, top_k=15),
search_collection("rd_therapies", expanded_query.therapy_embeddings, top_k=10),
search_collection("rd_guidelines", expanded_query.guideline_embeddings, top_k=10),
]
results = await asyncio.gather(*tasks)
return merge_and_rank(results)
Stage 3: Cross-Collection Fusion and Reranking
Results from multiple collections are fused using Reciprocal Rank Fusion (RRF):
Where k=60 (standard RRF constant). Documents appearing in multiple collections receive boosted scores, implementing the principle that cross-modal evidence convergence increases diagnostic confidence.
12.2 Rare Disease-Specific Retrieval Challenges¶
Several challenges are unique to rare disease retrieval:
- Extreme class imbalance: Some diseases have <10 known cases worldwide. The retrieval system uses case report-level indexing to capture even single-patient observations.
- Phenotypic heterogeneity: The same genetic variant can produce vastly different clinical presentations (variable expressivity). The system indexes phenotype frequency annotations to weight common vs rare presentations.
- Evolving nomenclature: Disease names change frequently (e.g., "Charcot-Marie-Tooth" encompasses 90+ subtypes with multiple naming conventions). The system maintains a comprehensive synonym and cross-reference index.
- Multilingual evidence: Critical case reports may be published in non-English journals. Translation-aware embeddings capture cross-lingual semantic similarity.
12.3 Context Window Optimization¶
Given the complexity of rare disease cases, context window management is critical for LLM-based reasoning:
- Hierarchical summarization: Long documents are pre-summarized at multiple levels (abstract, key findings, full text) and the appropriate level is selected based on query relevance score
- Evidence prioritization: The most diagnostically relevant evidence is placed first in the context window, with supporting evidence appended in decreasing relevance order
- Token budget allocation: The context window is partitioned across evidence types (40% phenotype-gene evidence, 25% variant data, 20% literature, 15% therapeutic options)
- Dynamic retrieval: If initial retrieval does not provide sufficient evidence for confident diagnosis, iterative retrieval rounds expand the search to lower-ranked candidates
13. API and UI Design¶
13.1 RESTful API Architecture¶
The Rare Disease Diagnostic Agent exposes a RESTful API on port 8134 (configurable) for integration with clinical systems:
Core Endpoints:
| Endpoint | Method | Description |
|---|---|---|
/api/v1/diagnose |
POST | Submit patient phenotype/genotype for diagnostic analysis |
/api/v1/variants/interpret |
POST | ACMG-compliant variant interpretation |
/api/v1/phenotype/match |
POST | HPO-to-disease matching |
/api/v1/therapy/search |
POST | Therapeutic option identification |
/api/v1/trial/match |
POST | Clinical trial eligibility matching |
/api/v1/report/generate |
POST | Generate comprehensive diagnostic report |
/api/v1/case/{id}/reanalyze |
PUT | Trigger case reanalysis with updated evidence |
/api/v1/health |
GET | Service health and collection status |
/api/v1/collections/status |
GET | Milvus collection statistics |
Request Schema (Diagnostic Analysis):
{
"patient_id": "RD-2026-001",
"phenotypes": [
{"hpo_id": "HP:0001252", "onset": "congenital", "severity": "severe"},
{"hpo_id": "HP:0001263", "onset": "infantile"},
{"hpo_id": "HP:0002015", "onset": "infantile"}
],
"negated_phenotypes": ["HP:0001249"],
"vcf_path": "/data/patients/RD-2026-001/variants.vcf.gz",
"family_history": {
"consanguinity": false,
"affected_relatives": [],
"inheritance_pattern_suspected": "autosomal_recessive"
},
"prior_testing": ["normal karyotype", "negative Prader-Willi methylation"],
"urgency": "routine"
}
13.2 Streamlit Clinical Interface¶
The clinical user interface is built with Streamlit, providing an interactive diagnostic workbench:
- Patient Intake Panel: Structured HPO term entry with autocomplete, free-text clinical description with automatic HPO extraction, VCF upload, family history capture
- Diagnostic Dashboard: Real-time differential diagnosis list with confidence scores, evidence provenance for each candidate, interactive knowledge graph visualization
- Variant Review Panel: Filterable variant table with ACMG classification, protein structure visualization, population frequency plots, ClinVar submission history
- Therapeutic Options Panel: Approved therapies, clinical trials, gene therapy eligibility, expanded access programs
- Report Generator: One-click generation of comprehensive diagnostic reports in PDF/FHIR format
13.3 FHIR Interoperability Layer¶
The agent implements HL7 FHIR R4 resources for clinical system integration:
- DiagnosticReport: Complete diagnostic workup results
- Condition: Identified or suspected rare disease diagnoses
- Observation: Individual phenotypic findings (HPO-coded)
- MolecularSequence: Genomic variant data
- MedicationRequest: Recommended therapeutic interventions
- ResearchStudy: Matched clinical trials
FHIR resources are serialized as NDJSON for bulk export and individual JSON for point queries, enabling integration with Epic, Cerner, and other EHR systems through SMART on FHIR applications.
14. Clinical Decision Support Engines¶
14.1 HPO-to-Gene Matcher¶
The HPO-to-Gene Matcher implements a semantic similarity-based approach to identify candidate genes from patient phenotypes. Using the Information Content (IC) of each HPO term -- derived from the frequency of term annotation across all diseases -- the matcher computes pairwise similarity between patient phenotype profiles and gene-associated phenotype profiles.
Algorithm:
- For each patient HPO term, compute IC:
IC(t) = -log2(p(t))wherep(t)is the fraction of diseases annotated with termtor its descendants - For each candidate gene, retrieve all HPO-annotated diseases caused by that gene
- Compute Best Match Average (BMA) similarity between patient profile and each disease profile
- Rank genes by maximum BMA score across all associated diseases
- Apply Bayesian likelihood ratio adjustment for negated phenotypes
The matcher processes 22,000 genes in <3 seconds using pre-computed IC values and cached similarity matrices stored in the rd_phenotypes Milvus collection.
14.2 ACMG Variant Classifier¶
Automated ACMG/AMP variant classification following the 2015 Standards and Guidelines with rare disease-specific adaptations:
Evidence Categories Evaluated:
- PVS1: Null variant in a gene where LOF is a known mechanism of disease (curated from ClinGen)
- PS1-PS4: Same amino acid change as established pathogenic, de novo in patient, well-established functional studies, prevalence in affected vs controls
- PM1-PM6: Located in mutational hot spot, absent from controls, protein length changes, novel missense at a position where different pathogenic missense observed, assumed de novo, in-frame deletion/insertion in non-repeat region
- PP1-PP5: Co-segregation with disease in multiple family members, missense in gene with low rate of benign missense, multiple computational evidence, patient phenotype highly specific for gene, reputable source reports pathogenic
The classifier outputs a five-tier classification with full evidence justification, enabling clinical geneticists to review and modify individual evidence criteria before finalizing classification.
14.3 Orphan Drug Matcher¶
The Orphan Drug Matcher cross-references patient diagnoses against the FDA Orphan Drug Product Database and EMA Orphan Designation Registry:
- Exact disease match: Patient's confirmed or suspected diagnosis matched to approved orphan drug indications
- Pathway-based match: When no direct therapy exists, the matcher identifies drugs targeting the same biological pathway (via
rd_pathwayscollection) - Repurposing candidates: FDA-approved drugs for related conditions that share molecular mechanisms, identified through semantic similarity in
rd_therapies - Compassionate use: For diseases with no approved therapy, the matcher identifies manufacturers with expanded access programs
14.4 Diagnostic Algorithm Recommender¶
Based on the presenting phenotype profile, the system recommends the optimal diagnostic algorithm:
- Neurodevelopmental presentation: Chromosomal microarray -> WES -> WGS -> RNA-seq
- Metabolic crisis: Targeted metabolic panel -> Acylcarnitine/amino acids -> WES with metabolic gene focus
- Skeletal dysplasia: Skeletal survey -> Targeted gene panel -> WES
- Cardiac presentation: Targeted cardiac gene panel -> WES -> WGS for structural variants
- Immunodeficiency: Flow cytometry + TREC/KREC -> Targeted panel -> WES
- Connective tissue: Targeted panel (FBN1, COL genes) -> WES if negative
Each recommendation includes estimated cost, turnaround time, diagnostic yield based on published literature, and insurance coverage likelihood.
14.5 Family Segregation Analyzer¶
The Family Segregation Analyzer evaluates whether candidate variants co-segregate with disease status in the family:
- Pedigree parsing: Family structure encoded from pedigree input (PED format or interactive entry)
- Genotype assignment: Available family member genotypes extracted from VCF or entered manually
- LOD score calculation: Logarithm of odds (LOD) score computed for each candidate variant under the suspected inheritance model
- Segregation evidence classification: LOD > 3.0 (strong evidence, PS), LOD 1.5-3.0 (moderate, PM), LOD 0.6-1.5 (supporting, PP), LOD < 0.6 (insufficient)
- De novo assessment: For trio sequencing, de novo variants identified with confirmation of parental genotypes and maternity/paternity verification
14.6 Natural History Predictor¶
The Natural History Predictor leverages longitudinal data from rd_natural_history and rd_registries to project disease trajectory:
- Disease-specific timelines: Age-dependent probability of developing specific complications (e.g., cardiomyopathy onset in DMD, scoliosis progression in Marfan syndrome)
- Genotype-phenotype correlation: Specific variant types associated with milder or more severe disease courses (e.g., in-frame vs out-of-frame deletions in DMD)
- Surveillance recommendations: Evidence-based monitoring schedules generated from published management guidelines in
rd_guidelines - Milestone predictions: Expected developmental, functional, and organ-specific outcomes with confidence intervals based on natural history registry data
15. Reporting and Interoperability¶
15.1 Diagnostic Report Generation¶
The Rare Disease Diagnostic Agent generates comprehensive clinical-grade diagnostic reports suitable for inclusion in patient medical records. Reports follow the ACMG/AMP reporting standards for clinical genomic sequencing.
Report Sections:
- Patient Demographics and Indication: De-identified patient information, referring provider, clinical indication for testing
- Methodology Summary: Data sources analyzed, bioinformatics pipeline versions, knowledge base versions and dates
- Phenotype Summary: Patient HPO profile with semantic grouping, onset/severity annotations
- Differential Diagnosis: Ranked list of candidate diagnoses with confidence scores, evidence summaries, and distinguishing features
- Variant Findings: Classified variants in tabular format (gene, variant, zygosity, ACMG classification, associated disease, evidence summary)
- Therapeutic Implications: Available treatments, clinical trials, gene therapy eligibility, prognostic implications
- Recommendations: Suggested confirmatory testing, specialist referrals, surveillance schedule, family screening recommendations
- Evidence Appendix: Full retrieval provenance with source citations, embedding similarity scores, and knowledge graph traversal paths
15.2 Output Formats¶
Reports are generated in multiple formats for different consumption needs:
- PDF: Clinical-grade formatted report for medical records and patient communication
- FHIR Bundle: Complete diagnostic workup as interoperable FHIR R4 resources
- HL7 v2 ORU: Observation Result message for legacy laboratory information system integration
- JSON: Structured data export for downstream computational analysis
- GA4GH Phenopacket: Standardized phenotypic and genomic data exchange format for research collaboration and Matchmaker Exchange submission
15.3 Audit Trail and Provenance¶
Every diagnostic conclusion includes a complete audit trail:
- Timestamp and version of each knowledge base consulted
- Specific documents retrieved with similarity scores
- LLM prompt and response pairs used in reasoning
- Graph traversal paths taken during evidence aggregation
- User modifications to automated classifications
- Reanalysis history with differential changes between analyses
This provenance chain supports regulatory compliance (21 CFR Part 11 for electronic records), clinical quality assurance, and enables retrospective analysis of diagnostic performance.
16. Product Requirements Document¶
16.1 Problem Statement¶
Clinical geneticists and rare disease specialists spend an average of 40-60 minutes per case manually searching OMIM, Orphanet, ClinVar, PubMed, and GeneReviews to assemble evidence for diagnostic evaluation. For complex cases, this process may span multiple sessions over days or weeks. The Rare Disease Diagnostic Agent aims to reduce this evidence assembly time to under 5 minutes while improving diagnostic accuracy.
16.2 Target Users¶
| User Persona | Primary Need | Usage Frequency |
|---|---|---|
| Clinical Geneticist | Variant interpretation, differential diagnosis | 10-20 cases/week |
| Genetic Counselor | Patient education, testing recommendations | 15-25 cases/week |
| Pediatric Subspecialist | Rare disease screening in complex patients | 5-10 cases/week |
| Laboratory Geneticist | Variant classification, report generation | 20-40 variants/day |
| Rare Disease Researcher | Genotype-phenotype correlation, cohort analysis | Batch analysis |
| Undiagnosed Disease Program | Comprehensive multi-modal evaluation | 2-5 cases/week |
16.3 Functional Requirements¶
Must Have (P0): - HPO-to-disease matching with ranked differential diagnosis (accuracy >90% for top-10 list) - ACMG-compliant variant classification with evidence justification - Integration with ClinVar, OMIM, Orphanet, HPO, gnomAD knowledge bases - VCF ingestion and variant annotation pipeline - Clinical report generation in PDF format - FHIR R4 resource output for EHR integration - Secure, HIPAA-compliant data handling with audit logging - Sub-5-second query response time for phenotype matching
Should Have (P1): - Gene therapy eligibility assessment - Clinical trial matching via ClinicalTrials.gov integration - Family segregation analysis with LOD score calculation - Natural history prediction from registry data - Newborn screening integration - Matchmaker Exchange connectivity
Nice to Have (P2): - RNA-seq expression outlier analysis - Automated metabolic pathway visualization - Multi-language report generation - Patient-facing educational content generation - Telemedicine integration for specialist consultation
16.4 Non-Functional Requirements¶
- Performance: Phenotype matching <5s, variant interpretation <10s, full diagnostic workup <5 min
- Availability: 99.5% uptime during clinical hours (6 AM - 10 PM local time)
- Scalability: Support 100 concurrent users, 500 cases/day
- Security: HIPAA compliance, SOC 2 Type II, encryption at rest (AES-256) and in transit (TLS 1.3)
- Data Currency: Knowledge bases updated within 30 days of source release
- Auditability: Complete provenance trail for every diagnostic conclusion
17. Data Acquisition Strategy¶
17.1 Primary Knowledge Sources¶
| Source | Records | Update Frequency | Access Method | License |
|---|---|---|---|---|
| OMIM | 16,800+ entries | Weekly | API (academic) | Academic license |
| Orphanet | 6,400+ diseases | Quarterly | RD-CODE downloads | CC BY 4.0 |
| GARD (NIH) | 7,600+ diseases | Monthly | API | Public domain |
| HPO | 16,600+ terms | Quarterly | GitHub release | Custom open |
| ClinVar | 4.1M+ submissions | Monthly | FTP download | Public domain |
| gnomAD (v4.1) | 807,162 genomes | Major releases | Cloud/download | ODC-By 1.0 |
| GeneReviews | 870+ disease entries | Continuous | NCBI Bookshelf | Fair use |
| ClinicalTrials.gov | 3,200+ rare disease trials | Daily | API v2 | Public domain |
| Reactome | 2,600+ pathways | Quarterly | Download | CC0 1.0 |
| KEGG | 550+ pathways | Monthly | API | Academic license |
| AlphaMissense | 71M predictions | Major releases | Download | CC BY 4.0 |
| PubMed | 45,000+ rare disease | Daily | E-utilities API | Public domain |
17.2 Data Ingestion Pipeline¶
The data acquisition pipeline runs as a scheduled Nextflow workflow:
- Download: Source-specific downloaders fetch latest releases from APIs, FTP servers, and GitHub repositories
- Validation: Schema validation, record count verification, and integrity checks against previous versions
- Transformation: Source-specific parsers normalize data to internal schema (OMIM format, Orphanet XML, ClinVar VCV XML, HPO OBO)
- Embedding Generation: Text fields processed through BGE-small-en-v1.5 to generate 384-dimensional dense vectors
- Milvus Upsert: Transformed records with embeddings upserted into appropriate Milvus collections with version tracking
- Verification: Post-load queries verify collection integrity, vector index quality, and search result relevance
- Changelog: Automated changelog generation documenting added, updated, and removed records
17.3 Milvus Collection Specifications¶
The agent maintains 14 specialized Milvus collections:
| Collection | Records | Dimensions | Index Type | Metric | Purpose |
|---|---|---|---|---|---|
rd_phenotypes |
16,600 | 384 | IVF_FLAT | COSINE | HPO term embeddings with hierarchy |
rd_diseases |
8,500 | 384 | IVF_FLAT | COSINE | Disease descriptions and criteria |
rd_genes |
22,000 | 384 | IVF_FLAT | COSINE | Gene function and constraint data |
rd_variants |
4,100,000 | 384 | IVF_SQ8 | COSINE | ClinVar variant annotations |
rd_literature |
45,000 | 384 | IVF_SQ8 | COSINE | PubMed abstracts and GeneReviews |
rd_trials |
3,200 | 384 | IVF_FLAT | COSINE | Clinical trial descriptions |
rd_therapies |
600 | 384 | IVF_FLAT | COSINE | Approved and investigational drugs |
rd_case_reports |
12,000 | 384 | IVF_FLAT | COSINE | Individual case descriptions |
rd_guidelines |
870 | 384 | IVF_FLAT | COSINE | Management guidelines (GeneReviews) |
rd_pathways |
3,150 | 384 | IVF_FLAT | COSINE | Biological pathway descriptions |
rd_registries |
1,500 | 384 | IVF_FLAT | COSINE | Patient registry metadata |
rd_natural_history |
2,800 | 384 | IVF_FLAT | COSINE | Longitudinal disease progression data |
rd_newborn_screening |
85 | 384 | IVF_FLAT | COSINE | NBS condition panels and cutoffs |
genomic_evidence |
500,000 | 384 | IVF_SQ8 | COSINE | Functional genomic annotations |
Total estimated storage: ~28 GB vectors + ~45 GB metadata.
18. Validation and Testing Strategy¶
18.1 Clinical Validation Framework¶
The diagnostic agent must be validated against established benchmarks before clinical deployment:
Benchmark Datasets:
-
Deciphering Developmental Disorders (DDD): 13,612 families with developmental disorders. The agent is evaluated on its ability to identify the causative variant in previously solved cases using only phenotype + VCF input.
-
ClinVar Clinical Significance Concordance: 50,000 randomly sampled variants classified by expert panels (3-star and above). The agent's ACMG classifier is compared against expert consensus.
-
Orphanet Diagnostic Test Accuracy: 500 simulated cases with known diagnoses. For each case, a subset of phenotypic features (mimicking partial presentation) is provided and the agent's top-10 differential is evaluated for diagnostic accuracy.
-
LIRICAL Benchmarking Suite: Standardized phenotype-to-diagnosis benchmarks from the LIRICAL tool publication, enabling direct comparison against Exomiser, Phen2Gene, and AMELIE.
18.2 Validation Metrics¶
| Metric | Target | Measurement |
|---|---|---|
| Top-1 Diagnostic Accuracy | >60% | Correct diagnosis ranked first |
| Top-5 Diagnostic Accuracy | >85% | Correct diagnosis in top 5 |
| Top-10 Diagnostic Accuracy | >92% | Correct diagnosis in top 10 |
| ACMG Classification Concordance | >90% | Agreement with expert panel |
| Variant Sensitivity | >99% | Pathogenic variants not missed |
| Variant Specificity | >95% | Benign variants correctly excluded |
| Mean Reciprocal Rank (MRR) | >0.65 | Average reciprocal rank of correct diagnosis |
| Time to Diagnosis | <5 min | From input to ranked differential |
18.3 Testing Levels¶
- Unit Tests: Individual component testing (HPO matcher, ACMG classifier, retrieval modules) -- target 90% code coverage
- Integration Tests: End-to-end pipeline testing from patient input to diagnostic report output
- Clinical Simulation Tests: Board-certified geneticist review of 100 agent-generated diagnostic reports for clinical acceptability
- Adversarial Testing: Edge cases including incomplete phenotype data, novel variants, ultra-rare diseases (<10 known cases), and deliberately misleading inputs
- Regression Testing: Every knowledge base update triggers regression testing against the benchmark suite to detect performance degradation
18.4 Continuous Monitoring¶
Post-deployment, the system monitors diagnostic performance through:
- Clinician feedback loop: Accept/reject/modify tracking on diagnostic suggestions
- Diagnostic concordance: Comparison of agent suggestions against final clinical diagnosis (captured at 3, 6, and 12 month follow-up)
- Knowledge base freshness: Alerts when source databases are more than 30 days stale
- Retrieval quality: Automated monitoring of embedding similarity distributions and retrieval diversity metrics
19. Regulatory Considerations¶
19.1 FDA Classification¶
The Rare Disease Diagnostic Agent functions as a Clinical Decision Support (CDS) tool. Under the FDA's 21st Century Cures Act, the system is designed to meet Criterion 4 exemption requirements:
- Not intended to replace clinical judgment: The agent provides ranked diagnostic hypotheses and evidence summaries, but does not make autonomous diagnostic decisions
- Displays underlying evidence: All diagnostic suggestions include full evidence provenance (source documents, similarity scores, graph traversal paths)
- Intended for qualified professionals: The system is designed for use by board-certified clinical geneticists, genetic counselors, and laboratory geneticists
- Enables independent review: Clinicians can review, modify, and override any automated classification or recommendation
If the system is deployed as a standalone diagnostic tool (without clinician review), it would be classified as a Class II Software as a Medical Device (SaMD) requiring 510(k) clearance.
19.2 HIPAA Compliance¶
The agent implements the following HIPAA safeguards:
Technical Safeguards: - AES-256 encryption for all patient data at rest - TLS 1.3 for all data in transit - Role-based access control (RBAC) with minimum necessary access - Automated session timeout (15 minutes idle) - Complete audit logging of all data access events - Secure key management through HashiCorp Vault or AWS KMS
Administrative Safeguards: - Business Associate Agreements (BAAs) with all cloud service providers - Annual security risk assessments - Workforce training on PHI handling - Incident response plan with 72-hour breach notification
Physical Safeguards: - DGX Spark deployment in physically secured facility - Encrypted backup media - Facility access controls and monitoring
19.3 International Regulatory Landscape¶
| Jurisdiction | Regulation | Classification | Requirements |
|---|---|---|---|
| USA | FDA 21st Century Cures | CDS Criterion 4 (exempt) or Class II SaMD | Evidence of clinical validity |
| EU | EU MDR 2017/745 | Class IIa (Rule 11) | CE marking, notified body assessment |
| UK | MHRA | Class IIa | UKCA marking post-Brexit |
| Canada | Health Canada | Class II | Medical Device License |
| Australia | TGA | Class IIa | ARTG listing |
| Japan | PMDA | Class II | Shonin approval |
19.4 IVDR Considerations¶
For the genomic variant interpretation component, the EU In Vitro Diagnostic Regulation (IVDR 2017/746) may apply if the system is used to generate variant classifications that directly inform clinical decisions. Under IVDR, the variant classifier would be classified as Class C, requiring conformity assessment by a notified body and compliance with Common Specifications for companion diagnostics.
20. DGX Compute Progression¶
20.1 Deployment Tiers¶
The Rare Disease Diagnostic Agent is designed to scale across NVIDIA DGX hardware tiers:
Tier 1: DGX Spark (Entry / Clinical Lab)
| Resource | Specification | Utilization |
|---|---|---|
| GPU | NVIDIA Grace Blackwell, 128 GB unified | Embedding generation, variant annotation |
| CPU | 20 ARM cores | API serving, data preprocessing |
| Memory | 128 GB unified CPU+GPU | Milvus collections in-memory |
| Storage | 4 TB NVMe | Knowledge bases, patient data |
| Throughput | 5-10 cases/hour | Single clinical laboratory |
| Power | 500W | Desktop deployment |
Tier 2: DGX Station (Department / Hospital)
| Resource | Specification | Utilization |
|---|---|---|
| GPU | 1x A100 80GB or H100 | Concurrent embedding + genomic alignment |
| CPU | 64 cores | Parallel case processing |
| Memory | 512 GB | Full collection loading + batch processing |
| Storage | 15 TB NVMe | Extended knowledge bases, local VCF storage |
| Throughput | 30-50 cases/hour | Hospital genetics department |
| Power | 1,500W | Under-desk/rack deployment |
Tier 3: DGX SuperPOD (National Program / Research)
| Resource | Specification | Utilization |
|---|---|---|
| GPU | 8-32x H100/B200 | Population-scale genomic analysis |
| CPU | 512+ cores | Cohort analysis, model training |
| Memory | 2-8 TB | Full gnomAD, population databases |
| Storage | 100+ TB NVMe | National genomics program data |
| Throughput | 500+ cases/hour | National screening program |
| Power | 40+ kW | Data center deployment |
20.2 Scaling Architecture¶
The application is designed for horizontal scaling:
- Milvus: Scales from standalone (DGX Spark) to distributed cluster (SuperPOD) with automatic sharding
- API Layer: Stateless FastAPI instances behind load balancer, scaled by replica count
- Genomic Pipeline: Parabricks NIM instances scaled per-GPU, with queue-based workload distribution
- Embedding Pipeline: BGE model replicated across available GPU memory for parallel document processing
20.3 Cost-Performance Analysis¶
| Tier | Hardware Cost | Annual TCO | Cost per Case | Target Market |
|---|---|---|---|---|
| DGX Spark | ~$4,999 | ~$8,000 | $1.60 | Community hospital, clinic |
| DGX Station | ~$70,000 | ~$95,000 | $0.52 | Academic medical center |
| DGX SuperPOD | ~$2M+ | ~$3M+ | $0.08 | National health system |
Compared to manual diagnostic workup costs ($2,000-5,000 per case including specialist time), all tiers deliver positive ROI within the first year of deployment.
21. Implementation Roadmap¶
21.1 Phase 1: Foundation (Months 1-3)¶
Milestone 1.1: Core Infrastructure - Deploy Milvus standalone on DGX Spark with initial collection schema - Implement FastAPI service framework with authentication and audit logging - Establish CI/CD pipeline with GitHub Actions - Configure monitoring with Prometheus and Grafana dashboards
Milestone 1.2: Knowledge Base Ingestion - Implement HPO ontology parser and embedding pipeline (16,600 terms) - Ingest OMIM gene-disease associations (6,500+ relationships) - Load ClinVar variants with ACMG classifications (4.1M records) - Ingest Orphanet disease descriptions and prevalence data (6,400 diseases)
Milestone 1.3: Core Matching Engine - Implement HPO-to-gene semantic similarity matcher - Build phenotype-to-disease ranking with IC-based scoring - Develop basic ACMG variant classifier (PVS1, PS1-4, PM1-6, PP1-5) - Integration with existing HCLS AI Factory genomics pipeline
21.2 Phase 2: Clinical Workflows (Months 4-6)¶
Milestone 2.1: Diagnostic Workflows - Implement 10 clinical workflows (phenotype-driven, WES/WGS, metabolic, etc.) - Build gene therapy eligibility assessment engine - Develop clinical trial matching integration - Create family segregation analysis module
Milestone 2.2: Reporting - Design and implement PDF diagnostic report templates - Build FHIR R4 resource generators - Implement GA4GH Phenopacket export - Develop audit trail and provenance tracking
Milestone 2.3: UI Development - Build Streamlit clinical interface with HPO autocomplete - Implement interactive differential diagnosis dashboard - Create variant review and classification panel - Develop knowledge graph visualization
21.3 Phase 3: Validation (Months 7-9)¶
Milestone 3.1: Benchmark Validation - Run DDD benchmark (13,612 families) and measure diagnostic accuracy - Validate ACMG classifier against ClinVar expert panel consensus - Execute Orphanet diagnostic test accuracy evaluation (500 simulated cases) - Compare performance against LIRICAL, Exomiser, Phen2Gene benchmarks
Milestone 3.2: Clinical Pilot - Partner with 2-3 academic medical centers for pilot deployment - Process 50 retrospective solved cases per site to validate accuracy - Collect clinician feedback on usability, report quality, and diagnostic utility - Iterate on UI/UX based on clinical workflow observations
21.4 Phase 4: Production (Months 10-12)¶
Milestone 4.1: Production Hardening - Performance optimization for sub-5-second response targets - Security audit and penetration testing - HIPAA compliance validation - Disaster recovery and backup procedures
Milestone 4.2: Launch - General availability release (open-source, Apache 2.0) - Documentation and clinical user guides - Training materials for clinical geneticists and genetic counselors - Integration guides for EHR vendors
22. Risk Analysis¶
22.1 Technical Risks¶
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| Embedding model insufficient for rare disease terminology | Medium | High | Fine-tune BGE on rare disease corpus; evaluate domain-specific embeddings (BioLORD, PubMedBERT) |
| Milvus collection size exceeds DGX Spark memory | Low | High | Implement tiered storage with IVF_SQ8 quantization; archive low-frequency records |
| LLM hallucination in diagnostic reasoning | High | Critical | Strict RAG grounding with citation verification; confidence scoring; human-in-the-loop review |
| VCF processing pipeline failures on non-standard formats | Medium | Medium | Comprehensive VCF validation; support for multi-sample VCF, gVCF, and structural variant formats |
| Knowledge base update breaks existing functionality | Medium | Medium | Automated regression testing against benchmark suite on every knowledge base update |
| API response time exceeds clinical usability threshold | Low | High | Pre-compute embeddings for common queries; implement caching layer for frequent phenotype combinations |
22.2 Clinical Risks¶
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| Missed diagnosis due to incomplete knowledge base | Medium | Critical | Multiple overlapping data sources; flag cases with low retrieval confidence for manual review |
| Incorrect variant classification | Low | Critical | Conservative classification defaults (favor VUS over benign); expert review workflow for pathogenic calls |
| Over-reliance on AI by clinicians | Medium | High | Clear labeling as decision support (not diagnostic); mandatory clinician sign-off; education program |
| Bias toward well-studied populations | High | High | Monitor diagnostic accuracy stratified by ancestry; flag underrepresented population variants |
| Gene therapy eligibility false positive | Low | Critical | Conservative eligibility criteria; multi-step verification; specialist referral requirement |
22.3 Organizational Risks¶
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| Insufficient clinical validation data | Medium | High | Partner with UDN, 100K Genomes, DDD for validation datasets |
| Regulatory classification change | Low | High | Design for Class II SaMD compliance even under CDS exemption; maintain QMS documentation |
| Data source licensing changes | Medium | Medium | Diversify sources; maintain local cached copies; contribute to open-access alternatives |
| Key personnel dependency | Medium | Medium | Comprehensive documentation; modular architecture enabling independent development |
23. Competitive Landscape¶
23.1 Existing Diagnostic Support Tools¶
| Tool | Developer | Approach | Strengths | Limitations |
|---|---|---|---|---|
| Exomiser | Monarch Initiative | Phenotype-variant prioritization | Well-validated, open-source, cross-species phenotype data | No RAG, limited literature integration, no treatment recommendations |
| LIRICAL | Monarch Initiative | Likelihood ratio-based phenotype matching | Rigorous statistical framework, HPO-native | No variant interpretation, no therapeutic module |
| Phen2Gene | CHOP | Phenotype-to-gene ranking | Fast, HPO-based, API available | Gene-level only (no variant), no literature context |
| AMELIE | Stanford | Literature-based variant prioritization | PubMed full-text mining, strong literature coverage | Literature-only evidence, no phenotype matching |
| Face2Gene | FDNA | Facial gestalt analysis + phenotype matching | Unique dysmorphology capability, large training set | Facial-photo dependent, proprietary, narrow focus |
| Fabric GEM | Fabric Genomics | AI-driven variant prioritization | Clinical-grade, CLIA-validated | Proprietary, expensive, no phenotype-first workflow |
| Franklin | Genoox | Variant classification platform | Strong ACMG automation, community data | Variant-focused only, limited phenotype integration |
| Mastermind | Genomenon | Genomic literature search | Comprehensive literature indexing | Literature search only, no integrated diagnosis |
23.2 Competitive Differentiators¶
The HCLS AI Factory Rare Disease Diagnostic Agent differentiates through:
- Multi-modal RAG architecture: Unlike single-approach tools, the agent integrates phenotype matching, variant interpretation, literature mining, and therapeutic search in a unified retrieval framework
- 14 specialized Milvus collections: Purpose-built vector stores for each data type, enabling domain-optimized retrieval rather than generic search
- End-to-end pipeline: From raw FASTQ to diagnostic report with therapeutic recommendations -- no tool-switching required
- Open-source with on-premises deployment: Full data sovereignty on DGX hardware, unlike cloud-dependent proprietary solutions
- LLM-powered clinical reasoning: Claude-based synthesis of multi-source evidence into coherent diagnostic narratives, not just ranked lists
- Gene therapy integration: Unique capability to assess emerging gene therapy eligibility alongside traditional diagnostic workup
- HCLS AI Factory ecosystem: Native integration with existing genomics, RAG/chat, and drug discovery pipelines
- Cost-effective scaling: DGX Spark entry point at ~$5,000 vs $50,000-200,000/year for enterprise SaaS solutions
23.3 Market Positioning¶
The agent targets the underserved intersection of AI-assisted diagnosis and rare disease:
- Academic medical centers: Replace manual OMIM/Orphanet searching with integrated AI-assisted workup
- Community hospitals: Bring rare disease diagnostic expertise to facilities without genetics departments
- Newborn screening programs: Second-tier analysis for positive screening results
- Pharmaceutical companies: Patient identification for rare disease clinical trials
- National health systems: Population-scale rare disease screening and surveillance (UK 100K Genomes, All of Us)
24. Discussion¶
24.1 Addressing the Diagnostic Odyssey¶
The Rare Disease Diagnostic Agent represents a fundamental shift from the current paradigm of sequential, manual diagnostic investigation to a parallel, AI-augmented evidence synthesis approach. By simultaneously querying 14 specialized knowledge collections and applying graph-based reasoning across phenotype, genotype, and therapeutic dimensions, the system compresses what traditionally takes weeks of specialist time into minutes of computation.
The diagnostic odyssey -- averaging 5-7 years and 7+ specialist consultations -- persists not because the knowledge to diagnose most rare diseases does not exist, but because that knowledge is fragmented across dozens of databases, thousands of publications, and hundreds of subspecialties. No single clinician can maintain current awareness of 8,500+ rare diseases, 22,000 genes, 4.1 million classified variants, and 600+ therapeutic options. The Rare Disease Diagnostic Agent addresses this fundamental information asymmetry through comprehensive, real-time knowledge retrieval and synthesis.
24.2 Clinical Impact Projections¶
Based on published diagnostic yields from similar AI-assisted tools (Exomiser: 80% improvement in variant prioritization; AMELIE: 72% of causal genes ranked in top 10), we project the following clinical impacts:
- Diagnostic yield improvement: 15-25% increase in diagnoses from existing genomic data through automated reanalysis and updated knowledge bases
- Time to diagnosis reduction: From 5-7 years to <1 year for patients entering the system at initial presentation
- Cost reduction: $15,000-25,000 savings per patient through reduced unnecessary testing and specialist consultations
- Therapeutic impact: 30-40% of newly diagnosed patients identified as eligible for existing therapies, gene therapies, or clinical trials
24.3 Limitations and Challenges¶
Several important limitations must be acknowledged:
-
Knowledge base completeness: Despite integrating 12+ data sources, rare disease knowledge remains incomplete. Approximately 50% of suspected genetic diseases have no identified causative gene. The agent cannot diagnose diseases that have not yet been characterized.
-
Population bias: Existing genomic databases (gnomAD, ClinVar) are heavily biased toward European-ancestry populations. Variant interpretation accuracy is lower for underrepresented populations, potentially exacerbating health disparities.
-
Phenotype capture quality: The system's diagnostic accuracy is directly dependent on the quality and completeness of phenotype input. Incomplete or inaccurate HPO coding degrades performance.
-
LLM reasoning limitations: While Claude provides sophisticated clinical reasoning, LLMs can produce plausible but incorrect conclusions (hallucination). The strict RAG grounding and evidence provenance requirements mitigate but do not eliminate this risk.
-
Validation at scale: Clinical validation of rare disease diagnostics is inherently challenging due to the rarity of each condition. Achieving statistically significant accuracy measurements for individual diseases with <100 known cases requires novel validation frameworks.
24.4 Ethical Considerations¶
The deployment of AI in rare disease diagnosis raises important ethical questions:
- Equity of access: Will AI-assisted diagnosis widen or narrow the gap between well-resourced academic centers and underserved communities? The open-source, on-premises deployment model is designed to democratize access, but hardware costs and technical expertise remain barriers.
- Incidental findings: Comprehensive genomic analysis may reveal incidental findings (e.g., cancer predisposition variants) unrelated to the presenting complaint. The system must implement ACMG SF v3.2 guidelines for reportable secondary findings.
- Data sovereignty: Rare disease patients are an identifiable population even with de-identification. Strict data governance and consent frameworks are essential.
- Therapeutic hope: Identifying a diagnosis does not guarantee treatment availability. The system must communicate realistic therapeutic expectations, particularly for the 95% of rare diseases without approved therapies.
24.5 Future Directions¶
The Rare Disease Diagnostic Agent architecture enables several future capabilities:
- Federated learning: Multi-institutional model improvement without data sharing, enabling rare disease centers to collectively improve diagnostic accuracy while maintaining data sovereignty
- Longitudinal phenotyping: Continuous phenotype capture from EHR data to detect evolving disease presentations and trigger re-evaluation
- Pharmacogenomic integration: Variant-based drug metabolism prediction for prescribed therapies, ensuring rare disease patients receive optimally dosed treatments
- Patient-reported outcomes: Direct patient/caregiver input of symptoms and functional status to supplement clinical phenotyping
- Global Matchmaker Network: Deep integration with Matchmaker Exchange to identify phenotypically similar patients across international networks for novel gene-disease association discovery
25. Conclusion¶
The HCLS AI Factory Rare Disease Diagnostic Agent presents a comprehensive, multi-collection RAG architecture purpose-built for the unique challenges of rare disease diagnosis. By unifying 14 specialized Milvus vector collections spanning phenotypes, diseases, genes, variants, literature, clinical trials, therapies, case reports, guidelines, pathways, registries, natural history data, newborn screening, and genomic evidence, the system creates an integrated knowledge substrate that no manual search process can replicate.
The architecture addresses the core challenges of the diagnostic odyssey: knowledge fragmentation, phenotypic heterogeneity, extreme class imbalance, and the combinatorial complexity of genotype-phenotype correlation across thousands of diseases. Through 10 specialized clinical workflows -- from phenotype-driven diagnosis and WES/WGS interpretation to gene therapy eligibility assessment and undiagnosed disease program support -- the agent provides structured diagnostic pathways for the full spectrum of rare disease evaluation.
Six clinical decision support engines -- the HPO-to-Gene Matcher, ACMG Variant Classifier, Orphan Drug Matcher, Diagnostic Algorithm Recommender, Family Segregation Analyzer, and Natural History Predictor -- provide the computational intelligence to transform raw clinical and genomic data into actionable diagnostic insights. These engines operate on the principle of evidence convergence: diagnostic confidence increases as independent evidence streams (phenotypic, genomic, biochemical, literature) align on a common hypothesis.
The system is designed for deployment across the NVIDIA DGX compute continuum, from the $4,999 DGX Spark enabling community hospital deployment to DGX SuperPOD configurations supporting national genomic medicine programs. The open-source, Apache 2.0 licensing ensures that this diagnostic capability is accessible to the global rare disease community, not restricted to well-funded academic centers.
Key diseases targeted include phenylketonuria (PKU), Gaucher disease, Fabry disease, Pompe disease, spinal muscular atrophy (SMA), Duchenne muscular dystrophy (DMD), Rett syndrome, sickle cell disease, Marfan syndrome, Ehlers-Danlos syndrome (EDS), severe combined immunodeficiency (SCID), hemophilia, Li-Fraumeni syndrome, and Lynch syndrome -- representing the breadth from metabolic disorders to connective tissue diseases to cancer predisposition syndromes. The gene therapy eligibility module tracks the rapidly expanding landscape of curative therapies including nusinersen, onasemnogene abeparvovec, risdiplam, voretigene neparvovec (Luxturna), etranacogene dezaparvovec (Hemgenix), and exagamglogene autotemcel (Casgevy).
For the estimated 300 million people worldwide living with a rare disease -- half of them children -- the difference between a 5-year diagnostic odyssey and a 5-minute AI-assisted diagnostic workup is not an incremental improvement. It is the difference between years of suffering and misdiagnosis and the possibility of timely, targeted treatment. The Rare Disease Diagnostic Agent, built on the HCLS AI Factory platform, aims to make that possibility a clinical reality.
26. References¶
-
Nguengang Wakap, S., et al. (2020). Estimating cumulative point prevalence of rare diseases: analysis of the Orphanet database. European Journal of Human Genetics, 28(2), 165-173. doi:10.1038/s41431-019-0508-0
-
Global Genes. (2023). RARE Facts. Retrieved from https://globalgenes.org/rare-disease-facts/
-
Ferreira, C.R. (2019). The burden of rare diseases. American Journal of Medical Genetics Part A, 179(6), 885-892.
-
Kohler, S., et al. (2021). The Human Phenotype Ontology in 2021. Nucleic Acids Research, 49(D1), D1207-D1217. doi:10.1093/nar/gkaa1043
-
Richards, S., et al. (2015). Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the ACMG and AMP. Genetics in Medicine, 17(5), 405-424. doi:10.1038/gim.2015.30
-
Landrum, M.J., et al. (2024). ClinVar: improvements to accessing data. Nucleic Acids Research, 52(D1), D1265-D1273. doi:10.1093/nar/gkad1105
-
Karczewski, K.J., et al. (2020). The mutational constraint spectrum quantified from variation in 141,456 humans. Nature, 581(7809), 434-443. doi:10.1038/s41586-020-2308-7
-
Cheng, J., et al. (2023). Accurate proteome-wide missense variant effect prediction with AlphaMissense. Science, 381(6664), eadg7492. doi:10.1126/science.adg7492
-
Smedley, D., et al. (2015). Next-generation diagnostics and disease-gene discovery with the Exomiser. Nature Protocols, 10(12), 2004-2015. doi:10.1038/nprot.2015.124
-
Robinson, P.N., et al. (2020). Interpretable Clinical Genomics with a Likelihood Ratio Paradigm. American Journal of Human Genetics, 107(3), 403-417. doi:10.1016/j.ajhg.2020.06.021
-
Zhao, M., et al. (2020). Phen2Gene: rapid phenotype-driven gene prioritization for rare diseases. NAR Genomics and Bioinformatics, 2(2), lqaa032. doi:10.1093/nargab/lqaa032
-
Birgmeier, J., et al. (2020). AMELIE speeds Mendelian diagnosis by matching patient phenotype and genotype to primary literature. Science Translational Medicine, 12(544), eaau9113.
-
Hamosh, A., et al. (2005). Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Research, 33(Database issue), D514-D517.
-
Rath, A., et al. (2012). Representation of rare diseases in health information systems: the Orphanet approach to serve a wide range of end users. Human Mutation, 33(5), 803-808.
-
Philippakis, A.A., et al. (2015). The Matchmaker Exchange: a platform for rare disease gene discovery. Human Mutation, 36(10), 915-921.
-
Splinter, K., et al. (2018). Effect of genetic diagnosis on patients with previously undiagnosed disease. New England Journal of Medicine, 379(22), 2131-2139. doi:10.1056/NEJMoa1714458
-
Mendell, J.R., et al. (2017). Single-dose gene-replacement therapy for spinal muscular atrophy. New England Journal of Medicine, 377(18), 1713-1722. doi:10.1056/NEJMoa1706198
-
Russell, S., et al. (2017). Efficacy and safety of voretigene neparvovec (AAV2-hRPE65v2) in patients with RPE65-mediated inherited retinal dystrophy. The Lancet, 390(10097), 849-860.
-
Frangoul, H., et al. (2021). CRISPR-Cas9 gene editing for sickle cell disease and beta-thalassemia. New England Journal of Medicine, 384(3), 252-260. doi:10.1056/NEJMoa2031054
-
Finkel, R.S., et al. (2017). Nusinersen versus sham control in infantile-onset spinal muscular atrophy. New England Journal of Medicine, 377(18), 1723-1732. doi:10.1056/NEJMoa1702752
-
Turnbull, C., et al. (2018). The 100 000 Genomes Project: bringing whole genome sequencing to the NHS. BMJ, 361, k1687. doi:10.1136/bmj.k1687
-
Boycott, K.M., et al. (2017). International cooperation to enable the diagnosis of all rare genetic diseases. American Journal of Human Genetics, 100(5), 695-705. doi:10.1016/j.ajhg.2017.04.003
-
Jagadeesh, K.A., et al. (2019). Phrank measures phenotype sets similarity to greatly improve Mendelian diagnostic gene prioritization. Genetics in Medicine, 21(2), 464-470.
-
NVIDIA. (2025). NVIDIA DGX Spark Technical Specifications. Retrieved from https://www.nvidia.com/en-us/data-center/dgx-spark/
-
Xiao, S., et al. (2024). BGE-M3: Multilingual Embedding Model for Multi-Granularity Retrieval. arXiv preprint arXiv:2402.03216.
-
Jaganathan, K., et al. (2019). Predicting splicing from primary sequence with deep learning. Cell, 176(3), 535-548. doi:10.1016/j.cell.2018.12.015
-
Lin, Z., et al. (2023). Evolutionary-scale prediction of atomic-level protein structure with a language model. Science, 379(6637), 1123-1130. doi:10.1126/science.ade2574
-
Rehm, H.L., et al. (2015). ClinGen -- the Clinical Genome Resource. New England Journal of Medicine, 372(23), 2235-2242. doi:10.1056/NEJMsr1406261
-
Gainotti, S., et al. (2018). The RD-Connect Registry & Biobank Finder: a tool for sharing aggregated data on rare disease patients and biobanks. European Journal of Human Genetics, 26(5), 631-643.
-
Austin, C.P., et al. (2018). Future of rare diseases research 2017-2027: an IRDiRC perspective. Clinical and Translational Science, 11(1), 21-27. doi:10.1111/cts.12500
This research paper describes a pre-implementation architecture for the Rare Disease Diagnostic Agent within the HCLS AI Factory platform. All performance projections are based on published benchmarks from comparable systems and will be validated during the clinical pilot phase. The system is intended as a clinical decision support tool and does not replace professional medical judgment.
Part of the HCLS AI Factory -- an end-to-end precision medicine platform. https://github.com/ajones1923/hcls-ai-factory