Rare Disease Diagnostic Agent -- Learning Guide: Advanced Topics¶
ACMG 28-criteria deep dive, phenotype matching algorithms, family segregation analysis, natural history prediction, newborn screening expansion, and gene therapy pipelines.
Version: 1.0.0 Date: March 22, 2026 Author: Adam Jones Platform: NVIDIA DGX Spark -- HCLS AI Factory
Table of Contents¶
- ACMG 28-Criteria Deep Dive
- Phenotype Matching Algorithms
- Family Segregation Analysis
- Natural History Prediction
- Newborn Screening Expansion
- Gene Therapy Pipelines
- Diagnostic Yield Optimization
- Multi-Omics Integration
- Undiagnosed Disease Programs
- Pharmacogenomics in Rare Disease
- Ethical Considerations
- The Future of Rare Disease Diagnosis
1. ACMG 28-Criteria Deep Dive¶
1.1 Pathogenic Criteria -- Very Strong¶
PVS1: Null Variant in LOF-Intolerant Gene¶
What it means: A variant that completely disrupts gene function (loss-of-function) in a gene where haploinsufficiency is a known disease mechanism.
Qualifying variant types: - Nonsense (premature stop codon) - Frameshift (reading frame disruption) - Canonical splice site (+/-1,2 positions) - Initiation codon (start loss) - Single or multi-exon deletion
LOF intolerance assessment: - gnomAD pLI score > 0.9 indicates intolerance to LOF - LOEUF (Loss of Expected Upper Fraction) < 0.35 is the newer metric - Gene must have established LOF mechanism for the disease in question
Caveats: - Does NOT apply to genes where disease is caused by gain-of-function (e.g., some SCN5A variants) - Reduced strength for: last exon truncation, alternative transcripts, rescue by nonsense-mediated decay escape - The ClinGen Sequence Variant Interpretation (SVI) workgroup provides detailed decision trees for PVS1 application
Agent implementation: The system maintains a curated list of 20 LOF-intolerant genes including SCN1A, MECP2, KCNQ1, KCNH2, SCN5A, MYH7, MYBPC3, FBN1, COL1A1, COL1A2, HTT, NSD1, NIPBL, KMT2D, CHD7, PTPN11, RAF1, BRAF, FGFR3, and TCOF1.
1.2 Pathogenic Criteria -- Strong (PS1-PS4)¶
PS1: Same Amino Acid Change as Established Pathogenic¶
The same amino acid substitution has been previously established as pathogenic, regardless of the nucleotide change. Example: If p.Arg412Gly in FBN1 is pathogenic, then any nucleotide change producing p.Arg412Gly is PS1-eligible.
Caveat: Must verify the nucleotide change does not alter splicing differently.
PS2: De Novo (Both Maternity and Paternity Confirmed)¶
A de novo variant in a patient with the disease and no family history. Requires confirmation through testing of both biological parents.
Strength modifiers: - Confirmed de novo (both parents tested): Full PS2 strength - Assumed de novo (parents not available): Downgraded to PM6
PS3: Well-Established In Vitro or In Vivo Functional Studies¶
Functional studies demonstrate a damaging effect on the gene product. Examples: - Enzyme activity assay showing < 10% residual activity - Electrophysiology studies showing altered channel function - Mouse model with disease phenotype when carrying the variant
The challenge: Not all functional studies are equally reliable. ClinGen provides gene-specific guidance on which assays qualify for PS3.
PS4: Prevalence in Affected Individuals Significantly Increased vs Controls¶
Statistical evidence that the variant occurs significantly more often in affected individuals than in matched controls. Requires case-control data with adequate sample sizes and population matching.
1.3 Pathogenic Criteria -- Moderate (PM1-PM6)¶
PM1: Mutational Hot Spot / Critical Functional Domain¶
The variant is located in a well-established functional domain or known mutational hot spot. The agent maintains hot spot coordinates for: - FGFR3: positions 370-380 (transmembrane domain -- achondroplasia) - BRAF: positions 594-601 (kinase domain) - KRAS: positions 10-15, 58-63 - TP53: positions 125-300 (DNA-binding domain)
PM2: Absent from Controls¶
The variant is absent from or at extremely low frequency (< 0.0001) in gnomAD and other population databases. This criterion has been refined by ClinGen SVI to use population-specific frequency thresholds that account for disease prevalence and penetrance.
PM3: In Trans with a Known Pathogenic Variant (Recessive Disorders)¶
For autosomal recessive conditions, detecting the variant in trans (on the opposite chromosome) from a known pathogenic variant supports pathogenicity. Ideally confirmed by phasing through parental testing.
PM4: Protein Length Change in Non-Repeat Region¶
In-frame deletions or insertions that alter protein length, provided they are NOT in a repetitive region (which would trigger BP3 instead).
PM5: Novel Missense at an Established Pathogenic Position¶
A different amino acid substitution at the same position where another missense has been established as pathogenic. Example: If p.Arg412Gly is pathogenic, then p.Arg412Trp at the same position qualifies for PM5.
PM6: Assumed De Novo Without Parental Confirmation¶
Used when paternity and/or maternity is assumed but not confirmed through testing. Weaker than PS2.
1.4 Pathogenic Criteria -- Supporting (PP1-PP5)¶
PP1: Cosegregation with Disease in Multiple Affected Family Members¶
The variant tracks with the disease through the family pedigree. Strength is proportional to the number of informative meioses (see Section 3 on LOD scores).
PP2: Missense in a Gene with Low Rate of Benign Missense Variation¶
Some genes are intolerant to missense variation -- nearly all observed missense variants in these genes are pathogenic. The gene's missense Z-score in gnomAD quantifies this constraint.
PP3: Multiple Lines of Computational Evidence Support Deleterious Effect¶
In silico prediction tools (REVEL, CADD, PolyPhen-2, SIFT, MutationTaster) concordantly predict the variant is damaging. The agent evaluates the computational_prediction field.
PP4: Patient Phenotype Highly Specific for a Disease with a Single Genetic Etiology¶
When the clinical presentation is pathognomonic for a specific genetic disease. Example: ectopia lentis + tall stature + aortic root dilation is highly specific for Marfan syndrome (FBN1).
PP5: Reputable Source Recently Reports Variant as Pathogenic¶
A ClinVar submission from a reputable laboratory classifies the variant as pathogenic. Note: ClinGen SVI has recommended retiring PP5 in favor of directly evaluating the underlying evidence.
1.5 Benign Criteria (BA1, BS1-BS2, BP1-BP7)¶
BA1: Population Frequency > 5% (Standalone Benign)¶
Any variant with allele frequency > 5% in gnomAD or other population databases is classified as benign immediately, regardless of other evidence. This is the only standalone criterion in the ACMG framework.
Exception: Some conditions (e.g., hereditary hemochromatosis, HFE C282Y at 5-10% in Northern Europeans) have specific BA1 exemptions.
BS1: Greater Than Expected Frequency for Disorder¶
Allele frequency is higher than expected for the disease given its prevalence and penetrance. Calculated using the Hardy-Weinberg equation and disease-specific parameters.
BS2: Observed in a Healthy Adult with Full Penetrance¶
The variant is observed in a healthy individual who has passed the age of expected disease onset, for a condition with complete penetrance. Strongest evidence for conditions like Huntington disease where penetrance is ~100%.
BP1-BP7: Supporting Benign Evidence¶
- BP1: Missense in gene where only truncating variants cause disease
- BP3: In-frame indel in repetitive region without known function
- BP4: Multiple computational predictors suggest no functional impact
- BP6: Reputable source reports variant as benign
- BP7: Synonymous variant with no predicted splice effect
2. Phenotype Matching Algorithms¶
2.1 Information Content (IC) Scoring¶
IC quantifies the discriminating power of a phenotype term:
Where p(t) is the fraction of diseases annotated with term t in HPO.
Properties: - IC increases as term frequency decreases (rarer = more informative) - Root terms (e.g., "Phenotypic abnormality") have IC near 0 - Leaf terms (e.g., "Genu recurvatum") have high IC (6-10) - IC is additive: total information from independent terms sums
2.2 Best-Match-Average (BMA) Similarity¶
BMA is the standard algorithm for phenotype profile comparison in rare disease:
BMA(P, D) = 0.5 * (avg_p + avg_d)
avg_p = (1/|P|) * SUM over p in P of: max over d in D of sim(p, d)
avg_d = (1/|D|) * SUM over d in D of: max over p in P of sim(d, p)
Where sim(a, b) is the IC of the most informative common ancestor (MICA) of terms a and b.
Advantages: - Bidirectional: penalizes both missing and extra phenotypes - IC-weighted: specific matches count more than generic ones - Handles partial matches gracefully
2.3 Resnik Similarity¶
The Resnik similarity between two HPO terms is the IC of their Most Informative Common Ancestor (MICA):
For exact matches, sim_Resnik equals the IC of the term itself. For related but non-identical terms, the similarity is the IC of their shared ancestor.
2.4 Lin Similarity¶
Lin similarity normalizes Resnik similarity:
This produces values in [0, 1], making comparison across term pairs more interpretable.
2.5 Agent Implementation¶
The Rare Disease Diagnostic Agent uses a combined scoring approach:
Where freq_weight incorporates the frequency of each matched phenotype within the specific gene-disease association. A phenotype that is present in 95% of patients with a disease contributes more than one present in only 30%.
2.6 Ranking Algorithm¶
- Compute BMA similarity for each candidate gene
- Weight by phenotype frequency in gene-disease association
- Combine: 70% BMA + 30% frequency weight
- Normalize to [0, 1]
- Sort descending, return top_k candidates
3. Family Segregation Analysis¶
3.1 LOD Scores¶
The LOD (Logarithm of Odds) score is the standard measure for evaluating genetic linkage and variant-disease cosegregation:
Where: - theta = 0: complete linkage (variant and disease always co-occur) - theta = 0.5: no linkage (random association) - L: likelihood of observed data under each model
3.2 ACMG Segregation Evidence Levels¶
ClinGen has calibrated LOD scores to ACMG evidence strength:
| LOD Score | ACMG Evidence | Description |
|---|---|---|
| >= 3.0 | PS (Strong) | Equivalent to p < 0.001 |
| 1.5 - 2.99 | PM (Moderate) | Significant but not conclusive |
| 0.6 - 1.49 | PP (Supporting) | Suggestive cosegregation |
| < 0.6 | Insufficient | Not enough informative meioses |
3.3 Required Family Size¶
| Inheritance | Informative Meioses for PS | Typical Family Size |
|---|---|---|
| Autosomal Dominant | 7+ affected carriers | ~3 generation family |
| Autosomal Recessive | 5+ (including obligate carriers) | Multiple siblings |
| X-Linked Recessive | 5+ | Multiple generations through carrier females |
3.4 Agent Implementation¶
The Family Segregation Analyzer implements simplified LOD scoring: - Each concordant meiosis (affected carries variant, unaffected does not): +0.3 - Each discordant meiosis (affected lacks variant, or unaffected carries it): -1.0 - Supports AD, AR, XLD, XLR inheritance models - Outputs LOD score, ACMG evidence level, concordant/discordant counts
3.5 Pitfalls¶
- Reduced penetrance: Non-penetrant carriers appear discordant but are not truly so
- Phenocopies: Unrelated causes of similar phenotypes inflate discordance
- Small families: Single nuclear family provides limited evidence (LOD typically < 1.0)
- De novo variants: Family segregation is uninformative for de novo events
4. Natural History Prediction¶
4.1 What Is Natural History?¶
Natural history describes the typical disease course over a patient's lifetime: age of onset, progression rate, milestone events, complications, and survival. Understanding natural history is essential for: - Prognostic counseling for families - Anticipatory management (preventing complications before they occur) - Treatment timing decisions - Clinical trial endpoint selection
4.2 Data Sources¶
| Source | Type | Strength |
|---|---|---|
| Patient registries | Prospective cohort | Large sample, standardized data |
| Published cohort studies | Retrospective | Variable quality, publication bias |
| Case reports | Individual | Anecdotal, biased toward unusual presentations |
| Clinical trials (natural history arms) | Prospective | Well-characterized but small |
4.3 Agent Implementation¶
The Natural History Predictor covers 6 well-characterized diseases:
SMA Type 1: - Symptom onset: 0-6 months (median 3) - Loss of head control: 3-12 months - Ventilatory support: 6-24 months - Mortality without treatment: 12-48 months - Modifier: SMN2 copy number (3 copies = milder)
Duchenne Muscular Dystrophy: - Gait abnormality onset: 24-60 months (median 36) - Loss of ambulation: 84-156 months (median 120) - Cardiomyopathy onset: 120-216 months - Ventilatory support: 180-264 months - Modifier: Exon-skippable mutations (therapy-eligible)
Cystic Fibrosis: - NBS diagnosis: 0-6 months - First Pseudomonas: 6-120 months - FEV1 decline: 72-216 months - Modifier: CFTR modulators dramatically extend survival
PKU: - NBS detection: birth - Diet initiation: within 3 weeks - Normal IQ if treated early; IQ < 50 if untreated - Modifier: BH4-responsive mutations allow diet relaxation
Marfan Syndrome: - Diagnosis: 24-240 months (highly variable) - Aortic root dilation: 60-360 months - Aortic dissection risk: 240-600 months (untreated) - Modifier: Haploinsufficiency variants generally more severe
Dravet Syndrome: - First febrile seizure: 4-12 months - Afebrile seizures: 8-24 months - Developmental plateau: 18-36 months - Modifier: Truncating SCN1A variants = more severe
4.4 Genotype-Phenotype Correlation¶
Disease course often depends on the specific genotype:
| Disease | Modifier | Effect |
|---|---|---|
| SMA | SMN2 copies | More copies = milder disease |
| CF | F508del homozygous | Trikafta-eligible = dramatically improved |
| DMD | Exon-skippable mutation | Eligible for exon-skipping therapy |
| Dravet | Truncating vs missense SCN1A | Truncating = more severe |
| Marfan | Haploinsufficiency vs dominant-negative | Haploinsufficiency = more severe cardiovascular |
| PKU | BH4-responsive PAH mutations | Sapropterin may allow diet relaxation |
5. Newborn Screening Expansion¶
5.1 Current State of NBS¶
The US Recommended Uniform Screening Panel (RUSP) has expanded from 4 conditions in 1960 to 37 core conditions and 26 secondary conditions in 2024. Recent additions include: - SMA (2018) -- detectable by DNA-based testing - Pompe disease (2015) -- enzyme activity assay - MPS I (2016) -- enzyme activity assay - X-ALD (2016) -- C26:0-lysophosphatidylcholine - MPS II (2022) -- enzyme activity assay
5.2 Genomic NBS (gNBS)¶
Pilot programs are evaluating whole-genome sequencing of all newborns: - BabySeq (US): Demonstrated 9.4% of healthy newborns carry actionable genetic findings - Genomics England NBS (UK): Screening for ~200 conditions - BeginNGS (US): Screening for actionable conditions with available treatments
5.3 Expansion Criteria¶
A condition should be added to NBS if: 1. It can be detected reliably in the newborn period 2. Early detection improves outcomes compared to clinical diagnosis 3. An effective treatment or management strategy exists 4. A suitable screening test is available at acceptable cost 5. A follow-up system can handle positive results
5.4 Challenges¶
- False positives: NBS generates many false positives, causing family anxiety
- VUS in genomic NBS: WGS identifies many VUS that create clinical uncertainty
- Equity: Not all states/countries screen for the same conditions
- Incidental findings: Genomic NBS may reveal untreatable conditions or adult-onset risks
- Infrastructure: Follow-up systems struggle to handle increased volume
6. Gene Therapy Pipelines¶
6.1 AAV Gene Replacement Pipeline¶
1. Identify target gene and disease
2. Design AAV transgene cassette
- Promoter selection (tissue-specific or ubiquitous)
- Codon optimization
- Regulatory elements (enhancer, polyA)
3. Select AAV serotype
- AAV9: crosses blood-brain barrier (SMA, neurological)
- AAV2: retinal tropism (LCA)
- AAV5: hepatic tropism (hemophilia)
- AAVrh74: muscle tropism (DMD)
4. Vector production (HEK293 cells)
5. Preclinical testing (mouse -> NHP)
6. IND filing and clinical trial phases
7. BLA filing and FDA review
8. Manufacturing scale-up
9. Commercial launch and patient access
6.2 CRISPR Gene Editing Pipeline¶
1. Design guide RNA (gRNA) targeting pathogenic locus
2. Select editing strategy:
- Gene knockout (disrupt deleterious allele)
- Gene correction (HDR to fix point mutation)
- Base editing (single nucleotide change without DSB)
- Prime editing (precise insertion/deletion)
3. Ex vivo workflow:
a. Harvest patient stem cells (HSCs or T cells)
b. Electroporate CRISPR components
c. Select edited cells
d. Myeloablative conditioning
e. Reinfuse edited cells
4. In vivo workflow (future):
a. Lipid nanoparticle (LNP) delivery to target organ
b. Direct administration
6.3 Lentiviral Gene Addition Pipeline¶
1. Design lentiviral vector with therapeutic transgene
2. Self-inactivating (SIN) vector design for safety
3. Ex vivo workflow:
a. Mobilize and harvest patient HSCs
b. Transduce with lentiviral vector
c. Quality control (VCN, sterility, identity)
d. Myeloablative conditioning (busulfan)
e. Reinfuse transduced cells
4. Engraftment monitoring
5. Long-term follow-up (integration site analysis)
6.4 Eligibility Assessment Framework¶
The agent evaluates gene therapy eligibility across multiple dimensions:
| Dimension | Assessment |
|---|---|
| Disease confirmation | Molecular diagnosis with confirmed pathogenic variant |
| Genotype compatibility | Variant type compatible with therapy mechanism |
| Age eligibility | Within approved age window |
| Organ function | Sufficient baseline function for benefit |
| Anti-AAV antibodies | Seropositive patients may be ineligible for AAV therapies |
| Prior gene therapy | Most AAV therapies are one-time only |
| Geographic access | Therapy availability by region |
| Insurance coverage | Authorization and funding pathway |
7. Diagnostic Yield Optimization¶
7.1 Testing Strategy by Phenotype¶
| Clinical Scenario | Optimal First Test | Expected Yield |
|---|---|---|
| Isolated ID/DD | CMA then WES | 15-20% + 25-40% |
| ID/DD + dysmorphism | CMA then WES | 20% + 30-50% |
| Epilepsy + DD | Epilepsy gene panel or WES | 20-50% |
| Cardiomyopathy | Cardiomyopathy gene panel | 20-40% |
| Skeletal dysplasia | Skeletal survey + gene panel | 30-50% |
| Connective tissue disorder | Clinical criteria + targeted gene | 70-93% (Marfan) |
7.2 Trio vs Singleton Analysis¶
| Approach | Diagnostic Yield | Advantages |
|---|---|---|
| Singleton WES | 25-35% | Lower cost |
| Trio WES (proband + parents) | 35-50% | De novo detection, phasing, reduced VUS |
| Trio WGS | 40-60% | Non-coding variants, structural variants |
| Quad/family WGS | Higher | Complex inheritance patterns |
7.3 Reanalysis¶
Periodic reanalysis of previously negative WES/WGS data yields additional diagnoses: - 10-15% additional yield on reanalysis at 1-2 years - New gene-disease associations discovered monthly - Updated bioinformatics pipelines improve variant calling
8. Multi-Omics Integration¶
8.1 Beyond DNA Sequencing¶
| Omics Layer | Technology | Added Value |
|---|---|---|
| Transcriptomics (RNA-seq) | RNA sequencing | Detect splicing defects, expression changes |
| Metabolomics | Mass spectrometry | Identify metabolic pathway disruptions |
| Proteomics | Mass spectrometry | Protein expression and modification |
| Epigenomics | Methylation arrays | Imprinting disorders, episignatures |
8.2 RNA-seq for Unsolved Cases¶
RNA-seq from patient tissue (blood, fibroblasts, muscle) can detect: - Aberrant splicing caused by deep intronic variants - Allele-specific expression (monoallelic expression in AR disease) - Expression outliers suggesting regulatory variants
Diagnostic yield: 10-35% additional diagnoses in previously unsolved cases.
8.3 Episignatures¶
DNA methylation patterns (episignatures) can diagnose specific genetic syndromes: - Unique methylation profiles for 50+ syndromes - Can distinguish overlapping clinical presentations - Useful for VUS reclassification
9. Undiagnosed Disease Programs¶
9.1 Major Programs¶
| Program | Country | Approach |
|---|---|---|
| UDP (Undiagnosed Diseases Program) | USA (NIH) | Multi-omics, expert evaluation |
| UDNI (Undiagnosed Diseases Network International) | Global | Coordinated cross-border evaluation |
| SOLVE-RD | EU | WGS reanalysis across ERNs |
| 100,000 Genomes Project | UK | WGS with phenotype linkage |
9.2 Solving the Unsolvable¶
Strategies for patients who remain undiagnosed after standard genetic testing: 1. WGS reanalysis with updated annotations 2. RNA-seq from relevant tissue 3. Metabolomics profiling 4. Functional validation of candidate variants 5. International matchmaking (GeneMatcher, DECIPHER) 6. Animal modeling of novel gene candidates
10. Pharmacogenomics in Rare Disease¶
10.1 PGx Relevance¶
Pharmacogenomics is particularly relevant for rare disease patients because: - Many rare disease therapies have narrow therapeutic windows - Enzyme replacement therapies may interact with metabolic pathways - Anti-seizure medications require careful PGx-guided dosing - Gene therapy immunosuppression regimens benefit from PGx profiling
10.2 Key PGx Interactions¶
| Drug Class | Gene | Rare Disease Context |
|---|---|---|
| Antiepileptics | HLA-B, CYP2C19 | Dravet syndrome, epilepsy management |
| Immunosuppressants | TPMT, NUDT15 | Post-gene therapy conditioning |
| Cardiac drugs | CYP2D6, SLCO1B1 | Cardiomyopathy, channelopathy management |
| Enzyme replacement | NAT2 | Metabolic disease management |
11. Ethical Considerations¶
11.1 Key Ethical Issues¶
| Issue | Challenge |
|---|---|
| Diagnostic uncertainty | VUS results create anxiety without clear clinical action |
| Incidental findings | Genomic testing may reveal unrelated health risks |
| Reproductive implications | Carrier status affects family planning |
| Pediatric testing | Testing children for adult-onset conditions |
| Data privacy | Genomic data requires stringent protection |
| Equity | Access to genetic testing and gene therapy varies by geography and wealth |
| Cost of gene therapy | $1-3.5 million per treatment raises justice questions |
11.2 Return of Results¶
Current consensus for return of genomic results: - Report pathogenic and likely pathogenic variants in disease-related genes - Offer return of ACMG secondary findings (SF v3.2: 81 genes) - Do NOT report VUS for clinical action (but may report for monitoring) - Respect patient preferences for incidental findings
12. The Future of Rare Disease Diagnosis¶
12.1 Emerging Technologies¶
| Technology | Impact | Timeline |
|---|---|---|
| Long-read sequencing (PacBio, ONT) | Structural variants, repeat expansions | Now |
| Optical genome mapping | Large structural variants | Now |
| Single-cell RNA-seq | Cell-type-specific expression | 2-3 years |
| AI-powered facial recognition | Dysmorphology phenotyping | Now |
| Protein structure prediction (AlphaFold) | Missense variant impact | Now |
| In vivo CRISPR gene editing | One-time curative therapy | 3-5 years |
| mRNA therapeutics | Protein replacement | 2-4 years |
12.2 The Path to Zero Diagnostic Odyssey¶
A convergence of technologies may eventually eliminate the diagnostic odyssey: 1. Universal genomic NBS: WGS at birth for all newborns 2. AI-powered phenotyping: Automatic HPO extraction from EHR and imaging 3. Real-time variant interpretation: Automated ACMG classification with full evidence integration 4. Global data sharing: International rare disease databases with millions of genotype-phenotype records 5. Precision therapy matching: AI-guided therapy selection including gene therapy eligibility
The HCLS AI Factory Rare Disease Diagnostic Agent represents a step toward this vision -- demonstrating that integrated AI systems can deliver comprehensive diagnostic support across the full rare disease workflow.
Apache 2.0 License -- HCLS AI Factory