Skip to content

Rare Disease Diagnostic Agent -- Learning Guide: Advanced Topics

ACMG 28-criteria deep dive, phenotype matching algorithms, family segregation analysis, natural history prediction, newborn screening expansion, and gene therapy pipelines.

Version: 1.0.0 Date: March 22, 2026 Author: Adam Jones Platform: NVIDIA DGX Spark -- HCLS AI Factory


Table of Contents

  1. ACMG 28-Criteria Deep Dive
  2. Phenotype Matching Algorithms
  3. Family Segregation Analysis
  4. Natural History Prediction
  5. Newborn Screening Expansion
  6. Gene Therapy Pipelines
  7. Diagnostic Yield Optimization
  8. Multi-Omics Integration
  9. Undiagnosed Disease Programs
  10. Pharmacogenomics in Rare Disease
  11. Ethical Considerations
  12. The Future of Rare Disease Diagnosis

1. ACMG 28-Criteria Deep Dive

1.1 Pathogenic Criteria -- Very Strong

PVS1: Null Variant in LOF-Intolerant Gene

What it means: A variant that completely disrupts gene function (loss-of-function) in a gene where haploinsufficiency is a known disease mechanism.

Qualifying variant types: - Nonsense (premature stop codon) - Frameshift (reading frame disruption) - Canonical splice site (+/-1,2 positions) - Initiation codon (start loss) - Single or multi-exon deletion

LOF intolerance assessment: - gnomAD pLI score > 0.9 indicates intolerance to LOF - LOEUF (Loss of Expected Upper Fraction) < 0.35 is the newer metric - Gene must have established LOF mechanism for the disease in question

Caveats: - Does NOT apply to genes where disease is caused by gain-of-function (e.g., some SCN5A variants) - Reduced strength for: last exon truncation, alternative transcripts, rescue by nonsense-mediated decay escape - The ClinGen Sequence Variant Interpretation (SVI) workgroup provides detailed decision trees for PVS1 application

Agent implementation: The system maintains a curated list of 20 LOF-intolerant genes including SCN1A, MECP2, KCNQ1, KCNH2, SCN5A, MYH7, MYBPC3, FBN1, COL1A1, COL1A2, HTT, NSD1, NIPBL, KMT2D, CHD7, PTPN11, RAF1, BRAF, FGFR3, and TCOF1.

1.2 Pathogenic Criteria -- Strong (PS1-PS4)

PS1: Same Amino Acid Change as Established Pathogenic

The same amino acid substitution has been previously established as pathogenic, regardless of the nucleotide change. Example: If p.Arg412Gly in FBN1 is pathogenic, then any nucleotide change producing p.Arg412Gly is PS1-eligible.

Caveat: Must verify the nucleotide change does not alter splicing differently.

PS2: De Novo (Both Maternity and Paternity Confirmed)

A de novo variant in a patient with the disease and no family history. Requires confirmation through testing of both biological parents.

Strength modifiers: - Confirmed de novo (both parents tested): Full PS2 strength - Assumed de novo (parents not available): Downgraded to PM6

PS3: Well-Established In Vitro or In Vivo Functional Studies

Functional studies demonstrate a damaging effect on the gene product. Examples: - Enzyme activity assay showing < 10% residual activity - Electrophysiology studies showing altered channel function - Mouse model with disease phenotype when carrying the variant

The challenge: Not all functional studies are equally reliable. ClinGen provides gene-specific guidance on which assays qualify for PS3.

PS4: Prevalence in Affected Individuals Significantly Increased vs Controls

Statistical evidence that the variant occurs significantly more often in affected individuals than in matched controls. Requires case-control data with adequate sample sizes and population matching.

1.3 Pathogenic Criteria -- Moderate (PM1-PM6)

PM1: Mutational Hot Spot / Critical Functional Domain

The variant is located in a well-established functional domain or known mutational hot spot. The agent maintains hot spot coordinates for: - FGFR3: positions 370-380 (transmembrane domain -- achondroplasia) - BRAF: positions 594-601 (kinase domain) - KRAS: positions 10-15, 58-63 - TP53: positions 125-300 (DNA-binding domain)

PM2: Absent from Controls

The variant is absent from or at extremely low frequency (< 0.0001) in gnomAD and other population databases. This criterion has been refined by ClinGen SVI to use population-specific frequency thresholds that account for disease prevalence and penetrance.

PM3: In Trans with a Known Pathogenic Variant (Recessive Disorders)

For autosomal recessive conditions, detecting the variant in trans (on the opposite chromosome) from a known pathogenic variant supports pathogenicity. Ideally confirmed by phasing through parental testing.

PM4: Protein Length Change in Non-Repeat Region

In-frame deletions or insertions that alter protein length, provided they are NOT in a repetitive region (which would trigger BP3 instead).

PM5: Novel Missense at an Established Pathogenic Position

A different amino acid substitution at the same position where another missense has been established as pathogenic. Example: If p.Arg412Gly is pathogenic, then p.Arg412Trp at the same position qualifies for PM5.

PM6: Assumed De Novo Without Parental Confirmation

Used when paternity and/or maternity is assumed but not confirmed through testing. Weaker than PS2.

1.4 Pathogenic Criteria -- Supporting (PP1-PP5)

PP1: Cosegregation with Disease in Multiple Affected Family Members

The variant tracks with the disease through the family pedigree. Strength is proportional to the number of informative meioses (see Section 3 on LOD scores).

PP2: Missense in a Gene with Low Rate of Benign Missense Variation

Some genes are intolerant to missense variation -- nearly all observed missense variants in these genes are pathogenic. The gene's missense Z-score in gnomAD quantifies this constraint.

PP3: Multiple Lines of Computational Evidence Support Deleterious Effect

In silico prediction tools (REVEL, CADD, PolyPhen-2, SIFT, MutationTaster) concordantly predict the variant is damaging. The agent evaluates the computational_prediction field.

PP4: Patient Phenotype Highly Specific for a Disease with a Single Genetic Etiology

When the clinical presentation is pathognomonic for a specific genetic disease. Example: ectopia lentis + tall stature + aortic root dilation is highly specific for Marfan syndrome (FBN1).

PP5: Reputable Source Recently Reports Variant as Pathogenic

A ClinVar submission from a reputable laboratory classifies the variant as pathogenic. Note: ClinGen SVI has recommended retiring PP5 in favor of directly evaluating the underlying evidence.

1.5 Benign Criteria (BA1, BS1-BS2, BP1-BP7)

BA1: Population Frequency > 5% (Standalone Benign)

Any variant with allele frequency > 5% in gnomAD or other population databases is classified as benign immediately, regardless of other evidence. This is the only standalone criterion in the ACMG framework.

Exception: Some conditions (e.g., hereditary hemochromatosis, HFE C282Y at 5-10% in Northern Europeans) have specific BA1 exemptions.

BS1: Greater Than Expected Frequency for Disorder

Allele frequency is higher than expected for the disease given its prevalence and penetrance. Calculated using the Hardy-Weinberg equation and disease-specific parameters.

BS2: Observed in a Healthy Adult with Full Penetrance

The variant is observed in a healthy individual who has passed the age of expected disease onset, for a condition with complete penetrance. Strongest evidence for conditions like Huntington disease where penetrance is ~100%.

BP1-BP7: Supporting Benign Evidence

  • BP1: Missense in gene where only truncating variants cause disease
  • BP3: In-frame indel in repetitive region without known function
  • BP4: Multiple computational predictors suggest no functional impact
  • BP6: Reputable source reports variant as benign
  • BP7: Synonymous variant with no predicted splice effect

2. Phenotype Matching Algorithms

2.1 Information Content (IC) Scoring

IC quantifies the discriminating power of a phenotype term:

IC(t) = -log2(p(t))

Where p(t) is the fraction of diseases annotated with term t in HPO.

Properties: - IC increases as term frequency decreases (rarer = more informative) - Root terms (e.g., "Phenotypic abnormality") have IC near 0 - Leaf terms (e.g., "Genu recurvatum") have high IC (6-10) - IC is additive: total information from independent terms sums

2.2 Best-Match-Average (BMA) Similarity

BMA is the standard algorithm for phenotype profile comparison in rare disease:

BMA(P, D) = 0.5 * (avg_p + avg_d)

avg_p = (1/|P|) * SUM over p in P of: max over d in D of sim(p, d)
avg_d = (1/|D|) * SUM over d in D of: max over p in P of sim(d, p)

Where sim(a, b) is the IC of the most informative common ancestor (MICA) of terms a and b.

Advantages: - Bidirectional: penalizes both missing and extra phenotypes - IC-weighted: specific matches count more than generic ones - Handles partial matches gracefully

2.3 Resnik Similarity

The Resnik similarity between two HPO terms is the IC of their Most Informative Common Ancestor (MICA):

sim_Resnik(t1, t2) = IC(MICA(t1, t2))

For exact matches, sim_Resnik equals the IC of the term itself. For related but non-identical terms, the similarity is the IC of their shared ancestor.

2.4 Lin Similarity

Lin similarity normalizes Resnik similarity:

sim_Lin(t1, t2) = (2 * IC(MICA(t1, t2))) / (IC(t1) + IC(t2))

This produces values in [0, 1], making comparison across term pairs more interpretable.

2.5 Agent Implementation

The Rare Disease Diagnostic Agent uses a combined scoring approach:

combined = BMA * 0.7 + freq_weight * 0.3

Where freq_weight incorporates the frequency of each matched phenotype within the specific gene-disease association. A phenotype that is present in 95% of patients with a disease contributes more than one present in only 30%.

2.6 Ranking Algorithm

  1. Compute BMA similarity for each candidate gene
  2. Weight by phenotype frequency in gene-disease association
  3. Combine: 70% BMA + 30% frequency weight
  4. Normalize to [0, 1]
  5. Sort descending, return top_k candidates

3. Family Segregation Analysis

3.1 LOD Scores

The LOD (Logarithm of Odds) score is the standard measure for evaluating genetic linkage and variant-disease cosegregation:

LOD = log10(L(theta=0) / L(theta=0.5))

Where: - theta = 0: complete linkage (variant and disease always co-occur) - theta = 0.5: no linkage (random association) - L: likelihood of observed data under each model

3.2 ACMG Segregation Evidence Levels

ClinGen has calibrated LOD scores to ACMG evidence strength:

LOD Score ACMG Evidence Description
>= 3.0 PS (Strong) Equivalent to p < 0.001
1.5 - 2.99 PM (Moderate) Significant but not conclusive
0.6 - 1.49 PP (Supporting) Suggestive cosegregation
< 0.6 Insufficient Not enough informative meioses

3.3 Required Family Size

Inheritance Informative Meioses for PS Typical Family Size
Autosomal Dominant 7+ affected carriers ~3 generation family
Autosomal Recessive 5+ (including obligate carriers) Multiple siblings
X-Linked Recessive 5+ Multiple generations through carrier females

3.4 Agent Implementation

The Family Segregation Analyzer implements simplified LOD scoring: - Each concordant meiosis (affected carries variant, unaffected does not): +0.3 - Each discordant meiosis (affected lacks variant, or unaffected carries it): -1.0 - Supports AD, AR, XLD, XLR inheritance models - Outputs LOD score, ACMG evidence level, concordant/discordant counts

3.5 Pitfalls

  • Reduced penetrance: Non-penetrant carriers appear discordant but are not truly so
  • Phenocopies: Unrelated causes of similar phenotypes inflate discordance
  • Small families: Single nuclear family provides limited evidence (LOD typically < 1.0)
  • De novo variants: Family segregation is uninformative for de novo events

4. Natural History Prediction

4.1 What Is Natural History?

Natural history describes the typical disease course over a patient's lifetime: age of onset, progression rate, milestone events, complications, and survival. Understanding natural history is essential for: - Prognostic counseling for families - Anticipatory management (preventing complications before they occur) - Treatment timing decisions - Clinical trial endpoint selection

4.2 Data Sources

Source Type Strength
Patient registries Prospective cohort Large sample, standardized data
Published cohort studies Retrospective Variable quality, publication bias
Case reports Individual Anecdotal, biased toward unusual presentations
Clinical trials (natural history arms) Prospective Well-characterized but small

4.3 Agent Implementation

The Natural History Predictor covers 6 well-characterized diseases:

SMA Type 1: - Symptom onset: 0-6 months (median 3) - Loss of head control: 3-12 months - Ventilatory support: 6-24 months - Mortality without treatment: 12-48 months - Modifier: SMN2 copy number (3 copies = milder)

Duchenne Muscular Dystrophy: - Gait abnormality onset: 24-60 months (median 36) - Loss of ambulation: 84-156 months (median 120) - Cardiomyopathy onset: 120-216 months - Ventilatory support: 180-264 months - Modifier: Exon-skippable mutations (therapy-eligible)

Cystic Fibrosis: - NBS diagnosis: 0-6 months - First Pseudomonas: 6-120 months - FEV1 decline: 72-216 months - Modifier: CFTR modulators dramatically extend survival

PKU: - NBS detection: birth - Diet initiation: within 3 weeks - Normal IQ if treated early; IQ < 50 if untreated - Modifier: BH4-responsive mutations allow diet relaxation

Marfan Syndrome: - Diagnosis: 24-240 months (highly variable) - Aortic root dilation: 60-360 months - Aortic dissection risk: 240-600 months (untreated) - Modifier: Haploinsufficiency variants generally more severe

Dravet Syndrome: - First febrile seizure: 4-12 months - Afebrile seizures: 8-24 months - Developmental plateau: 18-36 months - Modifier: Truncating SCN1A variants = more severe

4.4 Genotype-Phenotype Correlation

Disease course often depends on the specific genotype:

Disease Modifier Effect
SMA SMN2 copies More copies = milder disease
CF F508del homozygous Trikafta-eligible = dramatically improved
DMD Exon-skippable mutation Eligible for exon-skipping therapy
Dravet Truncating vs missense SCN1A Truncating = more severe
Marfan Haploinsufficiency vs dominant-negative Haploinsufficiency = more severe cardiovascular
PKU BH4-responsive PAH mutations Sapropterin may allow diet relaxation

5. Newborn Screening Expansion

5.1 Current State of NBS

The US Recommended Uniform Screening Panel (RUSP) has expanded from 4 conditions in 1960 to 37 core conditions and 26 secondary conditions in 2024. Recent additions include: - SMA (2018) -- detectable by DNA-based testing - Pompe disease (2015) -- enzyme activity assay - MPS I (2016) -- enzyme activity assay - X-ALD (2016) -- C26:0-lysophosphatidylcholine - MPS II (2022) -- enzyme activity assay

5.2 Genomic NBS (gNBS)

Pilot programs are evaluating whole-genome sequencing of all newborns: - BabySeq (US): Demonstrated 9.4% of healthy newborns carry actionable genetic findings - Genomics England NBS (UK): Screening for ~200 conditions - BeginNGS (US): Screening for actionable conditions with available treatments

5.3 Expansion Criteria

A condition should be added to NBS if: 1. It can be detected reliably in the newborn period 2. Early detection improves outcomes compared to clinical diagnosis 3. An effective treatment or management strategy exists 4. A suitable screening test is available at acceptable cost 5. A follow-up system can handle positive results

5.4 Challenges

  • False positives: NBS generates many false positives, causing family anxiety
  • VUS in genomic NBS: WGS identifies many VUS that create clinical uncertainty
  • Equity: Not all states/countries screen for the same conditions
  • Incidental findings: Genomic NBS may reveal untreatable conditions or adult-onset risks
  • Infrastructure: Follow-up systems struggle to handle increased volume

6. Gene Therapy Pipelines

6.1 AAV Gene Replacement Pipeline

1. Identify target gene and disease
2. Design AAV transgene cassette
   - Promoter selection (tissue-specific or ubiquitous)
   - Codon optimization
   - Regulatory elements (enhancer, polyA)
3. Select AAV serotype
   - AAV9: crosses blood-brain barrier (SMA, neurological)
   - AAV2: retinal tropism (LCA)
   - AAV5: hepatic tropism (hemophilia)
   - AAVrh74: muscle tropism (DMD)
4. Vector production (HEK293 cells)
5. Preclinical testing (mouse -> NHP)
6. IND filing and clinical trial phases
7. BLA filing and FDA review
8. Manufacturing scale-up
9. Commercial launch and patient access

6.2 CRISPR Gene Editing Pipeline

1. Design guide RNA (gRNA) targeting pathogenic locus
2. Select editing strategy:
   - Gene knockout (disrupt deleterious allele)
   - Gene correction (HDR to fix point mutation)
   - Base editing (single nucleotide change without DSB)
   - Prime editing (precise insertion/deletion)
3. Ex vivo workflow:
   a. Harvest patient stem cells (HSCs or T cells)
   b. Electroporate CRISPR components
   c. Select edited cells
   d. Myeloablative conditioning
   e. Reinfuse edited cells
4. In vivo workflow (future):
   a. Lipid nanoparticle (LNP) delivery to target organ
   b. Direct administration

6.3 Lentiviral Gene Addition Pipeline

1. Design lentiviral vector with therapeutic transgene
2. Self-inactivating (SIN) vector design for safety
3. Ex vivo workflow:
   a. Mobilize and harvest patient HSCs
   b. Transduce with lentiviral vector
   c. Quality control (VCN, sterility, identity)
   d. Myeloablative conditioning (busulfan)
   e. Reinfuse transduced cells
4. Engraftment monitoring
5. Long-term follow-up (integration site analysis)

6.4 Eligibility Assessment Framework

The agent evaluates gene therapy eligibility across multiple dimensions:

Dimension Assessment
Disease confirmation Molecular diagnosis with confirmed pathogenic variant
Genotype compatibility Variant type compatible with therapy mechanism
Age eligibility Within approved age window
Organ function Sufficient baseline function for benefit
Anti-AAV antibodies Seropositive patients may be ineligible for AAV therapies
Prior gene therapy Most AAV therapies are one-time only
Geographic access Therapy availability by region
Insurance coverage Authorization and funding pathway

7. Diagnostic Yield Optimization

7.1 Testing Strategy by Phenotype

Clinical Scenario Optimal First Test Expected Yield
Isolated ID/DD CMA then WES 15-20% + 25-40%
ID/DD + dysmorphism CMA then WES 20% + 30-50%
Epilepsy + DD Epilepsy gene panel or WES 20-50%
Cardiomyopathy Cardiomyopathy gene panel 20-40%
Skeletal dysplasia Skeletal survey + gene panel 30-50%
Connective tissue disorder Clinical criteria + targeted gene 70-93% (Marfan)

7.2 Trio vs Singleton Analysis

Approach Diagnostic Yield Advantages
Singleton WES 25-35% Lower cost
Trio WES (proband + parents) 35-50% De novo detection, phasing, reduced VUS
Trio WGS 40-60% Non-coding variants, structural variants
Quad/family WGS Higher Complex inheritance patterns

7.3 Reanalysis

Periodic reanalysis of previously negative WES/WGS data yields additional diagnoses: - 10-15% additional yield on reanalysis at 1-2 years - New gene-disease associations discovered monthly - Updated bioinformatics pipelines improve variant calling


8. Multi-Omics Integration

8.1 Beyond DNA Sequencing

Omics Layer Technology Added Value
Transcriptomics (RNA-seq) RNA sequencing Detect splicing defects, expression changes
Metabolomics Mass spectrometry Identify metabolic pathway disruptions
Proteomics Mass spectrometry Protein expression and modification
Epigenomics Methylation arrays Imprinting disorders, episignatures

8.2 RNA-seq for Unsolved Cases

RNA-seq from patient tissue (blood, fibroblasts, muscle) can detect: - Aberrant splicing caused by deep intronic variants - Allele-specific expression (monoallelic expression in AR disease) - Expression outliers suggesting regulatory variants

Diagnostic yield: 10-35% additional diagnoses in previously unsolved cases.

8.3 Episignatures

DNA methylation patterns (episignatures) can diagnose specific genetic syndromes: - Unique methylation profiles for 50+ syndromes - Can distinguish overlapping clinical presentations - Useful for VUS reclassification


9. Undiagnosed Disease Programs

9.1 Major Programs

Program Country Approach
UDP (Undiagnosed Diseases Program) USA (NIH) Multi-omics, expert evaluation
UDNI (Undiagnosed Diseases Network International) Global Coordinated cross-border evaluation
SOLVE-RD EU WGS reanalysis across ERNs
100,000 Genomes Project UK WGS with phenotype linkage

9.2 Solving the Unsolvable

Strategies for patients who remain undiagnosed after standard genetic testing: 1. WGS reanalysis with updated annotations 2. RNA-seq from relevant tissue 3. Metabolomics profiling 4. Functional validation of candidate variants 5. International matchmaking (GeneMatcher, DECIPHER) 6. Animal modeling of novel gene candidates


10. Pharmacogenomics in Rare Disease

10.1 PGx Relevance

Pharmacogenomics is particularly relevant for rare disease patients because: - Many rare disease therapies have narrow therapeutic windows - Enzyme replacement therapies may interact with metabolic pathways - Anti-seizure medications require careful PGx-guided dosing - Gene therapy immunosuppression regimens benefit from PGx profiling

10.2 Key PGx Interactions

Drug Class Gene Rare Disease Context
Antiepileptics HLA-B, CYP2C19 Dravet syndrome, epilepsy management
Immunosuppressants TPMT, NUDT15 Post-gene therapy conditioning
Cardiac drugs CYP2D6, SLCO1B1 Cardiomyopathy, channelopathy management
Enzyme replacement NAT2 Metabolic disease management

11. Ethical Considerations

11.1 Key Ethical Issues

Issue Challenge
Diagnostic uncertainty VUS results create anxiety without clear clinical action
Incidental findings Genomic testing may reveal unrelated health risks
Reproductive implications Carrier status affects family planning
Pediatric testing Testing children for adult-onset conditions
Data privacy Genomic data requires stringent protection
Equity Access to genetic testing and gene therapy varies by geography and wealth
Cost of gene therapy $1-3.5 million per treatment raises justice questions

11.2 Return of Results

Current consensus for return of genomic results: - Report pathogenic and likely pathogenic variants in disease-related genes - Offer return of ACMG secondary findings (SF v3.2: 81 genes) - Do NOT report VUS for clinical action (but may report for monitoring) - Respect patient preferences for incidental findings


12. The Future of Rare Disease Diagnosis

12.1 Emerging Technologies

Technology Impact Timeline
Long-read sequencing (PacBio, ONT) Structural variants, repeat expansions Now
Optical genome mapping Large structural variants Now
Single-cell RNA-seq Cell-type-specific expression 2-3 years
AI-powered facial recognition Dysmorphology phenotyping Now
Protein structure prediction (AlphaFold) Missense variant impact Now
In vivo CRISPR gene editing One-time curative therapy 3-5 years
mRNA therapeutics Protein replacement 2-4 years

12.2 The Path to Zero Diagnostic Odyssey

A convergence of technologies may eventually eliminate the diagnostic odyssey: 1. Universal genomic NBS: WGS at birth for all newborns 2. AI-powered phenotyping: Automatic HPO extraction from EHR and imaging 3. Real-time variant interpretation: Automated ACMG classification with full evidence integration 4. Global data sharing: International rare disease databases with millions of genotype-phenotype records 5. Precision therapy matching: AI-guided therapy selection including gene therapy eligibility

The HCLS AI Factory Rare Disease Diagnostic Agent represents a step toward this vision -- demonstrating that integrated AI systems can deliver comprehensive diagnostic support across the full rare disease workflow.


Apache 2.0 License -- HCLS AI Factory