Rare Disease Diagnostic Agent -- Learning Guide: Advanced Topics¶

ACMG 28-criteria deep dive, phenotype matching algorithms, family segregation analysis, natural history prediction, newborn screening expansion, and gene therapy pipelines.

Version: 1.0.0 Date: March 22, 2026 Author: Adam Jones Platform: NVIDIA DGX Spark -- HCLS AI Factory

1. ACMG 28-Criteria Deep Dive¶

1.1 Pathogenic Criteria -- Very Strong¶

PVS1: Null Variant in LOF-Intolerant Gene¶

What it means: A variant that completely disrupts gene function (loss-of-function) in a gene where haploinsufficiency is a known disease mechanism.

Qualifying variant types: - Nonsense (premature stop codon) - Frameshift (reading frame disruption) - Canonical splice site (+/-1,2 positions) - Initiation codon (start loss) - Single or multi-exon deletion

LOF intolerance assessment: - gnomAD pLI score > 0.9 indicates intolerance to LOF - LOEUF (Loss of Expected Upper Fraction) < 0.35 is the newer metric - Gene must have established LOF mechanism for the disease in question

Caveats: - Does NOT apply to genes where disease is caused by gain-of-function (e.g., some SCN5A variants) - Reduced strength for: last exon truncation, alternative transcripts, rescue by nonsense-mediated decay escape - The ClinGen Sequence Variant Interpretation (SVI) workgroup provides detailed decision trees for PVS1 application

Agent implementation: The system maintains a curated list of 20 LOF-intolerant genes including SCN1A, MECP2, KCNQ1, KCNH2, SCN5A, MYH7, MYBPC3, FBN1, COL1A1, COL1A2, HTT, NSD1, NIPBL, KMT2D, CHD7, PTPN11, RAF1, BRAF, FGFR3, and TCOF1.

1.2 Pathogenic Criteria -- Strong (PS1-PS4)¶

PS1: Same Amino Acid Change as Established Pathogenic¶

The same amino acid substitution has been previously established as pathogenic, regardless of the nucleotide change. Example: If p.Arg412Gly in FBN1 is pathogenic, then any nucleotide change producing p.Arg412Gly is PS1-eligible.

Caveat: Must verify the nucleotide change does not alter splicing differently.

PS2: De Novo (Both Maternity and Paternity Confirmed)¶

A de novo variant in a patient with the disease and no family history. Requires confirmation through testing of both biological parents.

Strength modifiers: - Confirmed de novo (both parents tested): Full PS2 strength - Assumed de novo (parents not available): Downgraded to PM6

PS3: Well-Established In Vitro or In Vivo Functional Studies¶

Functional studies demonstrate a damaging effect on the gene product. Examples: - Enzyme activity assay showing < 10% residual activity - Electrophysiology studies showing altered channel function - Mouse model with disease phenotype when carrying the variant

The challenge: Not all functional studies are equally reliable. ClinGen provides gene-specific guidance on which assays qualify for PS3.

PS4: Prevalence in Affected Individuals Significantly Increased vs Controls¶

Statistical evidence that the variant occurs significantly more often in affected individuals than in matched controls. Requires case-control data with adequate sample sizes and population matching.

1.3 Pathogenic Criteria -- Moderate (PM1-PM6)¶

PM1: Mutational Hot Spot / Critical Functional Domain¶

The variant is located in a well-established functional domain or known mutational hot spot. The agent maintains hot spot coordinates for: - FGFR3: positions 370-380 (transmembrane domain -- achondroplasia) - BRAF: positions 594-601 (kinase domain) - KRAS: positions 10-15, 58-63 - TP53: positions 125-300 (DNA-binding domain)

PM2: Absent from Controls¶

The variant is absent from or at extremely low frequency (< 0.0001) in gnomAD and other population databases. This criterion has been refined by ClinGen SVI to use population-specific frequency thresholds that account for disease prevalence and penetrance.

PM3: In Trans with a Known Pathogenic Variant (Recessive Disorders)¶

For autosomal recessive conditions, detecting the variant in trans (on the opposite chromosome) from a known pathogenic variant supports pathogenicity. Ideally confirmed by phasing through parental testing.

PM4: Protein Length Change in Non-Repeat Region¶

In-frame deletions or insertions that alter protein length, provided they are NOT in a repetitive region (which would trigger BP3 instead).

PM5: Novel Missense at an Established Pathogenic Position¶

A different amino acid substitution at the same position where another missense has been established as pathogenic. Example: If p.Arg412Gly is pathogenic, then p.Arg412Trp at the same position qualifies for PM5.

PM6: Assumed De Novo Without Parental Confirmation¶

Used when paternity and/or maternity is assumed but not confirmed through testing. Weaker than PS2.

1.4 Pathogenic Criteria -- Supporting (PP1-PP5)¶

PP1: Cosegregation with Disease in Multiple Affected Family Members¶

The variant tracks with the disease through the family pedigree. Strength is proportional to the number of informative meioses (see Section 3 on LOD scores).

PP2: Missense in a Gene with Low Rate of Benign Missense Variation¶

Some genes are intolerant to missense variation -- nearly all observed missense variants in these genes are pathogenic. The gene's missense Z-score in gnomAD quantifies this constraint.

PP3: Multiple Lines of Computational Evidence Support Deleterious Effect¶

In silico prediction tools (REVEL, CADD, PolyPhen-2, SIFT, MutationTaster) concordantly predict the variant is damaging. The agent evaluates the computational_prediction field.

PP4: Patient Phenotype Highly Specific for a Disease with a Single Genetic Etiology¶

When the clinical presentation is pathognomonic for a specific genetic disease. Example: ectopia lentis + tall stature + aortic root dilation is highly specific for Marfan syndrome (FBN1).

PP5: Reputable Source Recently Reports Variant as Pathogenic¶

A ClinVar submission from a reputable laboratory classifies the variant as pathogenic. Note: ClinGen SVI has recommended retiring PP5 in favor of directly evaluating the underlying evidence.

1.5 Benign Criteria (BA1, BS1-BS2, BP1-BP7)¶

BA1: Population Frequency > 5% (Standalone Benign)¶

Any variant with allele frequency > 5% in gnomAD or other population databases is classified as benign immediately, regardless of other evidence. This is the only standalone criterion in the ACMG framework.

Exception: Some conditions (e.g., hereditary hemochromatosis, HFE C282Y at 5-10% in Northern Europeans) have specific BA1 exemptions.

BS1: Greater Than Expected Frequency for Disorder¶

Allele frequency is higher than expected for the disease given its prevalence and penetrance. Calculated using the Hardy-Weinberg equation and disease-specific parameters.

BS2: Observed in a Healthy Adult with Full Penetrance¶

The variant is observed in a healthy individual who has passed the age of expected disease onset, for a condition with complete penetrance. Strongest evidence for conditions like Huntington disease where penetrance is ~100%.

BP1-BP7: Supporting Benign Evidence¶

BP1: Missense in gene where only truncating variants cause disease
BP3: In-frame indel in repetitive region without known function
BP4: Multiple computational predictors suggest no functional impact
BP6: Reputable source reports variant as benign
BP7: Synonymous variant with no predicted splice effect

2. Phenotype Matching Algorithms¶

2.1 Information Content (IC) Scoring¶

IC quantifies the discriminating power of a phenotype term:

IC(t) = -log2(p(t))

Where p(t) is the fraction of diseases annotated with term t in HPO.

Properties: - IC increases as term frequency decreases (rarer = more informative) - Root terms (e.g., "Phenotypic abnormality") have IC near 0 - Leaf terms (e.g., "Genu recurvatum") have high IC (6-10) - IC is additive: total information from independent terms sums

2.2 Best-Match-Average (BMA) Similarity¶

BMA is the standard algorithm for phenotype profile comparison in rare disease:

BMA(P, D) = 0.5 * (avg_p + avg_d)

avg_p = (1/|P|) * SUM over p in P of: max over d in D of sim(p, d)
avg_d = (1/|D|) * SUM over d in D of: max over p in P of sim(d, p)

Where sim(a, b) is the IC of the most informative common ancestor (MICA) of terms a and b.

Advantages: - Bidirectional: penalizes both missing and extra phenotypes - IC-weighted: specific matches count more than generic ones - Handles partial matches gracefully

2.3 Resnik Similarity¶

The Resnik similarity between two HPO terms is the IC of their Most Informative Common Ancestor (MICA):

sim_Resnik(t1, t2) = IC(MICA(t1, t2))

For exact matches, sim_Resnik equals the IC of the term itself. For related but non-identical terms, the similarity is the IC of their shared ancestor.

2.4 Lin Similarity¶

Lin similarity normalizes Resnik similarity:

sim_Lin(t1, t2) = (2 * IC(MICA(t1, t2))) / (IC(t1) + IC(t2))

This produces values in [0, 1], making comparison across term pairs more interpretable.

2.5 Agent Implementation¶

The Rare Disease Diagnostic Agent uses a combined scoring approach:

combined = BMA * 0.7 + freq_weight * 0.3

Where freq_weight incorporates the frequency of each matched phenotype within the specific gene-disease association. A phenotype that is present in 95% of patients with a disease contributes more than one present in only 30%.

2.6 Ranking Algorithm¶

Compute BMA similarity for each candidate gene
Weight by phenotype frequency in gene-disease association
Combine: 70% BMA + 30% frequency weight
Normalize to [0, 1]
Sort descending, return top_k candidates

3. Family Segregation Analysis¶

3.1 LOD Scores¶

The LOD (Logarithm of Odds) score is the standard measure for evaluating genetic linkage and variant-disease cosegregation:

LOD = log10(L(theta=0) / L(theta=0.5))

Where: - theta = 0: complete linkage (variant and disease always co-occur) - theta = 0.5: no linkage (random association) - L: likelihood of observed data under each model

3.2 ACMG Segregation Evidence Levels¶

ClinGen has calibrated LOD scores to ACMG evidence strength:

LOD Score	ACMG Evidence	Description
>= 3.0	PS (Strong)	Equivalent to p < 0.001
1.5 - 2.99	PM (Moderate)	Significant but not conclusive
0.6 - 1.49	PP (Supporting)	Suggestive cosegregation
< 0.6	Insufficient	Not enough informative meioses

3.3 Required Family Size¶

Inheritance	Informative Meioses for PS	Typical Family Size
Autosomal Dominant	7+ affected carriers	~3 generation family
Autosomal Recessive	5+ (including obligate carriers)	Multiple siblings
X-Linked Recessive	5+	Multiple generations through carrier females

3.4 Agent Implementation¶

The Family Segregation Analyzer implements simplified LOD scoring: - Each concordant meiosis (affected carries variant, unaffected does not): +0.3 - Each discordant meiosis (affected lacks variant, or unaffected carries it): -1.0 - Supports AD, AR, XLD, XLR inheritance models - Outputs LOD score, ACMG evidence level, concordant/discordant counts

3.5 Pitfalls¶

Reduced penetrance: Non-penetrant carriers appear discordant but are not truly so
Phenocopies: Unrelated causes of similar phenotypes inflate discordance
Small families: Single nuclear family provides limited evidence (LOD typically < 1.0)
De novo variants: Family segregation is uninformative for de novo events

4. Natural History Prediction¶

4.1 What Is Natural History?¶

Natural history describes the typical disease course over a patient's lifetime: age of onset, progression rate, milestone events, complications, and survival. Understanding natural history is essential for: - Prognostic counseling for families - Anticipatory management (preventing complications before they occur) - Treatment timing decisions - Clinical trial endpoint selection

4.2 Data Sources¶

Source	Type	Strength
Patient registries	Prospective cohort	Large sample, standardized data
Published cohort studies	Retrospective	Variable quality, publication bias
Case reports	Individual	Anecdotal, biased toward unusual presentations
Clinical trials (natural history arms)	Prospective	Well-characterized but small

4.3 Agent Implementation¶

The Natural History Predictor covers 6 well-characterized diseases:

SMA Type 1: - Symptom onset: 0-6 months (median 3) - Loss of head control: 3-12 months - Ventilatory support: 6-24 months - Mortality without treatment: 12-48 months - Modifier: SMN2 copy number (3 copies = milder)

Duchenne Muscular Dystrophy: - Gait abnormality onset: 24-60 months (median 36) - Loss of ambulation: 84-156 months (median 120) - Cardiomyopathy onset: 120-216 months - Ventilatory support: 180-264 months - Modifier: Exon-skippable mutations (therapy-eligible)

Cystic Fibrosis: - NBS diagnosis: 0-6 months - First Pseudomonas: 6-120 months - FEV1 decline: 72-216 months - Modifier: CFTR modulators dramatically extend survival

PKU: - NBS detection: birth - Diet initiation: within 3 weeks - Normal IQ if treated early; IQ < 50 if untreated - Modifier: BH4-responsive mutations allow diet relaxation

Marfan Syndrome: - Diagnosis: 24-240 months (highly variable) - Aortic root dilation: 60-360 months - Aortic dissection risk: 240-600 months (untreated) - Modifier: Haploinsufficiency variants generally more severe

Dravet Syndrome: - First febrile seizure: 4-12 months - Afebrile seizures: 8-24 months - Developmental plateau: 18-36 months - Modifier: Truncating SCN1A variants = more severe

4.4 Genotype-Phenotype Correlation¶

Disease course often depends on the specific genotype:

Disease	Modifier	Effect
SMA	SMN2 copies	More copies = milder disease
CF	F508del homozygous	Trikafta-eligible = dramatically improved
DMD	Exon-skippable mutation	Eligible for exon-skipping therapy
Dravet	Truncating vs missense SCN1A	Truncating = more severe
Marfan	Haploinsufficiency vs dominant-negative	Haploinsufficiency = more severe cardiovascular
PKU	BH4-responsive PAH mutations	Sapropterin may allow diet relaxation

5. Newborn Screening Expansion¶

5.1 Current State of NBS¶

The US Recommended Uniform Screening Panel (RUSP) has expanded from 4 conditions in 1960 to 37 core conditions and 26 secondary conditions in 2024. Recent additions include: - SMA (2018) -- detectable by DNA-based testing - Pompe disease (2015) -- enzyme activity assay - MPS I (2016) -- enzyme activity assay - X-ALD (2016) -- C26:0-lysophosphatidylcholine - MPS II (2022) -- enzyme activity assay

5.2 Genomic NBS (gNBS)¶

Pilot programs are evaluating whole-genome sequencing of all newborns: - BabySeq (US): Demonstrated 9.4% of healthy newborns carry actionable genetic findings - Genomics England NBS (UK): Screening for ~200 conditions - BeginNGS (US): Screening for actionable conditions with available treatments

5.3 Expansion Criteria¶

A condition should be added to NBS if: 1. It can be detected reliably in the newborn period 2. Early detection improves outcomes compared to clinical diagnosis 3. An effective treatment or management strategy exists 4. A suitable screening test is available at acceptable cost 5. A follow-up system can handle positive results

5.4 Challenges¶

False positives: NBS generates many false positives, causing family anxiety
VUS in genomic NBS: WGS identifies many VUS that create clinical uncertainty
Equity: Not all states/countries screen for the same conditions
Incidental findings: Genomic NBS may reveal untreatable conditions or adult-onset risks
Infrastructure: Follow-up systems struggle to handle increased volume

6. Gene Therapy Pipelines¶

6.1 AAV Gene Replacement Pipeline¶

1. Identify target gene and disease
2. Design AAV transgene cassette
   - Promoter selection (tissue-specific or ubiquitous)
   - Codon optimization
   - Regulatory elements (enhancer, polyA)
3. Select AAV serotype
   - AAV9: crosses blood-brain barrier (SMA, neurological)
   - AAV2: retinal tropism (LCA)
   - AAV5: hepatic tropism (hemophilia)
   - AAVrh74: muscle tropism (DMD)
4. Vector production (HEK293 cells)
5. Preclinical testing (mouse -> NHP)
6. IND filing and clinical trial phases
7. BLA filing and FDA review
8. Manufacturing scale-up
9. Commercial launch and patient access

6.2 CRISPR Gene Editing Pipeline¶

1. Design guide RNA (gRNA) targeting pathogenic locus
2. Select editing strategy:
   - Gene knockout (disrupt deleterious allele)
   - Gene correction (HDR to fix point mutation)
   - Base editing (single nucleotide change without DSB)
   - Prime editing (precise insertion/deletion)
3. Ex vivo workflow:
   a. Harvest patient stem cells (HSCs or T cells)
   b. Electroporate CRISPR components
   c. Select edited cells
   d. Myeloablative conditioning
   e. Reinfuse edited cells
4. In vivo workflow (future):
   a. Lipid nanoparticle (LNP) delivery to target organ
   b. Direct administration

6.3 Lentiviral Gene Addition Pipeline¶

1. Design lentiviral vector with therapeutic transgene
2. Self-inactivating (SIN) vector design for safety
3. Ex vivo workflow:
   a. Mobilize and harvest patient HSCs
   b. Transduce with lentiviral vector
   c. Quality control (VCN, sterility, identity)
   d. Myeloablative conditioning (busulfan)
   e. Reinfuse transduced cells
4. Engraftment monitoring
5. Long-term follow-up (integration site analysis)

6.4 Eligibility Assessment Framework¶

The agent evaluates gene therapy eligibility across multiple dimensions:

Dimension	Assessment
Disease confirmation	Molecular diagnosis with confirmed pathogenic variant
Genotype compatibility	Variant type compatible with therapy mechanism
Age eligibility	Within approved age window
Organ function	Sufficient baseline function for benefit
Anti-AAV antibodies	Seropositive patients may be ineligible for AAV therapies
Prior gene therapy	Most AAV therapies are one-time only
Geographic access	Therapy availability by region
Insurance coverage	Authorization and funding pathway

7. Diagnostic Yield Optimization¶

7.1 Testing Strategy by Phenotype¶

Clinical Scenario	Optimal First Test	Expected Yield
Isolated ID/DD	CMA then WES	15-20% + 25-40%
ID/DD + dysmorphism	CMA then WES	20% + 30-50%
Epilepsy + DD	Epilepsy gene panel or WES	20-50%
Cardiomyopathy	Cardiomyopathy gene panel	20-40%
Skeletal dysplasia	Skeletal survey + gene panel	30-50%
Connective tissue disorder	Clinical criteria + targeted gene	70-93% (Marfan)

7.2 Trio vs Singleton Analysis¶

Approach	Diagnostic Yield	Advantages
Singleton WES	25-35%	Lower cost
Trio WES (proband + parents)	35-50%	De novo detection, phasing, reduced VUS
Trio WGS	40-60%	Non-coding variants, structural variants
Quad/family WGS	Higher	Complex inheritance patterns

7.3 Reanalysis¶

Periodic reanalysis of previously negative WES/WGS data yields additional diagnoses: - 10-15% additional yield on reanalysis at 1-2 years - New gene-disease associations discovered monthly - Updated bioinformatics pipelines improve variant calling

8. Multi-Omics Integration¶

8.1 Beyond DNA Sequencing¶

Omics Layer	Technology	Added Value
Transcriptomics (RNA-seq)	RNA sequencing	Detect splicing defects, expression changes
Metabolomics	Mass spectrometry	Identify metabolic pathway disruptions
Proteomics	Mass spectrometry	Protein expression and modification
Epigenomics	Methylation arrays	Imprinting disorders, episignatures

8.2 RNA-seq for Unsolved Cases¶

RNA-seq from patient tissue (blood, fibroblasts, muscle) can detect: - Aberrant splicing caused by deep intronic variants - Allele-specific expression (monoallelic expression in AR disease) - Expression outliers suggesting regulatory variants

Diagnostic yield: 10-35% additional diagnoses in previously unsolved cases.

8.3 Episignatures¶

DNA methylation patterns (episignatures) can diagnose specific genetic syndromes: - Unique methylation profiles for 50+ syndromes - Can distinguish overlapping clinical presentations - Useful for VUS reclassification

9. Undiagnosed Disease Programs¶

9.1 Major Programs¶

Program	Country	Approach
UDP (Undiagnosed Diseases Program)	USA (NIH)	Multi-omics, expert evaluation
UDNI (Undiagnosed Diseases Network International)	Global	Coordinated cross-border evaluation
SOLVE-RD	EU	WGS reanalysis across ERNs
100,000 Genomes Project	UK	WGS with phenotype linkage

9.2 Solving the Unsolvable¶

Strategies for patients who remain undiagnosed after standard genetic testing: 1. WGS reanalysis with updated annotations 2. RNA-seq from relevant tissue 3. Metabolomics profiling 4. Functional validation of candidate variants 5. International matchmaking (GeneMatcher, DECIPHER) 6. Animal modeling of novel gene candidates

10. Pharmacogenomics in Rare Disease¶

10.1 PGx Relevance¶

Pharmacogenomics is particularly relevant for rare disease patients because: - Many rare disease therapies have narrow therapeutic windows - Enzyme replacement therapies may interact with metabolic pathways - Anti-seizure medications require careful PGx-guided dosing - Gene therapy immunosuppression regimens benefit from PGx profiling

10.2 Key PGx Interactions¶

Drug Class	Gene	Rare Disease Context
Antiepileptics	HLA-B, CYP2C19	Dravet syndrome, epilepsy management
Immunosuppressants	TPMT, NUDT15	Post-gene therapy conditioning
Cardiac drugs	CYP2D6, SLCO1B1	Cardiomyopathy, channelopathy management
Enzyme replacement	NAT2	Metabolic disease management

11. Ethical Considerations¶

11.1 Key Ethical Issues¶

Issue	Challenge
Diagnostic uncertainty	VUS results create anxiety without clear clinical action
Incidental findings	Genomic testing may reveal unrelated health risks
Reproductive implications	Carrier status affects family planning
Pediatric testing	Testing children for adult-onset conditions
Data privacy	Genomic data requires stringent protection
Equity	Access to genetic testing and gene therapy varies by geography and wealth
Cost of gene therapy	$1-3.5 million per treatment raises justice questions

11.2 Return of Results¶

Current consensus for return of genomic results: - Report pathogenic and likely pathogenic variants in disease-related genes - Offer return of ACMG secondary findings (SF v3.2: 81 genes) - Do NOT report VUS for clinical action (but may report for monitoring) - Respect patient preferences for incidental findings

12. The Future of Rare Disease Diagnosis¶

12.1 Emerging Technologies¶

Technology	Impact	Timeline
Long-read sequencing (PacBio, ONT)	Structural variants, repeat expansions	Now
Optical genome mapping	Large structural variants	Now
Single-cell RNA-seq	Cell-type-specific expression	2-3 years
AI-powered facial recognition	Dysmorphology phenotyping	Now
Protein structure prediction (AlphaFold)	Missense variant impact	Now
In vivo CRISPR gene editing	One-time curative therapy	3-5 years
mRNA therapeutics	Protein replacement	2-4 years

12.2 The Path to Zero Diagnostic Odyssey¶

A convergence of technologies may eventually eliminate the diagnostic odyssey: 1. Universal genomic NBS: WGS at birth for all newborns 2. AI-powered phenotyping: Automatic HPO extraction from EHR and imaging 3. Real-time variant interpretation: Automated ACMG classification with full evidence integration 4. Global data sharing: International rare disease databases with millions of genotype-phenotype records 5. Precision therapy matching: AI-guided therapy selection including gene therapy eligibility

The HCLS AI Factory Rare Disease Diagnostic Agent represents a step toward this vision -- demonstrating that integrated AI systems can deliver comprehensive diagnostic support across the full rare disease workflow.

Apache 2.0 License -- HCLS AI Factory