Single-Cell Intelligence Agent -- Learning Guide: Advanced Topics¶

Version: 1.0.0 Date: 2026-03-22 Author: Adam Jones

1. TME Classification: Deep Dive¶

1.1 The Four Immunophenotypes¶

Tumor microenvironment classification is central to immunotherapy patient selection. The field has converged on four canonical phenotypes, each with distinct cellular composition, spatial organization, and therapeutic implications.

Hot-Inflamed TME¶

Cellular hallmarks: - CD8+ T cells > 15% of total cells - Active cytotoxic gene program (GZMB+, PRF1+, IFNG+) - PD-L1 expression on tumor and immune cells - Tertiary lymphoid structures may be present

Molecular signatures: - Interferon-gamma signaling: STAT1, IRF1, CXCL9, CXCL10, CXCL11 - Cytotoxic effector program: GZMA, GZMB, PRF1, GNLY, NKG7 - Antigen presentation: HLA-A, HLA-B, HLA-C, B2M, TAP1, TAP2

Clinical implication: Strong candidate for checkpoint inhibitor monotherapy. Response rates: 30-50% for anti-PD-1/PD-L1.

Cold-Desert TME¶

Cellular hallmarks: - Total immune infiltrate < 10% of cells - Minimal T cell presence (< 2% CD8+) - Low neoantigen load (low TMB) - No tertiary lymphoid structures

Molecular signatures: - Absent IFN-gamma signaling - Low MHC class I expression - Active Wnt/beta-catenin signaling (immune exclusion) - PTEN loss (PI3K pathway activation)

Clinical implication: Checkpoint inhibitors alone are ineffective. Requires immune priming: - Oncolytic virus (T-VEC) to induce immunogenic cell death - STING agonist to activate innate immunity - Radiation therapy for abscopal effect - Bispecific T-cell engagers (BiTEs) to bypass recruitment defect

Excluded TME¶

Cellular hallmarks: - Immune cells present at tumor margin but excluded from core - Dense stromal barrier (fibroblasts, CAFs, myofibroblasts) - Immune cells "stuck" at the invasive front - Angiogenic vasculature without immune migration signals

Molecular signatures: - TGF-beta signaling: TGFB1, TGFB2, SMAD2, SMAD3 - Stromal activation: COL1A1, COL1A2, FN1, POSTN - CXCL12/CXCR4 axis (immune cell trapping) - VEGF-driven angiogenesis without CXCL9/10 chemokines

Clinical implication: Target the stromal barrier: - Anti-TGF-beta (bintrafusp alfa) to reduce fibrosis - Anti-VEGF + anti-PD-L1 combination (atezolizumab + bevacizumab) - FAK inhibitors to disrupt stromal architecture - Anti-CXCR4 to release trapped immune cells

Immunosuppressive TME¶

Cellular hallmarks: - Immune cells present and infiltrating but functionally suppressed - High regulatory T cell (Treg) fraction (> 10%) - M2-polarized macrophages dominant - Myeloid-derived suppressor cells (MDSCs) present

Molecular signatures: - Immunosuppressive cytokines: IL-10, TGF-beta, IL-35 - Metabolic suppression: IDO1, ARG1, NOS2 (tryptophan/arginine depletion) - Exhaustion markers on T cells: LAG3, TIM3/HAVCR2, TIGIT, TOX - Checkpoint overexpression: CTLA-4, PD-1 on T cells; PD-L1, PD-L2 on myeloid

Clinical implication: Multi-pronged approach: - Dual checkpoint (anti-PD-1 + anti-CTLA-4) - Treg depletion (anti-CCR8, low-dose cyclophosphamide) - Macrophage reprogramming (CSF1R inhibitor) - MDSC differentiation (ATRA, HDAC inhibitor)

1.2 Classification Algorithm¶

The Single-Cell Intelligence Agent's TMEClassifier implements a hierarchical decision tree:

Step 1: Spatial override
  - "absent" + immune < 0.05  -->  COLD_DESERT
  - "margin" + immune > 0.05  -->  EXCLUDED

Step 2: Hot-inflamed check
  - CD8 >= 15% AND immune >= 25%
    - Suppressive > 0.4  -->  IMMUNOSUPPRESSIVE
    - Otherwise           -->  HOT_INFLAMED

Step 3: Excluded check
  - Immune >= 10% AND stromal > 20%  -->  EXCLUDED

Step 4: Immunosuppressive check
  - Suppressive > 0.3 AND immune >= 10%  -->  IMMUNOSUPPRESSIVE

Step 5: Cold check
  - Immune < 10%  -->  COLD_DESERT

Step 6: PD-L1 rescue
  - PD-L1 high AND CD8 >= 5%  -->  HOT_INFLAMED

Default: COLD_DESERT

The suppressive score is a weighted combination: - 50%: suppressive cell fraction (Treg + MDSC + M2 macrophage) / 0.2 - 50%: suppressive gene score (IDO1, TGFB1, IL10, VEGFA, ARG1, NOS2)

1.3 Evidence Levels for TME Classification¶

Evidence Available	Level	Confidence
Spatial context + PD-L1 TPS + scRNA-seq	STRONG	High
PD-L1 TPS + scRNA-seq (no spatial)	MODERATE	Medium
scRNA-seq only (no PD-L1, no spatial)	LIMITED	Low

2. Subclonal Architecture and Clonal Dynamics¶

2.1 Why Subclones Matter¶

Cancer is not a monolithic disease. A single tumor contains multiple subclonal populations, each with distinct: - Somatic mutation profiles (driver and passenger) - Copy number aberrations (gains, losses, LOH) - Transcriptomic programs (proliferation, invasion, immune evasion) - Drug sensitivity profiles

Under therapeutic selective pressure, resistant subclones expand:

Before Treatment:
  Clone A (80%): Drug-sensitive, antigen+
  Clone B (15%): Moderate sensitivity, antigen+
  Clone C (5%):  Resistant, antigen-negative

After 8 Weeks of CAR-T:
  Clone A (5%):  Depleted by CAR-T
  Clone B (20%): Partially depleted
  Clone C (75%): Expanded (antigen escape)

2.2 Single-Cell Subclonal Detection¶

Methods for inferring subclonal architecture from scRNA-seq:

Method	Input	Output	Mechanism
inferCNV	scRNA-seq expression	Clone-specific CNV profiles	Expression deviation from normal reference
CopyKAT	scRNA-seq expression	Aneuploid/diploid classification	Bayesian segmentation
Numbat	scRNA-seq + genotype	Haplotype-aware CNV + clone tree	Allele-specific expression
clonealign	scRNA-seq + scDNA-seq	Clone-to-transcriptome mapping	Statistical alignment

2.3 Escape Risk Scoring¶

The SubclonalRiskScorer evaluates four risk factors per clone:

Factor	Weight	Threshold
Antigen-negative (expression < 0.1)	+0.4	Binary flag
Clone expanding	+0.2	Boolean (serial samples)
High proliferation index	up to +0.2	Proportional to MKI67/TOP2A
Resistance genes present	+0.05/gene (max +0.2)	Count of resistance-associated genes

Overall risk classification: - HIGH: antigen-negative fraction > 10% - MEDIUM: antigen-negative > 3% or any individual clone at HIGH risk - LOW: all clones below thresholds

Timeline estimation: Using exponential growth: t = T_doubling * log2(0.5 / current_fraction)

Example: If antigen-negative fraction is 5% and tumor doubling time is 14 days: - t = 14 * log2(0.5 / 0.05) = 14 * 3.32 = 46.5 days to reach 50% dominance

3. Spatial Transcriptomics¶

3.1 Technology Landscape¶

Spatial transcriptomics preserves the physical location of gene expression measurements within tissue:

Visium (10x Genomics)¶

Resolution: 55-micron spots (5-10 cells per spot)
Coverage: Whole transcriptome (~20,000 genes)
Tissue: Fresh-frozen or FFPE
Workflow: Tissue on barcoded slide -> permeabilization -> mRNA capture -> sequencing
Analysis: Requires computational deconvolution (cell2location, RCTD) to resolve cell types within spots

MERFISH (Vizgen)¶

Resolution: Subcellular (individual transcripts)
Coverage: 100-500 gene panel (custom design)
Tissue: Fresh-frozen
Workflow: Tissue on slide -> sequential rounds of hybridization + imaging
Analysis: Direct cell segmentation and gene assignment

Xenium (10x Genomics)¶

Resolution: Subcellular
Coverage: 100-5,000 gene panel (expanding)
Tissue: Fresh-frozen or FFPE
Workflow: In situ padlock probe hybridization + rolling circle amplification
Analysis: Cell segmentation -> direct transcript counting per cell

CODEX (Akoya)¶

Resolution: Single cell
Coverage: 40-60 proteins (antibody panel)
Tissue: FFPE or fresh-frozen
Workflow: Sequential antibody staining + fluorescence imaging
Analysis: Protein co-expression -> cell typing

3.2 Spatial Analysis Methods¶

Analysis	Method	What It Reveals
Spatial autocorrelation	Moran's I	Genes with spatially structured expression
Niche identification	Cell neighborhood analysis	Co-occurring cell type combinations
Cell-cell proximity	Pairwise distance analysis	Which cell types are physically adjacent
Spatial deconvolution	cell2location, RCTD	Cell type composition of Visium spots
Tissue segmentation	Histological features + expression	Tumor vs. stroma vs. necrosis regions
Spatial communication	MISTy, SpaTalk	Location-aware ligand-receptor analysis

3.3 Spatial Niches in Oncology¶

Clinically relevant spatial patterns:

Spatial Niche	Cell Types	Clinical Significance
Tumor-immune interface	CD8+ T, tumor, DC	Active immune surveillance, checkpoint response
Tertiary lymphoid structure	B cell, T cell, FDC	Positive prognosis, improved immunotherapy response
Fibrotic barrier	CAF, myofibroblast	Immune exclusion, anti-TGFb target
Hypoxic core	Tumor, few immune	Radioresistance, angiogenesis driver
Perivascular niche	Endothelial, pericyte, tumor	Metastatic dissemination route
Necrotic zone	Dead/dying cells	Antigen release, DAMP signaling

4. Trajectory Inference¶

4.1 What Are Cellular Trajectories?¶

Single-cell snapshots capture cells at different stages of continuous processes (differentiation, activation, exhaustion). Trajectory inference algorithms order cells along these continuous paths in "pseudotime."

4.2 Trajectory Types¶

Type	Start State	End State	Clinical Relevance
Differentiation	Progenitor/stem	Mature cell	HSC transplant engraftment
Activation	Naive T cell	Effector T cell	Immune response quality
Exhaustion	Effector T cell	Exhausted T cell (TOX+)	Checkpoint inhibitor response
EMT	Epithelial	Mesenchymal	Metastatic potential
Stemness	Differentiated tumor	Cancer stem cell	Treatment resistance
Cell cycle	G1	G2/M	Proliferation rate, chemo sensitivity

4.3 Trajectory Inference Methods¶

Method	Approach	Strengths	Key Paper
Monocle3	Principal graph	Handles branching, scalable	Cao et al., Nature 2019
PAGA	Partition-based	Robust, preserves topology	Wolf et al., Genome Biology 2019
RNA velocity (scVelo)	Spliced/unspliced ratios	Directionality without time series	Bergen et al., Nature Biotech 2020
Palantir	Diffusion maps	Probabilistic fate assignment	Setty et al., Nature Biotech 2019
CytoTRACE	Gene counts as proxy	Simple, no assumptions	Gulati et al., Science 2020

4.4 RNA Velocity¶

RNA velocity infers the direction and speed of gene expression change by comparing unspliced (nascent) and spliced (mature) mRNA:

Positive velocity (unspliced > expected): Gene is being upregulated
Negative velocity (unspliced < expected): Gene is being downregulated
Zero velocity (equilibrium): Gene is at steady state

import scvelo as scv

# Load data with spliced/unspliced counts
adata = scv.read("sample.h5ad")

# Compute velocity
scv.pp.moments(adata)
scv.tl.velocity(adata, mode='dynamical')
scv.tl.velocity_graph(adata)

# Visualize on UMAP
scv.pl.velocity_embedding_stream(adata)

5. Foundation Models for Single-Cell Biology¶

5.1 scGPT¶

Architecture: Transformer-based generative pre-trained model for single-cell data.

Pre-training: 33 million cells from CellxGene, trained on gene expression prediction using masked token modeling.

Capabilities: - Zero-shot cell type annotation - Gene expression imputation - Perturbation response prediction - Multi-batch integration - Gene regulatory network inference

Performance benchmarks (from Cui et al., Nature Methods 2024): - Cell type annotation: 93.5% accuracy (zero-shot on held-out datasets) - Batch integration: superior to scVI on 6/8 benchmarks - Perturbation prediction: R=0.85 correlation with observed perturbation effects

5.2 Geneformer¶

Architecture: BERT-style transformer trained on gene expression rank order.

Pre-training: 30 million cells from public data, using attention-based gene embeddings.

Key innovation: Represents cells as ordered sequences of genes (ranked by expression), enabling transfer learning across tissues and species.

Capabilities: - Context-aware gene function prediction - Disease state classification - Therapeutic target nomination - Dosage sensitivity prediction

Performance (from Theodoris et al., Nature 2023): - Transfer learning accuracy: 85-95% across tissue types - Network biology prediction: improved over expression-based methods - Chromatin dynamics prediction: validated experimentally

5.3 scFoundation¶

Architecture: Large-scale pre-trained model (100M+ parameters) for cell representation learning.

Pre-training: 50 million+ cells from diverse tissues and species.

Capabilities: - Universal cell embeddings for cross-dataset integration - Drug response prediction - Cell fate prediction

5.4 Integration with the Single-Cell Intelligence Agent¶

Foundation models can serve as: 1. Embedding backbone: Replace BGE-small-en-v1.5 with scGPT cell embeddings for cell-level vector search 2. Annotation engine: Zero-shot cell type prediction via scGPT 3. Perturbation simulator: Predict drug response at single-cell resolution 4. Integration layer: Cross-dataset harmonization via Geneformer embeddings

The agent's knowledge base documents these models and their capabilities. NIM endpoint integration is planned for v2.0.

6. GPU Benchmarks for Single-Cell Analysis¶

6.1 RAPIDS vs. CPU Benchmarks¶

Operation	Dataset Size	CPU (seconds)	GPU (seconds)	Speedup
PCA (50 comps)	50K cells	45	0.9	50x
PCA (50 comps)	500K cells	480	4.2	114x
UMAP	50K cells	120	2.4	50x
UMAP	500K cells	1,800	12	150x
kNN (k=30)	50K cells	90	0.8	112x
kNN (k=30)	500K cells	960	3.5	274x
Leiden (res=0.5)	50K cells	30	1.0	30x
Leiden (res=0.5)	500K cells	350	5.0	70x
Full pipeline	50K cells	345	7.5	46x
Full pipeline	500K cells	3,590	24.7	145x

Benchmarks on NVIDIA A100 80GB. CPU benchmarks on AMD EPYC 7742 64-core.

6.2 Memory Requirements¶

Dataset Size	CPU RAM	GPU VRAM
10K cells	2 GB	1 GB
50K cells	8 GB	4 GB
100K cells	16 GB	8 GB
500K cells	64 GB	32 GB
1M cells	128 GB	64 GB

6.3 rapids-singlecell¶

The rapids-singlecell package provides GPU-accelerated Scanpy-compatible functions:

import rapids_singlecell as rsc

# GPU-accelerated preprocessing
rsc.pp.normalize_total(adata)
rsc.pp.log1p(adata)
rsc.pp.highly_variable_genes(adata)
rsc.pp.pca(adata)

# GPU-accelerated analysis
rsc.pp.neighbors(adata)
rsc.tl.leiden(adata)
rsc.tl.umap(adata)

# Results are identical to Scanpy, 50-150x faster

7. Cell-Cell Communication Analysis¶

7.1 Ligand-Receptor Databases¶

Database	Interactions	Source
CellPhoneDB	2,500+	Curated from literature
CellTalkDB	3,000+	Curated + predicted
NicheNet	6,000+	Ligand-target predicted
CellChatDB	2,000+	Curated with pathway context
LIANA	Meta-database	Consensus of multiple databases

7.2 Analysis Methods¶

Method	Approach	Output
CellPhoneDB	Permutation test on L-R co-expression	P-values per L-R pair per cell type pair
CellChat	Quantitative mass-action model	Interaction strength, pathway activity
NicheNet	Ligand activity prediction from target genes	Ligand prioritization by downstream effect
LIANA	Consensus of multiple methods	Aggregated interaction scores

7.3 The Single-Cell Intelligence Agent's L-R Knowledge¶

The agent curates 25 ligand-receptor pairs across clinically actionable pathways:

Pathway	Ligand	Receptor	Clinical Relevance
Checkpoint	CD274 (PD-L1)	PDCD1 (PD-1)	Checkpoint inhibitor target
Checkpoint	CD80	CTLA4	Ipilimumab target
Chemokine	CXCL12	CXCR4	Immune cell migration/trapping
Chemokine	CCL2	CCR2	Monocyte/macrophage recruitment
Growth factor	EGF	EGFR	TKI target (erlotinib, osimertinib)
Growth factor	HGF	MET	MET inhibitor target
Notch	DLL1	NOTCH1	Cancer stem cell maintenance
Wnt	WNT5A	FZD5	Immune exclusion, beta-catenin
Hedgehog	SHH	PTCH1	Stromal activation
Angiogenesis	VEGFA	KDR (VEGFR2)	Anti-VEGF target (bevacizumab)

8. Biomarker Discovery at Single-Cell Resolution¶

8.1 Advantages Over Bulk Discovery¶

Feature	Bulk Discovery	Single-Cell Discovery
Specificity	Tissue-level	Cell-type-specific (AUROC > 0.9)
Confounders	Cell composition changes confound DE	Direct cell-type DE
Sensitivity	Rare cell markers diluted	Detectable at 0.1% frequency
Actionability	Unknown cellular source	Known cell type enables targeted therapy

8.2 Discovery Workflow¶

scRNA-seq data (disease vs. control)
         |
         v
Cell type annotation (57 cell types)
         |
         v
Per-cell-type differential expression
         |
    +----+----+----+
    |    |    |    |
    v    v    v    v
  CD8  Treg  Mac  ...
  DE   DE    DE
    |
    v
Specificity scoring (AUROC per gene per cell type)
    |
    v
Surface protein filter (is_surface = True)
    |
    v
Clinical validation check (existing assay, clinical trial)
    |
    v
BiomarkerCandidate output

8.3 Biomarker Types¶

Type	Definition	Example
Diagnostic	Distinguishes disease from normal	CD19 for B-ALL detection
Prognostic	Predicts outcome regardless of treatment	TOX+ exhausted CD8 fraction predicts poor OS
Predictive	Predicts treatment response	PD-L1 on tumor cells predicts anti-PD-1 response
Pharmacodynamic	Measures treatment effect	CD8/Treg ratio change under immunotherapy

9. CAR-T Target Validation¶

9.1 The Ideal CAR-T Target¶

Property	Ideal	Acceptable	Unacceptable
On-tumor coverage	> 95%	> 70%	< 50%
Off-tumor vital organs	0 hits	Low-level (< 0.5 TPM)	High in heart, brain, lung
Therapeutic index	> 10	> 3	< 3
Heterogeneity	Low (uniform expression)	Moderate	High (bimodal)
Escape risk	Low (essential gene)	Medium	High (dispensable antigen)

9.2 The Agent's Target Validation Pipeline¶

Target Gene (e.g., CD19, MSLN, HER2)
         |
    +----+----+
    |         |
    v         v
On-Tumor    Off-Tumor
Analysis    Safety Check
    |         |
    v         v
Coverage    8 vital organs:
percentage  brain, heart, lung,
Mean expr.  liver, kidney, pancreas,
            bone_marrow, intestine
    |         |
    +----+----+
         |
         v
Therapeutic Index = mean_on_tumor / (max_off_tumor + 0.01)
         |
         v
    +----+----+----+
    |         |    |
    v         v    v
FAVORABLE  COND.  UNFAVORABLE
(safe +    (risk   (safety or
 effective) mitig.) efficacy fail)

9.3 Safety Switch Integration¶

For CONDITIONAL targets, the agent recommends: - iCasp9 (inducible caspase 9): Dimerizer-activated suicide switch - EGFRt: Truncated EGFR enabling cetuximab-mediated depletion - Affinity-tuned CAR: Reduced scFv affinity discriminates high-expression tumor from low-expression normal tissue

10. Advanced Study Resources¶

10.1 Key Papers¶

Year	Paper	Impact
2017	Zheng et al., "Massively parallel digital transcriptional profiling"	10x Chromium technology paper
2018	Wolf et al., "SCANPY: large-scale single-cell gene expression data analysis"	Standard Python toolkit
2019	Stuart et al., "Comprehensive Integration of Single-Cell Data"	Seurat v3, integration methods
2020	Bergen et al., "Generalizing RNA velocity"	RNA velocity dynamical model
2021	Stahl et al., "Visualization and analysis of gene expression in tissue sections by spatial transcriptomics"	Visium technology
2023	Theodoris et al., "Transfer learning enables predictions in network biology"	Geneformer foundation model
2024	Cui et al., "scGPT: toward building a foundation model for single-cell multi-omics"	scGPT foundation model

10.2 Online Courses¶

Single Cell Genomics (Wellcome Sanger Institute) -- Comprehensive bioinformatics training
Analysis of Single Cell RNA-seq Data (Cambridge University) -- Scanpy/Seurat tutorials
NVIDIA RAPIDS for Single-Cell -- GPU acceleration training

10.3 Practice Datasets¶

Dataset	Cells	Tissue	Modality	Access
PBMC 3K	2,700	Blood	scRNA-seq	10x Genomics
PBMC 68K	68,000	Blood	scRNA-seq	10x Genomics
Tabula Sapiens	500,000	Multi-tissue	scRNA-seq	CellxGene
Human Lung Cell Atlas	580,000	Lung	scRNA-seq + spatial	CellxGene
TCGA Pan-Cancer scRNA	1M+	Multi-cancer	scRNA-seq	TISCH2

HCLS AI Factory -- Single-Cell Intelligence Agent Learning Guide: Advanced Topics v1.0.0