Single-Cell Intelligence Agent -- Learning Guide: Advanced Topics¶
Version: 1.0.0 Date: 2026-03-22 Author: Adam Jones
1. TME Classification: Deep Dive¶
1.1 The Four Immunophenotypes¶
Tumor microenvironment classification is central to immunotherapy patient selection. The field has converged on four canonical phenotypes, each with distinct cellular composition, spatial organization, and therapeutic implications.
Hot-Inflamed TME¶
Cellular hallmarks: - CD8+ T cells > 15% of total cells - Active cytotoxic gene program (GZMB+, PRF1+, IFNG+) - PD-L1 expression on tumor and immune cells - Tertiary lymphoid structures may be present
Molecular signatures: - Interferon-gamma signaling: STAT1, IRF1, CXCL9, CXCL10, CXCL11 - Cytotoxic effector program: GZMA, GZMB, PRF1, GNLY, NKG7 - Antigen presentation: HLA-A, HLA-B, HLA-C, B2M, TAP1, TAP2
Clinical implication: Strong candidate for checkpoint inhibitor monotherapy. Response rates: 30-50% for anti-PD-1/PD-L1.
Cold-Desert TME¶
Cellular hallmarks: - Total immune infiltrate < 10% of cells - Minimal T cell presence (< 2% CD8+) - Low neoantigen load (low TMB) - No tertiary lymphoid structures
Molecular signatures: - Absent IFN-gamma signaling - Low MHC class I expression - Active Wnt/beta-catenin signaling (immune exclusion) - PTEN loss (PI3K pathway activation)
Clinical implication: Checkpoint inhibitors alone are ineffective. Requires immune priming: - Oncolytic virus (T-VEC) to induce immunogenic cell death - STING agonist to activate innate immunity - Radiation therapy for abscopal effect - Bispecific T-cell engagers (BiTEs) to bypass recruitment defect
Excluded TME¶
Cellular hallmarks: - Immune cells present at tumor margin but excluded from core - Dense stromal barrier (fibroblasts, CAFs, myofibroblasts) - Immune cells "stuck" at the invasive front - Angiogenic vasculature without immune migration signals
Molecular signatures: - TGF-beta signaling: TGFB1, TGFB2, SMAD2, SMAD3 - Stromal activation: COL1A1, COL1A2, FN1, POSTN - CXCL12/CXCR4 axis (immune cell trapping) - VEGF-driven angiogenesis without CXCL9/10 chemokines
Clinical implication: Target the stromal barrier: - Anti-TGF-beta (bintrafusp alfa) to reduce fibrosis - Anti-VEGF + anti-PD-L1 combination (atezolizumab + bevacizumab) - FAK inhibitors to disrupt stromal architecture - Anti-CXCR4 to release trapped immune cells
Immunosuppressive TME¶
Cellular hallmarks: - Immune cells present and infiltrating but functionally suppressed - High regulatory T cell (Treg) fraction (> 10%) - M2-polarized macrophages dominant - Myeloid-derived suppressor cells (MDSCs) present
Molecular signatures: - Immunosuppressive cytokines: IL-10, TGF-beta, IL-35 - Metabolic suppression: IDO1, ARG1, NOS2 (tryptophan/arginine depletion) - Exhaustion markers on T cells: LAG3, TIM3/HAVCR2, TIGIT, TOX - Checkpoint overexpression: CTLA-4, PD-1 on T cells; PD-L1, PD-L2 on myeloid
Clinical implication: Multi-pronged approach: - Dual checkpoint (anti-PD-1 + anti-CTLA-4) - Treg depletion (anti-CCR8, low-dose cyclophosphamide) - Macrophage reprogramming (CSF1R inhibitor) - MDSC differentiation (ATRA, HDAC inhibitor)
1.2 Classification Algorithm¶
The Single-Cell Intelligence Agent's TMEClassifier implements a hierarchical decision tree:
Step 1: Spatial override
- "absent" + immune < 0.05 --> COLD_DESERT
- "margin" + immune > 0.05 --> EXCLUDED
Step 2: Hot-inflamed check
- CD8 >= 15% AND immune >= 25%
- Suppressive > 0.4 --> IMMUNOSUPPRESSIVE
- Otherwise --> HOT_INFLAMED
Step 3: Excluded check
- Immune >= 10% AND stromal > 20% --> EXCLUDED
Step 4: Immunosuppressive check
- Suppressive > 0.3 AND immune >= 10% --> IMMUNOSUPPRESSIVE
Step 5: Cold check
- Immune < 10% --> COLD_DESERT
Step 6: PD-L1 rescue
- PD-L1 high AND CD8 >= 5% --> HOT_INFLAMED
Default: COLD_DESERT
The suppressive score is a weighted combination: - 50%: suppressive cell fraction (Treg + MDSC + M2 macrophage) / 0.2 - 50%: suppressive gene score (IDO1, TGFB1, IL10, VEGFA, ARG1, NOS2)
1.3 Evidence Levels for TME Classification¶
| Evidence Available | Level | Confidence |
|---|---|---|
| Spatial context + PD-L1 TPS + scRNA-seq | STRONG | High |
| PD-L1 TPS + scRNA-seq (no spatial) | MODERATE | Medium |
| scRNA-seq only (no PD-L1, no spatial) | LIMITED | Low |
2. Subclonal Architecture and Clonal Dynamics¶
2.1 Why Subclones Matter¶
Cancer is not a monolithic disease. A single tumor contains multiple subclonal populations, each with distinct: - Somatic mutation profiles (driver and passenger) - Copy number aberrations (gains, losses, LOH) - Transcriptomic programs (proliferation, invasion, immune evasion) - Drug sensitivity profiles
Under therapeutic selective pressure, resistant subclones expand:
Before Treatment:
Clone A (80%): Drug-sensitive, antigen+
Clone B (15%): Moderate sensitivity, antigen+
Clone C (5%): Resistant, antigen-negative
After 8 Weeks of CAR-T:
Clone A (5%): Depleted by CAR-T
Clone B (20%): Partially depleted
Clone C (75%): Expanded (antigen escape)
2.2 Single-Cell Subclonal Detection¶
Methods for inferring subclonal architecture from scRNA-seq:
| Method | Input | Output | Mechanism |
|---|---|---|---|
| inferCNV | scRNA-seq expression | Clone-specific CNV profiles | Expression deviation from normal reference |
| CopyKAT | scRNA-seq expression | Aneuploid/diploid classification | Bayesian segmentation |
| Numbat | scRNA-seq + genotype | Haplotype-aware CNV + clone tree | Allele-specific expression |
| clonealign | scRNA-seq + scDNA-seq | Clone-to-transcriptome mapping | Statistical alignment |
2.3 Escape Risk Scoring¶
The SubclonalRiskScorer evaluates four risk factors per clone:
| Factor | Weight | Threshold |
|---|---|---|
| Antigen-negative (expression < 0.1) | +0.4 | Binary flag |
| Clone expanding | +0.2 | Boolean (serial samples) |
| High proliferation index | up to +0.2 | Proportional to MKI67/TOP2A |
| Resistance genes present | +0.05/gene (max +0.2) | Count of resistance-associated genes |
Overall risk classification: - HIGH: antigen-negative fraction > 10% - MEDIUM: antigen-negative > 3% or any individual clone at HIGH risk - LOW: all clones below thresholds
Timeline estimation:
Using exponential growth: t = T_doubling * log2(0.5 / current_fraction)
Example: If antigen-negative fraction is 5% and tumor doubling time is 14 days:
- t = 14 * log2(0.5 / 0.05) = 14 * 3.32 = 46.5 days to reach 50% dominance
3. Spatial Transcriptomics¶
3.1 Technology Landscape¶
Spatial transcriptomics preserves the physical location of gene expression measurements within tissue:
Visium (10x Genomics)¶
- Resolution: 55-micron spots (5-10 cells per spot)
- Coverage: Whole transcriptome (~20,000 genes)
- Tissue: Fresh-frozen or FFPE
- Workflow: Tissue on barcoded slide -> permeabilization -> mRNA capture -> sequencing
- Analysis: Requires computational deconvolution (cell2location, RCTD) to resolve cell types within spots
MERFISH (Vizgen)¶
- Resolution: Subcellular (individual transcripts)
- Coverage: 100-500 gene panel (custom design)
- Tissue: Fresh-frozen
- Workflow: Tissue on slide -> sequential rounds of hybridization + imaging
- Analysis: Direct cell segmentation and gene assignment
Xenium (10x Genomics)¶
- Resolution: Subcellular
- Coverage: 100-5,000 gene panel (expanding)
- Tissue: Fresh-frozen or FFPE
- Workflow: In situ padlock probe hybridization + rolling circle amplification
- Analysis: Cell segmentation -> direct transcript counting per cell
CODEX (Akoya)¶
- Resolution: Single cell
- Coverage: 40-60 proteins (antibody panel)
- Tissue: FFPE or fresh-frozen
- Workflow: Sequential antibody staining + fluorescence imaging
- Analysis: Protein co-expression -> cell typing
3.2 Spatial Analysis Methods¶
| Analysis | Method | What It Reveals |
|---|---|---|
| Spatial autocorrelation | Moran's I | Genes with spatially structured expression |
| Niche identification | Cell neighborhood analysis | Co-occurring cell type combinations |
| Cell-cell proximity | Pairwise distance analysis | Which cell types are physically adjacent |
| Spatial deconvolution | cell2location, RCTD | Cell type composition of Visium spots |
| Tissue segmentation | Histological features + expression | Tumor vs. stroma vs. necrosis regions |
| Spatial communication | MISTy, SpaTalk | Location-aware ligand-receptor analysis |
3.3 Spatial Niches in Oncology¶
Clinically relevant spatial patterns:
| Spatial Niche | Cell Types | Clinical Significance |
|---|---|---|
| Tumor-immune interface | CD8+ T, tumor, DC | Active immune surveillance, checkpoint response |
| Tertiary lymphoid structure | B cell, T cell, FDC | Positive prognosis, improved immunotherapy response |
| Fibrotic barrier | CAF, myofibroblast | Immune exclusion, anti-TGFb target |
| Hypoxic core | Tumor, few immune | Radioresistance, angiogenesis driver |
| Perivascular niche | Endothelial, pericyte, tumor | Metastatic dissemination route |
| Necrotic zone | Dead/dying cells | Antigen release, DAMP signaling |
4. Trajectory Inference¶
4.1 What Are Cellular Trajectories?¶
Single-cell snapshots capture cells at different stages of continuous processes (differentiation, activation, exhaustion). Trajectory inference algorithms order cells along these continuous paths in "pseudotime."
4.2 Trajectory Types¶
| Type | Start State | End State | Clinical Relevance |
|---|---|---|---|
| Differentiation | Progenitor/stem | Mature cell | HSC transplant engraftment |
| Activation | Naive T cell | Effector T cell | Immune response quality |
| Exhaustion | Effector T cell | Exhausted T cell (TOX+) | Checkpoint inhibitor response |
| EMT | Epithelial | Mesenchymal | Metastatic potential |
| Stemness | Differentiated tumor | Cancer stem cell | Treatment resistance |
| Cell cycle | G1 | G2/M | Proliferation rate, chemo sensitivity |
4.3 Trajectory Inference Methods¶
| Method | Approach | Strengths | Key Paper |
|---|---|---|---|
| Monocle3 | Principal graph | Handles branching, scalable | Cao et al., Nature 2019 |
| PAGA | Partition-based | Robust, preserves topology | Wolf et al., Genome Biology 2019 |
| RNA velocity (scVelo) | Spliced/unspliced ratios | Directionality without time series | Bergen et al., Nature Biotech 2020 |
| Palantir | Diffusion maps | Probabilistic fate assignment | Setty et al., Nature Biotech 2019 |
| CytoTRACE | Gene counts as proxy | Simple, no assumptions | Gulati et al., Science 2020 |
4.4 RNA Velocity¶
RNA velocity infers the direction and speed of gene expression change by comparing unspliced (nascent) and spliced (mature) mRNA:
- Positive velocity (unspliced > expected): Gene is being upregulated
- Negative velocity (unspliced < expected): Gene is being downregulated
- Zero velocity (equilibrium): Gene is at steady state
import scvelo as scv
# Load data with spliced/unspliced counts
adata = scv.read("sample.h5ad")
# Compute velocity
scv.pp.moments(adata)
scv.tl.velocity(adata, mode='dynamical')
scv.tl.velocity_graph(adata)
# Visualize on UMAP
scv.pl.velocity_embedding_stream(adata)
5. Foundation Models for Single-Cell Biology¶
5.1 scGPT¶
Architecture: Transformer-based generative pre-trained model for single-cell data.
Pre-training: 33 million cells from CellxGene, trained on gene expression prediction using masked token modeling.
Capabilities: - Zero-shot cell type annotation - Gene expression imputation - Perturbation response prediction - Multi-batch integration - Gene regulatory network inference
Performance benchmarks (from Cui et al., Nature Methods 2024): - Cell type annotation: 93.5% accuracy (zero-shot on held-out datasets) - Batch integration: superior to scVI on 6/8 benchmarks - Perturbation prediction: R=0.85 correlation with observed perturbation effects
5.2 Geneformer¶
Architecture: BERT-style transformer trained on gene expression rank order.
Pre-training: 30 million cells from public data, using attention-based gene embeddings.
Key innovation: Represents cells as ordered sequences of genes (ranked by expression), enabling transfer learning across tissues and species.
Capabilities: - Context-aware gene function prediction - Disease state classification - Therapeutic target nomination - Dosage sensitivity prediction
Performance (from Theodoris et al., Nature 2023): - Transfer learning accuracy: 85-95% across tissue types - Network biology prediction: improved over expression-based methods - Chromatin dynamics prediction: validated experimentally
5.3 scFoundation¶
Architecture: Large-scale pre-trained model (100M+ parameters) for cell representation learning.
Pre-training: 50 million+ cells from diverse tissues and species.
Capabilities: - Universal cell embeddings for cross-dataset integration - Drug response prediction - Cell fate prediction
5.4 Integration with the Single-Cell Intelligence Agent¶
Foundation models can serve as: 1. Embedding backbone: Replace BGE-small-en-v1.5 with scGPT cell embeddings for cell-level vector search 2. Annotation engine: Zero-shot cell type prediction via scGPT 3. Perturbation simulator: Predict drug response at single-cell resolution 4. Integration layer: Cross-dataset harmonization via Geneformer embeddings
The agent's knowledge base documents these models and their capabilities. NIM endpoint integration is planned for v2.0.
6. GPU Benchmarks for Single-Cell Analysis¶
6.1 RAPIDS vs. CPU Benchmarks¶
| Operation | Dataset Size | CPU (seconds) | GPU (seconds) | Speedup |
|---|---|---|---|---|
| PCA (50 comps) | 50K cells | 45 | 0.9 | 50x |
| PCA (50 comps) | 500K cells | 480 | 4.2 | 114x |
| UMAP | 50K cells | 120 | 2.4 | 50x |
| UMAP | 500K cells | 1,800 | 12 | 150x |
| kNN (k=30) | 50K cells | 90 | 0.8 | 112x |
| kNN (k=30) | 500K cells | 960 | 3.5 | 274x |
| Leiden (res=0.5) | 50K cells | 30 | 1.0 | 30x |
| Leiden (res=0.5) | 500K cells | 350 | 5.0 | 70x |
| Full pipeline | 50K cells | 345 | 7.5 | 46x |
| Full pipeline | 500K cells | 3,590 | 24.7 | 145x |
Benchmarks on NVIDIA A100 80GB. CPU benchmarks on AMD EPYC 7742 64-core.
6.2 Memory Requirements¶
| Dataset Size | CPU RAM | GPU VRAM |
|---|---|---|
| 10K cells | 2 GB | 1 GB |
| 50K cells | 8 GB | 4 GB |
| 100K cells | 16 GB | 8 GB |
| 500K cells | 64 GB | 32 GB |
| 1M cells | 128 GB | 64 GB |
6.3 rapids-singlecell¶
The rapids-singlecell package provides GPU-accelerated Scanpy-compatible functions:
import rapids_singlecell as rsc
# GPU-accelerated preprocessing
rsc.pp.normalize_total(adata)
rsc.pp.log1p(adata)
rsc.pp.highly_variable_genes(adata)
rsc.pp.pca(adata)
# GPU-accelerated analysis
rsc.pp.neighbors(adata)
rsc.tl.leiden(adata)
rsc.tl.umap(adata)
# Results are identical to Scanpy, 50-150x faster
7. Cell-Cell Communication Analysis¶
7.1 Ligand-Receptor Databases¶
| Database | Interactions | Source |
|---|---|---|
| CellPhoneDB | 2,500+ | Curated from literature |
| CellTalkDB | 3,000+ | Curated + predicted |
| NicheNet | 6,000+ | Ligand-target predicted |
| CellChatDB | 2,000+ | Curated with pathway context |
| LIANA | Meta-database | Consensus of multiple databases |
7.2 Analysis Methods¶
| Method | Approach | Output |
|---|---|---|
| CellPhoneDB | Permutation test on L-R co-expression | P-values per L-R pair per cell type pair |
| CellChat | Quantitative mass-action model | Interaction strength, pathway activity |
| NicheNet | Ligand activity prediction from target genes | Ligand prioritization by downstream effect |
| LIANA | Consensus of multiple methods | Aggregated interaction scores |
7.3 The Single-Cell Intelligence Agent's L-R Knowledge¶
The agent curates 25 ligand-receptor pairs across clinically actionable pathways:
| Pathway | Ligand | Receptor | Clinical Relevance |
|---|---|---|---|
| Checkpoint | CD274 (PD-L1) | PDCD1 (PD-1) | Checkpoint inhibitor target |
| Checkpoint | CD80 | CTLA4 | Ipilimumab target |
| Chemokine | CXCL12 | CXCR4 | Immune cell migration/trapping |
| Chemokine | CCL2 | CCR2 | Monocyte/macrophage recruitment |
| Growth factor | EGF | EGFR | TKI target (erlotinib, osimertinib) |
| Growth factor | HGF | MET | MET inhibitor target |
| Notch | DLL1 | NOTCH1 | Cancer stem cell maintenance |
| Wnt | WNT5A | FZD5 | Immune exclusion, beta-catenin |
| Hedgehog | SHH | PTCH1 | Stromal activation |
| Angiogenesis | VEGFA | KDR (VEGFR2) | Anti-VEGF target (bevacizumab) |
8. Biomarker Discovery at Single-Cell Resolution¶
8.1 Advantages Over Bulk Discovery¶
| Feature | Bulk Discovery | Single-Cell Discovery |
|---|---|---|
| Specificity | Tissue-level | Cell-type-specific (AUROC > 0.9) |
| Confounders | Cell composition changes confound DE | Direct cell-type DE |
| Sensitivity | Rare cell markers diluted | Detectable at 0.1% frequency |
| Actionability | Unknown cellular source | Known cell type enables targeted therapy |
8.2 Discovery Workflow¶
scRNA-seq data (disease vs. control)
|
v
Cell type annotation (57 cell types)
|
v
Per-cell-type differential expression
|
+----+----+----+
| | | |
v v v v
CD8 Treg Mac ...
DE DE DE
|
v
Specificity scoring (AUROC per gene per cell type)
|
v
Surface protein filter (is_surface = True)
|
v
Clinical validation check (existing assay, clinical trial)
|
v
BiomarkerCandidate output
8.3 Biomarker Types¶
| Type | Definition | Example |
|---|---|---|
| Diagnostic | Distinguishes disease from normal | CD19 for B-ALL detection |
| Prognostic | Predicts outcome regardless of treatment | TOX+ exhausted CD8 fraction predicts poor OS |
| Predictive | Predicts treatment response | PD-L1 on tumor cells predicts anti-PD-1 response |
| Pharmacodynamic | Measures treatment effect | CD8/Treg ratio change under immunotherapy |
9. CAR-T Target Validation¶
9.1 The Ideal CAR-T Target¶
| Property | Ideal | Acceptable | Unacceptable |
|---|---|---|---|
| On-tumor coverage | > 95% | > 70% | < 50% |
| Off-tumor vital organs | 0 hits | Low-level (< 0.5 TPM) | High in heart, brain, lung |
| Therapeutic index | > 10 | > 3 | < 3 |
| Heterogeneity | Low (uniform expression) | Moderate | High (bimodal) |
| Escape risk | Low (essential gene) | Medium | High (dispensable antigen) |
9.2 The Agent's Target Validation Pipeline¶
Target Gene (e.g., CD19, MSLN, HER2)
|
+----+----+
| |
v v
On-Tumor Off-Tumor
Analysis Safety Check
| |
v v
Coverage 8 vital organs:
percentage brain, heart, lung,
Mean expr. liver, kidney, pancreas,
bone_marrow, intestine
| |
+----+----+
|
v
Therapeutic Index = mean_on_tumor / (max_off_tumor + 0.01)
|
v
+----+----+----+
| | |
v v v
FAVORABLE COND. UNFAVORABLE
(safe + (risk (safety or
effective) mitig.) efficacy fail)
9.3 Safety Switch Integration¶
For CONDITIONAL targets, the agent recommends: - iCasp9 (inducible caspase 9): Dimerizer-activated suicide switch - EGFRt: Truncated EGFR enabling cetuximab-mediated depletion - Affinity-tuned CAR: Reduced scFv affinity discriminates high-expression tumor from low-expression normal tissue
10. Advanced Study Resources¶
10.1 Key Papers¶
| Year | Paper | Impact |
|---|---|---|
| 2017 | Zheng et al., "Massively parallel digital transcriptional profiling" | 10x Chromium technology paper |
| 2018 | Wolf et al., "SCANPY: large-scale single-cell gene expression data analysis" | Standard Python toolkit |
| 2019 | Stuart et al., "Comprehensive Integration of Single-Cell Data" | Seurat v3, integration methods |
| 2020 | Bergen et al., "Generalizing RNA velocity" | RNA velocity dynamical model |
| 2021 | Stahl et al., "Visualization and analysis of gene expression in tissue sections by spatial transcriptomics" | Visium technology |
| 2023 | Theodoris et al., "Transfer learning enables predictions in network biology" | Geneformer foundation model |
| 2024 | Cui et al., "scGPT: toward building a foundation model for single-cell multi-omics" | scGPT foundation model |
10.2 Online Courses¶
- Single Cell Genomics (Wellcome Sanger Institute) -- Comprehensive bioinformatics training
- Analysis of Single Cell RNA-seq Data (Cambridge University) -- Scanpy/Seurat tutorials
- NVIDIA RAPIDS for Single-Cell -- GPU acceleration training
10.3 Practice Datasets¶
| Dataset | Cells | Tissue | Modality | Access |
|---|---|---|---|---|
| PBMC 3K | 2,700 | Blood | scRNA-seq | 10x Genomics |
| PBMC 68K | 68,000 | Blood | scRNA-seq | 10x Genomics |
| Tabula Sapiens | 500,000 | Multi-tissue | scRNA-seq | CellxGene |
| Human Lung Cell Atlas | 580,000 | Lung | scRNA-seq + spatial | CellxGene |
| TCGA Pan-Cancer scRNA | 1M+ | Multi-cancer | scRNA-seq | TISCH2 |
HCLS AI Factory -- Single-Cell Intelligence Agent Learning Guide: Advanced Topics v1.0.0