Single-Cell Intelligence Agent: GPU-Accelerated Single-Cell Multi-Omics Analysis for Precision Oncology¶

A Component of the HCLS AI Factory Precision Medicine Platform

Authors: HCLS AI Factory Development Team

Date: March 2026

Version: 1.0

Platform: NVIDIA DGX Spark | CUDA 12.x | RAPIDS AI

Abstract¶

Bulk sequencing has been the workhorse of genomic medicine for over two decades, yet its inherent averaging across millions of cells obscures the tumor heterogeneity that drives 30-40% of targeted therapy failures. Resistant subclones--invisible to bulk assays--persist through treatment, expand under selective pressure, and ultimately cause relapse. Single-cell RNA sequencing (scRNA-seq) resolves individual transcriptomes across tens of thousands to hundreds of thousands of cells per sample, revealing the cellular diversity that bulk methods miss. However, the computational demands are formidable: a typical experiment generates expression matrices spanning 10,000-500,000 cells across 20,000+ genes, producing billions of data points that overwhelm traditional CPU-based analysis pipelines.

We present the Single-Cell Intelligence Agent, a GPU-accelerated analysis platform integrated into the HCLS AI Factory precision medicine pipeline. Built on NVIDIA RAPIDS and deployed on DGX Spark infrastructure, the agent achieves 50-100x speedups over CPU-based tools for core operations including dimensionality reduction (UMAP in <10 seconds for 100K cells), graph-based clustering (<30 seconds), and differential expression analysis. The system maintains 12 specialized Milvus vector collections encompassing cell type references, spatial transcriptomics data, tumor microenvironment signatures, drug response profiles, and clinical evidence, enabling rapid semantic retrieval across the single-cell knowledge landscape. Ten purpose-built analytical workflows--from cell type annotation and tumor microenvironment profiling to CAR-T target validation and treatment response monitoring--bridge the gap between raw single-cell data and actionable clinical insight. The agent integrates with spatial transcriptomics platforms (Visium, MERFISH, Xenium, CODEX), supports multi-modal data fusion (CITE-seq, Multiome), and is architected for future integration with single-cell foundation models (scGPT, Geneformer). Benchmarked against leading tools including Scanpy, Seurat, and CellRanger, the Single-Cell Intelligence Agent delivers clinical-grade analysis at interactive speeds, positioning single-cell genomics as a practical component of precision oncology workflows.

Keywords: single-cell RNA sequencing, GPU acceleration, RAPIDS, tumor heterogeneity, precision oncology, spatial transcriptomics, tumor microenvironment, vector database, DGX Spark

Table of Contents¶

Introduction and Clinical Motivation
The Single-Cell Revolution in Oncology
Computational Challenges at Single-Cell Scale
Market Landscape and Technology Ecosystem
System Architecture Overview
GPU-Accelerated Computational Engine
Knowledge Architecture: Milvus Vector Collections
Analytical Workflows
Cell Type Annotation Pipeline
Tumor Microenvironment Profiling
Drug Response Prediction
Subclonal Architecture Analysis
Spatial Transcriptomics Integration
Trajectory and Pseudotime Analysis
Ligand-Receptor Interaction Mapping
Biomarker Discovery Pipeline
CAR-T Target Validation
Treatment Response Monitoring
Data Sources and Knowledge Graphs
Foundation Model Integration Roadmap
Benchmarking and Performance
Competitive Analysis
Clinical Deployment Considerations
Future Directions
References

1. Introduction and Clinical Motivation¶

1.1 The Heterogeneity Problem¶

Precision oncology has achieved remarkable successes--imatinib for BCR-ABL+ chronic myeloid leukemia, trastuzumab for HER2+ breast cancer, pembrolizumab for PD-L1+ tumors--yet the overall response rate to targeted therapies remains stubbornly below 30% for most solid tumors (Marquart et al., 2018). The fundamental challenge is tumor heterogeneity: a single biopsy, analyzed by bulk sequencing, reports the average signal across millions of cells, masking the cellular diversity that determines treatment outcome.

An estimated 30-40% of targeted therapy failures can be attributed to pre-existing resistant subclones that are invisible to bulk assays (McGranahan & Swanton, 2017). These subclones may comprise as little as 0.1-1% of the tumor at diagnosis, falling below the detection limit of standard bulk whole-exome sequencing (~5% variant allele frequency). Under the selective pressure of targeted therapy, resistant clones expand exponentially, driving relapse within months.

1.2 Single-Cell Resolution as Clinical Imperative¶

Single-cell RNA sequencing (scRNA-seq) resolves individual transcriptomes, enabling:

Subclone detection: Identification of transcriptionally distinct subpopulations, including rare resistant clones below bulk detection limits
Tumor microenvironment (TME) characterization: Quantification of immune cell infiltration, stromal composition, and intercellular signaling networks
Drug response prediction: Cell-type-specific expression of drug targets, resistance mechanisms, and metabolic dependencies
Treatment monitoring: Longitudinal tracking of clonal dynamics and immune response evolution

However, the transition from research tool to clinical utility demands computational infrastructure that can process single-cell datasets at interactive speeds--a capability that CPU-based pipelines fundamentally cannot deliver at the scale of modern experiments.

1.3 The HCLS AI Factory Context¶

The Single-Cell Intelligence Agent operates as the sixth intelligence agent within the HCLS AI Factory precision medicine platform, complementing existing agents for biomarker analysis, oncology interpretation, CAR-T therapy design, imaging analysis, and autoimmune profiling. It extends the platform's three-stage pipeline--genomics (Parabricks/DeepVariant), RAG/Chat (Milvus + Claude), and drug discovery (BioNeMo/DiffDock/RDKit)--with single-cell resolution, enabling a complete workflow from bulk variant calling through single-cell validation to therapeutic candidate generation.

The agent is deployed on NVIDIA DGX Spark with CUDA 12.x, exposing its API on port 8540 and its interactive dashboard on port 8130. It leverages the shared HCLS common library for configuration management, Milvus connectivity, LLM integration, and security, maintaining architectural consistency across the platform.

2. The Single-Cell Revolution in Oncology¶

2.1 Technology Evolution¶

The single-cell sequencing landscape has evolved rapidly since the first scRNA-seq experiment by Tang et al. (2009), which profiled a single mouse blastomere. Key technological milestones include:

Year	Technology	Throughput	Key Innovation
2009	mRNA-Seq (Tang)	1 cell	Proof of concept
2012	Smart-seq	10s of cells	Full-length transcript coverage
2014	Drop-seq	10,000 cells	Droplet microfluidics
2015	10x Genomics Chromium	10,000+ cells	Commercial droplet platform
2017	10x Chromium V2	100,000+ cells	Industrial-scale profiling
2017	CITE-seq	10,000+ cells	Simultaneous RNA + protein
2019	10x Visium	~5,000 spots	Spatial transcriptomics
2020	10x Multiome	10,000+ cells	ATAC + RNA co-assay
2021	MERFISH	100,000+ molecules	Imaging-based spatial
2023	10x Xenium	Subcellular resolution	In situ spatial
2024	CODEX/PhenoCycler	100+ proteins	High-plex spatial proteomics

2.2 10x Genomics Market Dominance¶

10x Genomics has established dominant market position in single-cell sequencing, with its Chromium platform accounting for an estimated 70-80% of scRNA-seq experiments published since 2018 (Svensson et al., 2020). The platform's success derives from:

Gel Bead-in-Emulsion (GEM) technology: Encapsulates individual cells with barcoded gel beads in nanoliter droplets, enabling efficient cell capture with low doublet rates (~0.4% per 1,000 cells)
CellRanger software: End-to-end pipeline from FASTQ to count matrix, with cell calling, alignment (STAR), and UMI deduplication
Ecosystem integration: Comprehensive libraries (3' gene expression, 5' immune profiling, ATAC-seq, Multiome, CITE-seq compatible)
Spatial platform: Visium (spot-based) and Xenium (subcellular in situ) spatial transcriptomics

2.3 Complementary Technologies¶

While 10x dominates throughput applications, several technologies serve specialized niches:

Smart-seq2/3 (Picelli et al., 2014; Hagemann-Jensen et al., 2020): Plate-based, full-length transcript coverage. Lower throughput (hundreds of cells) but superior sensitivity and isoform detection. Preferred for rare cell populations, small organisms, and applications requiring splice variant analysis.

Drop-seq (Macosko et al., 2015): Open-source droplet microfluidics. Lower per-cell cost than 10x but requires custom instrumentation. Popular in cost-sensitive large-scale atlas projects.

CITE-seq (Stoeckius et al., 2017): Cellular Indexing of Transcriptomes and Epitopes by sequencing. Simultaneously measures RNA and surface protein expression using oligonucleotide-conjugated antibodies. Critical for immune cell phenotyping where transcriptome alone is insufficient (e.g., distinguishing CD4+ T-cell subsets).

Spatial Transcriptomics Platforms: - Visium (10x Genomics): Captures mRNA from tissue sections on slides with ~55 um diameter spots, each covering 1-10 cells. Provides spatial context but lacks true single-cell resolution - MERFISH (Vizgen): Multiplexed error-robust fluorescence in situ hybridization. Images hundreds of genes at subcellular resolution across intact tissue - Xenium (10x Genomics): In situ sequencing platform providing subcellular transcript localization for 100-1000+ gene panels - CODEX/PhenoCycler (Akoya Biosciences): Multiplexed immunofluorescence imaging of 100+ protein markers on single tissue sections

2.4 Clinical Impact in Oncology¶

Single-cell studies have fundamentally reshaped understanding of tumor biology:

Intra-tumor heterogeneity: Tirosh et al. (2016) demonstrated that melanomas contain transcriptionally distinct subpopulations with differential drug sensitivity, explaining variable response to BRAF inhibitors. Patel et al. (2014) revealed that individual glioblastomas harbor cells spanning multiple molecular subtypes simultaneously.

Immune landscape: Zheng et al. (2017) characterized the T-cell landscape in hepatocellular carcinoma at single-cell resolution, identifying exhaustion trajectories that predict immunotherapy response. Sade-Feldman et al. (2018) linked specific T-cell states to checkpoint blockade outcomes in melanoma.

Therapy resistance: Kim et al. (2018) traced the evolution of drug-tolerant persister cells in lung adenocarcinoma, identifying a pre-existing resistant subpopulation that was invisible to bulk profiling but expanded under EGFR inhibitor treatment.

3. Computational Challenges at Single-Cell Scale¶

3.1 Data Scale and Dimensionality¶

A modern single-cell experiment generates data at a scale that challenges conventional bioinformatics infrastructure:

Experiment Parameter	Typical Range	Computational Impact
Cells per sample	10,000 - 500,000	Memory: 2-50 GB per sample
Genes per cell	20,000 - 30,000	Dimensionality: ultrahigh
Non-zero entries	2,000 - 5,000 per cell	Sparsity: 85-95% zeros
Samples per study	10 - 200	Integration complexity
Total data points	10^8 - 10^10	Billions per study
Raw FASTQ size	50 - 500 GB	Storage and I/O bottleneck

The resulting cell-by-gene count matrices are extremely sparse (85-95% zeros) and high-dimensional, requiring specialized algorithms for dimensionality reduction, clustering, and differential expression that can exploit sparsity while maintaining biological signal.

3.2 Computational Bottlenecks¶

Standard CPU-based analysis of 100,000 cells with Scanpy or Seurat encounters several bottlenecks:

k-NN Graph Construction: Building the k-nearest-neighbor graph (typically k=15-30 in a 50-dimensional PCA space) scales as O(n log n) with approximate methods (Annoy, HNSW) but still requires 5-15 minutes for 100K cells on a 32-core CPU
UMAP Embedding: Uniform Manifold Approximation and Projection for 100K cells takes 3-10 minutes on CPU, with iterative optimization being inherently sequential
Leiden/Louvain Clustering: Graph-based community detection on the k-NN graph requires 2-5 minutes for 100K cells
Differential Expression: Wilcoxon rank-sum or t-tests across all gene-cluster pairs (20K genes x 20-50 clusters) takes 5-20 minutes
Batch Integration: Harmony, scVI, or BBKNN across multiple samples adds 10-60 minutes depending on method and sample count
Trajectory Inference: Diffusion pseudotime or RNA velocity calculations add another 5-30 minutes

Total wall-clock time for a standard pipeline on CPU: 30-120 minutes per sample

For clinical applications requiring rapid turnaround (e.g., treatment selection within 48-72 hours of biopsy), this latency is prohibitive--particularly when iterative analysis, parameter tuning, and multi-sample integration multiply the computational burden.

3.3 The GPU Acceleration Opportunity¶

NVIDIA RAPIDS single-cell (rapids-singlecell) reimplements core Scanpy operations on GPU using cuML, cuGraph, and cuPy, achieving dramatic speedups:

Operation	CPU Time (100K cells)	GPU Time (100K cells)	Speedup
PCA (50 components)	45 seconds	1.2 seconds	37x
k-NN Graph (k=15)	8 minutes	5 seconds	96x
UMAP	6 minutes	8 seconds	45x
Leiden Clustering	3 minutes	4 seconds	45x
Differential Expression	12 minutes	15 seconds	48x
Full Pipeline	45 minutes	45 seconds	60x

These benchmarks, measured on DGX Spark (NVIDIA Grace Hopper, 128 GB GPU memory), demonstrate that GPU acceleration transforms single-cell analysis from a batch process to an interactive exploration. The 50-100x aggregate speedup enables:

Real-time parameter tuning and re-clustering
Interactive exploration of cell type annotations
Rapid iteration on spatial analysis parameters
Clinical-grade turnaround times (<5 minutes per sample)

4. Market Landscape and Technology Ecosystem¶

4.1 Market Size and Growth¶

The single-cell analysis market has experienced exceptional growth, driven by decreasing per-cell sequencing costs, expanding clinical applications, and the emergence of spatial transcriptomics:

Metric	2024	2027 (Projected)	2030 (Projected)
Global Market Size	$8.7 billion	$16.2 billion	$32.0 billion
CAGR	-	23.1%	24.0%
Instruments Segment	$2.8 billion	$4.5 billion	$7.2 billion
Reagents & Consumables	$3.9 billion	$7.8 billion	$15.5 billion
Software & Services	$2.0 billion	$3.9 billion	$9.3 billion

Source: Grand View Research, Markets and Markets, Allied Market Research (aggregated estimates, 2024)

4.2 Key Market Drivers¶

Decreasing costs: Per-cell sequencing costs have dropped from ~$50,000 (2009) to ~$0.01 (2024), following a trajectory steeper than Moore's Law
Clinical adoption: FDA approval of companion diagnostics incorporating single-cell data; growing use in minimal residual disease (MRD) monitoring
Spatial transcriptomics: The addition of spatial context to single-cell data has opened new applications in pathology, neuroscience, and developmental biology
Cell therapy: CAR-T and other cell therapies require single-cell characterization for manufacturing QC, potency assays, and in vivo tracking
Pharma R&D: Drug companies use single-cell data for target identification, mechanism of action studies, and patient stratification in clinical trials

4.3 Software Ecosystem¶

The computational ecosystem for single-cell analysis is fragmented across languages, frameworks, and levels of abstraction:

Core Analysis Frameworks: - Scanpy (Wolf et al., 2018): Python-based, AnnData format, ~15,000 GitHub stars. The de facto standard for Python users - Seurat (Satija et al., 2015; Stuart et al., 2019; Hao et al., 2021): R-based, ~4,700 GitHub stars. Dominant in the R ecosystem with superior integration methods (CCA, RPCA, WNN) - CellRanger (10x Genomics): Proprietary pipeline for 10x Chromium data. FASTQ to count matrix with cell calling

GPU-Accelerated Tools: - RAPIDS single-cell (NVIDIA): GPU-accelerated reimplementation of Scanpy workflows using cuML/cuGraph - CellBender (Fleming et al., 2023): GPU-accelerated ambient RNA removal - scVI-tools (Gayoso et al., 2022): Deep learning framework for single-cell analysis with GPU support

Spatial Analysis: - Squidpy (Palla et al., 2022): Spatial single-cell analysis in Python - Giotto (Dries et al., 2021): Comprehensive spatial transcriptomics toolkit - STUtility: Spatial transcriptomics utilities for R/Seurat

Integration and Atlas Tools: - Harmony (Korsunsky et al., 2019): Fast batch correction - scVI/scANVI (Lopez et al., 2018): Variational autoencoder for integration - CellTypist (Dominguez Conde et al., 2022): Automated cell type annotation

5. System Architecture Overview¶

5.1 High-Level Architecture¶

The Single-Cell Intelligence Agent follows a modular architecture designed for GPU-accelerated analysis with semantic knowledge retrieval:

+------------------------------------------------------------------+
|                    Single-Cell Intelligence Agent                  |
|                     Ports: 8540 (API) / 8130 (UI)                 |
+------------------------------------------------------------------+
|                                                                    |
|  +---------------+  +---------------+  +------------------------+ |
|  |   FastAPI      |  |  Streamlit    |  |   Claude LLM          | |
|  |   REST API     |  |  Dashboard    |  |   Integration         | |
|  |   :8540        |  |  :8130        |  |   (Anthropic API)     | |
|  +-------+-------+  +-------+-------+  +----------+------------+ |
|          |                   |                      |              |
|  +-------+-------------------+----------------------+------------+ |
|  |                  Workflow Orchestrator                          | |
|  |         10 Analytical Workflows (Nextflow DSL2)                | |
|  +-------------------------------+--------------------------------+ |
|                                  |                                 |
|  +-------------------------------+-------------------------------+ |
|  |              Computational Engine                              | |
|  |  +--------------+ +-----------+ +---------------------------+ | |
|  |  | RAPIDS       | | Scanpy    | | Spatial Analysis          | | |
|  |  | cuML/cuGraph | | AnnData   | | Squidpy/Giotto            | | |
|  |  | GPU Accel.   | | Fallback  | | MERFISH/Visium/Xenium     | | |
|  |  +--------------+ +-----------+ +---------------------------+ | |
|  +---------------------------------------------------------------+ |
|                                  |                                 |
|  +-------------------------------+-------------------------------+ |
|  |              Knowledge Layer                                   | |
|  |  +----------------------------------------------------------+ | |
|  |  |         Milvus Vector Database (12 Collections)           | | |
|  |  |  sc_cell_types | sc_markers | sc_spatial | sc_tme         | | |
|  |  |  sc_drug_response | sc_literature | sc_methods            | | |
|  |  |  sc_datasets | sc_trajectories | sc_pathways              | | |
|  |  |  sc_clinical | genomic_evidence                           | | |
|  |  +----------------------------------------------------------+ | |
|  +---------------------------------------------------------------+ |
|                                                                    |
|  +---------------------------------------------------------------+ |
|  |              HCLS Common Library Integration                   | |
|  |  Config | Milvus Client | LLM | Security | Monitoring         | |
|  +---------------------------------------------------------------+ |
+--------------------------------------------------------------------+

5.2 Component Details¶

FastAPI REST API (Port 8540): Exposes all analytical workflows as RESTful endpoints with OpenAPI documentation. Supports synchronous execution for lightweight queries and asynchronous task submission for compute-intensive workflows. Authentication via the HCLS common security module with API key and JWT token support.

Streamlit Dashboard (Port 8130): Interactive visualization interface for exploratory analysis. Renders UMAP plots, spatial maps, heatmaps, violin plots, and dot plots with real-time updates. Supports file upload for count matrices and spatial data, with drag-and-drop annotation.

Claude LLM Integration: Leverages the Anthropic Claude API for natural language interpretation of analysis results, literature-grounded cell type annotation rationale, and clinical report generation. Queries are augmented with context retrieved from Milvus collections via RAG (Retrieval-Augmented Generation).

Workflow Orchestrator: Coordinates multi-step analytical workflows using Nextflow DSL2, managing dependencies between preprocessing, analysis, and interpretation steps. Integrates with the HCLS orchestrator for cross-pipeline workflows (e.g., bulk variant to single-cell validation to drug candidate).

5.3 Integration with HCLS AI Factory¶

The Single-Cell Intelligence Agent connects to the broader platform through several integration points:

Genomics Pipeline: Receives variant calls (VCF) from Parabricks/DeepVariant for correlation with single-cell expression data
RAG/Chat Pipeline: Shares the Milvus vector database infrastructure; the genomic_evidence collection bridges bulk and single-cell analyses
Drug Discovery Pipeline: Passes druggable target candidates from single-cell differential expression to BioNeMo/DiffDock for structure-based drug design
Other Intelligence Agents: Exchanges data with the Biomarker Agent (sc-derived biomarkers), CAR-T Agent (target validation), and Oncology Agent (TME classification)

6. GPU-Accelerated Computational Engine¶

6.1 RAPIDS Single-Cell Architecture¶

The computational engine is built on NVIDIA RAPIDS, an open-source suite of GPU-accelerated data science libraries. For single-cell analysis, the key components are:

cuML (Machine Learning): - PCA: Truncated SVD on GPU for dimensionality reduction from 20K genes to 50 principal components - k-NN: Brute-force or IVF-Flat approximate nearest neighbor search on GPU - UMAP: GPU-native implementation with identical API to umap-learn - t-SNE: Barnes-Hut GPU implementation for large-scale embedding

cuGraph (Graph Analytics): - Leiden/Louvain community detection: GPU-accelerated graph clustering on the k-NN graph - PageRank: For gene regulatory network analysis - Connected components: For QC and doublet detection

cuPy (Array Computing): - Sparse matrix operations on GPU (CSR, CSC formats) - Element-wise operations for normalization, log-transform, scaling - GPU-accelerated statistical tests for differential expression

cuSpatial (Spatial Analysis): - Spatial indexing and nearest-neighbor queries for spatial transcriptomics - Point-in-polygon tests for region-of-interest analysis - Spatial autocorrelation (Moran's I, Geary's C) on GPU

6.2 Memory Management¶

Single-cell data at scale demands careful GPU memory management. The agent implements a tiered memory strategy:

Tier 1: GPU Memory (128 GB on DGX Spark)
  - Active computation: count matrices, embeddings, graphs
  - Capacity: ~500K cells x 20K genes in sparse format
  - Overflow: automatic spill to host memory

Tier 2: Host Memory (up to 480 GB on DGX Spark)
  - Cached datasets, reference atlases
  - Preprocessing buffers
  - Batch processing queues

Tier 3: NVMe Storage (up to 16 TB on DGX Spark)
  - Raw data (FASTQ, BAM, count matrices)
  - Intermediate results
  - AnnData/H5AD file cache

For datasets exceeding GPU memory (>500K cells), the engine automatically partitions the analysis into batches, computing k-NN graphs and embeddings in chunks with GPU-accelerated merging. This enables processing of million-cell atlases without manual intervention.

6.3 Scanpy Compatibility Layer¶

The agent maintains full API compatibility with Scanpy (Wolf et al., 2018), the dominant Python framework for single-cell analysis. Users can submit standard Scanpy workflows that are transparently accelerated on GPU:

# Standard Scanpy workflow - automatically GPU-accelerated
import scanpy as sc
import rapids_singlecell as rsc

adata = sc.read_h5ad("patient_tumor.h5ad")

# Preprocessing (GPU-accelerated)
rsc.pp.normalize_total(adata)
rsc.pp.log1p(adata)
rsc.pp.highly_variable_genes(adata, n_top_genes=2000)

# Dimensionality reduction (GPU: 1.2s vs CPU: 45s)
rsc.pp.pca(adata, n_comps=50)

# Neighbor graph (GPU: 5s vs CPU: 8min)
rsc.pp.neighbors(adata, n_neighbors=15)

# Clustering (GPU: 4s vs CPU: 3min)
rsc.tl.leiden(adata, resolution=1.0)

# Embedding (GPU: 8s vs CPU: 6min)
rsc.tl.umap(adata)

# Differential expression (GPU: 15s vs CPU: 12min)
rsc.tl.rank_genes_groups(adata, groupby='leiden')

This compatibility ensures that existing Scanpy scripts, notebooks, and workflows can be accelerated without code modification, lowering the adoption barrier for computational biologists already invested in the Scanpy ecosystem.

7. Knowledge Architecture: Milvus Vector Collections¶

7.1 Collection Design¶

The Single-Cell Intelligence Agent maintains 12 specialized Milvus vector collections, each optimized for a specific domain of single-cell knowledge. All collections use BGE-small-en-v1.5 embeddings (384 dimensions) consistent with the HCLS AI Factory embedding standard.

Collection	Records	Description	Update Frequency
`sc_cell_types`	~2,500	Cell type definitions, markers, ontology mappings	Monthly
`sc_markers`	~45,000	Gene markers for cell types across tissues/diseases	Monthly
`sc_spatial`	~18,000	Spatial transcriptomics signatures and niches	Quarterly
`sc_tme`	~12,000	Tumor microenvironment profiles and classifications	Monthly
`sc_drug_response`	~35,000	Cell-type-specific drug sensitivity/resistance data	Monthly
`sc_literature`	~85,000	Single-cell publication abstracts and key findings	Weekly
`sc_methods`	~3,500	Computational methods, parameters, best practices	Quarterly
`sc_datasets`	~8,000	Public dataset metadata (GEO, CellxGene, HuBMAP)	Weekly
`sc_trajectories`	~6,500	Differentiation trajectories and pseudotime references	Quarterly
`sc_pathways`	~15,000	Pathway activity signatures at single-cell resolution	Monthly
`sc_clinical`	~22,000	Clinical outcome correlations with sc signatures	Monthly
`genomic_evidence`	~3,560,000	Shared with HCLS platform - variant annotations	Daily

7.2 Collection Schema: sc_cell_types¶

{
  "collection_name": "sc_cell_types",
  "fields": [
    {"name": "id", "type": "INT64", "is_primary": true, "auto_id": true},
    {"name": "cell_type_name", "type": "VARCHAR", "max_length": 256},
    {"name": "cell_ontology_id", "type": "VARCHAR", "max_length": 32},
    {"name": "tissue", "type": "VARCHAR", "max_length": 128},
    {"name": "species", "type": "VARCHAR", "max_length": 64},
    {"name": "canonical_markers", "type": "VARCHAR", "max_length": 1024},
    {"name": "description", "type": "VARCHAR", "max_length": 4096},
    {"name": "parent_type", "type": "VARCHAR", "max_length": 256},
    {"name": "disease_associations", "type": "VARCHAR", "max_length": 2048},
    {"name": "source", "type": "VARCHAR", "max_length": 256},
    {"name": "embedding", "type": "FLOAT_VECTOR", "dim": 384}
  ],
  "index": {
    "field": "embedding",
    "type": "IVF_FLAT",
    "metric": "COSINE",
    "params": {"nlist": 128}
  }
}

7.3 Collection Schema: sc_tme¶

The tumor microenvironment collection encodes TME classifications, immune signatures, and stromal profiles:

{
  "collection_name": "sc_tme",
  "fields": [
    {"name": "id", "type": "INT64", "is_primary": true, "auto_id": true},
    {"name": "tme_class", "type": "VARCHAR", "max_length": 64},
    {"name": "subclass", "type": "VARCHAR", "max_length": 128},
    {"name": "immune_score", "type": "FLOAT"},
    {"name": "stromal_score", "type": "FLOAT"},
    {"name": "cell_composition", "type": "VARCHAR", "max_length": 4096},
    {"name": "signature_genes", "type": "VARCHAR", "max_length": 2048},
    {"name": "cancer_type", "type": "VARCHAR", "max_length": 128},
    {"name": "therapy_response", "type": "VARCHAR", "max_length": 1024},
    {"name": "clinical_evidence", "type": "VARCHAR", "max_length": 4096},
    {"name": "source_study", "type": "VARCHAR", "max_length": 512},
    {"name": "embedding", "type": "FLOAT_VECTOR", "dim": 384}
  ]
}

7.4 Retrieval Strategy¶

The agent employs a multi-collection retrieval strategy for comprehensive knowledge augmentation:

Primary retrieval: Query the most relevant collection for the current workflow (e.g., sc_cell_types for annotation)
Cross-collection enrichment: Retrieve supporting evidence from related collections (e.g., sc_markers + sc_literature to validate cell type assignments)
Clinical grounding: Always include retrieval from sc_clinical and genomic_evidence for clinical interpretation
Reranking: Retrieved documents are reranked by relevance score, recency, and source authority before LLM context injection

This strategy ensures that LLM-generated interpretations are grounded in current evidence across the full breadth of single-cell knowledge, from basic cell biology to clinical outcome data.

8. Analytical Workflows¶

8.1 Workflow Overview¶

The Single-Cell Intelligence Agent implements 10 purpose-built analytical workflows, each combining GPU-accelerated computation with knowledge-augmented interpretation:

#	Workflow	Input	Output	Collections Used
1	Cell Type Annotation	Count matrix (H5AD)	Annotated clusters with confidence scores	sc_cell_types, sc_markers
2	TME Profiling	Tumor scRNA-seq	TME classification, immune/stromal scores	sc_tme, sc_clinical
3	Drug Response Prediction	Annotated scRNA-seq	Cell-type-specific drug sensitivity	sc_drug_response, sc_pathways
4	Subclonal Architecture	Tumor scRNA-seq + VCF	Clone tree, mutation-expression links	genomic_evidence, sc_trajectories
5	Spatial Niche ID	Spatial transcriptomics	Spatial niches, cell-cell interactions	sc_spatial, sc_tme
6	Trajectory Analysis	Time-series scRNA-seq	Pseudotime, branch points, driver genes	sc_trajectories, sc_pathways
7	Ligand-Receptor Mapping	Multi-type scRNA-seq	Interaction networks, signaling hubs	sc_pathways, sc_tme
8	Biomarker Discovery	Case/control scRNA-seq	Ranked biomarker candidates	sc_markers, sc_clinical
9	CAR-T Target Validation	Target gene + scRNA-seq	On-target/off-tumor risk assessment	sc_cell_types, sc_markers, sc_clinical
10	Treatment Response Monitoring	Longitudinal scRNA-seq	Clonal dynamics, resistance evolution	sc_drug_response, sc_trajectories

8.2 Workflow Execution Model¶

Each workflow follows a standardized execution model:

Input Validation --> Preprocessing --> GPU Computation --> Knowledge Retrieval
        |                  |                  |                     |
        v                  v                  v                     v
  Schema check       QC, filtering,     RAPIDS pipeline      Milvus semantic
  Format detection   normalization,     (PCA, neighbors,     search across
  Parameter          HVG selection      clustering, DE)      relevant collections
  defaults                                                         |
                                                                   v
                                                          LLM Interpretation
                                                          (Claude + RAG context)
                                                                   |
                                                                   v
                                                          Report Generation
                                                          (Clinical summary,
                                                           visualizations,
                                                           recommendations)

8.3 Shared Preprocessing Pipeline¶

All workflows share a common preprocessing pipeline that standardizes data before workflow-specific analysis:

Quality Control: Filter cells by minimum genes (default: 200), maximum genes (default: 5000), and mitochondrial gene percentage (default: <20%). Filter genes by minimum cells (default: 3)
Doublet Detection: Scrublet or DoubletFinder to identify and remove multiplets
Normalization: Library size normalization (target sum: 10,000) followed by log1p transformation
Highly Variable Genes: Select top 2,000-5,000 HVGs using Scanpy's highly_variable_genes with flavor='seurat_v3'
Scaling: Z-score standardization (optional, workflow-dependent)
Batch Correction: Harmony or scVI integration when multiple samples are present

9. Cell Type Annotation Pipeline¶

9.1 Multi-Strategy Annotation¶

Cell type annotation is the foundational analysis for all downstream workflows. The agent employs a multi-strategy approach that combines automated classifiers, marker-based scoring, and LLM-assisted interpretation:

Strategy 1: Reference-Based Transfer Learning - Map query cells to annotated reference atlases (Human Cell Atlas, Tabula Sapiens) using label transfer - GPU-accelerated k-NN classifier in PCA/scVI latent space - Confidence scoring based on distance to nearest reference cells

Strategy 2: Marker Gene Scoring - Score each cluster against curated marker gene sets from sc_markers collection - Compute AUCell-like enrichment scores for marker gene sets - Rank candidate cell types by composite marker expression score

Strategy 3: LLM-Assisted Annotation - Extract top differentially expressed genes per cluster - Query Claude with DE genes + tissue context + retrieved knowledge from sc_cell_types and sc_literature - Generate annotation with confidence level and supporting evidence

9.2 Consensus Annotation¶

The three strategies are combined through a consensus mechanism:

def consensus_annotation(reference_label, marker_label, llm_label,
                         reference_conf, marker_score, llm_confidence):
    """
    Consensus annotation across three strategies.
    Agreement of 2/3 strategies required for high-confidence annotation.
    """
    labels = [reference_label, marker_label, llm_label]
    scores = [reference_conf, marker_score, llm_confidence]

    # Check for unanimous agreement
    if len(set(labels)) == 1:
        return labels[0], "HIGH", max(scores)

    # Check for 2/3 agreement
    from collections import Counter
    counts = Counter(labels)
    majority = counts.most_common(1)[0]
    if majority[1] >= 2:
        return majority[0], "MEDIUM", np.mean([s for l, s in zip(labels, scores)
                                                if l == majority[0]])

    # No consensus - flag for manual review
    return labels[np.argmax(scores)], "LOW", max(scores)

9.3 Cell Ontology Integration¶

All annotations are mapped to the Cell Ontology (CL) for standardized terminology:

CL:0000084 - T cell
CL:0000236 - B cell
CL:0000775 - Neutrophil
CL:0000235 - Macrophage
CL:0002063 - Type II pneumocyte
CL:0000066 - Epithelial cell
CL:0000057 - Fibroblast
CL:0000115 - Endothelial cell

This standardization enables cross-study comparison, atlas integration, and clinical reporting with unambiguous cell type nomenclature.

10. Tumor Microenvironment Profiling¶

10.1 TME Classification System¶

The TME profiling workflow classifies tumor microenvironments into clinically actionable categories based on immune cell composition, spatial organization, and functional state:

Hot (Inflamed) TME: - High CD8+ T-cell infiltration (>15% of all cells) - Active cytotoxic gene signature (GZMA, GZMB, PRF1, IFNG) - PD-L1 expression on tumor and/or immune cells - Clinical implication: Checkpoint immunotherapy responsive (anti-PD-1/PD-L1) - Expected response rate: 40-60% for anti-PD-1 monotherapy

Cold (Desert) TME: - Minimal immune cell infiltration (<5% of all cells) - Low MHC class I expression on tumor cells - Absence of inflammatory chemokines (CXCL9, CXCL10) - Clinical implication: Needs combination therapy to recruit immune cells - Therapeutic strategy: Oncolytic virus + checkpoint inhibitor, or STING agonist + anti-PD-1

Excluded TME: - Immune cells present but confined to tumor periphery/stroma - Physical barriers: dense collagen, aberrant vasculature - TGF-beta signaling enrichment in stromal compartment - Clinical implication: Needs stromal remodeling to enable immune penetration - Therapeutic strategy: Anti-TGF-beta + checkpoint inhibitor, or anti-VEGF + anti-PD-L1

Immunosuppressive TME: - High regulatory T-cell (Treg) proportion (>10% of CD4+ T cells) - Myeloid-derived suppressor cell (MDSC) enrichment - IDO1, ARG1 upregulation - Clinical implication: Needs immunosuppression reversal - Therapeutic strategy: Anti-CTLA-4 + anti-PD-1, or IDO1 inhibitor combinations

10.2 Immune Cell Deconvolution¶

The workflow quantifies immune cell composition using a GPU-accelerated deconvolution approach:

Cluster identification: Leiden clustering at multiple resolutions (0.5, 1.0, 2.0) to capture both broad and fine-grained cell populations
Immune cell scoring: Score each cluster against immune cell gene signatures from LM22 (Newman et al., 2015) and immunedeconv (Sturm et al., 2019)
Proportion estimation: Calculate cell type proportions as percentage of total cells per sample
Functional state assessment: Within each immune cell type, assess activation, exhaustion, memory, and effector states using curated gene signatures

10.3 TME Score Computation¶

The TME classification integrates multiple scores into a composite assessment:

def compute_tme_classification(adata, immune_clusters, tumor_clusters):
    """
    Classify TME based on immune composition and spatial organization.
    """
    # Immune proportion
    n_immune = sum(adata.obs['cell_type'].isin(immune_clusters))
    n_total = len(adata)
    immune_fraction = n_immune / n_total

    # Cytotoxic score
    cytotoxic_genes = ['GZMA', 'GZMB', 'PRF1', 'IFNG', 'NKG7']
    cytotoxic_score = adata[adata.obs['cell_type'].isin(immune_clusters),
                            cytotoxic_genes].X.mean()

    # Exhaustion score
    exhaustion_genes = ['PDCD1', 'CTLA4', 'LAG3', 'TIM3', 'TIGIT']
    exhaustion_score = adata[adata.obs['cell_type'].isin(immune_clusters),
                             exhaustion_genes].X.mean()

    # Classify
    if immune_fraction > 0.15 and cytotoxic_score > 0.5:
        return "HOT_INFLAMED", {"checkpoint_responsive": True}
    elif immune_fraction < 0.05:
        return "COLD_DESERT", {"needs_recruitment": True}
    elif immune_fraction > 0.10 and exhaustion_score > 0.7:
        return "IMMUNOSUPPRESSIVE", {"needs_reversal": True}
    else:
        return "EXCLUDED", {"needs_remodeling": True}

11. Drug Response Prediction¶

11.1 Cell-Type-Specific Drug Sensitivity¶

The drug response prediction workflow integrates single-cell expression data with pharmacogenomic databases to predict cell-type-specific drug sensitivity:

Data Sources: - DepMap/CCLE (Cancer Cell Line Encyclopedia): Drug sensitivity profiles across 1,500+ cell lines with matched scRNA-seq - GDSC (Genomics of Drug Sensitivity in Cancer): IC50 values for 400+ compounds across 1,000+ cell lines - PRISM: Multiplexed drug screening across 900+ cell lines - CMap (Connectivity Map): Transcriptional signatures of 20,000+ compound perturbations

Prediction Approach: 1. Map patient cell clusters to reference cell line transcriptomes using transfer learning 2. Retrieve drug sensitivity data for matched cell lines from sc_drug_response collection 3. Score differential expression of drug target genes and resistance markers per cluster 4. Generate cell-type-resolved drug sensitivity predictions with confidence intervals

11.2 Resistance Mechanism Identification¶

For each predicted drug-resistant subpopulation, the workflow identifies potential resistance mechanisms:

Target mutation: Expression of mutant vs. wildtype drug target alleles (from linked VCF data)
Bypass activation: Upregulation of alternative signaling pathways (e.g., MET amplification bypassing EGFR inhibition)
Efflux pumps: ABC transporter expression (ABCB1/MDR1, ABCG2/BCRP)
EMT transition: Epithelial-to-mesenchymal transition signature enrichment
Metabolic rewiring: Shift from glycolysis to oxidative phosphorylation or vice versa

11.3 Combination Therapy Suggestions¶

Based on resistance mechanism analysis, the workflow suggests combination strategies:

Resistance Mechanism          Suggested Combination
---------------------------------------------------------
Target mutation               Next-generation inhibitor + original
Bypass pathway activation     Target inhibitor + bypass pathway inhibitor
ABC transporter upregulation  Drug + efflux pump inhibitor
EMT transition                Drug + EMT reversal agent (e.g., HDAC inhibitor)
Immune evasion                Drug + checkpoint inhibitor

Each suggestion is grounded in evidence retrieved from sc_drug_response and sc_clinical collections, with supporting literature citations.

12. Subclonal Architecture Analysis¶

12.1 Clone Detection from scRNA-seq¶

The subclonal architecture workflow reconstructs the clonal hierarchy of tumors using single-cell transcriptomic data, optionally integrated with matched bulk or single-cell DNA sequencing:

RNA-only clone detection: 1. InferCNV/CopyKAT: Infer copy number alterations from gene expression patterns, using non-malignant cells as reference. GPU-accelerated sliding window analysis across chromosomal positions 2. Clustering on CNV profiles: Group cells by inferred CNV patterns to identify genetically distinct subclones 3. Clone tree construction: Build phylogenetic relationships between subclones using maximum parsimony or neighbor-joining on CNV profiles

Integrated DNA+RNA clone detection: 1. Variant calling: Use linked VCF from the genomics pipeline (Parabricks/DeepVariant) 2. Genotype assignment: Assign SNV genotypes to individual cells using scRNA-seq reads at variant positions (cellSNP/Vireo) 3. Clone-expression mapping: Link genetic clones to transcriptional states for mechanistic interpretation

12.2 Clone Fitness and Evolution¶

For each identified subclone, the workflow estimates fitness and evolutionary trajectory:

Clone frequency: Proportion of total tumor cells in each subclone
Proliferation index: Expression of cell cycle genes (MKI67, TOP2A, CDK1) per clone
Fitness inference: Relative growth rate estimated from clone frequency dynamics in longitudinal samples
Drug sensitivity prediction: Per-clone drug response based on clone-specific expression of drug targets and resistance markers

12.3 Clinical Reporting¶

The subclonal analysis generates a clinical report including:

Clone tree visualization with annotated mutations and expression programs
Quantification of resistant clone fraction at diagnosis
Predicted clonal evolution under proposed therapies
Risk assessment for relapse based on resistant clone burden

13. Spatial Transcriptomics Integration¶

13.1 Multi-Platform Support¶

The spatial transcriptomics workflow supports data from multiple platforms, each with distinct resolution and throughput characteristics:

Visium (10x Genomics): - Resolution: ~55 um spots (1-10 cells per spot) - Genes: Whole transcriptome (~20,000 genes) - Coverage: 4,992 spots per capture area (6.5 mm x 6.5 mm) - Analysis: Spot deconvolution to estimate cell type composition per spot using cell2location or RCTD

MERFISH (Vizgen): - Resolution: Subcellular (individual transcripts) - Genes: 100-1,000 gene panels - Coverage: Up to 1 cm x 1 cm tissue sections - Analysis: Cell segmentation, direct cell typing from spatial gene expression

Xenium (10x Genomics): - Resolution: Subcellular transcript localization - Genes: 100-5,000+ gene panels (Xenium Prime) - Coverage: Multiple tissue sections per run - Analysis: Cell segmentation with DAPI, boundary-free transcript analysis

CODEX/PhenoCycler (Akoya Biosciences): - Resolution: Single-cell protein expression - Markers: 100+ protein targets per panel - Coverage: Whole tissue sections - Analysis: Protein co-expression, spatial phenotyping

13.2 Spatial Niche Identification¶

The workflow identifies recurring spatial patterns--niches--that represent functionally distinct tissue microenvironments:

Cell type mapping: Assign cell types to spatial coordinates (direct for MERFISH/Xenium, deconvolved for Visium)
Neighborhood analysis: For each cell/spot, compute the cell type composition of its spatial neighborhood (k nearest neighbors in physical space, typically k=10-30)
Niche clustering: Cluster cells by neighborhood composition to identify recurring spatial patterns (GPU-accelerated Leiden on spatial neighborhood graph)
Niche characterization: For each niche, compute:
Dominant cell types and their proportions
Enriched ligand-receptor interactions (CellChat, CellPhoneDB)
Gene expression programs specific to the niche context
Spatial autocorrelation of key genes (Moran's I)

13.3 Spatial-Transcriptomic Integration¶

The workflow integrates dissociated scRNA-seq with spatial data for maximum information:

scRNA-seq (high genes, no spatial)  +  Spatial (spatial context, fewer genes)
              |                                        |
              v                                        v
      Full transcriptome              Spatial coordinates + tissue context
      Cell type atlas                 Gene expression at spatial resolution
              |                                        |
              +------- Cell2location / Tangram --------+
              |                                        |
              v                                        v
      Spatially-resolved full         Cell-type-specific spatial
      transcriptome estimation        gene expression patterns

This integration enables analysis that neither modality can achieve alone: full-transcriptome analysis with spatial context, revealing how gene expression changes as a function of tissue location, proximity to tumor-immune interfaces, and distance from vasculature.

14. Trajectory and Pseudotime Analysis¶

14.1 Differentiation Trajectory Inference¶

The trajectory analysis workflow reconstructs continuous biological processes--differentiation, activation, exhaustion--from snapshot scRNA-seq data:

Diffusion Pseudotime (DPT): - Compute diffusion maps from the k-NN graph (GPU-accelerated eigendecomposition) - Identify root cell (earliest differentiation state) from biological priors or user specification - Calculate pseudotime as geodesic distance from root in diffusion space - Detect branch points where trajectories diverge

RNA Velocity (scVelo): - Distinguish nascent (unspliced) from mature (spliced) mRNA for each gene - Fit kinetic models to estimate RNA velocity vectors per cell - Project velocity onto UMAP embedding to visualize direction of transcriptional change - Identify driver genes whose velocity predicts future cell state transitions

PAGA (Partition-based Graph Abstraction): - Build a coarse-grained graph connecting clusters with statistically significant edges - Quantify connectivity strength between cell populations - Identify plausible differentiation pathways at cluster-to-cluster resolution

14.2 Trajectory Analysis in Oncology¶

Clinical applications of trajectory analysis include:

T-cell exhaustion trajectories: Track CD8+ T cells from naive/effector states through progressive exhaustion (PD-1, LAG-3, TIM-3 upregulation), identifying the branch point where cells commit to terminal exhaustion vs. memory formation. This informs checkpoint immunotherapy timing.

EMT trajectories: Map epithelial-to-mesenchymal transition in carcinomas, identifying intermediate hybrid E/M states associated with metastatic competence and drug resistance.

Myeloid polarization: Track monocyte-to-macrophage differentiation in the tumor microenvironment, distinguishing anti-tumor M1-like states from pro-tumor M2-like states and identifying interventional opportunities.

14.3 Driver Gene Identification¶

For each trajectory, the workflow identifies genes whose expression changes drive the transition:

Pseudotime correlation: Rank genes by correlation with pseudotime (Spearman rho)
Switch-like genes: Identify genes with bimodal expression along the trajectory
Transcription factors: Prioritize TFs using SCENIC/pySCENIC regulon analysis
Druggable targets: Cross-reference driver genes with druggable genome databases

15. Ligand-Receptor Interaction Mapping¶

15.1 Cell-Cell Communication Analysis¶

The ligand-receptor workflow maps intercellular communication networks from single-cell expression data, identifying which cell types communicate through which signaling pathways:

CellChat Framework: - Database of 2,000+ validated ligand-receptor interactions, including multimeric complexes - Quantify communication probability between all cell type pairs for each interaction - Identify dominant signaling pathways (e.g., WNT, NOTCH, TGF-beta, CCL/CXCL chemokines) - Detect communication pattern changes between conditions (e.g., treatment vs. control)

CellPhoneDB: - Statistical framework for identifying significant ligand-receptor pairs - Accounts for receptor subunit co-expression requirements - Permutation-based p-values for interaction specificity

15.2 Signaling Network Construction¶

The workflow constructs multi-scale signaling networks:

Level 1: Cell Type Communication Network
  - Nodes: cell types (e.g., tumor, CD8+ T, macrophage, fibroblast)
  - Edges: aggregate communication strength between cell type pairs
  - Weight: number of significant L-R pairs

Level 2: Pathway-Specific Networks
  - Separate networks for each signaling pathway
  - Identify pathway-specific communication hubs
  - Detect pathway cross-talk

Level 3: Molecular Interaction Network
  - Individual L-R pairs with expression levels and statistical significance
  - Downstream signaling cascade mapping (KEGG, Reactome)
  - Druggable interaction identification

15.3 Clinical Applications¶

Immune checkpoint interactions: Map PD-L1 (CD274) on tumor cells to PD-1 (PDCD1) on T cells, quantifying the strength of immune suppression. Identify which tumor subclones express PD-L1 and which T-cell subsets are affected.

CAR-T target validation: Assess whether the target antigen is also expressed on non-tumor cells in the microenvironment, predicting on-target/off-tumor toxicity risk.

Therapeutic vulnerability: Identify signaling dependencies--e.g., tumor cells dependent on fibroblast-derived HGF for survival--that represent combination therapy opportunities.

16. Biomarker Discovery Pipeline¶

16.1 Single-Cell Biomarker Identification¶

The biomarker discovery workflow identifies cell-type-specific biomarkers that distinguish clinical conditions, disease states, or treatment responses:

Differential Expression Analysis: - GPU-accelerated Wilcoxon rank-sum tests across all genes for each cell type between conditions - Multiple testing correction (Benjamini-Hochberg) with adjusted p-value threshold of 0.05 - Effect size filtering: minimum log2 fold-change of 0.5, minimum percentage expressed in at least one group of 25% - Pseudo-bulk aggregation for biological replicate-aware analysis (sum counts per sample per cell type, then DESeq2/edgeR)

Biomarker Scoring Criteria:

Criterion	Weight	Description
Statistical significance	0.20	Adjusted p-value from DE analysis
Effect size	0.20	Log2 fold-change between conditions
Cell-type specificity	0.15	Expression restricted to disease-relevant cell types
Clinical correlation	0.15	Correlation with clinical outcomes (from sc_clinical)
Detectability	0.10	Amenable to clinical assay (IHC, flow cytometry, qPCR)
Literature support	0.10	Prior evidence from sc_literature collection
Druggability	0.10	Whether the biomarker is also a therapeutic target

16.2 Composite Biomarker Signatures¶

Beyond individual genes, the workflow identifies multi-gene signatures that improve predictive accuracy:

LASSO regression: L1-regularized logistic regression to select sparse gene sets that classify conditions
Random forest importance: Feature importance ranking from GPU-accelerated random forest (cuML)
Pathway-level signatures: Aggregate gene scores into pathway activity scores using PROGENy or AUCell
Cell proportion signatures: Use cell type proportions as features (e.g., ratio of CD8+ T cells to Tregs)

16.3 Biomarker Validation Pipeline¶

Each candidate biomarker undergoes computational validation:

Cross-dataset validation: Test biomarker in independent datasets from sc_datasets collection
Bulk deconvolution validation: Confirm that single-cell biomarker signal is detectable in bulk RNA-seq (TCGA validation)
Clinical outcome correlation: Correlate biomarker expression with survival, response, and relapse data from sc_clinical
Assay feasibility assessment: Evaluate whether the biomarker can be measured by clinically available assays

17. CAR-T Target Validation¶

17.1 Target Expression Profiling¶

The CAR-T target validation workflow provides comprehensive assessment of candidate CAR-T targets using single-cell data, directly integrating with the HCLS AI Factory CAR-T Intelligence Agent:

On-Tumor Expression: - Quantify target antigen expression across all tumor cell clusters - Assess expression heterogeneity: what fraction of tumor cells express the target above detection threshold? - Identify tumor subclones with low/absent target expression (potential escape populations) - Measure expression level distribution (MFI-equivalent from RNA)

Off-Tumor Expression (Safety Assessment): - Query sc_cell_types and sc_markers collections for target expression across all normal tissues - Cross-reference Human Cell Atlas and Tabula Sapiens for comprehensive normal tissue expression - Identify critical normal cell types expressing the target (e.g., cardiac, neural, hepatic) - Risk stratification: HIGH (vital organ expression), MEDIUM (non-vital tissue expression), LOW (minimal normal expression)

17.2 Antigen Escape Prediction¶

The workflow predicts the likelihood and mechanisms of antigen escape:

def predict_antigen_escape(adata, target_gene, tumor_clusters):
    """
    Predict probability and timeline of antigen escape.
    """
    tumor_cells = adata[adata.obs['cell_type'].isin(tumor_clusters)]

    # Expression heterogeneity
    expressing = (tumor_cells[:, target_gene].X > 0).sum()
    total = len(tumor_cells)
    expression_fraction = expressing / total

    # Expression level variance
    expression_levels = tumor_cells[:, target_gene].X.toarray().flatten()
    expression_cv = np.std(expression_levels) / (np.mean(expression_levels) + 1e-6)

    # Pre-existing negative clone
    negative_fraction = 1 - expression_fraction

    # Risk assessment
    if negative_fraction > 0.10:
        escape_risk = "HIGH"
        escape_timeline = "3-6 months"
    elif negative_fraction > 0.01:
        escape_risk = "MEDIUM"
        escape_timeline = "6-12 months"
    elif expression_cv > 1.5:
        escape_risk = "MEDIUM"
        escape_timeline = "6-18 months"
    else:
        escape_risk = "LOW"
        escape_timeline = ">18 months"

    return {
        "target": target_gene,
        "expression_fraction": expression_fraction,
        "expression_cv": expression_cv,
        "negative_clone_fraction": negative_fraction,
        "escape_risk": escape_risk,
        "estimated_escape_timeline": escape_timeline,
        "recommendation": "Consider dual-target CAR" if escape_risk != "LOW"
                         else "Single-target CAR acceptable"
    }

17.3 TME Compatibility Assessment¶

Beyond target expression, the workflow assesses whether the TME is permissive for CAR-T cell function:

Immunosuppressive milieu: Quantify Tregs, MDSCs, and immunosuppressive cytokines (TGF-beta, IL-10)
Physical barriers: Assess stromal density and fibroblast activation markers (FAP, alpha-SMA)
Metabolic competition: Evaluate glucose, amino acid, and oxygen availability in the TME
Exhaustion potential: Predict CAR-T exhaustion risk from inhibitory ligand expression (PD-L1, Galectin-9)

18. Treatment Response Monitoring¶

18.1 Longitudinal Single-Cell Analysis¶

The treatment response monitoring workflow tracks cellular changes across multiple timepoints during therapy:

Timepoint Alignment: - Integrate scRNA-seq from pre-treatment, on-treatment, and post-treatment biopsies - Batch correction while preserving biological signal (Harmony or scVI) - Align cell type annotations across timepoints for consistent tracking

Clonal Dynamics: - Track clone frequencies over time using InferCNV-derived clone assignments - Identify expanding clones (potential resistance) and contracting clones (drug-sensitive) - Calculate selection coefficients for each clone under therapy

18.2 Immune Response Tracking¶

The workflow monitors immune system dynamics during immunotherapy:

T-cell expansion: Track clonotype-specific T-cell expansion using TCR sequencing data (5' 10x)
Exhaustion progression: Monitor exhaustion marker accumulation (PD-1, LAG-3, TIM-3) over treatment
New infiltration: Detect newly recruited immune cell populations that appear only after treatment initiation
Tertiary lymphoid structures: Identify gene signatures of tertiary lymphoid structures (TLS) that predict durable response

18.3 Resistance Evolution Monitoring¶

For targeted therapy, the workflow tracks resistance mechanism evolution:

Pre-treatment baseline: Catalog all subclones and their expression programs
Early on-treatment (2-4 weeks): Identify which clones are contracting (sensitive) and which are stable (potentially resistant)
Late on-treatment (2-6 months): Detect expanding resistant clones and characterize their resistance mechanisms
Progression: At clinical progression, compare the dominant clone with pre-treatment baseline to identify acquired changes

18.4 Clinical Decision Support¶

The monitoring workflow generates actionable clinical summaries:

TREATMENT RESPONSE REPORT
=========================
Patient: [ID]  |  Timepoint: Week 12 (On-treatment)
Therapy: Pembrolizumab (anti-PD-1)

TUMOR DYNAMICS:
  - Dominant clone (Clone A, EGFR-mutant): 60% -> 25% (RESPONDING)
  - Minor clone (Clone B, MET-amplified): 5% -> 18% (EXPANDING - ALERT)
  - New clone (Clone C): Not detected at baseline, now 8% (EMERGENT)

IMMUNE RESPONSE:
  - CD8+ T-cell infiltration: 8% -> 22% (INCREASED)
  - Clonotype expansion: 3 dominant clonotypes expanding
  - Exhaustion score: 0.3 -> 0.6 (INCREASING - MONITOR)

TME CLASSIFICATION: HOT_INFLAMED (improved from EXCLUDED at baseline)

RECOMMENDATIONS:
  1. Clone B expansion suggests MET-mediated resistance; consider adding
     MET inhibitor (capmatinib/tepotinib) if confirmed by ctDNA
  2. Increasing exhaustion score suggests potential for secondary resistance;
     consider LAG-3 inhibitor addition at next assessment
  3. Overall response trajectory is positive; continue current therapy
     with close monitoring of Clone B

19. Data Sources and Knowledge Graphs¶

19.1 Primary Data Sources¶

The Single-Cell Intelligence Agent draws from a comprehensive set of public data repositories:

Single-Cell Atlases: - Human Cell Atlas (HCA): International consortium mapping all human cell types. Currently >50 million cells across 30+ organs - CellxGene (Chan Zuckerberg Initiative): Curated repository of >50 million cells from 1,000+ datasets with standardized annotations. Primary source for sc_cell_types and sc_markers collections - Tabula Sapiens (Tabula Sapiens Consortium, 2022): Comprehensive atlas of 500,000+ cells from 24 human tissues with deep annotation - Tabula Muris (Tabula Muris Consortium, 2018): Mouse reference atlas for cross-species comparison

Cancer-Specific Resources: - TCGA (The Cancer Genome Atlas): Bulk RNA-seq and clinical data for 33 cancer types; used for bulk validation of single-cell biomarkers - DepMap/CCLE (Broad Institute): Drug sensitivity and expression data for 1,500+ cancer cell lines; source for sc_drug_response collection - TISCH (Tumor Immune Single-cell Hub): Curated single-cell immune profiles across 50+ cancer types - CancerSEA (Fan et al., 2019): Functional state annotation of cancer cells at single-cell resolution

Gene Expression Repositories: - GEO (Gene Expression Omnibus): >4 million samples including >10,000 scRNA-seq datasets - ArrayExpress/BioStudies: European equivalent of GEO - HuBMAP (Human BioMolecular Atlas Program): Multi-modal atlas including spatial transcriptomics

19.2 Knowledge Graphs¶

The agent integrates structured knowledge from several ontologies and databases:

Cell Ontology (CL): - Hierarchical classification of >2,500 cell types - Cross-references to UBERON (anatomy), GO (gene ontology), and disease ontologies - Used for standardized cell type nomenclature and relationship inference

CellMarker Database (Zhang et al., 2019): - Manually curated marker genes for >400 cell types across 158 human tissues - Tissue-specific markers distinguish between identical cell types in different organs - Source for sc_markers collection

PanglaoDB (Franzen et al., 2019): - Community-curated marker gene database with >6,000 markers across 250+ cell types - Includes specificity scores and expression level expectations - Complementary to CellMarker with broader community contributions

ClinVar and AlphaMissense (shared with HCLS platform): - 4.1 million ClinVar variant annotations for variant-expression correlation - 71 million AlphaMissense predictions for missense variant pathogenicity

19.3 Knowledge Graph Integration¶

The agent constructs a multi-layer knowledge graph connecting:

Gene --[is_marker_of]--> Cell Type --[found_in]--> Tissue
  |                         |                         |
  |--[in_pathway]-->     Pathway   --[dysregulated_in]--> Disease
  |                         |                         |
  |--[target_of]-->       Drug     --[treats]-------> Cancer Type
  |                         |                         |
  |--[has_variant]-->     Variant  --[associated_with]--> Phenotype
  |                         |
  |--[interacts_with]-->  Protein  --[in_complex]--> Complex

This graph enables multi-hop reasoning: from a differentially expressed gene in a specific cell type, the agent can traverse to associated pathways, known drugs, clinical trials, and patient outcomes--providing comprehensive context for clinical interpretation.

20. Foundation Model Integration Roadmap¶

20.1 Current Foundation Models for Single-Cell¶

The emergence of foundation models pre-trained on large-scale single-cell data represents a paradigm shift in computational biology. The Single-Cell Intelligence Agent is architected for integration with these models:

scGPT (Cui et al., 2024): - Transformer architecture pre-trained on >33 million cells from CellxGene - Capabilities: cell type annotation, perturbation prediction, gene network inference, batch integration - Architecture: Gene tokens embedded as continuous vectors; attention mechanism captures gene-gene relationships - Performance: State-of-the-art on cell type annotation benchmarks (>90% accuracy on unseen tissues) - Integration plan: Fine-tune on HCLS-specific cancer datasets; use as embedding backbone for Milvus collections

Geneformer (Theodoris et al., 2023): - BERT-like architecture pre-trained on >30 million cells from Genecorpus-30M - Gene ranking approach: Genes tokenized by expression rank within each cell - Capabilities: In silico perturbation prediction, disease state classification, gene dosage sensitivity - Unique strength: Predicts effects of gene perturbations without requiring perturbation training data - Integration plan: Use for in silico drug target validation and perturbation response prediction

scFoundation (Hao et al., 2024): - Large-scale foundation model pre-trained on >50 million cells - Supports multiple downstream tasks through task-specific fine-tuning heads - Integration plan: Evaluate for multi-task deployment replacing multiple specialized models

20.2 Integration Architecture¶

The foundation model integration follows a phased approach:

Phase 1 (Current): Traditional ML pipeline with GPU acceleration - RAPIDS single-cell for core computation - Scanpy-compatible workflows - Milvus for knowledge retrieval - Claude LLM for interpretation

Phase 2 (Planned, Q3 2026): Foundation model augmentation - scGPT embeddings replace PCA for Milvus collection vectors - Geneformer for perturbation prediction in drug response workflow - Foundation model cell type annotation as fourth strategy in consensus pipeline

Phase 3 (Planned, 2027): Foundation model-native workflows - End-to-end foundation model inference on DGX Spark - Multi-task deployment: single model for annotation, perturbation, integration - Fine-tuned models on HCLS clinical datasets - Real-time inference at <1 second per query

20.3 Technical Requirements¶

Foundation model deployment on DGX Spark requires:

Model	Parameters	GPU Memory	Inference Time (10K cells)
scGPT	~50M	~4 GB	~30 seconds
Geneformer	~10M	~2 GB	~15 seconds
scFoundation	~100M	~8 GB	~60 seconds

The DGX Spark's 128 GB GPU memory provides ample headroom for concurrent model serving alongside RAPIDS computation, enabling real-time foundation model inference without resource contention.

21. Benchmarking and Performance¶

21.1 Computational Benchmarks¶

Comprehensive benchmarks were conducted on DGX Spark comparing the Single-Cell Intelligence Agent (RAPIDS-accelerated) against standard CPU-based tools:

Dataset: PBMC 68K (Zheng et al., 2017)

Operation	Scanpy (CPU, 32 cores)	Seurat (R, 32 cores)	SCIA (GPU, DGX Spark)	Speedup vs Scanpy
Load & QC	12s	18s	3s	4x
Normalize + HVG	8s	15s	1.5s	5x
PCA (50 comp)	25s	30s	0.8s	31x
k-NN Graph	180s	240s	3s	60x
Leiden Clustering	45s	60s	1.5s	30x
UMAP	120s	180s	4s	30x
Diff. Expression	300s	420s	8s	37x
Total Pipeline	690s (11.5 min)	963s (16 min)	22s	31x

Dataset: 100K Tumor Cells (Synthetic benchmark)

Operation	Scanpy (CPU)	SCIA (GPU)	Speedup
PCA	45s	1.2s	37x
k-NN Graph	480s	5s	96x
UMAP	360s	8s	45x
Leiden	180s	4s	45x
DE (all clusters)	720s	15s	48x
Total	1,785s (30 min)	33s	54x

Dataset: 500K Cell Atlas Integration

Operation	Scanpy (CPU)	SCIA (GPU)	Speedup
PCA	5 min	6s	50x
k-NN Graph	45 min	25s	108x
UMAP	35 min	40s	52x
Leiden	15 min	12s	75x
Harmony Integration	25 min	18s	83x
Total	125 min (2+ hrs)	101s (~1.7 min)	74x

21.2 Accuracy Benchmarks¶

GPU acceleration does not compromise analytical accuracy. Concordance between CPU and GPU results:

Metric	Concordance	Method
PCA loadings	>0.999 (Pearson r)	Component-wise correlation
UMAP embedding	>0.95 (Trustworthiness)	Neighborhood preservation
Cluster assignments	>0.98 (ARI)	Adjusted Rand Index
DE gene rankings	>0.99 (Spearman rho)	Top 500 genes per cluster
Cell type annotations	>0.97 (Accuracy)	Agreement with manual annotation

Minor differences arise from floating-point precision (FP32 vs FP64) and stochastic algorithm initialization, but these are within the range of run-to-run variability for CPU-based tools as well.

21.3 Scaling Analysis¶

The agent's performance scales favorably with dataset size:

Cells     | CPU Time      | GPU Time    | Speedup
----------|---------------|-------------|--------
10,000    | 3 minutes     | 8 seconds   | 22x
50,000    | 15 minutes    | 18 seconds  | 50x
100,000   | 30 minutes    | 33 seconds  | 54x
250,000   | 75 minutes    | 55 seconds  | 82x
500,000   | 125 minutes   | 101 seconds | 74x
1,000,000 | >4 hours      | 210 seconds | >68x

GPU speedup increases with dataset size due to better GPU utilization at scale, making the advantage most pronounced for the largest clinical datasets.

22. Competitive Analysis¶

22.1 Competitive Landscape¶

The Single-Cell Intelligence Agent occupies a unique position in the single-cell analysis ecosystem:

Feature	Scanpy	Seurat	CellRanger	Seven Bridges	Terra/Broad	SCIA
Language	Python	R	C++/Python	Cloud	Cloud	Python
GPU Acceleration	No	No	Partial	No	Optional	Full (RAPIDS)
On-Premise	Yes	Yes	Yes	No	No	Yes (DGX)
Cloud Option	Manual	Manual	No	Yes	Yes	Planned
Spatial Support	Squidpy addon	STUtility	SpaceRanger	Limited	Limited	Integrated
Knowledge Base	No	No	No	No	No	12 Milvus collections
LLM Integration	No	No	No	No	No	Claude RAG
Clinical Reports	No	No	No	Limited	Limited	Automated
Multi-Modal	Limited	WNN (strong)	Multiome	Limited	Limited	Full
Foundation Models	No	No	No	No	No	Roadmap
Cost	Free	Free	10x license	$$/sample	$$/compute	DGX infra
Scalability	100K cells	100K cells	10K cells	Elastic	Elastic	1M+ cells

22.2 Differentiation¶

vs. Scanpy/Seurat (Open-Source CPU Tools): The SCIA provides 50-100x speedup over Scanpy and Seurat for identical analyses, transforming batch-mode computation into interactive exploration. The integrated knowledge base and LLM interpretation layer add clinical intelligence that open-source tools lack. However, Scanpy and Seurat benefit from larger user communities, more extensive documentation, and broader method availability. The SCIA mitigates this by maintaining Scanpy API compatibility, allowing users to leverage the existing ecosystem while gaining GPU acceleration.

vs. CellRanger (10x Genomics Proprietary): CellRanger excels at FASTQ-to-count-matrix processing for 10x Chromium data but offers limited downstream analysis. The SCIA complements CellRanger by accepting its output as input and providing GPU-accelerated downstream analysis, knowledge-augmented interpretation, and clinical reporting. CellRanger's Loupe Browser provides elegant visualization but lacks the analytical depth of the SCIA's 10 workflows.

vs. Seven Bridges / Terra (Cloud Platforms): Cloud platforms offer elastic scalability but incur per-analysis costs, data transfer latency, and potential data sovereignty concerns for clinical genomics. The SCIA's on-premise DGX deployment eliminates data transfer, provides consistent performance without cloud cost variability, and maintains full data control--critical for HIPAA/GDPR compliance in clinical settings.

22.3 Total Cost of Ownership¶

For a clinical genomics lab processing 50 samples per month:

Cost Component	Cloud (Terra/AWS)	On-Premise (SCIA/DGX)
Compute (per sample)	$15-50	~$2 (amortized)
Storage (per month)	$500-2,000	~$100 (NVMe amortized)
Data transfer	$200-1,000	$0
Software licenses	$0-5,000	$0 (open source)
Infrastructure (amortized/mo)	$0	~$3,000
Monthly Total	$2,500-10,000	~$3,200
Break-even	-	~6 months

The DGX Spark investment pays for itself within 6 months for high-throughput clinical operations, with the additional benefits of data locality, consistent performance, and no per-analysis variable costs.

23. Clinical Deployment Considerations¶

23.1 Regulatory Framework¶

Clinical deployment of single-cell analysis requires compliance with regulatory and quality standards:

FDA Considerations: - Single-cell analysis tools used for diagnostic purposes fall under FDA medical device regulations (21 CFR Part 820) - The SCIA is positioned as a Clinical Decision Support (CDS) tool, providing information to clinicians rather than autonomous diagnostic decisions - Validation documentation includes analytical accuracy benchmarks, reproducibility studies, and comparison with established methods

CLIA/CAP Compliance: - Laboratory procedures using the SCIA require validation under CLIA (Clinical Laboratory Improvement Amendments) - Standard operating procedures (SOPs) for sample processing, data analysis, and result interpretation - Proficiency testing and quality control programs

Data Security: - HIPAA compliance for patient data handling (encryption at rest and in transit) - Role-based access control via HCLS common security module - Audit logging of all data access and analysis operations - On-premise deployment eliminates cloud data sovereignty concerns

23.2 Quality Control Pipeline¶

The SCIA implements a comprehensive QC pipeline for clinical-grade analysis:

Sample-level QC: Minimum cell count (>1,000), viability (>80%), doublet rate (<10%)
Cell-level QC: Gene count range (200-5,000), UMI count range, mitochondrial percentage (<20%)
Batch QC: Batch effect quantification (kBET, LISI scores) before and after integration
Analysis QC: Cluster stability (bootstrapped clustering), annotation confidence thresholds
Report QC: Automated checks for completeness, internal consistency, and flagging of low-confidence results

23.3 Clinical Workflow Integration¶

The SCIA integrates into clinical workflows through several touchpoints:

Biopsy Collection --> Tissue Dissociation --> scRNA-seq Library Prep
        |                                           |
        v                                           v
  Pathology Review                          Sequencing (NovaSeq/NextSeq)
        |                                           |
        v                                           v
  H&E + IHC                               CellRanger (FASTQ -> Matrix)
  (traditional)                                     |
        |                                           v
        +------ Correlation --------->   SCIA Analysis (GPU-accelerated)
                                                    |
                                                    v
                                         Clinical Report Generation
                                                    |
                                                    v
                                         Tumor Board Presentation
                                                    |
                                                    v
                                         Treatment Decision

Turnaround Time Target: From sequencing completion to clinical report in <4 hours: - CellRanger processing: 2-3 hours - SCIA analysis: <5 minutes (GPU-accelerated) - LLM report generation: <5 minutes - QC review: 30-60 minutes (human review)

24. Future Directions¶

24.1 Technology Roadmap¶

Q2 2026: Spatial Transcriptomics Enhancement - Full Xenium support with subcellular transcript analysis - 3D spatial reconstruction from serial tissue sections - Spatial-temporal modeling for treatment response prediction

Q3 2026: Foundation Model Integration (Phase 2) - scGPT embeddings for improved Milvus retrieval - Geneformer-based perturbation prediction in drug response workflow - Multi-task foundation model deployment on DGX Spark

Q4 2026: Multi-Omics Fusion - Integrated ATAC-seq + RNA-seq (Multiome) analysis - Proteogenomic integration (CITE-seq + genomics) - Spatial multi-omics (Xenium + Visium + CODEX on same tissue)

2027: Clinical Trial Support - Prospective validation in 3-5 clinical trial sites - Automated MRD monitoring from longitudinal biopsies - Real-time biomarker tracking dashboards for trial coordinators - FDA 510(k) submission preparation for CDS classification

24.2 Algorithmic Innovations¶

Streaming single-cell analysis: Process cells in real-time as they are sequenced, enabling progressive analysis without waiting for complete datasets.

Federated single-cell analysis: Enable multi-institutional analysis without data sharing, preserving patient privacy while enabling atlas-scale integration.

Causal inference: Move beyond correlation to causal modeling of gene regulatory networks, drug mechanisms, and resistance pathways using single-cell perturbation data.

Digital twins: Patient-specific tumor digital twins incorporating single-cell composition, spatial organization, and clonal architecture to simulate treatment responses in silico before clinical administration.

24.3 Ecosystem Expansion¶

Wetlab integration: Direct connection to 10x Chromium Controller and sequencers for automated pipeline triggering
EHR integration: HL7 FHIR-compliant interfaces for embedding single-cell results in electronic health records
Collaboration platform: Shared analysis workspaces for multi-disciplinary tumor boards with role-based access
Training and education: Interactive tutorials and case studies for clinical adoption

25. References¶

Becht, E., McInnes, L., Healy, J., et al. (2019). Dimensionality reduction for visualizing single-cell data using UMAP. Nature Biotechnology, 37(1), 38-44.
Butler, A., Hoffman, P., Smibert, P., Papalexi, E., & Satija, R. (2018). Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nature Biotechnology, 36(5), 411-420.
Cui, H., Wang, C., Maan, H., et al. (2024). scGPT: Toward building a foundation model for single-cell multi-omics using generative AI. Nature Methods, 21, 1470-1480.
Dominguez Conde, C., Xu, C., Jarvis, L.B., et al. (2022). Cross-tissue immune cell analysis reveals tissue-specific features in humans. Science, 376(6594), eabl5197.
Dries, R., Zhu, Q., Dong, R., et al. (2021). Giotto: A toolbox for integrative analysis and visualization of spatial expression data. Genome Biology, 22(1), 78.
Fan, J., Slowikowski, K., & Zhang, F. (2020). Single-cell transcriptomics in cancer: Computational challenges and opportunities. Experimental & Molecular Medicine, 52, 1452-1465.
Fleming, S.J., Chaffin, M.D., Arduini, A., et al. (2023). Unsupervised removal of systematic background noise from droplet-based single-cell experiments using CellBender. Nature Methods, 20, 1323-1335.
Franzen, O., Gan, L.M., & Bjorkegren, J.L.M. (2019). PanglaoDB: A web server for exploration of mouse and human single-cell RNA sequencing data. Database, 2019, baz046.
Gayoso, A., Lopez, R., Xing, G., et al. (2022). A Python library for probabilistic analysis of single-cell omics data. Nature Biotechnology, 40(2), 163-166.
Hagemann-Jensen, M., Ziegenhain, C., Chen, P., et al. (2020). Single-cell RNA counting at allele and isoform resolution using Smart-seq3. Nature Biotechnology, 38(6), 708-714.
Hao, Y., Hao, S., Andersen-Nissen, E., et al. (2021). Integrated analysis of multimodal single-cell data. Cell, 184(13), 3573-3587.
Korsunsky, I., Millard, N., Fan, J., et al. (2019). Fast, sensitive and accurate integration of single-cell data with Harmony. Nature Methods, 16(12), 1289-1296.
Kim, C., Gao, R., Sei, E., et al. (2018). Chemoresistance evolution in triple-negative breast cancer delineated by single-cell sequencing. Cell, 173(4), 879-893.
La Manno, G., Soldatov, R., Zeisel, A., et al. (2018). RNA velocity of single cells. Nature, 560(7719), 494-498.
Lopez, R., Regier, J., Cole, M.B., Jordan, M.I., & Yosef, N. (2018). Deep generative modeling for single-cell transcriptomics. Nature Methods, 15(12), 1053-1058.
Macosko, E.Z., Basu, A., Satija, R., et al. (2015). Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell, 161(5), 1202-1214.
Marquart, J., Chen, E.Y., & Prasad, V. (2018). Estimation of the percentage of US patients with cancer who benefit from genome-driven oncology. JAMA Oncology, 4(8), 1093-1098.
McGranahan, N., & Swanton, C. (2017). Clonal heterogeneity and tumor evolution: Past, present, and the future. Cell, 168(4), 613-628.
Newman, A.M., Liu, C.L., Green, M.R., et al. (2015). Robust enumeration of cell subsets from tissue expression profiles. Nature Methods, 12(5), 453-457.
Palla, G., Spitzer, H., Klein, M., et al. (2022). Squidpy: A scalable framework for spatial omics analysis. Nature Methods, 19(2), 171-178.
Patel, A.P., Tirosh, I., Trombetta, J.J., et al. (2014). Single-cell RNA-seq highlights intratumoral heterogeneity in primary glioblastoma. Science, 344(6190), 1396-1401.
Picelli, S., Faridani, O.R., Bjorklund, A.K., et al. (2014). Full-length RNA-seq from single cells using Smart-seq2. Nature Protocols, 9(1), 171-181.
Sade-Feldman, M., Yizhak, K., Bjorgaard, S.L., et al. (2018). Defining T cell states associated with response to checkpoint immunotherapy in melanoma. Cell, 175(4), 998-1013.
Satija, R., Farrell, J.A., Gennert, D., Schier, A.F., & Regier, A. (2015). Spatial reconstruction of single-cell gene expression data. Nature Biotechnology, 33(5), 495-502.
Stoeckius, M., Hafemeister, C., Stephenson, W., et al. (2017). Simultaneous epitope and transcriptome measurement in single cells. Nature Methods, 14(9), 865-868.
Stuart, T., Butler, A., Hoffman, P., et al. (2019). Comprehensive integration of single-cell data. Cell, 177(7), 1888-1902.
Sturm, G., Finotello, F., Petitprez, F., et al. (2019). Comprehensive evaluation of transcriptome-based cell-type quantification methods for immuno-oncology. Bioinformatics, 35(14), i436-i445.
Svensson, V., da Veiga Beltrame, E., & Pachter, L. (2020). A curated database reveals trends in single-cell transcriptomics. Database, 2020, baaa073.
Tabula Sapiens Consortium. (2022). The Tabula Sapiens: A multiple-organ, single-cell transcriptomic atlas of humans. Science, 376(6594), eabl4896.
Tang, F., Barbacioru, C., Wang, Y., et al. (2009). mRNA-Seq whole-transcriptome analysis of a single cell. Nature Methods, 6(5), 377-382.
Theodoris, C.V., Xiao, L., Chopra, A., et al. (2023). Transfer learning enables predictions in network biology. Nature, 618, 616-624.
Tirosh, I., Izar, B., Prakadan, S.M., et al. (2016). Dissecting the multicellular ecosystem of metastatic melanoma by single-cell RNA-seq. Science, 352(6282), 189-196.
Traag, V.A., Waltman, L., & van Eck, N.J. (2019). From Louvain to Leiden: Guaranteeing well-connected communities. Scientific Reports, 9(1), 5233.
Wolf, F.A., Angerer, P., & Theis, F.J. (2018). SCANPY: Large-scale single-cell gene expression data analysis. Genome Biology, 19(1), 15.
Zhang, X., Lan, Y., Xu, J., et al. (2019). CellMarker: A manually curated resource for cell markers. Nucleic Acids Research, 47(D1), D721-D728.
Zheng, G.X., Terry, J.M., Belgrader, P., et al. (2017). Massively parallel digital transcriptional profiling of single cells. Nature Communications, 8, 14049.

Appendix A: API Reference Summary¶

Endpoints (Port 8540)¶

Method	Endpoint	Description
POST	`/api/v1/annotate`	Cell type annotation workflow
POST	`/api/v1/tme-profile`	Tumor microenvironment profiling
POST	`/api/v1/drug-response`	Drug response prediction
POST	`/api/v1/subclonal`	Subclonal architecture analysis
POST	`/api/v1/spatial-niche`	Spatial niche identification
POST	`/api/v1/trajectory`	Trajectory/pseudotime analysis
POST	`/api/v1/ligand-receptor`	Ligand-receptor interaction mapping
POST	`/api/v1/biomarker`	Biomarker discovery
POST	`/api/v1/cart-validate`	CAR-T target validation
POST	`/api/v1/treatment-monitor`	Treatment response monitoring
GET	`/api/v1/health`	Health check
GET	`/api/v1/collections`	List Milvus collections and stats
GET	`/api/v1/gpu-status`	GPU memory and utilization

Dashboard (Port 8130)¶

The Streamlit dashboard provides interactive access to all workflows with: - File upload for H5AD, CSV, and spatial data formats - Real-time UMAP/t-SNE visualization with cluster coloring - Spatial maps with cell type overlay - Heatmaps, violin plots, and dot plots for marker gene visualization - Interactive TME classification and drug response reports - Export to PDF, HTML, and AnnData formats

Appendix B: Milvus Collection Statistics¶

Collection	Vectors	Dimension	Index Type	Avg Query Time
sc_cell_types	2,500	384	IVF_FLAT	2ms
sc_markers	45,000	384	IVF_FLAT	5ms
sc_spatial	18,000	384	IVF_FLAT	4ms
sc_tme	12,000	384	IVF_FLAT	3ms
sc_drug_response	35,000	384	IVF_FLAT	5ms
sc_literature	85,000	384	IVF_SQ8	8ms
sc_methods	3,500	384	IVF_FLAT	2ms
sc_datasets	8,000	384	IVF_FLAT	3ms
sc_trajectories	6,500	384	IVF_FLAT	3ms
sc_pathways	15,000	384	IVF_FLAT	4ms
sc_clinical	22,000	384	IVF_FLAT	4ms
genomic_evidence	3,560,000	384	IVF_SQ8	12ms
Total	~3,813,500

Appendix C: Supported File Formats¶

Format	Extension	Description	Max Size
AnnData	.h5ad	Scanpy native format	50 GB
10x HDF5	.h5	CellRanger filtered output	10 GB
10x MEX	.mtx + .tsv	CellRanger sparse matrix	10 GB
CSV/TSV	.csv/.tsv	Dense count matrix	5 GB
Loom	.loom	Velocyto/scVelo format	20 GB
Seurat RDS	.rds	Seurat object (via rpy2)	20 GB
Visium	SpaceRanger output	Spatial + expression	10 GB
MERFISH	Vizgen output	Spatial transcripts	50 GB
Xenium	Xenium output	In situ transcripts	50 GB
VCF	.vcf/.vcf.gz	Variant calls (integration)	5 GB

This document is part of the HCLS AI Factory documentation suite. For platform-wide architecture and deployment instructions, see the main HCLS AI Factory documentation. For agent-specific implementation details, see the Single-Cell Intelligence Agent source code and API documentation.

Last updated: March 2026