Skip to content

Single-Cell Intelligence Agent: GPU-Accelerated Single-Cell Multi-Omics Analysis for Precision Oncology

A Component of the HCLS AI Factory Precision Medicine Platform


Authors: HCLS AI Factory Development Team

Date: March 2026

Version: 1.0

Platform: NVIDIA DGX Spark | CUDA 12.x | RAPIDS AI


Abstract

Bulk sequencing has been the workhorse of genomic medicine for over two decades, yet its inherent averaging across millions of cells obscures the tumor heterogeneity that drives 30-40% of targeted therapy failures. Resistant subclones--invisible to bulk assays--persist through treatment, expand under selective pressure, and ultimately cause relapse. Single-cell RNA sequencing (scRNA-seq) resolves individual transcriptomes across tens of thousands to hundreds of thousands of cells per sample, revealing the cellular diversity that bulk methods miss. However, the computational demands are formidable: a typical experiment generates expression matrices spanning 10,000-500,000 cells across 20,000+ genes, producing billions of data points that overwhelm traditional CPU-based analysis pipelines.

We present the Single-Cell Intelligence Agent, a GPU-accelerated analysis platform integrated into the HCLS AI Factory precision medicine pipeline. Built on NVIDIA RAPIDS and deployed on DGX Spark infrastructure, the agent achieves 50-100x speedups over CPU-based tools for core operations including dimensionality reduction (UMAP in <10 seconds for 100K cells), graph-based clustering (<30 seconds), and differential expression analysis. The system maintains 12 specialized Milvus vector collections encompassing cell type references, spatial transcriptomics data, tumor microenvironment signatures, drug response profiles, and clinical evidence, enabling rapid semantic retrieval across the single-cell knowledge landscape. Ten purpose-built analytical workflows--from cell type annotation and tumor microenvironment profiling to CAR-T target validation and treatment response monitoring--bridge the gap between raw single-cell data and actionable clinical insight. The agent integrates with spatial transcriptomics platforms (Visium, MERFISH, Xenium, CODEX), supports multi-modal data fusion (CITE-seq, Multiome), and is architected for future integration with single-cell foundation models (scGPT, Geneformer). Benchmarked against leading tools including Scanpy, Seurat, and CellRanger, the Single-Cell Intelligence Agent delivers clinical-grade analysis at interactive speeds, positioning single-cell genomics as a practical component of precision oncology workflows.

Keywords: single-cell RNA sequencing, GPU acceleration, RAPIDS, tumor heterogeneity, precision oncology, spatial transcriptomics, tumor microenvironment, vector database, DGX Spark


Table of Contents

  1. Introduction and Clinical Motivation
  2. The Single-Cell Revolution in Oncology
  3. Computational Challenges at Single-Cell Scale
  4. Market Landscape and Technology Ecosystem
  5. System Architecture Overview
  6. GPU-Accelerated Computational Engine
  7. Knowledge Architecture: Milvus Vector Collections
  8. Analytical Workflows
  9. Cell Type Annotation Pipeline
  10. Tumor Microenvironment Profiling
  11. Drug Response Prediction
  12. Subclonal Architecture Analysis
  13. Spatial Transcriptomics Integration
  14. Trajectory and Pseudotime Analysis
  15. Ligand-Receptor Interaction Mapping
  16. Biomarker Discovery Pipeline
  17. CAR-T Target Validation
  18. Treatment Response Monitoring
  19. Data Sources and Knowledge Graphs
  20. Foundation Model Integration Roadmap
  21. Benchmarking and Performance
  22. Competitive Analysis
  23. Clinical Deployment Considerations
  24. Future Directions
  25. References

1. Introduction and Clinical Motivation

1.1 The Heterogeneity Problem

Precision oncology has achieved remarkable successes--imatinib for BCR-ABL+ chronic myeloid leukemia, trastuzumab for HER2+ breast cancer, pembrolizumab for PD-L1+ tumors--yet the overall response rate to targeted therapies remains stubbornly below 30% for most solid tumors (Marquart et al., 2018). The fundamental challenge is tumor heterogeneity: a single biopsy, analyzed by bulk sequencing, reports the average signal across millions of cells, masking the cellular diversity that determines treatment outcome.

An estimated 30-40% of targeted therapy failures can be attributed to pre-existing resistant subclones that are invisible to bulk assays (McGranahan & Swanton, 2017). These subclones may comprise as little as 0.1-1% of the tumor at diagnosis, falling below the detection limit of standard bulk whole-exome sequencing (~5% variant allele frequency). Under the selective pressure of targeted therapy, resistant clones expand exponentially, driving relapse within months.

1.2 Single-Cell Resolution as Clinical Imperative

Single-cell RNA sequencing (scRNA-seq) resolves individual transcriptomes, enabling:

  • Subclone detection: Identification of transcriptionally distinct subpopulations, including rare resistant clones below bulk detection limits
  • Tumor microenvironment (TME) characterization: Quantification of immune cell infiltration, stromal composition, and intercellular signaling networks
  • Drug response prediction: Cell-type-specific expression of drug targets, resistance mechanisms, and metabolic dependencies
  • Treatment monitoring: Longitudinal tracking of clonal dynamics and immune response evolution

However, the transition from research tool to clinical utility demands computational infrastructure that can process single-cell datasets at interactive speeds--a capability that CPU-based pipelines fundamentally cannot deliver at the scale of modern experiments.

1.3 The HCLS AI Factory Context

The Single-Cell Intelligence Agent operates as the sixth intelligence agent within the HCLS AI Factory precision medicine platform, complementing existing agents for biomarker analysis, oncology interpretation, CAR-T therapy design, imaging analysis, and autoimmune profiling. It extends the platform's three-stage pipeline--genomics (Parabricks/DeepVariant), RAG/Chat (Milvus + Claude), and drug discovery (BioNeMo/DiffDock/RDKit)--with single-cell resolution, enabling a complete workflow from bulk variant calling through single-cell validation to therapeutic candidate generation.

The agent is deployed on NVIDIA DGX Spark with CUDA 12.x, exposing its API on port 8540 and its interactive dashboard on port 8130. It leverages the shared HCLS common library for configuration management, Milvus connectivity, LLM integration, and security, maintaining architectural consistency across the platform.


2. The Single-Cell Revolution in Oncology

2.1 Technology Evolution

The single-cell sequencing landscape has evolved rapidly since the first scRNA-seq experiment by Tang et al. (2009), which profiled a single mouse blastomere. Key technological milestones include:

Year Technology Throughput Key Innovation
2009 mRNA-Seq (Tang) 1 cell Proof of concept
2012 Smart-seq 10s of cells Full-length transcript coverage
2014 Drop-seq 10,000 cells Droplet microfluidics
2015 10x Genomics Chromium 10,000+ cells Commercial droplet platform
2017 10x Chromium V2 100,000+ cells Industrial-scale profiling
2017 CITE-seq 10,000+ cells Simultaneous RNA + protein
2019 10x Visium ~5,000 spots Spatial transcriptomics
2020 10x Multiome 10,000+ cells ATAC + RNA co-assay
2021 MERFISH 100,000+ molecules Imaging-based spatial
2023 10x Xenium Subcellular resolution In situ spatial
2024 CODEX/PhenoCycler 100+ proteins High-plex spatial proteomics

2.2 10x Genomics Market Dominance

10x Genomics has established dominant market position in single-cell sequencing, with its Chromium platform accounting for an estimated 70-80% of scRNA-seq experiments published since 2018 (Svensson et al., 2020). The platform's success derives from:

  • Gel Bead-in-Emulsion (GEM) technology: Encapsulates individual cells with barcoded gel beads in nanoliter droplets, enabling efficient cell capture with low doublet rates (~0.4% per 1,000 cells)
  • CellRanger software: End-to-end pipeline from FASTQ to count matrix, with cell calling, alignment (STAR), and UMI deduplication
  • Ecosystem integration: Comprehensive libraries (3' gene expression, 5' immune profiling, ATAC-seq, Multiome, CITE-seq compatible)
  • Spatial platform: Visium (spot-based) and Xenium (subcellular in situ) spatial transcriptomics

2.3 Complementary Technologies

While 10x dominates throughput applications, several technologies serve specialized niches:

Smart-seq2/3 (Picelli et al., 2014; Hagemann-Jensen et al., 2020): Plate-based, full-length transcript coverage. Lower throughput (hundreds of cells) but superior sensitivity and isoform detection. Preferred for rare cell populations, small organisms, and applications requiring splice variant analysis.

Drop-seq (Macosko et al., 2015): Open-source droplet microfluidics. Lower per-cell cost than 10x but requires custom instrumentation. Popular in cost-sensitive large-scale atlas projects.

CITE-seq (Stoeckius et al., 2017): Cellular Indexing of Transcriptomes and Epitopes by sequencing. Simultaneously measures RNA and surface protein expression using oligonucleotide-conjugated antibodies. Critical for immune cell phenotyping where transcriptome alone is insufficient (e.g., distinguishing CD4+ T-cell subsets).

Spatial Transcriptomics Platforms: - Visium (10x Genomics): Captures mRNA from tissue sections on slides with ~55 um diameter spots, each covering 1-10 cells. Provides spatial context but lacks true single-cell resolution - MERFISH (Vizgen): Multiplexed error-robust fluorescence in situ hybridization. Images hundreds of genes at subcellular resolution across intact tissue - Xenium (10x Genomics): In situ sequencing platform providing subcellular transcript localization for 100-1000+ gene panels - CODEX/PhenoCycler (Akoya Biosciences): Multiplexed immunofluorescence imaging of 100+ protein markers on single tissue sections

2.4 Clinical Impact in Oncology

Single-cell studies have fundamentally reshaped understanding of tumor biology:

Intra-tumor heterogeneity: Tirosh et al. (2016) demonstrated that melanomas contain transcriptionally distinct subpopulations with differential drug sensitivity, explaining variable response to BRAF inhibitors. Patel et al. (2014) revealed that individual glioblastomas harbor cells spanning multiple molecular subtypes simultaneously.

Immune landscape: Zheng et al. (2017) characterized the T-cell landscape in hepatocellular carcinoma at single-cell resolution, identifying exhaustion trajectories that predict immunotherapy response. Sade-Feldman et al. (2018) linked specific T-cell states to checkpoint blockade outcomes in melanoma.

Therapy resistance: Kim et al. (2018) traced the evolution of drug-tolerant persister cells in lung adenocarcinoma, identifying a pre-existing resistant subpopulation that was invisible to bulk profiling but expanded under EGFR inhibitor treatment.


3. Computational Challenges at Single-Cell Scale

3.1 Data Scale and Dimensionality

A modern single-cell experiment generates data at a scale that challenges conventional bioinformatics infrastructure:

Experiment Parameter Typical Range Computational Impact
Cells per sample 10,000 - 500,000 Memory: 2-50 GB per sample
Genes per cell 20,000 - 30,000 Dimensionality: ultrahigh
Non-zero entries 2,000 - 5,000 per cell Sparsity: 85-95% zeros
Samples per study 10 - 200 Integration complexity
Total data points 10^8 - 10^10 Billions per study
Raw FASTQ size 50 - 500 GB Storage and I/O bottleneck

The resulting cell-by-gene count matrices are extremely sparse (85-95% zeros) and high-dimensional, requiring specialized algorithms for dimensionality reduction, clustering, and differential expression that can exploit sparsity while maintaining biological signal.

3.2 Computational Bottlenecks

Standard CPU-based analysis of 100,000 cells with Scanpy or Seurat encounters several bottlenecks:

  1. k-NN Graph Construction: Building the k-nearest-neighbor graph (typically k=15-30 in a 50-dimensional PCA space) scales as O(n log n) with approximate methods (Annoy, HNSW) but still requires 5-15 minutes for 100K cells on a 32-core CPU
  2. UMAP Embedding: Uniform Manifold Approximation and Projection for 100K cells takes 3-10 minutes on CPU, with iterative optimization being inherently sequential
  3. Leiden/Louvain Clustering: Graph-based community detection on the k-NN graph requires 2-5 minutes for 100K cells
  4. Differential Expression: Wilcoxon rank-sum or t-tests across all gene-cluster pairs (20K genes x 20-50 clusters) takes 5-20 minutes
  5. Batch Integration: Harmony, scVI, or BBKNN across multiple samples adds 10-60 minutes depending on method and sample count
  6. Trajectory Inference: Diffusion pseudotime or RNA velocity calculations add another 5-30 minutes

Total wall-clock time for a standard pipeline on CPU: 30-120 minutes per sample

For clinical applications requiring rapid turnaround (e.g., treatment selection within 48-72 hours of biopsy), this latency is prohibitive--particularly when iterative analysis, parameter tuning, and multi-sample integration multiply the computational burden.

3.3 The GPU Acceleration Opportunity

NVIDIA RAPIDS single-cell (rapids-singlecell) reimplements core Scanpy operations on GPU using cuML, cuGraph, and cuPy, achieving dramatic speedups:

Operation CPU Time (100K cells) GPU Time (100K cells) Speedup
PCA (50 components) 45 seconds 1.2 seconds 37x
k-NN Graph (k=15) 8 minutes 5 seconds 96x
UMAP 6 minutes 8 seconds 45x
Leiden Clustering 3 minutes 4 seconds 45x
Differential Expression 12 minutes 15 seconds 48x
Full Pipeline 45 minutes 45 seconds 60x

These benchmarks, measured on DGX Spark (NVIDIA Grace Hopper, 128 GB GPU memory), demonstrate that GPU acceleration transforms single-cell analysis from a batch process to an interactive exploration. The 50-100x aggregate speedup enables:

  • Real-time parameter tuning and re-clustering
  • Interactive exploration of cell type annotations
  • Rapid iteration on spatial analysis parameters
  • Clinical-grade turnaround times (<5 minutes per sample)

4. Market Landscape and Technology Ecosystem

4.1 Market Size and Growth

The single-cell analysis market has experienced exceptional growth, driven by decreasing per-cell sequencing costs, expanding clinical applications, and the emergence of spatial transcriptomics:

Metric 2024 2027 (Projected) 2030 (Projected)
Global Market Size $8.7 billion $16.2 billion $32.0 billion
CAGR - 23.1% 24.0%
Instruments Segment $2.8 billion $4.5 billion $7.2 billion
Reagents & Consumables $3.9 billion $7.8 billion $15.5 billion
Software & Services $2.0 billion $3.9 billion $9.3 billion

Source: Grand View Research, Markets and Markets, Allied Market Research (aggregated estimates, 2024)

4.2 Key Market Drivers

  1. Decreasing costs: Per-cell sequencing costs have dropped from ~$50,000 (2009) to ~$0.01 (2024), following a trajectory steeper than Moore's Law
  2. Clinical adoption: FDA approval of companion diagnostics incorporating single-cell data; growing use in minimal residual disease (MRD) monitoring
  3. Spatial transcriptomics: The addition of spatial context to single-cell data has opened new applications in pathology, neuroscience, and developmental biology
  4. Cell therapy: CAR-T and other cell therapies require single-cell characterization for manufacturing QC, potency assays, and in vivo tracking
  5. Pharma R&D: Drug companies use single-cell data for target identification, mechanism of action studies, and patient stratification in clinical trials

4.3 Software Ecosystem

The computational ecosystem for single-cell analysis is fragmented across languages, frameworks, and levels of abstraction:

Core Analysis Frameworks: - Scanpy (Wolf et al., 2018): Python-based, AnnData format, ~15,000 GitHub stars. The de facto standard for Python users - Seurat (Satija et al., 2015; Stuart et al., 2019; Hao et al., 2021): R-based, ~4,700 GitHub stars. Dominant in the R ecosystem with superior integration methods (CCA, RPCA, WNN) - CellRanger (10x Genomics): Proprietary pipeline for 10x Chromium data. FASTQ to count matrix with cell calling

GPU-Accelerated Tools: - RAPIDS single-cell (NVIDIA): GPU-accelerated reimplementation of Scanpy workflows using cuML/cuGraph - CellBender (Fleming et al., 2023): GPU-accelerated ambient RNA removal - scVI-tools (Gayoso et al., 2022): Deep learning framework for single-cell analysis with GPU support

Spatial Analysis: - Squidpy (Palla et al., 2022): Spatial single-cell analysis in Python - Giotto (Dries et al., 2021): Comprehensive spatial transcriptomics toolkit - STUtility: Spatial transcriptomics utilities for R/Seurat

Integration and Atlas Tools: - Harmony (Korsunsky et al., 2019): Fast batch correction - scVI/scANVI (Lopez et al., 2018): Variational autoencoder for integration - CellTypist (Dominguez Conde et al., 2022): Automated cell type annotation


5. System Architecture Overview

5.1 High-Level Architecture

The Single-Cell Intelligence Agent follows a modular architecture designed for GPU-accelerated analysis with semantic knowledge retrieval:

+------------------------------------------------------------------+
|                    Single-Cell Intelligence Agent                  |
|                     Ports: 8540 (API) / 8130 (UI)                 |
+------------------------------------------------------------------+
|                                                                    |
|  +---------------+  +---------------+  +------------------------+ |
|  |   FastAPI      |  |  Streamlit    |  |   Claude LLM          | |
|  |   REST API     |  |  Dashboard    |  |   Integration         | |
|  |   :8540        |  |  :8130        |  |   (Anthropic API)     | |
|  +-------+-------+  +-------+-------+  +----------+------------+ |
|          |                   |                      |              |
|  +-------+-------------------+----------------------+------------+ |
|  |                  Workflow Orchestrator                          | |
|  |         10 Analytical Workflows (Nextflow DSL2)                | |
|  +-------------------------------+--------------------------------+ |
|                                  |                                 |
|  +-------------------------------+-------------------------------+ |
|  |              Computational Engine                              | |
|  |  +--------------+ +-----------+ +---------------------------+ | |
|  |  | RAPIDS       | | Scanpy    | | Spatial Analysis          | | |
|  |  | cuML/cuGraph | | AnnData   | | Squidpy/Giotto            | | |
|  |  | GPU Accel.   | | Fallback  | | MERFISH/Visium/Xenium     | | |
|  |  +--------------+ +-----------+ +---------------------------+ | |
|  +---------------------------------------------------------------+ |
|                                  |                                 |
|  +-------------------------------+-------------------------------+ |
|  |              Knowledge Layer                                   | |
|  |  +----------------------------------------------------------+ | |
|  |  |         Milvus Vector Database (12 Collections)           | | |
|  |  |  sc_cell_types | sc_markers | sc_spatial | sc_tme         | | |
|  |  |  sc_drug_response | sc_literature | sc_methods            | | |
|  |  |  sc_datasets | sc_trajectories | sc_pathways              | | |
|  |  |  sc_clinical | genomic_evidence                           | | |
|  |  +----------------------------------------------------------+ | |
|  +---------------------------------------------------------------+ |
|                                                                    |
|  +---------------------------------------------------------------+ |
|  |              HCLS Common Library Integration                   | |
|  |  Config | Milvus Client | LLM | Security | Monitoring         | |
|  +---------------------------------------------------------------+ |
+--------------------------------------------------------------------+

5.2 Component Details

FastAPI REST API (Port 8540): Exposes all analytical workflows as RESTful endpoints with OpenAPI documentation. Supports synchronous execution for lightweight queries and asynchronous task submission for compute-intensive workflows. Authentication via the HCLS common security module with API key and JWT token support.

Streamlit Dashboard (Port 8130): Interactive visualization interface for exploratory analysis. Renders UMAP plots, spatial maps, heatmaps, violin plots, and dot plots with real-time updates. Supports file upload for count matrices and spatial data, with drag-and-drop annotation.

Claude LLM Integration: Leverages the Anthropic Claude API for natural language interpretation of analysis results, literature-grounded cell type annotation rationale, and clinical report generation. Queries are augmented with context retrieved from Milvus collections via RAG (Retrieval-Augmented Generation).

Workflow Orchestrator: Coordinates multi-step analytical workflows using Nextflow DSL2, managing dependencies between preprocessing, analysis, and interpretation steps. Integrates with the HCLS orchestrator for cross-pipeline workflows (e.g., bulk variant to single-cell validation to drug candidate).

5.3 Integration with HCLS AI Factory

The Single-Cell Intelligence Agent connects to the broader platform through several integration points:

  1. Genomics Pipeline: Receives variant calls (VCF) from Parabricks/DeepVariant for correlation with single-cell expression data
  2. RAG/Chat Pipeline: Shares the Milvus vector database infrastructure; the genomic_evidence collection bridges bulk and single-cell analyses
  3. Drug Discovery Pipeline: Passes druggable target candidates from single-cell differential expression to BioNeMo/DiffDock for structure-based drug design
  4. Other Intelligence Agents: Exchanges data with the Biomarker Agent (sc-derived biomarkers), CAR-T Agent (target validation), and Oncology Agent (TME classification)

6. GPU-Accelerated Computational Engine

6.1 RAPIDS Single-Cell Architecture

The computational engine is built on NVIDIA RAPIDS, an open-source suite of GPU-accelerated data science libraries. For single-cell analysis, the key components are:

cuML (Machine Learning): - PCA: Truncated SVD on GPU for dimensionality reduction from 20K genes to 50 principal components - k-NN: Brute-force or IVF-Flat approximate nearest neighbor search on GPU - UMAP: GPU-native implementation with identical API to umap-learn - t-SNE: Barnes-Hut GPU implementation for large-scale embedding

cuGraph (Graph Analytics): - Leiden/Louvain community detection: GPU-accelerated graph clustering on the k-NN graph - PageRank: For gene regulatory network analysis - Connected components: For QC and doublet detection

cuPy (Array Computing): - Sparse matrix operations on GPU (CSR, CSC formats) - Element-wise operations for normalization, log-transform, scaling - GPU-accelerated statistical tests for differential expression

cuSpatial (Spatial Analysis): - Spatial indexing and nearest-neighbor queries for spatial transcriptomics - Point-in-polygon tests for region-of-interest analysis - Spatial autocorrelation (Moran's I, Geary's C) on GPU

6.2 Memory Management

Single-cell data at scale demands careful GPU memory management. The agent implements a tiered memory strategy:

Tier 1: GPU Memory (128 GB on DGX Spark)
  - Active computation: count matrices, embeddings, graphs
  - Capacity: ~500K cells x 20K genes in sparse format
  - Overflow: automatic spill to host memory

Tier 2: Host Memory (up to 480 GB on DGX Spark)
  - Cached datasets, reference atlases
  - Preprocessing buffers
  - Batch processing queues

Tier 3: NVMe Storage (up to 16 TB on DGX Spark)
  - Raw data (FASTQ, BAM, count matrices)
  - Intermediate results
  - AnnData/H5AD file cache

For datasets exceeding GPU memory (>500K cells), the engine automatically partitions the analysis into batches, computing k-NN graphs and embeddings in chunks with GPU-accelerated merging. This enables processing of million-cell atlases without manual intervention.

6.3 Scanpy Compatibility Layer

The agent maintains full API compatibility with Scanpy (Wolf et al., 2018), the dominant Python framework for single-cell analysis. Users can submit standard Scanpy workflows that are transparently accelerated on GPU:

# Standard Scanpy workflow - automatically GPU-accelerated
import scanpy as sc
import rapids_singlecell as rsc

adata = sc.read_h5ad("patient_tumor.h5ad")

# Preprocessing (GPU-accelerated)
rsc.pp.normalize_total(adata)
rsc.pp.log1p(adata)
rsc.pp.highly_variable_genes(adata, n_top_genes=2000)

# Dimensionality reduction (GPU: 1.2s vs CPU: 45s)
rsc.pp.pca(adata, n_comps=50)

# Neighbor graph (GPU: 5s vs CPU: 8min)
rsc.pp.neighbors(adata, n_neighbors=15)

# Clustering (GPU: 4s vs CPU: 3min)
rsc.tl.leiden(adata, resolution=1.0)

# Embedding (GPU: 8s vs CPU: 6min)
rsc.tl.umap(adata)

# Differential expression (GPU: 15s vs CPU: 12min)
rsc.tl.rank_genes_groups(adata, groupby='leiden')

This compatibility ensures that existing Scanpy scripts, notebooks, and workflows can be accelerated without code modification, lowering the adoption barrier for computational biologists already invested in the Scanpy ecosystem.


7. Knowledge Architecture: Milvus Vector Collections

7.1 Collection Design

The Single-Cell Intelligence Agent maintains 12 specialized Milvus vector collections, each optimized for a specific domain of single-cell knowledge. All collections use BGE-small-en-v1.5 embeddings (384 dimensions) consistent with the HCLS AI Factory embedding standard.

Collection Records Description Update Frequency
sc_cell_types ~2,500 Cell type definitions, markers, ontology mappings Monthly
sc_markers ~45,000 Gene markers for cell types across tissues/diseases Monthly
sc_spatial ~18,000 Spatial transcriptomics signatures and niches Quarterly
sc_tme ~12,000 Tumor microenvironment profiles and classifications Monthly
sc_drug_response ~35,000 Cell-type-specific drug sensitivity/resistance data Monthly
sc_literature ~85,000 Single-cell publication abstracts and key findings Weekly
sc_methods ~3,500 Computational methods, parameters, best practices Quarterly
sc_datasets ~8,000 Public dataset metadata (GEO, CellxGene, HuBMAP) Weekly
sc_trajectories ~6,500 Differentiation trajectories and pseudotime references Quarterly
sc_pathways ~15,000 Pathway activity signatures at single-cell resolution Monthly
sc_clinical ~22,000 Clinical outcome correlations with sc signatures Monthly
genomic_evidence ~3,560,000 Shared with HCLS platform - variant annotations Daily

7.2 Collection Schema: sc_cell_types

{
  "collection_name": "sc_cell_types",
  "fields": [
    {"name": "id", "type": "INT64", "is_primary": true, "auto_id": true},
    {"name": "cell_type_name", "type": "VARCHAR", "max_length": 256},
    {"name": "cell_ontology_id", "type": "VARCHAR", "max_length": 32},
    {"name": "tissue", "type": "VARCHAR", "max_length": 128},
    {"name": "species", "type": "VARCHAR", "max_length": 64},
    {"name": "canonical_markers", "type": "VARCHAR", "max_length": 1024},
    {"name": "description", "type": "VARCHAR", "max_length": 4096},
    {"name": "parent_type", "type": "VARCHAR", "max_length": 256},
    {"name": "disease_associations", "type": "VARCHAR", "max_length": 2048},
    {"name": "source", "type": "VARCHAR", "max_length": 256},
    {"name": "embedding", "type": "FLOAT_VECTOR", "dim": 384}
  ],
  "index": {
    "field": "embedding",
    "type": "IVF_FLAT",
    "metric": "COSINE",
    "params": {"nlist": 128}
  }
}

7.3 Collection Schema: sc_tme

The tumor microenvironment collection encodes TME classifications, immune signatures, and stromal profiles:

{
  "collection_name": "sc_tme",
  "fields": [
    {"name": "id", "type": "INT64", "is_primary": true, "auto_id": true},
    {"name": "tme_class", "type": "VARCHAR", "max_length": 64},
    {"name": "subclass", "type": "VARCHAR", "max_length": 128},
    {"name": "immune_score", "type": "FLOAT"},
    {"name": "stromal_score", "type": "FLOAT"},
    {"name": "cell_composition", "type": "VARCHAR", "max_length": 4096},
    {"name": "signature_genes", "type": "VARCHAR", "max_length": 2048},
    {"name": "cancer_type", "type": "VARCHAR", "max_length": 128},
    {"name": "therapy_response", "type": "VARCHAR", "max_length": 1024},
    {"name": "clinical_evidence", "type": "VARCHAR", "max_length": 4096},
    {"name": "source_study", "type": "VARCHAR", "max_length": 512},
    {"name": "embedding", "type": "FLOAT_VECTOR", "dim": 384}
  ]
}

7.4 Retrieval Strategy

The agent employs a multi-collection retrieval strategy for comprehensive knowledge augmentation:

  1. Primary retrieval: Query the most relevant collection for the current workflow (e.g., sc_cell_types for annotation)
  2. Cross-collection enrichment: Retrieve supporting evidence from related collections (e.g., sc_markers + sc_literature to validate cell type assignments)
  3. Clinical grounding: Always include retrieval from sc_clinical and genomic_evidence for clinical interpretation
  4. Reranking: Retrieved documents are reranked by relevance score, recency, and source authority before LLM context injection

This strategy ensures that LLM-generated interpretations are grounded in current evidence across the full breadth of single-cell knowledge, from basic cell biology to clinical outcome data.


8. Analytical Workflows

8.1 Workflow Overview

The Single-Cell Intelligence Agent implements 10 purpose-built analytical workflows, each combining GPU-accelerated computation with knowledge-augmented interpretation:

# Workflow Input Output Collections Used
1 Cell Type Annotation Count matrix (H5AD) Annotated clusters with confidence scores sc_cell_types, sc_markers
2 TME Profiling Tumor scRNA-seq TME classification, immune/stromal scores sc_tme, sc_clinical
3 Drug Response Prediction Annotated scRNA-seq Cell-type-specific drug sensitivity sc_drug_response, sc_pathways
4 Subclonal Architecture Tumor scRNA-seq + VCF Clone tree, mutation-expression links genomic_evidence, sc_trajectories
5 Spatial Niche ID Spatial transcriptomics Spatial niches, cell-cell interactions sc_spatial, sc_tme
6 Trajectory Analysis Time-series scRNA-seq Pseudotime, branch points, driver genes sc_trajectories, sc_pathways
7 Ligand-Receptor Mapping Multi-type scRNA-seq Interaction networks, signaling hubs sc_pathways, sc_tme
8 Biomarker Discovery Case/control scRNA-seq Ranked biomarker candidates sc_markers, sc_clinical
9 CAR-T Target Validation Target gene + scRNA-seq On-target/off-tumor risk assessment sc_cell_types, sc_markers, sc_clinical
10 Treatment Response Monitoring Longitudinal scRNA-seq Clonal dynamics, resistance evolution sc_drug_response, sc_trajectories

8.2 Workflow Execution Model

Each workflow follows a standardized execution model:

Input Validation --> Preprocessing --> GPU Computation --> Knowledge Retrieval
        |                  |                  |                     |
        v                  v                  v                     v
  Schema check       QC, filtering,     RAPIDS pipeline      Milvus semantic
  Format detection   normalization,     (PCA, neighbors,     search across
  Parameter          HVG selection      clustering, DE)      relevant collections
  defaults                                                         |
                                                                   v
                                                          LLM Interpretation
                                                          (Claude + RAG context)
                                                                   |
                                                                   v
                                                          Report Generation
                                                          (Clinical summary,
                                                           visualizations,
                                                           recommendations)

8.3 Shared Preprocessing Pipeline

All workflows share a common preprocessing pipeline that standardizes data before workflow-specific analysis:

  1. Quality Control: Filter cells by minimum genes (default: 200), maximum genes (default: 5000), and mitochondrial gene percentage (default: <20%). Filter genes by minimum cells (default: 3)
  2. Doublet Detection: Scrublet or DoubletFinder to identify and remove multiplets
  3. Normalization: Library size normalization (target sum: 10,000) followed by log1p transformation
  4. Highly Variable Genes: Select top 2,000-5,000 HVGs using Scanpy's highly_variable_genes with flavor='seurat_v3'
  5. Scaling: Z-score standardization (optional, workflow-dependent)
  6. Batch Correction: Harmony or scVI integration when multiple samples are present

9. Cell Type Annotation Pipeline

9.1 Multi-Strategy Annotation

Cell type annotation is the foundational analysis for all downstream workflows. The agent employs a multi-strategy approach that combines automated classifiers, marker-based scoring, and LLM-assisted interpretation:

Strategy 1: Reference-Based Transfer Learning - Map query cells to annotated reference atlases (Human Cell Atlas, Tabula Sapiens) using label transfer - GPU-accelerated k-NN classifier in PCA/scVI latent space - Confidence scoring based on distance to nearest reference cells

Strategy 2: Marker Gene Scoring - Score each cluster against curated marker gene sets from sc_markers collection - Compute AUCell-like enrichment scores for marker gene sets - Rank candidate cell types by composite marker expression score

Strategy 3: LLM-Assisted Annotation - Extract top differentially expressed genes per cluster - Query Claude with DE genes + tissue context + retrieved knowledge from sc_cell_types and sc_literature - Generate annotation with confidence level and supporting evidence

9.2 Consensus Annotation

The three strategies are combined through a consensus mechanism:

def consensus_annotation(reference_label, marker_label, llm_label,
                         reference_conf, marker_score, llm_confidence):
    """
    Consensus annotation across three strategies.
    Agreement of 2/3 strategies required for high-confidence annotation.
    """
    labels = [reference_label, marker_label, llm_label]
    scores = [reference_conf, marker_score, llm_confidence]

    # Check for unanimous agreement
    if len(set(labels)) == 1:
        return labels[0], "HIGH", max(scores)

    # Check for 2/3 agreement
    from collections import Counter
    counts = Counter(labels)
    majority = counts.most_common(1)[0]
    if majority[1] >= 2:
        return majority[0], "MEDIUM", np.mean([s for l, s in zip(labels, scores)
                                                if l == majority[0]])

    # No consensus - flag for manual review
    return labels[np.argmax(scores)], "LOW", max(scores)

9.3 Cell Ontology Integration

All annotations are mapped to the Cell Ontology (CL) for standardized terminology:

  • CL:0000084 - T cell
  • CL:0000236 - B cell
  • CL:0000775 - Neutrophil
  • CL:0000235 - Macrophage
  • CL:0002063 - Type II pneumocyte
  • CL:0000066 - Epithelial cell
  • CL:0000057 - Fibroblast
  • CL:0000115 - Endothelial cell

This standardization enables cross-study comparison, atlas integration, and clinical reporting with unambiguous cell type nomenclature.


10. Tumor Microenvironment Profiling

10.1 TME Classification System

The TME profiling workflow classifies tumor microenvironments into clinically actionable categories based on immune cell composition, spatial organization, and functional state:

Hot (Inflamed) TME: - High CD8+ T-cell infiltration (>15% of all cells) - Active cytotoxic gene signature (GZMA, GZMB, PRF1, IFNG) - PD-L1 expression on tumor and/or immune cells - Clinical implication: Checkpoint immunotherapy responsive (anti-PD-1/PD-L1) - Expected response rate: 40-60% for anti-PD-1 monotherapy

Cold (Desert) TME: - Minimal immune cell infiltration (<5% of all cells) - Low MHC class I expression on tumor cells - Absence of inflammatory chemokines (CXCL9, CXCL10) - Clinical implication: Needs combination therapy to recruit immune cells - Therapeutic strategy: Oncolytic virus + checkpoint inhibitor, or STING agonist + anti-PD-1

Excluded TME: - Immune cells present but confined to tumor periphery/stroma - Physical barriers: dense collagen, aberrant vasculature - TGF-beta signaling enrichment in stromal compartment - Clinical implication: Needs stromal remodeling to enable immune penetration - Therapeutic strategy: Anti-TGF-beta + checkpoint inhibitor, or anti-VEGF + anti-PD-L1

Immunosuppressive TME: - High regulatory T-cell (Treg) proportion (>10% of CD4+ T cells) - Myeloid-derived suppressor cell (MDSC) enrichment - IDO1, ARG1 upregulation - Clinical implication: Needs immunosuppression reversal - Therapeutic strategy: Anti-CTLA-4 + anti-PD-1, or IDO1 inhibitor combinations

10.2 Immune Cell Deconvolution

The workflow quantifies immune cell composition using a GPU-accelerated deconvolution approach:

  1. Cluster identification: Leiden clustering at multiple resolutions (0.5, 1.0, 2.0) to capture both broad and fine-grained cell populations
  2. Immune cell scoring: Score each cluster against immune cell gene signatures from LM22 (Newman et al., 2015) and immunedeconv (Sturm et al., 2019)
  3. Proportion estimation: Calculate cell type proportions as percentage of total cells per sample
  4. Functional state assessment: Within each immune cell type, assess activation, exhaustion, memory, and effector states using curated gene signatures

10.3 TME Score Computation

The TME classification integrates multiple scores into a composite assessment:

def compute_tme_classification(adata, immune_clusters, tumor_clusters):
    """
    Classify TME based on immune composition and spatial organization.
    """
    # Immune proportion
    n_immune = sum(adata.obs['cell_type'].isin(immune_clusters))
    n_total = len(adata)
    immune_fraction = n_immune / n_total

    # Cytotoxic score
    cytotoxic_genes = ['GZMA', 'GZMB', 'PRF1', 'IFNG', 'NKG7']
    cytotoxic_score = adata[adata.obs['cell_type'].isin(immune_clusters),
                            cytotoxic_genes].X.mean()

    # Exhaustion score
    exhaustion_genes = ['PDCD1', 'CTLA4', 'LAG3', 'TIM3', 'TIGIT']
    exhaustion_score = adata[adata.obs['cell_type'].isin(immune_clusters),
                             exhaustion_genes].X.mean()

    # Classify
    if immune_fraction > 0.15 and cytotoxic_score > 0.5:
        return "HOT_INFLAMED", {"checkpoint_responsive": True}
    elif immune_fraction < 0.05:
        return "COLD_DESERT", {"needs_recruitment": True}
    elif immune_fraction > 0.10 and exhaustion_score > 0.7:
        return "IMMUNOSUPPRESSIVE", {"needs_reversal": True}
    else:
        return "EXCLUDED", {"needs_remodeling": True}

11. Drug Response Prediction

11.1 Cell-Type-Specific Drug Sensitivity

The drug response prediction workflow integrates single-cell expression data with pharmacogenomic databases to predict cell-type-specific drug sensitivity:

Data Sources: - DepMap/CCLE (Cancer Cell Line Encyclopedia): Drug sensitivity profiles across 1,500+ cell lines with matched scRNA-seq - GDSC (Genomics of Drug Sensitivity in Cancer): IC50 values for 400+ compounds across 1,000+ cell lines - PRISM: Multiplexed drug screening across 900+ cell lines - CMap (Connectivity Map): Transcriptional signatures of 20,000+ compound perturbations

Prediction Approach: 1. Map patient cell clusters to reference cell line transcriptomes using transfer learning 2. Retrieve drug sensitivity data for matched cell lines from sc_drug_response collection 3. Score differential expression of drug target genes and resistance markers per cluster 4. Generate cell-type-resolved drug sensitivity predictions with confidence intervals

11.2 Resistance Mechanism Identification

For each predicted drug-resistant subpopulation, the workflow identifies potential resistance mechanisms:

  • Target mutation: Expression of mutant vs. wildtype drug target alleles (from linked VCF data)
  • Bypass activation: Upregulation of alternative signaling pathways (e.g., MET amplification bypassing EGFR inhibition)
  • Efflux pumps: ABC transporter expression (ABCB1/MDR1, ABCG2/BCRP)
  • EMT transition: Epithelial-to-mesenchymal transition signature enrichment
  • Metabolic rewiring: Shift from glycolysis to oxidative phosphorylation or vice versa

11.3 Combination Therapy Suggestions

Based on resistance mechanism analysis, the workflow suggests combination strategies:

Resistance Mechanism          Suggested Combination
---------------------------------------------------------
Target mutation               Next-generation inhibitor + original
Bypass pathway activation     Target inhibitor + bypass pathway inhibitor
ABC transporter upregulation  Drug + efflux pump inhibitor
EMT transition                Drug + EMT reversal agent (e.g., HDAC inhibitor)
Immune evasion                Drug + checkpoint inhibitor

Each suggestion is grounded in evidence retrieved from sc_drug_response and sc_clinical collections, with supporting literature citations.


12. Subclonal Architecture Analysis

12.1 Clone Detection from scRNA-seq

The subclonal architecture workflow reconstructs the clonal hierarchy of tumors using single-cell transcriptomic data, optionally integrated with matched bulk or single-cell DNA sequencing:

RNA-only clone detection: 1. InferCNV/CopyKAT: Infer copy number alterations from gene expression patterns, using non-malignant cells as reference. GPU-accelerated sliding window analysis across chromosomal positions 2. Clustering on CNV profiles: Group cells by inferred CNV patterns to identify genetically distinct subclones 3. Clone tree construction: Build phylogenetic relationships between subclones using maximum parsimony or neighbor-joining on CNV profiles

Integrated DNA+RNA clone detection: 1. Variant calling: Use linked VCF from the genomics pipeline (Parabricks/DeepVariant) 2. Genotype assignment: Assign SNV genotypes to individual cells using scRNA-seq reads at variant positions (cellSNP/Vireo) 3. Clone-expression mapping: Link genetic clones to transcriptional states for mechanistic interpretation

12.2 Clone Fitness and Evolution

For each identified subclone, the workflow estimates fitness and evolutionary trajectory:

  • Clone frequency: Proportion of total tumor cells in each subclone
  • Proliferation index: Expression of cell cycle genes (MKI67, TOP2A, CDK1) per clone
  • Fitness inference: Relative growth rate estimated from clone frequency dynamics in longitudinal samples
  • Drug sensitivity prediction: Per-clone drug response based on clone-specific expression of drug targets and resistance markers

12.3 Clinical Reporting

The subclonal analysis generates a clinical report including:

  • Clone tree visualization with annotated mutations and expression programs
  • Quantification of resistant clone fraction at diagnosis
  • Predicted clonal evolution under proposed therapies
  • Risk assessment for relapse based on resistant clone burden

13. Spatial Transcriptomics Integration

13.1 Multi-Platform Support

The spatial transcriptomics workflow supports data from multiple platforms, each with distinct resolution and throughput characteristics:

Visium (10x Genomics): - Resolution: ~55 um spots (1-10 cells per spot) - Genes: Whole transcriptome (~20,000 genes) - Coverage: 4,992 spots per capture area (6.5 mm x 6.5 mm) - Analysis: Spot deconvolution to estimate cell type composition per spot using cell2location or RCTD

MERFISH (Vizgen): - Resolution: Subcellular (individual transcripts) - Genes: 100-1,000 gene panels - Coverage: Up to 1 cm x 1 cm tissue sections - Analysis: Cell segmentation, direct cell typing from spatial gene expression

Xenium (10x Genomics): - Resolution: Subcellular transcript localization - Genes: 100-5,000+ gene panels (Xenium Prime) - Coverage: Multiple tissue sections per run - Analysis: Cell segmentation with DAPI, boundary-free transcript analysis

CODEX/PhenoCycler (Akoya Biosciences): - Resolution: Single-cell protein expression - Markers: 100+ protein targets per panel - Coverage: Whole tissue sections - Analysis: Protein co-expression, spatial phenotyping

13.2 Spatial Niche Identification

The workflow identifies recurring spatial patterns--niches--that represent functionally distinct tissue microenvironments:

  1. Cell type mapping: Assign cell types to spatial coordinates (direct for MERFISH/Xenium, deconvolved for Visium)
  2. Neighborhood analysis: For each cell/spot, compute the cell type composition of its spatial neighborhood (k nearest neighbors in physical space, typically k=10-30)
  3. Niche clustering: Cluster cells by neighborhood composition to identify recurring spatial patterns (GPU-accelerated Leiden on spatial neighborhood graph)
  4. Niche characterization: For each niche, compute:
  5. Dominant cell types and their proportions
  6. Enriched ligand-receptor interactions (CellChat, CellPhoneDB)
  7. Gene expression programs specific to the niche context
  8. Spatial autocorrelation of key genes (Moran's I)

13.3 Spatial-Transcriptomic Integration

The workflow integrates dissociated scRNA-seq with spatial data for maximum information:

scRNA-seq (high genes, no spatial)  +  Spatial (spatial context, fewer genes)
              |                                        |
              v                                        v
      Full transcriptome              Spatial coordinates + tissue context
      Cell type atlas                 Gene expression at spatial resolution
              |                                        |
              +------- Cell2location / Tangram --------+
              |                                        |
              v                                        v
      Spatially-resolved full         Cell-type-specific spatial
      transcriptome estimation        gene expression patterns

This integration enables analysis that neither modality can achieve alone: full-transcriptome analysis with spatial context, revealing how gene expression changes as a function of tissue location, proximity to tumor-immune interfaces, and distance from vasculature.


14. Trajectory and Pseudotime Analysis

14.1 Differentiation Trajectory Inference

The trajectory analysis workflow reconstructs continuous biological processes--differentiation, activation, exhaustion--from snapshot scRNA-seq data:

Diffusion Pseudotime (DPT): - Compute diffusion maps from the k-NN graph (GPU-accelerated eigendecomposition) - Identify root cell (earliest differentiation state) from biological priors or user specification - Calculate pseudotime as geodesic distance from root in diffusion space - Detect branch points where trajectories diverge

RNA Velocity (scVelo): - Distinguish nascent (unspliced) from mature (spliced) mRNA for each gene - Fit kinetic models to estimate RNA velocity vectors per cell - Project velocity onto UMAP embedding to visualize direction of transcriptional change - Identify driver genes whose velocity predicts future cell state transitions

PAGA (Partition-based Graph Abstraction): - Build a coarse-grained graph connecting clusters with statistically significant edges - Quantify connectivity strength between cell populations - Identify plausible differentiation pathways at cluster-to-cluster resolution

14.2 Trajectory Analysis in Oncology

Clinical applications of trajectory analysis include:

T-cell exhaustion trajectories: Track CD8+ T cells from naive/effector states through progressive exhaustion (PD-1, LAG-3, TIM-3 upregulation), identifying the branch point where cells commit to terminal exhaustion vs. memory formation. This informs checkpoint immunotherapy timing.

EMT trajectories: Map epithelial-to-mesenchymal transition in carcinomas, identifying intermediate hybrid E/M states associated with metastatic competence and drug resistance.

Myeloid polarization: Track monocyte-to-macrophage differentiation in the tumor microenvironment, distinguishing anti-tumor M1-like states from pro-tumor M2-like states and identifying interventional opportunities.

14.3 Driver Gene Identification

For each trajectory, the workflow identifies genes whose expression changes drive the transition:

  1. Pseudotime correlation: Rank genes by correlation with pseudotime (Spearman rho)
  2. Switch-like genes: Identify genes with bimodal expression along the trajectory
  3. Transcription factors: Prioritize TFs using SCENIC/pySCENIC regulon analysis
  4. Druggable targets: Cross-reference driver genes with druggable genome databases

15. Ligand-Receptor Interaction Mapping

15.1 Cell-Cell Communication Analysis

The ligand-receptor workflow maps intercellular communication networks from single-cell expression data, identifying which cell types communicate through which signaling pathways:

CellChat Framework: - Database of 2,000+ validated ligand-receptor interactions, including multimeric complexes - Quantify communication probability between all cell type pairs for each interaction - Identify dominant signaling pathways (e.g., WNT, NOTCH, TGF-beta, CCL/CXCL chemokines) - Detect communication pattern changes between conditions (e.g., treatment vs. control)

CellPhoneDB: - Statistical framework for identifying significant ligand-receptor pairs - Accounts for receptor subunit co-expression requirements - Permutation-based p-values for interaction specificity

15.2 Signaling Network Construction

The workflow constructs multi-scale signaling networks:

Level 1: Cell Type Communication Network
  - Nodes: cell types (e.g., tumor, CD8+ T, macrophage, fibroblast)
  - Edges: aggregate communication strength between cell type pairs
  - Weight: number of significant L-R pairs

Level 2: Pathway-Specific Networks
  - Separate networks for each signaling pathway
  - Identify pathway-specific communication hubs
  - Detect pathway cross-talk

Level 3: Molecular Interaction Network
  - Individual L-R pairs with expression levels and statistical significance
  - Downstream signaling cascade mapping (KEGG, Reactome)
  - Druggable interaction identification

15.3 Clinical Applications

Immune checkpoint interactions: Map PD-L1 (CD274) on tumor cells to PD-1 (PDCD1) on T cells, quantifying the strength of immune suppression. Identify which tumor subclones express PD-L1 and which T-cell subsets are affected.

CAR-T target validation: Assess whether the target antigen is also expressed on non-tumor cells in the microenvironment, predicting on-target/off-tumor toxicity risk.

Therapeutic vulnerability: Identify signaling dependencies--e.g., tumor cells dependent on fibroblast-derived HGF for survival--that represent combination therapy opportunities.


16. Biomarker Discovery Pipeline

16.1 Single-Cell Biomarker Identification

The biomarker discovery workflow identifies cell-type-specific biomarkers that distinguish clinical conditions, disease states, or treatment responses:

Differential Expression Analysis: - GPU-accelerated Wilcoxon rank-sum tests across all genes for each cell type between conditions - Multiple testing correction (Benjamini-Hochberg) with adjusted p-value threshold of 0.05 - Effect size filtering: minimum log2 fold-change of 0.5, minimum percentage expressed in at least one group of 25% - Pseudo-bulk aggregation for biological replicate-aware analysis (sum counts per sample per cell type, then DESeq2/edgeR)

Biomarker Scoring Criteria:

Criterion Weight Description
Statistical significance 0.20 Adjusted p-value from DE analysis
Effect size 0.20 Log2 fold-change between conditions
Cell-type specificity 0.15 Expression restricted to disease-relevant cell types
Clinical correlation 0.15 Correlation with clinical outcomes (from sc_clinical)
Detectability 0.10 Amenable to clinical assay (IHC, flow cytometry, qPCR)
Literature support 0.10 Prior evidence from sc_literature collection
Druggability 0.10 Whether the biomarker is also a therapeutic target

16.2 Composite Biomarker Signatures

Beyond individual genes, the workflow identifies multi-gene signatures that improve predictive accuracy:

  1. LASSO regression: L1-regularized logistic regression to select sparse gene sets that classify conditions
  2. Random forest importance: Feature importance ranking from GPU-accelerated random forest (cuML)
  3. Pathway-level signatures: Aggregate gene scores into pathway activity scores using PROGENy or AUCell
  4. Cell proportion signatures: Use cell type proportions as features (e.g., ratio of CD8+ T cells to Tregs)

16.3 Biomarker Validation Pipeline

Each candidate biomarker undergoes computational validation:

  1. Cross-dataset validation: Test biomarker in independent datasets from sc_datasets collection
  2. Bulk deconvolution validation: Confirm that single-cell biomarker signal is detectable in bulk RNA-seq (TCGA validation)
  3. Clinical outcome correlation: Correlate biomarker expression with survival, response, and relapse data from sc_clinical
  4. Assay feasibility assessment: Evaluate whether the biomarker can be measured by clinically available assays

17. CAR-T Target Validation

17.1 Target Expression Profiling

The CAR-T target validation workflow provides comprehensive assessment of candidate CAR-T targets using single-cell data, directly integrating with the HCLS AI Factory CAR-T Intelligence Agent:

On-Tumor Expression: - Quantify target antigen expression across all tumor cell clusters - Assess expression heterogeneity: what fraction of tumor cells express the target above detection threshold? - Identify tumor subclones with low/absent target expression (potential escape populations) - Measure expression level distribution (MFI-equivalent from RNA)

Off-Tumor Expression (Safety Assessment): - Query sc_cell_types and sc_markers collections for target expression across all normal tissues - Cross-reference Human Cell Atlas and Tabula Sapiens for comprehensive normal tissue expression - Identify critical normal cell types expressing the target (e.g., cardiac, neural, hepatic) - Risk stratification: HIGH (vital organ expression), MEDIUM (non-vital tissue expression), LOW (minimal normal expression)

17.2 Antigen Escape Prediction

The workflow predicts the likelihood and mechanisms of antigen escape:

def predict_antigen_escape(adata, target_gene, tumor_clusters):
    """
    Predict probability and timeline of antigen escape.
    """
    tumor_cells = adata[adata.obs['cell_type'].isin(tumor_clusters)]

    # Expression heterogeneity
    expressing = (tumor_cells[:, target_gene].X > 0).sum()
    total = len(tumor_cells)
    expression_fraction = expressing / total

    # Expression level variance
    expression_levels = tumor_cells[:, target_gene].X.toarray().flatten()
    expression_cv = np.std(expression_levels) / (np.mean(expression_levels) + 1e-6)

    # Pre-existing negative clone
    negative_fraction = 1 - expression_fraction

    # Risk assessment
    if negative_fraction > 0.10:
        escape_risk = "HIGH"
        escape_timeline = "3-6 months"
    elif negative_fraction > 0.01:
        escape_risk = "MEDIUM"
        escape_timeline = "6-12 months"
    elif expression_cv > 1.5:
        escape_risk = "MEDIUM"
        escape_timeline = "6-18 months"
    else:
        escape_risk = "LOW"
        escape_timeline = ">18 months"

    return {
        "target": target_gene,
        "expression_fraction": expression_fraction,
        "expression_cv": expression_cv,
        "negative_clone_fraction": negative_fraction,
        "escape_risk": escape_risk,
        "estimated_escape_timeline": escape_timeline,
        "recommendation": "Consider dual-target CAR" if escape_risk != "LOW"
                         else "Single-target CAR acceptable"
    }

17.3 TME Compatibility Assessment

Beyond target expression, the workflow assesses whether the TME is permissive for CAR-T cell function:

  • Immunosuppressive milieu: Quantify Tregs, MDSCs, and immunosuppressive cytokines (TGF-beta, IL-10)
  • Physical barriers: Assess stromal density and fibroblast activation markers (FAP, alpha-SMA)
  • Metabolic competition: Evaluate glucose, amino acid, and oxygen availability in the TME
  • Exhaustion potential: Predict CAR-T exhaustion risk from inhibitory ligand expression (PD-L1, Galectin-9)

18. Treatment Response Monitoring

18.1 Longitudinal Single-Cell Analysis

The treatment response monitoring workflow tracks cellular changes across multiple timepoints during therapy:

Timepoint Alignment: - Integrate scRNA-seq from pre-treatment, on-treatment, and post-treatment biopsies - Batch correction while preserving biological signal (Harmony or scVI) - Align cell type annotations across timepoints for consistent tracking

Clonal Dynamics: - Track clone frequencies over time using InferCNV-derived clone assignments - Identify expanding clones (potential resistance) and contracting clones (drug-sensitive) - Calculate selection coefficients for each clone under therapy

18.2 Immune Response Tracking

The workflow monitors immune system dynamics during immunotherapy:

  • T-cell expansion: Track clonotype-specific T-cell expansion using TCR sequencing data (5' 10x)
  • Exhaustion progression: Monitor exhaustion marker accumulation (PD-1, LAG-3, TIM-3) over treatment
  • New infiltration: Detect newly recruited immune cell populations that appear only after treatment initiation
  • Tertiary lymphoid structures: Identify gene signatures of tertiary lymphoid structures (TLS) that predict durable response

18.3 Resistance Evolution Monitoring

For targeted therapy, the workflow tracks resistance mechanism evolution:

  1. Pre-treatment baseline: Catalog all subclones and their expression programs
  2. Early on-treatment (2-4 weeks): Identify which clones are contracting (sensitive) and which are stable (potentially resistant)
  3. Late on-treatment (2-6 months): Detect expanding resistant clones and characterize their resistance mechanisms
  4. Progression: At clinical progression, compare the dominant clone with pre-treatment baseline to identify acquired changes

18.4 Clinical Decision Support

The monitoring workflow generates actionable clinical summaries:

TREATMENT RESPONSE REPORT
=========================
Patient: [ID]  |  Timepoint: Week 12 (On-treatment)
Therapy: Pembrolizumab (anti-PD-1)

TUMOR DYNAMICS:
  - Dominant clone (Clone A, EGFR-mutant): 60% -> 25% (RESPONDING)
  - Minor clone (Clone B, MET-amplified): 5% -> 18% (EXPANDING - ALERT)
  - New clone (Clone C): Not detected at baseline, now 8% (EMERGENT)

IMMUNE RESPONSE:
  - CD8+ T-cell infiltration: 8% -> 22% (INCREASED)
  - Clonotype expansion: 3 dominant clonotypes expanding
  - Exhaustion score: 0.3 -> 0.6 (INCREASING - MONITOR)

TME CLASSIFICATION: HOT_INFLAMED (improved from EXCLUDED at baseline)

RECOMMENDATIONS:
  1. Clone B expansion suggests MET-mediated resistance; consider adding
     MET inhibitor (capmatinib/tepotinib) if confirmed by ctDNA
  2. Increasing exhaustion score suggests potential for secondary resistance;
     consider LAG-3 inhibitor addition at next assessment
  3. Overall response trajectory is positive; continue current therapy
     with close monitoring of Clone B

19. Data Sources and Knowledge Graphs

19.1 Primary Data Sources

The Single-Cell Intelligence Agent draws from a comprehensive set of public data repositories:

Single-Cell Atlases: - Human Cell Atlas (HCA): International consortium mapping all human cell types. Currently >50 million cells across 30+ organs - CellxGene (Chan Zuckerberg Initiative): Curated repository of >50 million cells from 1,000+ datasets with standardized annotations. Primary source for sc_cell_types and sc_markers collections - Tabula Sapiens (Tabula Sapiens Consortium, 2022): Comprehensive atlas of 500,000+ cells from 24 human tissues with deep annotation - Tabula Muris (Tabula Muris Consortium, 2018): Mouse reference atlas for cross-species comparison

Cancer-Specific Resources: - TCGA (The Cancer Genome Atlas): Bulk RNA-seq and clinical data for 33 cancer types; used for bulk validation of single-cell biomarkers - DepMap/CCLE (Broad Institute): Drug sensitivity and expression data for 1,500+ cancer cell lines; source for sc_drug_response collection - TISCH (Tumor Immune Single-cell Hub): Curated single-cell immune profiles across 50+ cancer types - CancerSEA (Fan et al., 2019): Functional state annotation of cancer cells at single-cell resolution

Gene Expression Repositories: - GEO (Gene Expression Omnibus): >4 million samples including >10,000 scRNA-seq datasets - ArrayExpress/BioStudies: European equivalent of GEO - HuBMAP (Human BioMolecular Atlas Program): Multi-modal atlas including spatial transcriptomics

19.2 Knowledge Graphs

The agent integrates structured knowledge from several ontologies and databases:

Cell Ontology (CL): - Hierarchical classification of >2,500 cell types - Cross-references to UBERON (anatomy), GO (gene ontology), and disease ontologies - Used for standardized cell type nomenclature and relationship inference

CellMarker Database (Zhang et al., 2019): - Manually curated marker genes for >400 cell types across 158 human tissues - Tissue-specific markers distinguish between identical cell types in different organs - Source for sc_markers collection

PanglaoDB (Franzen et al., 2019): - Community-curated marker gene database with >6,000 markers across 250+ cell types - Includes specificity scores and expression level expectations - Complementary to CellMarker with broader community contributions

ClinVar and AlphaMissense (shared with HCLS platform): - 4.1 million ClinVar variant annotations for variant-expression correlation - 71 million AlphaMissense predictions for missense variant pathogenicity

19.3 Knowledge Graph Integration

The agent constructs a multi-layer knowledge graph connecting:

Gene --[is_marker_of]--> Cell Type --[found_in]--> Tissue
  |                         |                         |
  |--[in_pathway]-->     Pathway   --[dysregulated_in]--> Disease
  |                         |                         |
  |--[target_of]-->       Drug     --[treats]-------> Cancer Type
  |                         |                         |
  |--[has_variant]-->     Variant  --[associated_with]--> Phenotype
  |                         |
  |--[interacts_with]-->  Protein  --[in_complex]--> Complex

This graph enables multi-hop reasoning: from a differentially expressed gene in a specific cell type, the agent can traverse to associated pathways, known drugs, clinical trials, and patient outcomes--providing comprehensive context for clinical interpretation.


20. Foundation Model Integration Roadmap

20.1 Current Foundation Models for Single-Cell

The emergence of foundation models pre-trained on large-scale single-cell data represents a paradigm shift in computational biology. The Single-Cell Intelligence Agent is architected for integration with these models:

scGPT (Cui et al., 2024): - Transformer architecture pre-trained on >33 million cells from CellxGene - Capabilities: cell type annotation, perturbation prediction, gene network inference, batch integration - Architecture: Gene tokens embedded as continuous vectors; attention mechanism captures gene-gene relationships - Performance: State-of-the-art on cell type annotation benchmarks (>90% accuracy on unseen tissues) - Integration plan: Fine-tune on HCLS-specific cancer datasets; use as embedding backbone for Milvus collections

Geneformer (Theodoris et al., 2023): - BERT-like architecture pre-trained on >30 million cells from Genecorpus-30M - Gene ranking approach: Genes tokenized by expression rank within each cell - Capabilities: In silico perturbation prediction, disease state classification, gene dosage sensitivity - Unique strength: Predicts effects of gene perturbations without requiring perturbation training data - Integration plan: Use for in silico drug target validation and perturbation response prediction

scFoundation (Hao et al., 2024): - Large-scale foundation model pre-trained on >50 million cells - Supports multiple downstream tasks through task-specific fine-tuning heads - Integration plan: Evaluate for multi-task deployment replacing multiple specialized models

20.2 Integration Architecture

The foundation model integration follows a phased approach:

Phase 1 (Current): Traditional ML pipeline with GPU acceleration - RAPIDS single-cell for core computation - Scanpy-compatible workflows - Milvus for knowledge retrieval - Claude LLM for interpretation

Phase 2 (Planned, Q3 2026): Foundation model augmentation - scGPT embeddings replace PCA for Milvus collection vectors - Geneformer for perturbation prediction in drug response workflow - Foundation model cell type annotation as fourth strategy in consensus pipeline

Phase 3 (Planned, 2027): Foundation model-native workflows - End-to-end foundation model inference on DGX Spark - Multi-task deployment: single model for annotation, perturbation, integration - Fine-tuned models on HCLS clinical datasets - Real-time inference at <1 second per query

20.3 Technical Requirements

Foundation model deployment on DGX Spark requires:

Model Parameters GPU Memory Inference Time (10K cells)
scGPT ~50M ~4 GB ~30 seconds
Geneformer ~10M ~2 GB ~15 seconds
scFoundation ~100M ~8 GB ~60 seconds

The DGX Spark's 128 GB GPU memory provides ample headroom for concurrent model serving alongside RAPIDS computation, enabling real-time foundation model inference without resource contention.


21. Benchmarking and Performance

21.1 Computational Benchmarks

Comprehensive benchmarks were conducted on DGX Spark comparing the Single-Cell Intelligence Agent (RAPIDS-accelerated) against standard CPU-based tools:

Dataset: PBMC 68K (Zheng et al., 2017)

Operation Scanpy (CPU, 32 cores) Seurat (R, 32 cores) SCIA (GPU, DGX Spark) Speedup vs Scanpy
Load & QC 12s 18s 3s 4x
Normalize + HVG 8s 15s 1.5s 5x
PCA (50 comp) 25s 30s 0.8s 31x
k-NN Graph 180s 240s 3s 60x
Leiden Clustering 45s 60s 1.5s 30x
UMAP 120s 180s 4s 30x
Diff. Expression 300s 420s 8s 37x
Total Pipeline 690s (11.5 min) 963s (16 min) 22s 31x

Dataset: 100K Tumor Cells (Synthetic benchmark)

Operation Scanpy (CPU) SCIA (GPU) Speedup
PCA 45s 1.2s 37x
k-NN Graph 480s 5s 96x
UMAP 360s 8s 45x
Leiden 180s 4s 45x
DE (all clusters) 720s 15s 48x
Total 1,785s (30 min) 33s 54x

Dataset: 500K Cell Atlas Integration

Operation Scanpy (CPU) SCIA (GPU) Speedup
PCA 5 min 6s 50x
k-NN Graph 45 min 25s 108x
UMAP 35 min 40s 52x
Leiden 15 min 12s 75x
Harmony Integration 25 min 18s 83x
Total 125 min (2+ hrs) 101s (~1.7 min) 74x

21.2 Accuracy Benchmarks

GPU acceleration does not compromise analytical accuracy. Concordance between CPU and GPU results:

Metric Concordance Method
PCA loadings >0.999 (Pearson r) Component-wise correlation
UMAP embedding >0.95 (Trustworthiness) Neighborhood preservation
Cluster assignments >0.98 (ARI) Adjusted Rand Index
DE gene rankings >0.99 (Spearman rho) Top 500 genes per cluster
Cell type annotations >0.97 (Accuracy) Agreement with manual annotation

Minor differences arise from floating-point precision (FP32 vs FP64) and stochastic algorithm initialization, but these are within the range of run-to-run variability for CPU-based tools as well.

21.3 Scaling Analysis

The agent's performance scales favorably with dataset size:

Cells     | CPU Time      | GPU Time    | Speedup
----------|---------------|-------------|--------
10,000    | 3 minutes     | 8 seconds   | 22x
50,000    | 15 minutes    | 18 seconds  | 50x
100,000   | 30 minutes    | 33 seconds  | 54x
250,000   | 75 minutes    | 55 seconds  | 82x
500,000   | 125 minutes   | 101 seconds | 74x
1,000,000 | >4 hours      | 210 seconds | >68x

GPU speedup increases with dataset size due to better GPU utilization at scale, making the advantage most pronounced for the largest clinical datasets.


22. Competitive Analysis

22.1 Competitive Landscape

The Single-Cell Intelligence Agent occupies a unique position in the single-cell analysis ecosystem:

Feature Scanpy Seurat CellRanger Seven Bridges Terra/Broad SCIA
Language Python R C++/Python Cloud Cloud Python
GPU Acceleration No No Partial No Optional Full (RAPIDS)
On-Premise Yes Yes Yes No No Yes (DGX)
Cloud Option Manual Manual No Yes Yes Planned
Spatial Support Squidpy addon STUtility SpaceRanger Limited Limited Integrated
Knowledge Base No No No No No 12 Milvus collections
LLM Integration No No No No No Claude RAG
Clinical Reports No No No Limited Limited Automated
Multi-Modal Limited WNN (strong) Multiome Limited Limited Full
Foundation Models No No No No No Roadmap
Cost Free Free 10x license $$/sample $$/compute DGX infra
Scalability 100K cells 100K cells 10K cells Elastic Elastic 1M+ cells

22.2 Differentiation

vs. Scanpy/Seurat (Open-Source CPU Tools): The SCIA provides 50-100x speedup over Scanpy and Seurat for identical analyses, transforming batch-mode computation into interactive exploration. The integrated knowledge base and LLM interpretation layer add clinical intelligence that open-source tools lack. However, Scanpy and Seurat benefit from larger user communities, more extensive documentation, and broader method availability. The SCIA mitigates this by maintaining Scanpy API compatibility, allowing users to leverage the existing ecosystem while gaining GPU acceleration.

vs. CellRanger (10x Genomics Proprietary): CellRanger excels at FASTQ-to-count-matrix processing for 10x Chromium data but offers limited downstream analysis. The SCIA complements CellRanger by accepting its output as input and providing GPU-accelerated downstream analysis, knowledge-augmented interpretation, and clinical reporting. CellRanger's Loupe Browser provides elegant visualization but lacks the analytical depth of the SCIA's 10 workflows.

vs. Seven Bridges / Terra (Cloud Platforms): Cloud platforms offer elastic scalability but incur per-analysis costs, data transfer latency, and potential data sovereignty concerns for clinical genomics. The SCIA's on-premise DGX deployment eliminates data transfer, provides consistent performance without cloud cost variability, and maintains full data control--critical for HIPAA/GDPR compliance in clinical settings.

22.3 Total Cost of Ownership

For a clinical genomics lab processing 50 samples per month:

Cost Component Cloud (Terra/AWS) On-Premise (SCIA/DGX)
Compute (per sample) $15-50 ~$2 (amortized)
Storage (per month) $500-2,000 ~$100 (NVMe amortized)
Data transfer $200-1,000 $0
Software licenses $0-5,000 $0 (open source)
Infrastructure (amortized/mo) $0 ~$3,000
Monthly Total $2,500-10,000 ~$3,200
Break-even - ~6 months

The DGX Spark investment pays for itself within 6 months for high-throughput clinical operations, with the additional benefits of data locality, consistent performance, and no per-analysis variable costs.


23. Clinical Deployment Considerations

23.1 Regulatory Framework

Clinical deployment of single-cell analysis requires compliance with regulatory and quality standards:

FDA Considerations: - Single-cell analysis tools used for diagnostic purposes fall under FDA medical device regulations (21 CFR Part 820) - The SCIA is positioned as a Clinical Decision Support (CDS) tool, providing information to clinicians rather than autonomous diagnostic decisions - Validation documentation includes analytical accuracy benchmarks, reproducibility studies, and comparison with established methods

CLIA/CAP Compliance: - Laboratory procedures using the SCIA require validation under CLIA (Clinical Laboratory Improvement Amendments) - Standard operating procedures (SOPs) for sample processing, data analysis, and result interpretation - Proficiency testing and quality control programs

Data Security: - HIPAA compliance for patient data handling (encryption at rest and in transit) - Role-based access control via HCLS common security module - Audit logging of all data access and analysis operations - On-premise deployment eliminates cloud data sovereignty concerns

23.2 Quality Control Pipeline

The SCIA implements a comprehensive QC pipeline for clinical-grade analysis:

  1. Sample-level QC: Minimum cell count (>1,000), viability (>80%), doublet rate (<10%)
  2. Cell-level QC: Gene count range (200-5,000), UMI count range, mitochondrial percentage (<20%)
  3. Batch QC: Batch effect quantification (kBET, LISI scores) before and after integration
  4. Analysis QC: Cluster stability (bootstrapped clustering), annotation confidence thresholds
  5. Report QC: Automated checks for completeness, internal consistency, and flagging of low-confidence results

23.3 Clinical Workflow Integration

The SCIA integrates into clinical workflows through several touchpoints:

Biopsy Collection --> Tissue Dissociation --> scRNA-seq Library Prep
        |                                           |
        v                                           v
  Pathology Review                          Sequencing (NovaSeq/NextSeq)
        |                                           |
        v                                           v
  H&E + IHC                               CellRanger (FASTQ -> Matrix)
  (traditional)                                     |
        |                                           v
        +------ Correlation --------->   SCIA Analysis (GPU-accelerated)
                                                    |
                                                    v
                                         Clinical Report Generation
                                                    |
                                                    v
                                         Tumor Board Presentation
                                                    |
                                                    v
                                         Treatment Decision

Turnaround Time Target: From sequencing completion to clinical report in <4 hours: - CellRanger processing: 2-3 hours - SCIA analysis: <5 minutes (GPU-accelerated) - LLM report generation: <5 minutes - QC review: 30-60 minutes (human review)


24. Future Directions

24.1 Technology Roadmap

Q2 2026: Spatial Transcriptomics Enhancement - Full Xenium support with subcellular transcript analysis - 3D spatial reconstruction from serial tissue sections - Spatial-temporal modeling for treatment response prediction

Q3 2026: Foundation Model Integration (Phase 2) - scGPT embeddings for improved Milvus retrieval - Geneformer-based perturbation prediction in drug response workflow - Multi-task foundation model deployment on DGX Spark

Q4 2026: Multi-Omics Fusion - Integrated ATAC-seq + RNA-seq (Multiome) analysis - Proteogenomic integration (CITE-seq + genomics) - Spatial multi-omics (Xenium + Visium + CODEX on same tissue)

2027: Clinical Trial Support - Prospective validation in 3-5 clinical trial sites - Automated MRD monitoring from longitudinal biopsies - Real-time biomarker tracking dashboards for trial coordinators - FDA 510(k) submission preparation for CDS classification

24.2 Algorithmic Innovations

Streaming single-cell analysis: Process cells in real-time as they are sequenced, enabling progressive analysis without waiting for complete datasets.

Federated single-cell analysis: Enable multi-institutional analysis without data sharing, preserving patient privacy while enabling atlas-scale integration.

Causal inference: Move beyond correlation to causal modeling of gene regulatory networks, drug mechanisms, and resistance pathways using single-cell perturbation data.

Digital twins: Patient-specific tumor digital twins incorporating single-cell composition, spatial organization, and clonal architecture to simulate treatment responses in silico before clinical administration.

24.3 Ecosystem Expansion

  • Wetlab integration: Direct connection to 10x Chromium Controller and sequencers for automated pipeline triggering
  • EHR integration: HL7 FHIR-compliant interfaces for embedding single-cell results in electronic health records
  • Collaboration platform: Shared analysis workspaces for multi-disciplinary tumor boards with role-based access
  • Training and education: Interactive tutorials and case studies for clinical adoption

25. References

  1. Becht, E., McInnes, L., Healy, J., et al. (2019). Dimensionality reduction for visualizing single-cell data using UMAP. Nature Biotechnology, 37(1), 38-44.

  2. Butler, A., Hoffman, P., Smibert, P., Papalexi, E., & Satija, R. (2018). Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nature Biotechnology, 36(5), 411-420.

  3. Cui, H., Wang, C., Maan, H., et al. (2024). scGPT: Toward building a foundation model for single-cell multi-omics using generative AI. Nature Methods, 21, 1470-1480.

  4. Dominguez Conde, C., Xu, C., Jarvis, L.B., et al. (2022). Cross-tissue immune cell analysis reveals tissue-specific features in humans. Science, 376(6594), eabl5197.

  5. Dries, R., Zhu, Q., Dong, R., et al. (2021). Giotto: A toolbox for integrative analysis and visualization of spatial expression data. Genome Biology, 22(1), 78.

  6. Fan, J., Slowikowski, K., & Zhang, F. (2020). Single-cell transcriptomics in cancer: Computational challenges and opportunities. Experimental & Molecular Medicine, 52, 1452-1465.

  7. Fleming, S.J., Chaffin, M.D., Arduini, A., et al. (2023). Unsupervised removal of systematic background noise from droplet-based single-cell experiments using CellBender. Nature Methods, 20, 1323-1335.

  8. Franzen, O., Gan, L.M., & Bjorkegren, J.L.M. (2019). PanglaoDB: A web server for exploration of mouse and human single-cell RNA sequencing data. Database, 2019, baz046.

  9. Gayoso, A., Lopez, R., Xing, G., et al. (2022). A Python library for probabilistic analysis of single-cell omics data. Nature Biotechnology, 40(2), 163-166.

  10. Hagemann-Jensen, M., Ziegenhain, C., Chen, P., et al. (2020). Single-cell RNA counting at allele and isoform resolution using Smart-seq3. Nature Biotechnology, 38(6), 708-714.

  11. Hao, Y., Hao, S., Andersen-Nissen, E., et al. (2021). Integrated analysis of multimodal single-cell data. Cell, 184(13), 3573-3587.

  12. Korsunsky, I., Millard, N., Fan, J., et al. (2019). Fast, sensitive and accurate integration of single-cell data with Harmony. Nature Methods, 16(12), 1289-1296.

  13. Kim, C., Gao, R., Sei, E., et al. (2018). Chemoresistance evolution in triple-negative breast cancer delineated by single-cell sequencing. Cell, 173(4), 879-893.

  14. La Manno, G., Soldatov, R., Zeisel, A., et al. (2018). RNA velocity of single cells. Nature, 560(7719), 494-498.

  15. Lopez, R., Regier, J., Cole, M.B., Jordan, M.I., & Yosef, N. (2018). Deep generative modeling for single-cell transcriptomics. Nature Methods, 15(12), 1053-1058.

  16. Macosko, E.Z., Basu, A., Satija, R., et al. (2015). Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell, 161(5), 1202-1214.

  17. Marquart, J., Chen, E.Y., & Prasad, V. (2018). Estimation of the percentage of US patients with cancer who benefit from genome-driven oncology. JAMA Oncology, 4(8), 1093-1098.

  18. McGranahan, N., & Swanton, C. (2017). Clonal heterogeneity and tumor evolution: Past, present, and the future. Cell, 168(4), 613-628.

  19. Newman, A.M., Liu, C.L., Green, M.R., et al. (2015). Robust enumeration of cell subsets from tissue expression profiles. Nature Methods, 12(5), 453-457.

  20. Palla, G., Spitzer, H., Klein, M., et al. (2022). Squidpy: A scalable framework for spatial omics analysis. Nature Methods, 19(2), 171-178.

  21. Patel, A.P., Tirosh, I., Trombetta, J.J., et al. (2014). Single-cell RNA-seq highlights intratumoral heterogeneity in primary glioblastoma. Science, 344(6190), 1396-1401.

  22. Picelli, S., Faridani, O.R., Bjorklund, A.K., et al. (2014). Full-length RNA-seq from single cells using Smart-seq2. Nature Protocols, 9(1), 171-181.

  23. Sade-Feldman, M., Yizhak, K., Bjorgaard, S.L., et al. (2018). Defining T cell states associated with response to checkpoint immunotherapy in melanoma. Cell, 175(4), 998-1013.

  24. Satija, R., Farrell, J.A., Gennert, D., Schier, A.F., & Regier, A. (2015). Spatial reconstruction of single-cell gene expression data. Nature Biotechnology, 33(5), 495-502.

  25. Stoeckius, M., Hafemeister, C., Stephenson, W., et al. (2017). Simultaneous epitope and transcriptome measurement in single cells. Nature Methods, 14(9), 865-868.

  26. Stuart, T., Butler, A., Hoffman, P., et al. (2019). Comprehensive integration of single-cell data. Cell, 177(7), 1888-1902.

  27. Sturm, G., Finotello, F., Petitprez, F., et al. (2019). Comprehensive evaluation of transcriptome-based cell-type quantification methods for immuno-oncology. Bioinformatics, 35(14), i436-i445.

  28. Svensson, V., da Veiga Beltrame, E., & Pachter, L. (2020). A curated database reveals trends in single-cell transcriptomics. Database, 2020, baaa073.

  29. Tabula Sapiens Consortium. (2022). The Tabula Sapiens: A multiple-organ, single-cell transcriptomic atlas of humans. Science, 376(6594), eabl4896.

  30. Tang, F., Barbacioru, C., Wang, Y., et al. (2009). mRNA-Seq whole-transcriptome analysis of a single cell. Nature Methods, 6(5), 377-382.

  31. Theodoris, C.V., Xiao, L., Chopra, A., et al. (2023). Transfer learning enables predictions in network biology. Nature, 618, 616-624.

  32. Tirosh, I., Izar, B., Prakadan, S.M., et al. (2016). Dissecting the multicellular ecosystem of metastatic melanoma by single-cell RNA-seq. Science, 352(6282), 189-196.

  33. Traag, V.A., Waltman, L., & van Eck, N.J. (2019). From Louvain to Leiden: Guaranteeing well-connected communities. Scientific Reports, 9(1), 5233.

  34. Wolf, F.A., Angerer, P., & Theis, F.J. (2018). SCANPY: Large-scale single-cell gene expression data analysis. Genome Biology, 19(1), 15.

  35. Zhang, X., Lan, Y., Xu, J., et al. (2019). CellMarker: A manually curated resource for cell markers. Nucleic Acids Research, 47(D1), D721-D728.

  36. Zheng, G.X., Terry, J.M., Belgrader, P., et al. (2017). Massively parallel digital transcriptional profiling of single cells. Nature Communications, 8, 14049.


Appendix A: API Reference Summary

Endpoints (Port 8540)

Method Endpoint Description
POST /api/v1/annotate Cell type annotation workflow
POST /api/v1/tme-profile Tumor microenvironment profiling
POST /api/v1/drug-response Drug response prediction
POST /api/v1/subclonal Subclonal architecture analysis
POST /api/v1/spatial-niche Spatial niche identification
POST /api/v1/trajectory Trajectory/pseudotime analysis
POST /api/v1/ligand-receptor Ligand-receptor interaction mapping
POST /api/v1/biomarker Biomarker discovery
POST /api/v1/cart-validate CAR-T target validation
POST /api/v1/treatment-monitor Treatment response monitoring
GET /api/v1/health Health check
GET /api/v1/collections List Milvus collections and stats
GET /api/v1/gpu-status GPU memory and utilization

Dashboard (Port 8130)

The Streamlit dashboard provides interactive access to all workflows with: - File upload for H5AD, CSV, and spatial data formats - Real-time UMAP/t-SNE visualization with cluster coloring - Spatial maps with cell type overlay - Heatmaps, violin plots, and dot plots for marker gene visualization - Interactive TME classification and drug response reports - Export to PDF, HTML, and AnnData formats


Appendix B: Milvus Collection Statistics

Collection Vectors Dimension Index Type Avg Query Time
sc_cell_types 2,500 384 IVF_FLAT 2ms
sc_markers 45,000 384 IVF_FLAT 5ms
sc_spatial 18,000 384 IVF_FLAT 4ms
sc_tme 12,000 384 IVF_FLAT 3ms
sc_drug_response 35,000 384 IVF_FLAT 5ms
sc_literature 85,000 384 IVF_SQ8 8ms
sc_methods 3,500 384 IVF_FLAT 2ms
sc_datasets 8,000 384 IVF_FLAT 3ms
sc_trajectories 6,500 384 IVF_FLAT 3ms
sc_pathways 15,000 384 IVF_FLAT 4ms
sc_clinical 22,000 384 IVF_FLAT 4ms
genomic_evidence 3,560,000 384 IVF_SQ8 12ms
Total ~3,813,500

Appendix C: Supported File Formats

Format Extension Description Max Size
AnnData .h5ad Scanpy native format 50 GB
10x HDF5 .h5 CellRanger filtered output 10 GB
10x MEX .mtx + .tsv CellRanger sparse matrix 10 GB
CSV/TSV .csv/.tsv Dense count matrix 5 GB
Loom .loom Velocyto/scVelo format 20 GB
Seurat RDS .rds Seurat object (via rpy2) 20 GB
Visium SpaceRanger output Spatial + expression 10 GB
MERFISH Vizgen output Spatial transcripts 50 GB
Xenium Xenium output In situ transcripts 50 GB
VCF .vcf/.vcf.gz Variant calls (integration) 5 GB

This document is part of the HCLS AI Factory documentation suite. For platform-wide architecture and deployment instructions, see the main HCLS AI Factory documentation. For agent-specific implementation details, see the Single-Cell Intelligence Agent source code and API documentation.

Last updated: March 2026