Single-Cell Intelligence Agent -- Project Bible¶
Version: 1.0.0 Date: 2026-03-22 Author: Adam Jones Classification: Internal Reference Document
1. Mission Statement¶
The Single-Cell Intelligence Agent delivers RAG-powered clinical decision support at single-cell resolution, transforming raw transcriptomic data into clinically actionable insights for oncology, immunology, and cell therapy teams. It bridges the gap between single-cell research output and bedside treatment decisions by combining curated domain knowledge, vector-based evidence retrieval, and LLM-powered synthesis.
2. Problem Statement¶
2.1 The Resolution Gap¶
Precision medicine has advanced from tissue-level to gene-level analysis, but clinical decisions still rely on bulk-averaged measurements that mask critical cellular heterogeneity. A tumor biopsy reporting "PD-L1 positive" may contain 30% immune-hot niches and 70% immune-cold desert -- information invisible at bulk resolution but decisive for immunotherapy selection.
2.2 The Interpretation Bottleneck¶
Single-cell RNA-seq generates datasets with 10,000-500,000 cells, each expressing 20,000+ genes. A trained bioinformatician requires 2-4 weeks to fully annotate, profile, and interpret a single dataset. Clinical turnaround expectations are 24-72 hours.
2.3 The Knowledge Integration Challenge¶
Actionable single-cell interpretation requires simultaneous access to: - 44+ cell type definitions with canonical marker genes - 12 cancer-specific TME reference profiles - 30+ drug sensitivity databases - Spatial transcriptomics platform specifications - Active clinical trial registries - 75+ validated marker gene associations
No single analyst maintains current expertise across all these domains.
3. Solution Architecture¶
3.1 High-Level Design¶
The agent follows the Plan-Search-Evaluate-Synthesize-Report pattern:
- Plan: Parse the query, classify the workflow type (1 of 11), and construct a search plan with collection-specific weights
- Search: Execute parallel vector searches across 12 Milvus collections using BGE-small-en-v1.5 embeddings
- Evaluate: Score evidence quality, check cross-collection corroboration, assess clinical relevance
- Synthesize: Generate a grounded LLM response using Claude Sonnet with retrieved evidence
- Report: Format structured output with citations, severity levels, and actionable recommendations
3.2 Key Design Decisions¶
| Decision | Rationale |
|---|---|
| 12 separate collections (not 1 monolithic) | Workflow-specific weight boosting, schema optimization per data type |
| BGE-small-en-v1.5 (384-dim) | Balance of quality and speed; 384-dim sufficient for biomedical text |
| IVF_FLAT index | Simple, accurate, adequate for < 100K records per collection |
| Pydantic models for all I/O | Type safety, validation, auto-documentation |
| 4 dedicated decision engines | Deterministic clinical logic separate from LLM stochasticity |
| Graceful degradation | Each component failure reduces capability, never crashes |
| COSINE similarity | Standard for normalized text embeddings |
3.3 What This Agent Does NOT Do¶
- Does not process raw FASTQ/BAM files (that is the genomics pipeline)
- Does not perform de novo clustering or dimensionality reduction (that is the computational pipeline)
- Does not store patient health records (no PHI storage)
- Does not replace pathologist review (decision support, not decision making)
- Does not run GPU-accelerated analysis (v1.0 -- RAPIDS integration planned for v2.0)
4. Stakeholder Map¶
| Stakeholder | Role | Interest |
|---|---|---|
| Clinical oncologist | End user | TME classification, drug response, treatment monitoring |
| Bioinformatician | Power user | Cell type annotation, trajectory inference, method selection |
| Cell therapy team | End user | CAR-T target validation, escape risk assessment |
| Spatial biology researcher | Power user | Spatial niche mapping, L-R interaction analysis |
| Clinical trial coordinator | Consumer | Biomarker discovery, trial-eligible target identification |
| Platform engineering | Operator | Deployment, monitoring, scaling |
5. Feature Inventory¶
5.1 Analysis Workflows (10)¶
| # | Workflow | Clinical Question Answered |
|---|---|---|
| 1 | Cell Type Annotation | "What cell types are in this sample and at what proportions?" |
| 2 | TME Profiling | "Is this tumor hot, cold, excluded, or immunosuppressive?" |
| 3 | Drug Response | "Which drugs will this tumor respond to at the cellular level?" |
| 4 | Subclonal Architecture | "Are there resistant subclones that could cause relapse?" |
| 5 | Spatial Niche | "Where are the immune cells relative to tumor cells in tissue?" |
| 6 | Trajectory Analysis | "What differentiation or exhaustion trajectories are active?" |
| 7 | Ligand-Receptor | "Which cell-cell communication axes are driving tumor progression?" |
| 8 | Biomarker Discovery | "What cell-type-specific biomarkers predict treatment outcome?" |
| 9 | CAR-T Validation | "Is this target safe and effective for CAR-T therapy?" |
| 10 | Treatment Monitoring | "How has the tumor composition changed under treatment?" |
5.2 Decision Support Engines (4)¶
| Engine | Input | Output | Deterministic |
|---|---|---|---|
| TMEClassifier | Cell proportions, gene expression | TME class + treatment recs | Yes |
| SubclonalRiskScorer | Clone data, target antigen | Risk level + timeline | Yes |
| TargetExpressionValidator | Tumor/normal expression | Safety verdict | Yes |
| CellularDeconvolutionEngine | Bulk expression | Cell type proportions | Yes |
5.3 Knowledge Resources¶
| Resource | Records | Update Frequency |
|---|---|---|
| Cell Type Atlas | 44 types, 232 aliases | Static (v3.0.0) |
| Drug Database | 30 drugs, 10 classes | Semi-annual |
| Marker Genes | 75 genes | Static |
| Immune Signatures | 10 signatures | Static |
| L-R Pairs | 25 pairs | Static |
| Cancer TME Atlas | 12 cancer types | Semi-annual |
| CellxGene Seeds | 49 records | On ingest |
| Marker Seeds | 75 records | On ingest |
| TME Seeds | 20 records | On ingest |
6. Technical Stack¶
6.1 Core Technologies¶
| Layer | Technology | Version |
|---|---|---|
| Language | Python | 3.10 |
| API framework | FastAPI | 0.111.0 |
| UI framework | Streamlit | 1.33.0 |
| Vector database | Milvus | 2.4 |
| Embedding model | BGE-small-en-v1.5 | 384-dim |
| LLM | Claude Sonnet (Anthropic) | claude-sonnet-4-6 |
| Containerization | Docker | Multi-stage |
| Orchestration | Docker Compose | 3.8 |
| Validation | Pydantic | 2.7.4 |
| Metrics | Prometheus client | 0.20.0 |
| Scheduling | APScheduler | 3.10.4 |
| Export | python-docx | 1.1.0 |
6.2 Data Sources¶
| Source | Data Type | Integration |
|---|---|---|
| Human Cell Atlas | Cell type references | Seed data |
| CellMarker 2.0 | Marker-cell associations | Seed data |
| Cell Ontology (CL) | Ontology identifiers | Static mapping |
| PanglaoDB | Marker gene database | Seed data |
| CellxGene | Dataset metadata | API ingest |
| GDSC | Drug sensitivity | Knowledge base |
| DepMap | Cancer dependency | Knowledge base |
| ClinVar | Variant classification | Shared collection |
| TISCH2 | TME atlas | Knowledge base |
| CellPhoneDB | L-R interactions | Knowledge base |
7. Port Allocation¶
| Port | Service | Protocol |
|---|---|---|
| 8540 | FastAPI REST API | HTTP |
| 8130 | Streamlit UI | HTTP |
| 19530 | Milvus (shared) | gRPC |
| 69530 | Milvus (standalone) | gRPC |
| 69091 | Milvus health (standalone) | HTTP |
8. Data Flow¶
User Query
|
v
Query Classification (SCWorkflowType)
|
v
Search Plan Construction
|-- Collection selection (12 collections)
|-- Weight profile selection (11 profiles)
|-- Filter expression generation
|
v
Parallel Vector Search (ThreadPoolExecutor)
|-- sc_cell_types (weight: 0.14)
|-- sc_markers (weight: 0.12)
|-- ... (10 more collections)
|
v
Evidence Aggregation & Scoring
|-- Cross-collection entity linking
|-- Citation relevance scoring
|-- Evidence level assessment
|
v
LLM Synthesis (Claude Sonnet)
|-- System prompt: single-cell specialist
|-- Context: top-K evidence from search
|-- Conversation history (3-turn window)
|
v
Structured Response (SCResponse)
|-- answer (natural language)
|-- workflow_result (typed)
|-- citations (formatted)
|-- confidence (0-1)
9. Quality Gates¶
9.1 Code Quality¶
| Gate | Tool | Threshold |
|---|---|---|
| Type safety | Pydantic validation | All I/O models typed |
| Unit tests | pytest | 185+ test cases |
| Configuration validation | SingleCellSettings.validate() | 0 critical warnings |
| Weight sum validation | Collections.py | Sum within 0.05 of 1.0 |
9.2 Clinical Quality¶
| Gate | Mechanism |
|---|---|
| TME classification accuracy | Validated against TISCH2 reference profiles |
| Drug sensitivity correlation | Cross-referenced with GDSC IC50 data |
| CAR-T safety thresholds | Based on published vital organ expression data |
| Evidence grading | Four-tier evidence level system |
| Severity classification | Five-level clinical severity scale |
10. Deployment Models¶
10.1 Standalone (Docker Compose)¶
Includes dedicated Milvus instance (etcd + MinIO + standalone). Suitable for development, testing, and single-user deployment.
10.2 Integrated (DGX Spark)¶
Connects to shared Milvus instance via the top-level docker-compose.dgx-spark.yml. Reads from shared genomic_evidence collection. Suitable for production deployment alongside other HCLS AI Factory agents.
10.3 VAST AI OS¶
Deployed as a function within the VAST AI OS platform with automatic scaling and health monitoring. Uses the VAST AI OS AgentEngine model for lifecycle management.
11. Roadmap¶
11.1 v1.0 (Current)¶
- 12 Milvus collections with seed data
- 10 analysis workflows
- 4 decision support engines
- FastAPI + Streamlit deployment
- Cross-agent integration (4 peer agents)
- 185 test cases
11.2 v1.1 (Q2 2026)¶
- RAPIDS GPU acceleration for cuML UMAP/clustering
- scGPT foundation model integration via NIM
- CIBERSORTx-grade deconvolution
- OpenTelemetry distributed tracing
- Redis-backed rate limiting
11.3 v2.0 (Q3 2026)¶
- Multi-modal integration (scATAC-seq, CITE-seq, Multiome)
- Real-time spatial analysis pipeline
- Automated report generation with institutional templates
- Patient longitudinal tracking dashboard
- FDA 21 CFR Part 11 compliance features
12. Glossary¶
| Term | Definition |
|---|---|
| AnnData | Annotated data matrix format for single-cell data (.h5ad) |
| CAR-T | Chimeric Antigen Receptor T-cell therapy |
| CITE-seq | Cellular Indexing of Transcriptomes and Epitopes by Sequencing |
| CL | Cell Ontology (standardized cell type identifier system) |
| DE | Differential Expression analysis |
| GDSC | Genomics of Drug Sensitivity in Cancer |
| HCA | Human Cell Atlas consortium |
| L-R | Ligand-Receptor (cell-cell communication) |
| MERFISH | Multiplexed Error-Robust FISH (spatial transcriptomics) |
| MRD | Minimal Residual Disease |
| NNLS | Non-Negative Least Squares (deconvolution method) |
| PCA | Principal Component Analysis |
| RAG | Retrieval-Augmented Generation |
| scRNA-seq | Single-cell RNA sequencing |
| TME | Tumor Microenvironment |
| UMAP | Uniform Manifold Approximation and Projection |
| Visium | 10x Genomics spatial transcriptomics platform |
| Xenium | 10x Genomics in situ spatial platform |
HCLS AI Factory -- Single-Cell Intelligence Agent Project Bible v1.0.0