CAR-T Intelligence Agent¶

Cross-functional intelligence across the CAR-T cell therapy development lifecycle. Part of the HCLS AI Factory.

CAR-T Intelligence Agent on NVIDIA DGX Spark

Overview¶

The CAR-T Intelligence Agent breaks down data silos across the 5 stages of CAR-T development. It searches across all data sources simultaneously and synthesizes cross-functional insights powered by Claude.

Collection	Records	Source
Literature	5,047	PubMed abstracts via NCBI E-utilities
Clinical Trials	973	ClinicalTrials.gov API v2
CAR Constructs	6	6 FDA-approved CAR-T products
Assay Results	45	Curated from landmark papers (ELIANA, ZUMA-1, KarMMa, CARTITUDE-1, etc.)
Manufacturing	30	Curated CMC/process data (transduction, expansion, release, cryo, logistics)
Safety	—	Pharmacovigilance, CRS/ICANS profiles
Biomarkers	—	CRS prediction, exhaustion monitoring
Regulatory	—	Approval timelines, post-marketing requirements
Sequences	—	Molecular binding, scFv sequences
Real-World Evidence	—	Registry outcomes, real-world data
Genomic Evidence	(read-only)	Shared from Stage 2 RAG pipeline (Milvus)
Total	6,266+ vectors	11 collections (10 owned + 1 read-only)

Example Queries¶

"Why do CD19 CAR-T therapies fail in relapsed B-ALL?"
"Compare 4-1BB vs CD28 costimulatory domains for DLBCL"
"What manufacturing parameters predict clinical response?"
"BCMA CAR-T resistance mechanisms in multiple myeloma"
"How does T-cell exhaustion affect CAR-T persistence?"

All queries return grounded, cross-collection answers with clickable Literature:PMID and Trial:NCT... citations.

Comparative Analysis Mode¶

Comparative queries are auto-detected and produce structured side-by-side analysis with markdown tables, advantages/limitations, and clinical context.

"Compare CD19 vs BCMA"                              → Target vs target
"Compare 4-1BB vs CD28 costimulatory domains"        → Costimulatory domain comparison
"Kymriah versus Carvykti"                            → Product vs product (resolves to CD19 vs BCMA)
"Compare CRS and ICANS toxicity"                     → Toxicity profile comparison

How it works: The engine detects "vs/versus/compare" keywords, parses two entities, resolves each against the knowledge graph (25 antigens, 6 products, 8 toxicities, 10 manufacturing processes), runs dual retrievals with per-entity filtering, and builds a comparative prompt that instructs Claude to produce structured tables. The evidence panel groups results by entity with color-coded headers.

Feature	Detail
Entity types	Targets, FDA products, costimulatory domains, toxicities, manufacturing
Entity resolution	25 antigens + 39+ product/domain/biomarker/regulatory aliases
Dual retrieval	~365ms for 46 results (24 + 22 per entity)
Structured output	Comparison table, advantages, limitations, clinical context
Fallback	Unrecognized entities gracefully fall back to normal query path

Architecture¶

User Query
    |
    v
[Comparative Detection] ──── "X vs Y" detected? ──── YES ──┐
    |                                                        |
    NO                                              [Parse Two Entities]
    |                                              (resolve via knowledge graph)
    v                                                        |
[BGE-small-en-v1.5 Embedding]                      [Dual Retrieval]
(384-dim, asymmetric query prefix)                  (Entity A + Entity B)
    |                                                        |
    v                                                        v
[Parallel Search: 11 Milvus Collections]     [Comparative Prompt Builder]
(IVF_FLAT / COSINE)                         (tables + pros/cons format)
    |               |           |                            |
    v               v           v                            |
Literature      Trials     Constructs                        |
 5,047           973           6                             |
    |               |           |                            |
 Assays      Manufacturing     |                             |
   45             30           |                             |
    |               |           |                            |
    +-------+-------+----------+                             |
            |                                                |
            v                                                v
    [Query Expansion] (12 maps, 169 keywords -> 1,496 terms)
            |
            v
    [Knowledge Graph Augmentation]
    (25 antigens, 8 toxicities, 10 mfg processes)
            |
            v
    [Claude LLM] -> Grounded response with citations

Built on the HCLS AI Factory platform:

Vector DB: Milvus 2.4 with IVF_FLAT/COSINE indexes (nlist=1024, nprobe=16)
Embeddings: BGE-small-en-v1.5 (384-dim)
LLM: Claude Sonnet 4.6 (Anthropic API)
UI: Streamlit (port 8521)
Hardware target: NVIDIA DGX Spark ($3,999)

Setup¶

Prerequisites¶

Python 3.10+
Milvus 2.4 running on localhost:19530
ANTHROPIC_API_KEY environment variable (or in rag-chat-pipeline/.env)

Install¶

cd ai_agent_adds/cart_intelligence_agent
pip install -r requirements.txt

1. Create Collections and Seed FDA Constructs¶

python3 scripts/setup_collections.py --seed-constructs

This creates 11 Milvus collections (10 owned + 1 read-only) with IVF_FLAT indexes and inserts 6 FDA-approved CAR-T products (Kymriah, Yescarta, Tecartus, Breyanzi, Abecma, Carvykti).

2. Ingest PubMed Literature (~15 min)¶

python3 scripts/ingest_pubmed.py --max-results 5000

Fetches CAR-T abstracts via NCBI E-utilities (esearch + efetch), classifies by development stage, extracts target antigens, embeds with BGE-small, and stores in cart_literature.

3. Ingest Clinical Trials (~3 min)¶

python3 scripts/ingest_clinical_trials.py --max-results 1500

Fetches CAR-T trials via ClinicalTrials.gov API v2, extracts phase/status/sponsor/antigen/generation, embeds, and stores in cart_trials.

4. Seed Assay Data (~30 sec)¶

python3 scripts/seed_assays.py

Inserts 45 curated assay records from landmark CAR-T publications (ELIANA, ZUMA-1, KarMMa, CARTITUDE-1, etc.) covering cytotoxicity, cytokine, persistence, exhaustion, and resistance data.

5. Seed Manufacturing Data (~30 sec)¶

python3 scripts/seed_manufacturing.py

Inserts 30 curated manufacturing/CMC records covering transduction, expansion, harvest, cryopreservation, release testing, logistics, cost analysis, and emerging platforms (POC, allogeneic, non-viral).

6. Validate¶

python3 scripts/validate_e2e.py

Runs 5 tests: collection stats, single-collection search, multi-collection search_all(), filtered search (target_antigen == "CD19"), and all demo queries.

7. Run Integration Test (requires API key)¶

python3 scripts/test_rag_pipeline.py

Tests the full RAG pipeline: embed -> search_all -> knowledge graph -> Claude LLM response generation. Validates both synchronous and streaming modes.

8. Launch UI¶

streamlit run app/cart_ui.py --server.port 8521

Project Structure¶

cart_intelligence_agent/
├── Docs/
│   └── CART_Intelligence_Agent_Design.md  # Architecture design document
├── src/
│   ├── models.py                  # Pydantic data models (16 models + enums)
│   ├── collections.py             # 11 Milvus collection schemas + manager
│   ├── knowledge.py               # Knowledge graph (25 targets, 8 toxicities, 10 mfg)
│   ├── query_expansion.py         # 12 expansion maps (169 keywords -> 1,496 terms)
│   ├── rag_engine.py              # Multi-collection RAG engine + comparative analysis + Claude
│   ├── agent.py                   # CAR-T Intelligence Agent (plan -> search -> synthesize)
│   ├── ingest/
│   │   ├── base.py                # Base ingest pipeline (fetch -> parse -> embed -> store)
│   │   ├── literature_parser.py   # PubMed NCBI E-utilities ingest
│   │   ├── clinical_trials_parser.py  # ClinicalTrials.gov API v2 ingest
│   │   ├── construct_parser.py    # CAR construct data parser
│   │   ├── assay_parser.py        # Assay data parser
│   │   └── manufacturing_parser.py # Manufacturing/CMC data parser
│   └── utils/
│       └── pubmed_client.py       # NCBI E-utilities HTTP client
├── app/
│   └── cart_ui.py                 # Streamlit chat interface (NVIDIA theme, comparative mode)
├── config/
│   └── settings.py                # Pydantic BaseSettings configuration
├── data/
│   └── reference/
│       ├── assay_seed_data.json   # 45 curated assay records from landmark papers
│       └── manufacturing_seed_data.json  # 30 curated manufacturing/CMC records
├── scripts/
│   ├── setup_collections.py       # Create collections + seed FDA constructs
│   ├── ingest_pubmed.py           # CLI: ingest PubMed CAR-T literature
│   ├── ingest_clinical_trials.py  # CLI: ingest ClinicalTrials.gov trials
│   ├── seed_assays.py             # Seed assay data from published papers
│   ├── seed_manufacturing.py      # Seed manufacturing/CMC data
│   ├── validate_e2e.py            # End-to-end data layer validation
│   ├── test_rag_pipeline.py       # Full RAG + LLM integration test
│   └── seed_knowledge.py          # Export knowledge graph to JSON
├── requirements.txt
└── LICENSE                        # Apache 2.0

55 Python files | ~16,748 lines | Apache 2.0

Knowledge Graph¶

Component	Count	Examples
Target Antigens	25	CD19, BCMA, CD22, CD20, CD30, HER2, GPC3, EGFR, Mesothelin, GPRC5D, ...
FDA-Approved Products	6	Kymriah, Yescarta, Tecartus, Breyanzi, Abecma, Carvykti
Entity Aliases	39+	Product names, generic names, costimulatory domains (for comparative resolution)
Toxicity Profiles	8	CRS, ICANS, B-cell aplasia, HLH/MAS, cytopenias, TLS, GvHD, on-target/off-tumor
Manufacturing Processes	10	Transduction, expansion, leukapheresis, cryopreservation, release testing, ...
Biomarkers	15	CRS prediction, T-cell exhaustion, persistence, cytokine panels, ...
Regulatory Histories	6	Approval timelines, post-marketing requirements for all FDA products
Immunogenicity Topics	6	ADA, immunogenicity assays, risk factors, ...
Query Expansion Maps	12	Target Antigen, Disease, Toxicity, Manufacturing, Mechanism, Construct, Safety, Biomarker, Regulatory, Sequence, RealWorld, Immunogenicity
Expansion Keywords	169	Mapping to 1,496 related terms

Performance¶

Measured on NVIDIA DGX Spark (GB10 GPU, 128GB unified memory):

Metric	Value
PubMed ingest (5,047 abstracts)	~15 min
ClinicalTrials.gov ingest (973 trials)	~3 min
Assay seed ingest (45 records)	~30 sec
Manufacturing seed ingest (30 records)	~30 sec
Vector search (11 collections, top-5 each)	12-16 ms (cached)
Comparative dual retrieval (2x11 collections)	~365 ms
Full RAG query (search + Claude)	~24 sec
Comparative RAG query (dual search + Claude)	~30 sec
Cosine similarity scores	0.74 - 0.90

Credits¶

Adam Jones
Apache 2.0 License