Clinical Trial Intelligence Agent -- Project Bible¶
Date: March 22, 2026 Author: Adam Jones Platform: NVIDIA DGX Spark -- HCLS AI Factory
Table of Contents¶
- Overview
- Architecture
- Collections Reference
- Workflow Reference
- API Endpoint Reference
- Knowledge Base Reference
- Decision Support Engines
- Query Expansion Reference
- Configuration Reference
- Port Map
- Tech Stack
- Data Models
- Cross-Agent Integration
- Ingest Pipeline Reference
- Test Reference
1. Overview¶
The Clinical Trial Intelligence Agent is an AI-powered clinical trial decision support system that integrates RAG-based evidence retrieval across 14 Milvus vector collections, 10 clinical workflows, 5 decision support engines, and an autonomous reasoning pipeline. It serves pharmaceutical R&D teams, clinical operations, regulatory affairs, and medical affairs with evidence-based guidance across the entire clinical trial lifecycle.
Key Numbers¶
| Metric | Value |
|---|---|
| Python files | 46 |
| Lines of code | 22,607 |
| Milvus collections | 14 |
| Clinical workflows | 10 |
| Decision support engines | 5 + 1 (Historical Success Estimator) |
| API endpoints | 26 |
| Landmark trials | 40 |
| Therapeutic areas | 13 |
| Regulatory agencies | 9 |
| Entity aliases | 140 |
| Drug synonym entries | 33 |
| Biomarker entries | 22 |
| Tests | 769 (100% pass, 0.47s) |
| Knowledge version | 2.0.0 |
2. Architecture¶
User --> Streamlit UI (:8128) --> FastAPI API (:8538)
|
+----------------+----------------+
| | |
Workflows(10) Decision Engines(5) RAG Engine
| |
Knowledge Base Milvus (:19530)
(40 trials, 13 areas, 14 collections
9 agencies, 9 designs) 384-dim BGE
IVF_FLAT/COSINE
Tiers¶
- Presentation: Streamlit (5 tabs, NVIDIA dark theme, port 8128)
- Application: FastAPI (26 endpoints, CORS, auth, rate limiting, port 8538)
- Data: Milvus (14 collections, BGE-small-en-v1.5 embeddings, port 19530)
3. Collections Reference¶
3.1 Full Collection Catalog¶
| # | Name | Fields | Est. Records | Weight | Primary Use |
|---|---|---|---|---|---|
| 1 | trial_protocols |
trial_id, title, phase, status, therapeutic_area, sponsor, start_date, enrollment_target, text_content | 5,000 | 0.10 | Protocol design, competitive intel |
| 2 | trial_eligibility |
trial_id, criterion_type, criterion_text, logic_operator, population_impact | 50,000 | 0.09 | Patient matching, eligibility optimization |
| 3 | trial_endpoints |
trial_id, endpoint_type, measure, time_frame, statistical_method | 20,000 | 0.08 | Protocol design, adaptive design |
| 4 | trial_sites |
trial_id, site_id, facility_name, city, state, country, status, enrollment_count | 30,000 | 0.07 | Site selection, diversity assessment |
| 5 | trial_investigators |
investigator_id, name, specialty, h_index, publication_count, therapeutic_areas | 5,000 | 0.05 | Site selection, competitive intel |
| 6 | trial_results |
trial_id, outcome, p_value, effect_size, confidence_interval, publication_pmid | 3,000 | 0.09 | Protocol design, competitive intel |
| 7 | trial_regulatory |
submission_id, agency, decision, document_type, drug_name, indication | 2,000 | 0.07 | Regulatory docs, competitive intel |
| 8 | trial_literature |
pmid, title, journal, mesh_terms, publication_year, study_type | 10,000 | 0.08 | Evidence synthesis, protocol design |
| 9 | trial_biomarkers |
biomarker, assay, threshold, validated, trial_context | 3,000 | 0.07 | Patient matching, biomarker strategy |
| 10 | trial_safety |
trial_id, event_type, severity, frequency, soc_term | 20,000 | 0.08 | Safety signal, regulatory docs |
| 11 | trial_rwe |
source, population, outcome, study_design, sample_size | 2,000 | 0.06 | Eligibility optimization, diversity |
| 12 | trial_adaptive |
design_type, decision_rule, trigger_criteria, historical_precedent | 500 | 0.05 | Adaptive design evaluation |
| 13 | trial_guidelines |
guideline_id, organization, version, recommendation_text, evidence_class | 1,000 | 0.08 | All workflows (regulatory reference) |
| 14 | genomic_evidence |
gene, variant, clinical_significance, condition, evidence_summary, source | 100,000 | 0.03 | Cross-modal genomic queries |
3.2 Index Configuration¶
All collections use identical index parameters:
3.3 Workflow-Specific Weight Overrides¶
Each workflow boosts its primary collections. Example weights for the top-3 collections per workflow:
| Workflow | #1 Collection (Weight) | #2 Collection (Weight) | #3 Collection (Weight) |
|---|---|---|---|
| Protocol Design | trial_protocols (0.20) | trial_endpoints (0.15) | trial_eligibility (0.12) |
| Patient Matching | trial_eligibility (0.25) | trial_protocols (0.15) | trial_sites (0.12) |
| Site Selection | trial_sites (0.25) | trial_investigators (0.18) | trial_protocols (0.10) |
| Eligibility Optimization | trial_eligibility (0.25) | trial_protocols (0.12) | trial_rwe (0.10) |
| Adaptive Design | trial_adaptive (0.25) | trial_endpoints (0.15) | trial_guidelines (0.12) |
| Safety Signal | trial_safety (0.25) | trial_results (0.15) | trial_protocols (0.10) |
| Regulatory Docs | trial_regulatory (0.25) | trial_guidelines (0.18) | trial_results (0.12) |
| Competitive Intel | trial_protocols (0.20) | trial_results (0.15) | trial_endpoints (0.12) |
| Diversity Assessment | trial_sites (0.22) | trial_eligibility (0.18) | trial_protocols (0.12) |
| Decentralized Planning | trial_sites (0.18) | trial_protocols (0.15) | trial_guidelines (0.12) |
| General | trial_protocols (0.10) | trial_results (0.09) | trial_eligibility (0.09) |
4. Workflow Reference¶
4.1 Workflow Catalog¶
| Workflow | Enum Value | Key Inputs | Key Outputs |
|---|---|---|---|
| Protocol Design | protocol_design |
indication, phase, comparator, mechanism | Protocol blueprint, endpoint recommendations, sample size |
| Patient Matching | patient_matching |
PatientProfile (age, dx, biomarkers, variants) | MatchScore per criterion, OverallMatch, site distances |
| Site Selection | site_selection |
therapeutic_area, phase, target_enrollment | SiteScore list, enrollment rate, diversity index |
| Eligibility Optimization | eligibility_optimization |
criteria list with justifications | Population impact, BROADEN/REVIEW/RETAIN recs |
| Adaptive Design | adaptive_design |
trial parameters, uncertainty profile | Design type, interim analysis schedule, regulatory guidance |
| Safety Signal | safety_signal |
AE data, trial_id | SafetySignal list, PRR/ROR, severity, DSMB alerts |
| Regulatory Docs | regulatory_docs |
trial data, agency, document_type | Structured document draft, agency-specific formatting |
| Competitive Intel | competitive_intel |
indication, mechanism, target trial | CompetitorProfile list, threat scores, timeline comparison |
| Diversity Assessment | diversity_assessment |
trial sites, target demographics | Demographic gaps, site recommendations, FDORA compliance |
| Decentralized Planning | decentralized_planning |
trial type, patient population | DCT component assessment, regulatory feasibility |
4.2 Workflow Contract¶
All workflows inherit from BaseTrialWorkflow:
class BaseTrialWorkflow(ABC):
workflow_type: TrialWorkflowType
def run(self, inputs: dict) -> WorkflowResult:
processed = self.preprocess(inputs)
result = self.execute(processed)
return self.postprocess(result)
def preprocess(self, inputs: dict) -> dict: ...
@abstractmethod
def execute(self, inputs: dict) -> WorkflowResult: ...
def postprocess(self, result: WorkflowResult) -> WorkflowResult: ...
5. API Endpoint Reference¶
Base URL: http://localhost:8538¶
5.1 System Endpoints¶
| Method | Path | Response | Auth Required |
|---|---|---|---|
| GET | /health |
{"status": "ok", "collections": {...}} |
No |
| GET | /collections |
Collection names and record counts | No |
| GET | /workflows |
Available workflow types | No |
| GET | /metrics |
Prometheus text format | No |
5.2 Trial Endpoints (/v1/trial/)¶
| Method | Path | Request Body | Response Model |
|---|---|---|---|
| POST | /query |
{"question": "...", "workflow_type": "..."} |
QueryResponse |
| POST | /search |
{"query": "...", "collections": [...], "top_k": 5} |
SearchResponse |
| POST | /protocol/optimize |
{"indication": "...", "phase": "..."} |
ProtocolOptimizeResponse |
| POST | /match |
{"patient": {...}, "trial_ids": [...]} |
PatientMatchResponse |
| POST | /match/batch |
{"patients": [...]} |
BatchMatchResponse |
| POST | /site/recommend |
{"therapeutic_area": "...", "phase": "..."} |
SiteRecommendResponse |
| POST | /eligibility/optimize |
{"criteria": [...]} |
EligibilityOptimizeResponse |
| POST | /adaptive/evaluate |
{"trial_params": {...}} |
AdaptiveEvaluateResponse |
| POST | /safety/signal |
{"events": [...], "trial_id": "..."} |
SafetySignalResponse |
| POST | /regulatory/generate |
{"trial_data": {...}, "doc_type": "..."} |
RegulatoryGenerateResponse |
| POST | /competitive/landscape |
{"indication": "...", "mechanism": "..."} |
CompetitiveLandscapeResponse |
| POST | /diversity/assess |
{"sites": [...], "demographics": {...}} |
DiversityAssessResponse |
| POST | /dct/plan |
{"trial_type": "...", "population": "..."} |
DCTPlanResponse |
| GET | /therapeutic-areas |
-- | Therapeutic area catalog |
| GET | /phases |
-- | Phase reference |
| GET | /guidelines |
-- | Guideline reference |
| GET | /knowledge-version |
-- | Version metadata |
| POST | /workflow/{type} |
{"inputs": {...}} |
WorkflowResponse |
5.3 Report and Event Endpoints¶
| Method | Path | Description |
|---|---|---|
| POST | /v1/reports/generate |
Generate structured report |
| GET | /v1/reports/formats |
List supported export formats |
| GET | /v1/events/stream |
SSE event stream |
| GET | /v1/events/health |
Event subsystem health |
6. Knowledge Base Reference¶
6.1 Therapeutic Areas (13)¶
oncology, cardiovascular, neuroscience, immunology, infectious_disease, rare_diseases, metabolic, respiratory, hematology, gastroenterology, dermatology, ophthalmology, gene_cell_therapy
6.2 Trial Phases (7)¶
preclinical, phase_0, phase_1, phase_2, phase_3, phase_4, expanded_access
6.3 Regulatory Agencies (9)¶
FDA, EMA, PMDA, Health_Canada, TGA, MHRA, NMPA, Swissmedic, ANVISA
6.4 Endpoint Types (9)¶
primary, secondary, exploratory, safety, patient_reported, digital, composite, minimal_residual_disease, ctDNA_clearance
6.5 Adaptive Designs (9)¶
group_sequential, sample_size_reestimation, response_adaptive, biomarker_adaptive, platform_trial, seamless_phase, master_protocol, enrichment_adaptive, dose_finding
6.6 Biomarker Strategies (9)¶
enrichment, stratification, prognostic, predictive, pharmacodynamic, surrogate, companion_diagnostic, liquid_biopsy, digital_biomarker
6.7 DCT Components (9)¶
econsent, telemedicine, home_health, local_labs, wearables, epro_ecoa, direct_to_patient, remote_monitoring, digital_informed_consent
6.8 Landmark Trials (40)¶
KEYNOTE-024, EMPEROR-Reduced, RECOVERY, PARADIGM-HF, CheckMate-067, SPRINT, EMPA-REG_OUTCOME, DAPA-CKD, HIMALAYA, DESTINY-Breast04, ADVANCE, CLARITY-AD, FOURIER, FLAURA, ADAURA, MAGELLAN, I-SPY_2, VICTORIA, TOPAZ-1, KRYSTAL-1, ELARA, CREST, CheckMate-227, KEYNOTE-522, DESTINY-Lung02, CLEAR_Outcomes, SELECT, STEP_HFpEF, TRAILBLAZER-ALZ_2, EMERGE_ENGAGE, SPRINT_SMA, SUNFISH, CASGEVY, RINVOQ_SELECT, SURMOUNT-1, PURPOSE_1, PANORAMIC, EPIC-HR, IMpower110, DAPA-HF
7. Decision Support Engines¶
7.1 Engine Catalog¶
| Engine | Class | Purpose | Key Factors |
|---|---|---|---|
| Confidence Calibrator | ConfidenceCalibrator |
Calibrate raw confidence scores | raw (0.3), evidence (0.3), docs (0.2), agreement (0.2) |
| Protocol Complexity | ProtocolComplexityScorer |
Score protocol complexity | procedures, visits, endpoints, criteria, amendments |
| Enrollment Predictor | EnrollmentPredictor |
Predict monthly enrollment | historical rate, prevalence, competition, capacity, phase |
| Eligibility Analyzer | EligibilityAnalyzer |
Analyze criteria restrictiveness | 29 population impact patterns, justification scoring |
| Competitive Threat | CompetitiveThreatScorer |
Score competitor threat level | phase (0.3), enrollment (0.25), sponsor (0.2), differentiation (0.25) |
| Success Estimator | HistoricalSuccessEstimator |
Phase-specific success probability | 12 therapeutic areas, cumulative POS calculation |
7.2 Evidence Level Scoring¶
| Level | Score | Description |
|---|---|---|
| A1 | 1.00 | Systematic review of RCTs |
| A2 | 0.85 | High-quality RCT |
| B | 0.65 | Non-randomized controlled study |
| C | 0.45 | Observational study |
| D | 0.25 | Case series / case report |
| E | 0.15 | Expert opinion |
8. Query Expansion Reference¶
8.1 Synonym Maps¶
| Map | Key Count | Example Entry |
|---|---|---|
| ENTITY_ALIASES | 140 | "NSCLC" -> "non-small cell lung cancer" |
| THERAPEUTIC_AREA_MAP | 13 | "oncology" -> ["cancer", "tumor", "neoplasm", ...] |
| PHASE_MAP | 7 | "phase 3" -> ["phase III", "pivotal", "confirmatory", ...] |
| DRUG_SYNONYM_MAP | 33 | "pembrolizumab" -> ["Keytruda", "MK-3475", ...] |
| BIOMARKER_MAP | 22 | "PD-L1" -> ["CD274", "TPS", "CPS", "SP263", ...] |
| ENDPOINT_MAP | 15 | "OS" -> ["overall survival", "mortality", ...] |
| REGULATORY_MAP | 19 | "IND" -> ["investigational new drug", ...] |
| DESIGN_MAP | 14 | "adaptive" -> ["interim analysis", "Bayesian adaptive", ...] |
| POPULATION_MAP | 10 | "pediatric" -> ["children", "adolescent", "PREA", ...] |
| SAFETY_MAP | 10+ | "SAE" -> ["serious adverse event", "hospitalization", ...] |
8.2 QueryExpander Pipeline¶
- Resolve entity aliases (NSCLC -> non-small cell lung cancer)
- Detect therapeutic areas from query text
- Expand drug names (brand -> generic, code names)
- Expand biomarker references (PD-L1 -> all assay names)
- Expand endpoint and regulatory terms
- Apply workflow-aware term boosting
9. Configuration Reference¶
Environment Variables¶
All variables use the TRIAL_ prefix:
| Variable | Default | Description |
|---|---|---|
TRIAL_MILVUS_HOST |
localhost | Milvus server hostname |
TRIAL_MILVUS_PORT |
19530 | Milvus server port |
TRIAL_EMBEDDING_MODEL |
BAAI/bge-small-en-v1.5 | Embedding model name |
TRIAL_EMBEDDING_DIMENSION |
384 | Embedding vector dimension |
TRIAL_LLM_MODEL |
claude-sonnet-4-6 | LLM model for synthesis |
TRIAL_ANTHROPIC_API_KEY |
(none) | Anthropic API key |
TRIAL_API_PORT |
8538 | FastAPI server port |
TRIAL_STREAMLIT_PORT |
8128 | Streamlit UI port |
TRIAL_API_KEY |
(empty) | API authentication key |
TRIAL_CORS_ORIGINS |
localhost:8080,8538,8128 | Allowed CORS origins |
TRIAL_TOP_K_PER_COLLECTION |
5 | Results per collection |
TRIAL_SCORE_THRESHOLD |
0.4 | Minimum similarity score |
TRIAL_INGEST_SCHEDULE_HOURS |
24 | Ingest interval |
TRIAL_INGEST_ENABLED |
False | Enable scheduled ingest |
TRIAL_CROSS_AGENT_TIMEOUT |
30 | Cross-agent query timeout (s) |
10. Port Map¶
| Port | Service | Protocol | Notes |
|---|---|---|---|
| 8538 | FastAPI REST API | HTTP | Clinical Trial Intelligence Agent |
| 8128 | Streamlit UI | HTTP | 5-tab clinical trial interface |
| 19530 | Milvus | gRPC | Shared vector store |
| 2379 | etcd | gRPC | Milvus metadata |
| 9000 | MinIO | HTTP | Milvus blob storage |
| 8527 | Oncology Agent | HTTP | Cross-agent integration |
| 8107 | PGx Agent | HTTP | Cross-agent integration |
| 8126 | Cardiology Agent | HTTP | Cross-agent integration |
| 8529 | Biomarker Agent | HTTP | Cross-agent integration |
| 8080 | Landing Page | HTTP | HCLS AI Factory hub |
11. Tech Stack¶
| Layer | Technology | Version/Details |
|---|---|---|
| Compute | NVIDIA DGX Spark | CUDA 12.x |
| LLM | Claude (Anthropic) | claude-sonnet-4-6 |
| Vector DB | Milvus | 2.x with etcd + MinIO |
| Embeddings | BGE-small-en-v1.5 | 384 dimensions, sentence-transformers |
| API Framework | FastAPI | Uvicorn ASGI server |
| UI Framework | Streamlit | NVIDIA dark theme |
| Data Models | Pydantic v2 | BaseModel + BaseSettings |
| Config | pydantic-settings | .env file support |
| Metrics | prometheus_client | Counter, Histogram, Gauge, Info |
| Container | Docker | Multi-stage build |
| Orchestration | docker-compose | DGX Spark stack |
| Testing | pytest | 769 tests, 0.47s |
| Python | 3.10+ | Type hints, dataclasses |
12. Data Models¶
12.1 Enums¶
| Enum | Count | Members |
|---|---|---|
| TrialWorkflowType | 19 | protocol_design, patient_matching, site_selection, eligibility_optimization, eligibility_analysis, endpoint_strategy, adaptive_design, safety_signal, safety_monitoring, regulatory_docs, regulatory_strategy, competitive_intel, competitive_intelligence, biomarker_strategy, rwe_analysis, recruitment_optimization, diversity_assessment, decentralized_planning, general |
| TrialPhase | 7 | phase_i through not_applicable |
| TrialStatus | 7 | recruiting through not_yet_recruiting |
| EvidenceLevel | 6 | a1 through e |
| TherapeuticArea | 13 | oncology through other |
| SeverityLevel | 5 | critical through informational |
| RegulatoryAgency | 6 | fda through mhra |
| DocumentType | 6 | ind through dsur |
| CriterionType | 2 | inclusion, exclusion |
| EndpointType | 4 | primary through safety |
| DCTComponent | 7 | econsent through direct_to_patient |
12.2 Pydantic Models¶
| Model | Fields | Purpose |
|---|---|---|
| TrialQuery | question, workflow_type, patient_context, top_k, include_guidelines | Input query |
| TrialSearchResult | collection, content, score, metadata | Single search result |
| PatientProfile | age, sex, diagnosis, biomarkers, medications, genomic_variants, comorbidities, geographic_location | Patient data |
| MatchScore | criterion_text, criterion_type, met, confidence, evidence | Per-criterion match |
| OverallMatch | trial_id, title, phase, status, inclusion_met/total, exclusion_clear/total, overall_score, confidence | Overall match |
| EligibilityAnalysis | criterion, population_impact, justification_score, competitor_comparison, recommendation | Criterion analysis |
| SiteScore | site_id, facility_name, city, country, enrollment_rate, screen_failure_rate, diversity_index, overall_score | Site evaluation |
| SafetySignal | event_type, severity, frequency, prr, ror, causality_assessment | Safety signal |
| CompetitorProfile | trial_id, sponsor, phase, indication, mechanism, enrollment_progress, estimated_completion, threat_level | Competitor data |
| ProtocolComplexity | procedure_count, visit_count, endpoint_count, eligibility_criteria_count, complexity_score, percentile_rank | Complexity assessment |
| WorkflowResult | workflow_type, findings, recommendations, guideline_references, severity, cross_agent_triggers, confidence | Workflow output |
| TrialResponse | answer, citations, workflow_results, matches, confidence | Top-level response |
13. Cross-Agent Integration¶
| Agent | Endpoint | Purpose | Timeout |
|---|---|---|---|
| Oncology | http://localhost:8527 |
Molecular trial matches | 30s |
| PGx | http://localhost:8107 |
Pharmacogenomic screening | 30s |
| Cardiology | http://localhost:8126 |
Cardiac safety assessment | 30s |
| Biomarker | http://localhost:8529 |
Biomarker enrichment | 30s |
All queries use graceful degradation: unavailable agents return default responses with warnings.
14. Ingest Pipeline Reference¶
14.1 Parsers¶
| Parser | Source | API | Output Collections |
|---|---|---|---|
| ClinicalTrialsParser | ClinicalTrials.gov | XML/JSON REST API | protocols, eligibility, endpoints, sites, investigators |
| PubMedParser | PubMed/MEDLINE | E-utilities API | literature, results |
| RegulatoryParser | FDA/EMA docs | Document parsing | regulatory, guidelines, safety |
14.2 Pipeline Features¶
- Chunking with configurable overlap
- Content hash deduplication
- NCBI API key support for rate limit increase
- Incremental ingest (timestamp-based)
- Error reporting and validation
15. Test Reference¶
15.1 Test Files¶
| File | LOC | Coverage |
|---|---|---|
| test_models.py | 519 | All enums and Pydantic models |
| test_workflow_execution.py | 379 | All 10 workflow execute methods |
| test_clinical_workflows.py | 347 | Preprocess/postprocess logic |
| test_api.py | 310 | All 26 API endpoints |
| test_agent.py | 294 | Agent pipeline stages |
| test_decision_support.py | 281 | All decision support engines |
| test_query_expansion.py | 255 | All synonym maps and expansion |
| test_integration.py | 247 | End-to-end integration |
| test_rag_engine.py | 194 | RAG retrieval and scoring |
| test_collections.py | 152 | Schema validation |
| test_settings.py | 133 | Configuration validation |
| test_knowledge.py | 123 | Knowledge completeness |
15.2 Results¶
- Total: 769 tests
- Passed: 769
- Failed: 0
- Pass rate: 100%
- Execution time: 0.47s
Clinical Trial Intelligence Agent -- HCLS AI Factory -- March 2026
Clinical Decision Support Disclaimer
The Clinical Trial Intelligence Agent is a clinical decision support research tool for clinical trial analysis. It is not FDA-cleared and is not intended as a standalone diagnostic device. All recommendations should be reviewed by qualified healthcare professionals. Apache 2.0 License.