HCLS AI Factory — Final Comprehensive Audit Report¶

Pre-Handoff Verification for VAST R&D

Date: February 9, 2026 | Auditor: Claude Opus 4.6 | Repo: ajones1923/hcls-ai-factory

Executive Summary¶

Metric	Result
Total Tests Run	356
Tests Passed	355
Tests Failed	1 (pre-existing pynvml mock issue)
Critical Issues	0
High Severity Issues	1
Medium Severity Issues	4
Low Severity Issues	~20
MkDocs Site Build	SUCCESS (0 errors, 0 warnings)
Live Site (hcls-ai-factory.org)	LIVE, Stage 0 confirmed
VAST Site (dashing-pegasus-fc7708.netlify.app)	LIVE, auth working

Verdict: APPROVED FOR VAST R&D HANDOFF

Test Results by Component¶

Component	Tests	Passed	Failed	Time
Stage 2: RAG/Chat Pipeline	157	157	0	0.07s
Stage 1: Genomics Web Portal	129	128	1	—
Stage 3: Drug Discovery Pipeline	59	59	0	0.06s
Landing Page	11	11	0	—
TOTAL	356	355	1	—

The single failing test (test_get_gpu_utilization_with_nvml) is a pre-existing mock patching issue in the genomics web portal — the test patches server.pynvml but the module uses a try/except import pattern. Not a functional bug.

Stage 0: Data Acquisition (`setup-data.sh`)¶

Status: PRODUCTION-QUALITY

Check	Result
CLI flags (--all, --stage1/2/3, --verify, --status, --dry-run)	All implemented
Download URLs (NCBI FTP for FASTQ, Google Storage for AlphaMissense)	Valid format
Checksum verification	MD5 with retry
Retry/resume logic	aria2c primary, wget fallback, 3 retries, exponential backoff
Disk space preflight	Checks available space per stage
Tool dependency checks	aria2c, wget, curl, md5sum, pigz
Idempotency	State file (.data-setup-state), skip existing files
Error handling	Comprehensive with actionable diagnostics

Data Inventory: - Stage 1: 68 GIAB HG002 FASTQ chunks (~300 GB) + GRCh38 reference (~11 GB) - Stage 2: ClinVar variant_summary (394 MB) + ClinVar VCF (85 MB) + AlphaMissense (614 MB) - Stage 3: PDB structure cache (optional)

Documentation Cross-References: - docs/DATA_SETUP.md accurately describes setup-data.sh capabilities - README.md, quickstart.md, demo guide all reference Stage 0 correctly - PRODUCT_DOCUMENTATION.txt includes Stage 0 section

Finding: ClinVar filename in setup-data.sh downloads as clinvar_variant_summary.txt.gz while the RAG pipeline README shows variant_summary.txt.gz as the expected path. The ingest_vcf.py default arg uses clinvar_variant_summary.txt.gz which matches the download. The README documentation has a minor naming inconsistency but the code paths are correct.

Stage 1: Genomics Pipeline¶

Status: FUNCTIONAL WITH DOCUMENTED ISSUES

Scripts Audited (14 files)¶

Script	Purpose	Error Handling
`run.sh`	Dispatcher for all subcommands	Missing `set -e` at top level
`00-setup-check.sh`	Prerequisites verification	`set -e` present
`01-ngc-login.sh`	NGC container registry login	`set -e` present
`02-download-data.sh`	FASTQ download (primary)	`set -e` present
`02-download-data-conservative.sh`	FASTQ with retry (5 attempts)	Full retry logic
`02-download-data-verified.sh`	FASTQ with MD5 verification	MD5 checksums
`03-setup-reference.sh`	GRCh38 reference genome	Idempotent, skip existing
`04-run-chr20-test.sh`	Chromosome 20 test run	`set -o pipefail`, trap, retry
`05-run-full-genome.sh`	Full genome pipeline	Resume logic, 3 DeepVariant retries

Key Findings¶

DGX Spark Compatibility: - nvidia-smi wrapper reports 16 GB GPU memory (actual: 128 GB unified). This is a workaround for Parabricks not recognizing the GB10's memory. Conservative but functional — Parabricks may not fully utilize available memory. - Resume logic in 05-run-full-genome.sh is excellent: detects existing BAM/VCF and skips completed steps. - DeepVariant retry (3 attempts, 30s waits, GPU health checks) is robust.

Web Portal (Flask): - 129 tests, 128 passing - Security: CSRF tokens (constant-time comparison), rate limiting, path traversal protection - Estimated GPU metrics (IOPS, bandwidth, SM efficiency) presented as real measurements — could mislead VAST R&D - CDN dependencies (Bootstrap, Chart.js from jsdelivr) — will fail on air-gapped systems - Thread safety: pipeline_state dict accessed without locks (Python GIL mitigates but not ideal)

VCF Output Compatibility: - Standard VCFv4.2 format, bgzip compressed, tabix indexed - Naming convention (HG002.genome.vcf.gz) matches what Stage 2 expects

Stage 2: RAG/Chat Pipeline¶

Status: APPROVED FOR HANDOFF — 157/157 TESTS PASS

Architecture¶

VCF → vcf_parser.py → annotator.py (ClinVar + AlphaMissense) → embedder.py (BGE-small-en-v1.5, 384-dim)
  → milvus_client.py (IVF_FLAT, COSINE) → rag_engine.py (10 therapeutic areas) → llm_client.py (4 providers)
  → chat_ui.py (Streamlit) → target_hypothesis.py → Phase 5 export

Module-by-Module¶

Module	Lines	Key Features	Status
`vcf_parser.py`	331	cyvcf2 + fallback parser, multi-allelic splitting, long allele truncation	PASS
`annotator.py`	610	ClinVar (4.1M), AlphaMissense (71M), VEP	PASS
`embedder.py`	200	BGE-small-en-v1.5, normalize=True, disk cache	PASS
`milvus_client.py`	409	17-field schema, IVF_FLAT, injection-safe sanitization	PASS
`llm_client.py`	348	4 providers (Anthropic, OpenAI, Ollama, vLLM), factory pattern	PASS
`rag_engine.py`	622	10 therapeutic area query expansion, Clinker knowledge integration	PASS
`knowledge.py`	2,684	201 genes, 171 druggable, 13 therapeutic areas	PASS
`target_hypothesis.py`	253	CRUD, JSON persistence, Phase 5 export	PASS
`chat_ui.py`	1,774	6 model options, streaming, evidence panels, file manager	PASS

Security¶

Milvus filter injection: Prevented by regex sanitization on gene and chromosome inputs
7 injection payloads tested and rejected for each sanitizer
API keys sourced from environment variables, never hardcoded

Knowledge Base Statistics¶

201 genes across 13 therapeutic areas
171 druggable targets (85.1%)
73 genes with reference SMILES for drug discovery handoff
10 query expansion dictionaries (126+ keywords)

Stage 3: Drug Discovery Pipeline¶

Status: APPROVED FOR HANDOFF — 59/59 TESTS PASS

10-Stage Pipeline¶

Stage	Name	Implementation
0	Initialize	Config validation, output directory creation
1	Normalize Target	Target import from RAG pipeline
2	Structure Discovery	RCSB PDB query, resolution-based ranking
3	Structure Prep	Best structure selection (5FTK for VCP)
4	Molecule Generation	MolMIM NIM (real) or RDKit mock fallback
5	Chemistry QC	Lipinski Rule of Five, SMILES validation
6	Conformers	3D conformer generation (RDKit)
7	Docking	DiffDock NIM (real) or hash-seeded mock
8	Ranking	Composite: 30% gen + 40% dock + 30% QED
9	Reporting	PDF via ReportLab, JSON export, SDF export

Scoring Formula (Verified)¶

composite = 0.3 * generation_score + 0.4 * dock_normalized + 0.3 * qed_score
dock_normalized = max(0, min(1, (10 + dock_score) / 20))

Mathematically verified against output data: - Candidate #1: gen=1.0, dock=-8.62, qed=0.387 → composite=0.4437 ✓

Mock Fallback¶

NIM_ALLOW_MOCK_FALLBACK=true enables full pipeline execution without real BioNeMo NIM containers: - MockMolMIMClient: RDKit-based analogues + 9 pre-designed VCP inhibitors - MockDiffDockClient: Hash-seeded reproducible docking scores centered at -8.0

Minor Issues (0 critical, 8 minor)¶

max_retries defined but no retry logic in NIM HTTP calls
Dual GeneratedMolecule classes (dataclass in UI, Pydantic in pipeline)
Pydantic v1 .dict() used instead of v2 .model_dump()
services CLI command may crash if NIMs unavailable
Docking stage passes PDB ID string, not file content (works in mock mode)
No weight-sum validation in PipelineConfig
Morgan fingerprint computed but unused in molecule_generator.py
Nextflow script misplaced in monitoring/ directory

Orchestrator & Infrastructure¶

Nextflow DSL2 (`hls-orchestrator/`)¶

Mode	Status
`full`	Chains Stage 1 → 2 → 3 correctly
`demo`	Works (VCP demo data)
`target`	Partial (genomics skipped)
`drug`	Partial (genomics + RAG skipped)

Profiles: standard, docker, singularity, dgx_spark, slurm, test

Note: Nextflow modules use simplified/mock implementations (BWA-MEM + GATK instead of Parabricks, mock molecule generation instead of NIM calls). Real pipeline execution uses the individual stage scripts/code.

run_pipeline.py Python alternative: Only demo mode is implemented. Other modes print a stub message.

Landing Page (`landing-page/`)¶

11/11 tests passing
Monitors 10 services in parallel with 2s timeout each
Dynamic host IP detection
Report freshness checking

Service Launchers¶

Script	Services Started	Status
`start-services.sh`	Milvus, Landing (8080), Chat (8501), Drug Discovery (8505), Portal (8510)	PASS — uses `$SCRIPT_DIR`
`demo.sh`	Above + Genomics (5000), RAG API (5001)	PASS — independent
`health-monitor.sh`	All 11 services with auto-recovery, cron support	PASS

Docker Compose Files (4 total)¶

Location	Services	GPU
`rag-chat-pipeline/`	Milvus v2.4.17 (ARM64), VEP	No
`drug-discovery-pipeline/`	MolMIM, DiffDock, Pipeline UI	2 GPU (shared)
`genomics-pipeline/web-portal/`	Flask portal	No
`drug-discovery-pipeline/monitoring/`	Prometheus, Grafana, DCGM, Node Exporter	No

Documentation & MkDocs Site¶

Site Build¶

Build result: SUCCESS (0 errors)
Live site: hcls-ai-factory.org — 200 OK, Stage 0 content confirmed
Pages: 20+ pages in sitemap including home, architecture, quickstart, all 3 stages, data setup, deployment guide, demo guide, white paper, project bible, learning guides

Content Consistency¶

Claim	Consistent Across Docs?
DGX Spark $3,999	Yes (15+ references)
Under 5 hours end-to-end	Yes
11.7M variants	Yes
3.56M pass quality filter	Yes
Parabricks 4.6	Yes (10+ references)
`claude-sonnet-4-20250514`	Yes (15+ references)
Stage 0 data acquisition	Yes (added to all relevant pages)

Community Files¶

File	Status
`README.md`	Comprehensive, accurate
`CONTRIBUTING.md`	Fork guidance, code standards, PR process
`CODE_OF_CONDUCT.md`	Healthcare/life sciences appropriate
`SECURITY.md`	Contact info, HIPAA/GDPR scope
`LICENSE`	Apache 2.0
`.env.example`	All variables documented

CI/CD¶

GitHub Actions: lint (ruff) + test (4 services) + docs (mkdocs build)
Dependabot: pip weekly for 4 directories + GitHub Actions monthly
Secret scanner: scripts/check-secrets.sh — no secrets detected

Git Hygiene¶

Check	Result
Tracked `.pyc`/`__pycache__`	None
Tracked large data files	None
Tracked `site/` build output	None
Tracked `results/`	None (cleaned)
Tracked `.env` secrets	None
`start-services.sh` TRANSFER_DIR	Fixed — uses `$SCRIPT_DIR`
Total tracked files	276
`.gitignore` coverage	Comprehensive (159 lines)

Issues Summary¶

High Severity (1)¶

ID	Component	Description
H-1	`landing-page/start-all.sh`	Drug Discovery Portal start command points to wrong directory. Mitigated: Primary launcher `start-services.sh` handles this correctly. This is a secondary/legacy script.

Medium Severity (4)¶

ID	Component	Description
M-1	`hls-orchestrator/main.nf`	`genomics_only` mode referenced but not implemented — would cause runtime failure
M-2	`hls-orchestrator/main.nf`	`ch_targets` type mismatch in demo/drug modes when passed to GENERATE_REPORT
M-3	`hls-orchestrator/portal/app.py`	DCGM metrics URL hardcoded to localhost (should use SERVICE_HOST)
M-4	`hls-orchestrator/portal/app.py`	Sidebar service status checks hardcoded to localhost

Low Severity (~20)¶

Key items: - Genomics run.sh missing set -e at top level - Primary download script suppresses aria2c errors with || true - Primary download script lacks checksum verification (conservative/verified variants have it) - nvidia-smi wrapper reports 16 GB instead of 128 GB - Web portal CDN dependencies (Bootstrap, Chart.js) fail on air-gapped systems - Web portal estimated GPU metrics presented as real measurements - Nextflow run_pipeline.py only implements demo mode - Docker Compose version: '3.8' deprecated (cosmetic) - 1 failing test (pynvml mock patching)

Recommendations for VAST R&D¶

Before Forking¶

Fork from: github.com/ajones1923/hcls-ai-factory
Run: cp .env.example .env and fill in API keys (NGC, Anthropic)
Quick test: ./setup-data.sh --stage2 (2 GB, 5 min) then ./demo.sh
Full test: ./setup-data.sh --all (~500 GB, 2-6 hours) then full pipeline

VAST-Specific Migration Path¶

The private deployment guide at dashing-pegasus-fc7708.netlify.app provides: - 31-section blueprint mapping to all 6 VAST AI OS components - 46-file migration checklist (Appendix D) - PyArrow schemas for 15 VAST DataBase tables - DataEngine trigger chain replacing Nextflow orchestration - InsightEngine RAG pipeline configuration - AgentEngine ReAct agent with 5 tools

What Works Out of the Box¶

setup-data.sh — production-quality data acquisition
All 3 pipeline stages with working code
Docker Compose for all services
Mock fallback for demo without real NIM containers
356 tests (355 passing)
MkDocs documentation site

HCLS AI Factory — Apache 2.0 | February 2026 Audit performed by Claude Opus 4.6