HCLS AI Factory — Final Comprehensive Audit Report¶
Pre-Handoff Verification for VAST R&D
Date: February 9, 2026 | Auditor: Claude Opus 4.6 | Repo:
ajones1923/hcls-ai-factory
Executive Summary¶
| Metric | Result |
|---|---|
| Total Tests Run | 356 |
| Tests Passed | 355 |
| Tests Failed | 1 (pre-existing pynvml mock issue) |
| Critical Issues | 0 |
| High Severity Issues | 1 |
| Medium Severity Issues | 4 |
| Low Severity Issues | ~20 |
| MkDocs Site Build | SUCCESS (0 errors, 0 warnings) |
| Live Site (hcls-ai-factory.org) | LIVE, Stage 0 confirmed |
| VAST Site (dashing-pegasus-fc7708.netlify.app) | LIVE, auth working |
Verdict: APPROVED FOR VAST R&D HANDOFF
Test Results by Component¶
| Component | Tests | Passed | Failed | Time |
|---|---|---|---|---|
| Stage 2: RAG/Chat Pipeline | 157 | 157 | 0 | 0.07s |
| Stage 1: Genomics Web Portal | 129 | 128 | 1 | — |
| Stage 3: Drug Discovery Pipeline | 59 | 59 | 0 | 0.06s |
| Landing Page | 11 | 11 | 0 | — |
| TOTAL | 356 | 355 | 1 | — |
The single failing test (test_get_gpu_utilization_with_nvml) is a pre-existing mock patching issue in the genomics web portal — the test patches server.pynvml but the module uses a try/except import pattern. Not a functional bug.
Stage 0: Data Acquisition (setup-data.sh)¶
Status: PRODUCTION-QUALITY
| Check | Result |
|---|---|
| CLI flags (--all, --stage1/2/3, --verify, --status, --dry-run) | All implemented |
| Download URLs (NCBI FTP for FASTQ, Google Storage for AlphaMissense) | Valid format |
| Checksum verification | MD5 with retry |
| Retry/resume logic | aria2c primary, wget fallback, 3 retries, exponential backoff |
| Disk space preflight | Checks available space per stage |
| Tool dependency checks | aria2c, wget, curl, md5sum, pigz |
| Idempotency | State file (.data-setup-state), skip existing files |
| Error handling | Comprehensive with actionable diagnostics |
Data Inventory: - Stage 1: 68 GIAB HG002 FASTQ chunks (~300 GB) + GRCh38 reference (~11 GB) - Stage 2: ClinVar variant_summary (394 MB) + ClinVar VCF (85 MB) + AlphaMissense (614 MB) - Stage 3: PDB structure cache (optional)
Documentation Cross-References:
- docs/DATA_SETUP.md accurately describes setup-data.sh capabilities
- README.md, quickstart.md, demo guide all reference Stage 0 correctly
- PRODUCT_DOCUMENTATION.txt includes Stage 0 section
Finding: ClinVar filename in setup-data.sh downloads as clinvar_variant_summary.txt.gz while the RAG pipeline README shows variant_summary.txt.gz as the expected path. The ingest_vcf.py default arg uses clinvar_variant_summary.txt.gz which matches the download. The README documentation has a minor naming inconsistency but the code paths are correct.
Stage 1: Genomics Pipeline¶
Status: FUNCTIONAL WITH DOCUMENTED ISSUES
Scripts Audited (14 files)¶
| Script | Purpose | Error Handling |
|---|---|---|
run.sh |
Dispatcher for all subcommands | Missing set -e at top level |
00-setup-check.sh |
Prerequisites verification | set -e present |
01-ngc-login.sh |
NGC container registry login | set -e present |
02-download-data.sh |
FASTQ download (primary) | set -e present |
02-download-data-conservative.sh |
FASTQ with retry (5 attempts) | Full retry logic |
02-download-data-verified.sh |
FASTQ with MD5 verification | MD5 checksums |
03-setup-reference.sh |
GRCh38 reference genome | Idempotent, skip existing |
04-run-chr20-test.sh |
Chromosome 20 test run | set -o pipefail, trap, retry |
05-run-full-genome.sh |
Full genome pipeline | Resume logic, 3 DeepVariant retries |
Key Findings¶
DGX Spark Compatibility:
- nvidia-smi wrapper reports 16 GB GPU memory (actual: 128 GB unified). This is a workaround for Parabricks not recognizing the GB10's memory. Conservative but functional — Parabricks may not fully utilize available memory.
- Resume logic in 05-run-full-genome.sh is excellent: detects existing BAM/VCF and skips completed steps.
- DeepVariant retry (3 attempts, 30s waits, GPU health checks) is robust.
Web Portal (Flask):
- 129 tests, 128 passing
- Security: CSRF tokens (constant-time comparison), rate limiting, path traversal protection
- Estimated GPU metrics (IOPS, bandwidth, SM efficiency) presented as real measurements — could mislead VAST R&D
- CDN dependencies (Bootstrap, Chart.js from jsdelivr) — will fail on air-gapped systems
- Thread safety: pipeline_state dict accessed without locks (Python GIL mitigates but not ideal)
VCF Output Compatibility:
- Standard VCFv4.2 format, bgzip compressed, tabix indexed
- Naming convention (HG002.genome.vcf.gz) matches what Stage 2 expects
Stage 2: RAG/Chat Pipeline¶
Status: APPROVED FOR HANDOFF — 157/157 TESTS PASS
Architecture¶
VCF → vcf_parser.py → annotator.py (ClinVar + AlphaMissense) → embedder.py (BGE-small-en-v1.5, 384-dim)
→ milvus_client.py (IVF_FLAT, COSINE) → rag_engine.py (10 therapeutic areas) → llm_client.py (4 providers)
→ chat_ui.py (Streamlit) → target_hypothesis.py → Phase 5 export
Module-by-Module¶
| Module | Lines | Key Features | Status |
|---|---|---|---|
vcf_parser.py |
331 | cyvcf2 + fallback parser, multi-allelic splitting, long allele truncation | PASS |
annotator.py |
610 | ClinVar (4.1M), AlphaMissense (71M), VEP | PASS |
embedder.py |
200 | BGE-small-en-v1.5, normalize=True, disk cache | PASS |
milvus_client.py |
409 | 17-field schema, IVF_FLAT, injection-safe sanitization | PASS |
llm_client.py |
348 | 4 providers (Anthropic, OpenAI, Ollama, vLLM), factory pattern | PASS |
rag_engine.py |
622 | 10 therapeutic area query expansion, Clinker knowledge integration | PASS |
knowledge.py |
2,684 | 201 genes, 171 druggable, 13 therapeutic areas | PASS |
target_hypothesis.py |
253 | CRUD, JSON persistence, Phase 5 export | PASS |
chat_ui.py |
1,774 | 6 model options, streaming, evidence panels, file manager | PASS |
Security¶
- Milvus filter injection: Prevented by regex sanitization on gene and chromosome inputs
- 7 injection payloads tested and rejected for each sanitizer
- API keys sourced from environment variables, never hardcoded
Knowledge Base Statistics¶
- 201 genes across 13 therapeutic areas
- 171 druggable targets (85.1%)
- 73 genes with reference SMILES for drug discovery handoff
- 10 query expansion dictionaries (126+ keywords)
Stage 3: Drug Discovery Pipeline¶
Status: APPROVED FOR HANDOFF — 59/59 TESTS PASS
10-Stage Pipeline¶
| Stage | Name | Implementation |
|---|---|---|
| 0 | Initialize | Config validation, output directory creation |
| 1 | Normalize Target | Target import from RAG pipeline |
| 2 | Structure Discovery | RCSB PDB query, resolution-based ranking |
| 3 | Structure Prep | Best structure selection (5FTK for VCP) |
| 4 | Molecule Generation | MolMIM NIM (real) or RDKit mock fallback |
| 5 | Chemistry QC | Lipinski Rule of Five, SMILES validation |
| 6 | Conformers | 3D conformer generation (RDKit) |
| 7 | Docking | DiffDock NIM (real) or hash-seeded mock |
| 8 | Ranking | Composite: 30% gen + 40% dock + 30% QED |
| 9 | Reporting | PDF via ReportLab, JSON export, SDF export |
Scoring Formula (Verified)¶
composite = 0.3 * generation_score + 0.4 * dock_normalized + 0.3 * qed_score
dock_normalized = max(0, min(1, (10 + dock_score) / 20))
Mathematically verified against output data: - Candidate #1: gen=1.0, dock=-8.62, qed=0.387 → composite=0.4437 ✓
Mock Fallback¶
NIM_ALLOW_MOCK_FALLBACK=true enables full pipeline execution without real BioNeMo NIM containers:
- MockMolMIMClient: RDKit-based analogues + 9 pre-designed VCP inhibitors
- MockDiffDockClient: Hash-seeded reproducible docking scores centered at -8.0
Minor Issues (0 critical, 8 minor)¶
max_retriesdefined but no retry logic in NIM HTTP calls- Dual
GeneratedMoleculeclasses (dataclass in UI, Pydantic in pipeline) - Pydantic v1
.dict()used instead of v2.model_dump() servicesCLI command may crash if NIMs unavailable- Docking stage passes PDB ID string, not file content (works in mock mode)
- No weight-sum validation in PipelineConfig
- Morgan fingerprint computed but unused in molecule_generator.py
- Nextflow script misplaced in monitoring/ directory
Orchestrator & Infrastructure¶
Nextflow DSL2 (hls-orchestrator/)¶
| Mode | Status |
|---|---|
full |
Chains Stage 1 → 2 → 3 correctly |
demo |
Works (VCP demo data) |
target |
Partial (genomics skipped) |
drug |
Partial (genomics + RAG skipped) |
Profiles: standard, docker, singularity, dgx_spark, slurm, test
Note: Nextflow modules use simplified/mock implementations (BWA-MEM + GATK instead of Parabricks, mock molecule generation instead of NIM calls). Real pipeline execution uses the individual stage scripts/code.
run_pipeline.py Python alternative: Only demo mode is implemented. Other modes print a stub message.
Landing Page (landing-page/)¶
- 11/11 tests passing
- Monitors 10 services in parallel with 2s timeout each
- Dynamic host IP detection
- Report freshness checking
Service Launchers¶
| Script | Services Started | Status |
|---|---|---|
start-services.sh |
Milvus, Landing (8080), Chat (8501), Drug Discovery (8505), Portal (8510) | PASS — uses $SCRIPT_DIR |
demo.sh |
Above + Genomics (5000), RAG API (5001) | PASS — independent |
health-monitor.sh |
All 11 services with auto-recovery, cron support | PASS |
Docker Compose Files (4 total)¶
| Location | Services | GPU |
|---|---|---|
rag-chat-pipeline/ |
Milvus v2.4.17 (ARM64), VEP | No |
drug-discovery-pipeline/ |
MolMIM, DiffDock, Pipeline UI | 2 GPU (shared) |
genomics-pipeline/web-portal/ |
Flask portal | No |
drug-discovery-pipeline/monitoring/ |
Prometheus, Grafana, DCGM, Node Exporter | No |
Documentation & MkDocs Site¶
Site Build¶
- Build result: SUCCESS (0 errors)
- Live site: hcls-ai-factory.org — 200 OK, Stage 0 content confirmed
- Pages: 20+ pages in sitemap including home, architecture, quickstart, all 3 stages, data setup, deployment guide, demo guide, white paper, project bible, learning guides
Content Consistency¶
| Claim | Consistent Across Docs? |
|---|---|
| DGX Spark $3,999 | Yes (15+ references) |
| Under 5 hours end-to-end | Yes |
| 11.7M variants | Yes |
| 3.56M pass quality filter | Yes |
| Parabricks 4.6 | Yes (10+ references) |
claude-sonnet-4-20250514 |
Yes (15+ references) |
| Stage 0 data acquisition | Yes (added to all relevant pages) |
Community Files¶
| File | Status |
|---|---|
README.md |
Comprehensive, accurate |
CONTRIBUTING.md |
Fork guidance, code standards, PR process |
CODE_OF_CONDUCT.md |
Healthcare/life sciences appropriate |
SECURITY.md |
Contact info, HIPAA/GDPR scope |
LICENSE |
Apache 2.0 |
.env.example |
All variables documented |
CI/CD¶
- GitHub Actions: lint (ruff) + test (4 services) + docs (mkdocs build)
- Dependabot: pip weekly for 4 directories + GitHub Actions monthly
- Secret scanner:
scripts/check-secrets.sh— no secrets detected
Git Hygiene¶
| Check | Result |
|---|---|
Tracked .pyc/__pycache__ |
None |
| Tracked large data files | None |
Tracked site/ build output |
None |
Tracked results/ |
None (cleaned) |
Tracked .env secrets |
None |
start-services.sh TRANSFER_DIR |
Fixed — uses $SCRIPT_DIR |
| Total tracked files | 276 |
.gitignore coverage |
Comprehensive (159 lines) |
Issues Summary¶
High Severity (1)¶
| ID | Component | Description |
|---|---|---|
| H-1 | landing-page/start-all.sh |
Drug Discovery Portal start command points to wrong directory. Mitigated: Primary launcher start-services.sh handles this correctly. This is a secondary/legacy script. |
Medium Severity (4)¶
| ID | Component | Description |
|---|---|---|
| M-1 | hls-orchestrator/main.nf |
genomics_only mode referenced but not implemented — would cause runtime failure |
| M-2 | hls-orchestrator/main.nf |
ch_targets type mismatch in demo/drug modes when passed to GENERATE_REPORT |
| M-3 | hls-orchestrator/portal/app.py |
DCGM metrics URL hardcoded to localhost (should use SERVICE_HOST) |
| M-4 | hls-orchestrator/portal/app.py |
Sidebar service status checks hardcoded to localhost |
Low Severity (~20)¶
Key items:
- Genomics run.sh missing set -e at top level
- Primary download script suppresses aria2c errors with || true
- Primary download script lacks checksum verification (conservative/verified variants have it)
- nvidia-smi wrapper reports 16 GB instead of 128 GB
- Web portal CDN dependencies (Bootstrap, Chart.js) fail on air-gapped systems
- Web portal estimated GPU metrics presented as real measurements
- Nextflow run_pipeline.py only implements demo mode
- Docker Compose version: '3.8' deprecated (cosmetic)
- 1 failing test (pynvml mock patching)
Recommendations for VAST R&D¶
Before Forking¶
- Fork from:
github.com/ajones1923/hcls-ai-factory - Run:
cp .env.example .envand fill in API keys (NGC, Anthropic) - Quick test:
./setup-data.sh --stage2(2 GB, 5 min) then./demo.sh - Full test:
./setup-data.sh --all(~500 GB, 2-6 hours) then full pipeline
VAST-Specific Migration Path¶
The private deployment guide at dashing-pegasus-fc7708.netlify.app provides:
- 31-section blueprint mapping to all 6 VAST AI OS components
- 46-file migration checklist (Appendix D)
- PyArrow schemas for 15 VAST DataBase tables
- DataEngine trigger chain replacing Nextflow orchestration
- InsightEngine RAG pipeline configuration
- AgentEngine ReAct agent with 5 tools
What Works Out of the Box¶
setup-data.sh— production-quality data acquisition- All 3 pipeline stages with working code
- Docker Compose for all services
- Mock fallback for demo without real NIM containers
- 356 tests (355 passing)
- MkDocs documentation site
HCLS AI Factory — Apache 2.0 | February 2026 Audit performed by Claude Opus 4.6