Skip to content

HCLS AI Factory — Final Comprehensive Audit Report

Pre-Handoff Verification for VAST R&D

Date: February 9, 2026 | Auditor: Claude Opus 4.6 | Repo: ajones1923/hcls-ai-factory


Executive Summary

Metric Result
Total Tests Run 356
Tests Passed 355
Tests Failed 1 (pre-existing pynvml mock issue)
Critical Issues 0
High Severity Issues 1
Medium Severity Issues 4
Low Severity Issues ~20
MkDocs Site Build SUCCESS (0 errors, 0 warnings)
Live Site (hcls-ai-factory.org) LIVE, Stage 0 confirmed
VAST Site (dashing-pegasus-fc7708.netlify.app) LIVE, auth working

Verdict: APPROVED FOR VAST R&D HANDOFF


Test Results by Component

Component Tests Passed Failed Time
Stage 2: RAG/Chat Pipeline 157 157 0 0.07s
Stage 1: Genomics Web Portal 129 128 1
Stage 3: Drug Discovery Pipeline 59 59 0 0.06s
Landing Page 11 11 0
TOTAL 356 355 1

The single failing test (test_get_gpu_utilization_with_nvml) is a pre-existing mock patching issue in the genomics web portal — the test patches server.pynvml but the module uses a try/except import pattern. Not a functional bug.


Stage 0: Data Acquisition (setup-data.sh)

Status: PRODUCTION-QUALITY

Check Result
CLI flags (--all, --stage1/2/3, --verify, --status, --dry-run) All implemented
Download URLs (NCBI FTP for FASTQ, Google Storage for AlphaMissense) Valid format
Checksum verification MD5 with retry
Retry/resume logic aria2c primary, wget fallback, 3 retries, exponential backoff
Disk space preflight Checks available space per stage
Tool dependency checks aria2c, wget, curl, md5sum, pigz
Idempotency State file (.data-setup-state), skip existing files
Error handling Comprehensive with actionable diagnostics

Data Inventory: - Stage 1: 68 GIAB HG002 FASTQ chunks (~300 GB) + GRCh38 reference (~11 GB) - Stage 2: ClinVar variant_summary (394 MB) + ClinVar VCF (85 MB) + AlphaMissense (614 MB) - Stage 3: PDB structure cache (optional)

Documentation Cross-References: - docs/DATA_SETUP.md accurately describes setup-data.sh capabilities - README.md, quickstart.md, demo guide all reference Stage 0 correctly - PRODUCT_DOCUMENTATION.txt includes Stage 0 section

Finding: ClinVar filename in setup-data.sh downloads as clinvar_variant_summary.txt.gz while the RAG pipeline README shows variant_summary.txt.gz as the expected path. The ingest_vcf.py default arg uses clinvar_variant_summary.txt.gz which matches the download. The README documentation has a minor naming inconsistency but the code paths are correct.


Stage 1: Genomics Pipeline

Status: FUNCTIONAL WITH DOCUMENTED ISSUES

Scripts Audited (14 files)

Script Purpose Error Handling
run.sh Dispatcher for all subcommands Missing set -e at top level
00-setup-check.sh Prerequisites verification set -e present
01-ngc-login.sh NGC container registry login set -e present
02-download-data.sh FASTQ download (primary) set -e present
02-download-data-conservative.sh FASTQ with retry (5 attempts) Full retry logic
02-download-data-verified.sh FASTQ with MD5 verification MD5 checksums
03-setup-reference.sh GRCh38 reference genome Idempotent, skip existing
04-run-chr20-test.sh Chromosome 20 test run set -o pipefail, trap, retry
05-run-full-genome.sh Full genome pipeline Resume logic, 3 DeepVariant retries

Key Findings

DGX Spark Compatibility: - nvidia-smi wrapper reports 16 GB GPU memory (actual: 128 GB unified). This is a workaround for Parabricks not recognizing the GB10's memory. Conservative but functional — Parabricks may not fully utilize available memory. - Resume logic in 05-run-full-genome.sh is excellent: detects existing BAM/VCF and skips completed steps. - DeepVariant retry (3 attempts, 30s waits, GPU health checks) is robust.

Web Portal (Flask): - 129 tests, 128 passing - Security: CSRF tokens (constant-time comparison), rate limiting, path traversal protection - Estimated GPU metrics (IOPS, bandwidth, SM efficiency) presented as real measurements — could mislead VAST R&D - CDN dependencies (Bootstrap, Chart.js from jsdelivr) — will fail on air-gapped systems - Thread safety: pipeline_state dict accessed without locks (Python GIL mitigates but not ideal)

VCF Output Compatibility: - Standard VCFv4.2 format, bgzip compressed, tabix indexed - Naming convention (HG002.genome.vcf.gz) matches what Stage 2 expects


Stage 2: RAG/Chat Pipeline

Status: APPROVED FOR HANDOFF — 157/157 TESTS PASS

Architecture

VCF  vcf_parser.py  annotator.py (ClinVar + AlphaMissense)  embedder.py (BGE-small-en-v1.5, 384-dim)
   milvus_client.py (IVF_FLAT, COSINE)  rag_engine.py (10 therapeutic areas)  llm_client.py (4 providers)
   chat_ui.py (Streamlit)  target_hypothesis.py  Phase 5 export

Module-by-Module

Module Lines Key Features Status
vcf_parser.py 331 cyvcf2 + fallback parser, multi-allelic splitting, long allele truncation PASS
annotator.py 610 ClinVar (4.1M), AlphaMissense (71M), VEP PASS
embedder.py 200 BGE-small-en-v1.5, normalize=True, disk cache PASS
milvus_client.py 409 17-field schema, IVF_FLAT, injection-safe sanitization PASS
llm_client.py 348 4 providers (Anthropic, OpenAI, Ollama, vLLM), factory pattern PASS
rag_engine.py 622 10 therapeutic area query expansion, Clinker knowledge integration PASS
knowledge.py 2,684 201 genes, 171 druggable, 13 therapeutic areas PASS
target_hypothesis.py 253 CRUD, JSON persistence, Phase 5 export PASS
chat_ui.py 1,774 6 model options, streaming, evidence panels, file manager PASS

Security

  • Milvus filter injection: Prevented by regex sanitization on gene and chromosome inputs
  • 7 injection payloads tested and rejected for each sanitizer
  • API keys sourced from environment variables, never hardcoded

Knowledge Base Statistics

  • 201 genes across 13 therapeutic areas
  • 171 druggable targets (85.1%)
  • 73 genes with reference SMILES for drug discovery handoff
  • 10 query expansion dictionaries (126+ keywords)

Stage 3: Drug Discovery Pipeline

Status: APPROVED FOR HANDOFF — 59/59 TESTS PASS

10-Stage Pipeline

Stage Name Implementation
0 Initialize Config validation, output directory creation
1 Normalize Target Target import from RAG pipeline
2 Structure Discovery RCSB PDB query, resolution-based ranking
3 Structure Prep Best structure selection (5FTK for VCP)
4 Molecule Generation MolMIM NIM (real) or RDKit mock fallback
5 Chemistry QC Lipinski Rule of Five, SMILES validation
6 Conformers 3D conformer generation (RDKit)
7 Docking DiffDock NIM (real) or hash-seeded mock
8 Ranking Composite: 30% gen + 40% dock + 30% QED
9 Reporting PDF via ReportLab, JSON export, SDF export

Scoring Formula (Verified)

composite = 0.3 * generation_score + 0.4 * dock_normalized + 0.3 * qed_score
dock_normalized = max(0, min(1, (10 + dock_score) / 20))

Mathematically verified against output data: - Candidate #1: gen=1.0, dock=-8.62, qed=0.387 → composite=0.4437 ✓

Mock Fallback

NIM_ALLOW_MOCK_FALLBACK=true enables full pipeline execution without real BioNeMo NIM containers: - MockMolMIMClient: RDKit-based analogues + 9 pre-designed VCP inhibitors - MockDiffDockClient: Hash-seeded reproducible docking scores centered at -8.0

Minor Issues (0 critical, 8 minor)

  1. max_retries defined but no retry logic in NIM HTTP calls
  2. Dual GeneratedMolecule classes (dataclass in UI, Pydantic in pipeline)
  3. Pydantic v1 .dict() used instead of v2 .model_dump()
  4. services CLI command may crash if NIMs unavailable
  5. Docking stage passes PDB ID string, not file content (works in mock mode)
  6. No weight-sum validation in PipelineConfig
  7. Morgan fingerprint computed but unused in molecule_generator.py
  8. Nextflow script misplaced in monitoring/ directory

Orchestrator & Infrastructure

Nextflow DSL2 (hls-orchestrator/)

Mode Status
full Chains Stage 1 → 2 → 3 correctly
demo Works (VCP demo data)
target Partial (genomics skipped)
drug Partial (genomics + RAG skipped)

Profiles: standard, docker, singularity, dgx_spark, slurm, test

Note: Nextflow modules use simplified/mock implementations (BWA-MEM + GATK instead of Parabricks, mock molecule generation instead of NIM calls). Real pipeline execution uses the individual stage scripts/code.

run_pipeline.py Python alternative: Only demo mode is implemented. Other modes print a stub message.

Landing Page (landing-page/)

  • 11/11 tests passing
  • Monitors 10 services in parallel with 2s timeout each
  • Dynamic host IP detection
  • Report freshness checking

Service Launchers

Script Services Started Status
start-services.sh Milvus, Landing (8080), Chat (8501), Drug Discovery (8505), Portal (8510) PASS — uses $SCRIPT_DIR
demo.sh Above + Genomics (5000), RAG API (5001) PASS — independent
health-monitor.sh All 11 services with auto-recovery, cron support PASS

Docker Compose Files (4 total)

Location Services GPU
rag-chat-pipeline/ Milvus v2.4.17 (ARM64), VEP No
drug-discovery-pipeline/ MolMIM, DiffDock, Pipeline UI 2 GPU (shared)
genomics-pipeline/web-portal/ Flask portal No
drug-discovery-pipeline/monitoring/ Prometheus, Grafana, DCGM, Node Exporter No

Documentation & MkDocs Site

Site Build

  • Build result: SUCCESS (0 errors)
  • Live site: hcls-ai-factory.org — 200 OK, Stage 0 content confirmed
  • Pages: 20+ pages in sitemap including home, architecture, quickstart, all 3 stages, data setup, deployment guide, demo guide, white paper, project bible, learning guides

Content Consistency

Claim Consistent Across Docs?
DGX Spark $3,999 Yes (15+ references)
Under 5 hours end-to-end Yes
11.7M variants Yes
3.56M pass quality filter Yes
Parabricks 4.6 Yes (10+ references)
claude-sonnet-4-20250514 Yes (15+ references)
Stage 0 data acquisition Yes (added to all relevant pages)

Community Files

File Status
README.md Comprehensive, accurate
CONTRIBUTING.md Fork guidance, code standards, PR process
CODE_OF_CONDUCT.md Healthcare/life sciences appropriate
SECURITY.md Contact info, HIPAA/GDPR scope
LICENSE Apache 2.0
.env.example All variables documented

CI/CD

  • GitHub Actions: lint (ruff) + test (4 services) + docs (mkdocs build)
  • Dependabot: pip weekly for 4 directories + GitHub Actions monthly
  • Secret scanner: scripts/check-secrets.sh — no secrets detected

Git Hygiene

Check Result
Tracked .pyc/__pycache__ None
Tracked large data files None
Tracked site/ build output None
Tracked results/ None (cleaned)
Tracked .env secrets None
start-services.sh TRANSFER_DIR Fixed — uses $SCRIPT_DIR
Total tracked files 276
.gitignore coverage Comprehensive (159 lines)

Issues Summary

High Severity (1)

ID Component Description
H-1 landing-page/start-all.sh Drug Discovery Portal start command points to wrong directory. Mitigated: Primary launcher start-services.sh handles this correctly. This is a secondary/legacy script.

Medium Severity (4)

ID Component Description
M-1 hls-orchestrator/main.nf genomics_only mode referenced but not implemented — would cause runtime failure
M-2 hls-orchestrator/main.nf ch_targets type mismatch in demo/drug modes when passed to GENERATE_REPORT
M-3 hls-orchestrator/portal/app.py DCGM metrics URL hardcoded to localhost (should use SERVICE_HOST)
M-4 hls-orchestrator/portal/app.py Sidebar service status checks hardcoded to localhost

Low Severity (~20)

Key items: - Genomics run.sh missing set -e at top level - Primary download script suppresses aria2c errors with || true - Primary download script lacks checksum verification (conservative/verified variants have it) - nvidia-smi wrapper reports 16 GB instead of 128 GB - Web portal CDN dependencies (Bootstrap, Chart.js) fail on air-gapped systems - Web portal estimated GPU metrics presented as real measurements - Nextflow run_pipeline.py only implements demo mode - Docker Compose version: '3.8' deprecated (cosmetic) - 1 failing test (pynvml mock patching)


Recommendations for VAST R&D

Before Forking

  1. Fork from: github.com/ajones1923/hcls-ai-factory
  2. Run: cp .env.example .env and fill in API keys (NGC, Anthropic)
  3. Quick test: ./setup-data.sh --stage2 (2 GB, 5 min) then ./demo.sh
  4. Full test: ./setup-data.sh --all (~500 GB, 2-6 hours) then full pipeline

VAST-Specific Migration Path

The private deployment guide at dashing-pegasus-fc7708.netlify.app provides: - 31-section blueprint mapping to all 6 VAST AI OS components - 46-file migration checklist (Appendix D) - PyArrow schemas for 15 VAST DataBase tables - DataEngine trigger chain replacing Nextflow orchestration - InsightEngine RAG pipeline configuration - AgentEngine ReAct agent with 5 tools

What Works Out of the Box

  • setup-data.sh — production-quality data acquisition
  • All 3 pipeline stages with working code
  • Docker Compose for all services
  • Mock fallback for demo without real NIM containers
  • 356 tests (355 passing)
  • MkDocs documentation site

HCLS AI Factory — Apache 2.0 | February 2026 Audit performed by Claude Opus 4.6