Skip to content

Troubleshooting Guide

Common issues and solutions for the HCLS AI Factory platform, organized by component.


Quick Diagnostics

Before diving into specific issues, run these checks:

# 1. Check all service health from landing page
curl -s http://localhost:8080/api/check-services | python3 -m json.tool

# 2. Check Docker containers
docker compose ps

# 3. Check GPU availability
nvidia-smi

# 4. Check disk space
df -h /

# 5. Check memory
free -h

Services Will Not Start

Docker Compose fails with port conflicts

Symptom: Bind for 0.0.0.0:8501 failed: port is already allocated

Solution:

# Find what is using the port
lsof -i :8501

# Stop the conflicting process, or change the port in docker-compose.yml

NVIDIA Container Runtime not found

Symptom: docker: Error response from daemon: Unknown runtime specified nvidia

Solution:

# Install NVIDIA Container Toolkit
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
distribution=$(. /etc/os-release; echo $ID$VERSION_ID)
curl -s -L "https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list" | \
  sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
  sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

Services start but show "unhealthy"

Symptom: Landing page shows services as red/offline despite containers running.

Solution:

# Check container logs for the specific service
docker compose logs <service-name> --tail=50

# Common causes:
# 1. Milvus not ready yet (takes 30-60s to initialize)
# 2. Missing API keys (check .env file)
# 3. Missing data files (run setup-data.sh)


Stage 1: Genomics Pipeline

Parabricks license error

Symptom: NVIDIA Parabricks license validation failed

Solution:

# Ensure NGC API key is set
echo $NGC_CLI_API_KEY

# Re-authenticate with NGC
docker login nvcr.io -u '$oauthtoken' -p $NGC_CLI_API_KEY

Out of GPU memory during alignment

Symptom: CUDA out of memory during BWA-MEM2 or DeepVariant

Solution:

  • Reduce batch size in Parabricks configuration
  • For DGX Spark (128GB unified memory), this should not occur with default settings
  • For GPUs with less VRAM, use the --low-memory flag if available, or process chromosomes individually

FASTQ files not found

Symptom: FileNotFoundError when starting genomics pipeline

Solution:

# Verify data download completed
ls -la data/genomics/

# If missing, re-run data setup for genomics only
./setup-data.sh --genomics


Stage 2: RAG/Chat Pipeline

Milvus connection refused

Symptom: ConnectionRefusedError: [Errno 111] Connection refused on port 19530

Solution:

# Check if Milvus container is running
docker compose ps milvus

# Check Milvus logs
docker compose logs milvus --tail=50

# Restart Milvus
docker compose restart milvus

# Wait for initialization (30-60 seconds)
sleep 30
curl -s http://localhost:19530/v1/vector/collections

Empty search results

Symptom: Queries return no results or empty evidence

Solution:

# Check collection counts via Attu or API
curl -s http://localhost:19530/v1/vector/collections

# If collections are empty, re-run data ingestion
cd rag-chat-pipeline
python3 ingest.py --all

Claude API errors

Symptom: AuthenticationError or RateLimitError from Anthropic API

Solution:

# Verify API key is set
echo $ANTHROPIC_API_KEY

# Check API key validity
curl -s https://api.anthropic.com/v1/messages \
  -H "x-api-key: $ANTHROPIC_API_KEY" \
  -H "content-type: application/json" \
  -H "anthropic-version: 2023-06-01" \
  -d '{"model":"claude-sonnet-4-20250514","max_tokens":10,"messages":[{"role":"user","content":"test"}]}'

# For rate limits: the pipeline automatically retries with exponential backoff
# If persistent, check your Anthropic plan limits

Embedding model download fails

Symptom: OSError or timeout when loading BAAI/bge-small-en-v1.5

Solution:

# Pre-download the model
python3 -c "from sentence_transformers import SentenceTransformer; SentenceTransformer('BAAI/bge-small-en-v1.5')"

# If behind a firewall, set HuggingFace cache directory
export HF_HOME=/path/to/cache
export TRANSFORMERS_CACHE=/path/to/cache


Stage 3: Drug Discovery Pipeline

BioNeMo NIM services unavailable

Symptom: Connection refused on ports 8001 (MolMIM) or 8002 (DiffDock)

Solution:

# Check NIM mode
echo $NIM_MODE

# For cloud NIMs (recommended for DGX Spark ARM64):
export NIM_MODE=cloud
export NGC_CLI_API_KEY=your_key_here

# Verify cloud NIM access
curl -s https://health.api.nvidia.com/v1/health/ready

# For local NIMs (x86 only):
docker compose up -d molmim diffdock

DiffDock docking fails

Symptom: Error during docking or empty results from DiffDock

Solution:

  • Verify the PDB file is valid and contains the target protein
  • Check that the ligand SMILES string is valid: python3 -c "from rdkit import Chem; print(Chem.MolFromSmiles('your_smiles') is not None)"
  • For cloud NIM: verify NVCF asset staging completed (check logs for asset ID)
  • Try reducing the number of docking poses

RDKit import errors

Symptom: ModuleNotFoundError: No module named 'rdkit'

Solution:

# Install RDKit via conda (recommended)
conda install -c conda-forge rdkit

# Or via pip
pip install rdkit-pypi


Intelligence Agents

Agent UI not loading

Symptom: Streamlit UI returns connection error on agent port (8521, 8525, or 8526)

Solution:

# Check if the agent container is running
docker compose ps | grep agent

# Start the specific agent
docker compose up -d cart-agent    # Port 8521
docker compose up -d imaging-agent # Port 8525
docker compose up -d onco-agent   # Port 8526

# Check agent logs
docker compose logs cart-agent --tail=50

Agent cannot connect to Milvus

Symptom: Agent health check shows milvus: false

Solution:

# Agents share the same Milvus instance as the core platform
# Verify Milvus is running and accessible
curl -s http://localhost:19530/v1/vector/collections

# Check the agent's Milvus host/port configuration
# Default: MILVUS_HOST=localhost, MILVUS_PORT=19530
# In Docker: MILVUS_HOST=milvus (container name)

Cross-modal triggers not firing

Symptom: Agent queries do not pull evidence from shared genomic collections

Solution:

  • Verify the shared genomic_evidence collection exists in Milvus
  • Check that the cross-modal threshold is not set too high (default: 0.7)
  • Ensure the agent has read access to shared collections

Landing Page

Landing page shows all services as offline

Symptom: All service tiles are red despite services running

Solution:

# Check if the landing page can reach services
# Services must be accessible from the landing page container/process
# If running in Docker, ensure services are on the same network

docker network ls
docker network inspect hcls-ai-factory_default

Auto-start not working

Symptom: Genomics and RAG services do not auto-start from the landing page

Solution:

  • Auto-start only works when the landing page runs on the same host as the services
  • Check that the service directories exist at the expected paths
  • Verify Python virtual environments are set up for each service

Nextflow Orchestrator

Nextflow cgroup errors on DGX Spark

Symptom: Cannot get cgroup or process resource errors

Solution:

# Use the Python orchestrator as an alternative
cd hls-orchestrator
python3 run_pipeline.py --mode demo

# Or run Nextflow with NXF_OPTS to bypass cgroup
export NXF_OPTS="-Xms512m -Xmx4g"
./nextflow run main.nf -profile dgx_spark --mode demo

Pipeline hangs at a stage

Symptom: Nextflow shows a process running but no progress

Solution:

# Check Nextflow work directory for logs
ls -la work/

# Find the specific task directory
find work/ -name ".command.log" -newer work/ -exec tail -20 {} \;

# Resume from the last successful stage
./nextflow run main.nf -resume


Data Setup

setup-data.sh download failures

Symptom: Downloads fail or stall during ./setup-data.sh --all

Solution:

# The script supports automatic retry — re-run safely
./setup-data.sh --all

# Download specific stages only
./setup-data.sh --genomics    # Reference genome, FASTQ files (~400GB)
./setup-data.sh --rag         # ClinVar, AlphaMissense, knowledge base (~2GB)
./setup-data.sh --drug        # PDB structures, seed compounds (~100MB)

# Verify checksums after download
./setup-data.sh --verify

Insufficient disk space

Symptom: No space left on device during data download or pipeline execution

Solution:

Component Approximate Size
Reference genome (GRCh38) 3.1 GB
FASTQ sequencing data ~200 GB
ClinVar + AlphaMissense ~2 GB
Milvus vector database ~15 GB
Pipeline outputs (BAM, VCF) ~120 GB
Docker images ~30 GB
Total recommended 500 GB minimum
# Check current usage
du -sh data/ genomics-pipeline/data/ rag-chat-pipeline/data/

# Clean up old pipeline outputs
rm -rf results/old_run_*/
docker system prune -f

Monitoring

Grafana dashboards empty

Symptom: Grafana loads but shows "No data" in panels

Solution:

# Check Prometheus is scraping targets
curl -s http://localhost:9099/api/v1/targets | python3 -m json.tool

# Verify Node Exporter is running
curl -s http://localhost:9100/metrics | head -5

# Verify DCGM Exporter is running (GPU metrics)
curl -s http://localhost:9400/metrics | head -5

# Default Grafana credentials: admin / admin

Prometheus alerts firing

Symptom: Alert manager notifications for service down or GPU errors

Solution:

# Check active alerts
curl -s http://localhost:9099/api/v1/alerts | python3 -m json.tool

# Common alerts and resolutions:
# - ServiceDown: restart the affected service
# - GPUHighTemp: check GPU cooling, reduce workload
# - HighMemoryUsage: check for memory leaks, restart services
# - MilvusUnhealthy: restart Milvus container


Network and Security

CORS errors in browser

Symptom: Browser console shows Access-Control-Allow-Origin errors

Solution:

  • Add your origin to the CORS_ORIGINS environment variable
  • Default allows http://localhost:* patterns
  • For production, set specific origins instead of wildcards

API requests rejected (413)

Symptom: Request body exceeds X MB limit

Solution:

  • Default limit is 50 MB per request
  • Increase MAX_REQUEST_SIZE_MB in the service configuration
  • For large file uploads (VCF, FASTQ), use the dedicated file upload endpoints

Getting Help

If your issue is not covered here:

  1. Check the Deployment Guide for detailed configuration
  2. Review service logs: docker compose logs <service> --tail=100
  3. Open an issue on GitHub with:
  4. Steps to reproduce
  5. Relevant log output
  6. Hardware and OS details
  7. Docker and NVIDIA driver versions

Clinical Decision Support Disclaimer

The HCLS AI Factory platform and its components are clinical decision support research tools. It is not FDA-cleared and is not intended as a standalone diagnostic device. All recommendations should be reviewed by qualified healthcare professionals. Apache 2.0 License.