Troubleshooting Guide¶

Common issues and solutions for the HCLS AI Factory platform, organized by component.

Quick Diagnostics¶

Before diving into specific issues, run these checks:

# 1. Check all service health from landing page
curl -s http://localhost:8080/api/check-services | python3 -m json.tool

# 2. Check Docker containers
docker compose ps

# 3. Check GPU availability
nvidia-smi

# 4. Check disk space
df -h /

# 5. Check memory
free -h

Services Will Not Start¶

Docker Compose fails with port conflicts¶

Symptom: Bind for 0.0.0.0:8501 failed: port is already allocated

Solution:

# Find what is using the port
lsof -i :8501

# Stop the conflicting process, or change the port in docker-compose.yml

NVIDIA Container Runtime not found¶

Symptom: docker: Error response from daemon: Unknown runtime specified nvidia

Solution:

# Install NVIDIA Container Toolkit
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
distribution=$(. /etc/os-release; echo $ID$VERSION_ID)
curl -s -L "https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list" | \
  sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
  sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

Services start but show "unhealthy"¶

Symptom: Landing page shows services as red/offline despite containers running.

Solution:

# Check container logs for the specific service
docker compose logs <service-name> --tail=50

# Common causes:
# 1. Milvus not ready yet (takes 30-60s to initialize)
# 2. Missing API keys (check .env file)
# 3. Missing data files (run setup-data.sh)

Stage 1: Genomics Pipeline¶

Parabricks license error¶

Symptom: NVIDIA Parabricks license validation failed

Solution:

# Ensure NGC API key is set
echo $NGC_CLI_API_KEY

# Re-authenticate with NGC
docker login nvcr.io -u '$oauthtoken' -p $NGC_CLI_API_KEY

Out of GPU memory during alignment¶

Symptom: CUDA out of memory during BWA-MEM2 or DeepVariant

Solution:

Reduce batch size in Parabricks configuration
For DGX Spark (128GB unified memory), this should not occur with default settings
For GPUs with less VRAM, use the --low-memory flag if available, or process chromosomes individually

FASTQ files not found¶

Symptom: FileNotFoundError when starting genomics pipeline

Solution:

# Verify data download completed
ls -la data/genomics/

# If missing, re-run data setup for genomics only
./setup-data.sh --genomics

Stage 2: RAG/Chat Pipeline¶

Milvus connection refused¶

Symptom: ConnectionRefusedError: [Errno 111] Connection refused on port 19530

Solution:

# Check if Milvus container is running
docker compose ps milvus

# Check Milvus logs
docker compose logs milvus --tail=50

# Restart Milvus
docker compose restart milvus

# Wait for initialization (30-60 seconds)
sleep 30
curl -s http://localhost:19530/v1/vector/collections

Empty search results¶

Symptom: Queries return no results or empty evidence

Solution:

# Check collection counts via Attu or API
curl -s http://localhost:19530/v1/vector/collections

# If collections are empty, re-run data ingestion
cd rag-chat-pipeline
python3 ingest.py --all

Claude API errors¶

Symptom: AuthenticationError or RateLimitError from Anthropic API

Solution:

# Verify API key is set
echo $ANTHROPIC_API_KEY

# Check API key validity
curl -s https://api.anthropic.com/v1/messages \
  -H "x-api-key: $ANTHROPIC_API_KEY" \
  -H "content-type: application/json" \
  -H "anthropic-version: 2023-06-01" \
  -d '{"model":"claude-sonnet-4-20250514","max_tokens":10,"messages":[{"role":"user","content":"test"}]}'

# For rate limits: the pipeline automatically retries with exponential backoff
# If persistent, check your Anthropic plan limits

Embedding model download fails¶

Symptom: OSError or timeout when loading BAAI/bge-small-en-v1.5

Solution:

# Pre-download the model
python3 -c "from sentence_transformers import SentenceTransformer; SentenceTransformer('BAAI/bge-small-en-v1.5')"

# If behind a firewall, set HuggingFace cache directory
export HF_HOME=/path/to/cache
export TRANSFORMERS_CACHE=/path/to/cache

Stage 3: Drug Discovery Pipeline¶

BioNeMo NIM services unavailable¶

Symptom: Connection refused on ports 8001 (MolMIM) or 8002 (DiffDock)

Solution:

# Check NIM mode
echo $NIM_MODE

# For cloud NIMs (recommended for DGX Spark ARM64):
export NIM_MODE=cloud
export NGC_CLI_API_KEY=your_key_here

# Verify cloud NIM access
curl -s https://health.api.nvidia.com/v1/health/ready

# For local NIMs (x86 only):
docker compose up -d molmim diffdock

DiffDock docking fails¶

Symptom: Error during docking or empty results from DiffDock

Solution:

Verify the PDB file is valid and contains the target protein
Check that the ligand SMILES string is valid: python3 -c "from rdkit import Chem; print(Chem.MolFromSmiles('your_smiles') is not None)"
For cloud NIM: verify NVCF asset staging completed (check logs for asset ID)
Try reducing the number of docking poses

RDKit import errors¶

Symptom: ModuleNotFoundError: No module named 'rdkit'

Solution:

# Install RDKit via conda (recommended)
conda install -c conda-forge rdkit

# Or via pip
pip install rdkit-pypi

Intelligence Agents¶

Agent UI not loading¶

Symptom: Streamlit UI returns connection error on agent port (8521, 8525, or 8526)

Solution:

# Check if the agent container is running
docker compose ps | grep agent

# Start the specific agent
docker compose up -d cart-agent    # Port 8521
docker compose up -d imaging-agent # Port 8525
docker compose up -d onco-agent   # Port 8526

# Check agent logs
docker compose logs cart-agent --tail=50

Agent cannot connect to Milvus¶

Symptom: Agent health check shows milvus: false

Solution:

# Agents share the same Milvus instance as the core platform
# Verify Milvus is running and accessible
curl -s http://localhost:19530/v1/vector/collections

# Check the agent's Milvus host/port configuration
# Default: MILVUS_HOST=localhost, MILVUS_PORT=19530
# In Docker: MILVUS_HOST=milvus (container name)

Symptom: Agent queries do not pull evidence from shared genomic collections

Solution:

Verify the shared genomic_evidence collection exists in Milvus
Check that the cross-modal threshold is not set too high (default: 0.7)
Ensure the agent has read access to shared collections

Landing Page¶

Landing page shows all services as offline¶

Symptom: All service tiles are red despite services running

Solution:

# Check if the landing page can reach services
# Services must be accessible from the landing page container/process
# If running in Docker, ensure services are on the same network

docker network ls
docker network inspect hcls-ai-factory_default

Auto-start not working¶

Symptom: Genomics and RAG services do not auto-start from the landing page

Solution:

Auto-start only works when the landing page runs on the same host as the services
Check that the service directories exist at the expected paths
Verify Python virtual environments are set up for each service

Nextflow Orchestrator¶

Nextflow cgroup errors on DGX Spark¶

Symptom: Cannot get cgroup or process resource errors

Solution:

# Use the Python orchestrator as an alternative
cd hls-orchestrator
python3 run_pipeline.py --mode demo

# Or run Nextflow with NXF_OPTS to bypass cgroup
export NXF_OPTS="-Xms512m -Xmx4g"
./nextflow run main.nf -profile dgx_spark --mode demo

Pipeline hangs at a stage¶

Symptom: Nextflow shows a process running but no progress

Solution:

# Check Nextflow work directory for logs
ls -la work/

# Find the specific task directory
find work/ -name ".command.log" -newer work/ -exec tail -20 {} \;

# Resume from the last successful stage
./nextflow run main.nf -resume

Data Setup¶

setup-data.sh download failures¶

Symptom: Downloads fail or stall during ./setup-data.sh --all

Solution:

# The script supports automatic retry — re-run safely
./setup-data.sh --all

# Download specific stages only
./setup-data.sh --genomics    # Reference genome, FASTQ files (~400GB)
./setup-data.sh --rag         # ClinVar, AlphaMissense, knowledge base (~2GB)
./setup-data.sh --drug        # PDB structures, seed compounds (~100MB)

# Verify checksums after download
./setup-data.sh --verify

Insufficient disk space¶

Symptom: No space left on device during data download or pipeline execution

Solution:

Component	Approximate Size
Reference genome (GRCh38)	3.1 GB
FASTQ sequencing data	~200 GB
ClinVar + AlphaMissense	~2 GB
Milvus vector database	~15 GB
Pipeline outputs (BAM, VCF)	~120 GB
Docker images	~30 GB
Total recommended	500 GB minimum

# Check current usage
du -sh data/ genomics-pipeline/data/ rag-chat-pipeline/data/

# Clean up old pipeline outputs
rm -rf results/old_run_*/
docker system prune -f

Monitoring¶

Grafana dashboards empty¶

Symptom: Grafana loads but shows "No data" in panels

Solution:

# Check Prometheus is scraping targets
curl -s http://localhost:9099/api/v1/targets | python3 -m json.tool

# Verify Node Exporter is running
curl -s http://localhost:9100/metrics | head -5

# Verify DCGM Exporter is running (GPU metrics)
curl -s http://localhost:9400/metrics | head -5

# Default Grafana credentials: admin / admin

Prometheus alerts firing¶

Symptom: Alert manager notifications for service down or GPU errors

Solution:

# Check active alerts
curl -s http://localhost:9099/api/v1/alerts | python3 -m json.tool

# Common alerts and resolutions:
# - ServiceDown: restart the affected service
# - GPUHighTemp: check GPU cooling, reduce workload
# - HighMemoryUsage: check for memory leaks, restart services
# - MilvusUnhealthy: restart Milvus container

Network and Security¶

CORS errors in browser¶

Symptom: Browser console shows Access-Control-Allow-Origin errors

Solution:

Add your origin to the CORS_ORIGINS environment variable
Default allows http://localhost:* patterns
For production, set specific origins instead of wildcards

API requests rejected (413)¶

Symptom: Request body exceeds X MB limit

Solution:

Default limit is 50 MB per request
Increase MAX_REQUEST_SIZE_MB in the service configuration
For large file uploads (VCF, FASTQ), use the dedicated file upload endpoints

Getting Help¶

If your issue is not covered here:

Check the Deployment Guide for detailed configuration
Review service logs: docker compose logs <service> --tail=100
Open an issue on GitHub with:
Steps to reproduce
Relevant log output
Hardware and OS details
Docker and NVIDIA driver versions

Troubleshooting Guide¶

Quick Diagnostics¶

Services Will Not Start¶

Docker Compose fails with port conflicts¶

NVIDIA Container Runtime not found¶

Services start but show "unhealthy"¶

Stage 1: Genomics Pipeline¶

Parabricks license error¶

Out of GPU memory during alignment¶

FASTQ files not found¶

Stage 2: RAG/Chat Pipeline¶

Milvus connection refused¶

Empty search results¶

Claude API errors¶

Embedding model download fails¶

Stage 3: Drug Discovery Pipeline¶

BioNeMo NIM services unavailable¶

DiffDock docking fails¶

RDKit import errors¶

Intelligence Agents¶

Agent UI not loading¶

Agent cannot connect to Milvus¶

Cross-modal triggers not firing¶

Landing Page¶

Landing page shows all services as offline¶

Auto-start not working¶

Nextflow Orchestrator¶

Nextflow cgroup errors on DGX Spark¶

Pipeline hangs at a stage¶

Data Setup¶

setup-data.sh download failures¶

Insufficient disk space¶

Monitoring¶

Grafana dashboards empty¶

Prometheus alerts firing¶

Network and Security¶

CORS errors in browser¶

API requests rejected (413)¶

Getting Help¶