Stage 0: Data Acquisition¶

The foundation step before any pipeline can run.

This guide covers downloading and verifying all data required to run the HCLS AI Factory end-to-end (~500 GB). The setup-data.sh script automates this process, but this document provides context, troubleshooting, and manual procedures for when things go wrong.

Stage 0 is a one-time setup step that acquires all reference data, sequencing data, and annotation databases needed by the three processing stages. Once complete, the pipeline can run repeatedly without re-downloading.

Quick Start¶

# Download everything (run once, ~500 GB total)
./setup-data.sh --all

# Or download stage by stage
./setup-data.sh --stage2    # ClinVar + AlphaMissense (~2 GB, 5 min)
./setup-data.sh --stage1    # HG002 FASTQ + reference (~300 GB, 2-6 hours)
./setup-data.sh --stage3    # PDB structure cache (optional)

# Check what's downloaded
./setup-data.sh --status

# Re-verify all checksums
./setup-data.sh --verify

What Gets Downloaded¶

Stage	Component	Size	Source	Checksum
1	HG002 FASTQ files (68 files)	~200 GB	NCBI GIAB FTP	MD5 per file
1	GRCh38 reference genome + BWA index	~11 GB	S3 (Parabricks bundle)	--
1	Merged FASTQ (R1 + R2)	~100 GB	Generated locally	--
2	ClinVar variant_summary.txt.gz	~394 MB	NCBI FTP	gzip integrity
2	ClinVar VCF + tabix index	~85 MB	NCBI FTP	gzip integrity
2	AlphaMissense_hg38.tsv.gz	~614 MB	Google Cloud Storage	gzip integrity
3	PDB structures	~3 MB/gene	RCSB PDB (auto-fetched)	--
Total		~500 GB

Time Estimates¶

Component	100 Mbps	1 Gbps	10 Gbps
FASTQ download	6 hours	45 min	5 min
Reference genome	20 min	2 min	15 sec
FASTQ merge	30-60 min (CPU-bound)	30-60 min	30-60 min
Stage 2 databases	15 min	2 min	15 sec

Stage-by-Stage Reference¶

Stage 1: HG002 FASTQ Files¶

What: Whole-genome sequencing data for GIAB HG002 (Ashkenazi male reference standard). 30x coverage, Illumina 2x250bp paired-end.

Source: NCBI Genome in a Bottle (GIAB) FTP - Index URL: https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data_indexes/AshkenazimTrio/sequence.index.AJtrio_Illumina_2x250bps_06012016_updated.HG002 - Data: 34 R1 files + 34 R2 files across 2 sequencing lanes (L001, L002)

Checksum: MD5 checksums provided in the GIAB index TSV file (columns 2 and 4).

Download tool: aria2c with parallel connections. Falls back to wget on retry.

Output paths: - Raw chunks: genomics-pipeline/data/input/giab_hg002/reads/*.fastq.gz - Merged: genomics-pipeline/data/input/giab_hg002/HG002_R1.fastq.gz and HG002_R2.fastq.gz - Pipeline input: genomics-pipeline/data/input/HG002_R1.fastq.gz and HG002_R2.fastq.gz

Stage 1: Reference Genome (GRCh38)¶

What: Human reference genome build 38 with pre-built BWA-MEM2 index files for NVIDIA Parabricks alignment.

Source: NVIDIA Parabricks sample bundle on S3 - URL: https://s3.amazonaws.com/parabricks.sample/parabricks_sample.tar.gz - Size: ~11 GB compressed

Contents after extraction: - GRCh38.fa — Reference FASTA (3.1 GB) - GRCh38.fa.bwt — BWT suffix array (3.0 GB) - GRCh38.fa.pac — Packed sequence (768 MB) - GRCh38.fa.sa — Suffix array (1.5 GB) - GRCh38.fa.amb — Ambiguous bases - GRCh38.fa.ann — Annotations - GRCh38.fa.fai — FASTA index (created via samtools) - GRCh38.dict — Sequence dictionary (created via samtools)

Output path: genomics-pipeline/data/ref/

Stage 1: FASTQ Merge¶

What: Combines 68 individual FASTQ chunks (from 2 sequencing lanes) into 2 merged files.

Process: Decompresses all chunks with zcat, re-compresses with pigz (parallel gzip using all CPU cores).

Time: 30-60 minutes depending on CPU speed and I/O.

Requires: pigz for parallel compression (falls back to gzip if unavailable, but significantly slower).

Stage 2: ClinVar¶

What: NCBI Clinical Variants database. Contains clinical significance classifications for 4.1 million human variants.

Files: - clinvar_variant_summary.txt.gz (~394 MB) — Tab-delimited format used by the ClinVar annotator - clinvar.vcf.gz (~85 MB) — VCF format for secondary annotation paths - clinvar.vcf.gz.tbi — Tabix index for VCF

Source: https://ftp.ncbi.nlm.nih.gov/pub/clinvar/

Output path: rag-chat-pipeline/data/annotations/

Stage 2: AlphaMissense¶

What: DeepMind AlphaMissense predictions. AI-predicted pathogenicity scores for 71 million missense variants.

Source: https://storage.googleapis.com/dm_alphamissense/AlphaMissense_hg38.tsv.gz

Size: ~614 MB compressed. Requires ~8-10 GB RAM when loaded into memory during ingestion.

License: CC-BY-4.0 (Google DeepMind)

Output path: rag-chat-pipeline/data/annotations/AlphaMissense_hg38.tsv.gz

Stage 3: PDB Structures¶

What: Protein structure files from the RCSB Protein Data Bank. Used for molecular docking in Stage 3.

Behavior: Auto-fetched from RCSB during pipeline execution. The CryoEMEvidenceManager automatically downloads and caches structures for any gene with PDB IDs.

VCP demo structures (optionally pre-fetched by setup-data.sh): - 5FTK — VCP + CB-5083 inhibitor (2.9 MB) - 7K56 — VCP + cofactor complex (2.9 MB) - 8OOI — VCP wild-type hexamer (2.9 MB) - 9DIL — VCP disease mutant (950 KB)

Output path: drug-discovery-pipeline/data/structures/pdb_cache/

Troubleshooting¶

FASTQ downloads fail checksum verification¶

This is the most common issue. GIAB FASTQ files are large (3-6 GB each) and the NCBI FTP server can be unreliable.

Symptoms: MD5 mismatch errors during download, or files that download but fail verification.

Solutions:

Re-run the script — It's idempotent. Only failed files are re-downloaded:
```
./setup-data.sh --stage1
```
Use fewer parallel connections — Reduces load on FTP server:
```
./setup-data.sh --stage1 --connections 4
```
Wait and retry — NCBI rate-limits aggressive downloaders. Wait 15-30 minutes:
```
sleep 1800 && ./setup-data.sh --stage1
```
Try off-peak hours — NCBI servers are less loaded during US Eastern 2-6 AM.
Check NCBI status — Verify the server isn't down: https://www.ncbi.nlm.nih.gov/Status/
Verify existing downloads — See which files are good and which need re-download:
```
./setup-data.sh --verify
```

NCBI FTP server is slow or unreachable¶

Symptoms: Downloads stall, timeouts, "connection refused" errors.

Solutions:

Check DNS resolution:
```
nslookup ftp-trace.ncbi.nlm.nih.gov
```

Check if HTTPS works (some networks block FTP):

curl -I https://ftp-trace.ncbi.nlm.nih.gov/

Check for firewall/proxy issues:

wget --spider https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/

Use a VPN — If your institution throttles FTP traffic.
Check if your institution has a GIAB mirror — Some academic centers maintain local copies.

Disk space issues¶

Symptoms: "No space left on device" or pre-flight check failure.

Required space by stage: - Stage 1: ~350 GB (200 GB chunks + 11 GB reference + 100 GB merged + workspace) - Stage 2: ~2 GB - Stage 3: ~50 MB

Solutions:

Download stages independently — Stage 2 only needs 2 GB:
```
./setup-data.sh --stage2
```

Symlink data directories to a larger volume:

ln -s /mnt/large-drive/genomics-data genomics-pipeline/data

Skip the merge to save 100 GB (merged files can be created later):
```
./setup-data.sh --stage1 --skip-merge
```
Free space:
```
sudo apt-get clean
docker system prune
```

Download interrupted — how to resume¶

The script is fully idempotent. Simply re-run the same command:

./setup-data.sh --stage1

How it works: 1. A state file (.data-setup-state) tracks which files have been verified. 2. On re-run, verified files are skipped instantly (no re-download, no re-checksum). 3. Only missing or corrupt files are downloaded. 4. The --verify flag forces a full re-verification (ignores state file).

To reset and start fresh: Delete the state file:

rm .data-setup-state

DGX Spark (ARM64) specific notes¶

The HCLS AI Factory is designed to run on the NVIDIA DGX Spark with its ARM64 (aarch64) processor.

All required tools (aria2c, wget, pigz, md5sum) are available in ARM64 Ubuntu repos
Same commands work on both x86_64 and aarch64
Docker images used for samtools indexing support multi-architecture
The DGX Spark's 128 GB unified memory is sufficient for all pipeline stages

Install tools on DGX Spark:

sudo apt-get update
sudo apt-get install -y aria2 pigz

VAST AI OS deployment notes¶

For VAST R&D porting to VAST AI OS:

DataStore: Point data directories to VAST DataStore paths instead of local filesystem
DataEngine: FASTQ arrival can trigger automatic pipeline execution via event triggers
DataBase: Milvus can be replaced with VAST DataBase (unified SQL + vector)
Storage: VAST's parallel I/O significantly accelerates FASTQ download and merge operations
Data paths in genomics-pipeline/config/pipeline.env and rag-chat-pipeline/.env should be updated to VAST mount points

Data Sources and Licensing¶

Dataset	Provider	License	URL
HG002 WGS	NIST (GIAB)	Public Domain	https://www.nist.gov/programs-projects/genome-bottle
GRCh38 Reference	NCBI	Public Domain	https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.26/
ClinVar	NCBI	Public Domain	https://www.ncbi.nlm.nih.gov/clinvar/
AlphaMissense	Google DeepMind	CC-BY-4.0	https://github.com/google-deepmind/alphamissense
Parabricks Bundle	NVIDIA	Free to download	https://docs.nvidia.com/clara/parabricks/
PDB Structures	RCSB	CC0 1.0	https://www.rcsb.org/

Prerequisites¶

Required tools¶

Tool	Purpose	Install
`aria2c`	Parallel FASTQ downloads	`sudo apt-get install -y aria2`
`wget`	Reference genome + Stage 2 downloads	Usually pre-installed
`pigz`	Parallel FASTQ compression during merge	`sudo apt-get install -y pigz`
`md5sum`	Checksum verification	Pre-installed on Linux
`Docker`	Samtools indexing, Milvus, BioNeMo NIMs	https://docs.docker.com/get-docker/

API keys¶

Key	Required For	Get It
NGC API Key	Parabricks container, BioNeMo NIMs	https://ngc.nvidia.com/
Anthropic API Key	Claude AI in RAG pipeline	https://console.anthropic.com/
HuggingFace Token	Local LLM (optional, if using Llama)	https://huggingface.co/settings/tokens