Skip to content

Genomics Pipeline

NVIDIA Parabricks DGX Spark Docker Python License

Stage 1 of the Precision Medicine to Drug Discovery AI Factory

GPU-accelerated germline variant calling pipeline using NVIDIA Parabricks. This pipeline transforms raw DNA sequencing data (FASTQ) into analysis-ready variant calls (VCF) in under 2 hours using GPU acceleration.

┌──────────────────────────────────────────────────────────────────────────────────────┐
│                    PRECISION MEDICINE TO DRUG DISCOVERY AI FACTORY                   │
├──────────────────────────────────────────────────────────────────────────────────────┤
│                                                                                      │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐    ┌─────────────────────┐   │
│  │  GENOMICS   │    │  RAG/CHAT   │    │   CRYO-EM   │    │ MOLECULE GENERATION │   │
│  │  PIPELINE   │───▶│  PIPELINE   │───▶│  EVIDENCE   │───▶│     (BioNeMo)       │   │
│  │ (This Repo) │    │             │    │             │    │                     │   │
│  └─────────────┘    └─────────────┘    └─────────────┘    └─────────────────────┘   │
│    FASTQ→VCF         VCF→Target        Target→Structure    Structure→Molecules      │
│    Parabricks        Milvus+Claude     PDB/EMDB            MolMIM+DiffDock          │
│                                                                                      │
└──────────────────────────────────────────────────────────────────────────────────────┘

Table of Contents


The Biological Foundation

From DNA to Disease: Why This Matters

Every cell in the human body contains DNA—the blueprint of human biology. All of that DNA together is called the genome. DNA is organized into chromosomes, and chromosomes contain genes. Genes encode proteins, and proteins are the molecular machines that make the body work.

When variants occur in genes, they can change how proteins function, and those changes can disrupt normal biology and lead to disease. Understanding these changes—and doing it at scale—is the foundation of modern precision medicine.

DNA (Blueprint)
Chromosomes (23 pairs in humans)
Genes (~20,000 protein-coding genes)
Proteins (Molecular machines)
Function (Normal biology or disease)

The Scale of Human Variation

A typical human genome contains approximately: - 3 billion base pairs (A, C, G, T) - 4-5 million variants compared to the reference genome - ~20,000 protein-coding genes - ~11.7 million total variants (including non-coding regions) in a 30x whole-genome sequence

Most variants are harmless—normal human diversity. But some affect genes in ways that change proteins, and those changes are often where disease begins. This pipeline is the first step in finding those critical variants.


From Biology to Digital Data

The Sequencing Process

We start with data from Genome in a Bottle (GIAB), a globally trusted reference initiative led by NIST that provides high-confidence human genome datasets used worldwide to validate sequencing and analysis. The genome we're using here, HG002, was generated with Illumina short-read sequencing—the clinical standard today.

What happens during sequencing:

  1. Sample Collection: DNA is extracted from cells (blood, saliva, tissue)
  2. Library Preparation: DNA is broken into millions of small fragments (~250bp each)
  3. Sequencing: Each fragment is read as a string of DNA letters (A, C, G, T)
  4. Quality Scoring: Each base call gets a confidence score (Phred quality)
  5. Output: FASTQ files containing hundreds of millions of reads

The challenge: A single 30x whole-genome sequence produces: - ~800 million read pairs - ~200 GB of raw data - Takes 24-48 hours on traditional CPU pipelines

This is where GPU acceleration transforms what's possible.

What is Secondary Analysis?

Primary analysis happens on the sequencer—converting light signals to base calls.

Secondary analysis (this pipeline) turns raw reads into meaningful variants:

┌─────────────────────────────────────────────────────────────────────────────┐
                         SECONDARY ANALYSIS PIPELINE                          ├─────────────────────────────────────────────────────────────────────────────┤
                                                                               FASTQ Files              BAM File                 VCF File                   (Raw Reads)              (Aligned Reads)          (Variants)                                                                                              ┌──────────┐    Align    ┌──────────┐    Call    ┌──────────┐               ACGTACGT    ──────▶    chr7:pos    ──────▶   GA at                  TGCATGCA    BWA-MEM2   chr7:pos   DeepVar    chr7:pos                GCTAGCTA               chr7:pos              Quality                   ...                    ...                 Score                  └──────────┘             └──────────┘            └──────────┘                                                                                           "What letters            "Where do they          "How does this           │
   were read?"              belong?"                differ from                                                                 reference?"             │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

What This Pipeline Does

This pipeline processes whole-genome sequencing (WGS) data through a complete bioinformatics workflow using NVIDIA Parabricks, which accelerates genome analysis on GPUs—turning what used to take days on CPUs into hours.

The Four Steps

Step Tool Input Output Time (GPU)
1. Alignment BWA-MEM2 FASTQ reads Mapped positions 20-45 min
2. Sorting & Dedup fq2bam Mapped reads Sorted BAM (included)
3. Indexing samtools BAM file BAM index + QC 2-5 min
4. Variant Calling DeepVariant BAM file VCF file 10-35 min

Total: 120-240 minutes (vs. 24-48 hours on CPU)

Pipeline Flow

┌─────────────────────────────────────────────────────────────────────────────┐
                           NVIDIA PARABRICKS PIPELINE                         
├─────────────────────────────────────────────────────────────────────────────┤
                                                                             
   Input: FASTQ Files (HG002_R1.fastq.gz, HG002_R2.fastq.gz)                
          ~200GB paired-end reads from Illumina sequencing                   
                                                                             
                              ┌─────────────────┐                            
                                 Reference                                 
                                 GRCh38.fa                                 
                                 (3.1 GB)                                  
                              └────────┬────────┘                            
                                                                            
   ┌────────────┐                                                           
      FASTQ        ┌─────────────────────────────────────────┐            
      Files    │───▶│              fq2bam                                  
                      BWA-MEM2 alignment (GPU accelerated)              
     R1 + R2          Coordinate sorting                                
     ~200 GB          PCR duplicate marking                             
   └────────────┘    └──────────────────┬──────────────────────┘            
                                                                            
                                                                            
                     ┌─────────────────────────────────────────┐            
                                 BAM File                                  
                        Aligned reads with coordinates                    
                        Quality scores preserved                          
                        ~100 GB output                                    
                     └──────────────────┬──────────────────────┘            
                                                                            
                                                                            
                     ┌─────────────────────────────────────────┐            
                               samtools index                              
                        Create BAM index (.bai)                           
                        Generate alignment statistics                     
                     └──────────────────┬──────────────────────┘            
                                                                            
                                                                            
                     ┌─────────────────────────────────────────┐            
                                DeepVariant                                
                        Deep learning variant caller                      
                        GPU-accelerated inference                         
                        Trained on millions of examples                   
                     └──────────────────┬──────────────────────┘            
                                                                            
                                                                            
   Output: VCF File (HG002.genome.vcf.gz)                                   
           ~11.7 million variants identified                                 
           Ready for annotation in RAG/Chat Pipeline                         
                                                                             
└─────────────────────────────────────────────────────────────────────────────┘

Key Features

GPU Acceleration

  • 10-50x faster than CPU-only pipelines using NVIDIA Parabricks
  • Full genome analysis in 120-240 minutes vs. 24-48 hours
  • Optimized for NVIDIA DGX Spark, A100, V100, and consumer GPUs

Production-Ready Tools

  • BWA-MEM2: Industry-standard aligner used in clinical labs worldwide
  • DeepVariant: Google's deep learning variant caller (>99% accuracy)
  • GRCh38: Latest human reference genome build

Validated Output

  • Uses GIAB HG002 benchmark data with known truth sets
  • Enables accuracy benchmarking against gold-standard calls
  • Suitable for clinical genomics, research, and pharma applications

Dual Interface

  • Command Line: Scriptable pipeline for automation
  • Web Portal: Visual interface with real-time monitoring

Containerized & Reproducible

  • Fully Dockerized with NVIDIA Container Runtime
  • Version-controlled container images (Parabricks 4.6.0)
  • Consistent results across different systems

Architecture

System Architecture

┌────────────────────────────────────────────────────────────────────────────┐
                              HOST SYSTEM                                    
├────────────────────────────────────────────────────────────────────────────┤
                                                                            
   ┌────────────────────────────────────────────────────────────────────┐  
                         WEB PORTAL (Optional)                            
      Flask Server (Port 5000)                                            
       Real-time pipeline monitoring      Log streaming                 
       GPU utilization display            Configuration management      
       Progress tracking                  One-click step execution      
   └────────────────────────────────────────────────────────────────────┘  
                                                                           
   ┌────────────────────────────────────────────────────────────────────┐  
                         PIPELINE SCRIPTS                                 
                                                                          
      00-setup-check.sh ──▶ 01-ngc-login.sh ──▶ 02-download-data.sh     
                                                                         
                                                                         
      03-setup-reference.sh ──▶ 04-run-chr20-test.sh                     
                                                                         
                                                                         
      05-run-full-genome.sh ──▶ Output: VCF File                         
                                                                          
   └────────────────────────────────────────────────────────────────────┘  
                                                                           
   ┌────────────────────────────────────────────────────────────────────┐  
                 DOCKER + NVIDIA CONTAINER RUNTIME                        
      ┌────────────────────────────────────────────────────────────┐     
               clara-parabricks:4.6.0-1 Container                      
                                                                       
         ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐      
          BWA-MEM2    samtools   DeepVariant  bcftools       
           (GPU)                   (GPU)                     
         └──────────┘  └──────────┘  └──────────┘  └──────────┘      
                                                                       
      └────────────────────────────────────────────────────────────┘     
   └────────────────────────────────────────────────────────────────────┘  
                                                                           
   ┌────────────────────────────────────────────────────────────────────┐  
                            NVIDIA GPU                                    
                                                                          
      CUDA Cores ─────── Tensor Cores ─────── GPU Memory                 
      (Alignment)        (DeepVariant)        (Data + Models)            
                                                                          
      Supported: DGX Spark (GB10), A100, V100, RTX 4090, RTX 3090       
                                                                          
   └────────────────────────────────────────────────────────────────────┘  
                                                                            
└────────────────────────────────────────────────────────────────────────────┘

Data Flow

┌─────────────────────────────────────────────────────────────────────────────┐
                              DATA FLOW                                       
├─────────────────────────────────────────────────────────────────────────────┤
                                                                             
   GIAB FTP Server                    NVIDIA Sample Bundle                   
                                                                           
                                                                           
   ┌──────────┐                        ┌──────────────┐                     
     FASTQ                             Reference                        
     Files                             GRCh38.fa                        
     ~200GB                            + Indexes                        
   └────┬─────┘                        └──────┬───────┘                     
                                                                           
        └──────────────┬──────────────────────┘                              
                                                                            
                                                                            
              ┌─────────────────┐                                           
                  fq2bam                                                  
                (Alignment)                                               
              └────────┬────────┘                                           
                                                                            
                                                                            
              ┌─────────────────┐                                           
                 BAM File                                                 
                 (~100 GB)                                                
              └────────┬────────┘                                           
                                                                            
                                                                            
              ┌─────────────────┐                                           
                DeepVariant                                               
               (Variant Call)                                             
              └────────┬────────┘                                           
                                                                            
                                                                            
              ┌─────────────────┐                                           
                 VCF File      │──────────▶  RAG/Chat Pipeline            
               ~11.7M variants              (Stage 2)                     
              └─────────────────┘                                           
                                                                             
└─────────────────────────────────────────────────────────────────────────────┘

System Requirements

Hardware Requirements

Component Minimum Recommended (DGX Spark) Notes
GPU 8GB VRAM 128GB (GB10) More VRAM = faster processing
GPU Architecture Volta (V100)+ Blackwell (GB10) Newer = better DeepVariant perf
System RAM 32 GB 512 GB BAM sorting is memory-intensive
Storage 500 GB SSD 1 TB NVMe Fast I/O critical for BAM files
CPU 8 cores 144 cores Parallel I/O and preprocessing

Software Requirements

Software Version Purpose
Operating System Ubuntu 20.04+ / RHEL 8+ Host OS
Docker 20.10+ Container runtime
NVIDIA Driver 525+ GPU driver
nvidia-container-toolkit Latest GPU container support
Python 3.11+ Web portal (optional)
aria2c Latest Parallel downloads (optional)

NGC Account (Required)

An NVIDIA NGC account is required to pull the Parabricks container:

  1. Sign up at ngc.nvidia.com
  2. Generate an API key at ngc.nvidia.com/setup/api-key
  3. The API key is used as your password (username is $oauthtoken)

Quick Start

# Clone the repository
git clone https://github.com/ajones1923/genomics-pipeline.git
cd genomics-pipeline

# Run the complete workflow
./run.sh check      # Verify prerequisites (Docker, GPU, disk space)
./run.sh login      # Authenticate with NGC
./run.sh download   # Download GIAB HG002 data (~200GB, 2-6 hours)
./run.sh reference  # Setup GRCh38 reference genome
./run.sh test       # Quick validation on chr20 (5-20 min)
./run.sh full       # Full genome analysis (120-240 min)

Option 2: Web Portal

# Start the web portal
cd web-portal
./start-portal.sh

# Open browser to http://localhost:5000
# Click through each step in the visual interface

Option 3: Using Your Own Data

# Copy your FASTQ files to the input directory
cp your_sample_R1.fastq.gz data/input/HG002_R1.fastq.gz
cp your_sample_R2.fastq.gz data/input/HG002_R2.fastq.gz

# Skip download, run the rest
./run.sh check
./run.sh login
./run.sh reference
./run.sh full

Installation

Step 1: Clone the Repository

git clone https://github.com/ajones1923/genomics-pipeline.git
cd genomics-pipeline

Step 2: Verify Prerequisites

./run.sh check

This verifies: - Docker installation and daemon status - NVIDIA Container Runtime configuration - GPU availability and driver version - Available disk space (minimum 500GB)

Expected output:

[CHECK] Docker installation... OK
[CHECK] Docker daemon running... OK
[CHECK] NVIDIA Container Runtime... OK
[CHECK] GPU detected: NVIDIA GB10 (128GB)
[CHECK] Driver version: 560.35.03
[CHECK] Available disk space: 1.2TB
[CHECK] All prerequisites met!

Step 3: Authenticate with NGC

./run.sh login

When prompted: - Username: $oauthtoken (enter this literally) - Password: Your NGC API key (from ngc.nvidia.com)

Step 4: Download Data

./run.sh download

This downloads the GIAB HG002 Illumina 2x250bp dataset: - Source: NCBI GIAB FTP server - Size: ~200GB (multiple lane files) - Features: Parallel downloads, automatic resume on interruption

Note: Skip this step if using your own FASTQ files.

Step 5: Setup Reference Genome

./run.sh reference

Downloads and indexes the GRCh38 human reference genome: - Reference FASTA (GRCh38.fa) - 3.1 GB - BWA index files (.bwt, .pac, .sa, .amb, .ann) - FASTA index (.fai) - Sequence dictionary (.dict)


Usage

Command Line Interface

The run.sh script provides a unified interface:

./run.sh <command>

Commands:
  check       Check prerequisites (Docker, GPU, disk space)
  login       Authenticate with NGC
  download    Download GIAB HG002 data (~200GB)
  reference   Setup GRCh38 reference genome
  test        Run chr20 fast test (~5-20 min)
  full        Run full genome pipeline (~120-240 min)
  clean       Clean output files (keeps input data)
  clean-all   Clean everything including downloaded data
  help        Show help message

Running the Chr20 Test (Validation)

Before running the full genome, validate your setup with chromosome 20:

./run.sh test

What this does: - Processes only chromosome 20 (~2% of genome) - Validates GPU acceleration is working - Completes in 5-20 minutes - Produces: HG002.chr20.bam, HG002.chr20.vcf.gz

Running Full Genome Analysis

After validation:

./run.sh full

What this does: - Processes all chromosomes (chr1-22, chrX, chrY, chrM) - Takes 120-240 minutes depending on GPU - Produces: HG002.genome.bam, HG002.genome.vcf.gz

Web Portal

The web portal provides visual pipeline management:

cd web-portal
./start-portal.sh
# Open http://localhost:5000

Portal Features: - Click-to-run buttons for each pipeline step - Real-time console output streaming - Live GPU utilization monitoring - Configuration management UI - Historical log browser


Pipeline Steps in Detail

Step 0: Prerequisites Check

Script: scripts/00-setup-check.sh

Validates all system requirements before running the pipeline:

Check Requirement Why It Matters
Docker Installed + running Container runtime
NVIDIA Runtime Configured GPU access in containers
GPU Detected Acceleration
Driver 525+ CUDA compatibility
Disk Space 500GB+ BAM files are large

Step 1: NGC Authentication

Script: scripts/01-ngc-login.sh

Authenticates with NVIDIA NGC to pull the Parabricks container:

docker login nvcr.io
# Username: $oauthtoken
# Password: <your NGC API key>

Step 2: Data Download

Script: scripts/02-download-data.sh

Downloads GIAB HG002 benchmark genome:

File Size Description
HG002_R1.fastq.gz ~100GB Forward reads (Read 1)
HG002_R2.fastq.gz ~100GB Reverse reads (Read 2)

Features: - Parallel downloads using aria2c (8 connections) - Automatic resume on network interruption - Checksum validation - Lane file merging

Step 3: Reference Genome Setup

Script: scripts/03-setup-reference.sh

Downloads and prepares the GRCh38 human reference:

File Size Purpose
GRCh38.fa 3.1 GB Reference sequence
GRCh38.fa.fai 3 KB FASTA index
GRCh38.fa.bwt 3 GB BWA index
GRCh38.dict 50 KB Sequence dictionary

Step 4: Chr20 Test

Script: scripts/04-run-chr20-test.sh

Quick validation run using only chromosome 20:

# Alignment (fq2bam)
pbrun fq2bam \
  --ref data/ref/GRCh38.fa \
  --in-fq data/input/HG002_R1.fastq.gz data/input/HG002_R2.fastq.gz \
  --out-bam data/output/HG002.chr20.bam \
  --interval chr20

# Variant Calling (DeepVariant)
pbrun deepvariant \
  --ref data/ref/GRCh38.fa \
  --in-bam data/output/HG002.chr20.bam \
  --out-vcf data/output/HG002.chr20.vcf.gz

Step 5: Full Genome Analysis

Script: scripts/05-run-full-genome.sh

Complete whole-genome analysis:

Sub-step 5.1: fq2bam (Alignment)

pbrun fq2bam \
  --ref data/ref/GRCh38.fa \
  --in-fq data/input/HG002_R1.fastq.gz data/input/HG002_R2.fastq.gz \
  --out-bam data/output/HG002.genome.bam \
  --num-gpus 1

What fq2bam does: 1. Alignment: Maps reads to reference using BWA-MEM2 (GPU) 2. Sorting: Orders reads by genomic coordinate 3. Duplicate Marking: Flags PCR duplicates

Sub-step 5.2: BAM Indexing

samtools index data/output/HG002.genome.bam
samtools flagstat data/output/HG002.genome.bam

Sub-step 5.3: DeepVariant (Variant Calling)

pbrun deepvariant \
  --ref data/ref/GRCh38.fa \
  --in-bam data/output/HG002.genome.bam \
  --out-vcf data/output/HG002.genome.vcf.gz \
  --num-gpus 1

What DeepVariant does: 1. Pileup Images: Creates images of read alignments at each position 2. CNN Inference: Deep learning model classifies variants (GPU) 3. Variant Calling: Outputs high-confidence variant calls


Understanding the Output VCF

What is a VCF?

A VCF (Variant Call Format) file is a structured summary of how one genome differs from the human reference. Instead of storing every DNA letter (3 billion bases), the VCF records only the meaningful differences.

VCF Statistics (HG002 Full Genome)

Metric Value Notes
Total Variants ~11.7 million All chromosomes
SNPs ~4.2 million Single nucleotide changes
Indels ~1.0 million Insertions/deletions
High Quality (QUAL>30) ~3.5 million Confident calls
In Coding Regions ~35,000 Potentially functional

VCF Format Example

#CHROM  POS       ID           REF  ALT  QUAL   FILTER  INFO                    FORMAT  HG002
chr7    117559590 rs188935092  G    A    45.2   PASS    DP=32;AF=0.5           GT:DP   0/1:32
chr17   7674220   rs1042522    G    C    99.0   PASS    DP=45;AF=0.5           GT:DP   0/1:45
Column Meaning
CHROM Chromosome (chr7)
POS Position on chromosome (117559590)
ID dbSNP identifier (rs188935092)
REF Reference allele (G)
ALT Alternate allele (A)
QUAL Quality score (higher = more confident)
GT Genotype (0/1 = heterozygous)

What Happens Next

The VCF file feeds into the RAG/Chat Pipeline (Stage 2), where: 1. Variants are annotated with ClinVar, AlphaMissense, VEP 2. Embedded into vector database (Milvus) 3. Connected to knowledge graph (Clinker) 4. Queried via natural language with Claude


Configuration

Configuration File

Edit config/pipeline.env to customize behavior:

# GPU Configuration
NUM_GPUS=1              # Number of GPUs to use (1-8)
LOW_MEMORY=0            # Set to 1 for GPUs with <16GB VRAM

# Sample Configuration
PATIENT_ID=HG002        # Sample identifier in output filenames

# Reference Genome
REF_BUILD=GRCh38        # Reference genome build

# Container Image
PB_IMG=nvcr.io/nvidia/clara/clara-parabricks:4.6.0-1

# Performance Tuning
FQ2BAM_ARGS=""          # Additional fq2bam arguments
DEEPVARIANT_ARGS=""     # Additional DeepVariant arguments

Configuration Options

Parameter Default Description
NUM_GPUS 1 GPUs for parallel processing
LOW_MEMORY 0 Enable for GPUs with <16GB VRAM
PATIENT_ID HG002 Sample ID in output filenames
PB_IMG 4.6.0-1 Parabricks container version

Multi-GPU Configuration

For systems with multiple GPUs:

# config/pipeline.env
NUM_GPUS=4  # Use 4 GPUs in parallel

This distributes the workload across GPUs, reducing processing time proportionally.


Directory Structure

genomics-pipeline/
├── run.sh                      # Main CLI interface
├── README.md                   # This documentation
├── QUICKSTART.md               # Quick start guide
├── WEB-PORTAL-GUIDE.md         # Web portal documentation
├── .gitignore                  # Git ignore patterns

├── config/
   └── pipeline.env            # Pipeline configuration

├── scripts/
   ├── 00-setup-check.sh       # Prerequisites validation
   ├── 01-ngc-login.sh         # NGC authentication
   ├── 02-download-data.sh     # GIAB data download
   ├── 03-setup-reference.sh   # Reference genome setup
   ├── 04-run-chr20-test.sh    # Chr20 validation test
   └── 05-run-full-genome.sh   # Full genome pipeline

├── data/
   ├── input/                  # Input FASTQ files
      ├── HG002_R1.fastq.gz   # Forward reads (~100GB)
      └── HG002_R2.fastq.gz   # Reverse reads (~100GB)
   ├── ref/                    # Reference genome
      ├── GRCh38.fa           # Reference FASTA (3.1GB)
      ├── GRCh38.fa.fai       # FASTA index
      ├── GRCh38.fa.bwt       # BWA index
      └── GRCh38.dict         # Sequence dictionary
   └── output/                 # Pipeline outputs
       ├── logs/               # Execution logs
       ├── HG002.chr20.bam     # Chr20 test BAM
       ├── HG002.chr20.vcf.gz  # Chr20 test VCF
       ├── HG002.genome.bam    # Full genome BAM (~100GB)
       ├── HG002.genome.bam.bai # BAM index
       ├── HG002.genome.vcf.gz # Full genome VCF (→ RAG Pipeline)
       └── HG002.genome.vcf.gz.tbi # VCF index

├── web-portal/
   ├── start-portal.sh         # Portal startup script
   ├── requirements.txt        # Python dependencies
   ├── app/
      └── server.py           # Flask backend
   ├── templates/
      └── index.html          # Main UI
   └── static/
       ├── css/style.css       # Styles
       └── js/app.js           # Frontend logic

└── docs/                       # Additional documentation

Performance Benchmarks

Expected Timings by GPU

GPU Model VRAM Chr20 Test Full Genome Notes
DGX Spark (GB10) 128GB 5-10 min 30-60 min Recommended
NVIDIA A100 80GB 3-8 min 25-45 min Data center
NVIDIA A100 40GB 5-10 min 35-55 min Data center
NVIDIA V100 32GB 8-15 min 50-75 min Older data center
NVIDIA RTX 4090 24GB 6-12 min 40-60 min Consumer
NVIDIA RTX 3090 24GB 8-15 min 50-75 min Consumer

CPU baseline: 32-core Intel Xeon takes 24-48 hours

Step-by-Step Timing Breakdown

Step Chr20 Full Genome % of Total
fq2bam (alignment) 2-8 min 20-45 min ~60%
BAM indexing <1 min 2-5 min ~5%
DeepVariant 2-6 min 15-35 min ~35%
Total 5-15 min 37-85 min 100%

Resource Utilization

Resource During fq2bam During DeepVariant
GPU Compute 70-90% 80-95%
GPU Memory 8-12 GB 12-20 GB
System RAM 24-48 GB 16-32 GB
Disk I/O High (write) Moderate (read)

Output Files

Primary Outputs

File Description Size Next Step
HG002.genome.bam Aligned, sorted, deduplicated reads ~100 GB Archive
HG002.genome.bam.bai BAM index for random access ~10 MB With BAM
HG002.genome.vcf.gz Variant calls (main output) ~1-2 GB → RAG Pipeline
HG002.genome.vcf.gz.tbi VCF tabix index ~2 MB With VCF

Log Files

Located in data/output/logs/:

Log Contents
genome_fq2bam.log Alignment metrics, timing
genome_flagstat.log Alignment quality statistics
genome_deepvariant.log Variant calling statistics

VCF Downstream Usage

# Count total variants
bcftools view -H data/output/HG002.genome.vcf.gz | wc -l
# Output: ~11,700,000

# Filter high-quality variants (QUAL > 30)
bcftools filter -i 'QUAL>30' data/output/HG002.genome.vcf.gz | wc -l
# Output: ~3,500,000

# Extract specific gene region (TP53)
bcftools view -r chr17:7668402-7687550 data/output/HG002.genome.vcf.gz

# Copy to RAG Pipeline
cp data/output/HG002.genome.vcf.gz* ../rag-chat-pipeline/data/input/

Troubleshooting

Common Issues

GPU Out of Memory

Symptom: CUDA out of memory error during fq2bam or DeepVariant

Solution:

# Enable low-memory mode
# Edit config/pipeline.env:
LOW_MEMORY=1

# Re-run pipeline
./run.sh full

Docker Permission Denied

Symptom: Got permission denied while trying to connect to the Docker daemon

Solution:

# Add user to docker group
sudo usermod -aG docker $USER

# Log out and back in, or:
newgrp docker

NVIDIA Container Runtime Not Found

Symptom: could not select device driver "nvidia"

Solution:

# Install nvidia-container-toolkit
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
  sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
  sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker

NGC Authentication Failed

Symptom: unauthorized: authentication required

Solution:

# Verify credentials
docker logout nvcr.io
docker login nvcr.io
# Username: $oauthtoken (literally)
# Password: <your NGC API key>

Download Interrupted

Symptom: Partial FASTQ files after network interruption

Solution:

# aria2c automatically resumes - just re-run
./run.sh download

Checking Logs

# Real-time monitoring
tail -f data/output/logs/genome_fq2bam.log

# GPU utilization during processing
watch -n 1 nvidia-smi

# Check all logs
ls -lh data/output/logs/

This pipeline is Stage 1 of the Precision Medicine to Drug Discovery AI Factory:

Stage Pipeline Repository Description
1 Genomics Pipeline genomics-pipeline FASTQ → VCF (This repo)
2 RAG/Chat Pipeline rag-chat-pipeline VCF → Target Hypothesis
3 Drug Discovery Pipeline drug-discovery-pipeline Target → Molecule Candidates

Complete Demo Flow

Stage 1 (This Pipeline)
    
      FASTQ files (200GB raw sequencing data)
            
            
      Parabricks fq2bam + DeepVariant
            
            
      VCF file (11.7M variants)
    
    └───────────────────────────────────────────▶ Stage 2 (RAG/Chat)
                                                      
                                                        Annotate with ClinVar,
                                                        AlphaMissense, VEP
                                                              
                                                              
                                                        Vector DB (Milvus)
                                                              
                                                              
                                                        Claude RAG + Clinker
                                                              
                                                              
                                                        "VCP is a druggable
                                                         target for FTD"
                                                      
                                                      └──────▶ Stage 3 (Drug Discovery)
                                                                    
                                                                    
                                                              Generate drug
                                                              candidates for VCP

References

Documentation

Data Sources


License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.


Acknowledgments

  • NVIDIA for Parabricks and GPU-accelerated bioinformatics
  • NIST/GIAB for the HG002 benchmark genome dataset
  • Google Health for the DeepVariant variant caller
  • Broad Institute for samtools and bcftools

Note: This pipeline uses public, de-identified genomics data (GIAB HG002) for demonstration and validation purposes. For clinical use, ensure compliance with relevant regulations and institutional guidelines.