FETCH_GENOME Module¶
Overview¶
The FETCH_GENOME module downloads genome assembly sequences from NCBI in FASTA format using the GenBank Assembly (GCA) accession. It supports both direct file paths and automated downloads from NCBI's FTP server. Downloaded genomes are decompressed if necessary and made available for downstream quality assessment tools like BUSCO.
Module Location: pipelines/statistics/modules/fetch_genome.nf
Functionality¶
This module performs two key functions:
- Direct File Access: If
meta.genome_fileis provided, uses the specified file directly - NCBI Download: If no file is provided, automatically downloads from NCBI using GCA accession
The module handles:
- Genome download from NCBI FTP servers
- Decompression of .gz files
- File naming and organization
- Checksum validation (when available)
Inputs¶
Channel Input¶
Metadata Map:
[
gca: String, // REQUIRED: GCA accession (e.g., "GCA_000001405.29")
dbname: String, // Database name
species_id: Integer, // Species ID
taxon_id: String, // Taxonomy ID
production_species: String, // Production name (from DB_METADATA)
busco_dataset: String, // BUSCO dataset
genome_file: String // Optional: Direct path to genome file
]
Parameters¶
| Parameter | Type | Default | Description |
|---|---|---|---|
params.outdir |
String | ./results |
Base output directory |
params.cacheDir |
String | ./cache |
Directory for caching downloads |
Outputs¶
Channel Outputs¶
| Channel | Type | Description |
|---|---|---|
genome_file_output |
tuple val(meta), path("*.fna") |
Metadata + genome FASTA file |
versions_file |
path("versions.yml") |
Software versions used |
File Outputs¶
Genome FASTA File¶
Location: ${params.cacheDir}/${meta.gca}/*.fna
Naming Convention:
- NCBI Download: GCA_XXXXXXXXX.Y_<assembly_name>_genomic.fna
- Direct File: Original filename (must be .fna or .fna.gz)
Format: Standard FASTA format
Content Example:
>CM000663.2 Homo sapiens chromosome 1, GRCh38.p14 Primary Assembly
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
ATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCG
...
File Size: Varies (50 MB - 10 GB typical range)
Versions File¶
Location: ${params.outdir}/${meta.gca}/versions.yml
Format: YAML
Content Example:
Process Configuration¶
Directives¶
scratch false // Don't use scratch directory
label 'fetch_file' // Use fetch_file resource allocation
tag "${meta.gca}" // Tag with GCA accession
storeDir "${params.cacheDir}/${meta.gca}" // Permanent cache
Resource Allocation¶
From nextflow.config (fetch_file label):
- CPUs: 2
- Memory: 2 GB
- Time: 4 hours
- Queue: Standard
Container¶
Installed Tools:
- wget - For downloading from NCBI FTP
- gzip - For decompression
- md5sum - For checksum validation
Implementation Details¶
NCBI FTP Structure¶
NCBI genomes are hosted at:
URL Pattern:
Example:
# GCA_000001405.29 -> GCA/000/001/405/GCA_000001405.29_GRCh38.p14/
https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.29_GRCh38.p14/GCA_000001405.29_GRCh38.p14_genomic.fna.gz
Download Logic¶
The module uses a bash script to:
-
Check for Direct File:
-
Parse GCA Accession:
-
Construct NCBI URL:
-
Download Genome:
-
Decompress:
Caching Strategy¶
The module uses storeDir for permanent caching:
Benefits: - Downloaded files persist across pipeline runs - Saves bandwidth and time for repeated analyses - Shared cache across multiple workflows
Cache Location: ${params.cacheDir}/GCA_XXXXXXXXX.Y/
Usage Example¶
In a Workflow¶
include { FETCH_GENOME } from '../modules/fetch_genome.nf'
workflow {
// Create input channel with GCA accession
def input_ch = channel.of([
gca: 'GCA_000001405.29',
dbname: 'homo_sapiens_core_110_38',
species_id: 1,
taxon_id: '9606',
production_species: 'homo_sapiens',
busco_dataset: 'vertebrata_odb10',
genome_file: '' // Empty = download from NCBI
])
// Fetch genome
def genome_ch = FETCH_GENOME(input_ch).genome_file_output
// Use genome file
genome_ch.view { meta, fna_file ->
"Downloaded genome for ${meta.production_species}: ${fna_file}"
}
}
With Direct File Path¶
workflow {
// Use pre-downloaded genome file
def input_ch = channel.of([
gca: 'GCA_000001405.29',
dbname: 'homo_sapiens_core_110_38',
species_id: 1,
taxon_id: '9606',
production_species: 'homo_sapiens',
busco_dataset: 'vertebrata_odb10',
genome_file: '/data/genomes/human_genome.fna.gz'
])
FETCH_GENOME(input_ch)
}
Expected Output¶
Console Output:
Downloaded genome for homo_sapiens: /path/to/cache/GCA_000001405.29/GCA_000001405.29_GRCh38.p14_genomic.fna
Cached File: cache/GCA_000001405.29/GCA_000001405.29_GRCh38.p14_genomic.fna
File Size: ~3.1 GB (uncompressed)
Error Handling¶
Common Errors¶
1. Invalid GCA Accession¶
Error Message:
Cause: GCA accession doesn't exist on NCBI FTP server
Solution:
- Verify GCA accession at NCBI Assembly
- Check for typos in accession number
- Ensure full accession with version (e.g., GCA_000001405.29, not GCA_000001405)
2. Network Timeout¶
Error Message:
Solution:
- Check internet connectivity
- Increase timeout: wget --timeout=600
- Try alternative NCBI mirror
- Consider using --genome_file parameter with pre-downloaded genome
3. Insufficient Disk Space¶
Error Message:
Solution:
- Check available disk space: df -h
- Clean cache directory: rm -rf cache/GCA_*
- Use different cache location: --cacheDir /path/to/larger/disk
4. Corrupted Download¶
Error Message:
Solution:
# Remove corrupted file
rm cache/GCA_XXXXXXXXX.Y/*.fna.gz
# Re-run pipeline (will re-download)
nextflow run main.nf -resume
5. Missing Assembly on NCBI¶
Error Message:
Cause: Assembly removed or suppressed by NCBI
Solution:
- Check assembly status at NCBI
- Look for replacement assembly
- Download genome manually and use --genome_file parameter
Version Tracking¶
The module captures software versions:
Captured Versions:
- wget version used for download
- gzip version used for decompression
Integration with Other Modules¶
Upstream Modules¶
- DB_METADATA: Provides
gcaaccession and production name - BUSCO_DATASET: Provides BUSCO dataset information
Downstream Modules¶
- BUSCO_GENOME_LINEAGE: Uses genome file for BUSCO assessment
Data Flow Diagram¶
graph LR
A[DB_METADATA] --> B[BUSCO_DATASET]
B --> C[FETCH_GENOME]
C --> D[BUSCO_GENOME_LINEAGE]
style C fill:#90EE90
Best Practices¶
1. Use Shared Cache¶
For multi-user environments, use a shared cache:
# Create shared cache directory
mkdir -p /shared/data/genome_cache
chmod 775 /shared/data/genome_cache
# Configure pipeline
nextflow run main.nf \
--cacheDir /shared/data/genome_cache
2. Pre-download Large Genomes¶
For large genomes (>5 GB), pre-download to avoid timeouts:
# Download manually
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/XXX/YYY/ZZZ/GCA_*_genomic.fna.gz
# Use in pipeline
nextflow run main.nf \
--genome_file /data/genomes/large_genome.fna.gz
3. Monitor Disk Usage¶
Genomes can consume significant disk space:
# Check cache size
du -sh cache/
# List largest genomes
du -h cache/GCA_* | sort -h | tail -10
# Clean old genomes (older than 30 days)
find cache/ -name "*.fna" -mtime +30 -delete
4. Validate Downloaded Genomes¶
Verify genome integrity after download:
# Check FASTA format
head -n 1 cache/GCA_000001405.29/*.fna
# Should start with '>'
# Count sequences
grep -c "^>" cache/GCA_000001405.29/*.fna
# Check file size
ls -lh cache/GCA_000001405.29/*.fna
Performance Considerations¶
Download Times¶
Typical download times (100 Mbps connection):
| Genome Size | Compressed | Uncompressed | Download Time | Decompression |
|---|---|---|---|---|
| Small (10 MB) | 3 MB | 10 MB | 10 seconds | 5 seconds |
| Medium (100 MB) | 30 MB | 100 MB | 1 minute | 30 seconds |
| Large (1 GB) | 300 MB | 1 GB | 5 minutes | 3 minutes |
| Very Large (10 GB) | 3 GB | 10 GB | 30 minutes | 20 minutes |
Parallelization¶
Multiple genomes can be downloaded in parallel:
// Download 10 genomes simultaneously
channel.fromPath('genomes.csv')
.splitCsv(header: true)
.map { row -> [gca: row.gca, /*...*/] }
| FETCH_GENOME // Downloads in parallel
Recommended concurrency: 5-10 simultaneous downloads
Network consideration: Limit concurrent downloads to avoid overwhelming NCBI servers
Caching Benefits¶
With storeDir caching:
- First run: Full download time (minutes to hours)
- Subsequent runs: Instant (0 seconds) - file already cached
- Cache hit rate: Typically >80% for repeated analyses
Resource Usage¶
During download: - CPU: <10% (wget is I/O bound) - Memory: <100 MB - Disk I/O: Write speed limited by network - Network: Full available bandwidth
During decompression: - CPU: 100% (single-threaded gzip) - Memory: <500 MB - Disk I/O: Heavy write activity
Advanced Features¶
Custom NCBI Mirror¶
Use alternative NCBI mirror for faster downloads:
script:
"""
# Use European mirror
NCBI_MIRROR="ftp.ebi.ac.uk/pub/databases/genbank"
wget \${NCBI_MIRROR}/genomes/all/GCA/...
"""
Checksum Validation¶
Add MD5 checksum validation:
script:
"""
# Download genome and checksum
wget ${ASSEMBLY_URL}/*_genomic.fna.gz
wget ${ASSEMBLY_URL}/md5checksums.txt
# Validate
md5sum -c md5checksums.txt
"""
Compressed Output¶
Keep genome compressed to save disk space:
output:
path("*.fna.gz"), emit: genome_file_output
script:
"""
# Skip decompression step
wget ${ASSEMBLY_URL}/*_genomic.fna.gz
# Don't gunzip - use compressed file
"""
Note: Downstream tools must support .gz input
Testing¶
Unit Test¶
Test genome download for a small genome:
# Test with E. coli genome (small, fast download)
nextflow run pipelines/statistics/main.nf \
--run_busco_ncbi \
--csvFile test_data/test_fetch_genome.csv \
--cacheDir ./test_cache \
--outdir ./test_results \
--max_cpus 2 \
-entry BUSCO \
-process.executor 'local'
test_fetch_genome.csv:
gca,taxon_id,species_id,busco_dataset,genome_file,protein_file
GCA_000005845.2,511145,1,bacteria_odb10,,
Expected Test Result¶
File: test_cache/GCA_000005845.2/GCA_000005845.2_ASM584v2_genomic.fna
File Size: ~4.6 MB
Sequences: ~400 (chromosome + plasmids + scaffolds)
Validation¶
Verify genome download:
# Check file exists
ls -lh test_cache/GCA_000005845.2/*.fna
# Count sequences
grep -c "^>" test_cache/GCA_000005845.2/*.fna
# Check sequence headers
grep "^>" test_cache/GCA_000005845.2/*.fna | head -5
Expected headers:
Troubleshooting¶
Debug Mode¶
Enable verbose wget output:
Check NCBI FTP Availability¶
Test NCBI FTP access:
# Test connection
curl -I https://ftp.ncbi.nlm.nih.gov/genomes/
# List available assemblies for a GCA
curl -s https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/ | grep "GCA_000001405"
Manual Download¶
Download genome manually for troubleshooting:
# Find assembly directory
GCA="GCA_000001405.29"
curl -s https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/ | grep ${GCA}
# Download genome
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.29_GRCh38.p14/GCA_000001405.29_GRCh38.p14_genomic.fna.gz
# Decompress
gunzip GCA_000001405.29_GRCh38.p14_genomic.fna.gz
# Use in pipeline
nextflow run main.nf --genome_file $(pwd)/GCA_000001405.29_GRCh38.p14_genomic.fna
Related Documentation¶
- DB_METADATA Module - Provides GCA accession
- BUSCO_GENOME_LINEAGE Module - Uses genome file for assessment
- BUSCO Workflow - Complete workflow using this module
References¶
- NCBI Assembly Database - Assembly search and information
- NCBI FTP Structure - FTP directory organization
- BUSCO Documentation - BUSCO genome mode requirements
Last Updated: 2026-02-06
Module Version: 1.0.0
Maintained By: Ensembl Genes Team