Statistics Pipeline Troubleshooting Guide¶

This guide covers common issues, error messages, and solutions for the Ensembl genes statistics pipeline.

Table of Contents¶

Database Connection Issues
Module-Specific Errors
Resource and Performance Issues
File System and I/O Problems
Version and Dependency Issues
Data Quality and Validation
Debugging Strategies

Database Connection Issues¶

Error: "Access denied for user"¶

Symptoms:

ERROR: Access denied for user 'username'@'host' (using password: YES)

Causes: - Incorrect database credentials - Missing database privileges - Network access restrictions

Solutions: 1. Verify credentials in your configuration:

params.user = 'correct_username'
params.password = 'correct_password'
params.host = 'correct_host'
params.port = 3306

Test connection manually:

mysql -h ${params.host} -P ${params.port} -u ${params.user} -p

Check database grants:
```
SHOW GRANTS FOR 'username'@'host';
```

Affected Modules: All database-dependent modules (FETCH_GENOME, FETCH_PROTEINS, RUN_STATISTICS, RUN_ENSEMBL_META, POPULATE_DB, BUSCO_CORE_METAKEYS)

Error: "Can't connect to MySQL server"¶

Symptoms:

ERROR: Can't connect to MySQL server on 'host' (110)

Causes: - Database server is down - Network connectivity issues - Firewall blocking connection - Wrong host/port

Solutions: 1. Verify server is running:

ping ${params.host}
telnet ${params.host} ${params.port}

Check firewall rules allow outbound connections to database port
Verify host and port are correct in parameters
For compute environments, ensure security groups allow database access

Error: "Unknown database"¶

Symptoms:

ERROR: Unknown database 'dbname'

Causes: - Database name doesn't exist - Typo in database name - Metadata contains incorrect dbname

Solutions: 1. List available databases:

SHOW DATABASES LIKE '%pattern%';

Verify metadata dbname field matches actual database name
Check for case sensitivity (database names are case-sensitive on some systems)
Ensure database was created before running pipeline

Module-Specific Errors¶

BUSCO Modules¶

Error: "BUSCO dataset not found"¶

Symptoms:

ERROR: Could not download dataset 'lineage_odb10'

Causes: - Network connectivity to BUSCO database - Invalid lineage name - Corrupted cache directory

Solutions: 1. Verify lineage name is valid:

busco --list-datasets

Clear cache and retry:

rm -rf ${params.cacheDir}/busco_downloads/

Check internet connectivity:
```
curl -I https://busco-data.ezlab.org/
```

Manually download dataset if needed:

busco --download ${params.busco_lineage}

Affected Modules: BUSCO_DATASET, BUSCO_GENOME_LINEAGE, BUSCO_PROTEIN_LINEAGE

Error: "BUSCO failed to run"¶

Symptoms:

ERROR: BUSCO analysis failed for sample

Causes: - Input file format issues - Insufficient memory - Corrupted input data

Solutions: 1. Validate input file format:

# For genome
head -n 20 genome.fa

# For proteins
head -n 20 proteins.fa

Check file is not empty:
```
wc -l input.fa
```

Increase memory allocation in config:

process {
    withLabel: busco {
        memory = '16 GB'
    }
}

Check BUSCO logs in work directory for detailed error

OMAmer/OMark Modules¶

Error: "OMAmer database not found"¶

Symptoms:

ERROR: [Errno 2] No such file or directory: 'omamer_database'

Causes: - params.omamer_database not set correctly - Database file doesn't exist - Insufficient permissions

Solutions: 1. Verify database path exists:

ls -lh ${params.omamer_database}

Set correct path in parameters:

params.omamer_database = '/path/to/LUCA.h5'

Download database if missing:

# See OMAmer documentation for download instructions

Affected Modules: OMAMER_HOG, OMARK

Error: "OMAmer search failed"¶

Symptoms:

ERROR: Search failed with exit code 1

Causes: - Empty or invalid protein file - Corrupted database - Memory issues

Solutions: 1. Validate protein FASTA:

grep -c "^>" proteins.fa  # Count sequences

Check database integrity:

omamer validate-db ${params.omamer_database}

Increase maxForks if memory constrained:

process {
    withName: OMAMER_HOG {
        maxForks = 10  // Reduce from 15
    }
}

Fetch Modules¶

Error: "No translations found"¶

Symptoms:

WARNING: dump_translations.pl produced empty file

Causes: - Database has no protein_coding genes - Wrong database selected - Database missing translation data

Solutions: 1. Check gene count:

SELECT COUNT(*) FROM gene WHERE biotype = 'protein_coding';

Check translation table:
```
SELECT COUNT(*) FROM translation;
```
Verify correct database in metadata:
```
meta.dbname = 'correct_core_db_name'
```

Affected Modules: FETCH_PROTEINS

Error: "Genome fetch failed"¶

Symptoms:

ERROR: Failed to fetch genome from database

Causes: - Missing DNA sequences in database - Insufficient memory for large genomes - Database connection timeout

Solutions: 1. Verify DNA data exists:

SELECT COUNT(*) FROM dna;

Increase timeout for large genomes:

process {
    withName: FETCH_GENOME {
        time = '6h'
        memory = '16 GB'
    }
}

Affected Modules: FETCH_GENOME

Statistics Modules¶

Error: "Statistics generation failed"¶

Symptoms:

ERROR: generate_species_homepage_stats.pl failed

Causes: - Database schema issues - Missing required tables - Perl API version mismatch

Solutions: 1. Verify database schema version:

SELECT meta_value FROM meta WHERE meta_key = 'schema_version';

Check required tables exist:

SHOW TABLES LIKE '%gene%';
SHOW TABLES LIKE '%transcript%';

Ensure Ensembl API matches schema:

perl -MBio::EnsEMBL::Registry -e 'print $Bio::EnsEMBL::Registry::VERSION'

Affected Modules: RUN_STATISTICS, RUN_ENSEMBL_META

Database Population¶

Error: "SQL execution failed"¶

Symptoms:

ERROR: mysql returned non-zero exit code

Causes: - SQL syntax errors - Missing table/column - Insufficient privileges - Duplicate key violations

Solutions: 1. Test SQL file manually:

mysql -h ${params.host} -u ${params.user} -p ${meta.dbname} < file.sql

Check for errors in SQL:
```
grep -i error file.sql
```
Verify INSERT/UPDATE privileges:
```
SHOW GRANTS FOR CURRENT_USER();
```

Check for duplicate entries:

SELECT meta_key, COUNT(*) FROM meta GROUP BY meta_key HAVING COUNT(*) > 1;

Affected Modules: POPULATE_DB, BUSCO_CORE_METAKEYS

Resource and Performance Issues¶

Error: "OutOfMemoryError" or process killed¶

Symptoms:

ERROR: Process killed (exit code 137)
OutOfMemoryError: Java heap space

Causes: - Insufficient memory allocation - Memory leak - Processing very large files

Solutions: 1. Increase memory in configuration:

process {
    withLabel: busco {
        memory = { 8.GB * task.attempt }
    }
    withLabel: omamer {
        memory = { 16.GB * task.attempt }
    }
}

Enable automatic retry with more memory:

process {
    errorStrategy = { task.exitStatus == 137 ? 'retry' : 'terminate' }
    maxRetries = 3
}

Monitor memory usage:

# Check work directory for .command.log
tail -f work/xx/yyyy/.command.log

Issue: Pipeline very slow / processes waiting¶

Symptoms: - Many processes in "PENDING" state - Low CPU usage - Processes waiting hours to start

Causes: - Too many parallel forks - Database connection bottleneck - I/O bottleneck

Solutions: 1. Reduce maxForks for database-heavy processes:

process {
    withLabel: fetch_file {
        maxForks = 10  // Reduce from 20
    }
    withName: OMAMER_HOG {
        maxForks = 5   // Reduce from 15
    }
}

Stagger job submission:

executor {
    submitRateLimit = '10 sec'
}

Check database connection pool limits
Monitor I/O with:
```
iostat -x 5
```

Issue: "Too many open files"¶

Symptoms:

ERROR: Too many open files

Causes: - System file descriptor limit reached - Too many parallel processes - File handles not being released

Solutions: 1. Increase file descriptor limit:

ulimit -n 4096

Set in your shell profile:
```
echo "ulimit -n 4096" >> ~/.bashrc
```
Reduce maxForks globally:
```
executor.queueSize = 50
```

File System and I/O Problems¶

Error: "No space left on device"¶

Symptoms:

ERROR: No space left on device

Causes: - Work directory full - Cache directory full - Output directory full

Solutions: 1. Check disk usage:

df -h
du -sh work/ outdir/ cacheDir/

Enable cleaning:

params.clean = true
params.clean_work_dir = true

Use different storage for cache:

params.cacheDir = '/large/storage/path/'

Clean up old work directories:
```
nextflow clean -f
```

Issue: File system latency causing failures¶

Symptoms: - Files not found immediately after creation - Intermittent "No such file or directory" errors - Process succeeds on retry

Causes: - Distributed/NFS file system sync delay - High I/O load

Solutions: 1. Increase latency delay:

params.files_latency = 5  // Increase from default 1

Use faster storage for work directory:
```
workDir = '/fast/local/storage/work'
```

Enable error retry:

process {
    errorStrategy = 'retry'
    maxRetries = 2
}

Error: "Permission denied"¶

Symptoms:

ERROR: Permission denied: '/path/to/file'

Causes: - Insufficient file permissions - Wrong user/group ownership - Read-only file system

Solutions: 1. Check permissions:

ls -la /path/to/file

Fix ownership:
```
chown -R user:group /path/to/directory
```
Set correct permissions:
```
chmod -R 755 /path/to/directory
```
Verify write access to output/cache directories

Version and Dependency Issues¶

Error: "Command not found"¶

Symptoms:

ERROR: busco: command not found
ERROR: omamer: command not found

Causes: - Tool not installed in container - Wrong container image - PATH not set correctly

Solutions: 1. Verify container has the tool:

docker run <container> which busco

Check container definition in config:

process {
    withLabel: busco {
        container = 'ezlabgva/busco:v5.4.7_cv1'
    }
}

For conda environments:
```
conda list | grep busco
```

Error: Version incompatibility¶

Symptoms:

ERROR: This script requires BUSCO v5.x
ERROR: Ensembl API version mismatch

Causes: - Outdated tool version - Wrong Ensembl release - Script-API version mismatch

Solutions: 1. Check tool versions in versions.yml outputs

Update container to correct version:

container = 'ezlabgva/busco:v5.4.7_cv1'  // Specific version

Match Ensembl API to database schema:

# Database schema version
mysql> SELECT meta_value FROM meta WHERE meta_key='schema_version';

# Use matching API version
git checkout release/110  # Match schema version

Issue: Python/Perl module not found¶

Symptoms:

ERROR: Can't locate Bio/EnsEMBL/Registry.pm
ERROR: ModuleNotFoundError: No module named 'omamer'

Causes: - Missing dependencies in environment - Wrong Python/Perl environment active

Solutions: 1. For Perl modules:

perl -MCPAN -e 'install Bio::EnsEMBL::Registry'

For Python modules:
```
pip install omamer omark
```

Ensure correct environment in container:

RUN pip install omamer==0.3.3 omark==0.3.0

Check PERL5LIB and PYTHONPATH:
```
echo $PERL5LIB
echo $PYTHONPATH
```

Data Quality and Validation¶

Issue: Empty BUSCO results¶

Symptoms: - BUSCO completes but reports 0% completeness - No genes found

Causes: - Wrong lineage selected - Input data format issues - Corrupted input file

Solutions: 1. Verify lineage is appropriate:

busco --list-datasets
# Choose lineage matching your species

Check input file format:

# Should be valid FASTA
grep "^>" input.fa | head

Try broader lineage:

params.busco_lineage = 'eukaryota_odb10'  // Start broad

Validate file integrity:
```
md5sum input.fa
```

Issue: OMark reports high contamination¶

Symptoms: - OMark warns of possible contamination - Many "unexpected" proteins

Causes: - Actual contamination in assembly - Wrong taxonomic placement - Horizontal gene transfer

Solutions: 1. Review OMark detailed output in omark_output/ directory

Check for known contaminants:

grep -i "bacteria\|virus" omark_output/*.txt

Run contamination screening on assembly:

# Use tools like blobtools, NCBI FCS-GX

If legitimate, document in metadata

Issue: Statistics don't match expectations¶

Symptoms: - Gene counts seem wrong - Missing expected data in statistics

Causes: - Database not fully populated - Wrong database selected - Filtering parameters

Solutions: 1. Manually check database:

SELECT biotype, COUNT(*) FROM gene GROUP BY biotype;
SELECT biotype, COUNT(*) FROM transcript GROUP BY biotype;

Verify using correct database:

# Check meta.dbname matches intended database

Review statistics SQL files:
```
cat core_statistics/*.sql
```
Compare with previous runs if available

Debugging Strategies¶

Strategy 1: Enable trace and debugging¶

// nextflow.config
trace {
    enabled = true
    file = 'trace.txt'
}

dag {
    enabled = true
    file = 'dag.html'
}

report {
    enabled = true
    file = 'report.html'
}

Strategy 2: Examine work directories¶

# Find failed process work directory
find work/ -name .exitcode -exec grep -l '1' {} \; | head -1 | xargs dirname

# View logs
cd <work_dir>
cat .command.sh    # Command that was run
cat .command.log   # stdout
cat .command.err   # stderr
cat .command.trace # Resource usage

Strategy 3: Run process manually¶

# Navigate to work directory
cd <work_dir>

# Run command directly
bash .command.sh

# Or step through script line by line

Strategy 4: Reduce scope for testing¶

// Test with single sample
meta_ch = channel.of([
    [gca: 'GCA_000001405.15', dbname: 'test_db', production_name: 'test']
])

Strategy 5: Check versions compatibility¶

# Collect all versions
cat work/**/versions.yml > all_versions.yml

# Compare against known working versions

Strategy 6: Increase verbosity¶

# Run with debug logging
nextflow run main.nf -profile test --debug

# Nextflow trace
nextflow run main.nf -with-trace -with-report -with-timeline

Common Error Patterns¶

Pattern: Intermittent failures¶

Indicators: - Process succeeds on retry - Random timing of failures - Different processes failing

Likely Causes: - File system latency - Network issues - Resource contention

Solution:

process {
    errorStrategy = 'retry'
    maxRetries = 2
}
params.files_latency = 5

Pattern: All processes fail at same stage¶

Indicators: - All samples fail at same module - Consistent error message - Happens immediately on start

Likely Causes: - Configuration error - Missing parameter - Wrong path/credential

Solution: - Review parameters for that specific module - Check parameter documentation - Validate paths and credentials

Pattern: Failures after long runtime¶

Indicators: - Process runs for hours then fails - Memory-related errors - Disk space errors

Likely Causes: - Insufficient resources - Memory leak - Large file handling

Solution:

process {
    memory = { 8.GB * task.attempt }
    time = { 4.h * task.attempt }
    errorStrategy = 'retry'
    maxRetries = 3
}

Getting Help¶

Before asking for help:¶

✅ Check this troubleshooting guide
✅ Review module documentation
✅ Examine work directory logs
✅ Check parameter configuration
✅ Try with single sample/minimal data

When reporting issues, include:¶

Error message (full text from .command.err)
Command executed (.command.sh contents)
Configuration (relevant params)
Environment (Nextflow version, executor)
Steps to reproduce
Versions file (if generated)

Useful diagnostic commands:¶

# Nextflow version
nextflow -version

# Process logs
find work/ -name .command.err -exec grep -l ERROR {} \;

# Resource usage
cat work/**/.command.trace

# Configuration used
nextflow config -profile <your_profile>

# List failed processes
nextflow log <run_name> -f status,name,exit | grep FAILED

Quick Reference¶

Issue	First Check	Module
Can't connect to database	Credentials, network	All DB modules
BUSCO fails	Lineage, input format	BUSCO_*
OMAmer database not found	Path, file exists	OMAMER_HOG, OMARK
Empty output	Input data, database content	FETCH_, BUSCO_
SQL errors	Privileges, syntax	POPULATE_DB
Out of memory	Process memory config	BUSCO_, OMAMER_
File not found	File system latency	All
Slow pipeline	maxForks, database load	All

Last Updated: 2026-02-07
For: Ensembl Genes Statistics Pipeline v1.0