DB_METADATA Module¶
Overview¶
The DB_METADATA module extracts essential metadata from Ensembl core databases, including taxonomy ID, genome assembly accession (GCA), and production species name. This metadata is critical for downstream processes like BUSCO dataset selection and result organization.
Module Location: pipelines/statistics/modules/db_metadata.nf
Functionality¶
This module queries an Ensembl core database to retrieve three key metadata fields:
- Taxonomy ID: NCBI taxonomy identifier from the
metatable - GCA Accession: Genome assembly accession from the
metatable - Production Name: Scientific name in production format
The metadata is written to a file and propagated through the pipeline's metadata channel for use by downstream processes.
Inputs¶
Channel Input¶
Metadata Map:
[
gca: String, // Genome assembly accession (may be "UNKNOWN")
taxon_id: String, // Taxonomy ID (may be "UNKNOWN")
dbname: String, // Ensembl core database name
species_id: Integer, // Species ID (default: 1)
busco_mode: List, // BUSCO mode(s) to run
busco_dataset: String, // BUSCO dataset (optional)
genome_file: String, // Path to genome file (optional)
protein_file: String // Path to protein file (optional)
]
Parameters¶
| Parameter | Type | Default | Description |
|---|---|---|---|
params.mysqlUrl |
String | Required | MySQL connection URL (e.g., mysql://server:4157/) |
params.mysqluser |
String | Required | MySQL username for database access |
params.mysqlpass |
String | '' |
MySQL password (optional if using SSH tunnel) |
params.outdir |
String | ./results |
Base output directory |
Outputs¶
Channel Outputs¶
| Channel | Type | Description |
|---|---|---|
metadata |
tuple val(meta), path("*metadata.txt") |
Original metadata + metadata file |
versions_file |
path("versions.yml") |
Software versions used |
File Outputs¶
Metadata File¶
Location: ${params.outdir}/${meta.dbname}/*metadata.txt
Format: Plain text with key-value pairs
Content Example:
Fields:
- taxon_id: NCBI taxonomy ID from meta table
- gca: Genome assembly accession from meta table (key: assembly.accession)
- production_name: Production species name from meta table (key: species.production_name)
Versions File¶
Location: ${params.outdir}/${meta.dbname}/versions.yml
Format: YAML
Content Example:
Process Configuration¶
Directives¶
scratch false // Don't use scratch directory
label 'python' // Use Python resource allocation
tag "${meta.dbname}" // Tag with database name
Resource Allocation¶
From nextflow.config (python label):
- CPUs: 2
- Memory: 4 GB
- Time: 2 hours
- Queue: Standard
Container¶
Base Image: Ubuntu 20.04
Python Version: 3.11
MySQL Client: Installed
Python Libraries: pymysql, pandas
Implementation Details¶
Database Query¶
The module executes three MySQL queries against the meta table:
-- Get taxonomy ID
SELECT meta_value
FROM meta
WHERE species_id = ${meta.species_id}
AND meta_key = 'species.taxonomy_id';
-- Get GCA accession
SELECT meta_value
FROM meta
WHERE species_id = ${meta.species_id}
AND meta_key = 'assembly.accession';
-- Get production name
SELECT meta_value
FROM meta
WHERE species_id = ${meta.species_id}
AND meta_key = 'species.production_name';
Python Script Logic¶
The db_metadata.py script:
- Connects to MySQL using provided credentials
- Queries the
metatable for each metadata field - Handles missing values gracefully (returns "UNKNOWN")
- Writes results to a text file
- Captures Python version for reproducibility
Metadata Propagation¶
The output metadata map is enriched by downstream processes:
DB_METADATA(data).metadata.map { meta, metadata_file ->
def lines = metadata_file.text.readLines()
def new_taxon = lines[0].split('=')[1]
def new_gca = lines[1].split('=')[1]
def production_name = lines[2].split('=')[1]
def updated_meta = meta + [
taxon_id: new_taxon,
gca: new_gca,
production_species: production_name
]
updated_meta
}
Usage Example¶
In a Workflow¶
include { DB_METADATA } from '../modules/db_metadata.nf'
workflow {
// Create input channel
def input_ch = channel.of([
gca: 'UNKNOWN',
taxon_id: 'UNKNOWN',
dbname: 'homo_sapiens_core_110_38',
species_id: 1,
busco_mode: ['genome', 'protein'],
busco_dataset: '',
genome_file: '',
protein_file: ''
])
// Run DB_METADATA
def metadata_ch = DB_METADATA(input_ch).metadata
// Use the enriched metadata
metadata_ch.map { meta, metadata_file ->
def lines = metadata_file.text.readLines()
def taxon_id = lines[0].split('=')[1]
def gca = lines[1].split('=')[1]
def production_name = lines[2].split('=')[1]
println "Species: ${production_name} (${gca}, NCBI:${taxon_id})"
}
}
Expected Output¶
Console Output:
File Created: results/homo_sapiens_core_110_38/homo_sapiens_core_110_38_metadata.txt
Error Handling¶
Common Errors¶
1. Database Connection Failure¶
Error Message:
ERROR ~ Error executing process > 'DB_METADATA (homo_sapiens_core_110_38)'
Can't connect to MySQL server on 'host'
Solution:
- Verify params.mysqlUrl is correct
- Check network connectivity to database server
- Verify SSH tunnel is active (if using port forwarding)
- Confirm username and password are correct
2. Missing Metadata Keys¶
Error Message:
Behavior: Process continues but writes "UNKNOWN" for missing values
Solution: Verify the database contains expected metadata:
SELECT meta_key, meta_value
FROM meta
WHERE species_id = 1
AND meta_key IN ('species.taxonomy_id', 'assembly.accession', 'species.production_name');
3. Invalid Species ID¶
Error Message:
Solution: Verify species_id exists in the database:
4. Database Name Mismatch¶
Error Message:
Solution:
- Check database name spelling
- List available databases: SHOW DATABASES LIKE '%core%';
- Update input CSV with correct database name
Version Tracking¶
The module captures software versions in versions.yml:
Captured Versions:
- Python interpreter version
- (Future): pymysql library version
Integration with Other Modules¶
Upstream Modules¶
None - This is typically the first module in the workflow.
Downstream Modules¶
The metadata extracted by this module is used by:
- BUSCO_DATASET: Uses
taxon_idto select appropriate BUSCO lineage - FETCH_GENOME: Uses
gcato download genome assembly from NCBI - FETCH_PROTEINS: Uses
dbnameto extract protein sequences - All Analysis Modules: Use
production_namefor file naming and organization
Data Flow Diagram¶
graph TD
A[Input CSV] --> B[DB_METADATA]
B --> C[Metadata File]
B --> D[Updated Meta Map]
D --> E[BUSCO_DATASET]
D --> F[FETCH_GENOME]
D --> G[FETCH_PROTEINS]
C --> H[Published Results]
Best Practices¶
1. Database Connection Management¶
Use SSH Tunneling for Remote Databases:
# Open SSH tunnel
ssh -L 4157:mysql-server:3306 user@gateway-server -N &
# Configure Nextflow
nextflow run main.nf \
--mysqlUrl "mysql://localhost:4157/" \
--mysqluser "ensro" \
--mysqlpass ""
2. Species ID Handling¶
For multi-species databases (collections), use correct species_id:
dbname,species_id,busco_dataset
collection_core_110_1,1,bacteria_odb10
collection_core_110_1,2,bacteria_odb10
3. Error Resilience¶
Add error handling in workflows:
DB_METADATA(data).metadata
.filter { meta, file ->
def content = file.text
if (content.contains("UNKNOWN")) {
log.warn "Missing metadata for ${meta.dbname}"
return false
}
return true
}
.set { valid_metadata }
4. Caching Metadata¶
Since metadata rarely changes, consider caching:
process DB_METADATA {
// Add this directive to cache results
cache 'lenient' // Re-use cached results even if script changes
// Or use storeDir for permanent caching
storeDir "${params.cacheDir}/metadata"
// ... rest of process
}
Performance Considerations¶
Execution Time¶
Typical execution time: 5-15 seconds per database
Factors affecting performance: - Database server load - Network latency - Number of concurrent queries
Parallelization¶
This module can process multiple databases in parallel:
// Input: 10 databases
channel.fromPath('databases.csv')
.splitCsv(header: true)
.map { row -> [dbname: row.dbname, species_id: row.species_id] }
| DB_METADATA // Runs 10 instances in parallel
Recommended concurrency: 10-20 parallel instances
Resource Usage¶
Minimal resources required: - CPU: < 5% utilization - Memory: < 100 MB - Disk: < 1 KB output per database - Network: < 10 KB data transfer
Testing¶
Unit Test¶
Test metadata extraction for a known database:
# Test with homo_sapiens core
nextflow run pipelines/statistics/main.nf \
--run_busco_core \
--csvFile test_data/test_db_metadata.csv \
--mysqlUrl "mysql://ensembldb.ensembl.org:3306/" \
--mysqluser "anonymous" \
--mysqlpass "" \
-entry BUSCO \
--max_cpus 2 \
--max_memory 4.GB \
-process.executor 'local'
test_db_metadata.csv:
Expected Test Result¶
File: results/homo_sapiens_core_110_38/homo_sapiens_core_110_38_metadata.txt
Validation¶
Verify metadata values match database:
# Query database directly
mysql -h ensembldb.ensembl.org -u anonymous -e \
"SELECT meta_key, meta_value FROM homo_sapiens_core_110_38.meta
WHERE meta_key IN ('species.taxonomy_id', 'assembly.accession', 'species.production_name');"
# Compare with extracted metadata
cat results/homo_sapiens_core_110_38/homo_sapiens_core_110_38_metadata.txt
Troubleshooting¶
Debug Mode¶
Enable detailed MySQL logging:
script:
"""
python ${moduleDir}/bin/db_metadata.py \\
--mysql_host ${params.mysqlUrl} \\
--mysql_database "${meta.dbname}" \\
--mysql_user "${params.mysqluser}" \\
--mysql_pass "${params.mysqlpass}" \\
--species_id "${meta.species_id}" \\
--verbose # Add verbose flag
"""
Check Database Access¶
Test connection manually:
Inspect Metadata Table¶
Examine available metadata keys:
USE homo_sapiens_core_110_38;
SELECT meta_key, COUNT(*) as count
FROM meta
GROUP BY meta_key
HAVING meta_key LIKE 'species.%' OR meta_key LIKE 'assembly.%';
Related Documentation¶
- BUSCO_DATASET Module - Uses taxonomy ID for dataset selection
- FETCH_GENOME Module - Uses GCA for genome download
- FETCH_PROTEINS Module - Uses database name for protein extraction
- BUSCO Workflow - Complete workflow using this module
References¶
- Ensembl Database Schema - Core database structure
- NCBI Taxonomy - Taxonomy ID reference
- GCA Accessions - Genome assembly accessions
Last Updated: 2026-02-06
Module Version: 1.0.0
Maintained By: Ensembl Genes Team