DB_METADATA Module¶

Overview¶

The DB_METADATA module extracts essential metadata from Ensembl core databases, including taxonomy ID, genome assembly accession (GCA), and production species name. This metadata is critical for downstream processes like BUSCO dataset selection and result organization.

Module Location: pipelines/statistics/modules/db_metadata.nf

Functionality¶

This module queries an Ensembl core database to retrieve three key metadata fields:

Taxonomy ID: NCBI taxonomy identifier from the meta table
GCA Accession: Genome assembly accession from the meta table
Production Name: Scientific name in production format

The metadata is written to a file and propagated through the pipeline's metadata channel for use by downstream processes.

Inputs¶

Channel Input¶

tuple val(meta)

Metadata Map:

[
    gca: String,              // Genome assembly accession (may be "UNKNOWN")
    taxon_id: String,         // Taxonomy ID (may be "UNKNOWN")
    dbname: String,           // Ensembl core database name
    species_id: Integer,      // Species ID (default: 1)
    busco_mode: List,         // BUSCO mode(s) to run
    busco_dataset: String,    // BUSCO dataset (optional)
    genome_file: String,      // Path to genome file (optional)
    protein_file: String      // Path to protein file (optional)
]

Parameters¶

Parameter	Type	Default	Description
`params.mysqlUrl`	String	Required	MySQL connection URL (e.g., `mysql://server:4157/`)
`params.mysqluser`	String	Required	MySQL username for database access
`params.mysqlpass`	String	`''`	MySQL password (optional if using SSH tunnel)
`params.outdir`	String	`./results`	Base output directory

Outputs¶

Channel Outputs¶

Channel	Type	Description
`metadata`	`tuple val(meta), path("*metadata.txt")`	Original metadata + metadata file
`versions_file`	`path("versions.yml")`	Software versions used

File Outputs¶

Metadata File¶

Location: ${params.outdir}/${meta.dbname}/*metadata.txt

Format: Plain text with key-value pairs

Content Example:

taxon_id=9606
gca=GCA_000001405.29
production_name=homo_sapiens

Fields: - taxon_id: NCBI taxonomy ID from meta table - gca: Genome assembly accession from meta table (key: assembly.accession) - production_name: Production species name from meta table (key: species.production_name)

Versions File¶

Location: ${params.outdir}/${meta.dbname}/versions.yml

Format: YAML

Content Example:

"DB_METADATA":
    python: 3.11.0

Process Configuration¶

Directives¶

scratch false                // Don't use scratch directory
label 'python'               // Use Python resource allocation
tag "${meta.dbname}"         // Tag with database name

Resource Allocation¶

From nextflow.config (python label):

CPUs: 2
Memory: 4 GB
Time: 2 hours
Queue: Standard

Container¶

ensemblorg/ensembl-genes-metadata:latest

Base Image: Ubuntu 20.04
Python Version: 3.11
MySQL Client: Installed
Python Libraries: pymysql, pandas

Implementation Details¶

Database Query¶

The module executes three MySQL queries against the meta table:

-- Get taxonomy ID
SELECT meta_value 
FROM meta 
WHERE species_id = ${meta.species_id} 
  AND meta_key = 'species.taxonomy_id';

-- Get GCA accession
SELECT meta_value 
FROM meta 
WHERE species_id = ${meta.species_id} 
  AND meta_key = 'assembly.accession';

-- Get production name
SELECT meta_value 
FROM meta 
WHERE species_id = ${meta.species_id} 
  AND meta_key = 'species.production_name';

Python Script Logic¶

The db_metadata.py script:

Connects to MySQL using provided credentials
Queries the meta table for each metadata field
Handles missing values gracefully (returns "UNKNOWN")
Writes results to a text file
Captures Python version for reproducibility

Metadata Propagation¶

The output metadata map is enriched by downstream processes:

DB_METADATA(data).metadata.map { meta, metadata_file ->
    def lines = metadata_file.text.readLines()
    def new_taxon = lines[0].split('=')[1]
    def new_gca = lines[1].split('=')[1]
    def production_name = lines[2].split('=')[1]
    def updated_meta = meta + [
        taxon_id: new_taxon, 
        gca: new_gca, 
        production_species: production_name
    ]
    updated_meta
}

Usage Example¶

In a Workflow¶

include { DB_METADATA } from '../modules/db_metadata.nf'

workflow {
    // Create input channel
    def input_ch = channel.of([
        gca: 'UNKNOWN',
        taxon_id: 'UNKNOWN',
        dbname: 'homo_sapiens_core_110_38',
        species_id: 1,
        busco_mode: ['genome', 'protein'],
        busco_dataset: '',
        genome_file: '',
        protein_file: ''
    ])

    // Run DB_METADATA
    def metadata_ch = DB_METADATA(input_ch).metadata

    // Use the enriched metadata
    metadata_ch.map { meta, metadata_file ->
        def lines = metadata_file.text.readLines()
        def taxon_id = lines[0].split('=')[1]
        def gca = lines[1].split('=')[1]
        def production_name = lines[2].split('=')[1]
        println "Species: ${production_name} (${gca}, NCBI:${taxon_id})"
    }
}

Expected Output¶

Console Output:

Species: homo_sapiens (GCA_000001405.29, NCBI:9606)

File Created: results/homo_sapiens_core_110_38/homo_sapiens_core_110_38_metadata.txt

Error Handling¶

Common Errors¶

1. Database Connection Failure¶

Error Message:

ERROR ~ Error executing process > 'DB_METADATA (homo_sapiens_core_110_38)'
Can't connect to MySQL server on 'host'

Solution: - Verify params.mysqlUrl is correct - Check network connectivity to database server - Verify SSH tunnel is active (if using port forwarding) - Confirm username and password are correct

2. Missing Metadata Keys¶

Error Message:

WARNING: Missing meta key 'assembly.accession' for species_id=1

Behavior: Process continues but writes "UNKNOWN" for missing values

Solution: Verify the database contains expected metadata:

SELECT meta_key, meta_value 
FROM meta 
WHERE species_id = 1 
  AND meta_key IN ('species.taxonomy_id', 'assembly.accession', 'species.production_name');

3. Invalid Species ID¶

Error Message:

No metadata found for species_id=999

Solution: Verify species_id exists in the database:

SELECT DISTINCT species_id FROM meta;

4. Database Name Mismatch¶

Error Message:

Database 'homo_sapiens_core_999_99' does not exist

Solution: - Check database name spelling - List available databases: SHOW DATABASES LIKE '%core%'; - Update input CSV with correct database name

Version Tracking¶

The module captures software versions in versions.yml:

"DB_METADATA":
    python: 3.11.0

Captured Versions: - Python interpreter version - (Future): pymysql library version

Integration with Other Modules¶

Upstream Modules¶

None - This is typically the first module in the workflow.

Downstream Modules¶

The metadata extracted by this module is used by:

BUSCO_DATASET: Uses taxon_id to select appropriate BUSCO lineage
FETCH_GENOME: Uses gca to download genome assembly from NCBI
FETCH_PROTEINS: Uses dbname to extract protein sequences
All Analysis Modules: Use production_name for file naming and organization

Data Flow Diagram¶

graph TD
    A[Input CSV] --> B[DB_METADATA]
    B --> C[Metadata File]
    B --> D[Updated Meta Map]
    D --> E[BUSCO_DATASET]
    D --> F[FETCH_GENOME]
    D --> G[FETCH_PROTEINS]
    C --> H[Published Results]

Best Practices¶

1. Database Connection Management¶

Use SSH Tunneling for Remote Databases:

# Open SSH tunnel
ssh -L 4157:mysql-server:3306 user@gateway-server -N &

# Configure Nextflow
nextflow run main.nf \
    --mysqlUrl "mysql://localhost:4157/" \
    --mysqluser "ensro" \
    --mysqlpass ""

2. Species ID Handling¶

For multi-species databases (collections), use correct species_id:

dbname,species_id,busco_dataset
collection_core_110_1,1,bacteria_odb10
collection_core_110_1,2,bacteria_odb10

3. Error Resilience¶

Add error handling in workflows:

DB_METADATA(data).metadata
    .filter { meta, file ->
        def content = file.text
        if (content.contains("UNKNOWN")) {
            log.warn "Missing metadata for ${meta.dbname}"
            return false
        }
        return true
    }
    .set { valid_metadata }

4. Caching Metadata¶

Since metadata rarely changes, consider caching:

process DB_METADATA {
    // Add this directive to cache results
    cache 'lenient'  // Re-use cached results even if script changes

    // Or use storeDir for permanent caching
    storeDir "${params.cacheDir}/metadata"

    // ... rest of process
}

Performance Considerations¶

Execution Time¶

Typical execution time: 5-15 seconds per database

Factors affecting performance: - Database server load - Network latency - Number of concurrent queries

Parallelization¶

This module can process multiple databases in parallel:

// Input: 10 databases
channel.fromPath('databases.csv')
    .splitCsv(header: true)
    .map { row -> [dbname: row.dbname, species_id: row.species_id] }
    | DB_METADATA  // Runs 10 instances in parallel

Recommended concurrency: 10-20 parallel instances

Resource Usage¶

Minimal resources required: - CPU: < 5% utilization - Memory: < 100 MB - Disk: < 1 KB output per database - Network: < 10 KB data transfer

Testing¶

Unit Test¶

Test metadata extraction for a known database:

# Test with homo_sapiens core
nextflow run pipelines/statistics/main.nf \
    --run_busco_core \
    --csvFile test_data/test_db_metadata.csv \
    --mysqlUrl "mysql://ensembldb.ensembl.org:3306/" \
    --mysqluser "anonymous" \
    --mysqlpass "" \
    -entry BUSCO \
    --max_cpus 2 \
    --max_memory 4.GB \
    -process.executor 'local'

test_db_metadata.csv:

dbname,species_id,busco_dataset
homo_sapiens_core_110_38,1,vertebrata_odb10

Expected Test Result¶

File: results/homo_sapiens_core_110_38/homo_sapiens_core_110_38_metadata.txt

taxon_id=9606
gca=GCA_000001405.29
production_name=homo_sapiens

Validation¶

Verify metadata values match database:

# Query database directly
mysql -h ensembldb.ensembl.org -u anonymous -e \
  "SELECT meta_key, meta_value FROM homo_sapiens_core_110_38.meta 
   WHERE meta_key IN ('species.taxonomy_id', 'assembly.accession', 'species.production_name');"

# Compare with extracted metadata
cat results/homo_sapiens_core_110_38/homo_sapiens_core_110_38_metadata.txt

Troubleshooting¶

Debug Mode¶

Enable detailed MySQL logging:

script:
"""
python ${moduleDir}/bin/db_metadata.py \\
    --mysql_host ${params.mysqlUrl} \\
    --mysql_database "${meta.dbname}" \\
    --mysql_user "${params.mysqluser}" \\
    --mysql_pass "${params.mysqlpass}" \\
    --species_id "${meta.species_id}" \\
    --verbose  # Add verbose flag
"""

Check Database Access¶

Test connection manually:

mysql -h ensembldb.ensembl.org -u anonymous -e "SHOW DATABASES LIKE '%core%';"

Inspect Metadata Table¶

Examine available metadata keys:

USE homo_sapiens_core_110_38;
SELECT meta_key, COUNT(*) as count 
FROM meta 
GROUP BY meta_key 
HAVING meta_key LIKE 'species.%' OR meta_key LIKE 'assembly.%';

BUSCO_DATASET Module - Uses taxonomy ID for dataset selection
FETCH_GENOME Module - Uses GCA for genome download
FETCH_PROTEINS Module - Uses database name for protein extraction
BUSCO Workflow - Complete workflow using this module

References¶

Ensembl Database Schema - Core database structure
NCBI Taxonomy - Taxonomy ID reference
GCA Accessions - Genome assembly accessions

Last Updated: 2026-02-06
Module Version: 1.0.0
Maintained By: Ensembl Genes Team

DB_METADATA Module¶

Overview¶

Functionality¶

Inputs¶

Channel Input¶

Parameters¶

Outputs¶

Channel Outputs¶

File Outputs¶

Metadata File¶

Versions File¶

Process Configuration¶

Directives¶

Resource Allocation¶

Container¶

Implementation Details¶

Database Query¶

Python Script Logic¶

Metadata Propagation¶

Usage Example¶

In a Workflow¶

Expected Output¶

Error Handling¶

Common Errors¶

1. Database Connection Failure¶

2. Missing Metadata Keys¶

3. Invalid Species ID¶

4. Database Name Mismatch¶

Version Tracking¶

Integration with Other Modules¶

Upstream Modules¶

Downstream Modules¶

Data Flow Diagram¶

Best Practices¶

1. Database Connection Management¶

2. Species ID Handling¶

3. Error Resilience¶

4. Caching Metadata¶

Performance Considerations¶

Execution Time¶

Parallelization¶

Resource Usage¶

Testing¶

Unit Test¶

Expected Test Result¶

Validation¶

Troubleshooting¶

Debug Mode¶

Check Database Access¶

Inspect Metadata Table¶

Related Documentation¶

References¶