Skip to content

Statistics Pipeline Modules

This section provides detailed documentation for each module in the Statistics Pipeline. Modules are reusable Nextflow processes that perform specific tasks within the pipeline workflows.

📋 Module Overview

The Statistics Pipeline contains 14 modules organized into three functional categories:

Data Retrieval Modules

Modules responsible for fetching input data from databases and external sources:

Module Purpose Documentation
DB_METADATA Extract metadata from Ensembl databases View Docs
FETCH_GENOME Download genome assemblies from NCBI View Docs
FETCH_PROTEINS Extract protein sequences from databases View Docs

Analysis Modules

Modules that perform quality assessment and statistical analyses:

Module Purpose Documentation
BUSCO_DATASET Select appropriate BUSCO lineage dataset View Docs
BUSCO_GENOME_LINEAGE Run BUSCO on genome assemblies View Docs
BUSCO_PROTEIN_LINEAGE Run BUSCO on protein sequences View Docs
OMAMER_HOG Generate HOG assignments with OMAmer View Docs
OMARK Assess proteome quality with OMArk View Docs
RUN_STATISTICS Generate Ensembl assembly statistics View Docs
RUN_ENSEMBL_META Generate Ensembl metadata statistics View Docs

Database Integration Modules

Modules for storing results in Ensembl databases:

Module Purpose Documentation
BUSCO_CORE_METAKEYS Insert BUSCO results into core database View Docs
POPULATE_DB Load statistics into database View Docs

🔗 Module Usage by Workflow

BUSCO Workflow

The BUSCO workflow uses these modules:

graph LR
    A[DB_METADATA] --> B[BUSCO_DATASET]
    B --> C[FETCH_GENOME]
    B --> D[FETCH_PROTEINS]
    C --> E[BUSCO_GENOME_LINEAGE]
    D --> F[BUSCO_PROTEIN_LINEAGE]
    E --> G[BUSCO_CORE_METAKEYS]
    F --> G

Modules: DB_METADATA, BUSCO_DATASET, FETCH_GENOME, FETCH_PROTEINS, BUSCO_GENOME_LINEAGE, BUSCO_PROTEIN_LINEAGE, BUSCO_CORE_METAKEYS

OMArk Workflow

The OMArk workflow uses these modules:

graph LR
    A[DB_METADATA] --> B[FETCH_PROTEINS]
    B --> C[OMAMER_HOG]
    C --> D[OMARK]

Modules: DB_METADATA, FETCH_PROTEINS, OMAMER_HOG, OMARK

Ensembl Stats Workflow

The Ensembl Stats workflow uses these modules:

graph LR
    A[DB_METADATA] --> B[RUN_STATISTICS]
    A --> C[RUN_ENSEMBL_META]
    B --> D[POPULATE_DB]
    C --> D

Modules: DB_METADATA, RUN_STATISTICS, RUN_ENSEMBL_META, POPULATE_DB

📖 Module Documentation Structure

Each module documentation page includes:

  • Overview: Purpose and functionality
  • Inputs: Required input channels and parameters
  • Outputs: Generated output channels and files
  • Parameters: Configuration options and defaults
  • Container: Docker/Singularity image used
  • Resources: CPU, memory, and time requirements
  • Example: Usage example with sample data
  • Error Handling: Common issues and troubleshooting
  • Version Tracking: Software versions captured

🔧 Common Module Patterns

Process Labels

Modules use labels to define resource requirements:

  • python - Python-based processes (2 CPUs, 4GB RAM)
  • busco - BUSCO analysis (8 CPUs, 16GB RAM)
  • omamer - OMAmer/OMArk processes (4 CPUs, 8GB RAM)
  • fetch_file - File retrieval (2 CPUs, 2GB RAM)
  • default - Standard processes (1 CPU, 2GB RAM)

See nextflow.config for full resource definitions.

Version Tracking

All modules emit a versions.yml file tracking software versions:

"PROCESS_NAME":
    tool_name: version_number
    python: 3.11.0

This enables complete reproducibility and audit trails.

Publishing Strategy

Modules use publishDir directives to control output locations:

  • Analysis Results: ${params.outdir}/${meta.gca}/
  • Cached Data: ${params.cacheDir}/${meta.gca}/
  • Stored Data: storeDir for permanent caching

Metadata Propagation

All modules receive and emit a meta map containing:

[
    gca: "GCA_000001405.29",           // Genome assembly accession
    dbname: "homo_sapiens_core_110_38", // Database name
    species_id: 1,                      // Species ID in database
    taxon_id: "9606",                   // NCBI taxonomy ID
    production_name: "homo_sapiens",    // Production name
    busco_dataset: "vertebrata_odb10"   // BUSCO lineage
]

🚀 Quick Navigation

By Function

By Tool

Alphabetical

📚 Additional Resources


Last Updated: 2026-02-06
Pipeline Version: 1.0.0