Statistics Pipeline Modules¶
This section provides detailed documentation for each module in the Statistics Pipeline. Modules are reusable Nextflow processes that perform specific tasks within the pipeline workflows.
📋 Module Overview¶
The Statistics Pipeline contains 14 modules organized into three functional categories:
Data Retrieval Modules¶
Modules responsible for fetching input data from databases and external sources:
| Module | Purpose | Documentation |
|---|---|---|
| DB_METADATA | Extract metadata from Ensembl databases | View Docs |
| FETCH_GENOME | Download genome assemblies from NCBI | View Docs |
| FETCH_PROTEINS | Extract protein sequences from databases | View Docs |
Analysis Modules¶
Modules that perform quality assessment and statistical analyses:
| Module | Purpose | Documentation |
|---|---|---|
| BUSCO_DATASET | Select appropriate BUSCO lineage dataset | View Docs |
| BUSCO_GENOME_LINEAGE | Run BUSCO on genome assemblies | View Docs |
| BUSCO_PROTEIN_LINEAGE | Run BUSCO on protein sequences | View Docs |
| OMAMER_HOG | Generate HOG assignments with OMAmer | View Docs |
| OMARK | Assess proteome quality with OMArk | View Docs |
| RUN_STATISTICS | Generate Ensembl assembly statistics | View Docs |
| RUN_ENSEMBL_META | Generate Ensembl metadata statistics | View Docs |
Database Integration Modules¶
Modules for storing results in Ensembl databases:
| Module | Purpose | Documentation |
|---|---|---|
| BUSCO_CORE_METAKEYS | Insert BUSCO results into core database | View Docs |
| POPULATE_DB | Load statistics into database | View Docs |
🔗 Module Usage by Workflow¶
BUSCO Workflow¶
The BUSCO workflow uses these modules:
graph LR
A[DB_METADATA] --> B[BUSCO_DATASET]
B --> C[FETCH_GENOME]
B --> D[FETCH_PROTEINS]
C --> E[BUSCO_GENOME_LINEAGE]
D --> F[BUSCO_PROTEIN_LINEAGE]
E --> G[BUSCO_CORE_METAKEYS]
F --> G
Modules: DB_METADATA, BUSCO_DATASET, FETCH_GENOME, FETCH_PROTEINS, BUSCO_GENOME_LINEAGE, BUSCO_PROTEIN_LINEAGE, BUSCO_CORE_METAKEYS
OMArk Workflow¶
The OMArk workflow uses these modules:
Modules: DB_METADATA, FETCH_PROTEINS, OMAMER_HOG, OMARK
Ensembl Stats Workflow¶
The Ensembl Stats workflow uses these modules:
graph LR
A[DB_METADATA] --> B[RUN_STATISTICS]
A --> C[RUN_ENSEMBL_META]
B --> D[POPULATE_DB]
C --> D
Modules: DB_METADATA, RUN_STATISTICS, RUN_ENSEMBL_META, POPULATE_DB
📖 Module Documentation Structure¶
Each module documentation page includes:
- Overview: Purpose and functionality
- Inputs: Required input channels and parameters
- Outputs: Generated output channels and files
- Parameters: Configuration options and defaults
- Container: Docker/Singularity image used
- Resources: CPU, memory, and time requirements
- Example: Usage example with sample data
- Error Handling: Common issues and troubleshooting
- Version Tracking: Software versions captured
🔧 Common Module Patterns¶
Process Labels¶
Modules use labels to define resource requirements:
python- Python-based processes (2 CPUs, 4GB RAM)busco- BUSCO analysis (8 CPUs, 16GB RAM)omamer- OMAmer/OMArk processes (4 CPUs, 8GB RAM)fetch_file- File retrieval (2 CPUs, 2GB RAM)default- Standard processes (1 CPU, 2GB RAM)
See nextflow.config for full resource definitions.
Version Tracking¶
All modules emit a versions.yml file tracking software versions:
This enables complete reproducibility and audit trails.
Publishing Strategy¶
Modules use publishDir directives to control output locations:
- Analysis Results:
${params.outdir}/${meta.gca}/ - Cached Data:
${params.cacheDir}/${meta.gca}/ - Stored Data:
storeDirfor permanent caching
Metadata Propagation¶
All modules receive and emit a meta map containing:
[
gca: "GCA_000001405.29", // Genome assembly accession
dbname: "homo_sapiens_core_110_38", // Database name
species_id: 1, // Species ID in database
taxon_id: "9606", // NCBI taxonomy ID
production_name: "homo_sapiens", // Production name
busco_dataset: "vertebrata_odb10" // BUSCO lineage
]
🚀 Quick Navigation¶
By Function¶
- Data Retrieval - Fetch genomes, proteins, and metadata
- Quality Assessment - BUSCO and OMArk analyses
- Statistics Generation - Ensembl statistics
- Database Operations - Store results in databases
By Tool¶
- BUSCO Modules - All BUSCO-related processes
- OMArk Modules - OMAmer and OMArk processes
- Ensembl Modules - Ensembl statistics generation
Alphabetical¶
-
Insert BUSCO results into Ensembl core database
-
Select appropriate BUSCO lineage dataset
-
Run BUSCO assessment on genome assemblies
-
Run BUSCO assessment on protein sequences
-
Extract metadata from Ensembl databases
-
Download genome assemblies from NCBI
-
Extract protein sequences from databases
-
Generate HOG assignments with OMAmer
-
Assess proteome quality with OMArk
-
Load statistics into Ensembl database
-
Generate Ensembl metadata statistics
-
Generate comprehensive assembly statistics
📚 Additional Resources¶
- Workflow Documentation - How workflows use modules
- Parameter Reference - Configuration options
- Pipeline Overview - Architecture and design
- Source Code - View module implementations
Last Updated: 2026-02-06
Pipeline Version: 1.0.0