Skip to content

Statistics Pipeline Output Guide

This guide describes all outputs generated by the Ensembl genes statistics pipeline, including file formats, locations, and interpretation.

Table of Contents

  1. Output Directory Structure
  2. Module Outputs
  3. File Formats
  4. Cached Outputs
  5. Database Modifications
  6. Version Tracking
  7. Interpreting Results

Output Directory Structure

Standard Layout

${params.outdir}/
├── ${meta.gca}/                    # Per-genome assembly directory
│   ├── genome.fa                    # Genome sequence (FETCH_GENOME)
│   ├── ${production_name}.fa        # Protein translations (FETCH_PROTEINS)
│   ├── busco_genome_lineage/        # BUSCO genome results
│   ├── busco_protein_lineage/       # BUSCO protein results
│   ├── omark_output/                # OMark quality assessment
│   ├── core_statistics/             # Statistics SQL files
│   │   ├── *.sql                    # Generated SQL statements
│   │   └── README.txt               # Statistics summary
│   └── versions.yml                 # Tool versions used
└── pipeline_info/                   # Pipeline execution metadata
    ├── execution_report.html        # Nextflow execution report
    ├── execution_timeline.html      # Timeline visualization
    └── execution_trace.txt          # Process resource usage

Cache Directory Structure

${params.cacheDir}/
├── ${meta.gca}/
│   ├── genome/                      # Cached genome sequences
│   │   └── genome.fa
│   ├── translations/                # Cached protein translations
│   │   └── ${production_name}.fa
│   └── omamer/                      # Cached OMAmer results
│       └── proteins.omamer
└── busco_downloads/                 # BUSCO lineage datasets
    ├── ${lineage}_odb10/
    │   ├── dataset.cfg
    │   ├── hmms/
    │   ├── prfl/
    │   └── scores_cutoff
    └── lineages/

Module Outputs

FETCH_GENOME

Published to: ${params.outdir}/${meta.gca}/
Cached in: ${params.cacheDir}/${meta.gca}/genome/

Files:

  • genome.fa - FASTA format genome sequence

Format:

>1 dna:chromosome chromosome:Assembly:1:1:248956422:1 REF
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...
>2 dna:chromosome chromosome:Assembly:2:1:242193529:1 REF
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...

Metadata:

  • Sequence headers include chromosome/scaffold IDs
  • Assembly version in header
  • N's represent gaps in assembly
  • REF indicates reference sequence

Typical Size:

  • Small genome (bacteria): 2-10 MB
  • Insect genome: 100-500 MB
  • Mammalian genome: 2-3 GB

FETCH_PROTEINS

Published to: ${params.outdir}/${meta.gca}/
Cached in: ${params.cacheDir}/${meta.gca}/translations/

Files:

  • ${production_name}.fa - FASTA format protein sequences

Format:

>ENSP00000401091 pep:protein_coding transcript:ENST00000442987 gene:ENSG00000227232
MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLDILDTA
GQEEYSAMRDQYMRTGEGFLCVFAINNTKSFEDIHQYREQIKRVKDSDDVPMVLVGNKC

Metadata in Headers:

  • Protein stable ID (ENSP*)
  • Transcript stable ID (ENST*)
  • Gene stable ID (ENSG*)
  • Biotype (protein_coding)

Typical Size:

  • Depends on proteome size
  • Human: ~50 MB (20,000 proteins)
  • Model organisms: 10-100 MB

Notes:

  • Only includes protein-coding translations
  • One sequence per protein isoform
  • Sequences are in single-line or multi-line format

BUSCO_DATASET

Cached in: ${params.cacheDir}/busco_downloads/

Files:

${lineage}_odb10/
├── dataset.cfg          # Dataset configuration
├── hmms/                # HMM profiles for each BUSCO gene
├── prfl/                # Profile files
├── info/                # Dataset information
│   └── species.info     # Species included in lineage
└── scores_cutoff        # Score thresholds

Purpose:

  • Reference database for BUSCO assessment
  • Automatically downloaded once per lineage
  • Reused across all samples with same lineage

BUSCO_GENOME_LINEAGE

Published to: ${params.outdir}/${meta.gca}/busco_genome_lineage/

Files:

1. short_summary.specific.${lineage}.${run_name}.txt

Primary results file with completeness metrics.

Format:

# BUSCO version is: 5.4.7
# The lineage dataset is: vertebrata_odb10 (Creation date: 2021-02-19, number of genomes: 65, number of BUSCOs: 3354)
# Summarized benchmarking in BUSCO notation for file genome.fa
# BUSCO was run in mode: genome

    ***** Results: *****

    C:95.2%[S:94.1%,D:1.1%],F:2.3%,M:2.5%,n:3354       
    3194    Complete BUSCOs (C)            
    3156    Complete and single-copy BUSCOs (S)    
    38  Complete and duplicated BUSCOs (D)     
    77  Fragmented BUSCOs (F)              
    83  Missing BUSCOs (M)             
    3354    Total BUSCO groups searched        

Key Metrics: - C (Complete): Total % of BUSCOs found - S (Single-copy): Expected single copy - D (Duplicated): Found multiple times - F (Fragmented): Partial matches - M (Missing): Not found - n: Total BUSCOs in lineage

2. full_table.tsv

Detailed results for each BUSCO gene.

Format:

# Busco id  Status  Sequence    Gene Start  Gene End    Strand  Score   Length  OrthoDB url Description
1276at7742  Complete    1   12345   15678   +   1234.5  1112    https://... Description
1277at7742  Duplicated  2   23456   26789   -   2345.6  1045    https://... Description
1278at7742  Fragmented  3   34567   36890   +   456.7   223 https://... Description

Columns: - Busco id: BUSCO gene identifier - Status: Complete, Duplicated, Fragmented, or Missing - Sequence: Chromosome/scaffold where found - Gene Start/End: Genomic coordinates - Strand: + or - - Score: BUSCO score - Length: Length of matched region

3. missing_busco_list.tsv

List of missing BUSCO genes.

Format:

# Busco id
1279at7742
1280at7742

4. Additional Files:
  • busco_sequences/ - FASTA files of identified BUSCOs
  • single_copy_busco_sequences/ - Single-copy genes
  • multi_copy_busco_sequences/ - Duplicated genes
  • fragmented_busco_sequences/ - Fragmented genes
  • logs/ - Execution logs
  • hmmer_output/ - Raw HMMER results

Typical Size:

  • Summary files: < 1 MB
  • Full table: 1-5 MB
  • Sequences directory: 10-50 MB

BUSCO_PROTEIN_LINEAGE

Published to: ${params.outdir}/${meta.gca}/busco_protein_lineage/

Files:

Same structure as BUSCO_GENOME_LINEAGE, but: - Assesses protein sequences instead of genome - Reports protein-level completeness - Headers reference protein IDs instead of genomic coordinates

Key Differences:

  • Usually higher completeness than genome (proteins already annotated)
  • No genomic coordinates in output
  • Faster execution (smaller input)

BUSCO_CORE_METAKEYS

Output: None published (direct database modification)

Database Changes:

Inserts/updates rows in the meta table:

-- Example entries added
INSERT INTO meta (species_id, meta_key, meta_value) VALUES
  (1, 'busco.genome.version', '5.4.7'),
  (1, 'busco.genome.lineage', 'vertebrata_odb10'),
  (1, 'busco.genome.complete', '95.2'),
  (1, 'busco.genome.complete_single', '94.1'),
  (1, 'busco.genome.complete_duplicated', '1.1'),
  (1, 'busco.genome.fragmented', '2.3'),
  (1, 'busco.genome.missing', '2.5'),
  (1, 'busco.protein.complete', '98.5'),
  ...

Verification:

SELECT meta_key, meta_value 
FROM meta 
WHERE meta_key LIKE 'busco%' 
ORDER BY meta_key;

OMAMER_HOG

Cached in: ${params.cacheDir}/${meta.gca}/omamer/

Files:

  • proteins.omamer - OMAmer orthology assignments

Format:

Binary format containing: - Protein ID to HOG (Hierarchical Orthologous Group) mappings - Taxonomic placement information - Orthology scores

Size:

  • Typically 1-10 MB depending on proteome size

Notes:

  • Not human-readable (binary format)
  • Used as input for OMark
  • Cached to avoid recomputation

OMARK

Published to: ${params.outdir}/${meta.gca}/omark_output/

Files:

1. omark_output.txt

Main quality assessment results.

Format:

# OMark Quality Assessment
Taxon: Homo sapiens (9606)
Database: LUCA.h5
Number of proteins: 20391

# Quality Metrics
Completeness: 95.8%
Consistency: 97.2%
Contamination: 0.5%

# Detailed Results
Category                    Count    Percentage
Complete                    19542    95.8%
Consistent                  19832    97.2%
Inconsistent                559      2.7%
Contaminants                102      0.5%

Key Metrics: - Completeness: % of expected orthologs present - Consistency: Agreement between annotation and orthology - Contamination: Potential non-target sequences

2. inconsistent.txt

List of proteins with inconsistent orthology.

Format:

Protein_ID  Expected_Taxon  Observed_Taxon  Score
ENSP00000401091 Vertebrata  Bacteria    -2.5

3. contaminants.txt

Potential contamination candidates.

4. Additional Files:
  • detailed_summary.txt - Extended statistics
  • *.pdf - Visualization plots (if generated)
  • hog_mapping.txt - HOG assignments

Typical Size:

  • Text files: 1-10 MB
  • PDF plots: 1-5 MB each

RUN_STATISTICS

Published to: ${params.outdir}/${meta.gca}/core_statistics/

Files:

Multiple SQL files containing INSERT statements:

1. gene_stats.sql

Gene count statistics by biotype.

Format:

INSERT INTO meta (species_id, meta_key, meta_value) VALUES
  (1, 'gene.total', '20391'),
  (1, 'gene.protein_coding', '19950'),
  (1, 'gene.pseudogene', '15'),
  (1, 'gene.lncRNA', '345'),
  (1, 'gene.miRNA', '48'),
  (1, 'gene.snRNA', '23'),
  (1, 'gene.snoRNA', '10');

2. transcript_stats.sql

Transcript statistics.

Format:

INSERT INTO meta (species_id, meta_key, meta_value) VALUES
  (1, 'transcript.total', '81256'),
  (1, 'transcript.protein_coding', '79850'),
  (1, 'transcript.avg_per_gene', '3.98');

3. coding_stats.sql

Coding sequence statistics.

Format:

INSERT INTO meta (species_id, meta_key, meta_value) VALUES
  (1, 'coding.avg_length', '1821'),
  (1, 'coding.total_bases', '36234567'),
  (1, 'coding.avg_exons_per_transcript', '8.5');

4. Additional Stats:
  • exon_stats.sql - Exon counts and lengths
  • intron_stats.sql - Intron statistics
  • assembly_stats.sql - Genome assembly metrics
  • homology_stats.sql - Orthology/paralogy counts

SQL Structure:

All stats follow the pattern:

DELETE FROM meta WHERE species_id = X AND meta_key = 'stat.name';
INSERT INTO meta (species_id, meta_key, meta_value) VALUES (X, 'stat.name', 'value');


RUN_ENSEMBL_META

Published to: ${params.outdir}/${meta.gca}/core_statistics/

Files:

1. core_meta.sql

Core database metadata.

Format:

-- Schema information
INSERT INTO meta (meta_key, meta_value) VALUES
  ('schema_type', 'core'),
  ('schema_version', '110');

-- Species information
INSERT INTO meta (species_id, meta_key, meta_value) VALUES
  (1, 'species.production_name', 'homo_sapiens'),
  (1, 'species.scientific_name', 'Homo sapiens'),
  (1, 'species.common_name', 'Human'),
  (1, 'species.taxonomy_id', '9606'),
  (1, 'species.division', 'EnsemblVertebrates');

-- Assembly information
INSERT INTO meta (species_id, meta_key, meta_value) VALUES
  (1, 'assembly.default', 'GRCh38'),
  (1, 'assembly.name', 'GRCh38.p14'),
  (1, 'assembly.accession', 'GCA_000001405.29');

-- Production information
INSERT INTO meta (species_id, meta_key, meta_value) VALUES
  (1, 'genebuild.method', 'import'),
  (1, 'genebuild.start_date', '2023-10'),
  (1, 'genebuild.version', '2023_10');

2. external_db_meta.sql

External database references.

Format:

INSERT INTO meta (species_id, meta_key, meta_value) VALUES
  (1, 'xref.timestamp', '2023-10-01');


POPULATE_DB

Output: None (executes SQL against database)

Database Changes:

  • Executes all SQL files from RUN_STATISTICS and RUN_ENSEMBL_META
  • Populates meta table with statistics and metadata
  • May update other tables depending on SQL content

Verification:

-- Check statistics were inserted
SELECT meta_key, meta_value 
FROM meta 
WHERE meta_key LIKE 'gene%' 
   OR meta_key LIKE 'transcript%'
ORDER BY meta_key;

-- Check metadata
SELECT meta_key, meta_value 
FROM meta 
WHERE meta_key LIKE 'species%'
   OR meta_key LIKE 'assembly%'
ORDER BY meta_key;

DB_METADATA

Output: Database modifications

Database Changes:

Updates database metadata tracking tables.


CLEANING

Output: File deletions

Actions:

  • Removes files from ${params.outdir} if params.clean is true
  • Removes work directories if params.clean_work_dir is true
  • Preserves essential outputs and summaries

File Formats

FASTA Format (.fa, .fasta)

Used by: FETCH_GENOME, FETCH_PROTEINS, BUSCO outputs

Structure:

>header_line
SEQUENCE_DATA
>another_header
MORE_SEQUENCE_DATA

Notes: - Header starts with > - Sequence can be single-line or multi-line - Standard nucleotide codes (A, T, G, C, N) or amino acid codes

Tab-Separated Values (.tsv)

Used by: BUSCO full tables, OMark results

Structure:

# Comment lines start with #
Column1 Column2 Column3
Value1  Value2  Value3

Notes: - Tab character (\t) separates columns - May include header row - Comments prefixed with #

SQL Format (.sql)

Used by: Statistics and metadata files

Structure:

-- Comments start with --
DELETE FROM table WHERE condition;
INSERT INTO table (col1, col2) VALUES (val1, val2);
UPDATE table SET col1 = val WHERE condition;

Notes: - Standard SQL statements - Safe to execute multiple times (includes DELETE before INSERT) - Species-specific (uses species_id)

YAML Format (.yml)

Used by: Version tracking

Structure:

"PROCESS_NAME":
  tool_name: version_string
  python: 3.9.7

Notes: - Human-readable key-value pairs - Hierarchical structure - Easily parsed programmatically


Cached Outputs

Purpose of Caching

The pipeline uses storeDir directive for expensive operations: - FETCH_GENOME: Avoid repeated database queries - FETCH_PROTEINS: Avoid repeated protein extraction - BUSCO_DATASET: Avoid repeated dataset downloads - OMAMER_HOG: Avoid expensive orthology searches

Cache Behavior

  1. First run: Computes result and stores in cache
  2. Subsequent runs: Uses cached result (instant)
  3. Cache invalidation: Manual deletion required if source data changes

Cache Management

# Check cache size
du -sh ${params.cacheDir}

# Clear specific genome cache
rm -rf ${params.cacheDir}/GCA_000001405.15/

# Clear BUSCO datasets
rm -rf ${params.cacheDir}/busco_downloads/

# Clear all OMAmer results
find ${params.cacheDir} -name "*.omamer" -delete

When to Clear Cache

  • Source database updated
  • BUSCO datasets updated
  • Assembly version changed
  • Proteins re-annotated
  • OMAmer database updated

Database Modifications

Tables Modified

1. meta Table

Primary target for statistics and metadata.

Structure:

CREATE TABLE meta (
  meta_id INT PRIMARY KEY AUTO_INCREMENT,
  species_id INT DEFAULT 1,
  meta_key VARCHAR(255) NOT NULL,
  meta_value TEXT NOT NULL
);

Keys Added: - busco.* - BUSCO metrics - gene.* - Gene statistics - transcript.* - Transcript statistics - species.* - Species information - assembly.* - Assembly information - schema_* - Schema metadata

2. Other Tables (depending on SQL)

  • May update attrib_type
  • May insert into seq_region_attrib
  • May modify production metadata

Backup Recommendation

Before running POPULATE_DB:

-- Backup meta table
CREATE TABLE meta_backup AS SELECT * FROM meta;

-- Or full database backup
mysqldump -h host -u user -p dbname > backup.sql


Version Tracking

versions.yml Structure

Each module generates a versions file:

"BUSCO_GENOME_LINEAGE":
  busco: 5.4.7
"FETCH_GENOME":
  perl: 5.32.0
"OMAMER_HOG":
  omamer: 0.3.3
"OMARK":
  omark: 0.3.0
"RUN_STATISTICS":
  perl: 5.32.0
"RUN_ENSEMBL_META":
  python: 3.9.7
"POPULATE_DB":
  mysql: 8.0.31

Collecting All Versions

# Combine all version files
find ${params.outdir} -name "versions.yml" -exec cat {} \; > all_versions.yml

# Or using Nextflow output
cat work/**/versions.yml > complete_versions.yml

Version Verification

# Check for version consistency across samples
grep "busco:" ${params.outdir}/*/versions.yml | sort | uniq

# Verify against requirements
diff requirements.txt all_versions.yml

Interpreting Results

BUSCO Completeness Guidelines

Genome Assessment: - > 95% Complete: Excellent quality - 90-95% Complete: Good quality - 80-90% Complete: Acceptable quality - < 80% Complete: Poor quality or wrong lineage

Expected Duplicated: - < 5%: Normal diploid genome - 5-15%: Recent duplication or polyploidy - > 15%: High heterozygosity or contamination

Fragmented: - < 5%: Good assembly contiguity - > 10%: Fragmented assembly, consider improvement

OMark Quality Guidelines

Completeness: - > 95%: Excellent annotation - 90-95%: Good annotation - < 90%: Review annotation quality

Consistency: - > 95%: High consistency - 90-95%: Acceptable - < 90%: Review inconsistent proteins

Contamination: - < 1%: Clean - 1-5%: Minor contamination - > 5%: Significant contamination, investigate

Gene Statistics Interpretation

Protein-coding gene counts: - Human: ~20,000 - Mouse: ~22,000 - Drosophila: ~13,000 - C. elegans: ~20,000

Transcripts per gene: - < 2: Limited alternative splicing - 2-4: Normal alternative splicing - > 5: High alternative splicing (vertebrates)

Coding length: - Average protein: 300-500 amino acids - Average CDS: 900-1500 bases


Output Checklist

After pipeline completion, verify:

  • [ ] All expected sample directories exist in ${params.outdir}
  • [ ] BUSCO results show reasonable completeness
  • [ ] OMark results don't show excessive contamination
  • [ ] Statistics SQL files exist and contain data
  • [ ] versions.yml files present for all modules
  • [ ] No empty output files (check file sizes)
  • [ ] Database meta table populated (if POPULATE_DB ran)
  • [ ] Execution reports generated in pipeline_info/

Accessing Results Programmatically

Python Example

import os
import glob
import yaml
from Bio import SeqIO

# Read BUSCO results
busco_dir = f"{outdir}/{gca}/busco_genome_lineage"
with open(f"{busco_dir}/short_summary.*.txt") as f:
    for line in f:
        if line.startswith("\tC:"):
            metrics = line.strip()
            print(f"BUSCO metrics: {metrics}")

# Read protein sequences
proteins = f"{outdir}/{gca}/{production_name}.fa"
seqs = list(SeqIO.parse(proteins, "fasta"))
print(f"Protein count: {len(seqs)}")

# Read versions
version_files = glob.glob(f"{outdir}/**/versions.yml", recursive=True)
versions = {}
for vf in version_files:
    with open(vf) as f:
        versions.update(yaml.safe_load(f))

SQL Example

-- Query all statistics
SELECT 
    meta_key,
    meta_value
FROM meta
WHERE species_id = 1
  AND (meta_key LIKE 'gene.%'
   OR meta_key LIKE 'transcript.%'
   OR meta_key LIKE 'busco.%')
ORDER BY meta_key;

-- Compare BUSCO genome vs protein
SELECT 
    SUBSTRING_INDEX(meta_key, '.', 2) AS metric_type,
    MAX(CASE WHEN meta_key LIKE '%.genome.%' THEN meta_value END) AS genome_value,
    MAX(CASE WHEN meta_key LIKE '%.protein.%' THEN meta_value END) AS protein_value
FROM meta
WHERE meta_key LIKE 'busco.%complete'
GROUP BY metric_type;

Bash Example

#!/bin/bash

# Summary report
outdir=$1
gca=$2

echo "=== Statistics Summary for ${gca} ==="

# BUSCO genome
busco_genome="${outdir}/${gca}/busco_genome_lineage/short_summary.*.txt"
echo "BUSCO Genome Completeness:"
grep "C:" $busco_genome | head -1

# BUSCO protein
busco_protein="${outdir}/${gca}/busco_protein_lineage/short_summary.*.txt"
echo "BUSCO Protein Completeness:"
grep "C:" $busco_protein | head -1

# OMark
omark="${outdir}/${gca}/omark_output/omark_output.txt"
echo "OMark Quality:"
grep -E "Completeness|Consistency|Contamination" $omark

# Gene count from SQL
sql_file="${outdir}/${gca}/core_statistics/gene_stats.sql"
echo "Gene Count:"
grep "'gene.total'" $sql_file | grep -oP "VALUES \(.*?, '.*?', '\K[^']*"

Archiving and Delivery

Essential Files for Archive

Minimum files to preserve:

${gca}/
├── busco_genome_lineage/short_summary.*.txt
├── busco_protein_lineage/short_summary.*.txt
├── omark_output/omark_output.txt
├── core_statistics/*.sql
└── versions.yml

Full Archive

Complete output for reproducibility:

# Create tarball
tar -czf ${gca}_statistics.tar.gz \
    ${params.outdir}/${gca}/ \
    pipeline_info/

# Verify integrity
tar -tzf ${gca}_statistics.tar.gz | wc -l

Delivery to Database Team

Provide: 1. SQL files in core_statistics/ 2. BUSCO summaries (PDF or screenshots) 3. OMark assessment 4. versions.yml 5. Execution report


Quick Reference

Module Primary Output Location Format
FETCH_GENOME genome.fa outdir/${gca}/ FASTA
FETCH_PROTEINS proteins.fa outdir/${gca}/ FASTA
BUSCO_GENOME short_summary.txt outdir/${gca}/busco_genome_lineage/ TXT
BUSCO_PROTEIN short_summary.txt outdir/${gca}/busco_protein_lineage/ TXT
OMAMER_HOG proteins.omamer cache/${gca}/omamer/ Binary
OMARK omark_output.txt outdir/${gca}/omark_output/ TXT
RUN_STATISTICS *.sql outdir/${gca}/core_statistics/ SQL
RUN_ENSEMBL_META *.sql outdir/${gca}/core_statistics/ SQL
All modules versions.yml outdir/${gca}/ YAML