Statistics Pipeline Output Guide¶
This guide describes all outputs generated by the Ensembl genes statistics pipeline, including file formats, locations, and interpretation.
Table of Contents¶
- Output Directory Structure
- Module Outputs
- File Formats
- Cached Outputs
- Database Modifications
- Version Tracking
- Interpreting Results
Output Directory Structure¶
Standard Layout¶
${params.outdir}/
├── ${meta.gca}/ # Per-genome assembly directory
│ ├── genome.fa # Genome sequence (FETCH_GENOME)
│ ├── ${production_name}.fa # Protein translations (FETCH_PROTEINS)
│ ├── busco_genome_lineage/ # BUSCO genome results
│ ├── busco_protein_lineage/ # BUSCO protein results
│ ├── omark_output/ # OMark quality assessment
│ ├── core_statistics/ # Statistics SQL files
│ │ ├── *.sql # Generated SQL statements
│ │ └── README.txt # Statistics summary
│ └── versions.yml # Tool versions used
│
└── pipeline_info/ # Pipeline execution metadata
├── execution_report.html # Nextflow execution report
├── execution_timeline.html # Timeline visualization
└── execution_trace.txt # Process resource usage
Cache Directory Structure¶
${params.cacheDir}/
├── ${meta.gca}/
│ ├── genome/ # Cached genome sequences
│ │ └── genome.fa
│ ├── translations/ # Cached protein translations
│ │ └── ${production_name}.fa
│ └── omamer/ # Cached OMAmer results
│ └── proteins.omamer
│
└── busco_downloads/ # BUSCO lineage datasets
├── ${lineage}_odb10/
│ ├── dataset.cfg
│ ├── hmms/
│ ├── prfl/
│ └── scores_cutoff
└── lineages/
Module Outputs¶
FETCH_GENOME¶
Published to: ${params.outdir}/${meta.gca}/
Cached in: ${params.cacheDir}/${meta.gca}/genome/
Files:¶
genome.fa- FASTA format genome sequence
Format:¶
>1 dna:chromosome chromosome:Assembly:1:1:248956422:1 REF
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...
>2 dna:chromosome chromosome:Assembly:2:1:242193529:1 REF
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...
Metadata:¶
- Sequence headers include chromosome/scaffold IDs
- Assembly version in header
- N's represent gaps in assembly
- REF indicates reference sequence
Typical Size:¶
- Small genome (bacteria): 2-10 MB
- Insect genome: 100-500 MB
- Mammalian genome: 2-3 GB
FETCH_PROTEINS¶
Published to: ${params.outdir}/${meta.gca}/
Cached in: ${params.cacheDir}/${meta.gca}/translations/
Files:¶
${production_name}.fa- FASTA format protein sequences
Format:¶
>ENSP00000401091 pep:protein_coding transcript:ENST00000442987 gene:ENSG00000227232
MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLDILDTA
GQEEYSAMRDQYMRTGEGFLCVFAINNTKSFEDIHQYREQIKRVKDSDDVPMVLVGNKC
Metadata in Headers:¶
- Protein stable ID (ENSP*)
- Transcript stable ID (ENST*)
- Gene stable ID (ENSG*)
- Biotype (protein_coding)
Typical Size:¶
- Depends on proteome size
- Human: ~50 MB (20,000 proteins)
- Model organisms: 10-100 MB
Notes:¶
- Only includes protein-coding translations
- One sequence per protein isoform
- Sequences are in single-line or multi-line format
BUSCO_DATASET¶
Cached in: ${params.cacheDir}/busco_downloads/
Files:¶
${lineage}_odb10/
├── dataset.cfg # Dataset configuration
├── hmms/ # HMM profiles for each BUSCO gene
├── prfl/ # Profile files
├── info/ # Dataset information
│ └── species.info # Species included in lineage
└── scores_cutoff # Score thresholds
Purpose:¶
- Reference database for BUSCO assessment
- Automatically downloaded once per lineage
- Reused across all samples with same lineage
BUSCO_GENOME_LINEAGE¶
Published to: ${params.outdir}/${meta.gca}/busco_genome_lineage/
Files:¶
1. short_summary.specific.${lineage}.${run_name}.txt¶
Primary results file with completeness metrics.
Format:
# BUSCO version is: 5.4.7
# The lineage dataset is: vertebrata_odb10 (Creation date: 2021-02-19, number of genomes: 65, number of BUSCOs: 3354)
# Summarized benchmarking in BUSCO notation for file genome.fa
# BUSCO was run in mode: genome
***** Results: *****
C:95.2%[S:94.1%,D:1.1%],F:2.3%,M:2.5%,n:3354
3194 Complete BUSCOs (C)
3156 Complete and single-copy BUSCOs (S)
38 Complete and duplicated BUSCOs (D)
77 Fragmented BUSCOs (F)
83 Missing BUSCOs (M)
3354 Total BUSCO groups searched
Key Metrics: - C (Complete): Total % of BUSCOs found - S (Single-copy): Expected single copy - D (Duplicated): Found multiple times - F (Fragmented): Partial matches - M (Missing): Not found - n: Total BUSCOs in lineage
2. full_table.tsv¶
Detailed results for each BUSCO gene.
Format:
# Busco id Status Sequence Gene Start Gene End Strand Score Length OrthoDB url Description
1276at7742 Complete 1 12345 15678 + 1234.5 1112 https://... Description
1277at7742 Duplicated 2 23456 26789 - 2345.6 1045 https://... Description
1278at7742 Fragmented 3 34567 36890 + 456.7 223 https://... Description
Columns: - Busco id: BUSCO gene identifier - Status: Complete, Duplicated, Fragmented, or Missing - Sequence: Chromosome/scaffold where found - Gene Start/End: Genomic coordinates - Strand: + or - - Score: BUSCO score - Length: Length of matched region
3. missing_busco_list.tsv¶
List of missing BUSCO genes.
Format:
4. Additional Files:¶
busco_sequences/- FASTA files of identified BUSCOssingle_copy_busco_sequences/- Single-copy genesmulti_copy_busco_sequences/- Duplicated genesfragmented_busco_sequences/- Fragmented geneslogs/- Execution logshmmer_output/- Raw HMMER results
Typical Size:¶
- Summary files: < 1 MB
- Full table: 1-5 MB
- Sequences directory: 10-50 MB
BUSCO_PROTEIN_LINEAGE¶
Published to: ${params.outdir}/${meta.gca}/busco_protein_lineage/
Files:¶
Same structure as BUSCO_GENOME_LINEAGE, but: - Assesses protein sequences instead of genome - Reports protein-level completeness - Headers reference protein IDs instead of genomic coordinates
Key Differences:¶
- Usually higher completeness than genome (proteins already annotated)
- No genomic coordinates in output
- Faster execution (smaller input)
BUSCO_CORE_METAKEYS¶
Output: None published (direct database modification)
Database Changes:¶
Inserts/updates rows in the meta table:
-- Example entries added
INSERT INTO meta (species_id, meta_key, meta_value) VALUES
(1, 'busco.genome.version', '5.4.7'),
(1, 'busco.genome.lineage', 'vertebrata_odb10'),
(1, 'busco.genome.complete', '95.2'),
(1, 'busco.genome.complete_single', '94.1'),
(1, 'busco.genome.complete_duplicated', '1.1'),
(1, 'busco.genome.fragmented', '2.3'),
(1, 'busco.genome.missing', '2.5'),
(1, 'busco.protein.complete', '98.5'),
...
Verification:¶
OMAMER_HOG¶
Cached in: ${params.cacheDir}/${meta.gca}/omamer/
Files:¶
proteins.omamer- OMAmer orthology assignments
Format:¶
Binary format containing: - Protein ID to HOG (Hierarchical Orthologous Group) mappings - Taxonomic placement information - Orthology scores
Size:¶
- Typically 1-10 MB depending on proteome size
Notes:¶
- Not human-readable (binary format)
- Used as input for OMark
- Cached to avoid recomputation
OMARK¶
Published to: ${params.outdir}/${meta.gca}/omark_output/
Files:¶
1. omark_output.txt¶
Main quality assessment results.
Format:
# OMark Quality Assessment
Taxon: Homo sapiens (9606)
Database: LUCA.h5
Number of proteins: 20391
# Quality Metrics
Completeness: 95.8%
Consistency: 97.2%
Contamination: 0.5%
# Detailed Results
Category Count Percentage
Complete 19542 95.8%
Consistent 19832 97.2%
Inconsistent 559 2.7%
Contaminants 102 0.5%
Key Metrics: - Completeness: % of expected orthologs present - Consistency: Agreement between annotation and orthology - Contamination: Potential non-target sequences
2. inconsistent.txt¶
List of proteins with inconsistent orthology.
Format:
3. contaminants.txt¶
Potential contamination candidates.
4. Additional Files:¶
detailed_summary.txt- Extended statistics*.pdf- Visualization plots (if generated)hog_mapping.txt- HOG assignments
Typical Size:¶
- Text files: 1-10 MB
- PDF plots: 1-5 MB each
RUN_STATISTICS¶
Published to: ${params.outdir}/${meta.gca}/core_statistics/
Files:¶
Multiple SQL files containing INSERT statements:
1. gene_stats.sql¶
Gene count statistics by biotype.
Format:
INSERT INTO meta (species_id, meta_key, meta_value) VALUES
(1, 'gene.total', '20391'),
(1, 'gene.protein_coding', '19950'),
(1, 'gene.pseudogene', '15'),
(1, 'gene.lncRNA', '345'),
(1, 'gene.miRNA', '48'),
(1, 'gene.snRNA', '23'),
(1, 'gene.snoRNA', '10');
2. transcript_stats.sql¶
Transcript statistics.
Format:
INSERT INTO meta (species_id, meta_key, meta_value) VALUES
(1, 'transcript.total', '81256'),
(1, 'transcript.protein_coding', '79850'),
(1, 'transcript.avg_per_gene', '3.98');
3. coding_stats.sql¶
Coding sequence statistics.
Format:
INSERT INTO meta (species_id, meta_key, meta_value) VALUES
(1, 'coding.avg_length', '1821'),
(1, 'coding.total_bases', '36234567'),
(1, 'coding.avg_exons_per_transcript', '8.5');
4. Additional Stats:¶
exon_stats.sql- Exon counts and lengthsintron_stats.sql- Intron statisticsassembly_stats.sql- Genome assembly metricshomology_stats.sql- Orthology/paralogy counts
SQL Structure:¶
All stats follow the pattern:
DELETE FROM meta WHERE species_id = X AND meta_key = 'stat.name';
INSERT INTO meta (species_id, meta_key, meta_value) VALUES (X, 'stat.name', 'value');
RUN_ENSEMBL_META¶
Published to: ${params.outdir}/${meta.gca}/core_statistics/
Files:¶
1. core_meta.sql¶
Core database metadata.
Format:
-- Schema information
INSERT INTO meta (meta_key, meta_value) VALUES
('schema_type', 'core'),
('schema_version', '110');
-- Species information
INSERT INTO meta (species_id, meta_key, meta_value) VALUES
(1, 'species.production_name', 'homo_sapiens'),
(1, 'species.scientific_name', 'Homo sapiens'),
(1, 'species.common_name', 'Human'),
(1, 'species.taxonomy_id', '9606'),
(1, 'species.division', 'EnsemblVertebrates');
-- Assembly information
INSERT INTO meta (species_id, meta_key, meta_value) VALUES
(1, 'assembly.default', 'GRCh38'),
(1, 'assembly.name', 'GRCh38.p14'),
(1, 'assembly.accession', 'GCA_000001405.29');
-- Production information
INSERT INTO meta (species_id, meta_key, meta_value) VALUES
(1, 'genebuild.method', 'import'),
(1, 'genebuild.start_date', '2023-10'),
(1, 'genebuild.version', '2023_10');
2. external_db_meta.sql¶
External database references.
Format:
POPULATE_DB¶
Output: None (executes SQL against database)
Database Changes:¶
- Executes all SQL files from RUN_STATISTICS and RUN_ENSEMBL_META
- Populates
metatable with statistics and metadata - May update other tables depending on SQL content
Verification:¶
-- Check statistics were inserted
SELECT meta_key, meta_value
FROM meta
WHERE meta_key LIKE 'gene%'
OR meta_key LIKE 'transcript%'
ORDER BY meta_key;
-- Check metadata
SELECT meta_key, meta_value
FROM meta
WHERE meta_key LIKE 'species%'
OR meta_key LIKE 'assembly%'
ORDER BY meta_key;
DB_METADATA¶
Output: Database modifications
Database Changes:¶
Updates database metadata tracking tables.
CLEANING¶
Output: File deletions
Actions:¶
- Removes files from
${params.outdir}ifparams.cleanis true - Removes work directories if
params.clean_work_diris true - Preserves essential outputs and summaries
File Formats¶
FASTA Format (.fa, .fasta)¶
Used by: FETCH_GENOME, FETCH_PROTEINS, BUSCO outputs
Structure:
Notes:
- Header starts with >
- Sequence can be single-line or multi-line
- Standard nucleotide codes (A, T, G, C, N) or amino acid codes
Tab-Separated Values (.tsv)¶
Used by: BUSCO full tables, OMark results
Structure:
Notes:
- Tab character (\t) separates columns
- May include header row
- Comments prefixed with #
SQL Format (.sql)¶
Used by: Statistics and metadata files
Structure:
-- Comments start with --
DELETE FROM table WHERE condition;
INSERT INTO table (col1, col2) VALUES (val1, val2);
UPDATE table SET col1 = val WHERE condition;
Notes: - Standard SQL statements - Safe to execute multiple times (includes DELETE before INSERT) - Species-specific (uses species_id)
YAML Format (.yml)¶
Used by: Version tracking
Structure:
Notes: - Human-readable key-value pairs - Hierarchical structure - Easily parsed programmatically
Cached Outputs¶
Purpose of Caching¶
The pipeline uses storeDir directive for expensive operations:
- FETCH_GENOME: Avoid repeated database queries
- FETCH_PROTEINS: Avoid repeated protein extraction
- BUSCO_DATASET: Avoid repeated dataset downloads
- OMAMER_HOG: Avoid expensive orthology searches
Cache Behavior¶
- First run: Computes result and stores in cache
- Subsequent runs: Uses cached result (instant)
- Cache invalidation: Manual deletion required if source data changes
Cache Management¶
# Check cache size
du -sh ${params.cacheDir}
# Clear specific genome cache
rm -rf ${params.cacheDir}/GCA_000001405.15/
# Clear BUSCO datasets
rm -rf ${params.cacheDir}/busco_downloads/
# Clear all OMAmer results
find ${params.cacheDir} -name "*.omamer" -delete
When to Clear Cache¶
- Source database updated
- BUSCO datasets updated
- Assembly version changed
- Proteins re-annotated
- OMAmer database updated
Database Modifications¶
Tables Modified¶
1. meta Table¶
Primary target for statistics and metadata.
Structure:
CREATE TABLE meta (
meta_id INT PRIMARY KEY AUTO_INCREMENT,
species_id INT DEFAULT 1,
meta_key VARCHAR(255) NOT NULL,
meta_value TEXT NOT NULL
);
Keys Added:
- busco.* - BUSCO metrics
- gene.* - Gene statistics
- transcript.* - Transcript statistics
- species.* - Species information
- assembly.* - Assembly information
- schema_* - Schema metadata
2. Other Tables (depending on SQL)¶
- May update
attrib_type - May insert into
seq_region_attrib - May modify production metadata
Backup Recommendation¶
Before running POPULATE_DB:
-- Backup meta table
CREATE TABLE meta_backup AS SELECT * FROM meta;
-- Or full database backup
mysqldump -h host -u user -p dbname > backup.sql
Version Tracking¶
versions.yml Structure¶
Each module generates a versions file:
"BUSCO_GENOME_LINEAGE":
busco: 5.4.7
"FETCH_GENOME":
perl: 5.32.0
"OMAMER_HOG":
omamer: 0.3.3
"OMARK":
omark: 0.3.0
"RUN_STATISTICS":
perl: 5.32.0
"RUN_ENSEMBL_META":
python: 3.9.7
"POPULATE_DB":
mysql: 8.0.31
Collecting All Versions¶
# Combine all version files
find ${params.outdir} -name "versions.yml" -exec cat {} \; > all_versions.yml
# Or using Nextflow output
cat work/**/versions.yml > complete_versions.yml
Version Verification¶
# Check for version consistency across samples
grep "busco:" ${params.outdir}/*/versions.yml | sort | uniq
# Verify against requirements
diff requirements.txt all_versions.yml
Interpreting Results¶
BUSCO Completeness Guidelines¶
Genome Assessment: - > 95% Complete: Excellent quality - 90-95% Complete: Good quality - 80-90% Complete: Acceptable quality - < 80% Complete: Poor quality or wrong lineage
Expected Duplicated: - < 5%: Normal diploid genome - 5-15%: Recent duplication or polyploidy - > 15%: High heterozygosity or contamination
Fragmented: - < 5%: Good assembly contiguity - > 10%: Fragmented assembly, consider improvement
OMark Quality Guidelines¶
Completeness: - > 95%: Excellent annotation - 90-95%: Good annotation - < 90%: Review annotation quality
Consistency: - > 95%: High consistency - 90-95%: Acceptable - < 90%: Review inconsistent proteins
Contamination: - < 1%: Clean - 1-5%: Minor contamination - > 5%: Significant contamination, investigate
Gene Statistics Interpretation¶
Protein-coding gene counts: - Human: ~20,000 - Mouse: ~22,000 - Drosophila: ~13,000 - C. elegans: ~20,000
Transcripts per gene: - < 2: Limited alternative splicing - 2-4: Normal alternative splicing - > 5: High alternative splicing (vertebrates)
Coding length: - Average protein: 300-500 amino acids - Average CDS: 900-1500 bases
Output Checklist¶
After pipeline completion, verify:
- [ ] All expected sample directories exist in
${params.outdir} - [ ] BUSCO results show reasonable completeness
- [ ] OMark results don't show excessive contamination
- [ ] Statistics SQL files exist and contain data
- [ ] versions.yml files present for all modules
- [ ] No empty output files (check file sizes)
- [ ] Database meta table populated (if POPULATE_DB ran)
- [ ] Execution reports generated in pipeline_info/
Accessing Results Programmatically¶
Python Example¶
import os
import glob
import yaml
from Bio import SeqIO
# Read BUSCO results
busco_dir = f"{outdir}/{gca}/busco_genome_lineage"
with open(f"{busco_dir}/short_summary.*.txt") as f:
for line in f:
if line.startswith("\tC:"):
metrics = line.strip()
print(f"BUSCO metrics: {metrics}")
# Read protein sequences
proteins = f"{outdir}/{gca}/{production_name}.fa"
seqs = list(SeqIO.parse(proteins, "fasta"))
print(f"Protein count: {len(seqs)}")
# Read versions
version_files = glob.glob(f"{outdir}/**/versions.yml", recursive=True)
versions = {}
for vf in version_files:
with open(vf) as f:
versions.update(yaml.safe_load(f))
SQL Example¶
-- Query all statistics
SELECT
meta_key,
meta_value
FROM meta
WHERE species_id = 1
AND (meta_key LIKE 'gene.%'
OR meta_key LIKE 'transcript.%'
OR meta_key LIKE 'busco.%')
ORDER BY meta_key;
-- Compare BUSCO genome vs protein
SELECT
SUBSTRING_INDEX(meta_key, '.', 2) AS metric_type,
MAX(CASE WHEN meta_key LIKE '%.genome.%' THEN meta_value END) AS genome_value,
MAX(CASE WHEN meta_key LIKE '%.protein.%' THEN meta_value END) AS protein_value
FROM meta
WHERE meta_key LIKE 'busco.%complete'
GROUP BY metric_type;
Bash Example¶
#!/bin/bash
# Summary report
outdir=$1
gca=$2
echo "=== Statistics Summary for ${gca} ==="
# BUSCO genome
busco_genome="${outdir}/${gca}/busco_genome_lineage/short_summary.*.txt"
echo "BUSCO Genome Completeness:"
grep "C:" $busco_genome | head -1
# BUSCO protein
busco_protein="${outdir}/${gca}/busco_protein_lineage/short_summary.*.txt"
echo "BUSCO Protein Completeness:"
grep "C:" $busco_protein | head -1
# OMark
omark="${outdir}/${gca}/omark_output/omark_output.txt"
echo "OMark Quality:"
grep -E "Completeness|Consistency|Contamination" $omark
# Gene count from SQL
sql_file="${outdir}/${gca}/core_statistics/gene_stats.sql"
echo "Gene Count:"
grep "'gene.total'" $sql_file | grep -oP "VALUES \(.*?, '.*?', '\K[^']*"
Archiving and Delivery¶
Essential Files for Archive¶
Minimum files to preserve:
${gca}/
├── busco_genome_lineage/short_summary.*.txt
├── busco_protein_lineage/short_summary.*.txt
├── omark_output/omark_output.txt
├── core_statistics/*.sql
└── versions.yml
Full Archive¶
Complete output for reproducibility:
# Create tarball
tar -czf ${gca}_statistics.tar.gz \
${params.outdir}/${gca}/ \
pipeline_info/
# Verify integrity
tar -tzf ${gca}_statistics.tar.gz | wc -l
Delivery to Database Team¶
Provide:
1. SQL files in core_statistics/
2. BUSCO summaries (PDF or screenshots)
3. OMark assessment
4. versions.yml
5. Execution report
Quick Reference¶
| Module | Primary Output | Location | Format |
|---|---|---|---|
| FETCH_GENOME | genome.fa | outdir/${gca}/ | FASTA |
| FETCH_PROTEINS | proteins.fa | outdir/${gca}/ | FASTA |
| BUSCO_GENOME | short_summary.txt | outdir/${gca}/busco_genome_lineage/ | TXT |
| BUSCO_PROTEIN | short_summary.txt | outdir/${gca}/busco_protein_lineage/ | TXT |
| OMAMER_HOG | proteins.omamer | cache/${gca}/omamer/ | Binary |
| OMARK | omark_output.txt | outdir/${gca}/omark_output/ | TXT |
| RUN_STATISTICS | *.sql | outdir/${gca}/core_statistics/ | SQL |
| RUN_ENSEMBL_META | *.sql | outdir/${gca}/core_statistics/ | SQL |
| All modules | versions.yml | outdir/${gca}/ | YAML |