Run Statistics¶
Overview¶
The RUN_STATISTICS process generates comprehensive gene annotation statistics for Ensembl species homepages, computing counts and metrics for genes, transcripts, proteins, and other genomic features.
Process Details¶
- Label:
fetch_file - Tag: Uses genome assembly accession (
meta.gca) - Publish Directory:
${params.outdir}/${meta.gca} - Max Forks: 20 (limits parallel execution)
Inputs¶
| Name | Type | Description |
|---|---|---|
| meta | val | Metadata map containing genome assembly and database connection details |
Required Metadata Fields¶
gca: Genome assembly accessiondbname: Database nameproduction_name: Species production name
Outputs¶
| Channel | Type | Description |
|---|---|---|
| statistics_output | tuple | Metadata and generated SQL files from core_statistics/ directory |
| versions_file | path | versions.yml file tracking Perl version |
Parameters¶
Required¶
params.outdir: Output directory for resultsparams.enscode: Path to Ensembl code repositoryparams.host: Database hostparams.port: Database port
Optional¶
params.files_latency: Delay after script execution (default handled by afterScript)
Script Details¶
The process:
1. Executes generate_species_homepage_stats.pl Perl script
2. Connects to the specified Ensembl core database
3. Computes comprehensive annotation statistics including:
- Gene counts (by biotype)
- Transcript counts
- Protein-coding gene metrics
- Alternative splicing statistics
- Non-coding RNA counts
- Other genomic feature counts
4. Generates SQL files for inserting statistics into the database
5. Outputs results to core_statistics/ subdirectory
6. Captures Perl version information
Dependencies¶
- Perl 5
generate_species_homepage_stats.plscript (fromensembl-genes/src/perl/ensembl/genes/)- Ensembl Perl API
- Database read access
Generated Statistics¶
The SQL files typically include statistics for:
Gene Metrics¶
- Total gene count
- Protein-coding genes
- Pseudogenes
- Non-coding RNA genes (by type: lncRNA, miRNA, snRNA, etc.)
Transcript Metrics¶
- Total transcript count
- Protein-coding transcripts
- Alternative splicing frequency
Protein Metrics¶
- Protein-coding sequences
- Average protein length
- Protein domains
Other Features¶
- Alternative sequences
- Gene predictions
- Imported annotations
Notes¶
- Results are published to a genome-specific subdirectory
- Maximum of 20 concurrent processes to prevent database overload
- The process includes a configurable sleep delay after completion to handle file system latency
- Generated SQL files are designed for the Ensembl core database schema
- Statistics are used to populate species homepage displays in Ensembl browser
- Can be executed against production or staging databases