Skip to content

Run Ensembl Meta

Overview

The RUN_ENSEMBL_META process generates SQL files containing Ensembl core database metadata, including schema information, species details, and production metadata required for Ensembl release databases.

Process Details

  • Label: python
  • Tag: Uses genome assembly accession (meta.gca)
  • Publish Directory: ${params.outdir}/${meta.gca}

Inputs

Name Type Description
meta val Metadata map containing genome assembly and database connection details

Required Metadata Fields

  • gca: Genome assembly accession
  • dbname: Database name
  • production_name: Species production name

Outputs

Channel Type Description
ensembl_meta_output tuple Metadata and generated SQL files (*.sql)
versions_file path versions.yml file tracking Python version

Parameters

Required

  • params.outdir: Output directory for results
  • params.enscode: Path to Ensembl code repository
  • params.host: Database host
  • params.port: Database port
  • params.team: Team identifier for metadata attribution

Optional

  • params.files_latency: Delay after script execution (default handled by afterScript)

Script Details

The process: 1. Executes core_meta_data.py Python script from the Ensembl genes repository 2. Connects to the specified core database 3. Generates SQL files with metadata insertions/updates for: - Schema version information - Species metadata (taxonomy, assembly, etc.) - Production database references - Team attribution 4. Outputs SQL files to core_statistics/ subdirectory 5. Creates symbolic links to SQL files in the publish directory 6. Captures Python version information

Dependencies

  • Python 3
  • core_meta_data.py script (from ensembl-genes/src/python/ensembl/genes/metadata/)
  • Ensembl Python libraries
  • Database read access

Generated SQL Content

The SQL files typically include INSERT/UPDATE statements for the meta table with keys such as: - schema_version - schema_type - assembly.default - species.production_name - species.taxonomy_id - species.scientific_name

Notes

  • Results are published to a genome-specific subdirectory
  • The process includes a configurable sleep delay after completion to handle file system latency
  • Generated SQL files can be executed using the POPULATE_DB process
  • The core_statistics/ directory is created as an output subdirectory
  • Symbolic links ensure SQL files are accessible in the publish directory