BUSCO Core Metakeys¶
Overview¶
The BUSCO_CORE_METAKEYS process patches BUSCO metadata into an Ensembl core database by parsing BUSCO summary files and inserting metakeys directly into the database.
Process Details¶
- Label:
python - Tag: Uses genome assembly accession (
meta.gca) - Cache: Disabled because the process writes to the database
- Publish Directory:
${params.outdir}/${meta.gca} - Conditional Execution: Only runs when
params.apply_busco_metakeysis true
Inputs¶
| Name | Type | Description |
|---|---|---|
| meta | val | Metadata map containing genome assembly info and database details |
| summary_file | path | BUSCO summary file to parse for metakeys |
Required Metadata Fields¶
gca: Genome assembly accessiondbname: Database namespecies_id: Species identifier in the database
Outputs¶
| Channel | Type | Description |
|---|---|---|
| metakey_json | tuple val(meta), path | BUSCO metakey JSON file published to ${params.outdir}/${meta.gca} |
| versions_file | path | Versions YAML file tracking Python version |
Parameters¶
Required¶
params.apply_busco_metakeys: Boolean flag to enable/disable process executionparams.outdir: Output directory for resultsparams.host: Database hostparams.port: Database portparams.user: Database usernameparams.password: Database password
Optional¶
params.files_latency: Delay after script execution (default handled by afterScript)
Script Details¶
The process:
1. Executes busco_metakeys_patch.py with database connection parameters
2. Parses the BUSCO summary file
3. Creates a BUSCO metakey JSON file in the task working directory
4. Inserts metakeys into the specified Ensembl core database
5. Runs the query directly against the database (-run_query true)
6. Publishes the JSON file to ${params.outdir}/${meta.gca}
7. Generates a versions file tracking the Python version used
Dependencies¶
- Python 3
busco_metakeys_patch.pyscript (from Ensembl genes repository)- Database access credentials
Notes¶
- The process includes a configurable sleep delay after completion to handle file system latency
- JSON results are published to a genome-specific subdirectory
- Direct database modification requires appropriate write permissions
- Nextflow task caching is disabled for this process because it writes BUSCO metakeys to the database