Fetch Proteins¶
Overview¶
The FETCH_PROTEINS process retrieves protein translations from an Ensembl core database, generating FASTA files of protein sequences for downstream analysis.
Process Details¶
- Label:
fetch_file - Tag: Uses genome assembly accession (
meta.gca) - Store Directory:
${params.cacheDir}/${meta.gca}/translations/ - Max Forks: 20 (limits parallel execution)
Inputs¶
| Name | Type | Description |
|---|---|---|
| meta | val | Metadata map containing genome assembly and database connection details |
Required Metadata Fields¶
gca: Genome assembly accessiondbname: Database name containing protein translationsproduction_name: Species production name
Outputs¶
| Channel | Type | Description |
|---|---|---|
| proteins_file | path | FASTA file containing protein translations |
| versions_file | path | versions.yml file tracking Perl version |
Parameters¶
Required¶
params.cacheDir: Cache directory for storing protein filesparams.enscode: Path to Ensembl code repositoryparams.host: Database hostparams.port: Database portparams.production_name: Default production name (can be overridden by metadata)
Optional¶
params.files_latency: Delay after script execution (default handled by afterScript)
Script Details¶
The process:
1. Executes dump_translations.pl Perl script from the Ensembl genes repository
2. Connects to the specified Ensembl core database
3. Extracts all protein translations
4. Outputs a FASTA file named ${meta.production_name}.fa
5. Generates a versions file tracking the Perl version used
Dependencies¶
- Perl 5
dump_translations.plscript (fromensembl-genes/src/perl/ensembl/genes/)- Ensembl Perl API
- Database read access
Storage Strategy¶
Uses storeDir directive to cache results permanently by genome assembly accession, avoiding redundant database queries for the same genome.
Notes¶
- Results are cached permanently to reduce database load
- The process includes a configurable sleep delay after completion to handle file system latency
- Maximum of 20 concurrent processes to prevent database overload
- Output filename is based on the production name for consistency