Ensembl Genomio Pipelines:¶
Genomio prepare pipeline¶
Module [Bio::EnsEMBL::Pipeline::PipeConfig::BRC4_genome_prepare_conf]
Genome prepare pipeline for BRC/Metazoa
Description¶
Retrieve data for a genome from INSDC and prepare the following files in a separate folder for each genome:
- FASTA for DNA sequences
- FASTA for protein sequences
- GFF gene models
- JSON functional annotation
- JSON seq_region
- JSON genome
- JSON manifest
The JSON files follow the schemas defined in the src/python/ensembl/io/genomio/data/schemas folder.
These files can then be fed to the Genome loader pipeline.
How to run¶
init_pipeline.pl Bio::EnsEMBL::Pipeline::PipeConfig::BRC4_genome_prepare_conf \
    --host $HOST --port $PORT --user $USER --pass $PASS \
    --hive_force_init 1 \
    --pipeline_dir temp/prepare \
    --data_dir $INPUT \
    --output_dir $OUTPUT \
    ${OTHER_OPTIONS}
Parameters¶
| option | default value | meaning | 
|---|---|---|
| --pipeline_name | brc4_genome_prepare | name of the hive pipeline | 
| --pipeline_dir | temp directory for this pipeline run | |
| --data_dir | directory with json files for each genome to prepare, following the format set by src/python/ensembl/io/genomio/data/schemas/genome.json | |
| --output_dir | directory where the prepared files are to be stored | |
| --merge_split_genes | 0 | Sometimes the gene features are split in a gff file. Ensembl expects genes to be contiguous, so this option merge the parts into 1. | 
| --exclude_seq_regions | Do not include those seq_regions (apply to all genomes, this should be seldom used) | |
| --validate_gene_id | 0 | Enforce a strong gene ID pattern (replace by GeneID if available) | 
| --ensembl_mode | 0 | By default, set additional metadata for BRC genomes. With this parameter, use vanilla Ensembl metadata. |