Input Format¶
The pipeline uses a CSV file to specify samples and their metadata. This guide explains the required format and provides examples for different use cases.
CSV File Structure¶
Basic Format¶
The CSV file must contain a header row with column names, followed by one row per genome to analyze.
CSV Requirements
- Header row required: First row must contain column names
- Comma-separated: Use commas as delimiters
- No trailing spaces: Column values should not have leading/trailing whitespace
- One genome per row: Each row represents a single sample
Column Specifications by Workflow¶
Different workflows require different columns. Choose the columns based on which workflows you're running.
For BUSCO (Core Database Mode)¶
Required when using --run_busco_core:
| Column | Type | Required | Description | Example |
|---|---|---|---|---|
dbname |
string | ✅ | Ensembl core database name | homo_sapiens_core_110_38 |
species_id |
integer | ✅ | Species ID in the core database | 1 |
busco_dataset |
string | ⚠️ * | Specific BUSCO lineage dataset (if not will select the closest one in terms of taxonomy ) | primates_odb12 |
protein_file |
string | ⚠️ | Path to protein FASTA file (if not using DB) | /data/proteins.fa |
genome_file |
string | ⚠️ | Path to genome FASTA file (if not using DB) | /data/genome.fa |
* Can be specified globally with --busco_dataset parameter instead
Example CSV:
dbname,species_id,busco_dataset
homo_sapiens_core_110_38,1,primates_odb12
mus_musculus_core_110_39,1,glires_odb12
danio_rerio_core_110_11,1,actinopterygii_odb12
For BUSCO (NCBI Mode)¶
Required when using --run_busco_ncbi:
| Column | Type | Required | Description | Example |
|---|---|---|---|---|
gca |
string | ✅ | NCBI assembly accession | GCA_000001405.29 |
taxon_id |
integer | ✅ | NCBI taxonomy ID | 9606 |
busco_dataset |
string | ⚠️ * | Specific BUSCO lineage dataset (if not will select the closest one in terms of taxonomy ) | primates_odb12 |
* Can be specified globally with --busco_dataset parameter instead
Example CSV:
gca,taxon_id,busco_dataset
GCA_000001405.29,9606,primates_odb12
GCA_000001635.9,10090,glires_odb12
GCA_000002035.4,7955,actinopterygii_odb12
NCBI Mode
NCBI mode only runs BUSCO in genome mode (not protein mode). The genome assembly is automatically downloaded from NCBI.
For OMArk¶
Required when using --run_omark:
| Column | Type | Required | Description | Example |
|---|---|---|---|---|
dbname |
string | ✅ | Ensembl core database name | xenopus_tropicalis_core_110_10 |
species_id |
integer | ✅ | Species ID in the core database | 1 |
protein_file |
string | ⚠️ | Path to protein FASTA file (if not using DB) | /data/proteins.fa |
Example CSV:
For Ensembl Statistics¶
Required when using --run_ensembl_stats or --run_ensembl_beta_metakeys:
| Column | Type | Required | Description | Example |
|---|---|---|---|---|
dbname |
string | ✅ | Ensembl core database name | arabidopsis_thaliana_core_110_11 |
species_id |
integer | ✅ | Species ID in the core database | 1 |
Example CSV:
dbname,species_id
arabidopsis_thaliana_core_110_11,1
caenorhabditis_elegans_core_110_280,1
drosophila_melanogaster_core_110_9,1
Complete Examples¶
Example 1: BUSCO Quality Control (Core DB)¶
Full quality assessment with both protein and genome BUSCO modes:
dbname,species_id,busco_dataset
homo_sapiens_core_110_38,1,primates_odb12
pan_troglodytes_core_110_40,1,primates_odb12
gorilla_gorilla_core_110_6,1,primates_odb12
pongo_abelii_core_110_5,1,primates_odb12
Command:
nextflow run main.nf \
--csvFile primates.csv \
--run_busco_core \
--busco_mode both \
--host mysql-server.example.com \
--user_r ensro
Example 2: BUSCO from NCBI Assemblies¶
Quick assessment of public genome assemblies:
gca,taxon_id,busco_dataset
GCA_000001405.29,9606,primates_odb12
GCA_011100555.1,9598,primates_odb12
GCA_008122165.1,9593,primates_odb12
Command:
Example 3: Complete Multi-Workflow Analysis¶
Run all quality metrics together:
dbname,species_id,busco_dataset
drosophila_melanogaster_core_110_9,1,diptera_odb12
anopheles_gambiae_core_110_56,1,diptera_odb12
aedes_aegypti_core_110_6,1,diptera_odb12
Command:
nextflow run main.nf \
--csvFile insects.csv \
--run_busco_core \
--run_omark \
--run_ensembl_stats \
--busco_mode both \
--host mysql-server.example.com \
--user_r ensro \
--enscode /software/ensembl/ENSCODE
Example 4: Mixed Lineages¶
Different BUSCO lineages for different species:
dbname,species_id,busco_dataset
homo_sapiens_core_110_38,1,primates_odb12
danio_rerio_core_110_11,1,actinopterygii_odb12
drosophila_melanogaster_core_110_9,1,diptera_odb12
arabidopsis_thaliana_core_110_11,1,embryophyta_odb12
saccharomyces_cerevisiae_core_110_4,1,saccharomycetes_odb12
Command:
nextflow run main.nf \
--csvFile diverse_species.csv \
--run_busco_core \
--host mysql-server.example.com \
--user_r ensro
Example 5: Using External FASTA Files¶
When sequences are not in a database:
dbname,species_id,busco_dataset,protein_file,genome_file
custom_genome_v1,1,vertebrata_odb12,/data/project1/proteins.fa,/data/project1/genome.fa
custom_genome_v2,1,vertebrata_odb12,/data/project2/proteins.fa,/data/project2/genome.fa
Command:
Column Details¶
dbname¶
Format: {species}_{type}_{release}_{assembly_version}
- species: Species name (lowercase, underscores)
- type: Usually
core - release: Ensembl release number
- assembly_version: Assembly version number
Examples:
- homo_sapiens_core_110_38 (Human, release 110, GRCh38)
- mus_musculus_core_110_39 (Mouse, release 110, GRCm39)
species_id¶
The species ID within the core database. In most cases, this is 1.
Multi-species databases
For bacteria collections or multi-species databases, this might be different. Check the meta table in your core database.
busco_dataset¶
The BUSCO lineage dataset name. Must match available datasets in OrthoDB v12.
Format: {lineage}_odb12
Common lineages:
- primates_odb12, mammalia_odb12, vertebrata_odb12
- actinopterygii_odb12, aves_odb12
- diptera_odb12, insecta_odb12
- embryophyta_odb12, viridiplantae_odb12
- fungi_odb12, saccharomycetes_odb12
See BUSCO documentation for complete list.
taxon_id¶
The NCBI Taxonomy ID for the species.
Find taxonomy IDs at: https://www.ncbi.nlm.nih.gov/taxonomy
Examples:
- 9606 = Homo sapiens (Human)
- 10090 = Mus musculus (Mouse)
- 7955 = Danio rerio (Zebrafish)
- 7227 = Drosophila melanogaster (Fruit fly)
- 3702 = Arabidopsis thaliana (Thale cress)
gca¶
NCBI GenBank assembly accession.
Format: GCA_{9digits}.{version}
Examples:
- GCA_000001405.29 (Human GRCh38.p13)
- GCA_000001635.9 (Mouse GRCm39)
Find assemblies at: https://www.ncbi.nlm.nih.gov/assembly
protein_file and genome_file¶
Full paths to FASTA files:
- protein_file: Protein sequences (canonical transcripts preferred)
- genome_file: Genome assembly (unmasked or soft-masked)
Requirements: - Files must exist and be readable - FASTA format (.fa, .fasta, .fna, .faa) - Can be gzip-compressed (.gz)
Validation¶
The pipeline validates your CSV file against a JSON schema. Common errors:
❌ Invalid GCA format¶
Fix: Ensure GCA follows the patternGCA_{9digits}.{version}
❌ Missing required column¶
Fix: Add the missing column to your CSV header❌ Invalid taxon_id¶
Fix: Use numeric NCBI taxonomy IDs only❌ Trailing spaces¶
Fix: Remove spaces:homo_sapiens_core_110_38 not homo_sapiens_core_110_38
Tips and Best Practices¶
Start small
Test with 1-2 samples first, then scale up to your full dataset.
Use absolute paths
For file paths (protein_file, genome_file), use absolute paths to avoid issues with working directories.
Check database connectivity first
Before running the pipeline, verify you can connect to the database:
Validate GCA accessions
Verify NCBI accessions exist before running:
Group by lineage
For better resource efficiency, group species with the same BUSCO lineage together.
Schema Validation¶
The input CSV is validated against a JSON schema. You can view the schema:
Or validate manually:
Next Steps¶
- Parameters Reference - Configure pipeline behavior
- Quick Start - Run your first analysis
- Output Documentation - Understand results