BUSCO Workflow¶
BUSCO (Benchmarking Universal Single-Copy Orthologs) provides quantitative assessment of genome assembly and annotation completeness based on evolutionarily-informed expectations.
Overview¶
BUSCO assesses genome quality by searching for conserved orthologous genes that are expected to be present in single copy in a given lineage. The presence, absence, or duplication of these genes provides insights into assembly and annotation quality.
How BUSCO Works¶
- Lineage Selection: Choose an appropriate lineage dataset based on taxonomy
- Gene Search: Search for conserved single-copy orthologs using HMMER
- Classification: Genes are classified as Complete, Duplicated, Fragmented, or Missing
- Scoring: Generate completeness scores and detailed reports
Analysis Modes¶
Protein Mode¶
Analyzes translated protein sequences from gene predictions.
When to use: - Assessing annotation quality - Evaluating gene prediction completeness - Comparing gene sets across species
Input: Protein FASTA from canonical transcripts
Advantages: - Direct assessment of annotation quality - Faster than genome mode - Better for comparing gene sets
Genome Mode¶
Analyzes the genome assembly directly using AUGUSTUS for gene prediction.
When to use: - Assessing assembly completeness - Checking for missing genes in annotation - Validating genome assembly quality
Input: Genome FASTA (DNA sequence)
Advantages: - Independent of existing annotation - Can detect genes missed by annotation pipeline - Better for assembly quality assessment
Both Modes¶
Run both protein and genome modes for comprehensive assessment.
When to use: - Complete quality control - Comparing annotation vs. assembly completeness - Production annotation validation
Provides: - Annotation completeness (protein mode) - Assembly completeness (genome mode) - Comparison of both metrics
Running BUSCO¶
Using Core Database¶
# Protein mode only
nextflow run main.nf \
--csvFile genomes.csv \
--run_busco_core \
--busco_mode protein \
--host mysql-server.example.com \
--user_r ensro
# Genome mode only
nextflow run main.nf \
--csvFile genomes.csv \
--run_busco_core \
--busco_mode genome \
--host mysql-server.example.com \
--user_r ensro
# Both modes (recommended for QC)
nextflow run main.nf \
--csvFile genomes.csv \
--run_busco_core \
--busco_mode both \
--host mysql-server.example.com \
--user_r ensro
Using NCBI Assembly¶
# Automatically download and analyze genome from NCBI
nextflow run main.nf \
--csvFile ncbi_genomes.csv \
--run_busco_ncbi \
--outdir results
NCBI Mode
NCBI mode only supports genome mode analysis. The assembly is automatically downloaded using the NCBI Datasets API.
Choosing a Lineage¶
Select the most specific lineage that applies to your organism. More specific lineages provide better sensitivity.
Decision Tree¶
Is your organism a...
├─ Eukaryote?
│ ├─ Animal (Metazoa)?
│ │ ├─ Vertebrate?
│ │ │ ├─ Mammal?
│ │ │ │ ├─ Primate? → primates_odb12
│ │ │ │ └─ Other → mammalia_odb12
│ │ │ ├─ Bird? → aves_odb12
│ │ │ ├─ Fish? → actinopterygii_odb12
│ │ │ └─ Other → vertebrata_odb12
│ │ └─ Invertebrate?
│ │ ├─ Insect?
│ │ │ ├─ Fly/Mosquito? → diptera_odb12
│ │ │ └─ Other → insecta_odb12
│ │ └─ Other → metazoa_odb12
│ ├─ Plant (Viridiplantae)?
│ │ ├─ Land plant? → embryophyta_odb12
│ │ └─ Other → viridiplantae_odb12
│ └─ Fungus?
│ ├─ Yeast? → saccharomycetes_odb12
│ └─ Other → fungi_odb12
└─ Bacteria? → bacteria_odb10
Common Lineages¶
| Lineage | # BUSCOs | Description | Example Species |
|---|---|---|---|
primates_odb12 |
13,780 | Primates | Human, chimp, gorilla |
mammalia_odb12 |
9,226 | Mammals | Mouse, dog, cow, whale |
aves_odb12 |
8,338 | Birds | Chicken, zebra finch |
actinopterygii_odb12 |
3,640 | Ray-finned fish | Zebrafish, medaka |
vertebrata_odb12 |
3,354 | Vertebrates | Any vertebrate |
diptera_odb12 |
3,285 | Flies and mosquitoes | Drosophila, Anopheles |
insecta_odb12 |
1,367 | Insects | Bee, beetle, butterfly |
embryophyta_odb12 |
1,614 | Land plants | Arabidopsis, rice |
fungi_odb12 |
758 | Fungi | Yeast, Aspergillus |
metazoa_odb12 |
954 | Animals | Any animal |
eukaryota_odb12 |
255 | Eukaryotes | Any eukaryote |
Understanding Results¶
BUSCO Categories¶
BUSCO classifies each searched gene into one of four categories:
| Category | Code | Description | Interpretation |
|---|---|---|---|
| Complete (C) | C | Gene found in single copy with expected length | ✅ Good |
| Complete and Duplicated (D) | D | Gene found multiple times | ⚠️ May indicate duplication or misassembly |
| Fragmented (F) | F | Gene found but incomplete | ⚠️ Partial gene or assembly issue |
| Missing (M) | M | Gene not found | ❌ Gap in assembly/annotation |
Score Interpretation¶
Completeness Score = (C + D) / Total BUSCOs × 100
| Score Range | Quality | Interpretation |
|---|---|---|
| 95-100% | Excellent | High-quality assembly/annotation |
| 90-95% | Good | Acceptable for most analyses |
| 80-90% | Fair | Some missing genes; investigate |
| <80% | Poor | Significant issues; troubleshoot |
Context Matters
Scores should be interpreted in context:
- Protein mode: Lower scores may indicate annotation issues
- Genome mode: Lower scores suggest assembly gaps
- Duplications: High duplication (>5%) may indicate misassembly or polyploidy
Example Output¶
Short Summary (*_busco_short_summary.txt)¶
# BUSCO version is: 5.4.7
# The lineage dataset is: primates_odb12 (Creation date: 2024-01-08, number of genomes: 44, number of BUSCOs: 13780)
# Summarized benchmarking in BUSCO notation for file /data/proteins.fa
# BUSCO was run in mode: protein
# Gene predictor used: None
***** Results: *****
C:98.5%[S:97.8%,D:0.7%],F:0.8%,M:0.7%,n:13780
13570 Complete BUSCOs (C)
13474 Complete and single-copy BUSCOs (S)
96 Complete and duplicated BUSCOs (D)
110 Fragmented BUSCOs (F)
100 Missing BUSCOs (M)
13780 Total BUSCO groups searched
Interpretation: - ✅ Excellent completeness (98.5%) - ✅ Low duplication (0.7%) - ✅ Very few fragmented (0.8%) or missing (0.7%) - Conclusion: High-quality annotation
Full Table (*_busco_full_table.tsv)¶
# Busco id Status Sequence Gene Start Gene End Strand Score Length
1000092at7742 Complete ENSP00000262735 - - - 1256.4 416
1000093at7742 Complete ENSP00000344818 - - - 2145.7 712
1000094at7742 Missing - - - - - -
1000095at7742 Fragmented ENSP00000123456 - - - 456.2 189
Output Files¶
Per-Sample Outputs¶
| File | Description | Size |
|---|---|---|
{sample}_busco_short_summary.txt |
Summary statistics (protein mode) | ~1 KB |
{sample}_genome_busco_short_summary.txt |
Summary statistics (genome mode) | ~1 KB |
{sample}_busco_full_table.tsv |
Detailed results per BUSCO | ~1-5 MB |
{sample}_busco_missing_busco_list.tsv |
List of missing BUSCOs | Variable |
Directory Structure¶
results/
└── busco/
├── sample1_busco_short_summary.txt
├── sample1_genome_busco_short_summary.txt
├── sample1_busco_full_table.tsv
├── sample2_busco_short_summary.txt
└── sample2_genome_busco_short_summary.txt
Interpreting Results by Mode¶
Protein vs. Genome Comparison¶
| Scenario | Protein Score | Genome Score | Interpretation |
|---|---|---|---|
| Ideal | High (>95%) | High (>95%) | Excellent assembly and annotation |
| Good annotation | High (>95%) | Medium (85-95%) | Good annotation, some assembly gaps |
| Annotation issues | Medium (85-95%) | High (>95%) | Assembly good, annotation needs improvement |
| Both medium | 85-95% | 85-95% | Both assembly and annotation have room for improvement |
| Assembly problems | Low (<85%) | Low (<85%) | Significant assembly gaps affecting annotation |
High Duplication (>5%)¶
Possible causes: - Polyploidy: Natural in some species (e.g., wheat, salmon) - Recent whole-genome duplication: Normal in some lineages - Misassembly: Collapsed repeats or chimeric contigs - Contamination: Mixed samples - Incorrect lineage: Using too broad a lineage
Action: Check genome characteristics and assembly quality metrics.
Troubleshooting¶
Low Completeness Scores¶
Problem: BUSCO completeness <80%
Possible causes: 1. Wrong lineage: Using inappropriate lineage dataset 2. Assembly gaps: Incomplete genome assembly 3. Annotation issues: Missing or incomplete gene predictions 4. Contamination: Non-target sequences in assembly 5. Unusual biology: Genuine gene loss (rare)
Solutions: - Try a broader lineage (e.g., vertebrata instead of primates) - Check assembly statistics (N50, gaps, contiguity) - Review annotation pipeline parameters - Screen for contamination
High Fragmentation¶
Problem: Many fragmented BUSCOs (>5%)
Possible causes: - Assembly fragmentation (low N50) - Incorrect gene models - Pseudogenes or processed pseudogenes
Solutions: - Improve assembly contiguity - Adjust gene prediction parameters - Check for frame shifts or premature stop codons
High Duplication¶
Problem: >10% duplicated BUSCOs (when not expected)
Possible causes: - Haplotigs not removed during assembly - Collapsed repeats - Chimeric sequences - Recent gene duplications (normal)
Solutions: - Run purge_dups or similar tools - Check assembly for haplotigs - Validate with Hi-C or genetic maps
Best Practices¶
Choose the right lineage
Use the most specific lineage available for your organism. More specific = better sensitivity and specificity.
Run both modes
For production annotations, always run both protein and genome modes to get complete picture of quality.
Compare across releases
Track BUSCO scores across annotation updates to monitor improvements or catch regressions.
Use with other QC metrics
BUSCO is one metric—combine with N50, BUSCO, gene count trends, and manual inspection.
Document exceptions
Some genuine biological variations (gene loss, lineage-specific duplications) can affect scores—document these.
Advanced Topics¶
Custom Lineages¶
For specialized projects, you can create custom lineage datasets:
- Download from OrthoDB
- Place in the lineage directory
- Reference in your CSV or with
--busco_dataset
Offline Mode¶
Pre-download lineage datasets:
# Download lineages
busco --download_path /data/busco_lineages --download vertebrata_odb12
# Use in pipeline
nextflow run main.nf \
--csvFile genomes.csv \
--run_busco_core \
--download_path /data/busco_lineages
Metakey Integration¶
Load BUSCO results as database metakeys:
nextflow run main.nf \
--csvFile genomes.csv \
--run_busco_core \
--apply_busco_metakeys \
--host mysql-server.example.com \
--user ensadmin \
--password secret123
References¶
Primary Citations¶
-
Manni et al. (2021). BUSCO Update: Novel and Streamlined Workflows along with Broader and Deeper Phylogenetic Coverage for Scoring of Eukaryotic, Prokaryotic, and Viral Genomes. Molecular Biology and Evolution, 38(10):4647–4654. DOI: 10.1093/molbev/msab199
-
Manni et al. (2021). BUSCO: Assessing genomic data quality and beyond. Current Protocols, 1:e323. DOI: 10.1002/cpz1.323
Additional Resources¶
Next Steps¶
- OMArk Workflow - Complementary proteome assessment
- Output Documentation - Detailed output file specifications
- Troubleshooting - Solve common issues