OMArk Workflow¶

OMArk (Orthology MARKer) is a proteome quality assessment tool that evaluates completeness based on the presence of conserved genes from the target lineage.

Overview¶

OMArk provides an orthology-based assessment of proteome quality by comparing your protein set against a database of conserved orthologous groups. It identifies complete, fragmented, missing, and potentially contaminant sequences.

How OMArk Works¶

OMAmer Mapping: Maps protein sequences to hierarchical orthologous groups (HOGs)
Lineage Assignment: Determines the taxonomic placement of each protein
Completeness Assessment: Evaluates expected gene presence based on lineage
Consistency Check: Identifies potential contamination or misassignments

Key Features¶

Lineage-Specific: Uses taxonomically appropriate conserved gene sets
Contamination Detection: Identifies sequences inconsistent with target lineage
Fragmentation Analysis: Distinguishes complete vs. partial genes
Placement Verification: Validates taxonomic assignments

Running OMArk¶

Basic Usage¶

nextflow run main.nf \
  --csvFile proteomes.csv \
  --run_omark \
  --host mysql-server.example.com \
  --port 3306 \
  --user_r ensro \
  --outdir results

Required Inputs¶

Create a CSV file with these columns:

dbname,species_id,taxon_id
danio_rerio_core_110_11,1,7955
xenopus_tropicalis_core_110_10,1,8364
gallus_gallus_core_110_7,1,9031

Configuration Parameters¶

Parameter	Default	Description
`--omamer_database`	System path	Path to OMAmer HOG database (.h5 file)
`--omark_singularity_path`	System path	Path to OMArk Singularity container

Custom Database¶

nextflow run main.nf \
  --csvFile proteomes.csv \
  --run_omark \
  --omamer_database /data/omamer/LUCA.h5 \
  --host mysql-server.example.com \
  --user_r ensro

Understanding Results¶

Output File¶

OMArk generates a detailed summary file: {sample}_omark_proteins_detailed_summary.txt

Example Output¶

# OMArk Analysis Summary
# Database: danio_rerio_core_110_11
# Taxon ID: 7955 (Danio rerio)
# Date: 2024-01-15

Completeness Assessment:
========================
Expected HOGs: 3,640
Complete: 3,521 (96.7%)
Fragmented: 87 (2.4%)
Missing: 32 (0.9%)

Consistency Analysis:
====================
Consistent: 3,598 (98.8%)
Inconsistent: 10 (0.3%)

Result Categories¶

Category	Description	Interpretation
Complete	Full-length conserved genes present	✅ Expected genes found
Fragmented	Partial or truncated gene matches	⚠️ Incomplete genes or pseudogenes
Missing	Expected genes not detected	❌ Annotation gaps or genuine loss
Consistent	Proteins match expected lineage	✅ Correct taxonomic placement
Inconsistent	Proteins from unexpected lineage	⚠️ Contamination or HGT

Completeness Scores¶

Overall Completeness = Complete / Expected × 100

Score Range	Quality	Interpretation
95-100%	Excellent	High-quality, complete proteome
90-95%	Good	Minor gaps, generally good
85-90%	Fair	Some missing genes; investigate
<85%	Poor	Significant issues; troubleshoot

Consistency Scores¶

Consistency = Consistent / Total × 100

Score Range	Quality	Interpretation
98-100%	Excellent	Clean, no contamination
95-98%	Good	Minimal contamination
90-95%	Fair	Some inconsistent sequences
<90%	Poor	Significant contamination or misassignment

OMArk vs. BUSCO¶

Both tools assess proteome completeness but use different approaches:

Feature	OMArk	BUSCO
Database	OMAmer HOGs	OrthoDB single-copy orthologs
Approach	Hierarchical orthology	Single-copy gene presence
Contamination Detection	✅ Built-in	❌ Limited
Lineage Specificity	High	High
Speed	Fast	Slower
Gene Count	Variable by lineage	Fixed per lineage
Best For	Contamination screening	Assembly/annotation QC

Use Both

OMArk and BUSCO complement each other. Use both for comprehensive quality assessment.

Interpreting Results¶

High Completeness, High Consistency¶

Complete: 97.5%
Consistent: 99.2%

✅ Excellent Quality - Proteome is complete and clean - No significant contamination - Ready for downstream analyses

High Completeness, Low Consistency¶

Complete: 96.8%
Consistent: 89.5%

⚠️ Contamination Issues - Many genes present but from wrong lineage - Possible contamination or mixed sample - Check for foreign sequences

Action: - Review inconsistent proteins - Screen for contamination - Verify sample purity

Low Completeness, High Consistency¶

Complete: 82.3%
Consistent: 98.9%

⚠️ Annotation Gaps - Correct lineage but missing genes - Annotation incomplete or gene loss - Assembly may have gaps

Action: - Review annotation pipeline - Check assembly completeness - Consider re-annotation

Low Completeness, Low Consistency¶

Complete: 78.5%
Consistent: 87.2%

❌ Multiple Issues - Both completeness and contamination problems - Severe quality issues - Requires thorough investigation

Action: - Comprehensive QC review - Check sample identity - Validate assembly and annotation

Use Cases¶

1. Contamination Screening¶

OMArk excels at detecting contamination:

# Screen multiple assemblies
nextflow run main.nf \
  --csvFile new_assemblies.csv \
  --run_omark \
  --host mysql-server.example.com \
  --user_r ensro

Look for: - Low consistency scores (<95%) - Inconsistent proteins from distantly related lineages - Patterns suggesting systematic contamination

2. Pre-Publication QC¶

Validate proteome quality before public release:

# Run both OMArk and BUSCO
nextflow run main.nf \
  --csvFile release_genomes.csv \
  --run_omark \
  --run_busco_core \
  --busco_mode protein \
  --host mysql-server.example.com \
  --user_r ensro

3. Comparative Proteomics¶

Assess relative quality across multiple species:

dbname,species_id,taxon_id
species_a_core_110_1,1,12345
species_b_core_110_1,1,12346
species_c_core_110_1,1,12347

nextflow run main.nf \
  --csvFile comparative_set.csv \
  --run_omark \
  --host mysql-server.example.com \
  --user_r ensro

4. Annotation Pipeline Validation¶

Test annotation pipeline improvements:

# Compare v1 vs. v2 annotations
cat > comparison.csv << EOF
dbname,species_id,taxon_id
species_v1_core,1,9606
species_v2_core,1,9606
EOF

nextflow run main.nf \
  --csvFile comparison.csv \
  --run_omark \
  --host mysql-server.example.com \
  --user_r ensro

Output Files¶

Directory Structure¶

results/
└── omark/
    ├── sample1_omark_proteins_detailed_summary.txt
    ├── sample2_omark_proteins_detailed_summary.txt
    └── sample3_omark_proteins_detailed_summary.txt

File Contents¶

The detailed summary includes:

Header: Metadata (database, taxon, date)
Completeness Metrics: HOG presence/absence
Consistency Metrics: Lineage validation
Fragmentation Details: Partial gene information
Inconsistent Proteins: List of potentially contaminating sequences

Troubleshooting¶

Low Completeness with BUSCO OK¶

Problem: OMArk shows low completeness but BUSCO is good

Possible causes: - Different gene sets used (OMArk uses broader HOGs) - Lineage-specific gene expansions/losses - Different sensitivity thresholds

Solution: Review both results; some divergence is normal.

High Inconsistency Rate¶

Problem: >10% inconsistent proteins

Possible causes: 1. Contamination: Foreign DNA in sample 2. Wrong taxon ID: Incorrect species assignment 3. Horizontal gene transfer: Natural (e.g., in bacteria) 4. Symbionts: Co-sequenced organisms

Solutions: - Check sample purity - Verify taxon ID is correct - Screen for contamination with tools like BlobTools - Review inconsistent sequences manually

Missing OMAmer Database¶

Problem: Pipeline cannot find OMAmer database

Solution: Download and specify:

# Download from OMArk project
wget https://omabrowser.org/All/LUCA.h5

# Use in pipeline
nextflow run main.nf \
  --csvFile proteomes.csv \
  --run_omark \
  --omamer_database /data/LUCA.h5

Very Long Runtime¶

Problem: OMArk taking too long

Possible causes: - Very large proteome (>50,000 proteins) - Slow I/O on database file

Solutions: - Use local SSD for OMAmer database - Split large proteomes if possible - Increase computational resources

Best Practices¶

Check Consistency First

Always review consistency scores before interpreting completeness. Contamination can inflate completeness scores.

Combine with BUSCO

Use OMArk and BUSCO together for comprehensive QC. They catch different types of issues.

Track Across Releases

Monitor OMArk scores across annotation updates to catch regressions or improvements.

Investigate Inconsistencies

Don't ignore inconsistent proteins—they often reveal real contamination or sample issues.

Use Appropriate Taxon IDs

Ensure taxon_id matches your species. Wrong IDs lead to incorrect consistency assessments.

Advanced Topics¶

Custom OMAmer Databases¶

For specialized analyses, you can create custom HOG databases:

Extract relevant clade from OMA
Build OMAmer index
Point pipeline to custom database

Integration with Contamination Tools¶

Combine OMArk with other contamination detection:

# 1. Run OMArk
nextflow run main.nf --csvFile genomes.csv --run_omark

# 2. Extract inconsistent proteins
grep "Inconsistent" results/omark/*_detailed_summary.txt

# 3. Further screen with BlobTools or similar

Batch Analysis¶

Process hundreds of proteomes:

# Generate CSV for all databases
mysql -h mysql-server.example.com -u ensro -e "
  SELECT CONCAT(db_name, ',1,', meta_value) 
  FROM information_schema.schemata 
  JOIN meta ON db_name = database()
  WHERE db_name LIKE '%_core_%' 
  AND meta_key = 'species.taxonomy_id'
" > all_cores.csv

# Run OMArk on all
nextflow run main.nf \
  --csvFile all_cores.csv \
  --run_omark \
  --host mysql-server.example.com \
  --user_r ensro

Performance Considerations¶

Resource Requirements¶

Proteome Size	Memory	CPU	Runtime
Small (<20k)	8 GB	4	10-20 min
Medium (20-40k)	16 GB	8	20-40 min
Large (>40k)	32 GB	16	40-90 min

Optimization¶

# Increase parallelization
nextflow run main.nf \
  --csvFile proteomes.csv \
  --run_omark \
  --max_cpus 32 \
  --max_memory 64.GB

References¶

Primary Citation¶

Nevers et al. (2022). OMArk: proteome quality assessment using the OMA database. bioRxiv. DOI: 10.1101/2022.11.25.517970

Additional Resources¶

Next Steps¶

BUSCO Workflow - Complementary completeness assessment
Ensembl Stats - Generate database statistics
Output Documentation - Detailed output specifications
Troubleshooting - Common issues and solutions