Genome Annotation using MAKER-P at iPlant Collaboration with Mark Yandell Lab (University of Utah) iPlant: Josh Stein (CSHL) Matt Vaughn (TACC) Dian Jiao (TACC) Zhenyuan Lu (CSHL) Nirav Merchant (U. Arizona) Carson Holt (Ontario Institute Cancer Research) Cantarel et al Genome Research 18:188 Holt & Yandell BMC Bioinformatics 12:491
What Are Annotations? Annotations are descriptions of features of the genome Structural: exons, introns, UTRs, splice forms etc. Coding & non-coding genes Functional: enzymatic activity, expression Annotations should include evidence trail Assists in quality control of genome annotations Examples of evidence supporting a structural annotation: Ab initio gene predictions ESTs Protein homology
Secondary Annotation Protein Domains InterPro Scan: combines many HMM databases GO and other ontologies Pathway mapping E.g. BioCyc Pathway tools
Challenges in Plant Genome Annotation Genomes are BIG Highly repetitive Many pseudogenes Yet it is important to get it right!
Contamination Issue
Annotation Error Example: split gene models
Typical Annotation Pipeline Contamination screening Repeat/TE masking Ab initio prediction Evidence alignment (cDNA, EST, RNA-seq, protein) Evidence-based prediction Combiner Evaluation/filtering Manual curation
Options for Protein-coding Gene Annotation
MAKER is an easy-to-use annotation pipeline designed to help smaller research groups convert the mountain of genomic data provided by next generation sequencing technologies into a usable resource.
MAKER identifies repeats, aligns ESTs and proteins to a genome, produces ab-initio gene predictions, automatically synthesizes these data into gene annotations, and produces evidence-based quality values for downstream annotation management
Quality Control evaluation of the MAKER-P and TAIR10 datasets using Annotation Edit Distance (AED). Better Quality Worse
MAKER-P MPI Support Message Passing Interface (MPI) is a communication protocol for computer clusters which essentially allows multiple computers to act like a single powerful machine.
Current evidence Current Assembly Annotating the Genome – Apollo View
Current evidence Current Assembly Identify and Mask Repetitive Elements
Current evidence Current Assembly Identify and Mask Repetitive Elements RepeatMasker –RepBase –Species specific library RepeatRunner –MAKER internal protein library
Current evidence Current Assembly Identify and Mask Repetitive Elements
Current evidence Current Assembly Ab initio Predictions Generate Ab Initio Gene Predictions
Current evidence Current Assembly Ab initio Predictions Generate Ab Initio Gene Predictions MAKER currently supports: – SNAP – Augustus – GeneMark – FGENESH Can be run internally or externally
Current evidence Current Assembly Ab initio Predictions Generate Ab Initio Gene Predictions
Current evidence Current Assembly Ab initio Predictions Align EST and Protein Evidence EST TBLASTX EST BLASTN Protein BLASTX
Current evidence Current Assembly Ab initio Predictions Align EST and Protein Evidence EST TBLASTX EST BLASTN Protein BLASTX Identify regions being actively transcribed (i.e. EST data) Identify region with homology to a known protein
Current evidence Current Assembly Ab initio Predictions Align EST and Protein Evidence EST TBLASTX EST BLASTN Protein BLASTX
Polish BLAST Alignments with Exonerate Current evidence Current Assembly Ab initio Predictions Polished protein Polished EST
Polish BLAST Alignments with Exonerate Current evidence Current Assembly Ab initio Predictions Polished protein Polished EST All base pairs must aligns in order. No HSP overlap is permitted Aligns HSPs correctly with respect to splice sites.
Polish BLAST Alignments with Exonerate Current evidence Current Assembly Ab initio Predictions Polished protein Polished EST
Current evidence Current Assembly Ab initio Predictions Hint-based SNAP Hint-based FgenesH Pass Gene Finders Evidence-based ‘hints’
Current evidence Current Assembly Ab initio Predictions Hint-based SNAP Hint-based FgenesH * * Quantitative Measures for the Management and Comparison of Annotated Genomes Karen Eilbeck, Barry Moore, Carson Holt and Mark Yandell BMC Bioinformatics :67doi: / Identify Gene Model Most Consistent with Evidence*
Current evidence Current Assembly Ab initio Predictions * Revise it further if necessary; Create New Annotation
Compute Support for Each Portion of Gene Model
MAKER-P v2.28 at iPlant TACC Lonestar Supercomputer with 22,656 CPU MPI enabled for parallel computation Can complete entire rice genome in ~2 hrs (1,152 cores) 96 CPU per chromosome Can complete Aegilops tauschii ALLPATHS-LG assembly in ~8 hrs (1,152 cores) Currently being integrated into the iPlant Discovery Environment Atmosphere MPI enabled for parallel computation Maximum instance size 16 CPU
Assembly & Annotation at iPlant