Genome Annotation using MAKER-P at iPlant Collaboration with Mark Yandell Lab (University of Utah) www.yandell-lab.org iPlant: Josh Stein (CSHL) Matt Vaughn.

Slides:



Advertisements
Similar presentations
Model Organism Databases and Community Annotation
Advertisements

2 Unité de Biométrie et d’Intelligence Artificielle (UBIA) INRA
Genomics: READING genome sequences ASSEMBLY of the sequence ANNOTATION of the sequence carry out dideoxy sequencing connect seqs. to make whole chromosomes.
Genome Annotation using MAKER-P at iPlant Collaboration with Mark Yandell Lab (University of Utah) iPlant: Josh Stein (CSHL) Matt Vaughn.
Daniel Ence Yandell Lab University of Utah.  Annotations are descriptions of features of the genome  Structural: exons, introns, UTRs, splice forms.
Web Apollo Resources at the National Agricultural Library Christopher Childers NAL ARS USDA i5k.nal.usda.gov.
Genome analysis and annotation. Genome Annotation Which sequences code for proteins and structural RNAs ? What is the function of the predicted gene products.
How to access genomic information using Ensembl August 2005.
Bioinformatics Alternative splicing Multiple isoforms Exonic Splicing Enhancers (ESE) and Silencers (ESS) SpliceNest Lecture 13.
We are developing a web database for plant comparative genomics, named Phytome, that, when complete, will integrate organismal phylogenies, genetic maps.
Modeling Functional Genomics Datasets CVM Lesson 1 13 June 2007Bindu Nanduri.
Sequence Analysis. Today How to retrieve a DNA sequence? How to search for other related DNA sequences? How to search for its protein sequence? How to.
Eukaryotic Gene Finding
Genome Annotation BCB 660 October 20, From Carson Holt.
Gene Finding Genome Annotation. Gene finding is a cornerstone of genomic analysis Genome content and organization Differential expression analysis Epigenomics.
Sequence Analysis with Artemis & Artemis Comparison Tool (ACT) South East Asian Training Course on Bioinformatics Applied to Tropical Diseases (Sponsored.
Assembly & Annotation at iPlant
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
Tomato genome annotation pipeline in Cyrille2
Genomics of Microbial Eukaryotes Igor Grigoriev, Fungal Genomics Program Head US DOE Joint Genome Institute, Walnut Creek, CA.
Bikash Shakya Emma Lang Jorge Diaz.  BLASTx entire sequence against 9 plant genomes. RepeatMasker  55.47% repetitive sequences  82.5% retroelements.
Kerstin Howe, Mario Caccamo, Ian Sealy The Zebrafish Genome Sequencing Project Bioinformatics resources.
Arabidopsis Genome Annotation TAIR7 Release. Arabidopsis Genome Annotation  Overview of releases  Current release (TAIR7)  Where to find TAIR7 release.
Manifestations of a Code Genes, genomes, bioinformatics and cyberspace – and the promise they hold for biology education.
NCBI Review Concepts Chuong Huynh. NCBI Pairwise Sequence Alignments Purpose: identification of sequences with significant similarity to (a)
MAKER Annotation Process Example of Glossina VectorBase Karyn Mégy Dan Hughes.
Module 3 Sequence and Protein Analysis (Using web-based tools) Working with Pathogen Genomes - Uruguay 2008.
Welcome to DNA Subway Classroom-friendly Bioinformatics.
Chapter 21 Eukaryotic Genome Sequences
Biological Databases Biology outside the lab. Why do we need Bioinfomatics? Over the past few decades, major advances in the field of molecular biology,
Web Databases for Drosophila Introduction to FlyBase and Ensembl Database Wilson Leung6/06.
Gramene Objectives Provide researchers working on grasses and plants in general with a bird’s eye view of the grass genomes and their organization. Work.
Genome Annotation Rosana O. Babu.
Srr-1 from Streptococcus. i/v nonpolar s serine (polar uncharged) n/s/t polar uncharged s serine (polar uncharged) e glutamic acid (neg. charge) sserine.
Annotating genomes using MAKER-P and iPlant. What Are Annotations? Annotations are descriptions of features of the genome –Structural: exons, introns,
Building WormBase database(s). SAB 2008 Wellcome Trust Sanger Insitute Cold Spring Harbor Laboratory California Institute of Technology ● RNAi ● Microarray.
Mark D. Adams Dept. of Genetics 9/10/04
Web Databases for Drosophila An introduction to web tools, databases and NCBI BLAST Wilson Leung08/2015.
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on the GEP UCSC Genome Browser Wilson Leung08/2014.
The iPlant Collaborative Vision Enable life science researchers and educators to use and extend cyberinfrastructure.
How can we find genes? Search for them Look them up.
Web Apollo Resources at the National Agricultural Library Christopher Childers NAL ARS USDA i5k.nal.usda.gov.
JIGSAW: a better way to combine predictions J.E. Allen, W.H. Majoros, M. Pertea, and S.L. Salzberg. JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the.
Fgenes++ pipelines for automatic annotation of eukaryotic genomes Victor Solovyev, Peter Kosarev, Royal Holloway College, University of London Softberry.
August 2008Bioinformatics tools for Comparative Genomics of Vectors1 Genome Annotation Daniel Lawson EBI.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
The iPlant Collaborative Vision Enable life science researchers and educators to use and extend cyberinfrastructure.
1 of 28 Evaluating Genes and Transcripts (“Genebuild”)
Annotation of eukaryotic genomes
The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop Data Demo and MAKER-P.
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
Biotechnology and Bioinformatics: Bioinformatics Essential Idea: Bioinformatics is the use of computers to analyze sequence data in biological research.
Work Presentation Novel RNA genes in A. thaliana Gaurav Moghe Oct, 2008-Nov, 2008.
The Bovine Genome Database Abstract The Bovine Genome Database (BGD, facilitates the integration of bovine genomic data. BGD is.
Using DNA Subway in the Classroom Genome Annotation: Red Line.
Basics of Genome Annotation Daniel Standage Biology Department Indiana University.
Daphnia Genome Annotation & Analysis Notes July 2007 Don Gilbert Genome Informatics Lab, Biology Dept., Indiana University
Bioinformatics Computing 1 CMP 807 – Day 4 Kevin Galens.
bacteria and eukaryotes
Annotating The data.
VectorBase genome annotation
Gene Annotation with DNA Subway
Genome Annotation w/ MAKER
Cuong Nguyen, Deng Xin, Dongmei, Zheng Wang
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Ensembl Genome Repository.
Comparative Genomics.
2 Unité de Biométrie et d’Intelligence Artificielle (UBIA) INRA
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Presentation transcript:

Genome Annotation using MAKER-P at iPlant Collaboration with Mark Yandell Lab (University of Utah) iPlant: Josh Stein (CSHL) Matt Vaughn (TACC) Dian Jiao (TACC) Zhenyuan Lu (CSHL) Nirav Merchant (U. Arizona) Carson Holt (Ontario Institute Cancer Research) Cantarel et al Genome Research 18:188 Holt & Yandell BMC Bioinformatics 12:491

What Are Annotations? Annotations are descriptions of features of the genome Structural: exons, introns, UTRs, splice forms etc. Coding & non-coding genes Functional: enzymatic activity, expression Annotations should include evidence trail Assists in quality control of genome annotations Examples of evidence supporting a structural annotation: Ab initio gene predictions ESTs Protein homology

Secondary Annotation Protein Domains InterPro Scan: combines many HMM databases GO and other ontologies Pathway mapping E.g. BioCyc Pathway tools

Challenges in Plant Genome Annotation Genomes are BIG Highly repetitive Many pseudogenes Yet it is important to get it right!

Contamination Issue

Annotation Error Example: split gene models

Typical Annotation Pipeline Contamination screening Repeat/TE masking Ab initio prediction Evidence alignment (cDNA, EST, RNA-seq, protein) Evidence-based prediction Combiner Evaluation/filtering Manual curation

Options for Protein-coding Gene Annotation

MAKER is an easy-to-use annotation pipeline designed to help smaller research groups convert the mountain of genomic data provided by next generation sequencing technologies into a usable resource.

MAKER identifies repeats, aligns ESTs and proteins to a genome, produces ab-initio gene predictions, automatically synthesizes these data into gene annotations, and produces evidence-based quality values for downstream annotation management

Quality Control evaluation of the MAKER-P and TAIR10 datasets using Annotation Edit Distance (AED). Better Quality Worse

MAKER-P MPI Support Message Passing Interface (MPI) is a communication protocol for computer clusters which essentially allows multiple computers to act like a single powerful machine.

Current evidence Current Assembly Annotating the Genome – Apollo View

Current evidence Current Assembly Identify and Mask Repetitive Elements

Current evidence Current Assembly Identify and Mask Repetitive Elements RepeatMasker –RepBase –Species specific library RepeatRunner –MAKER internal protein library

Current evidence Current Assembly Identify and Mask Repetitive Elements

Current evidence Current Assembly Ab initio Predictions Generate Ab Initio Gene Predictions

Current evidence Current Assembly Ab initio Predictions Generate Ab Initio Gene Predictions MAKER currently supports: – SNAP – Augustus – GeneMark – FGENESH Can be run internally or externally

Current evidence Current Assembly Ab initio Predictions Generate Ab Initio Gene Predictions

Current evidence Current Assembly Ab initio Predictions Align EST and Protein Evidence EST TBLASTX EST BLASTN Protein BLASTX

Current evidence Current Assembly Ab initio Predictions Align EST and Protein Evidence EST TBLASTX EST BLASTN Protein BLASTX Identify regions being actively transcribed (i.e. EST data) Identify region with homology to a known protein

Current evidence Current Assembly Ab initio Predictions Align EST and Protein Evidence EST TBLASTX EST BLASTN Protein BLASTX

Polish BLAST Alignments with Exonerate Current evidence Current Assembly Ab initio Predictions Polished protein Polished EST

Polish BLAST Alignments with Exonerate Current evidence Current Assembly Ab initio Predictions Polished protein Polished EST All base pairs must aligns in order. No HSP overlap is permitted Aligns HSPs correctly with respect to splice sites.

Polish BLAST Alignments with Exonerate Current evidence Current Assembly Ab initio Predictions Polished protein Polished EST

Current evidence Current Assembly Ab initio Predictions Hint-based SNAP Hint-based FgenesH Pass Gene Finders Evidence-based ‘hints’

Current evidence Current Assembly Ab initio Predictions Hint-based SNAP Hint-based FgenesH * * Quantitative Measures for the Management and Comparison of Annotated Genomes Karen Eilbeck, Barry Moore, Carson Holt and Mark Yandell BMC Bioinformatics :67doi: / Identify Gene Model Most Consistent with Evidence*

Current evidence Current Assembly Ab initio Predictions * Revise it further if necessary; Create New Annotation

Compute Support for Each Portion of Gene Model

MAKER-P v2.28 at iPlant TACC Lonestar Supercomputer with 22,656 CPU MPI enabled for parallel computation Can complete entire rice genome in ~2 hrs (1,152 cores) 96 CPU per chromosome Can complete Aegilops tauschii ALLPATHS-LG assembly in ~8 hrs (1,152 cores) Currently being integrated into the iPlant Discovery Environment Atmosphere MPI enabled for parallel computation Maximum instance size 16 CPU

Assembly & Annotation at iPlant