Genome Annotation BCB 660 October 20, 2011. From Carson Holt.

Slides:



Advertisements
Similar presentations
EAnnot: A genome annotation tool using experimental evidence Aniko Sabo & Li Ding Genome Sequencing Center Washington University, St. Louis.
Advertisements

Genomics: READING genome sequences ASSEMBLY of the sequence ANNOTATION of the sequence carry out dideoxy sequencing connect seqs. to make whole chromosomes.
Genome Annotation using MAKER-P at iPlant Collaboration with Mark Yandell Lab (University of Utah) iPlant: Josh Stein (CSHL) Matt Vaughn.
Homology Based Analysis of the Human/Mouse lncRNome
Ab initio gene prediction Genome 559, Winter 2011.
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
Genome analysis and annotation. Genome Annotation Which sequences code for proteins and structural RNAs ? What is the function of the predicted gene products.
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
Comparative ab initio prediction of gene structures using pair HMMs
Eukaryotic Gene Finding
UCSC Known Genes Version 3 Take 10. Overall Pipeline Get alignments etc. from database Remove antibody fragments Clean alignments, project to genome Cluster.
Eukaryotic Gene Finding
Gene Finding Genome Annotation. Gene finding is a cornerstone of genomic analysis Genome content and organization Differential expression analysis Epigenomics.
Tomato genome annotation pipeline in Cyrille2
Genomics of Microbial Eukaryotes Igor Grigoriev, Fungal Genomics Program Head US DOE Joint Genome Institute, Walnut Creek, CA.
Genome Annotation using MAKER-P at iPlant Collaboration with Mark Yandell Lab (University of Utah) iPlant: Josh Stein (CSHL) Matt Vaughn.
Kerstin Howe, Mario Caccamo, Ian Sealy The Zebrafish Genome Sequencing Project Bioinformatics resources.
Arabidopsis Genome Annotation TAIR7 Release. Arabidopsis Genome Annotation  Overview of releases  Current release (TAIR7)  Where to find TAIR7 release.
Genome Annotation BBSI July 14, 2005 Rita Shiang.
 GEP Digital Laboratory Notebook Nick Reeves, Mt. San Jacinto Community College.
Coding Domain Sequence Prediction and Alternative Splicing Detection in Human Malaria Gambiae Jun Li 1, Bing-Bing Wang 2, Jose M. Ribeiro 3, Kenneth D.
MAKER Annotation Process Example of Glossina VectorBase Karyn Mégy Dan Hughes.
Web Apollo and the VectorBase user community Gloria I. Giraldo-Calderón March 31, 2015.
Welcome to DNA Subway Classroom-friendly Bioinformatics.
1 Transcript modeling Brent lab. 2 Overview Of Entertainment  Gene prediction Jeltje van Baren  Improving gene prediction with tiling arrays Aaron Tenney.
Web Databases for Drosophila Introduction to FlyBase and Ensembl Database Wilson Leung6/06.
VectorBase BRC The evolving VectorBase gene build: mixing automated and manual approaches when annotating vector genomes Daniel Lawson VectorBase-EBI,
Genome Annotation Rosana O. Babu.
Srr-1 from Streptococcus. i/v nonpolar s serine (polar uncharged) n/s/t polar uncharged s serine (polar uncharged) e glutamic acid (neg. charge) sserine.
Annotating genomes using MAKER-P and iPlant. What Are Annotations? Annotations are descriptions of features of the genome –Structural: exons, introns,
Importing Community annotations into VectorBase. Aims Provide the VectorBase community with tools for improving genome annotation. Must have low entry.
Mark D. Adams Dept. of Genetics 9/10/04
Comp. Genomics Recitation 9 11/3/06 Gene finding using HMMs & Conservation.
From Genomes to Genes Rui Alves.
Eukaryotic Gene Prediction Rui Alves. How are eukaryotic genes different? DNA RNA Pol mRNA Ryb Protein.
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
Basic Overview of Bioinformatics Tools and Biocomputing Applications II Dr Tan Tin Wee Director Bioinformatics Centre.
Alternative Splicing (a review by Liliana Florea, 2005) CS 498 SS Saurabh Sinha 11/30/06.
How can we find genes? Search for them Look them up.
Annotation of Drosophila virilis Chris Shaffer GEP workshop, 2006.
JIGSAW: a better way to combine predictions J.E. Allen, W.H. Majoros, M. Pertea, and S.L. Salzberg. JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the.
August 2008Bioinformatics tools for Comparative Genomics of Vectors1 Genome Annotation Daniel Lawson EBI.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
Gene Structure and Identification III BIO520 BioinformaticsJim Lund Previous reading: 1.3, , 10.4,
__________________________________________________________________________________________________ Fall 2015GCBA 815 __________________________________________________________________________________________________.
(H)MMs in gene prediction and similarity searches.
1 of 28 Evaluating Genes and Transcripts (“Genebuild”)
Finding genes in the genome
Annotation of eukaryotic genomes
1 Applications of Hidden Markov Models (Lecture for CS498-CXZ Algorithms in Bioinformatics) Nov. 12, 2005 ChengXiang Zhai Department of Computer Science.
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
Gene Finding in Chimpanzee Evidence based improvement of ab initio gene predictions Chris Shaffer06/2009.
The Bovine Genome Database Abstract The Bovine Genome Database (BGD, facilitates the integration of bovine genomic data. BGD is.
Using DNA Subway in the Classroom Genome Annotation: Red Line.
Basics of Genome Annotation Daniel Standage Biology Department Indiana University.
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
bacteria and eukaryotes
Annotating The data.
VectorBase genome annotation
Genome Sequence Annotation Server
Genome Sequence Annotation Server
Ab initio gene prediction
Gene Annotation with DNA Subway
Genome Annotation w/ MAKER
Introduction to Bioinformatics II
Cuong Nguyen, Deng Xin, Dongmei, Zheng Wang
Strategies for annotation of a genome
Ensembl Genome Repository.
Follow-up from last night: XSEDE credits
Part II SeqViewer AraCyc Help
Presentation transcript:

Genome Annotation BCB 660 October 20, 2011

From Carson Holt

Annotations Automated – Ab initio (based on genomic sequence alone) Involves comparisons to known proteins (BLAST similarity) Sequence motifs such as start/stop codons, intron/exon boundaries – Evidence-based (ESTs) Involves alignment of experimental EST (cDNA) data to a gene prediction Manual – Manual curation of genes predicted automatically – Check gene structure, presence of conserved domains, match of ESTs to gene prediction – Align to related genes/proteins and look for oddities (missing exons, early stop codons, etc). – Annotation can then be manually edited – May also involve assigning function (based on sequence similarity, conserved domains) via Gene Ontology Structural: exons, introns, UTRs, splice forms etc. Functional: process a gene is involved in (metabolism), molecular function (hydrolase), location of expression (expressed in the mitochondria), etc.

Classic strategy Combine ab initio and evidence-based gene predictors together to come up with a concensus predicted gene set Ask community to pitch in and manually annotate as many genes as possible Leads to great variability in quality of different genome annotations, often many versions of official gene sets

NGS and the future of genome annotation In 2010, 1300 eukaryotic genome projects were underway -- assuming 10,000 genes per genome, that’s 13,000,000 new annotations will be needed -- quality control and maintenance become an issue Some organizations dedicated to genome annotation (i.e ENSEMBL and VectorBase) but 1300 genomes will not be feasible Need for high quality, automated annotation pipelines, that are easy to use by small research groups without extensive bioinformatics expertise

MAKER Pipeline: Especially effective for Emerging Eukaryote Model Organisms Incorporates ab initio and evidence-based gene predictors Gene predictions are run a first time Then a small subset of the genome assembly is used to train gene predictors (building genome- specific HMMs) Then trained gene predictors are run again on whole genome ** Really nice if you don’t have a basis to start from (e.g. de novo gene prediction)

What does MAKER do? * Identifies and masks out repeat elements * Aligns ESTs to the genome * Aligns proteins to the genome * Produces ab initio gene predictions * Synthesizes these data into final annotations * Produces evidence-based quality values for downstream annotation management

MAKER Steps involved 1. Compute phase – RepeatMasker – BLAST – Exonerate – SNAP (and other gene predictors) 2. Filter/cluster phase – Identify/remove marginal predictions and alignments based on quality scores/cutoffs, etc – Cluster to identify overlapping alignments/predictions– to remove redundancy and assess weight of evidence 3. Polish – Realigns BLAST hits to obtain greater precision at exon boundaries (Exonerate) 4. Synthesis – Collect evidence for each annotation, using EST evidence – Evidences scores plus sequences (genomic, EST, coding, intron) passed to SNAP – SNAP then uses this evidence to retrain and alter its internal HMM 5. Annotate – Post-processing of SNAP prediction, recombine with evidence to generate complete annotations – Output is a gff3 annotation that can be imported into genome browsers

Inputs to MAKER Genomic sequence Config files – External executables – Sequence database locations – Compute parameters Sequence database files (choice of these turns out to be extremely important) – Transposons file (default plus known organism-specific) – Repeatmasker database file (organism-specific, optionsal) – Proteins file (known proteins from related organisms you want to align to the genome) – ESTs/mRNAs file (the evidence)

MAKER Output (Apollo browser)