Comparative ab initio prediction of gene structures using pair HMMs

Slides:



Advertisements
Similar presentations
Gene Prediction: Similarity-Based Approaches
Advertisements

GS 540 week 5. What discussion topics would you like? Past topics: General programming tips C/C++ tips and standard library BLAST Frequentist vs. Bayesian.
Hidden Markov Model in Biological Sequence Analysis – Part 2
Hidden Markov Models in Bioinformatics
Ab initio gene prediction Genome 559, Winter 2011.
Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and Dr Claude Depamphilis Dhiraj Joshi Department of Computer Science and Engineering.
1 Profile Hidden Markov Models For Protein Structure Prediction Colin Cherry
Ka-Lok Ng Dept. of Bioinformatics Asia University
Hidden Markov Models in Bioinformatics
Profiles for Sequences
Hidden Markov Models (HMMs) Steven Salzberg CMSC 828H, Univ. of Maryland Fall 2010.
درس بیوانفورماتیک December 2013 مدل ‌ مخفی مارکوف و تعمیم ‌ های آن به نام خدا.
Hidden Markov Models in Bioinformatics Example Domain: Gene Finding Colin Cherry
Gene Prediction Methods G P S Raghava. Prokaryotic gene structure ORF (open reading frame) Start codon Stop codon TATA box ATGACAGATTACAGATTACAGATTACAGGATAG.
HMM Sampling and Applications to Gene Finding and Alignment European Conference on Computational Biology 2003 Simon Cawley * and Lior Pachter + and thanks.
Hidden Markov Models Pairwise Alignments. Hidden Markov Models Finite state automata with multiple states as a convenient description of complex dynamic.
GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA exon intron intergene Find Gene Structures in DNA Intergene State First Exon State Intron State.
Hidden Markov Models Sasha Tkachev and Ed Anderson Presenter: Sasha Tkachev.
Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.
Non-coding RNA William Liu CS374: Algorithms in Biology November 23, 2004.
. Sequence Alignment via HMM Background Readings: chapters 3.4, 3.5, 4, in the Durbin et al.
Lyle Ungar, University of Pennsylvania Hidden Markov Models.
Comparative gene hunting Irmtraud Meyer The Sanger Institute now University of Oxford
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
CSE182-L10 Gene Finding.
Genome Annotation and the landscape of the Human Genome Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics.
Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.
Eukaryotic Gene Finding
Phylogenetic Shadowing Daniel L. Ong. March 9, 2005RUGS, UC Berkeley2 Abstract The human genome contains about 3 billion base pairs! Algorithms to analyze.
Deepak Verghese CS 6890 Gene Finding With A Hidden Markov model Of Genomic Structure and Evolution. Jakob Skou Pedersen and Jotun Hein.
Learning Hidden Markov Model Structure for Information Extraction Kristie Seymour, Andrew McCullum, & Ronald Rosenfeld.
Eukaryotic Gene Finding
Hidden Markov Models In BioInformatics
Introduction to Profile Hidden Markov Models
Comparative Genomics & Annotation The Foundation of Comparative Genomics The main methodological tasks of CG Annotation: Protein Gene Finding RNA Structure.
CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
Gene finding with GeneMark.HMM (Lukashin & Borodovsky, 1997 ) CS 466 Saurabh Sinha.
Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI Computational Bioinformatics 2012.
CSCI 6900/4900 Special Topics in Computer Science Automata and Formal Grammars for Bioinformatics Bioinformatics problems sequence comparison pattern/structure.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Mark D. Adams Dept. of Genetics 9/10/04
Comp. Genomics Recitation 9 11/3/06 Gene finding using HMMs & Conservation.
Starting Monday M Oct 29 –Back to BLAST and Orthology (readings posted) will focus on the BLAST algorithm, different types and applications of BLAST; in.
JIGSAW: a better way to combine predictions J.E. Allen, W.H. Majoros, M. Pertea, and S.L. Salzberg. JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the.
Applications of HMMs in Computational Biology BMI/CS 576 Colin Dewey Fall 2010.
Applied Bioinformatics
Hidden Markov Model and Its Application in Bioinformatics Liqing Department of Computer Science.
Genome Annotation Assessment in Drosophila melanogaster by Reese, M. G., et al. Summary by: Joe Reardon Swathi Appachi Max Masnick Summary of.
(H)MMs in gene prediction and similarity searches.
1 Applications of Hidden Markov Models (Lecture for CS498-CXZ Algorithms in Bioinformatics) Nov. 12, 2005 ChengXiang Zhai Department of Computer Science.
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
Identification of Coding Sequences Bert Gold, Ph.D., F.A.C.M.G.
1 Repeats!. 2 Introduction  A repeat family is a collection of repeats which appear multiple times in a genome.  Our objective is to identify all families.
Introducing Hidden Markov Models First – a Markov Model State : sunny cloudy rainy sunny ? A Markov Model is a chain-structured process where future states.
Definitions of Annotation Interpreting raw sequence data into useful biological information Information attached to genomic coordinates with start and.
Gene prediction 10. June 2004 Irmtraud Meyer University of Oxford
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
4.2 - Algorithms Sébastien Lemieux Elitra Canada Ltd.
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
Hidden Markov Models BMI/CS 576
Free for Academic Use. Jianlin Cheng.
bacteria and eukaryotes
Genome Annotation (protein coding genes)
What is a Hidden Markov Model?
Eukaryotic Gene Finding
Ab initio gene prediction
Pair Hidden Markov Model
CISC 667 Intro to Bioinformatics (Fall 2005) Hidden Markov Models (IV)
Profile HMMs GeneScan TMMOD
4. HMMs for gene finding HMM Ability to model grammar
Presentation transcript:

Comparative ab initio prediction of gene structures using pair HMMs Irmtraud M. Meyer and Richard Durbin Bioinformatics Vol. 18 no. 10 (2002) pp. 1309-1318

Introduction This paper describes Doublescan, a pair HMM approach for gene structure prediction. Doublescan makes simultaneous structure predictions for two homologous nucleotide sequences. This paper also describes a new HMM-traversal algorithm – the stepping stone algorithm – as a lower-cost alternative to the well known Viterbi algorithm. Medical applications of gene structure prediction, difficulty of finding exons in a lab, working with nucleotide sequences

Background Gene Structure prediction is an important problem. Most attempts to date have focused on predictions on a single nucleotide sequence. Little has been done with comparative gene structure prediction – the necessary data has only become available recently (Human Genome project, etc.). Medical applications of predicting genes, playing with them. Difficult to play with genes in labs – anything that helps us narrow down the relevant region is helpful. Relevance to proteiomics – with genes, we can get many potential protein sequences to play with. Some things HAVE been done with comparative gene prediction – just not as much as with single-sequence prediction.

What are Hidden Markov Models? Hidden Markov Models (HMMs) are a statistical modeling tool, similar to a finite state machine. Each state produces output, which follows a defined probability distribution. The transitions from state to state are also defined by probability distributions. Several gene prediction algorithms (e.g. Genie) use HMMs. It is easy to make an association between the output of HMM states and a desired nucleotide sequence.

Example HMM Each state has transition probabilities pi and (1-pi). There are also emission probabilities for each state (not shown), reflecting the nucleotide distribution in that region. Gene Start Stop p2 p1 Give an example of very simple emission probabilities – unreal distributions like 90% A, 10% T for a Gene, 50% C 50% G for a UTR UTR 1-p2

HMM Gene Prediction from a Single Nucleotide Sequence Genscan [Burge 1997] Genie [Kulp 1996] HMMgene [Krogh 1997] A combination of Genscan and HMMgene [Rogic 2002] HMMgene written by Anders Krogh, Center for Biological Sequence Analysis, Tech. University of Denmark Genie was co-authored by David Haussler among others The combination of Genscan and HMMgene was co-authored by Sanja Rogic

Why Use Multiple Sequences? Around 20 years of work has gone into single-sequence prediction. Some methods involve pure statistical models, others also attempt to use known protein data. Comparative genomics represents potential source of new information for prediction systems. WHAT WERE THE SUCCESSES AND LIMITATIONS HERE? WHY DID PEOPLE THINK THAT COMPARATIVE GENOMICS WOULD IMPROVE GENE PREDICTION?

HMM Gene Prediction involving Multiple Nucleotide Sequences Twinscan (an extension of Genscan) [Korf 2001] Rosetta [Batzoglou 2000] Doublescan innovates in having an actual comparative HMM, making simultaneous predictions for each sequence, and retrieving conserved subsequences. Twinscan is an improvement of Genscan. Twinscan uses homology data from multiple sequences to tweak the probabilities for Genscan. Rosetta involves human-mouse comparison. - uses an iterative global alignment system, and then identifies genes based on conservation of exonic features at aligned positions. Rosetta just finds coding exons in the mouse and human sequences. The Rosetta gene recognition program should not be confused with the ab-initio protein-structure predictor of the same name!

Structure of Doublescan The basic structure of Doublescan is an HMM. Each state emits a codon. There are separate sections for introns, exons, and intergenics. Multiple states to express complexity within these sections (matches, insertions in each sequence, transition states).

The Doublescan HMM Intergenic on top left, exons on bottom left everything includes match, emit x, and emit y states, where x and y are the sequences introns on right (with 5’ and 3’ splice sites) start and stop refer to genes – the begin and end states are at the far top, and connect to everything Top right introns (in box) are for UTR (untranslated region) splicing – they improve the performance of the model. Positive effects on sensitivity and specificity, particularly for wrong genes (page 1313, 1314)

Model Weights The emission probabilities were derived from matches between the sequences (specifically, from equal-length orthologous genes in the test set). The transition probabilities were estimated, and then tuned by hand for optimum performance with the training set. The authors note (page 1312) that they put the least constraints on the intergenic part of the model – “we do not attempt to model these features … as the ability to predict them is poor. If these functional elements are conserved, Doublescan should retrieve them as conservered … sequences, and they can be further investigated” They used a Dirichlet dist. for the posterior mean estimate of the emission probs. What was their prior to derive this posterior? Unfortunately, there is no additional information about the transition probabilities

The Scoring Algorithm The Viterbi algorithm is a well known way for taking sequences and computing the best-scoring path through an HMM. The time and space requirements of the Viterbi algorithm were too large for Doublescan. As a replacement, the authors wrote the Stepping Stone algorithm – a variant of the Viterbi algorithm that uses alignments to simplify the search. I’m not entirely sure why the Viterbi algorithm didn’t float their boat. The single-sequence implementation is linear with sequence length. Unless the pair-HMM Viterbi algorithm is substantially worse (which I doubt), then Viterbi should still be alright. The only possible problem is the large constants involved in computing every possible transition in a very complicated model. Stepping Stone avoids this by having portions that it “knows”, which (up to the frequency of the alignment) reduces the combinatorial complexity of the search.

The Problem with Viterbi The Viterbi algorithm computes the best path through an HMM by scoring, in parallel, each path through the HMM and keeping track of the best ones. This can be memory intensive when it’s time to re-construct the path! There’s also some potential time complexity here as well. The model is so complex that computing the set of next-state-probabilities may involve very large constants (looking at all transition probabilities). The paper mentions the Hirschberg algorithm on page 1313, as using less memory for path reconstruction than Viterbi. That could be applied here, but Stepping Stone is still faster (a factor of 4 was recorded in the paper). They also agreed with the vast majority of paths (pg 1315)

The Stepping Stone Algorithm Part 1 The two sequences are aligned with BLASTN. We do the following to each locally aligned portion, in order of score. Find the midpoint of the alignment Attempt to add the midpoint to a list of constraints At the end of this loop, we have a set of nucleotide pairing constraints for the path through the HMM. Description page 1313. NOTE: The assumption of correctness for the BLASTN alignments will obviously fail in some cases. They key is that the authors don’t care about perfection – they don’t want the best path through the HMM, just a very good one (“… a method with which a NEARLY optimal state path can be derived …”.

A Visual Example Only some of the local alignments were used! Some outliers (presumably worse-scoring) were not able to be included. Diagonal lines represent local alignments. The hashed portion is the area defined by the constraints.

The Stepping Stone Algorithm Part 2 For each pair of adjacent constraints, find the best scoring path between them with a variant of the Viterbi algorithm. Reconstruct the complete path through the HMM, and find its score. The exact variant of the Viterbi algorithm is described on page 1313 NOTE THE EXTREME DEPENDENCE ON THE ALIGNMENT HERE!!!! IF THE ALIGNMENT IS POOR, OR OFF, THEN THE ALGORITHM CAN SCREW UP!!!

Other Optimization of Doublescan “Doublescan including UTR-splicing still has a 14% rate or wrong genes corresponding to 30 genes which are predicted in addition to those that overlap the annotated gene in each DNA sequence” [Meyer 2002, pp 1313-1314]. The authors were able to raise the specificity of Doublescan substantially by removing “all predicted genes with introns of less than or equal to 50 base pairs length and or a total coding length of less than or equal to 120 base pairs length” [Meyer 2002, pp 1317].

Results and Comparison with Genscan – Key Figures Doublescan Genscan Gene Sensitivity 0.57 0.47 Specificity 0.50 0.46 Start Codon Sensitivity 0.75 0.73 Specificity 0.78 0.91 Stop Codon Sensitivity 0.89 0.88 Specificity 0.86 0.97 These data reflect a test set of mouse-human orthologs (page 1316) Why didn’t they compare with Twinscan? Twinscan is a better version of Genscan, involving mouse homology data.

Results and Comparison with Genscan - Analysis Overall, Doublescan gets better results at low resolution, but the lack of model detail restricts it’s accuracy at more specific levels. Loss of codon specificity Genscan had explicit states for promoters, the 5’ and 3’ UTR regions, and the final poly-A signal Doublescan has fewer distinctions – UTR, intron, or exon

Conclusion Doublescan provides a contrast with Genscan for gene structure detection – low-resolution versus high-resolution accuracy. Interestingly, Genscan and Doublescan had a tendency to fail on different genes. It is not clear why they were complementary in this fashion. The information about failing on different genes is on page 1315 (left side, mid-page).

Future Work Areas Twinscan [Korf 2001] is a pair-HMM improvement on Genscan. A comparison of Twinscan and Doublescan could contain more useful information. Scoring algorithms that find something besides the best path (e.g. the forward-backward algorithm) Exploration of how the alignment affects prediction quality Also – how does Doublescan perform when the input sequences are identical and there is some (a total?) alignment between them? This might give a more accurate comparison with genscan, as far as the amount of information input. What if one of the sequences is random noise? In theory, less of an alignment would actually result in a better gene prediction, albeit a longer running time

References Meyer, M., R. Durbin (2002) Comparative ab initio prediction of gene structures using pair HMMs. Bioinformatics Vol. 18 no. 10 pp 1309-1318 Korf I., et al. (2001) Integrating genomic homology into gene structure prediction. Bioinformatics Vol 17 Suppl. 1 pp S140-S148 Batzoglou S., et al. (2000) Human and Mouse Gene Structure: Comparative Analysis and Application to Exon Prediction. Genome Research Vol. 10 Is. 7 pp 950-958 Burge, C., S. Karlin (1997) Prediction of Complete Gene Structures in Human Genomic DNA. J. Mol. Biol. 268 pp 78-94 Krogh, A. (1997) Two methods for improving performance of an HMM and their application for gene finding. Proc. Fifth Int. Conf. Intelligent Systems for Molecular Biology, Eds T. Gaasterland et al. pp 179-186 Kulp, D., et al. (1996) A Generalized Hidden Markov Model for the Recognition of Human Genes in DNA. ISMB-96 pp 134-141 Rogic, S. et al. (2002) Improving gene recognition accuracy by combining predictions from two gene-finding programs. Bioinformatics Vol 18 no. 8 pp 1034-1045