Presentation is loading. Please wait.

Presentation is loading. Please wait.

Comparative ab initio prediction of gene structures using pair HMMs

Similar presentations


Presentation on theme: "Comparative ab initio prediction of gene structures using pair HMMs"— Presentation transcript:

1 Comparative ab initio prediction of gene structures using pair HMMs
Irmtraud M. Meyer and Richard Durbin Bioinformatics Vol. 18 no. 10 (2002) pp

2 Introduction This paper describes Doublescan, a pair HMM approach for gene structure prediction. Doublescan makes simultaneous structure predictions for two homologous nucleotide sequences. This paper also describes a new HMM-traversal algorithm – the stepping stone algorithm – as a lower-cost alternative to the well known Viterbi algorithm. Medical applications of gene structure prediction, difficulty of finding exons in a lab, working with nucleotide sequences

3 Background Gene Structure prediction is an important problem.
Most attempts to date have focused on predictions on a single nucleotide sequence. Little has been done with comparative gene structure prediction – the necessary data has only become available recently (Human Genome project, etc.). Medical applications of predicting genes, playing with them. Difficult to play with genes in labs – anything that helps us narrow down the relevant region is helpful. Relevance to proteiomics – with genes, we can get many potential protein sequences to play with. Some things HAVE been done with comparative gene prediction – just not as much as with single-sequence prediction.

4 What are Hidden Markov Models?
Hidden Markov Models (HMMs) are a statistical modeling tool, similar to a finite state machine. Each state produces output, which follows a defined probability distribution. The transitions from state to state are also defined by probability distributions. Several gene prediction algorithms (e.g. Genie) use HMMs. It is easy to make an association between the output of HMM states and a desired nucleotide sequence.

5 Example HMM Each state has transition probabilities pi and (1-pi).
There are also emission probabilities for each state (not shown), reflecting the nucleotide distribution in that region. Gene Start Stop p2 p1 Give an example of very simple emission probabilities – unreal distributions like 90% A, 10% T for a Gene, 50% C 50% G for a UTR UTR 1-p2

6 HMM Gene Prediction from a Single Nucleotide Sequence
Genscan [Burge 1997] Genie [Kulp 1996] HMMgene [Krogh 1997] A combination of Genscan and HMMgene [Rogic 2002] HMMgene written by Anders Krogh, Center for Biological Sequence Analysis, Tech. University of Denmark Genie was co-authored by David Haussler among others The combination of Genscan and HMMgene was co-authored by Sanja Rogic

7 Why Use Multiple Sequences?
Around 20 years of work has gone into single-sequence prediction. Some methods involve pure statistical models, others also attempt to use known protein data. Comparative genomics represents potential source of new information for prediction systems. WHAT WERE THE SUCCESSES AND LIMITATIONS HERE? WHY DID PEOPLE THINK THAT COMPARATIVE GENOMICS WOULD IMPROVE GENE PREDICTION?

8 HMM Gene Prediction involving Multiple Nucleotide Sequences
Twinscan (an extension of Genscan) [Korf 2001] Rosetta [Batzoglou 2000] Doublescan innovates in having an actual comparative HMM, making simultaneous predictions for each sequence, and retrieving conserved subsequences. Twinscan is an improvement of Genscan. Twinscan uses homology data from multiple sequences to tweak the probabilities for Genscan. Rosetta involves human-mouse comparison. - uses an iterative global alignment system, and then identifies genes based on conservation of exonic features at aligned positions. Rosetta just finds coding exons in the mouse and human sequences. The Rosetta gene recognition program should not be confused with the ab-initio protein-structure predictor of the same name!

9 Structure of Doublescan
The basic structure of Doublescan is an HMM. Each state emits a codon. There are separate sections for introns, exons, and intergenics. Multiple states to express complexity within these sections (matches, insertions in each sequence, transition states).

10 The Doublescan HMM Intergenic on top left, exons on bottom left everything includes match, emit x, and emit y states, where x and y are the sequences introns on right (with 5’ and 3’ splice sites) start and stop refer to genes – the begin and end states are at the far top, and connect to everything Top right introns (in box) are for UTR (untranslated region) splicing – they improve the performance of the model. Positive effects on sensitivity and specificity, particularly for wrong genes (page 1313, 1314)

11 Model Weights The emission probabilities were derived from matches between the sequences (specifically, from equal-length orthologous genes in the test set). The transition probabilities were estimated, and then tuned by hand for optimum performance with the training set. The authors note (page 1312) that they put the least constraints on the intergenic part of the model – “we do not attempt to model these features … as the ability to predict them is poor. If these functional elements are conserved, Doublescan should retrieve them as conservered … sequences, and they can be further investigated” They used a Dirichlet dist. for the posterior mean estimate of the emission probs. What was their prior to derive this posterior? Unfortunately, there is no additional information about the transition probabilities

12 The Scoring Algorithm The Viterbi algorithm is a well known way for taking sequences and computing the best-scoring path through an HMM. The time and space requirements of the Viterbi algorithm were too large for Doublescan. As a replacement, the authors wrote the Stepping Stone algorithm – a variant of the Viterbi algorithm that uses alignments to simplify the search. I’m not entirely sure why the Viterbi algorithm didn’t float their boat. The single-sequence implementation is linear with sequence length. Unless the pair-HMM Viterbi algorithm is substantially worse (which I doubt), then Viterbi should still be alright. The only possible problem is the large constants involved in computing every possible transition in a very complicated model. Stepping Stone avoids this by having portions that it “knows”, which (up to the frequency of the alignment) reduces the combinatorial complexity of the search.

13 The Problem with Viterbi
The Viterbi algorithm computes the best path through an HMM by scoring, in parallel, each path through the HMM and keeping track of the best ones. This can be memory intensive when it’s time to re-construct the path! There’s also some potential time complexity here as well. The model is so complex that computing the set of next-state-probabilities may involve very large constants (looking at all transition probabilities). The paper mentions the Hirschberg algorithm on page 1313, as using less memory for path reconstruction than Viterbi. That could be applied here, but Stepping Stone is still faster (a factor of 4 was recorded in the paper). They also agreed with the vast majority of paths (pg 1315)

14 The Stepping Stone Algorithm Part 1
The two sequences are aligned with BLASTN. We do the following to each locally aligned portion, in order of score. Find the midpoint of the alignment Attempt to add the midpoint to a list of constraints At the end of this loop, we have a set of nucleotide pairing constraints for the path through the HMM. Description page NOTE: The assumption of correctness for the BLASTN alignments will obviously fail in some cases. They key is that the authors don’t care about perfection – they don’t want the best path through the HMM, just a very good one (“… a method with which a NEARLY optimal state path can be derived …”.

15 A Visual Example Only some of the local alignments were used! Some outliers (presumably worse-scoring) were not able to be included. Diagonal lines represent local alignments. The hashed portion is the area defined by the constraints.

16 The Stepping Stone Algorithm Part 2
For each pair of adjacent constraints, find the best scoring path between them with a variant of the Viterbi algorithm. Reconstruct the complete path through the HMM, and find its score. The exact variant of the Viterbi algorithm is described on page 1313 NOTE THE EXTREME DEPENDENCE ON THE ALIGNMENT HERE!!!! IF THE ALIGNMENT IS POOR, OR OFF, THEN THE ALGORITHM CAN SCREW UP!!!

17 Other Optimization of Doublescan
“Doublescan including UTR-splicing still has a 14% rate or wrong genes corresponding to 30 genes which are predicted in addition to those that overlap the annotated gene in each DNA sequence” [Meyer 2002, pp ]. The authors were able to raise the specificity of Doublescan substantially by removing “all predicted genes with introns of less than or equal to 50 base pairs length and or a total coding length of less than or equal to 120 base pairs length” [Meyer 2002, pp 1317].

18 Results and Comparison with Genscan – Key Figures
Doublescan Genscan Gene Sensitivity Specificity Start Codon Sensitivity Specificity Stop Codon Sensitivity Specificity These data reflect a test set of mouse-human orthologs (page 1316) Why didn’t they compare with Twinscan? Twinscan is a better version of Genscan, involving mouse homology data.

19 Results and Comparison with Genscan - Analysis
Overall, Doublescan gets better results at low resolution, but the lack of model detail restricts it’s accuracy at more specific levels. Loss of codon specificity Genscan had explicit states for promoters, the 5’ and 3’ UTR regions, and the final poly-A signal Doublescan has fewer distinctions – UTR, intron, or exon

20 Conclusion Doublescan provides a contrast with Genscan for gene structure detection – low-resolution versus high-resolution accuracy. Interestingly, Genscan and Doublescan had a tendency to fail on different genes. It is not clear why they were complementary in this fashion. The information about failing on different genes is on page 1315 (left side, mid-page).

21 Future Work Areas Twinscan [Korf 2001] is a pair-HMM improvement on Genscan. A comparison of Twinscan and Doublescan could contain more useful information. Scoring algorithms that find something besides the best path (e.g. the forward-backward algorithm) Exploration of how the alignment affects prediction quality Also – how does Doublescan perform when the input sequences are identical and there is some (a total?) alignment between them? This might give a more accurate comparison with genscan, as far as the amount of information input. What if one of the sequences is random noise? In theory, less of an alignment would actually result in a better gene prediction, albeit a longer running time

22 References Meyer, M., R. Durbin (2002) Comparative ab initio prediction of gene structures using pair HMMs. Bioinformatics Vol. 18 no. 10 pp Korf I., et al. (2001) Integrating genomic homology into gene structure prediction. Bioinformatics Vol 17 Suppl. 1 pp S140-S148 Batzoglou S., et al. (2000) Human and Mouse Gene Structure: Comparative Analysis and Application to Exon Prediction. Genome Research Vol. 10 Is. 7 pp Burge, C., S. Karlin (1997) Prediction of Complete Gene Structures in Human Genomic DNA. J. Mol. Biol. 268 pp 78-94 Krogh, A. (1997) Two methods for improving performance of an HMM and their application for gene finding. Proc. Fifth Int. Conf. Intelligent Systems for Molecular Biology, Eds T. Gaasterland et al. pp Kulp, D., et al. (1996) A Generalized Hidden Markov Model for the Recognition of Human Genes in DNA. ISMB-96 pp Rogic, S. et al. (2002) Improving gene recognition accuracy by combining predictions from two gene-finding programs. Bioinformatics Vol 18 no. 8 pp


Download ppt "Comparative ab initio prediction of gene structures using pair HMMs"

Similar presentations


Ads by Google