Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.

Slides:



Advertisements
Similar presentations
B. Knudsen and J. Hein Department of Genetics and Ecology
Advertisements

Hidden Markov Model in Biological Sequence Analysis – Part 2
Ulf Schmitz, Statistical methods for aiding alignment1 Bioinformatics Statistical methods for pattern searching Ulf Schmitz
Hidden Markov Models in Bioinformatics
Hidden Markov Models Eine Einführung.
Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and Dr Claude Depamphilis Dhiraj Joshi Department of Computer Science and Engineering.
1 Profile Hidden Markov Models For Protein Structure Prediction Colin Cherry
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
Patterns, Profiles, and Multiple Alignment.
Hidden Markov Models: Applications in Bioinformatics Gleb Haynatzki, Ph.D. Creighton University March 31, 2003.
Entropy Rates of a Stochastic Process
Profiles for Sequences
JM - 1 Introduction to Bioinformatics: Lecture XIII Profile and Other Hidden Markov Models Jarek Meller Jarek Meller Division.
Hidden Markov Models Fundamentals and applications to bioinformatics.
Hidden Markov Models (HMMs) Steven Salzberg CMSC 828H, Univ. of Maryland Fall 2010.
Hidden Markov Models in Bioinformatics Example Domain: Gene Finding Colin Cherry
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
Profile HMMs for sequence families and Viterbi equations Linda Muselaars and Miranda Stobbe.
Hidden Markov Models Pairwise Alignments. Hidden Markov Models Finite state automata with multiple states as a convenient description of complex dynamic.
Heuristic alignment algorithms and cost matrices
Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.
HIDDEN MARKOV MODELS IN MULTIPLE ALIGNMENT. 2 HMM Architecture Markov Chains What is a Hidden Markov Model(HMM)? Components of HMM Problems of HMMs.
Profile-profile alignment using hidden Markov models Wing Wong.
. Sequence Alignment via HMM Background Readings: chapters 3.4, 3.5, 4, in the Durbin et al.
S. Maarschalkerweerd & A. Tjhang1 Parameter estimation for HMMs, Baum-Welch algorithm, Model topology, Numerical stability Chapter
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
Identifying functional residues of proteins from sequence info Using MSA (multiple sequence alignment) - search for remote homologs using HMMs or profiles.
Comparative ab initio prediction of gene structures using pair HMMs
Elze de Groot1 Parameter estimation for HMMs, Baum-Welch algorithm, Model topology, Numerical stability Chapter
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
Deepak Verghese CS 6890 Gene Finding With A Hidden Markov model Of Genomic Structure and Evolution. Jakob Skou Pedersen and Jotun Hein.
Introduction to Profile Hidden Markov Models
Masquerade Detection Mark Stamp 1Masquerade Detection.
Gapped BLAST and PSI-BLAST : a new generation of protein database search programs Team2 邱冠儒 黃尹柔 田耕豪 蕭逸嫻 謝朝茂 莊閔傑 2014/05/12 1.
Alignment and classification of time series gene expression in clinical studies Tien-ho Lin, Naftali Kaminski and Ziv Bar-Joseph.
Hidden Markov Models for Sequence Analysis 4
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
Fundamentals of Hidden Markov Model Mehmet Yunus Dönmez.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
The Examination of Residuals. Examination of Residuals The fitting of models to data is done using an iterative approach. The first step is to fit a simple.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Hidden Markov Models Yves Moreau Katholieke Universiteit Leuven.
Danila Filipponi Simonetta Cozzi ISTAT, Italy Outlier Identification Procedures for Contingency Tables in Longitudinal Data Roma,8-11 July 2008.
Basic Numerical Procedures Chapter 19 1 Options, Futures, and Other Derivatives, 7th Edition, Copyright © John C. Hull 2008.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
PGM 2003/04 Tirgul 2 Hidden Markov Models. Introduction Hidden Markov Models (HMM) are one of the most common form of probabilistic graphical models,
1 CONTEXT DEPENDENT CLASSIFICATION  Remember: Bayes rule  Here: The class to which a feature vector belongs depends on:  Its own value  The values.
ECE-7000: Nonlinear Dynamical Systems Overfitting and model costs Overfitting  The more free parameters a model has, the better it can be adapted.
EVOLUTIONARY HMMS BAYESIAN APPROACH TO MULTIPLE ALIGNMENT Siva Theja Maguluri CS 598 SS.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:
Spatial Smoothing and Multiple Comparisons Correction for Dummies Alexa Morcom, Matthew Brett Acknowledgements.
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
Visual Tracking by Cluster Analysis Arthur Pece Department of Computer Science University of Copenhagen
Step 3: Tools Database Searching
Probabilistic Suffix Trees Maria Cutumisu CMPUT 606 October 13, 2004.
(H)MMs in gene prediction and similarity searches.
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
Mismatch String Kernals for SVM Protein Classification Christina Leslie, Eleazar Eskin, Jason Weston, William Stafford Noble Presented by Pradeep Anand.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Hidden Markov Models BMI/CS 576
Lecture 1.31 Criteria for optimal reception of radio signals.
What is a Hidden Markov Model?
Adjustment of Temperature Trends In Landstations After Homogenization ATTILAH Uriah Heat Unavoidably Remaining Inaccuracies After Homogenization Heedfully.
Sequential Pattern Discovery under a Markov Assumption
Hidden Markov Models Part 2: Algorithms
CONTEXT DEPENDENT CLASSIFICATION
Presentation transcript:

Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar

Index 1. Motivation 2. Introduction 3. Method 4. Simulation Results 5. Conclusion

1. Motivation Drawbacks of HMM is that conserved amino acids are not emphasized. Signal and noise are treated equally Hence the no. of estimated parameters is enormous. Need to focus on conserved amino acids only, to improve accuracy.

2. Introduction Profile HMMs originate from the profile analysis. The essence of the profile analysis is that the information concerning the conservation of the residues is incorporated into the profile, whereby the analysis is able to detect structural similarities and homologies to the sequence family. In HMM models, emission probabilities of all 20 amino acids are estimated in all emitting states, and thus the number of estimated parameters can be enormous.

2. Introduction (continued) For example, if the model includes 300 emitting states, the number of emission parameters is Most of the parameters are however noise, i.e. is unconserved parameters. This paper presents an alternative, likelihood- based approach to the problem of reducing the parameter space in HMMs.

2. Introduction (continued) The advantage of the new method is that it explicitly takes into account conservation of the alignment.

3. Methods 3.1 Profile HMM 3.2 Classification Algorithm 3.3 EEP Estimation Method

3.1 Profile HMM The profile HMM architecture (Durbin et al., 1998) has three classes of states: match state insert state delete state The match and the insert states always emit a symbol Delete states are silent. The model starts from the begin state and ends with the end state.

3.1 Profile HMM (continued) The model length is determined by the number of positions, that is, the number of match–insert–delete state triplets between the begin and the end states. An observation sequence {Yi } is considered to be a stochastic process with a finite set of symbols O ={o1, o2,..., oS}. The state sequence, the path that goes through the model, is a finite-state Markov chain {Xi }. The emitted symbols are assumed to be conditionally independent given the states.

3.1 Profile HMM (continued) When the estimation is based on the sequence alignment, the columns of the multiple alignment are assigned as match or insert states before the estimation. Thus, the path that generates the sequence is known. Columns representing conserved positions are chosen as match states Rest of the states as insert states.

3.1 Profile HMM (continued) The profile HMM has two sets of parameters: Transition probabilities Emission probabilities

3.2 Classification Algorithm Basis for the EEP method that in match states the emission probability distributions are conserved on some residues. Other residues occur relatively seldom. In practice, the determination of conserved residues is variable.

3.2 Classification Algorithm (continued)

In the above algorithm At each iteration step, the residue with the largest relative frequency with respect to its background probability was defined as effective or ineffective depending on a fixed threshold value. Remaining probabilities were updated so that they again summed to one.

3.2 Classification Algorithm (continued) The renormalizing step is necessary because otherwise those residues with low background probability tend to be chosen as effective more often than those with high background probability.

3.3 EEP Estimation Method EEP is constructed by the log likelihood function of multinominal distribution. Where nj is a frequency of an amino acid j.

3.3 EEP Estimation Method (continued) Constraints of log likelihood function defined as  Where are constants

3.3 EEP Estimation Method (continued) First constraint ensures that the mutual ratios of the ineffective residues remains same as in background distribution. Second constraint ensures that total proportion of effective residues to ineffective residues does not increase too much w.r.t to background distribution

3.3 EEP Estimation Method (continued) There are two possible sets of solution depending on the above mentioned inequality

3.3 EEP Estimation Method (continued) If the inequality is true then rescaled optimal probabilities are calculated as mentioned below The probabilities of the ineffective residues are estimated by dividing the sum of the remaining probability in proportion to the background probability.

3.3 EEP Estimation Method (continued) If the inequality is not true then the probabilities are given by the following equations.

4. Simulation Results In order to study how successfully the EEP method classifies the residues as effective or ineffective, the percentages of misclassified residues were calculated. The accuracy and variance of the EEP estimates were compared to the ML estimates. Finally, the robustness of the EEP method for the choice of the threshold value was examined.

4. Simulation Results The theoretical simulation set was composed of three effective residues: alanine (35%), glycine (50%),and methionine (10%). The other residues were ineffective and were assigned by sharing the remaining probability in the same proportion as their background probabilities

4.1 False effective and Ineffective Residues Among the effective residues, alanine and glycine were correctly classified through all simulations. The number of misclassified methionine residues increased from 0.8 to 2.9% as the threshold value was increased from 1 to 2. As the threshold value increases, the classification of effective residues whose probabilities are relatively low might fail. This problem, however, disappears as the number of estimated sequences increases.

4.1 False Effectives and Ineffective residue (continued) When the threshold value was set to 1, cysteine, histidine, and tryptophan were misclassified in 5.3, 2.6 and 4.1% of the simulations, respectively. The variations seem to be closely related to the background distribution. Residues with low background probabilities tend to be more often misclassified than the others.

4.2 Accuracy and variance of estimates

4.2 Accuracy and variance of estimates (continued) Estimates of effective residues were rather accurate. There was no great difference between ML and EEP as seen from the figure.

4.2 Accuracy and variance of estimates (continued)

Considering the ineffective residue estimates, as can be seen from the figure there is a great difference between ML and EEP estimates. Variance in EEP estimates was clearly lot lesser than in ML estimates.

4.3 Choosing the threshold value To examine the effect of threshold value on estimation, data was estimated using incorrect threshold. For threshold value less than true threshold sensitivity improved and specificity worsened Opposite was true for threshold values greater than true threshold. As far as accuracy was concerned sensitivity seemed more important than specificity.

5. Conclusion The major advantage is the decrease in the dimension of the parameter space. In protein sequence alignments, the decrease is significant because in conserved positions only a few residues can be considered as effective. The study with 20 well-defined protein families indicates that the EEP method is able to detect sequences on average with 98% sensitivity and 99% specificity.

5. Conclusion (continued) As a consequence of the reduction of the parameter space, the variance of the ineffective residues decreases without influencing variance of the effective residues. The major disadvantage is its inability to take into account the physical and chemical characteristics of the amino acids, and thus, it ignores the relationships among the amino acids.