CSE 5290: Algorithms for Bioinformatics Fall 2009

Slides:



Advertisements
Similar presentations
. Lecture #8: - Parameter Estimation for HMM with Hidden States: the Baum Welch Training - Viterbi Training - Extensions of HMM Background Readings: Chapters.
Advertisements

Hidden Markov Model in Biological Sequence Analysis – Part 2
HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:
Learning HMM parameters
Hidden Markov Model.
Hidden Markov Models.
Hidden Markov Models Eine Einführung.
Hidden Markov Models.
1 Profile Hidden Markov Models For Protein Structure Prediction Colin Cherry
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
Patterns, Profiles, and Multiple Alignment.
Hidden Markov Models Modified from:
Hidden Markov Models Ellen Walker Bioinformatics Hiram College, 2008.
Lecture 15 Hidden Markov Models Dr. Jianjun Hu mleg.cse.sc.edu/edu/csce833 CSCE833 Machine Learning University of South Carolina Department of Computer.
Hidden Markov Models in Bioinformatics Applications
Lecture 6, Thursday April 17, 2003
Hidden Markov Models. Two learning scenarios 1.Estimation when the “right answer” is known Examples: GIVEN:a genomic region x = x 1 …x 1,000,000 where.
Hidden Markov Models. Decoding GIVEN x = x 1 x 2 ……x N We want to find  =  1, ……,  N, such that P[ x,  ] is maximized  * = argmax  P[ x,  ] We.
Hidden Markov Models. Two learning scenarios 1.Estimation when the “right answer” is known Examples: GIVEN:a genomic region x = x 1 …x 1,000,000 where.
Hidden Markov Models Lecture 6, Thursday April 17, 2003.
Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.
Hidden Markov Models Lecture 5, Tuesday April 15, 2003.
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
. Parameter Estimation For HMM Background Readings: Chapter 3.3 in the book, Biological Sequence Analysis, Durbin et al., 2001.
. Sequence Alignment via HMM Background Readings: chapters 3.4, 3.5, 4, in the Durbin et al.
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
S. Maarschalkerweerd & A. Tjhang1 Parameter estimation for HMMs, Baum-Welch algorithm, Model topology, Numerical stability Chapter
CpG islands in DNA sequences
Hidden Markov Models Lecture 5, Tuesday April 15, 2003.
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
Elze de Groot1 Parameter estimation for HMMs, Baum-Welch algorithm, Model topology, Numerical stability Chapter
Hidden Markov Models.
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
CS262 Lecture 5, Win07, Batzoglou Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
. Class 5: Hidden Markov Models. Sequence Models u So far we examined several probabilistic model sequence models u These model, however, assumed that.
1 Markov Chains. 2 Hidden Markov Models 3 Review Markov Chain can solve the CpG island finding problem Positive model, negative model Length? Solution:
. Parameter Estimation For HMM Lecture #7 Background Readings: Chapter 3.3 in the text book, Biological Sequence Analysis, Durbin et al.,  Shlomo.
HMM Hidden Markov Model Hidden Markov Model. CpG islands CpG islands In human genome, CG dinucleotides are relatively rare In human genome, CG dinucleotides.
Hidden Markov Models for Sequence Analysis 4
. EM with Many Random Variables Another Example of EM Sequence Alignment via HMM Lecture # 10 This class has been edited from Nir Friedman’s lecture. Changes.
. Parameter Estimation For HMM Lecture #7 Background Readings: Chapter 3.3 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
H IDDEN M ARKOV M ODELS. O VERVIEW Markov models Hidden Markov models(HMM) Issues Regarding HMM Algorithmic approach to Issues of HMM.
Hidden Markov Models Yves Moreau Katholieke Universiteit Leuven.
Hidden Markov Models Usman Roshan CS 675 Machine Learning.
. Correctness proof of EM Variants of HMM Sequence Alignment via HMM Lecture # 10 This class has been edited from Nir Friedman’s lecture. Changes made.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
PGM 2003/04 Tirgul 2 Hidden Markov Models. Introduction Hidden Markov Models (HMM) are one of the most common form of probabilistic graphical models,
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
1 MARKOV MODELS MARKOV MODELS Presentation by Jeff Rosenberg, Toru Sakamoto, Freeman Chen HIDDEN.
Algorithms in Computational Biology11Department of Mathematics & Computer Science Algorithms in Computational Biology Markov Chains and Hidden Markov Model.
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Hidden Markov Models – Concepts 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
CSE182-L10 HMM applications.
Hidden Markov Models BMI/CS 576
Comp. Genomics Recitation 6 14/11/06 ML and EM.
Hidden Markov Models - Training
Hidden Markov Models.
Three classic HMM problems
Professor of Computer Science and Mathematics
- Viterbi training - Baum-Welch algorithm - Maximum Likelihood
CISC 667 Intro to Bioinformatics (Fall 2005) Hidden Markov Models (I)
Professor of Computer Science and Mathematics
Professor of Computer Science and Mathematics
CSE 5290: Algorithms for Bioinformatics Fall 2009
CSE 5290: Algorithms for Bioinformatics Fall 2011
Lecture 13: Hidden Markov Models and applications
CISC 667 Intro to Bioinformatics (Fall 2005) Hidden Markov Models (I)
Presentation transcript:

CSE 5290: Algorithms for Bioinformatics Fall 2009 Suprakash Datta datta@cs.yorku.ca Office: CSEB 3043 Phone: 416-736-2100 ext 77875 Course page: http://www.cs.yorku.ca/course/5290 4/16/2019 CSE 5290, Fall 2009

Profile Representation of Protein Families Aligned DNA sequences can be represented by a 4 ·n profile matrix reflecting the frequencies of nucleotides in every aligned position. Protein family can be represented by a 20·n profile representing frequencies of amino acids. 4/16/2019 CSE 5290, Fall 2009

Profiles and HMMs HMMs can also be used for aligning a sequence against a profile representing protein family. A 20·n profile P corresponds to n sequentially linked match states M1,…,Mn in the profile HMM of P. 4/16/2019 CSE 5290, Fall 2009

Profile HMM A profile HMM 4/16/2019 CSE 5290, Fall 2009

Building a profile HMM Multiple alignment is used to construct the HMM model. Assign each column to a Match state in HMM. Add Insertion and Deletion state. Estimate the emission probabilities according to amino acid counts in column. Different positions in the protein will have different emission probabilities. Estimate the transition probabilities between Match, Deletion and Insertion states The HMM model gets trained to derive the optimal parameters. 4/16/2019 CSE 5290, Fall 2009

States of Profile HMM Match states M1…Mn (plus begin/end states) Insertion states I0I1…In Deletion states D1…Dn 4/16/2019 CSE 5290, Fall 2009

Transition Probabilities in Profile HMM log(aMI)+log(aIM) = gap initiation penalty log(aII) = gap extension penalty 4/16/2019 CSE 5290, Fall 2009

Emission Probabilities in Profile HMM Probability of emitting a symbol a at an insertion state Ij: eIj(a) = p(a) where p(a) is the frequency of the occurrence of the symbol a in all the sequences. 4/16/2019 CSE 5290, Fall 2009

Profile HMM Alignment Define vMj (i) as the logarithmic likelihood score of the best path for matching x1..xi to profile HMM ending with xi emitted by the state Mj. vIj (i) and vDj (i) are defined similarly. 4/16/2019 CSE 5290, Fall 2009

Profile HMM Alignment: Dynamic Programming vMj-1(i-1) + log(aMj-1,Mj ) vMj(i) = log (eMj(xi)/p(xi)) + max vIj-1(i-1) + log(aIj-1,Mj ) vDj-1(i-1) + log(aDj-1,Mj ) vMj(i-1) + log(aMj, Ij) vIj(i) = log (eIj(xi)/p(xi)) + max vIj(i-1) + log(aIj, Ij) vDj(i-1) + log(aDj, Ij) 4/16/2019 CSE 5290, Fall 2009

Paths in Edit Graph and Profile HMM A path through an edit graph and the corresponding path through a profile HMM 4/16/2019 CSE 5290, Fall 2009

HMM Parameter Estimation So far, we have assumed that the transition and emission probabilities are known. However, in most HMM applications, the probabilities are not known. It’s very hard to estimate the probabilities. 4/16/2019 CSE 5290, Fall 2009

HMM Parameter Estimation Problem Given HMM with states and alphabet (emission characters) Independent training sequences x1, … xm Find HMM parameters Θ (that is, akl, ek(b)) that maximize P(x1, …, xm | Θ) the joint probability of the training sequences. 4/16/2019 CSE 5290, Fall 2009

Maximize the likelihood P(x1, …, xm | Θ) as a function of Θ is called the likelihood of the model. The training sequences are assumed independent, therefore P(x1, …, xm | Θ) = Πi P(xi | Θ) The parameter estimation problem seeks Θ that realizes In practice the log likelihood is computed to avoid underflow errors 4/16/2019 CSE 5290, Fall 2009

Two situations Known paths for training sequences CpG islands marked on training sequences One evening the casino dealer allows us to see when he changes dice Unknown paths CpG islands are not marked Do not see when the casino dealer changes dice 4/16/2019 CSE 5290, Fall 2009

Known paths Akl = # of times each k  l is taken in the training sequences Ek(b) = # of times b is emitted from state k in the training sequences Compute akl and ek(b) as maximum likelihood estimators: 4/16/2019 CSE 5290, Fall 2009

Pseudocounts Some state k may not appear in any of the training sequences. This means Akl = 0 for every state l and akl cannot be computed with the given equation. To avoid this overfitting use predetermined pseudocounts rkl and rk(b). Akl = # of transitions kl + rkl Ek(b) = # of emissions of b from k + rk(b) The pseudocounts reflect our prior biases about the probability values. 4/16/2019 CSE 5290, Fall 2009

Unknown paths: Baum-Welch Idea: Guess initial values for parameters. art and experience, not science Estimate new (better) values for parameters. how ? Repeat until stopping criteria is met. what criteria ? 4/16/2019 CSE 5290, Fall 2009

Better values for parameters Would need the Akl and Ek(b) values but cannot count (the path is unknown) and do not want to use a most probable path. For all states k,l, symbol b and training sequence x Compute Akl and Ek(b) as expected values, given the current parameters 4/16/2019 CSE 5290, Fall 2009

Notation For any sequence of characters x emitted along some unknown path π, denote by πi = k the assumption that the state at position i (in which xi is emitted) is k. 4/16/2019 CSE 5290, Fall 2009

The Baum-Welch algorithm Initialization: Pick the best-guess for model parameters (or arbitrary) Iteration: Forward for each x Backward for each x Calculate Akl, Ek(b) Calculate new akl, ek(b) Calculate new log-likelihood Until log-likelihood does not change much 4/16/2019 CSE 5290, Fall 2009

Baum-Welch analysis Log-likelihood is increased by iterations Baum-Welch is a particular case of the EM (expectation maximization) algorithm Convergence to local maximum. Choice of initial parameters determines local maximum to which the algorithm converges 4/16/2019 CSE 5290, Fall 2009