Hidden Markov Models Lecture 6, Thursday April 17, 2003.

Slides:

Advertisements

Similar presentations

Hidden Markov Model in Biological Sequence Analysis – Part 2

Advertisements

Marjolijn Elsinga & Elze de Groot1 Markov Chains and Hidden Markov Models Marjolijn Elsinga & Elze de Groot.

HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:

Learning HMM parameters

Hidden Markov Model.

Bioinformatics Hidden Markov Models. Markov Random Processes n A random sequence has the Markov property if its distribution is determined solely by its.

Hidden Markov Models Eine Einführung.

Hidden Markov Models.

 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.

Hidden Markov Models Modified from:

Hidden Markov Models Theory By Johan Walters (SR 2003)

Hidden Markov Model Most pages of the slides are from lecture notes from Prof. Serafim Batzoglou’s course in Stanford: CS 262: Computational Genomics (Winter.

Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.

. Hidden Markov Model Lecture #6. 2 Reminder: Finite State Markov Chain An integer time stochastic process, consisting of a domain D of m states {1,…,m}

CpG islands in DNA sequences

Lecture 6, Thursday April 17, 2003

Hidden Markov Models. Two learning scenarios 1.Estimation when the “right answer” is known Examples: GIVEN:a genomic region x = x 1 …x 1,000,000 where.

Hidden Markov Models. Decoding GIVEN x = x 1 x 2 ……x N We want to find  =  1, ……,  N, such that P[ x,  ] is maximized  * = argmax  P[ x,  ] We.

Hidden Markov Models. Two learning scenarios 1.Estimation when the “right answer” is known Examples: GIVEN:a genomic region x = x 1 …x 1,000,000 where.

Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.

Hidden Markov Models Lecture 5, Tuesday April 15, 2003.

Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.

. Hidden Markov Models Lecture #5 Prepared by Dan Geiger. Background Readings: Chapter 3 in the text book (Durbin et al.).

Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.

Lecture 5: Learning models using EM

S. Maarschalkerweerd & A. Tjhang1 Parameter estimation for HMMs, Baum-Welch algorithm, Model topology, Numerical stability Chapter

CpG islands in DNA sequences

Hidden Markov Models Lecture 5, Tuesday April 15, 2003.

Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.

Hidden Markov Models K 1 … 2. Outline Hidden Markov Models – Formalism The Three Basic Problems of HMMs Solutions Applications of HMMs for Automatic Speech.

Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.

Hidden Markov Models Usman Roshan BNFO 601. Hidden Markov Models Alphabet of symbols: Set of states that emit symbols from the alphabet: Set of probabilities.

Time Warping Hidden Markov Models Lecture 2, Thursday April 3, 2003.

Hidden Markov Models—Variants Conditional Random Fields 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.

Hidden Markov Models 1 2 K … x1 x2 x3 xK.

Bioinformatics Hidden Markov Models. Markov Random Processes n A random sequence has the Markov property if its distribution is determined solely by its.

Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.

Elze de Groot1 Parameter estimation for HMMs, Baum-Welch algorithm, Model topology, Numerical stability Chapter

Hidden Markov Models.

Hidden Markov models Sushmita Roy BMI/CS 576 Oct 16 th, 2014.

Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.

CS262 Lecture 5, Win07, Batzoglou Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.

. Class 5: Hidden Markov Models. Sequence Models u So far we examined several probabilistic model sequence models u These model, however, assumed that.

Hidden Markov Model Continues …. Finite State Markov Chain A discrete time stochastic process, consisting of a domain D of m states {1,…,m} and 1.An m.

1 Markov Chains. 2 Hidden Markov Models 3 Review Markov Chain can solve the CpG island finding problem Positive model, negative model Length? Solution:

. Parameter Estimation For HMM Lecture #7 Background Readings: Chapter 3.3 in the text book, Biological Sequence Analysis, Durbin et al., 2001.

BINF6201/8201 Hidden Markov Models for Sequence Analysis

What if a new genome comes? We just sequenced the porcupine genome We know CpG islands play the same role in this genome However, we have no known CpG.

Fundamentals of Hidden Markov Model Mehmet Yunus Dönmez.

1 HMM - Part 2 Review of the last lecture The EM algorithm Continuous density HMM.

Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.

CS5263 Bioinformatics Lecture 11: Markov Chain and Hidden Markov Models.

Hidden Markov Models Yves Moreau Katholieke Universiteit Leuven.

Hidden Markov Models Usman Roshan CS 675 Machine Learning.

CS5263 Bioinformatics Lecture 12: Hidden Markov Models and applications.

CS5263 Bioinformatics Lecture 10: Markov Chain and Hidden Markov Models.

Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.

Algorithms in Computational Biology11Department of Mathematics & Computer Science Algorithms in Computational Biology Markov Chains and Hidden Markov Model.

1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.

1 DNA Analysis Part II Amir Golnabi ENGS 112 Spring 2008.

Eric Xing © Eric CMU, Machine Learning Structured Models: Hidden Markov Models versus Conditional Random Fields Eric Xing Lecture 13,

Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.

Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.

Definition of the Hidden Markov Model A Seminar Speech Recognition presentation A Seminar Speech Recognition presentation October 24 th 2002 Pieter Bas.

Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.

Hidden Markov Models – Concepts 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.

Lecture 16, CS5671 Hidden Markov Models (“Carnivals with High Walls”) States (“Stalls”) Emission probabilities (“Odds”) Transitions (“Routes”) Sequences.

Hidden Markov Models BMI/CS 576

Hidden Markov Models - Training

CSE 5290: Algorithms for Bioinformatics Fall 2009

Presentation transcript:

Hidden Markov Models Lecture 6, Thursday April 17, 2003

Review of Last Lecture Lecture 6, Thursday April 17, 2003

Decoding GIVEN x = x 1 x 2 ……x N We want to find  =  1, ……,  N, such that P[ x,  ] is maximized  * = argmax  P[ x,  ] We can use dynamic programming! Let V k (i) = max {  1,…,i-1} P[x 1 …x i-1,  1, …,  i-1, x i,  i = k] = Probability of most likely sequence of states ending at state  i = k 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2

Lecture 6, Thursday April 17, 2003 The Viterbi Algorithm Similar to “aligning” a set of states to a sequence Time: O(K 2 N) Space: O(KN) x 1 x 2 x 3 ………………………………………..x N State 1 2 K V j (i)

Lecture 6, Thursday April 17, 2003 Evaluation Compute: P(x)Probability of x given the model P(x i …x j )Probability of a substring of x given the model P(  I = k | x)Probability that the i th state is k, given x

Lecture 6, Thursday April 17, 2003 The Forward Algorithm We can compute f k (i) for all k, i, using dynamic programming! Initialization: f 0 (0) = 1 f k (0) = 0, for all k > 0 Iteration: f l (i) = e l (x i )  k f k (i-1) a kl Termination: P(x) =  k f k (N) a k0 Where, a k0 is the probability that the terminating state is k (usually = a 0k )

Lecture 6, Thursday April 17, 2003 The Backward Algorithm We can compute b k (i) for all k, i, using dynamic programming Initialization: b k (N) = a k0, for all k Iteration: b k (i) =  l e l (x i+1 ) a kl b l (i+1) Termination: P(x) =  l a 0l e l (x 1 ) b l (1)

Lecture 6, Thursday April 17, 2003 Posterior Decoding We can now calculate f k (i) b k (i) P(  i = k | x) = ––––––– P(x) Then, we can ask What is the most likely state at position i of sequence x: Define  ^ by Posterior Decoding:  ^ i = argmax k P(  i = k | x)

Lecture 6, Thursday April 17, 2003 Today Example: CpG Islands Learining

Lecture 6, Thursday April 17, 2003 Implementation Techniques Viterbi: Sum-of-logs V l (i) = log e k (x i ) + max k [ V k (i-1) + log a kl ] Forward/Backward: Scaling by c(i) One way to perform scaling: f l (i) = c(i)  [e l (x i )  k f k (i-1) a kl ] where c(i) = 1 /(  k f k (i) ) b l (i): use the same factors c(i) Details in Rabiner’s Tutorial on HMMs, 1989

A+C+G+T+ A-C-G-T- A modeling Example CpG islands in DNA sequences

Lecture 6, Thursday April 17, 2003 Example: CpG Islands CpG nucleotides in the genome are frequently methylated (Write CpG not to confuse with CG base pair) C  methyl-C  T Methylation often suppressed around genes, promoters  CpG islands

Lecture 6, Thursday April 17, 2003 Example: CpG Islands In CpG islands, CG is more frequent Other pairs (AA, AG, AT…) have different frequencies Question: Detect CpG islands computationally

Lecture 6, Thursday April 17, 2003 A model of CpG Islands – (1) Architecture A+C+G+T+ A-C-G-T- CpG Island Not CpG Island

Lecture 6, Thursday April 17, 2003 A model of CpG Islands – (2) Transitions How do we estimate the parameters of the model? Emission probabilities: 1/0 1.Transition probabilities within CpG islands Established from many known (experimentally verified) CpG islands (Training Set) 2.Transition probabilities within other regions Established from many known non-CpG islands + ACGT A C G T ACGT A C G T

Lecture 6, Thursday April 17, 2003 Parenthesis – log likelihoods ACGT A C G T A better way to see effects of transitions: Log likelihoods L(u, v) = log[ P(uv | + ) / P(uv | -) ] Given a region x = x 1 …x N A quick-&-dirty way to decide whether entire x is CpG P(x is CpG) > P(x is not CpG)   i L(x i, x i+1 ) > 0

Lecture 6, Thursday April 17, 2003 A model of CpG Islands – (2) Transitions What about transitions between (+) and (-) states? They affect Avg. length of CpG island Avg. separation between two CpG islands XY 1-p 1-q pq Length distribution of region X: P[l X = 1] = 1-p P[l X = 2] = p(1-p) … P[l X = k] = p k (1-p) E[l X ] = 1/(1-p) Exponential distribution, with mean 1/(1-p)

Lecture 6, Thursday April 17, 2003 A model of CpG Islands – (2) Transitions No reason to favor exiting/entering (+) and (-) regions at a particular nucleotide To determine transition probabilities between (+) and (-) states 1.Estimate average length of a CpG island: l CPG = 1/(1-p)  p = 1 – 1/l CPG 2.For each pair of (+) states k, l, let a kl  p × a kl 3.For each (+) state k, (-) state l, let a kl = (1-p)/4 (better: take frequency of l in the (-) regions into account) 4.Do the same for (-) states A problem with this model: CpG islands don’t have exponential length distribution This is a defect of HMMs – compensated with ease of analysis & computation

Lecture 6, Thursday April 17, 2003 Applications of the model Given a DNA region x, The Viterbi algorithm predicts locations of CpG islands Given a nucleotide x i, (say x i = A) The Viterbi parse tells whether x i is in a CpG island in the most likely general scenario The Forward/Backward algorithms can calculate P(x i is in CpG island) = P(  i = A + | x) Posterior Decoding can assign locally optimal predictions of CpG islands  ^ i = argmax k P(  i = k | x)

Lecture 6, Thursday April 17, 2003 What if a new genome comes? We just sequenced the porcupine genome We know CpG islands play the same role in this genome However, we have no known CpG islands for porcupines We suspect the frequency and characteristics of CpG islands are quite different in porcupines How do we adjust the parameters in our model? - LEARNING

Problem 3: Learning Re-estimate the parameters of the model based on training data

Lecture 6, Thursday April 17, 2003 Two learning scenarios 1.Estimation when the “right answer” is known Examples: GIVEN:a genomic region x = x 1 …x 1,000,000 where we have good (experimental) annotations of the CpG islands GIVEN:the casino player allows us to observe him one evening, as he changes dice and produces 10,000 rolls 2.Estimation when the “right answer” is unknown Examples: GIVEN:the porcupine genome; we don’t know how frequent are the CpG islands there, neither do we know their composition GIVEN: 10,000 rolls of the casino player, but we don’t see when he changes dice QUESTION:Update the parameters  of the model to maximize P(x|  )

Lecture 6, Thursday April 17, When the right answer is known Given x = x 1 …x N for which the true  =  1 …  N is known, Define: A kl = # times k  l transition occurs in  E k (b) = # times state k in  emits b in x We can show that the maximum likelihood parameters  are: A kl E k (b) a kl = ––––– e k (b) = –––––––  i A ki  c E k (c)

Lecture 6, Thursday April 17, When the right answer is known Intuition: When we know the underlying states, Best estimate is the average frequency of transitions & emissions that occur in the training data Drawback: Given little data, there may be overfitting: P(x|  ) is maximized, but  is unreasonable 0 probabilities – VERY BAD Example: Given 10 casino rolls, we observe x = 2, 1, 5, 6, 1, 2, 3, 6, 2, 3  = F, F, F, F, F, F, F, F, F, F Then: a FF = 1; a FL = 0 e F (1) = e F (3) =.2; e F (2) =.3; e F (4) = 0; e F (5) = e F (6) =.1

Lecture 6, Thursday April 17, 2003 Pseudocounts Solution for small training sets: Add pseudocounts A kl = # times k  l transition occurs in  + r kl E k (b) = # times state k in  emits b in x+ r k (b) r kl, r k (b) are pseudocounts representing our prior belief Larger pseudocounts  Strong priof belief Small pseudocounts (  < 1): just to avoid 0 probabilities

Lecture 6, Thursday April 17, 2003 Pseudocounts Example: dishonest casino We will observe player for one day, 500 rolls Reasonable pseudocounts: r 0F = r 0L = r F0 = r L0 = 1; r FL = r LF = r FF = r LL = 1; r F (1) = r F (2) = … = r F (6) = 20(strong belief fair is fair) r F (1) = r F (2) = … = r F (6) = 5(wait and see for loaded) Above #s pretty arbitrary – assigning priors is an art

Lecture 6, Thursday April 17, When the right answer is unknown We don’t know the true A kl, E k (b) Idea: We estimate our “best guess” on what A kl, E k (b) are We update the parameters of the model, based on our guess We repeat

Lecture 6, Thursday April 17, When the right answer is unknown Starting with our best guess of a model M, parameters  : Given x = x 1 …x N for which the true  =  1 …  N is unknown, We can get to a provably more likely parameter set  Principle: EXPECTATION MAXIMIZATION 1.Estimate A kl, E k (b) in the training data 2.Update  according to A kl, E k (b) 3.Repeat 1 & 2, until convergence

Lecture 6, Thursday April 17, 2003 Estimating new parameters To estimate A kl : At each position i of sequence x, Find probability transition k  l is used: P(  i = k,  i+1 = l | x) = [1/P(x)]  P(  i = k,  i+1 = l, x 1 …x N ) = Q/P(x) where Q = P(x 1 …x i,  i = k,  i+1 = l, x i+1 …x N ) = = P(  i+1 = l, x i+1 …x N |  i = k) P(x 1 …x i,  i = k) = = P(  i+1 = l, x i+1 x i+2 …x N |  i = k) f k (i) = = P(x i+2 …x N |  i+1 = l) P(x i+1 |  i+1 = l) P(  i+1 = l |  i = k) f k (i) = = b l (i+1) e l (x i+1 ) a kl f k (i) f k (i) a kl e l (x i+1 ) b l (i+1) So: P(  i = k,  i+1 = l | x,  ) = –––––––––––––––––– P(x |  )

Lecture 6, Thursday April 17, 2003 Estimating new parameters So, f k (i) a kl e l (x i+1 ) b l (i+1) A kl =  i P(  i = k,  i+1 = l | x,  ) =  i ––––––––––––––––– P(x |  ) Similarly, E k (b) = [1/P(x)]  {i | x i = b} f k (i) b k (i)

Lecture 6, Thursday April 17, 2003 Estimating new parameters If we have several training sequences, x 1, …, x M, each of length N, f k (i) a kl e l (x i+1 ) b l (i+1) A kl =  j  i P(  i = k,  i+1 = l | x,  ) =  j  i ––––––––––––––––– P(x |  ) Similarly, E k (b) =  j (1/P(x j ))  {i | x i = b} f k (i) b k (i)

Lecture 6, Thursday April 17, 2003 The Baum-Welch Algorithm Initialization: Pick the best-guess for model parameters (or arbitrary) Iteration: Forward Backward Calculate A kl, E k (b) Calculate new model parameters a kl, e k (b) Calculate new log-likelihood P(x |  ) GUARANTEED TO BE HIGHER BY EXPECTATION-MAXIMIZATION Until P(x |  ) does not change much

Lecture 6, Thursday April 17, 2003 The Baum-Welch Algorithm – comments Time Complexity: # iterations  O(K 2 N) Guaranteed to increase the log likelihood of the model P(  | x) = P(x,  ) / P(x) = P(x |  ) / ( P(x) P(  ) ) Not guaranteed to find globally best parameters Converges to local optimum, depending on initial conditions Too many parameters / too large model:Overtraining

Lecture 6, Thursday April 17, 2003 Alternative: Viterbi Training Initialization:Same Iteration: Perform Viterbi, to find  * Calculate A kl, E k (b) according to  * + pseudocounts Calculate the new parameters a kl, e k (b) Until convergence Notes: Convergence is guaranteed – Why? Does not maximize P(x |  ) In general, worse performance than Baum-Welch Convenient – when interested in Viterbi parsing, no need to implement additional procedures (Forward, Backward)!!

Lecture 6, Thursday April 17, 2003 Exercise – Submit any time – Groups up to 3 1.Implement a HMM for the dishonest casino (or any other simple process you feel like) 2.Generate training sequences with the model 3.Implement Baum-Welch and Viterbi training 4.Show a few sets of initial parameters such that a.Baum-Welch and Viterbi differ significantly, and/or b.Baum-Welch converges to parameters close to the model, and to unreasonable parameters, depending on initial parameters Do not use 0-probability transitions Do not use 0s in the initial parameters Do use pseudocounts in Viterbi This exercise will replace the 1-3 lowest problems, depending on thoroughness