. Lecture #8: - Parameter Estimation for HMM with Hidden States: the Baum Welch Training - Viterbi Training - Extensions of HMM Background Readings: Chapters.

Slides:



Advertisements
Similar presentations
1 Vorlesung Informatik 2 Algorithmen und Datenstrukturen (Parallel Algorithms) Robin Pomplun.
Advertisements

© 2008 Pearson Addison Wesley. All rights reserved Chapter Seven Costs.
Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.
Copyright © 2011, Elsevier Inc. All rights reserved. Chapter 6 Author: Julia Richards and R. Scott Hawley.
Author: Julia Richards and R. Scott Hawley
STATISTICS Random Variables and Distribution Functions
Properties Use, share, or modify this drill on mathematic properties. There is too much material for a single class, so you’ll have to select for your.
UNITED NATIONS Shipment Details Report – January 2006.
1 RA I Sub-Regional Training Seminar on CLIMAT&CLIMAT TEMP Reporting Casablanca, Morocco, 20 – 22 December 2005 Status of observing programmes in RA I.
Properties of Real Numbers CommutativeAssociativeDistributive Identity + × Inverse + ×
Markov models and applications
Artificiel Bee Colony (ABC) Algorithme Isfahan University of Technology Fall Elham Seifossadat Faegheh Javadi.
PP Test Review Sections 6-1 to 6-6
Bellwork Do the following problem on a ½ sheet of paper and turn in.
Thomas Jellema & Wouter Van Gool 1 Question. 2Answer.
Sequence Alignment I Lecture #2
How to convert a left linear grammar to a right linear grammar
Copyright © 2012, Elsevier Inc. All rights Reserved. 1 Chapter 7 Modeling Structure with Blocks.
1 RA III - Regional Training Seminar on CLIMAT&CLIMAT TEMP Reporting Buenos Aires, Argentina, 25 – 27 October 2006 Status of observing programmes in RA.
Basel-ICU-Journal Challenge18/20/ Basel-ICU-Journal Challenge8/20/2014.
1..
Graphs, representation, isomorphism, connectivity
CONTROL VISION Set-up. Step 1 Step 2 Step 3 Step 5 Step 4.
Analyzing Genes and Genomes
1 Let’s Recapitulate. 2 Regular Languages DFAs NFAs Regular Expressions Regular Grammars.
©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.
Essential Cell Biology
Intracellular Compartments and Transport
PSSA Preparation.
Essential Cell Biology
Immunobiology: The Immune System in Health & Disease Sixth Edition
Energy Generation in Mitochondria and Chlorplasts
Distributed Computing 9. Sorting - a lower bound on bit complexity Shmuel Zaks ©
1 Decidability continued…. 2 Theorem: For a recursively enumerable language it is undecidable to determine whether is finite Proof: We will reduce the.
Basics of Statistical Estimation
The Pumping Lemma for CFL’s
. Inference and Parameter Estimation in HMM Lecture 11 Computational Genomics © Shlomo Moran, Ydo Wexler, Dan Geiger (Technion) modified by Benny Chor.
Hidden Markov Model in Biological Sequence Analysis – Part 2
HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:
Hidden Markov Models Eine Einführung.
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
. Hidden Markov Model Lecture #6. 2 Reminder: Finite State Markov Chain An integer time stochastic process, consisting of a domain D of m states {1,…,m}
HMM for CpG Islands Parameter Estimation For HMM Maximum Likelihood and the Information Inequality Lecture #7 Background Readings: Chapter 3.3 in the.
. EM algorithm and applications Lecture #9 Background Readings: Chapters 11.2, 11.6 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
. Parameter Estimation and Relative Entropy Lecture #8 Background Readings: Chapters 3.3, 11.2 in the text book, Biological Sequence Analysis, Durbin et.
Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.
. Hidden Markov Model Lecture #6 Background Readings: Chapters 3.1, 3.2 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
. Parameter Estimation For HMM Background Readings: Chapter 3.3 in the book, Biological Sequence Analysis, Durbin et al., 2001.
. Hidden Markov Models Lecture #5 Prepared by Dan Geiger. Background Readings: Chapter 3 in the text book (Durbin et al.).
. Sequence Alignment via HMM Background Readings: chapters 3.4, 3.5, 4, in the Durbin et al.
S. Maarschalkerweerd & A. Tjhang1 Parameter estimation for HMMs, Baum-Welch algorithm, Model topology, Numerical stability Chapter
. Hidden Markov Model Lecture #6 Background Readings: Chapters 3.1, 3.2 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
Elze de Groot1 Parameter estimation for HMMs, Baum-Welch algorithm, Model topology, Numerical stability Chapter
. Learning Parameters of Hidden Markov Models Prepared by Dan Geiger.
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
Hidden Markov Model Continues …. Finite State Markov Chain A discrete time stochastic process, consisting of a domain D of m states {1,…,m} and 1.An m.
. Parameter Estimation For HMM Lecture #7 Background Readings: Chapter 3.3 in the text book, Biological Sequence Analysis, Durbin et al.,  Shlomo.
. EM with Many Random Variables Another Example of EM Sequence Alignment via HMM Lecture # 10 This class has been edited from Nir Friedman’s lecture. Changes.
. Parameter Estimation For HMM Lecture #7 Background Readings: Chapter 3.3 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
Hidden Markov Models Usman Roshan CS 675 Machine Learning.
. EM and variants of HMM Lecture #9 Background Readings: Chapters 11.2, 11.6, 3.4 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
. Correctness proof of EM Variants of HMM Sequence Alignment via HMM Lecture # 10 This class has been edited from Nir Friedman’s lecture. Changes made.
. Parameter Estimation and Relative Entropy Lecture #8 Background Readings: Chapters 3.3, 11.2 in the text book, Biological Sequence Analysis, Durbin et.
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
Hidden Markov Models BMI/CS 576
Hidden Markov Models - Training
CSE 5290: Algorithms for Bioinformatics Fall 2009
Hidden Markov Model Lecture #6
Presentation transcript:

. Lecture #8: - Parameter Estimation for HMM with Hidden States: the Baum Welch Training - Viterbi Training - Extensions of HMM Background Readings: Chapters 3.3, 11.2 in the text book, Biological Sequence Analysis, Durbin et al.,  Shlomo Moran, following Dan Geiger and Nir Friedman

2 Parameter Estimation in HMM: What we saw until now

3 The Parameters To determine the values of the parameters θ, use a training set = {x 1,...,x n }, where each x j is a sequence which is assumed to fit the model. An HMM model is defined by the probabilty parameters: m kl and e k (b), for all states k,l and all symbols b. θ denotes the collection of these parameters.

4 Maximum Likelihood Parameter Estimation for HMM looks for θ which maximizes: p(x 1,..., x n |θ) = ∏ j p (x j |θ).

5 Finding ML parameters for HMM when all states are known: Let M kl = #(transitions from k to l) in the training set. E k (b) = #(emissions of symbol b from state k) in the training set. We look for parameters  ={m kl, e k (b)} that: The optimal ML parameters θ are given by:

6 Case 2: State paths are unknown We need to maximize p(x|θ)=∑ s p(x,s|θ), where the summation is over all the sequences S which produce the output sequence x. Finding θ * which maximizes ∑ s p(x,s|θ) is hard. [Unlike finding θ * which maximizes p(x,s|θ) for a single sequence (x,s).] s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi

7 Parameter Estimation when States are Unknown The general process for finding θ in this case is 1.Start with an initial value of θ. 2.Find θ* so that p(x|θ*) > p(x|θ) 3.set θ = θ*. 4.Repeat until some convergence criterion is met. A general algorithm of this type is the Expectation Maximization algorithm, which we will meet later. For the specific case of HMM, it is the Baum- Welch training.

8 Baum Welch Training We start with some values of m kl and e k (b), which define prior values of θ. Then we use an iterative algorithm which attempts to replace θ by a θ * s.t. p( x |θ * ) > p( x |θ) This is done by “imitating” the algorithm for Case 1, where all states are known: s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi

9 When states were known, we counted… In case 1 we computed the optimal values of m kl and e k (b), (for the optimal θ) by simply counting the number M kl of transitions from state k to state l, and the number E k (b) of emissions of symbol b from state k, in the training set. This was possible since we knew all the states. S i = lS i-1 = k x i-1 = b … … x i = c

10 When states are unknown M kl and E k (b) are taken as averages: M kl and E k (b) are computed according to the current distribution θ, that is: s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi Similarly E k (b)=∑ s E s k (b)p(s|x,θ), where E s k (b) is the number of times k emits b in the sequence s with output x. where M s kl is the number of k to l transitions in the sequence s.

11 Computing averages of state-transitions: Since the number of sequences s is exponential in L, it is too expensive to compute M kl =∑ s M s kl p(s|x,θ) in the naïve way. Hence, we use dynamic programming: For each each pair (k,l) and for each edge s i-1  s i we compute the average number of “k to l” transitions over this edge. Then we take M kl to be the sum over all edges. S i = ?S i-1 = ? x i-1 = b x i = c ……

12 …and of Letter-Emissions Similarly, For each edge s i  b and each state k, we compute the average number of times that s i =k, which is the expected number of “k → b” transmission on this edge. Then we take E k (b) to be the sum over all such edges. These expected values are computed by assuming the current parameters θ: S i = ? x i = b

13 Baum Welch: step E for M kl Count avearge number of state transitions For computing the averages, Baum Welch computes for each index i and states k,l, the following probability: s1s1 SiSi sLsL X1X1 XiXi XLXL S i-1 X i-1.. P(s i-1 =k, s i =l | x,θ) For this, it uses the forwards and backwards algorithms

14 Reminder: finding state probabilities p(s i =k,x) = F k (s i ) B k (s i ) {F k (i) B k (i)} for every i and k are computed by one run of the backward/forward algorithms as follows: s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi F k (i) = p(x 1,…,x i,s i =k ), the probability that in a path which emits (x 1,..,x i ), state s i =k. B k (i)= p(x i+1,…,x L |s i =k), the probability that a path which emits (x i+1,..,x L ), given that state s i =k.

15 Baum Welch: Step E for M kl Claim: By the probability distribution of HMM s1s1 l sLsL X1X1 XiXi XLXL k X i-1.. (m kl and e l (x i ) are the parameters defined by , and F k (i-1), B k (i) are the outputs of the forward / backward algorithms)

16 proof of claim P(x 1,…,x L,s i-1 =k,s i =l|  ) = P(x 1,…,x i-1,s i-1 =k|  ) m kl e l (x i ) P(x i+1,…,x L |s i =l,  ) = F k (i-1) m kl e l (x i ) B l (i) Via the forward algorithm Via the backward algorithm s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL S i-1 X i-1 sisi XiXi x p(s i-1 =k,s i =l | x,  ) = F k (i-1) m kl e l (x i ) B l (i)

17 Step E for M kl (end) For each pair (k,l), compute the expected number of state transitions from k to l, as the sum of the expected number of k to l transitions over all L edges :

18 Step E for M kl, with many sequences: Claim: When we have n independent input sequences (x 1,..., x n ) of lengths L 1.. L n, then M kl is given by:

19 Proof of Claim: When we have n independent input sequences (x 1,..., x n ), the probability space is the product of n spaces: The probability of a simple event in this space with parameters θ is given by:

20 Proof of Claim (cont): The probability of that simple event given x=(x 1,..,x n ): The probability of the compound event (s j,x j ) given x=(x 1,..,x n ):

21 Proof of Claim (end):

22 Baum-Welch: Step E for E k (b) count expected number of letter-emissions for state k and each symbol b, for each i where X i =b, compute the expected number of times that S i =k. s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi X i =b

23 Baum-Welch: Step E for E k (b) For each state k and each symbol b, compute the expected number of emissions of b from k as the sum of the expected number of times that s i = k, over all i’s for which x i = b.

24 Step E for E k (b), many sequences Exercise: when we have n sequences (x 1,..., x n ), the expected number of emissions of b from k is given by:

25 Summary: the E part of the Baum Welch training This part computes the expected numbers M kl of k→l transitions for all pairs of states k and l, and the expected numbers E k (b) of transmisions of symbol b from state k, for all states k and symbols b. The next step is the M step, which is identical to the computation of optimal ML parameters when all states are known.

26 Baum-Welch: step M Use the M kl ’s, E k (b)’s to compute the new values of m kl and e k (b). These values define θ *. The correctness of the EM algorithm implies that, if θ* ≠ θ, then: p(x 1,..., x n |θ * ) > p(x 1,..., x n |θ) i.e, θ * increases the probability of the data, unless it is equal to θ. (this will follow from the correctness of the EM algorithm, to be proved later.) This procedure is iterated, until some convergence criterion is met.

27 Viterbi training: Maximizing the probability of the most probable path

28 Assume that rather then finding θ which maximizes the likelihood of the input x 1,..,x n, we wish to maximize the probability of a most probable path, ie to find parameters θ and state paths s(x 1 ),..,s(x n ) s.t.the value of p(s(x 1 ),..,s(x n ), x 1,..,x n |θ) is maximized. Clearly, s(x j ) should be the most probable path for x j under the parameters θ. We assume only one sequence (n=1). This is done by Viterbi Training

29 Maximizing the probabilty of the most probable path States are unknown. Viterbi training attempts to maximize the probability of a most probable path, ie the value of p(s(x 1 ),..,s(x n ), x 1,..,x n |θ) Where s(x j ) is the most probable (under θ) path for x j. We assume only one sequence (n=1). s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi

30 Viterbi training Start from given values of m kl and e k (b), which define prior values of θ. Each iteration: Step 1: Use Viterbi’s algorithm to find a most probable path s(x), which maximizes p(s(x), x|θ). s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi

31 Viterbi training (cont) Step 2. Use the ML method for HMM when the states are known to find θ * which maximizes p(s(x), x|θ * ). Note : If after Step 2 we have p(s(x), x|θ * )= p(s(x), x|θ), then it must be that θ=θ *. In this case the next iteration will be identical to the current one, and hence we may terminate the algorithm. s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi θ*θ*

32 Viterbi training (cont) Step 3. If θ≠θ *, set θ←θ *, and repeat. s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi If θ=θ *, stop.

33 Viterbi training (end) s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi Exercise: generalize the algorithm for the case where there are n training sequences x 1,..,x n to find paths {s(x 1 ),..,s(x n )} and parameters θ so that {s(x 1 ),..,s(x n )} are most probable paths for x 1,..,x n under θ.

34 Extensions of HMM

35 1. Monitoring probabilities of repetitions Markov chains are rather limited in describing sequences of symbols with non-random structures. For instance, a Markov chain forces the distribution of segments in which some state is repeated k+1 times to be (1-p)p k, for some p. AAAA By adding states we may bypass this restriction:

36 1. State duplications An extension of Markov chain which allows the distribution of segments in which a state is repeated k+1 times to have any desired value: Assign k+1 states to represent the same “real” state. This may model k repetitions (or less) with any desired probability. A1A1 A2A2 A3A3 A4A4

37 2. Silent states u States which do not emit symbols. u Can be used to model repetitions. u Also used to allow arbitrary jumps (may be used to model deletions) u Need to generalize the Forward and Backward algorithms for arbitrary acyclic digraphs to count for the silent states: Silent states: Regular states:

38 eg, the forwards algorithm should look: Directed cycles of silent (or other) states complicate things, and should be avoided. x v z Silent states Regular states symbols

39 3. High Order Markov Chains Markov chains in which the transition probabilities depends on the last k states: P(x i |x i-1,...,x 1 ) = P(x i |x i-1,...,x i-k ) Can be represented by a standard Markov chain with more states. eg for k=2: AA BB BA AB

40 4. Inhomogeneous Markov Chains u An important task in analyzing DNA sequences is recognizing the genes which code for proteins. u A triplet of 3 nucleotides – codon - codes for amino acids. u It is known that in parts of DNA which code for genes, the three codons positions has different statistics. u Thus a Markov chain model for DNA should represent not only the Nucleotide (A, C, G or T), but also its position – the same nucleotide in different position will have different transition probabilities. Used in GENEMARK gene finding program (93).