. Inference and Parameter Estimation in HMM Lecture 11 Computational Genomics © Shlomo Moran, Ydo Wexler, Dan Geiger (Technion) modified by Benny Chor.

Slides:



Advertisements
Similar presentations
. Lecture #8: - Parameter Estimation for HMM with Hidden States: the Baum Welch Training - Viterbi Training - Extensions of HMM Background Readings: Chapters.
Advertisements

Hidden Markov Model in Biological Sequence Analysis – Part 2
Hidden Markov Models (1)  Brief review of discrete time finite Markov Chain  Hidden Markov Model  Examples of HMM in Bioinformatics  Estimations Basic.
Marjolijn Elsinga & Elze de Groot1 Markov Chains and Hidden Markov Models Marjolijn Elsinga & Elze de Groot.
HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:
Learning HMM parameters
. Computational Genomics Lecture 10 Hidden Markov Models (HMMs) © Ydo Wexler & Dan Geiger (Technion) and by Nir Friedman (HU) Modified by Benny Chor (TAU)
Lecture 8: Hidden Markov Models (HMMs) Michael Gutkin Shlomi Haba Prepared by Originally presented at Yaakov Stein’s DSPCSP Seminar, spring 2002 Modified.
Hidden Markov Models Eine Einführung.
Hidden Markov Models.
. Computational Genomics Lecture 10 Hidden Markov Models (HMMs) © Ydo Wexler & Dan Geiger (Technion) and by Nir Friedman (HU) Modified by Benny Chor (TAU)
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
Hidden Markov Models Ellen Walker Bioinformatics Hiram College, 2008.
Hidden Markov Models Fundamentals and applications to bioinformatics.
Hidden Markov Models Usman Roshan BNFO 601.
. Hidden Markov Model Lecture #6. 2 Reminder: Finite State Markov Chain An integer time stochastic process, consisting of a domain D of m states {1,…,m}
GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.
HMM for CpG Islands Parameter Estimation For HMM Maximum Likelihood and the Information Inequality Lecture #7 Background Readings: Chapter 3.3 in the.
. EM algorithm and applications Lecture #9 Background Readings: Chapters 11.2, 11.6 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
. Learning Hidden Markov Models Tutorial #7 © Ilan Gronau. Based on original slides of Ydo Wexler & Dan Geiger.
Hidden Markov Model 11/28/07. Bayes Rule The posterior distribution Select k with the largest posterior distribution. Minimizes the average misclassification.
. Maximum Likelihood (ML) Parameter Estimation with applications to inferring phylogenetic trees Comput. Genomics, lecture 7a Presentation partially taken.
. Parameter Estimation and Relative Entropy Lecture #8 Background Readings: Chapters 3.3, 11.2 in the text book, Biological Sequence Analysis, Durbin et.
Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.
. Hidden Markov Model Lecture #6 Background Readings: Chapters 3.1, 3.2 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
. Parameter Estimation For HMM Background Readings: Chapter 3.3 in the book, Biological Sequence Analysis, Durbin et al., 2001.
. Hidden Markov Models Lecture #5 Prepared by Dan Geiger. Background Readings: Chapter 3 in the text book (Durbin et al.).
Lecture 5: Learning models using EM
S. Maarschalkerweerd & A. Tjhang1 Parameter estimation for HMMs, Baum-Welch algorithm, Model topology, Numerical stability Chapter
. Hidden Markov Model Lecture #6 Background Readings: Chapters 3.1, 3.2 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
. Hidden Markov Models For Genetic Linkage Analysis Lecture #4 Prepared by Dan Geiger.
. Computational Genomics Lecture 8a Hidden Markov Models (HMMs) © Ydo Wexler & Dan Geiger (Technion) and by Nir Friedman (HU) Modified by Benny Chor (TAU)
Hidden Markov Models K 1 … 2. Outline Hidden Markov Models – Formalism The Three Basic Problems of HMMs Solutions Applications of HMMs for Automatic Speech.
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
. Maximum Likelihood (ML) Parameter Estimation with applications to reconstructing phylogenetic trees Comput. Genomics, lecture 6b Presentation taken from.
. Inference in HMM Tutorial #6 © Ilan Gronau. Based on original slides of Ydo Wexler & Dan Geiger.
Elze de Groot1 Parameter estimation for HMMs, Baum-Welch algorithm, Model topology, Numerical stability Chapter
. Learning Parameters of Hidden Markov Models Prepared by Dan Geiger.
Hidden Markov models Sushmita Roy BMI/CS 576 Oct 16 th, 2014.
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
CS262 Lecture 5, Win07, Batzoglou Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
. Class 5: Hidden Markov Models. Sequence Models u So far we examined several probabilistic model sequence models u These model, however, assumed that.
Hidden Markov Model Continues …. Finite State Markov Chain A discrete time stochastic process, consisting of a domain D of m states {1,…,m} and 1.An m.
1 Markov Chains. 2 Hidden Markov Models 3 Review Markov Chain can solve the CpG island finding problem Positive model, negative model Length? Solution:
. Parameter Estimation For HMM Lecture #7 Background Readings: Chapter 3.3 in the text book, Biological Sequence Analysis, Durbin et al.,  Shlomo.
HMM Hidden Markov Model Hidden Markov Model. CpG islands CpG islands In human genome, CG dinucleotides are relatively rare In human genome, CG dinucleotides.
.. . Maximum Likelihood (ML) Parameter Estimation with applications to inferring phylogenetic trees Comput. Genomics, lecture 6a Presentation taken from.
. EM with Many Random Variables Another Example of EM Sequence Alignment via HMM Lecture # 10 This class has been edited from Nir Friedman’s lecture. Changes.
. Parameter Estimation For HMM Lecture #7 Background Readings: Chapter 3.3 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
BINF6201/8201 Hidden Markov Models for Sequence Analysis
What if a new genome comes? We just sequenced the porcupine genome We know CpG islands play the same role in this genome However, we have no known CpG.
Fundamentals of Hidden Markov Model Mehmet Yunus Dönmez.
H IDDEN M ARKOV M ODELS. O VERVIEW Markov models Hidden Markov models(HMM) Issues Regarding HMM Algorithmic approach to Issues of HMM.
Hidden Markov Models Yves Moreau Katholieke Universiteit Leuven.
Hidden Markov Models Usman Roshan CS 675 Machine Learning.
Hidden Markov Models BMI/CS 776 Mark Craven March 2002.
S. Salzberg CMSC 828N 1 Three classic HMM problems 2.Decoding: given a model and an output sequence, what is the most likely state sequence through the.
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
1 CONTEXT DEPENDENT CLASSIFICATION  Remember: Bayes rule  Here: The class to which a feature vector belongs depends on:  Its own value  The values.
Algorithms in Computational Biology11Department of Mathematics & Computer Science Algorithms in Computational Biology Markov Chains and Hidden Markov Model.
. Parameter Estimation and Relative Entropy Lecture #8 Background Readings: Chapters 3.3, 11.2 in the text book, Biological Sequence Analysis, Durbin et.
Hidden Markov Models (HMMs) Chapter 3 (Duda et al.) – Section 3.10 (Warning: this section has lots of typos) CS479/679 Pattern Recognition Spring 2013.
. EM in Hidden Markov Models Tutorial 7 © Ydo Wexler & Dan Geiger, revised by Sivan Yogev.
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
Hidden Markov Models BMI/CS 576
LECTURE 10: EXPECTATION MAXIMIZATION (EM)
Hidden Markov Models - Training
Hidden Markov Models Part 2: Algorithms
Three classic HMM problems
CONTEXT DEPENDENT CLASSIFICATION
Presentation transcript:

. Inference and Parameter Estimation in HMM Lecture 11 Computational Genomics © Shlomo Moran, Ydo Wexler, Dan Geiger (Technion) modified by Benny Chor (TAU) Background Readings: Chapter 3.3 in the text book, Biological Sequence Analysis, Durbin et al., 2001.

2 Hidden Markov Models Today, will further explore algorithmic questions related to HMMs S1S1 S2S2 S L-1 SLSL O1O1 O2O2 O L-1 OLOL SiSi OiOi Recall that an HMM is a network of the following form :

3 HMMs – Question I  Given an observation sequence O = ( O 1 O 2 O 3 … O L ), and a model M = {A, B,   }  how do we efficiently compute Pr(O|M), the probability that the given model M produces the observation O in a run of length L ? u Solved using the Baum-Welch forward-backward algorithm (dynamic programming).

4 Coin-Tossing Example 0.9 Fair loaded head tail /2 1/4 3/4 1/2 H1H1 H2H2 H L-1 HLHL X1X1 X2X2 X L-1 XLXL HiHi XiXi L tosses Fair/Loade d Head/Tail Start 1/2

5 HMM – Question II (Harder) u Given an observation sequence, O = (O 1 O 2 … O T ), and a model, M = {A, B, p }, how do we efficiently compute the most probable sequence(s) of states, Q ? u Namely the sequence of states S = (S 1 S 2 … S T ), which maximizes Pr(O|S,M), the probability that the given model M produces the given observation O when it goes through the specific sequence of states Q. u Recall that given a model M, a sequence of observations O, and a sequence of states S, we can efficiently compute Pr(O|Q,M) (should watch out for numeric underflows)

6 Easy: Finding most probable state at step i 1. The forward algorithm finds {f t (i) = Pr(O 1,…,O t,q t =S i ): t = 1,...L}. 2. The backward algorithm finds {b t (i) = Pr(O t+1,…,O L | q t =S i ): t = 1,...,L}. 3. Return {Pr ( q t =S i | O) = f t (s i ) b t (s i ) |t=1,...,L}. (proof uses conditional prob.) To Compute for every i simply run the forward and backward algorithms once, and compute {f t (i) b t (i)} for every i, t. s1s1 s2s2 s L-1 sLsL O1O1 O2O2 O L-1 OLOL sisi OiOi f t (i) = Pr(O 1,…,O i,q t =S i ), the probability of a path which emits (O 1,..,O t ) and where state q t =S i. b t (i)= Pr(O i+1,…,O L, q t =S i ), the probability of a path which emits (O i+1,..,O t ) and where state q t =S i. Notice change of notation: f t (i), b t (i) denotes forward, backward prob.

7 Reminder: Most Probable state path S1S1 S2S2 S L-1 SLSL O1O1 O2O2 O L-1 OLOL M M M M TTTT Given an output sequence O = (O 1,…,O L ), A most probable path s*= (s * 1,…,s * L ) is one which maximizes Pr(S*|O).

8 Reformulating: MAP in a Given HMM (all probabilities conditioned on M) 1.Recall that question I, likelihood of observation, is to compute Pr(O 1,…,O L ) =  Pr(O 1,…,O L | S 1,…,S L ) 2.Now we wish to compute a similar quantity: P * (O 1,…,O L ) = MAX Pr(O 1,…,O L | S 1,…,S L ) (S 1,…,S L ) And to find a Max Apostriori Probability assignment (S 1 *,…,S L * ) that yields this maximum. S1S1 S2S2 S L-1 SLSL O1O1 O2O2 O L-1 OLOL SiSi OiOi

9 Reformulating: MAP in a Given HMM (all probabilities conditioned on M) Goal: Compute P * (O 1,…,O L ) = MAX Pr(S 1,…,S L | O 1,…,O L ) (S 1,…,S L ) and find a Max Apostriori Probability assignment (S 1 *,…,S L * ). Solution by the Viterbi algorithm (1967), due to Andrea Viterbi, ) (see the HMM link from Roger Boyle page at Leeds for additional info, including an online demo : ) S1S1 S2S2 S L-1 SLSL O1O1 O2O2 O L-1 OLOL SiSi OiOi

10 The Viterbi Algorithm White board presentation.

11 Example: Dishonest Casino

12 Dishonest Casino u Computing posterior probabilities for “fair” at each time point (separately) in a long sequence:

13 Dishonest Casino Computing Max Apostriori Probability sequence (S1*,…,SL*) in a long sequence, using Viterbi (from Durbin et al.’s book, p. 57)

14 S1S1 S2S2 S L-1 SLSL O1O1 O2O2 O L-1 OLOL S OiOi L tosses Fair/Loade d Head/Tail 0.9 Fair loaded head tail /2 1/4 3/4 1/2 Start 1/2 Coin-Tossing Example - Viterbi’s algorithm A reminder: Q2: what is the most likely sequence of states to generate the given data?

15 Loaded Coin Example – Exhaustive Solution S1S1 S2S2 O1O1 O2O2 S3S3 O3O3 Small example: 3 tosses Outcomes: head, head, tail S 1,S 2,S 3 Pr(O 1,O 2,O 3 | S 1,S 2,S 3 ) F,F,F (0.5) 3 *0.5*(0.9) 2 = F,F,L (0.5) 2 *0.25*0.5*0.9*0.1= F,L,F 0.5*0.75*0.5*0.5*0.1*0.1= F,L,L 0.5*0.75*0.25*0.5*0.1*0.9= L,F,F 0.75*0.5*0.5*0.5*0.1*0.9= L,F,L 0.75*0.5*0.25*0.5*0.1*0.1= L,L,F 0.75*0.75*0.5*0.5*0.9*0.1= L,L,L 0.75*0.75*0.25*0.5*0.9*0.9= max

16 Coin-Tossing Example - Viterbi’s algorithm Forward phase: S1S1 S2S2 O1O1 O2O2 S3S3 O3O3 Numeric example: 3 tosses Outcomes: head, head, tail S 1 * = ARG MAX Pr(S 1 ) Pr(O 1 |S 1 ) b (S 1 ) = ARG MAX{Pr(loaded)Pr(head|loaded)* ,Pr(fair)Pr(head|fair)*0.2025}= loaded S2S2 S1S1 S 2 * = ARG MAX Pr(S 2 |loaded 1 ) Pr(head|S 2 ) b (S 2 ) = ARG MAX{Pr(loaded|loaded 1 )Pr(head|loaded)* 0.225, Pr(fair|loaded 1 )Pr(head|fair)*0.45} =loaded S2S2 S3S3 S 3 * = ARG MAX Pr(S 3 |loaded 2 ) Pr(tail|S 3 ) b (S 3 ) = ARG MAX{Pr(loaded|loaded 2 )Pr(tail|loaded), Pr(fair|loaded 2 )Pr(tail |fair)} =loaded S3S3 S4S4

. Have dealt with HMM Question I (computing likelihood of observation) and with HMM Question II (finding most likely sequence of states). Time for HMM Question III (hardest): Determine Model Parameters

18 HMM – Question III (Hardest) u Given an observation sequence O = (O 1 O 2 … O L ), and a class of models, each of the form M = {A,B,p}, which specific model “best” explains the observations? u A solution to HMM question I enables the efficient computation of Pr(O|M) (the probability that a specific model M produces the observation O). u Question III can be viewed as a learning problem: We want to use the sequence of observations in order to “train” an HMM and learn the optimal underlying model parameters (transition and output probabilities).

19 Parameter Estimation for HMM An HMM model with given structure is defined by the parameters: a kl and e k (b), for all states k,l and all symbols b. Let θ denote the collection of these parameters. s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi SlSk b a kl ek(b)ek(b)

20 Parameter Estimation for HMM To determine the values of (the parameters in) θ, use a training set = {x 1,...,x n }, where each x j is a sequence of observations, assumed to be generated by the model. Given the parameters θ, each sequence x j has a well defined likelihood Pr(x j |θ) (or Pr(x j | θ,HMM) ). s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi

21 ML Parameter Estimation for HMM The elements of the training set {x 1,...,x n }, are assumed to be independent, Pr(x 1,..., x n |θ) = ∏ j p (x j |θ). ML parameter estimation looks for θ which maximizes the above. The exact method for finding or approximating this θ depends on the nature of the training set used.

22 Data for HMM Possible properties of (the sequences in) the training set: 1.For each x j, what is our information on the states s i (the symbols x i are usually known). 2.The size (number of sequences) of the training set S1S1 S2S2 S L-1 SLSL x1x1 x2x2 X L-1 xLxL M M M M TTTT

23 We know the complete structure of each sequence in the training set {x 1,...,x n }. We wish to estimate a kl and e k (b) for all pairs of states k, l and symbols b. By the ML method, we look for parameters θ* which maximize the probability of the sample set: Pr(x 1,...,x n | θ*) =MAX θ Pr(x 1,...,x n | θ). s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi Case 1: States’ Sequences are fully known

24 Case 1:Sequence known; ML method s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi Let m kl = |{i: s i-1 =k,s i =l}| (in x j ). (empirical counts) m k (b)=|{i:s i =k,x i =b}| (in x j ). ( “ “ ) For each x j (sequence no. j) we have:

25 Case 1 (cont) s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi By the independence of the x j ’s, Pr(x 1,...,x n | θ)=∏ j Pr(x j |θ). Thus, if A kl = total #(transitions from k to l) in the training set, and E k (b) = total #(emissions of symbol b from state k) in the training set, we have:

26 Case 1 (cont) So we need to find a kl ’s and e k (b)’s which maximize: Subject to:

27 Case 1 (cont) Rewriting, we need to maximize:

28 Case 1 (cont) Then we will also maximize F. Each of the above is a simpler ML problem (multi facet dice). This problem admits a simple, intuitive solution, and was solved in homework assignment (or will be).

29 Apply the ML method to HMM Let A kl = total #(transitions from k to l) in the training set. E k (b) = total #(emissions of symbol b from state k) in the training set. We need to: s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi

30 Apply to HMM (cont.) We apply the ML solution to get, for each, k the parameters {a kl |l=1,..,m} and {e k (b)|b  Σ}: s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi Which gives the optimal ML parameters

31 Adding pseudo counts in HMM If the sample set is too small, we may get a biased result, reflecting the sample but not the true model (over fitting). In this case we modify the actual count by our prior knowledge/belief: r kl is our prior belief and transitions from k to l. r k (b) is our prior belief on emissions of b from state k. s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi

32 Summary of Case 1: State sequences are fully known We know the complete structure of each sequence in the training set {x 1,...,x n }. We wish to estimate a kl and e k (b) for all pairs of states k, l and symbols b. We just showed a method which finds the unique parameters θ* which maximizes Pr(x 1,...,x n | θ*) =MAX θ Pr(x 1,...,x n | θ). s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi

33 Case 2: State paths are unknown: In this case only the values of the x i ’s of the input sequences are known. This is a ML problem with “missing data”. We wish to find θ * so that Pr(x|θ * )=MAX θ {Pr(x|θ)}. For each sequence x, Pr(x|θ)=∑ s Pr(x,s|θ), taken over all state paths s. s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi

34 Case 2: State paths are unknown So we need to maximize Pr(x|θ)=∑ s Pr(x,s|θ), where the summation is over all the sequences S which produce the output sequence x. Finding θ * which maximizes ∑ s Pr(x,s|θ) is hard. [Unlike finding θ * which maximizes Pr(x,s|θ) for a single sequence (x,s).] s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi

35 ML Parameter Estimation for HMM The general process for finding θ in this case is 1.Start with an initial value of θ. 2.Compute Pr(x 1,..., x n |θ) (question I) 3.Look for “close by” θ’ so that Pr(x 1,..., x n |θ’) > Pr(x 1,..., x n |θ) 4. set θ = θ’. 5. Repeat until some convergence criterion is met. This fits the general framework of hill climbing, or local search. Will discuss a more “targeted” search strategy.

36 ML Parameter Estimation for HMM The general process for finding θ in this case is 1.Start with an initial value of θ. 2.Find θ’ so that Pr(x 1,..., x n |θ’) > Pr(x 1,..., x n |θ) 3.set θ = θ’. 4.Repeat until some convergence criterion is met. The Expectation Maximization (EM) algorithm, which we will describe next, fits this general framework. For the specific case of HMM, it is called the Baum-Welch training.

37 Learning the parameters (EM algorithm) A common algorithm to learn the parameters from unlabeled sequences is called Expectation-Maximization (EM). In the current context it reads as follows: Start with some probability tables (many possible choices) Iterate until convergence E-step: Compute Pr(q t-1 =S i, q t =S j | O 1,…,O L ) using the current probability tables (“current parameters”). M-step: use the Expected counts found to update the local probability tables via Maximum likelihood (=n s1  s2 /n). We start with the E-step (closely related to so called Belief update in AI). As usual, whiteboard presentation

38 Baum Welch training We start with some initial values of a kl and e k (b), which define prior values of θ. Baum-Welch training is an iterative algorithm which attempts to replace θ by a θ * s.t. Pr( x |θ * ) > Pr( x |θ) Each iteration consists of three main steps: First computations, then expectation, and finally maximization. s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi

39 Baum Welch Step 1: Expectation Compute the expected number of state transitions: For each sequence x j, for each i and for each k,l, compute the posterior state transitions probabilities: s1s1 SiSi sLsL X1X1 XiXi XLXL S i-1 X i-1.. Pr(s i-1 =k, s i =l | x j,θ)

40 Step 1: Computing Pr(s i-1 =k, s i =l | x j,θ) s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL S i-1 X i-1 sisi XiXi

41 Step 1 (end) For each sequence x j, computed the probability of having a transitions from state k to l. Now, if x j is of length L j, then the expected number of transitions from state k to l along the sequence equals Pr(s i-1 =k, s i =l | x j,θ)

42 Step 1 (end) For each sequence x j, computed the probability of having a transitions from state k to l. For n sequences x 1, …, x n, the expected number (over all sequences) of transitions from state k to l equals Pr(s i-1 =k, s i =l | x j,θ)

43 Baum-Welch: Step 2 for each state k and each symbol b, compute the expected number of emissions of b from state k: (inner sum is only over positions in sequence where b was emitted.)

44 Baum-Welch Step 3: Maximization Use the A kl ’s, E k (b)’s to compute the new values of a kl and e k (b). These values define θ *. It can be shown that: p(x 1,..., x n |θ * ) > p(x 1,..., x n |θ) i.e, θ * increases the probability of the data This procedure is iterated, until some convergence criterion is met.

45 This is the basic EM Algorithm u Many extensions have been considered. u Algorithm extensively used in practice. u Numerical considerations, like underflow, important in any implementation. u Convergence to some local maxima guaranteed, even though this process may be verrrry slow. u Choice of model (how many states, how many non zero directed edges) crucial for meaningful results. u Too many states may cause over fitting and slow down convergence. u Still, EM is one of the most widely used heuristic.

46 A Variant when State paths are unknown: Viterbi Training Also start from initial values of a kl and e k (b), which defines prior values of θ. Viterbi training attempts to maximize the probability of a most probable path, ie maximize p((s(x 1 ),..,s(x n )) |θ, x 1,..,x n ) Where s(x j ) is the most probable (under θ) path for x j. Finds paths, computes optimal ML values for them, then iterates. s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi

47 Each iteration: 1.Find a set {s(x j )} of most probable paths, which maximize p(s(x 1 ),..,s(x n ) |θ, x 1,..,x n ) 2. Find θ *, which maximizes p(s(x 1 ),..,s(x n ) | θ *, x 1,..,x n ) Note: In 1. the maximizing arguments are the paths, in 2. it is θ *. 3. Set θ=θ *, and repeat. Stop when paths are not changed. s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi A Variant when State paths are unknown: Viterbi Training

48 Case 2: State paths are unknown: Viterbi training Pr(s(x 1 ),..,s(x n ) | θ *, x 1,..,x n ) can be expressed in a closed form (since we are using a single path for each x j ), so this time convergence is achieved when the optimal paths are not changed any more. Now state paths are discrete, so convergence after finitely many steps is guaranteed (but maybe exponentially many). On the negative side, it is not Pr(x1,...,xn| θ*) that is maximized, and in general EM performs better. s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi

49 Example I: Homogenous HMM, one sample Start with some probability tables Iterate until convergence E-step: Compute p  (h i |h i -1,x 1,…,x L ) from p  (h i, h i -1 | x 1,…,x L ) which is computed using the forward- backward algorithm as explained earlier. M-step: Update the parameters simultaneously:    i p  (h i =1 | h i-1 =1, x 1,…,x L )+p  (h i =0 | h i-1 =0, x 1,…,x L )/(L-1) H1H1 H2H2 H L-1 HLHL X1X1 X2X2 X L-1 XLXL HiHi XiXi

50 H1H1 H2H2 H L-1 HLHL X1X1 X2X2 X L-1 XLXL HiHi XiXi Coin-Tossing Example Numeric example: 3 tosses Outcomes: head, head, tail

51 Coin-Tossing Example Numeric example: 3 tosses Outcomes: head, head, tail Pr(x 1 =head,h 1 =loaded)= Pr(loaded 1 ) Pr(head| loaded 1 )= 0.5*0.75=0.375 Pr(x 1 =head,h 1 =fair)= Pr(fair 1 ) Pr(head| fair 1 )= 0.5*0.5=0.25 First coin is loaded {step 1- forward} F(h i )=Pr(x 1,…,x i,h i ) =  Pr(x 1,…,x i-1, h i-1 ) Pr(h i | h i-1 ) Pr(x i | h i ) h i-1 Recall:

52 Coin-Tossing Example - forward Numeric example: 3 tosses Outcomes: head, head, tail Pr(x 1,…,x i,h i ) =  Pr(x 1,…,x i-1, h i-1 ) Pr(h i | h i-1 ) Pr(x i | h i ) h i-1 Pr(x 1 =head,h 1 =loaded)= Pr(loaded 1 ) Pr(head| loaded 1 )= 0.5*0.75=0.375 Pr(x 1 =head,h 1 =fair)= Pr(fair 1 ) Pr(head| fair 1 )= 0.5*0.5=0.25 {step 1} Pr(x 1 =head,x 2 =head,h 2 =loaded) =  Pr(x 1,h 1 ) Pr(h 2 | h 1 ) Pr(x 2 | h 2 ) = Pr(x 1 =head, loaded 1 ) Pr(loaded 2 | loaded 1 ) Pr(x 2 =head | loaded 2 ) + Pr(x 1 =head, fair 1 ) Pr(loaded 2 | fair 1 ) Pr(x 2 =head | loaded 2 ) = 0.375*0.9* *0.1*0.75= = h1h1 {step 2} Pr(x 1 =head,x 2 =head,h 2 =fair) =Pr(x 1 =head, loaded 1 ) Pr(fair 2 | loaded 1 ) Pr(x 2 =head | fair 2 ) +Pr(x 1 =head, fair 1 ) Pr(fair 2 | fair 1 ) Pr(x 2 =head | fair 2 ) = 0.375*0.1* *0.9*0.5= =

53 Coin-Tossing Example - forward Numeric example: 3 tosses Outcomes: head, head, tail Pr(x 1,…,x i,h i ) =  Pr(x 1,…,x i-1, h i-1 ) Pr(h i | h i-1 ) Pr(x i | h i ) h i-1 Pr(x 1 =head,x 2 =head,h 2 =loaded) = Pr(x 1 =head,x 2 =head,h 2 =fair) = {step 2} Pr(x 1 =head,x 2 =head, x 3 =tail,h 3 =loaded) =  Pr(x 1, x 2, h 2 ) Pr(h 3 | h 2 ) Pr(x 3 | h 3 ) = Pr(x 1 =head, x 2 =head, loaded 2 ) Pr(loaded 3 | loaded 2 ) Pr(x 3 =tail | loaded 3 ) + Pr(x 1 =head, x 2 =head, fair 2 ) Pr(loaded 3 | fair 2 ) Pr(x 3 =tail | loaded 3 ) = *0.9* *0.1*0.25= h2h2 {step 3} Pr(x 1 =head,x 2 =head, x 3 =tail,h 3 =fair) = Pr(x 1 =head, x 2 =head, loaded 2 ) Pr(fair 3 | loaded 2 ) Pr(x 3 =tail | fair 3 ) + Pr(x 1 =head, x 2 =head, fair 2 ) Pr(fair 3 | fair 2 ) Pr(x 3 =tail | fair 3 ) = *0.1* *0.9*0.5=

54 Coin-Tossing Example - backward Numeric example: 3 tosses Outcomes: head, head, tail b(h i ) = Pr(x i+1,…,x L |h i )= Pr(x i+1,…,x L |h i ) =  Pr(h i+1 | h i ) Pr(x i+1 | h i+1 ) b(h i+1 ) Pr(x 3 =tail | h 2 =loaded)=Pr(h 3 =loaded | h 2 =loaded) Pr(x 3 =tail | h 3 =loaded)+ Pr(h 3 =fair | h 2 =loaded) Pr(x 3 =tail | h 3 =fair)=0.9* *0.5=0.275 Pr(x 3 =tail | h 2 =fair)=Pr(h 3 =loaded | h 2 =fair) Pr(x 3 =tail | h 3 =loaded)+ Pr(h 3 =fair | h 2 =fair) Pr(x 3 =tail | h 3 =fair)=0.1* *0.5=0.475 {step 1} h i+1

55 Coin-Tossing Example - backward Numeric example: 3 tosses Outcomes: head, head, tail Pr(x 3 =tail | h 2 =loaded)=0.275 Pr(x 3 =tail | h 2 =fair)=0.475 {step 1} Pr(x 2 =head,x 3 =tail | h 1 =loaded) = Pr(loaded 2 | loaded 1 ) *Pr(head| loaded)* Pr(fair 2 | loaded 1 ) *Pr(head|fair)*0.475= 0.9*0.75* *0.5*0.475=0.209 {step 2} Pr(x 2 =head,x 3 =tail | h 1 =fair) = Pr(loaded 2 | fair 1 ) *Pr(head|loaded)* Pr(fair 2 | fair 1 ) * Pr(head|fair)*0.475= 0.1*0.75* *0.5*0.475=0.234 b(h i ) = Pr(x i+1,…,x L |h i )= Pr(x i+1,…,x L |h i ) =  Pr(h i+1 | h i ) Pr(x i+1 | h i+1 ) b(h i+1 ) h i+1

56 Pr(x 1,…,x L,h i,h i+1 )=f(h i ) Pr(h i+1 |h i ) Pr(x i+1 | h i+1 ) b(h i+1 ) Coin-Tossing Example Outcomes: head, head, tail f(h 1 =loaded) = 0.375, f(h 1 =fair) = 0.25 b(h 2 =loaded) = 0.275, b(h 2 =fair) = Pr(x 1 =head,h 1 =loaded)= Pr(loaded 1 ) Pr(head| loaded 1 )= 0.5*0.75=0.375 Pr(x 1 =head,h 1 =fair)= Pr(fair 1 ) Pr(head| fair 1 )= 0.5*0.5=0.25 {step 1} Recall:

57 Coin-Tossing Example Outcomes: head, head, tail f(h 1 =loaded) = 0.375, f(h 1 =fair) = 0.25 b(h 2 =loaded) = 0.275, b(h 2 =fair) = Pr(x 1,…,x L,h 1,h 2 )=f(h 1 ) Pr(h 1 |h 2 ) Pr(x 2 | h 2 ) b(h 2 ) Pr(x 1,…,x L,h 1 =loaded,h 2 =loaded)=0.375*0.9*0.75*0.275 Pr(x 1,…,x L,h 1 =loaded,h 2 =fair)=0.375*0.1*0. 5*0.475 Pr(x 1,…,x L,h 1 =fair,h 2 =loaded)=0.25*0.1*0.75*0.275 Pr(x 1,…,x L,h 1 =fair,h 2 =fair)=0.25*0.9*0. 5*0.475

58 Coin-Tossing Example Pr(h i |h i -1,x 1,…,x L )=Pr(x 1,…,x L,h i,h i-1 )/Pr(h i-1,x 1,…,x L ) f(h i-1 )*b(h i-1 ) =f(h i-1 ) Pr(h i-1 |h i ) Pr(x i | h i ) b(h i )/(f(h i-1 )*b(h i-1 )) =Pr(h i-1 |h i ) Pr(x i | h i ) b(h i )/b(h i-1 )

59 M-step M-step: Update the parameters simultaneously: (in this case we only have one parameter -  )   (  i Pr(h i =loaded | h i-1 =loaded, x 1,…,x L )+ p (h i =fair | h i-1 =fair, x 1,…,x L ))/(L-1)