. Parameter Estimation For HMM Lecture #7 Background Readings: Chapter 3.3 in the text book, Biological Sequence Analysis, Durbin et al., 2001.  Shlomo.

Slides:



Advertisements
Similar presentations
. Lecture #8: - Parameter Estimation for HMM with Hidden States: the Baum Welch Training - Viterbi Training - Extensions of HMM Background Readings: Chapters.
Advertisements

. Markov Chains. 2 Dependencies along the genome In previous classes we assumed every letter in a sequence is sampled randomly from some distribution.
. Inference and Parameter Estimation in HMM Lecture 11 Computational Genomics © Shlomo Moran, Ydo Wexler, Dan Geiger (Technion) modified by Benny Chor.
Hidden Markov Model in Biological Sequence Analysis – Part 2
HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:
Learning HMM parameters
The EM algorithm LING 572 Fei Xia Week 10: 03/09/2010.
EM algorithm and applications. Relative Entropy Let p,q be two probability distributions on the same sample space. The relative entropy between p and.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Hidden Markov Models Eine Einführung.
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
Statistical NLP: Lecture 11
Statistical NLP: Hidden Markov Models Updated 8/12/2005.
. Hidden Markov Model Lecture #6. 2 Reminder: Finite State Markov Chain An integer time stochastic process, consisting of a domain D of m states {1,…,m}
Markov Chains Lecture #5
HMM for CpG Islands Parameter Estimation For HMM Maximum Likelihood and the Information Inequality Lecture #7 Background Readings: Chapter 3.3 in the.
. EM algorithm and applications Lecture #9 Background Readings: Chapters 11.2, 11.6 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
. Learning Hidden Markov Models Tutorial #7 © Ilan Gronau. Based on original slides of Ydo Wexler & Dan Geiger.
Hidden Markov Model 11/28/07. Bayes Rule The posterior distribution Select k with the largest posterior distribution. Minimizes the average misclassification.
Hidden Markov Models. Decoding GIVEN x = x 1 x 2 ……x N We want to find  =  1, ……,  N, such that P[ x,  ] is maximized  * = argmax  P[ x,  ] We.
. EM algorithm and applications Lecture #9 Background Readings: Chapters 11.2, 11.6 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
. Parameter Estimation and Relative Entropy Lecture #8 Background Readings: Chapters 3.3, 11.2 in the text book, Biological Sequence Analysis, Durbin et.
Hidden Markov Models Lecture 6, Thursday April 17, 2003.
Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.
. Hidden Markov Model Lecture #6 Background Readings: Chapters 3.1, 3.2 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
Hidden Markov Models Lecture 5, Tuesday April 15, 2003.
. Parameter Estimation For HMM Background Readings: Chapter 3.3 in the book, Biological Sequence Analysis, Durbin et al., 2001.
. Hidden Markov Models Lecture #5 Prepared by Dan Geiger. Background Readings: Chapter 3 in the text book (Durbin et al.).
. Sequence Alignment via HMM Background Readings: chapters 3.4, 3.5, 4, in the Durbin et al.
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
S. Maarschalkerweerd & A. Tjhang1 Parameter estimation for HMMs, Baum-Welch algorithm, Model topology, Numerical stability Chapter
. Hidden Markov Model Lecture #6 Background Readings: Chapters 3.1, 3.2 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
Hidden Markov Models Lecture 5, Tuesday April 15, 2003.
Hidden Markov Models K 1 … 2. Outline Hidden Markov Models – Formalism The Three Basic Problems of HMMs Solutions Applications of HMMs for Automatic Speech.
The EM algorithm LING 572 Fei Xia 03/01/07. What is EM? EM stands for “expectation maximization”. A parameter estimation method: it falls into the general.
. Maximum Likelihood (ML) Parameter Estimation with applications to reconstructing phylogenetic trees Comput. Genomics, lecture 6b Presentation taken from.
CASE STUDY: Genetic Linkage Analysis via Bayesian Networks
Elze de Groot1 Parameter estimation for HMMs, Baum-Welch algorithm, Model topology, Numerical stability Chapter
. Learning Parameters of Hidden Markov Models Prepared by Dan Geiger.
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
Fall 2001 EE669: Natural Language Processing 1 Lecture 9: Hidden Markov Models (HMMs) (Chapter 9 of Manning and Schutze) Dr. Mary P. Harper ECE, Purdue.
Hidden Markov Model Continues …. Finite State Markov Chain A discrete time stochastic process, consisting of a domain D of m states {1,…,m} and 1.An m.
. Markov Chains Lecture #5 Background Readings: Durbin et. al. Section 3.1 Prepared by Shlomo Moran, based on Danny Geiger’s and Nir Friedman’s.
1 Markov Chains. 2 Hidden Markov Models 3 Review Markov Chain can solve the CpG island finding problem Positive model, negative model Length? Solution:
. Basic Model For Genetic Linkage Analysis Lecture #5 Prepared by Dan Geiger.
.. . Maximum Likelihood (ML) Parameter Estimation with applications to inferring phylogenetic trees Comput. Genomics, lecture 6a Presentation taken from.
. EM with Many Random Variables Another Example of EM Sequence Alignment via HMM Lecture # 10 This class has been edited from Nir Friedman’s lecture. Changes.
. Parameter Estimation For HMM Lecture #7 Background Readings: Chapter 3.3 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
BINF6201/8201 Hidden Markov Models for Sequence Analysis
What if a new genome comes? We just sequenced the porcupine genome We know CpG islands play the same role in this genome However, we have no known CpG.
Fundamentals of Hidden Markov Model Mehmet Yunus Dönmez.
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
Hidden Markov Models Usman Roshan CS 675 Machine Learning.
. EM and variants of HMM Lecture #9 Background Readings: Chapters 11.2, 11.6, 3.4 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
. Correctness proof of EM Variants of HMM Sequence Alignment via HMM Lecture # 10 This class has been edited from Nir Friedman’s lecture. Changes made.
. EM algorithm and applications Lecture #9 Background Readings: Chapters 11.2, 11.6 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
S. Salzberg CMSC 828N 1 Three classic HMM problems 2.Decoding: given a model and an output sequence, what is the most likely state sequence through the.
PGM 2003/04 Tirgul 2 Hidden Markov Models. Introduction Hidden Markov Models (HMM) are one of the most common form of probabilistic graphical models,
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
Algorithms in Computational Biology11Department of Mathematics & Computer Science Algorithms in Computational Biology Markov Chains and Hidden Markov Model.
. Parameter Estimation and Relative Entropy Lecture #8 Background Readings: Chapters 3.3, 11.2 in the text book, Biological Sequence Analysis, Durbin et.
. EM in Hidden Markov Models Tutorial 7 © Ydo Wexler & Dan Geiger, revised by Sivan Yogev.
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
Hidden Markov Models BMI/CS 576
Hidden Markov Models - Training
Hidden Markov Models Part 2: Algorithms
Three classic HMM problems
Hidden Markov Model Lecture #6
Presentation transcript:

. Parameter Estimation For HMM Lecture #7 Background Readings: Chapter 3.3 in the text book, Biological Sequence Analysis, Durbin et al.,  Shlomo Moran, following Danny Geiger and Nir Friedman

2 Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = m st Emission probabilities: p(X i = b| S i = s) = e s (b) S1S1 S2S2 S L-1 SLSL x1x1 x2x2 X L-1 xLxL M M M M TTTT For each integer L>0, this defines a probability space in which the simple events are HMM of length L.

3 Hidden Markov Model for CpG Islands The states: Domain(S i )={+, -}  {A,C,T,G} (8 values) In this representation P(x i | s i ) = 0 or 1 depending on whether x i is consistent with s i. e.g. x i = G is consistent with s i =(+,G) and with s i =(-,G) but not with any other state of s i. A-A- T+T+ A T G + G … … … …

4 Reminder: Most Probable state path S*1S*1 S*2S*2 S * L-1 S*LS*L x1x1 x2x2 X L-1 xLxL M M M M TTTT Given an output sequence x = (x 1,…,x L ), A most probable path s*= (s * 1,…,s * L ) is one which maximizes p(s|x).

5 Reminder: Viterbi’s algorithm for most probable path s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi For i=1 to L do for each state l : v l (i) = e l (x i ) MAX k {v k (i-1)m kl } ptr i (l)=argmax k {v k (i-1)m kl } [storing previous state for reconstructing the path] Termination: the probability of the most probable path Initialization: v 0 (0) = 1, v k (0) = 0 for k > 0 0 We add the special initial state 0. p(s 1 *,…,s L * ;x 1,…,x l ) =

6 A-A- C-C- T-T- T+T+ A CTT G + G Predicting CpG islands via most probable path: Output symbols: A, C, G, T (4 letters). Markov Chain states: 4 “-” states and 4 “+” states, two for each letter (8 states total). The most probable path (found by Viterbi’s algorithm) predicts CpG islands. Experiment (Durbin et al, p ) shows that the predicted islands are shorter than the assumed ones. In addition quite a few “false negatives” are found.

7 Reminder: Most probable state s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi Given an output sequence x = (x 1,…,x L ), s i is a most probable state (at location i) if: s i =argmax k p(S i =k |x).

8 Reminder: finding most probable state 1. The forward algorithm finds {f k (s i ) = P(x 1,…,x i,s i =k): k = 1,...m}. 2. The backward algorithm finds {b k (s i ) = P(x i+1,…,x L |s i =k ): k = 1,...m}. 3. Return {p(S i =k|x) = f k (s i ) b k (s i ) |k=1,...,m}. To Compute for every i simply run the forward and backward algorithms once, and compute {f k (s i ) b k (s i )} for every i, k. s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi f l (i) = p(x 1,…,x i,s i =l ), the probability of a path which emits (x 1,..,x i ) and in which state s i =l. b l (i)= p(x i+1,…,x L,s i =l), the probability of a path which emits (x i+1,..,x L ) and in which state s i =l.

9 Finding the probability that a letter is in a CpG island via the algorithm for most probable state: The probability that the occurrence of G in the i-th location is in a CpG island (+ state) is: ∑ s + p(S i =s + |x) = ∑ s + F(S i =s + )B(S i =s + ) Where the summation is formally over the 4 “+” states, but actually only state G + need to be considered (why?) A-A- C-C- T-T- T+T+ A CTT G + G i

10 Parameter Estimation for HMM An HMM model is defined by the parameters: m kl and e k (b), for all states k,l and all symbols b. Let θ denote the collection of these parameters: s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi lk b m kl ek(b)ek(b)

11 Parameter Estimation for HMM To determine the values of (the parameters in) θ, use a training set = {x 1,...,x n }, where each x j is a sequence which is assumed to fit the model. Given the parameters θ, each sequence x j has an assigned probability p(x j |θ). s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi

12 Maximum Likelihood Parameter Estimation for HMM The elements of the training set {x 1,...,x n }, are assumed to be independent, p(x 1,..., x n |θ) = ∏ j p (x j |θ). ML parameter estimation looks for θ which maximizes the above. The exact method for finding or approximating this θ depends on the nature of the training set used.

13 Data for HMM The training set is characterized by: 1.For each x j, the information on the states s j i (the symbols x j i are usually known). 2.The size (number of sequences) in it. S1S1 S2S2 S L-1 SLSL x1x1 x2x2 X L-1 xLxL M M M M TTTT

14 Case 1: ML when Sequences are fully known We know the complete structure of each sequence in the training set {x 1,...,x n }. We wish to estimate a kl and e k (b) for all pairs of states k, l and symbols b. By the ML method, we look for parameters θ* which maximize the probability of the sample set: p(x 1,...,x n | θ*) =MAX θ p(x 1,...,x n | θ). s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi

15 Case 1: Sequences are fully known s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi Let M kl = |{i: s i-1 =k,s i =l}| (in x j ). E k (b)=|{i:s i =k,x i =b}| (in x j ). For each x j we have:

16 Case 1 (cont) s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi By the independence of the x j ’s, p(x 1,...,x n | θ)=∏ j p(x j |θ). Thus, if M kl = #(transitions from k to l) in the training set, and E k (b) = #(emissions of symbol b from state k) in the training set, we have:

17 Case 1 (cont) So we need to find a kl ’s and e k (b)’s which maximize: Subject to:

18 Case 1 (cont) Rewriting, we need to maximize:

19 Case 1 (cont) Then we will maximize also F. Each of the above is a simpler ML problem, which is similar to ML parameters estimation for a die, treated next.

20 ML parameters estimation for a die Let X be a random variable with 6 values x 1,…,x 6 denoting the six outcomes of a (possibly unfair) die. Here the parameters are θ ={  1,  2,  3,  4,  5,  6 }, ∑θ i =1 Assume that the data is one sequence: Data = (x 6,x 1,x 1,x 3,x 2,x 2,x 3,x 4,x 5,x 2,x 6 ) So we have to maximize Subject to: θ 1 +θ 2 + θ 3 + θ 4 + θ 5 + θ 6 =1 [and θ i  0 ]

21 Side comment: Sufficient Statistics u To compute the probability of data in the die example we only require to record the number of times N i falling on side i (namely,N 1, N 2,…,N 6 ). u We do not need to recall the entire sequence of outcomes u {N i | i=1…6} is called sufficient statistics for the multinomial sampling.

22 Sufficient Statistics u A sufficient statistics is a function of the data that summarizes the relevant information for the likelihood  Formally, s(Data) is a sufficient statistics if for any two datasets D and D’ s(Data) = s(Data’ )  P(Data|  ) = P(Data’|  ) Datasets Statistics Exercise: Define “sufficient statistics” for the HMM model.

23 Maximum Likelihood Estimate By the ML approach, we look for parameters that maximizes the probability of data (i.e., the likelihood function ). We will find the parameters by considering the corresponding log-likelihood function: A necessary condition for (local) maximum is: 0 1 )|(log     θ    i i j j j NNDataP  

24 Finding the Maximum Rearranging terms: Divide the j th equation by the i th equation: Sum from j=1 to 6: So there is only one local – and hence global – maximum. Hence the MLE is given by:

25 Generalization for distribution with any number n of outcomes Let X be a random variable with k values x 1,…,x k denoting the k outcomes of Independently and Identically Distributed experiments, with parameters θ ={  1,  2,...,  k } (θ i is the probability of x i ). Again, the data is one sequence of length n, in which x i appears n i times. Then we have to maximize Subject to: θ 1 +θ θ k =1

26 Generalization for n outcomes (cont) By treatment identical to the die case, the maximum is obtained when for all i: Hence the MLE is given by the relative frequencies:

27 Fractional Exponents Some models allow n i ’s to be fractions (eg, if we are uncertain of a die outcome, we may consider it “6” with 20% confidence and “5” with 80%): We still can have And the same analysis yields:

28 Apply the ML method to HMM Let A kl = #(transitions from k to l) in the training set. E k (b) = #(emissions of symbol b from state k) in the training set. We need to: s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi

29 Apply to HMM (cont.) We apply the previous technique to get for each k the parameters {m kl |l=1,..,m} and {e k (b)|b  Σ}: s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi Which gives the optimal ML parameters

30 Summary of Case 1: Sequences are fully known We know the complete structure of each sequence in the training set {x 1,...,x n }. We wish to estimate m kl and e k (b) for all pairs of states k, l and symbols b. When everything is known, we can find the (unique set of) parameters θ* which maximizes p(x 1,...,x n | θ*) =MAX θ p(x 1,...,x n | θ). s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi

31 Adding pseudo counts in HMM We may modify the actual count by our prior knowledge/belief (e.g., when the sample set is too small) : r kl is our prior belief on transitions from k to l. r k (b) is our prior belief on emissions of b from state k. s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi

32 Case 2: State paths are unknown For a given θ we have: p(x 1,..., x n |θ)= p(x 1 | θ)    p (x n |θ) (since the x j are independent) s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi For each sequence x, p(x|θ)=∑ s p(x,s|θ), The sum taken over all state paths s which emit x.

33 Case 2: State paths are unknown Thus, for the n sequences (x 1,..., x n ) we have: p(x 1,..., x n |θ)= ∑ p(x 1,..., x n, s 1,..., s n |θ), Where the summation is taken over all tuples of n state paths (s 1,..., s n ) which generate (x 1,..., x n ). For simplicity, we will assume that n=1. s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi (s 1,..., s n )

34 Case 2: State paths are unknown So we need to maximize p(x|θ)=∑ s p(x,s|θ), where the summation is over all the sequences S which produce the output sequence x. Finding θ * which maximizes ∑ s p(x,s|θ) is hard. [Unlike finding θ * which maximizes p(x,s|θ) for a single sequence (x,s).] s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi

35 ML Parameter Estimation for HMM The general process for finding θ in this case is 1.Start with an initial value of θ. 2.Find θ’ so that p(x|θ’) > p(x|θ) 3.set θ = θ’. 4.Repeat until some convergence criterion is met. A general algorithm of this type is the Expectation Maximization algorithm, which we will meet later. For the specific case of HMM, it is the Baum- Welch training.

36 Baum Welch training We start with some values of m kl and e k (b), which define prior values of θ. Then we use an iterative algorithm which attempts to replace θ by a θ * s.t. p( x |θ * ) > p( x |θ) This is done by “imitating” the algorithm for Case 1, where all states are known: s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi

37 Baum Welch training In case 1 we computed the optimal values of m kl and e k (b), (for the optimal θ) by simply counting the number M kl of transitions from state k to state l, and the number E k (b) of emissions of symbol b from state k, in the training set. This was possible since we knew all the states. S i = lS i-1 = k x i-1 = b … … x i = c

38 Baum Welch training When the states are unknown, the counting process is replaced by averaging process: For each edge s i-1  s i we compute the average number of “k to l” transitions, for all possible pairs (k,l), over this edge. Then, for each k and l, we take M kl to be the sum over all edges. S i = ?S i-1 = ? x i-1 = b x i = c ……

39 Baum Welch training Similarly, For each edge s i  b and each state k, we compute the average number of times that s i =k, which is the expected number of “k → b” transmission on this edge. Then we take E k (b) to be the sum over all such edges. These expected values are computed by assuming the current parameters θ: S i = ? x i-1 = b

40 A kl and E k (b) when states are unknown M kl and E k (b) are computed according to the current distribution θ, that is: s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi M kl =∑ s M s kl p(s|x,θ), where M s kl is the number of k to l transitions in the sequence s. E k (b)=∑ s E s k (b)p(s|x,θ), where E s k (b) is the number of times k emits b in the sequence s with output x.

41 Baum Welch: step 1a Count expected number of state transitions For each index i and states k,l, compute the state transitions probabilities by the current θ: s1s1 SiSi sLsL X1X1 XiXi XLXL S i-1 X i-1.. P(s i-1 =k, s i =l | x,θ) For this, we use the forwards and backwards algorithms

42 Reminder: finding state probabilities p(s i =k,x) = F k (s i ) B k (s i ) {F k (i) B k (i)} for every i and k are computed by one run of the backward/forward algorithms as follows: s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi F k (i) = p(x 1,…,x i,s i =k ), the probability that in a path which emits (x 1,..,x i ), state s i =k. B k (i)= p(x i+1,…,x L |s i =k), the probability that a path which emits (x i+1,..,x L ), given that state s i =k.

43 Baum Welch: Step 1a (cont) Claim: By the probability distribution of HMM s1s1 l sLsL X1X1 XiXi XLXL k X i-1.. (a kl and e l (x i ) are the parameters defined by , and F k (i-1), B k (i) are the outputs of the forward / backward algorithms)

44 Step 1a (proof of claim) P(x 1,…,x L,s i-1 =k,s i =l|  ) = P(x 1,…,x i-1,s i-1 =k|  ) a kl e l (x i ) P(x i+1,…,x L |s i =l,  ) = F k (i-1) a kl e l (x i ) B l (i) Via the forward algorithm Via the backward algorithm s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL S i-1 X i-1 sisi XiXi x p(s i-1 =k,s i =l | x,  ) = F k (i-1) m kl e l (x i ) B l (i)

45 Step 1a (end) For each pair (k,l), compute the expected number of state transitions from k to l, as the sum of the expected number of k to l transitions over all L edges :

46 Step 1a for many sequences: Exercise: Prove that when we have n independent input sequences (x 1,..., x n ), then A kl is given by:

47 Baum-Welch: Step 1b count expected number of symbols emissions for state k and each symbol b, for each i where X i =b, compute the expected number of times that S i =k. s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi X i =b

48 Baum-Welch: Step 1b For each state k and each symbol b, compute the expected number of emissions of b from k as the sum of the expected number of times that s i = k, over all i’s for which x i = b.

49 Step 1b for many sequences Exercise: when we have n sequences (x 1,..., x n ), the expected number of emissions of b from k is given by:

50 Summary of Steps 1a and 1b: the E part of the Baum Welch training These steps compute the expected numbers M kl of k,l transitions for all pairs of states k and l, and the expected numbers E k (b) of transmitions of symbol b from state k, for all states k and symbols b. The next step is the M step, which is identical to the computation of optimal ML parameters when all states are known.

51 Baum-Welch: step 2 Use the A kl ’s, E k (b)’s to compute the new values of a kl and e k (b). These values define θ *. The correctness of the EM algorithm implies that: p(x 1,..., x n |θ * )  p(x 1,..., x n |θ) i.e, θ * increases the probability of the data This procedure is iterated, until some convergence criterion is met.