. Lecture #8: - Parameter Estimation for HMM with Hidden States: the Baum Welch Training - Viterbi Training - Extensions of HMM Background Readings: Chapters.

Presentation on theme: ". Lecture #8: - Parameter Estimation for HMM with Hidden States: the Baum Welch Training - Viterbi Training - Extensions of HMM Background Readings: Chapters."— Presentation transcript:

. Lecture #8: - Parameter Estimation for HMM with Hidden States: the Baum Welch Training - Viterbi Training - Extensions of HMM Background Readings: Chapters 3.3, 11.2 in the text book, Biological Sequence Analysis, Durbin et al., 2001.  Shlomo Moran, following Dan Geiger and Nir Friedman

2 Parameter Estimation in HMM: What we saw until now

3 The Parameters To determine the values of the parameters θ, use a training set = {x 1,...,x n }, where each x j is a sequence which is assumed to fit the model. An HMM model is defined by the probabilty parameters: m kl and e k (b), for all states k,l and all symbols b. θ denotes the collection of these parameters.

4 Maximum Likelihood Parameter Estimation for HMM looks for θ which maximizes: p(x 1,..., x n |θ) = ∏ j p (x j |θ).

5 Finding ML parameters for HMM when all states are known: Let M kl = #(transitions from k to l) in the training set. E k (b) = #(emissions of symbol b from state k) in the training set. We look for parameters  ={m kl, e k (b)} that: The optimal ML parameters θ are given by:

6 Case 2: State paths are unknown We need to maximize p(x|θ)=∑ s p(x,s|θ), where the summation is over all the sequences S which produce the output sequence x. Finding θ * which maximizes ∑ s p(x,s|θ) is hard. [Unlike finding θ * which maximizes p(x,s|θ) for a single sequence (x,s).] s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi

7 Parameter Estimation when States are Unknown The general process for finding θ in this case is 1.Start with an initial value of θ. 2.Find θ* so that p(x|θ*) > p(x|θ) 3.set θ = θ*. 4.Repeat until some convergence criterion is met. A general algorithm of this type is the Expectation Maximization algorithm, which we will meet later. For the specific case of HMM, it is the Baum- Welch training.

8 Baum Welch Training We start with some values of m kl and e k (b), which define prior values of θ. Then we use an iterative algorithm which attempts to replace θ by a θ * s.t. p( x |θ * ) > p( x |θ) This is done by “imitating” the algorithm for Case 1, where all states are known: s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi

9 When states were known, we counted… In case 1 we computed the optimal values of m kl and e k (b), (for the optimal θ) by simply counting the number M kl of transitions from state k to state l, and the number E k (b) of emissions of symbol b from state k, in the training set. This was possible since we knew all the states. S i = lS i-1 = k x i-1 = b … … x i = c

10 When states are unknown M kl and E k (b) are taken as averages: M kl and E k (b) are computed according to the current distribution θ, that is: s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi Similarly E k (b)=∑ s E s k (b)p(s|x,θ), where E s k (b) is the number of times k emits b in the sequence s with output x. where M s kl is the number of k to l transitions in the sequence s.

11 Computing averages of state-transitions: Since the number of sequences s is exponential in L, it is too expensive to compute M kl =∑ s M s kl p(s|x,θ) in the naïve way. Hence, we use dynamic programming: For each each pair (k,l) and for each edge s i-1  s i we compute the average number of “k to l” transitions over this edge. Then we take M kl to be the sum over all edges. S i = ?S i-1 = ? x i-1 = b x i = c ……

12 …and of Letter-Emissions Similarly, For each edge s i  b and each state k, we compute the average number of times that s i =k, which is the expected number of “k → b” transmission on this edge. Then we take E k (b) to be the sum over all such edges. These expected values are computed by assuming the current parameters θ: S i = ? x i = b

13 Baum Welch: step E for M kl Count avearge number of state transitions For computing the averages, Baum Welch computes for each index i and states k,l, the following probability: s1s1 SiSi sLsL X1X1 XiXi XLXL S i-1 X i-1.. P(s i-1 =k, s i =l | x,θ) For this, it uses the forwards and backwards algorithms

14 Reminder: finding state probabilities p(s i =k,x) = F k (s i ) B k (s i ) {F k (i) B k (i)} for every i and k are computed by one run of the backward/forward algorithms as follows: s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi F k (i) = p(x 1,…,x i,s i =k ), the probability that in a path which emits (x 1,..,x i ), state s i =k. B k (i)= p(x i+1,…,x L |s i =k), the probability that a path which emits (x i+1,..,x L ), given that state s i =k.

15 Baum Welch: Step E for M kl Claim: By the probability distribution of HMM s1s1 l sLsL X1X1 XiXi XLXL k X i-1.. (m kl and e l (x i ) are the parameters defined by , and F k (i-1), B k (i) are the outputs of the forward / backward algorithms)

16 proof of claim P(x 1,…,x L,s i-1 =k,s i =l|  ) = P(x 1,…,x i-1,s i-1 =k|  ) m kl e l (x i ) P(x i+1,…,x L |s i =l,  ) = F k (i-1) m kl e l (x i ) B l (i) Via the forward algorithm Via the backward algorithm s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL S i-1 X i-1 sisi XiXi x p(s i-1 =k,s i =l | x,  ) = F k (i-1) m kl e l (x i ) B l (i)

17 Step E for M kl (end) For each pair (k,l), compute the expected number of state transitions from k to l, as the sum of the expected number of k to l transitions over all L edges :

18 Step E for M kl, with many sequences: Claim: When we have n independent input sequences (x 1,..., x n ) of lengths L 1.. L n, then M kl is given by:

19 Proof of Claim: When we have n independent input sequences (x 1,..., x n ), the probability space is the product of n spaces: The probability of a simple event in this space with parameters θ is given by:

20 Proof of Claim (cont): The probability of that simple event given x=(x 1,..,x n ): The probability of the compound event (s j,x j ) given x=(x 1,..,x n ):

21 Proof of Claim (end):

22 Baum-Welch: Step E for E k (b) count expected number of letter-emissions for state k and each symbol b, for each i where X i =b, compute the expected number of times that S i =k. s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi X i =b

23 Baum-Welch: Step E for E k (b) For each state k and each symbol b, compute the expected number of emissions of b from k as the sum of the expected number of times that s i = k, over all i’s for which x i = b.

24 Step E for E k (b), many sequences Exercise: when we have n sequences (x 1,..., x n ), the expected number of emissions of b from k is given by:

25 Summary: the E part of the Baum Welch training This part computes the expected numbers M kl of k→l transitions for all pairs of states k and l, and the expected numbers E k (b) of transmisions of symbol b from state k, for all states k and symbols b. The next step is the M step, which is identical to the computation of optimal ML parameters when all states are known.

26 Baum-Welch: step M Use the M kl ’s, E k (b)’s to compute the new values of m kl and e k (b). These values define θ *. The correctness of the EM algorithm implies that, if θ* ≠ θ, then: p(x 1,..., x n |θ * ) > p(x 1,..., x n |θ) i.e, θ * increases the probability of the data, unless it is equal to θ. (this will follow from the correctness of the EM algorithm, to be proved later.) This procedure is iterated, until some convergence criterion is met.

27 Viterbi training: Maximizing the probability of the most probable path

28 Assume that rather then finding θ which maximizes the likelihood of the input x 1,..,x n, we wish to maximize the probability of a most probable path, ie to find parameters θ and state paths s(x 1 ),..,s(x n ) s.t.the value of p(s(x 1 ),..,s(x n ), x 1,..,x n |θ) is maximized. Clearly, s(x j ) should be the most probable path for x j under the parameters θ. We assume only one sequence (n=1). This is done by Viterbi Training

29 Maximizing the probabilty of the most probable path States are unknown. Viterbi training attempts to maximize the probability of a most probable path, ie the value of p(s(x 1 ),..,s(x n ), x 1,..,x n |θ) Where s(x j ) is the most probable (under θ) path for x j. We assume only one sequence (n=1). s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi

30 Viterbi training Start from given values of m kl and e k (b), which define prior values of θ. Each iteration: Step 1: Use Viterbi’s algorithm to find a most probable path s(x), which maximizes p(s(x), x|θ). s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi

31 Viterbi training (cont) Step 2. Use the ML method for HMM when the states are known to find θ * which maximizes p(s(x), x|θ * ). Note : If after Step 2 we have p(s(x), x|θ * )= p(s(x), x|θ), then it must be that θ=θ *. In this case the next iteration will be identical to the current one, and hence we may terminate the algorithm. s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi θ*θ*

32 Viterbi training (cont) Step 3. If θ≠θ *, set θ←θ *, and repeat. s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi If θ=θ *, stop.

33 Viterbi training (end) s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi Exercise: generalize the algorithm for the case where there are n training sequences x 1,..,x n to find paths {s(x 1 ),..,s(x n )} and parameters θ so that {s(x 1 ),..,s(x n )} are most probable paths for x 1,..,x n under θ.

34 Extensions of HMM

35 1. Monitoring probabilities of repetitions Markov chains are rather limited in describing sequences of symbols with non-random structures. For instance, a Markov chain forces the distribution of segments in which some state is repeated k+1 times to be (1-p)p k, for some p. AAAA By adding states we may bypass this restriction:

36 1. State duplications An extension of Markov chain which allows the distribution of segments in which a state is repeated k+1 times to have any desired value: Assign k+1 states to represent the same “real” state. This may model k repetitions (or less) with any desired probability. A1A1 A2A2 A3A3 A4A4

37 2. Silent states u States which do not emit symbols. u Can be used to model repetitions. u Also used to allow arbitrary jumps (may be used to model deletions) u Need to generalize the Forward and Backward algorithms for arbitrary acyclic digraphs to count for the silent states: Silent states: Regular states:

38 eg, the forwards algorithm should look: Directed cycles of silent (or other) states complicate things, and should be avoided. x v z Silent states Regular states symbols

39 3. High Order Markov Chains Markov chains in which the transition probabilities depends on the last k states: P(x i |x i-1,...,x 1 ) = P(x i |x i-1,...,x i-k ) Can be represented by a standard Markov chain with more states. eg for k=2: AA BB BA AB

40 4. Inhomogeneous Markov Chains u An important task in analyzing DNA sequences is recognizing the genes which code for proteins. u A triplet of 3 nucleotides – codon - codes for amino acids. u It is known that in parts of DNA which code for genes, the three codons positions has different statistics. u Thus a Markov chain model for DNA should represent not only the Nucleotide (A, C, G or T), but also its position – the same nucleotide in different position will have different transition probabilities. Used in GENEMARK gene finding program (93).

Download ppt ". Lecture #8: - Parameter Estimation for HMM with Hidden States: the Baum Welch Training - Viterbi Training - Extensions of HMM Background Readings: Chapters."

Similar presentations