. Inference and Parameter Estimation in HMM Lecture 11 Computational Genomics © Shlomo Moran, Ydo Wexler, Dan Geiger (Technion) modified by Benny Chor.

Presentation on theme: ". Inference and Parameter Estimation in HMM Lecture 11 Computational Genomics © Shlomo Moran, Ydo Wexler, Dan Geiger (Technion) modified by Benny Chor."— Presentation transcript:

. Inference and Parameter Estimation in HMM Lecture 11 Computational Genomics © Shlomo Moran, Ydo Wexler, Dan Geiger (Technion) modified by Benny Chor (TAU) Background Readings: Chapter 3.3 in the text book, Biological Sequence Analysis, Durbin et al., 2001.

2 Hidden Markov Models Today, will further explore algorithmic questions related to HMMs S1S1 S2S2 S L-1 SLSL O1O1 O2O2 O L-1 OLOL SiSi OiOi Recall that an HMM is a network of the following form :

3 HMMs – Question I  Given an observation sequence O = ( O 1 O 2 O 3 … O L ), and a model M = {A, B,   }  how do we efficiently compute Pr(O|M), the probability that the given model M produces the observation O in a run of length L ? u Solved using the Baum-Welch forward-backward algorithm (dynamic programming).

4 Coin-Tossing Example 0.9 Fair loaded head tail 0.9 0.1 1/2 1/4 3/4 1/2 H1H1 H2H2 H L-1 HLHL X1X1 X2X2 X L-1 XLXL HiHi XiXi L tosses Fair/Loade d Head/Tail Start 1/2

5 HMM – Question II (Harder) u Given an observation sequence, O = (O 1 O 2 … O T ), and a model, M = {A, B, p }, how do we efficiently compute the most probable sequence(s) of states, Q ? u Namely the sequence of states S = (S 1 S 2 … S T ), which maximizes Pr(O|S,M), the probability that the given model M produces the given observation O when it goes through the specific sequence of states Q. u Recall that given a model M, a sequence of observations O, and a sequence of states S, we can efficiently compute Pr(O|Q,M) (should watch out for numeric underflows)

6 Easy: Finding most probable state at step i 1. The forward algorithm finds {f t (i) = Pr(O 1,…,O t,q t =S i ): t = 1,...L}. 2. The backward algorithm finds {b t (i) = Pr(O t+1,…,O L | q t =S i ): t = 1,...,L}. 3. Return {Pr ( q t =S i | O) = f t (s i ) b t (s i ) |t=1,...,L}. (proof uses conditional prob.) To Compute for every i simply run the forward and backward algorithms once, and compute {f t (i) b t (i)} for every i, t. s1s1 s2s2 s L-1 sLsL O1O1 O2O2 O L-1 OLOL sisi OiOi f t (i) = Pr(O 1,…,O i,q t =S i ), the probability of a path which emits (O 1,..,O t ) and where state q t =S i. b t (i)= Pr(O i+1,…,O L, q t =S i ), the probability of a path which emits (O i+1,..,O t ) and where state q t =S i. Notice change of notation: f t (i), b t (i) denotes forward, backward prob.

7 Reminder: Most Probable state path S1S1 S2S2 S L-1 SLSL O1O1 O2O2 O L-1 OLOL M M M M TTTT Given an output sequence O = (O 1,…,O L ), A most probable path s*= (s * 1,…,s * L ) is one which maximizes Pr(S*|O).

8 Reformulating: MAP in a Given HMM (all probabilities conditioned on M) 1.Recall that question I, likelihood of observation, is to compute Pr(O 1,…,O L ) =  Pr(O 1,…,O L | S 1,…,S L ) 2.Now we wish to compute a similar quantity: P * (O 1,…,O L ) = MAX Pr(O 1,…,O L | S 1,…,S L ) (S 1,…,S L ) And to find a Max Apostriori Probability assignment (S 1 *,…,S L * ) that yields this maximum. S1S1 S2S2 S L-1 SLSL O1O1 O2O2 O L-1 OLOL SiSi OiOi

9 Reformulating: MAP in a Given HMM (all probabilities conditioned on M) Goal: Compute P * (O 1,…,O L ) = MAX Pr(S 1,…,S L | O 1,…,O L ) (S 1,…,S L ) and find a Max Apostriori Probability assignment (S 1 *,…,S L * ). Solution by the Viterbi algorithm (1967), due to Andrea Viterbi, 1935- ) (see the HMM link from Roger Boyle page at Leeds for additional info, including an online demo : http://www.comp.leeds.ac.uk/roger/ )http://www.comp.leeds.ac.uk/roger/ S1S1 S2S2 S L-1 SLSL O1O1 O2O2 O L-1 OLOL SiSi OiOi

10 The Viterbi Algorithm White board presentation.

11 Example: Dishonest Casino

12 Dishonest Casino u Computing posterior probabilities for “fair” at each time point (separately) in a long sequence:

13 Dishonest Casino Computing Max Apostriori Probability sequence (S1*,…,SL*) in a long sequence, using Viterbi (from Durbin et al.’s book, p. 57)

14 S1S1 S2S2 S L-1 SLSL O1O1 O2O2 O L-1 OLOL S OiOi L tosses Fair/Loade d Head/Tail 0.9 Fair loaded head tail 0.9 0.1 1/2 1/4 3/4 1/2 Start 1/2 Coin-Tossing Example - Viterbi’s algorithm A reminder: Q2: what is the most likely sequence of states to generate the given data?

15 Loaded Coin Example – Exhaustive Solution S1S1 S2S2 O1O1 O2O2 S3S3 O3O3 Small example: 3 tosses Outcomes: head, head, tail S 1,S 2,S 3 Pr(O 1,O 2,O 3 | S 1,S 2,S 3 ) F,F,F (0.5) 3 *0.5*(0.9) 2 =0.050625 F,F,L (0.5) 2 *0.25*0.5*0.9*0.1=0.0028125 F,L,F 0.5*0.75*0.5*0.5*0.1*0.1=0.0009375 F,L,L 0.5*0.75*0.25*0.5*0.1*0.9=0.00422 L,F,F 0.75*0.5*0.5*0.5*0.1*0.9=0.0084375 L,F,L 0.75*0.5*0.25*0.5*0.1*0.1=0.000468 L,L,F 0.75*0.75*0.5*0.5*0.9*0.1=0.01265 L,L,L 0.75*0.75*0.25*0.5*0.9*0.9=0.0569 max

. Have dealt with HMM Question I (computing likelihood of observation) and with HMM Question II (finding most likely sequence of states). Time for HMM Question III (hardest): Determine Model Parameters

18 HMM – Question III (Hardest) u Given an observation sequence O = (O 1 O 2 … O L ), and a class of models, each of the form M = {A,B,p}, which specific model “best” explains the observations? u A solution to HMM question I enables the efficient computation of Pr(O|M) (the probability that a specific model M produces the observation O). u Question III can be viewed as a learning problem: We want to use the sequence of observations in order to “train” an HMM and learn the optimal underlying model parameters (transition and output probabilities).

19 Parameter Estimation for HMM An HMM model with given structure is defined by the parameters: a kl and e k (b), for all states k,l and all symbols b. Let θ denote the collection of these parameters. s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi SlSk b a kl ek(b)ek(b)

20 Parameter Estimation for HMM To determine the values of (the parameters in) θ, use a training set = {x 1,...,x n }, where each x j is a sequence of observations, assumed to be generated by the model. Given the parameters θ, each sequence x j has a well defined likelihood Pr(x j |θ) (or Pr(x j | θ,HMM) ). s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi

21 ML Parameter Estimation for HMM The elements of the training set {x 1,...,x n }, are assumed to be independent, Pr(x 1,..., x n |θ) = ∏ j p (x j |θ). ML parameter estimation looks for θ which maximizes the above. The exact method for finding or approximating this θ depends on the nature of the training set used.

22 Data for HMM Possible properties of (the sequences in) the training set: 1.For each x j, what is our information on the states s i (the symbols x i are usually known). 2.The size (number of sequences) of the training set S1S1 S2S2 S L-1 SLSL x1x1 x2x2 X L-1 xLxL M M M M TTTT

23 We know the complete structure of each sequence in the training set {x 1,...,x n }. We wish to estimate a kl and e k (b) for all pairs of states k, l and symbols b. By the ML method, we look for parameters θ* which maximize the probability of the sample set: Pr(x 1,...,x n | θ*) =MAX θ Pr(x 1,...,x n | θ). s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi Case 1: States’ Sequences are fully known

24 Case 1:Sequence known; ML method s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi Let m kl = |{i: s i-1 =k,s i =l}| (in x j ). (empirical counts) m k (b)=|{i:s i =k,x i =b}| (in x j ). ( “ “ ) For each x j (sequence no. j) we have:

25 Case 1 (cont) s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi By the independence of the x j ’s, Pr(x 1,...,x n | θ)=∏ j Pr(x j |θ). Thus, if A kl = total #(transitions from k to l) in the training set, and E k (b) = total #(emissions of symbol b from state k) in the training set, we have:

26 Case 1 (cont) So we need to find a kl ’s and e k (b)’s which maximize: Subject to:

27 Case 1 (cont) Rewriting, we need to maximize:

28 Case 1 (cont) Then we will also maximize F. Each of the above is a simpler ML problem (multi facet dice). This problem admits a simple, intuitive solution, and was solved in homework assignment (or will be).

29 Apply the ML method to HMM Let A kl = total #(transitions from k to l) in the training set. E k (b) = total #(emissions of symbol b from state k) in the training set. We need to: s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi

30 Apply to HMM (cont.) We apply the ML solution to get, for each, k the parameters {a kl |l=1,..,m} and {e k (b)|b  Σ}: s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi Which gives the optimal ML parameters

31 Adding pseudo counts in HMM If the sample set is too small, we may get a biased result, reflecting the sample but not the true model (over fitting). In this case we modify the actual count by our prior knowledge/belief: r kl is our prior belief and transitions from k to l. r k (b) is our prior belief on emissions of b from state k. s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi

32 Summary of Case 1: State sequences are fully known We know the complete structure of each sequence in the training set {x 1,...,x n }. We wish to estimate a kl and e k (b) for all pairs of states k, l and symbols b. We just showed a method which finds the unique parameters θ* which maximizes Pr(x 1,...,x n | θ*) =MAX θ Pr(x 1,...,x n | θ). s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi

33 Case 2: State paths are unknown: In this case only the values of the x i ’s of the input sequences are known. This is a ML problem with “missing data”. We wish to find θ * so that Pr(x|θ * )=MAX θ {Pr(x|θ)}. For each sequence x, Pr(x|θ)=∑ s Pr(x,s|θ), taken over all state paths s. s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi

34 Case 2: State paths are unknown So we need to maximize Pr(x|θ)=∑ s Pr(x,s|θ), where the summation is over all the sequences S which produce the output sequence x. Finding θ * which maximizes ∑ s Pr(x,s|θ) is hard. [Unlike finding θ * which maximizes Pr(x,s|θ) for a single sequence (x,s).] s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi

35 ML Parameter Estimation for HMM The general process for finding θ in this case is 1.Start with an initial value of θ. 2.Compute Pr(x 1,..., x n |θ) (question I) 3.Look for “close by” θ’ so that Pr(x 1,..., x n |θ’) > Pr(x 1,..., x n |θ) 4. set θ = θ’. 5. Repeat until some convergence criterion is met. This fits the general framework of hill climbing, or local search. Will discuss a more “targeted” search strategy.

36 ML Parameter Estimation for HMM The general process for finding θ in this case is 1.Start with an initial value of θ. 2.Find θ’ so that Pr(x 1,..., x n |θ’) > Pr(x 1,..., x n |θ) 3.set θ = θ’. 4.Repeat until some convergence criterion is met. The Expectation Maximization (EM) algorithm, which we will describe next, fits this general framework. For the specific case of HMM, it is called the Baum-Welch training.

37 Learning the parameters (EM algorithm) A common algorithm to learn the parameters from unlabeled sequences is called Expectation-Maximization (EM). In the current context it reads as follows: Start with some probability tables (many possible choices) Iterate until convergence E-step: Compute Pr(q t-1 =S i, q t =S j | O 1,…,O L ) using the current probability tables (“current parameters”). M-step: use the Expected counts found to update the local probability tables via Maximum likelihood (=n s1  s2 /n). We start with the E-step (closely related to so called Belief update in AI). As usual, whiteboard presentation

38 Baum Welch training We start with some initial values of a kl and e k (b), which define prior values of θ. Baum-Welch training is an iterative algorithm which attempts to replace θ by a θ * s.t. Pr( x |θ * ) > Pr( x |θ) Each iteration consists of three main steps: First computations, then expectation, and finally maximization. s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi

39 Baum Welch Step 1: Expectation Compute the expected number of state transitions: For each sequence x j, for each i and for each k,l, compute the posterior state transitions probabilities: s1s1 SiSi sLsL X1X1 XiXi XLXL S i-1 X i-1.. Pr(s i-1 =k, s i =l | x j,θ)

40 Step 1: Computing Pr(s i-1 =k, s i =l | x j,θ) s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL S i-1 X i-1 sisi XiXi

41 Step 1 (end) For each sequence x j, computed the probability of having a transitions from state k to l. Now, if x j is of length L j, then the expected number of transitions from state k to l along the sequence equals Pr(s i-1 =k, s i =l | x j,θ)

42 Step 1 (end) For each sequence x j, computed the probability of having a transitions from state k to l. For n sequences x 1, …, x n, the expected number (over all sequences) of transitions from state k to l equals Pr(s i-1 =k, s i =l | x j,θ)

43 Baum-Welch: Step 2 for each state k and each symbol b, compute the expected number of emissions of b from state k: (inner sum is only over positions in sequence where b was emitted.)

44 Baum-Welch Step 3: Maximization Use the A kl ’s, E k (b)’s to compute the new values of a kl and e k (b). These values define θ *. It can be shown that: p(x 1,..., x n |θ * ) > p(x 1,..., x n |θ) i.e, θ * increases the probability of the data This procedure is iterated, until some convergence criterion is met.

45 This is the basic EM Algorithm u Many extensions have been considered. u Algorithm extensively used in practice. u Numerical considerations, like underflow, important in any implementation. u Convergence to some local maxima guaranteed, even though this process may be verrrry slow. u Choice of model (how many states, how many non zero directed edges) crucial for meaningful results. u Too many states may cause over fitting and slow down convergence. u Still, EM is one of the most widely used heuristic.

46 A Variant when State paths are unknown: Viterbi Training Also start from initial values of a kl and e k (b), which defines prior values of θ. Viterbi training attempts to maximize the probability of a most probable path, ie maximize p((s(x 1 ),..,s(x n )) |θ, x 1,..,x n ) Where s(x j ) is the most probable (under θ) path for x j. Finds paths, computes optimal ML values for them, then iterates. s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi

47 Each iteration: 1.Find a set {s(x j )} of most probable paths, which maximize p(s(x 1 ),..,s(x n ) |θ, x 1,..,x n ) 2. Find θ *, which maximizes p(s(x 1 ),..,s(x n ) | θ *, x 1,..,x n ) Note: In 1. the maximizing arguments are the paths, in 2. it is θ *. 3. Set θ=θ *, and repeat. Stop when paths are not changed. s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi A Variant when State paths are unknown: Viterbi Training

48 Case 2: State paths are unknown: Viterbi training Pr(s(x 1 ),..,s(x n ) | θ *, x 1,..,x n ) can be expressed in a closed form (since we are using a single path for each x j ), so this time convergence is achieved when the optimal paths are not changed any more. Now state paths are discrete, so convergence after finitely many steps is guaranteed (but maybe exponentially many). On the negative side, it is not Pr(x1,...,xn| θ*) that is maximized, and in general EM performs better. s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi

49 Example I: Homogenous HMM, one sample Start with some probability tables Iterate until convergence E-step: Compute p  (h i |h i -1,x 1,…,x L ) from p  (h i, h i -1 | x 1,…,x L ) which is computed using the forward- backward algorithm as explained earlier. M-step: Update the parameters simultaneously:    i p  (h i =1 | h i-1 =1, x 1,…,x L )+p  (h i =0 | h i-1 =0, x 1,…,x L )/(L-1) H1H1 H2H2 H L-1 HLHL X1X1 X2X2 X L-1 XLXL HiHi XiXi

50 H1H1 H2H2 H L-1 HLHL X1X1 X2X2 X L-1 XLXL HiHi XiXi Coin-Tossing Example Numeric example: 3 tosses Outcomes: head, head, tail

51 Coin-Tossing Example Numeric example: 3 tosses Outcomes: head, head, tail Pr(x 1 =head,h 1 =loaded)= Pr(loaded 1 ) Pr(head| loaded 1 )= 0.5*0.75=0.375 Pr(x 1 =head,h 1 =fair)= Pr(fair 1 ) Pr(head| fair 1 )= 0.5*0.5=0.25 First coin is loaded {step 1- forward} F(h i )=Pr(x 1,…,x i,h i ) =  Pr(x 1,…,x i-1, h i-1 ) Pr(h i | h i-1 ) Pr(x i | h i ) h i-1 Recall:

54 Coin-Tossing Example - backward Numeric example: 3 tosses Outcomes: head, head, tail b(h i ) = Pr(x i+1,…,x L |h i )= Pr(x i+1,…,x L |h i ) =  Pr(h i+1 | h i ) Pr(x i+1 | h i+1 ) b(h i+1 ) Pr(x 3 =tail | h 2 =loaded)=Pr(h 3 =loaded | h 2 =loaded) Pr(x 3 =tail | h 3 =loaded)+ Pr(h 3 =fair | h 2 =loaded) Pr(x 3 =tail | h 3 =fair)=0.9*0.25+0.1*0.5=0.275 Pr(x 3 =tail | h 2 =fair)=Pr(h 3 =loaded | h 2 =fair) Pr(x 3 =tail | h 3 =loaded)+ Pr(h 3 =fair | h 2 =fair) Pr(x 3 =tail | h 3 =fair)=0.1*0.25+0.9*0.5=0.475 {step 1} h i+1