Presentation is loading. Please wait.

Presentation is loading. Please wait.

Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.

Similar presentations


Presentation on theme: "Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2."— Presentation transcript:

1 Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2

2 Definition of a hidden Markov model Definition: A hidden Markov model (HMM) Alphabet  = { b 1, b 2, …, b M } Set of states Q = { 1,..., K } Transition probabilities between any two states a ij = transition prob from state i to state j a i1 + … + a iK = 1, for all states i = 1…K Start probabilities a 0i a 01 + … + a 0K = 1 Emission probabilities within each state e i (b) = P( x i = b |  i = k) e i (b 1 ) + … + e i (b M ) = 1, for all states i = 1…K K 1 … 2

3 A HMM is memory-less At each time step t, the only thing that affects future states is the current state  t P(  t+1 =k | “whatever happened so far”) = P(  t+1 =k |  1,  2, …,  t, x 1, x 2, …, x t )= P(  t+1 =k |  t ) K 1 … 2

4 The three main questions on HMMs 1.Evaluation GIVEN a HMM M, and a sequence x, FIND Prob[ x | M ] 2.Decoding GIVENa HMM M, and a sequence x, FINDthe sequence  of states that maximizes P[ x,  | M ] 3.Learning GIVENa HMM M, with unspecified transition/emission probs., and a sequence x, FINDparameters  = (e i (.), a ij ) that maximize P[ x |  ]

5 Let’s not be confused by notation P[ x | M ]: The probability that sequence x was generated by the model The model is: architecture (#states, etc) + parameters  = a ij, e i (.) So, P[x | M] is the same with P[ x |  ], and P[ x ], when the architecture, and the parameters, respectively, are implied Similarly, P[ x,  | M ], P[ x,  |  ] and P[ x,  ] are the same when the architecture, and the parameters, are implied In the LEARNING problem we always write P[ x |  ] to emphasize that we are seeking the  * that maximizes P[ x |  ]

6 Example: The Dishonest Casino A casino has two dice: Fair die P(1) = P(2) = P(3) = P(5) = P(6) = 1/6 Loaded die P(1) = P(2) = P(3) = P(5) = 1/10 P(6) = 1/2 Casino player switches back-&-forth between fair and loaded die once every 20 turns Game: 1.You bet $1 2.You roll (always with a fair die) 3.Casino player rolls (maybe with fair die, maybe with loaded die) 4.Highest number wins $2

7 Question # 1 – Evaluation GIVEN A sequence of rolls by the casino player 1245526462146146136136661664661636616366163616515615115146123562344 QUESTION How likely is this sequence, given our model of how the casino works? This is the EVALUATION problem in HMMs

8 Question # 2 – Decoding GIVEN A sequence of rolls by the casino player 1245526462146146136136661664661636616366163616515615115146123562344 QUESTION What portion of the sequence was generated with the fair die, and what portion with the loaded die? This is the DECODING question in HMMs

9 Question # 3 – Learning GIVEN A sequence of rolls by the casino player 1245526462146146136136661664661636616366163616515615115146123562344 QUESTION How “loaded” is the loaded die? How “fair” is the fair die? How often does the casino player change from fair to loaded, and back? This is the LEARNING question in HMMs

10 The dishonest casino model FAIRLOADED 0.05 0.95 P(1|F) = 1/6 P(2|F) = 1/6 P(3|F) = 1/6 P(4|F) = 1/6 P(5|F) = 1/6 P(6|F) = 1/6 P(1|L) = 1/10 P(2|L) = 1/10 P(3|L) = 1/10 P(4|L) = 1/10 P(5|L) = 1/10 P(6|L) = 1/2

11 A parse of a sequence Given a sequence x = x 1 ……x N, A parse of x is a sequence of states  =  1, ……,  N 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xNxN 2 1 K 2

12 Likelihood of a parse Given a sequence x = x 1 ……x N and a parse  =  1, ……,  N, To find how likely is the parse: (given our HMM) P(x,  ) = P(x 1, …, x N,  1, ……,  N ) = P(x N,  N | x 1 …x N-1,  1, ……,  N-1 ) P(x 1 …x N-1,  1, ……,  N-1 ) = P(x N,  N |  N-1 ) P(x 1 …x N-1,  1, ……,  N-1 ) = … = P(x N,  N |  N-1 ) P(x N-1,  N-1 |  N-2 )……P(x 2,  2 |  1 ) P(x 1,  1 ) = P(x N |  N ) P(  N |  N-1 ) ……P(x 2 |  2 ) P(  2 |  1 ) P(x 1 |  1 ) P(  1 ) = a 0  1 a  1  2 ……a  N-1  N e  1 (x 1 )……e  N (x N ) 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xNxN 2 1 K 2

13 Example: the dishonest casino Let the sequence of rolls be: x = 1, 2, 1, 5, 6, 2, 1, 6, 2, 4 Then, what is the likelihood of  = Fair, Fair, Fair, Fair, Fair, Fair, Fair, Fair, Fair, Fair? (say initial probs a 0Fair = ½, a 0Loaded = ½) ½  P(1 | Fair) P(Fair | Fair) P(2 | Fair) P(Fair | Fair) … P(4 | Fair) = ½  (1/6) 10  (0.95) 9 =.00000000521158647211 = 0.521  10 -9

14 Example: the dishonest casino So, the likelihood the die is fair in all this run is just 0.521  10 -9 OK, but what is the likelihood of  = Loaded, Loaded, Loaded, Loaded, Loaded, Loaded, Loaded, Loaded, Loaded, Loaded ? ½  P(1 | Loaded) P(Loaded, Loaded) … P(4 | Loaded) = ½  (1/10) 8  (1/2) 2 (0.95) 9 =.00000000078781176215 = 0.79  10 -9 Therefore, it is somewhat more likely that the die is loaded all the way, than that it is fair all the way

15 Example: the dishonest casino Let the sequence of rolls be: x = 1, 6, 6, 5, 6, 2, 6, 6, 3, 6 Now, what is the likelihood  = F, F, …, F? ½  (1/6) 10  (0.95) 9 = 0.5  10 -9, same as before What is the likelihood  = L, L, …, L? ½  (1/10) 4  (1/2) 6 (0.95) 9 =.00000049238235134735 = 0.5  10 -7 So, it is 100 times more likely the die is loaded

16 Problem 1: Decoding Find the best parse of a sequence

17 Decoding GIVEN x = x 1 x 2 ……x N We want to find  =  1, ……,  N, such that P[ x,  ] is maximized  * = argmax  P[ x,  ] We can use dynamic programming! Let V k (i) = max {  1,…,i-1} P[x 1 …x i-1,  1, …,  i-1, x i,  i = k] = Probability of most likely sequence of states ending at state  i = k 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xNxN 2 1 K 2

18 Decoding – main idea Given that for all states k, and for a fixed position i, V k (i) = max {  1,…,i-1} P[x 1 …x i-1,  1, …,  i-1, x i,  i = k] What is V l (i+1)? From definition, V l (i+1) = max {  1,…,i} P[ x 1 …x i,  1, …,  i, x i+1,  i+1 = l ] = max {  1,…,i} P(x i+1,  i+1 = l | x 1 …x i,  1,…,  i ) P[x 1 …x i,  1,…,  i ] = max {  1,…,i} P(x i+1,  i+1 = l |  i ) P[x 1 …x i-1,  1, …,  i-1, x i,  i ] = max k [ P(x i+1,  i+1 = l |  i = k) max {  1,…,i-1} P[x 1 …x i-1,  1,…,  i-1, x i,  i =k] ] = e l (x i+1 ) max k a kl V k (i)

19 The Viterbi Algorithm Input: x = x 1 ……x N Initialization: V 0 (0) = 1(0 is the imaginary first position) V k (0) = 0, for all k > 0 Iteration: V j (i) = e j (x i )  max k a kj V k (i-1) Ptr j (i) = argmax k a kj V k (i-1) Termination: P(x,  *) = max k V k (N) Traceback:  N * = argmax k V k (N)  i-1 * = Ptr  i (i)

20 The Viterbi Algorithm Similar to “aligning” a set of states to a sequence Time: O(K 2 N) Space: O(KN) x 1 x 2 x 3 ………………………………………..x N State 1 2 K V j (i)

21 Viterbi Algorithm – a practical detail Underflows are a significant problem P[ x 1,…., x i,  1, …,  i ] = a 0  1 a  1  2 ……a  i e  1 (x 1 )……e  i (x i ) These numbers become extremely small – underflow Solution: Take the logs of all values V l (i) = log e k (x i ) + max k [ V k (i-1) + log a kl ]

22 Example Let x be a sequence with a portion of ~ 1/6 6’s, followed by a portion of ~ ½ 6’s… x = 123456123456…12345 6626364656…1626364656 Then, it is not hard to show that optimal parse is (exercise): FFF…………………...F LLL………………………...L 6 characters “123456” parsed as F, contribute.95 6  (1/6) 6 = 1.6  10 -5 parsed as L, contribute.95 6  (1/2) 1  (1/10) 5 = 0.4  10 -5 “162636” parsed as F, contribute.95 6  (1/6) 6 = 1.6  10 -5 parsed as L, contribute.95 6  (1/2) 3  (1/10) 3 = 9.0  10 -5

23 Problem 2: Evaluation Find the likelihood a sequence is generated by the model

24 Generating a sequence by the model Given a HMM, we can generate a sequence of length n as follows: 1.Start at state  1 according to prob a 0  1 2.Emit letter x 1 according to prob e  1 (x 1 ) 3.Go to state  2 according to prob a  1  2 4.… until emitting x n 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xnxn 2 1 K 2 0 e 2 (x 1 ) a 02

25 A couple of questions Given a sequence x, What is the probability that x was generated by the model? Given a position i, what is the most likely state that emitted x i ? Example: the dishonest casino Say x = 12341623162616364616234161221341 Most likely path:  = FF……F However: marked letters more likely to be L than unmarked letters P(box: FFFFFFFFFFF) = (1/6) 11 * 0.95 12 = 2.76 -9 * 0.54 = 1.49 -9 P(box: LLLLLLLLLLL) = [ (1/2) 6 * (1/10) 5 ] * 0.95 12 * 0.05 2 = 1.56*10 -7 * 1.5 -3 = 0.23 -9

26 Evaluation We will develop algorithms that allow us to compute: P(x)Probability of x given the model P(x i …x j )Probability of a substring of x given the model P(  I = k | x)Probability that the i th state is k, given x A more refined measure of which states x may be in

27 The Forward Algorithm We want to calculate P(x) = probability of x, given the HMM Sum over all possible ways of generating x: P(x) =   P(x,  ) =   P(x |  ) P(  ) To avoid summing over an exponential number of paths , define f k (i) = P(x 1 …x i,  i = k) (the forward probability)

28 The Forward Algorithm – derivation Define the forward probability: f k (i) = P(x 1 …x i,  i = k) =   1…  i-1 P(x 1 …x i-1,  1,…,  i-1,  i = k) e k (x i ) =  l   1…  i-2 P(x 1 …x i-1,  1,…,  i-2,  i-1 = l) a lk e k (x i ) = e k (x i )  l f l (i-1) a lk

29 The Forward Algorithm We can compute f k (i) for all k, i, using dynamic programming! Initialization: f 0 (0) = 1 f k (0) = 0, for all k > 0 Iteration: f k (i) = e k (x i )  l f l (i-1) a lk Termination: P(x) =  k f k (N) a k0 Where, a k0 is the probability that the terminating state is k (usually = a 0k )

30 Relation between Forward and Viterbi VITERBI Initialization: V 0 (0) = 1 V k (0) = 0, for all k > 0 Iteration: V j (i) = e j (x i ) max k V k (i-1) a kj Termination: P(x,  *) = max k V k (N) FORWARD Initialization: f 0 (0) = 1 f k (0) = 0, for all k > 0 Iteration: f l (i) = e l (x i )  k f k (i-1) a kl Termination: P(x) =  k f k (N) a k0

31 Motivation for the Backward Algorithm We want to compute P(  i = k | x), the probability distribution on the i th position, given x We start by computing P(  i = k, x) = P(x 1 …x i,  i = k, x i+1 …x N ) = P(x 1 …x i,  i = k) P(x i+1 …x N | x 1 …x i,  i = k) = P(x 1 …x i,  i = k) P(x i+1 …x N |  i = k) Then, P(  i = k | x) = P(  i = k, x) / P(x) Forward, f k (i)Backward, b k (i)

32 The Backward Algorithm – derivation Define the backward probability: b k (i) = P(x i+1 …x N |  i = k) =   i+1…  N P(x i+1,x i+2, …, x N,  i+1, …,  N |  i = k) =  l   i+1…  N P(x i+1,x i+2, …, x N,  i+1 = l,  i+2, …,  N |  i = k) =  l e l (x i+1 ) a kl   i+1…  N P(x i+2, …, x N,  i+2, …,  N |  i+1 = l) =  l e l (x i+1 ) a kl b l (i+1)

33 The Backward Algorithm We can compute b k (i) for all k, i, using dynamic programming Initialization: b k (N) = a k0, for all k Iteration: b k (i) =  l e l (x i+1 ) a kl b l (i+1) Termination: P(x) =  l a 0l e l (x 1 ) b l (1)

34 Computational Complexity What is the running time, and space required, for Forward, and Backward? Time: O(K 2 N) Space: O(KN) Useful implementation technique to avoid underflows Viterbi: sum of logs Forward/Backward: rescaling at each position by multiplying by a constant

35 Posterior Decoding We can now calculate f k (i) b k (i) P(  i = k | x) = ––––––– P(x) Then, we can ask What is the most likely state at position i of sequence x: Define  ^ by Posterior Decoding:  ^ i = argmax k P(  i = k | x)

36 Posterior Decoding For each state,  Posterior Decoding gives us a curve of likelihood of state for each position  That is sometimes more informative than Viterbi path  * Posterior Decoding may give an invalid sequence of states  Why?

37 Posterior Decoding P(  i = k | x) =   P(  | x) 1(  i = k) =  {  :  [i] = k} P(  | x) x 1 x 2 x 3 …………………………………………… x N State 1 l P(  i =l|x) k

38 Posterior Decoding Example: How do we compute P(  i = l,  j  i = l’ | x)? f l (i) b l (j) P(  i = l,  i  I = l’ | x) = ––––––– P(x) x 1 x 2 x 3 …………………………………………… x N State 1 l P(  i =l|x) k P(  j =l’|x)

39 Viterbi, Forward, Backward VITERBI Initialization: V 0 (0) = 1 V k (0) = 0, for all k > 0 Iteration: V l (i) = e l (x i ) max k V k (i-1) a kl Termination: P(x,  *) = max k V k (N) FORWARD Initialization: f 0 (0) = 1 f k (0) = 0, for all k > 0 Iteration: f l (i) = e l (x i )  k f k (i-1) a kl Termination: P(x) =  k f k (N) a k0 BACKWARD Initialization: b k (N) = a k0, for all k Iteration: b l (i) =  k e l (x i +1) a kl b k (i+1) Termination: P(x) =  k a 0k e k (x 1 ) b k (1)

40 Modeling CpG Islands by HMMs In human genomes the pair CG often transforms to (methyl-C) G which often transforms to TG. Hence the pair CG appears less than expected from what is expected from the independent frequencies of C and G alone. Due to biological reasons, this process is sometimes suppressed in short stretches of genomes such as in the start regions of many genes. These areas are called CpG islands (p denotes “pair”).

41 CpG Islands In CpG islands,  CG is more frequent  Other pairs (AA, AG, AT…) have different frequencies Question: Detect CpG islands computationally

42 Problems: CpG Island We consider two questions (and some variants): Question 1: Given a short stretch of genomic data, does it come from a CpG island ? Question 2: Given a long piece of genomic data, does it contain CpG islands in it, where, what length ? We “solve” the first question by modeling strings with and without CpG islands as Markov Chains over the same states {A,C,G,T} but different transition probabilities:

43 Question 1: Using two Markov chains A + (For CpG islands): X i-1 XiXi ACGT A 0.180.270.430.12 C 0.17 p + (C | C) 0.274p + (T|C) G 0.16p + (C|G)p + (G|G)p + (T|G) T 0.08p + (C |T) p + (G|T)p + (T|T) We need to specify p + (x i | x i-1 ) where + stands for CpG Island. From Durbin et al we have: (Recall: rows must add up to one; columns need not.)

44 Question 1: Using two Markov chains A - (For non-CpG islands): X i-1 XiXi ACGT A 0.30.20.290.21 C 0.32p - (C|C)0.078p - (T|C) G 0.25p - (C|G) p - (G|G) p - (T|G) T 0.18p - (C|T)p - (G|T)p - (T|T) …and for p - (x i | x i-1 ) (where “-” stands for Non CpG island) we have:

45 Discriminating between the two models Given a string x=(x 1 ….x L ), now compute the ratio If RATIO>1, CpG island is more likely. Actually – the log of this ratio is computed: X1X1 X2X2 X L-1 XLXL Note: p + (x 1 |x 0 ) is defined for convenience as p + (x 1 ). p - (x1|x0) is defined for convenience as p - (x1).

46 Log Odds-Ratio test Taking logarithm yields If logQ > 0, then + is more likely (CpG island). If logQ < 0, then - is more likely (non-CpG island).

47 Where do the parameters (transition- probabilities) come from ? Learning from complete data, namely, when the label is given and every x i is measured: Source: A collection of sequences from CpG islands, and a collection of sequences from non-CpG islands. Input: Tuples of the form (x 1, …, x L, h), where h is + or - Output: Maximum Likelihood parameters (MLE) Count all pairs (X i =a, X i-1 =b) with label +, and with label -, say the numbers are N ba,+ and N ba,-.

48 Maximum Likelihood Estimate (MLE) of the parameters (using labeled data) The needed parameters are: p + (x 1 ), p + (x i | x i-1 ), p - (x 1 ), p - (x i | x i-1 ) The ML estimates are given by: X1X1 X2X2 X L-1 XLXL Where N a,+ is the number of times letter a appear in CpG islands in the dataset. Where N ba,+ is the number of times letter b appears after letter a in CpG islands in the dataset.

49 A model of CpG Islands – Transitions How do we estimate the parameters of the model? Emission probabilities: 1/0 1.Transition probabilities within CpG islands Established from many known (experimentally verified) CpG islands (Training Set) 2.Transition probabilities within other regions Established from many known non-CpG islands + ACGT A.180.274.426.120 C.171.368.274.188 G.161.339.375.125 T.079.355.384.182 - ACGT A.300.205.285.210 C.233.298.078.302 G.248.246.298.208 T.177.239.292

50 Log Likehoods— Telling “Prediction” from “Random” ACGT A -0.740+0.419+0.580-0.803 C -0.913+0.302+1.812-0.685 G -0.624+0.461+0.331-0.730 T -1.169+0.573+0.393-0.679 Another way to see effects of transitions: Log likelihoods L(u, v) = log[ P(uv | + ) / P(uv | -) ] Given a region x = x 1 …x N A quick-&-dirty way to decide whether entire x is CpG P(x is CpG) > P(x is not CpG)   i L(x i, x i+1 ) > 0

51 A model of CpG Islands – (1) Architecture A+C+G+T+ A-C-G-T- CpG Island Not CpG Island

52 A model of CpG Islands – (2) Transitions What about transitions between (+) and (-) states? They affect  Avg. length of CpG island  Avg. separation between two CpG islands XY 1-p 1-q pq Length distribution of region X: P[l X = 1] = 1-p P[l X = 2] = p(1-p) … P[l X = k] = p k (1-p) E[l X ] = 1/(1-p) Geometric distribution, with mean 1/(1-p)

53 A+C+G+T+ A-C-G-T- 1–p A model of CpG Islands – (2) Transitions No reason to favor exiting/entering (+) and (-) regions at a particular nucleotide To determine transition probabilities between (+) and (-) states 1.Estimate average length of a CpG island: l CPG = 1/(1-p)  p = 1 – 1/l CPG 2.For each pair of (+) states k, l, let a kl  p × a kl 3.For each (+) state k, (-) state l, let a kl = (1-p)/4 (better: take frequency of l in the (-) regions into account) 4.Do the same for (-) states A problem with this model: CpG islands don’t have exponential length distribution This is a defect of HMMs – compensated with ease of analysis & computation

54 Applications of the model Given a DNA region x, The Viterbi algorithm predicts locations of CpG islands Given a nucleotide x i, (say x i = A) The Viterbi parse tells whether x i is in a CpG island in the most likely general scenario The Forward/Backward algorithms can calculate P(x i is in CpG island) = P(  i = A + | x) Posterior Decoding can assign locally optimal predictions of CpG islands  ^ i = argmax k P(  i = k | x)

55 What if a new genome comes? We just sequenced the porcupine genome We know CpG islands play the same role in this genome However, we have no known CpG islands for porcupines We suspect the frequency and characteristics of CpG islands are quite different in porcupines How do we adjust the parameters in our model? LEARNING

56 Problem 3: Learning Re-estimate the parameters of the model based on training data

57 Two learning scenarios 1.Estimation when the “right answer” is known Examples: GIVEN:a genomic region x = x 1 …x 1,000,000 where we have good (experimental) annotations of the CpG islands GIVEN:the casino player allows us to observe him one evening, as he changes dice and produces 10,000 rolls 2.Estimation when the “right answer” is unknown Examples: GIVEN:the porcupine genome; we don’t know how frequent are the CpG islands there, neither do we know their composition GIVEN: 10,000 rolls of the casino player, but we don’t see when he changes dice QUESTION:Update the parameters  of the model to maximize P(x|  )

58 1.When the right answer is known Given x = x 1 …x N for which the true  =  1 …  N is known, Define: A kl = # times k  l transition occurs in  E k (b) = # times state k in  emits b in x We can show that the maximum likelihood parameters  (maximize P(x|  )) are: A kl E k (b) a kl = ––––– e k (b) = –––––––  i A ki  c E k (c)

59 1.When the right answer is known Intuition: When we know the underlying states, Best estimate is the average frequency of transitions & emissions that occur in the training data Drawback: Given little data, there may be overfitting: P(x|  ) is maximized, but  is unreasonable 0 probabilities – VERY BAD Example: Given 10 casino rolls, we observe x = 2, 1, 5, 6, 1, 2, 3, 6, 2, 3  = F, F, F, F, F, F, F, F, F, F Then: a FF = 1; a FL = 0 e F (1) = e F (3) =.2; e F (2) =.3; e F (4) = 0; e F (5) = e F (6) =.1

60 Pseudocounts Solution for small training sets: Add pseudocounts A kl = # times k  l transition occurs in  + r kl E k (b) = # times state k in  emits b in x+ r k (b) r kl, r k (b) are pseudocounts representing our prior belief Larger pseudocounts  Strong prior belief Small pseudocounts (  < 1): just to avoid 0 probabilities

61 Pseudocounts Example: dishonest casino We will observe player for one day, 600 rolls Reasonable pseudocounts: r 0F = r 0L = r F0 = r L0 = 1; r FL = r LF = r FF = r LL = 1; r F (1) = r F (2) = … = r F (6) = 20(strong belief fair is fair) r L (1) = r L (2) = … = r L (6) = 5(wait and see for loaded) Above #s pretty arbitrary – assigning priors is an art

62 2.When the right answer is unknown We don’t know the true A kl, E k (b) Idea: We estimate our “best guess” on what A kl, E k (b) are We update the parameters of the model, based on our guess We repeat

63 The general process for finding θ in this case is 1.Start with an initial value of θ. 2.Find θ’ so that p(x 1,..., x n |θ’) > p(x 1,..., x n |θ) 3.set θ = θ’. 4.Repeat until some convergence criterion is met. A general algorithm of this type is the Expectation Maximization algorithm, which we will meet later. For the specific case of HMM, it is the Baum- Welch training. 2.When the right answer is unknown

64 We don’t know the true A kl, E k (b) Starting with our best guess of a model M with parameters  : Given x = x 1 …x N for which the true  =  1 …  N is unknown, We can get to a provably more likely parameter set  = ( a kl, e k (b) ) Principle: EXPECTATION MAXIMIZATION 1.E-STEP: Estimate A kl, E k (b) in the training data 2.M-STEP: Update  = ( a kl, e k (b) ) according to A kl, E k (b) 3.Repeat 1 & 2, until convergence

65 Baum Welch training We start with some values of a kl and e k (b), which define prior values of θ. Baum-Welch training is an iterative algorithm which attempts to replace θ by a θ * s.t. p( x |θ * ) > p( x |θ) Each iteration consists of few steps: s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi

66 Baum Welch training In case 1 we computed the optimal values of a kl and e k (b), (for the optimal θ) by simply counting the number A kl of transitions from state k to state l, and the number E k (b) of emissions of symbol b from state k, in the training set. This was possible since we knew all the states. S i = lS i-1 = k x i-1 = b … … x i = c

67 s1s1 SiSi sLsL X1X1 XiXi XLXL S i-1 X i-1.. Baum Welch: step 1a Count expected number of state transitions For each i and for each k,l, compute the posterior state transitions probabilities: P(s i-1 =k, s i =l | x,θ) For this, we use the forwards and backwards algorithms

68 Reminder: finding posterior state probabilities p(s i =k,x) = f k (s i ) b k (s i ) (since these are independent events) {f k (i) b k (i)} for every i, k are computed by one run of the backward/forward algorithms. s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi f k (i) = p(x 1,…,x i,s i =k ), the probability that in a path which emits (x 1,..,x i ), state s i =k. b k (i)= p(x i+1,…,x L |s i =k), the probability that a path emits (x i+1,..,x L ), given that state s i =k.

69 Baum Welch: Step 1a (cont) Claim: s1s1 SiSi sLsL X1X1 XiXi XLXL S i-1 X i-1.. (a kl and e l (x i ) are the parameters defined by , and f k (i-1), b k (i) are the forward and backward functions)

70 Estimating new parameters To estimate A kl : (assume | , in all formulas below) At each position i of sequence x, find probability that transition k  l is used: P(  i = k,  i+1 = l | x) = [1/P(x)]  P(  i = k,  i+1 = l, x 1 …x N ) = Q/P(x) where Q = P(x 1 …x i,  i = k,  i+1 = l, x i+1 …x N ) = = P(  i+1 = l, x i+1 …x N |  i = k) P(x 1 …x i,  i = k) = = P(  i+1 = l, x i+1 x i+2 …x N |  i = k) f k (i) = = P(x i+2 …x N |  i+1 = l) P(x i+1 |  i+1 = l) P(  i+1 = l |  i = k) f k (i) = = b l (i+1) e l (x i+1 ) a kl f k (i) f k (i) a kl e l (x i+1 ) b l (i+1) So: P(  i = k,  i+1 = l | x,  ) = –––––––––––––––––– P(x |  )

71 Step 1a: Computing P(s i-1 =k, s i =l | x,θ) P(x 1,…,x L,s i-1 =k,s i =l|  ) = P(x 1,…,x i-1,s i-1 =k|  ) a kl e l (x i ) P(x i+1,…,x L |s i =l,  ) = f k (i-1) a kl e l (x i ) b l (i) Via the forward algorithm Via the backward algorithm s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL S i-1 X i-1 sisi XiXi x p(s i-1 =k,s i =l | x,  ) = f k (i-1) a kl e l (x i ) b l (i)

72 Step 1a (end) For each pair (k,l), compute the expected number of state transitions from k to l, as the sum of the expected number of k to l transitions over all L edges :

73 Baum-Welch: Step 1b For each state k and each symbol b, compute the expected number of emissions of b from k as the sum of the expected number of times that s i = k, over all i’s for which x i = b.

74 Estimating new parameters So, f k (i) a kl e l (x i+1 ) b l (i+1) A kl =  i P(  i = k,  i+1 = l | x,  ) =  i ––––––––––––––––– P(x |  ) Similarly, E k (b) = [1/P(x |  )]  {i | xi = b} f k (i) b k (i) kl x i+1 a kl e l (x i+1 ) b l (i+1)f k (i) x 1 ………x i-1 x i+2 ………x N xixi

75 Step 1a for many sequences: When we have n input sequences (x 1,..., x n ), then A kl is given by:

76 Step 1b for many sequences When we have n sequences (x 1,..., x n ), the expected number of emissions of b from k is given by:

77 Summary of Steps 1a and 1b: the E part of the Baum Welch training These steps compute the expected numbers A kl of k,l transitions for all pairs of states k and l, and the expected numbers E k (b) of transmitions of symbol b from state k, for all states k and symbols b. The next step is the M step, which is identical to the computation of optimal ML parameters when all states are known.

78 Baum-Welch: step 2 Use the A kl ’s, E k (b)’s to compute the new values of a kl and e k (b). These values define θ *. The correctness of the EM algorithm implies that: p(x 1,..., x n |θ * )  p(x 1,..., x n |θ) i.e, θ * increases the probability of the data This procedure is iterated, until some convergence criterion is met.

79 The Baum-Welch Algorithm Initialization: Pick the best-guess for model parameters (or arbitrary) Iteration: 1.Forward 2.Backward 3.Calculate A kl, E k (b) 4.Calculate new model parameters a kl, e k (b) 5.Calculate new log-likelihood P(x |  ) GUARANTEED TO BE HIGHER BY EXPECTATION-MAXIMIZATION Until P(x |  ) does not change much

80 The Baum-Welch Algorithm Time Complexity: # iterations  O(K 2 N) Guaranteed to increase the log likelihood P(x |  ) Not guaranteed to find globally best parameters Converges to local optimum, depending on initial conditions Too many parameters / too large model:Overtraining

81 Viterbi Learning: Maximizing the probabilty of the most probable path States are unknown. Viterbi training attempts to maximizes the probability of a most probable path, ie the value of p(s(x 1 ),..,s(x n ), x 1,..,x n |θ) Where s(x j ) is the most probable (under θ) path for x j. We assume only one sequence (n=1). s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi

82 Viterbi training (cont) Start from given values of a kl and e k (b), which define prior values of θ. Each iteration: Step 1: Use Viterbi’s algoritm to find a most probable path s(x), which maximizes p(s(x), x|θ). s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi

83 Viterbi training (cont) Step 2. Use the ML method for HMM with known parameters, to find θ * which maximizes p(s(x), x|θ * ) Note: In Step 1. the maximizing argument is the path s(x), in Step 2. it is the parameters θ *. s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi

84 Viterbi training (cont) 3. Set θ=θ *, and repeat. Stop when paths are not changed. s1s1 s2s2 s L-1 sLsL X1X1 X2X2 X L-1 XLXL sisi XiXi Claim 2 : If s(x) is the optimal path in step 1 of two different iterations, then in both iterations θ has the same values, and hence p(s(x), x |θ) will not increase in any later iteration. Hence the algorithm can terminate in this case.

85 Alternative: Viterbi Training Initialization:Same Iteration: 1.Perform Viterbi, to find  * 2.Calculate A kl, E k (b) according to  * + pseudocounts 3.Calculate the new parameters a kl, e k (b) Until convergence Notes:  Convergence is guaranteed – Why?  Does not maximize P(x |  )  Maximizes P(x, | ,  * )  In general, worse performance than Baum-Welch

86 Coin-Tossing Example using Baum-Welch 0.9 Fair loaded head tail 0.9 0.1 1/2 1/4 3/4 1/2 H1H1 H2H2 H L-1 HLHL X1X1 X2X2 X L-1 XLXL HiHi XiXi L tosses Fair/Loade d Head/Tail Start 1/2

87 Example : Homogenous HMM, one sample Start with some probability tables Iterate until convergence E-step: Compute p  (h i |h i -1,x 1,…,x L ) from p  (h i, h i -1 | x 1,…,x L ) which is computed using the forward- backward algorithm as explained earlier. M-step: Update the parameters simultaneously:    i p  (h i =1 | h i-1 =1, x 1,…,x L )+p  (h i =0 | h i-1 =0, x 1,…,x L )/(L-1) H1H1 H2H2 H L-1 HLHL X1X1 X2X2 X L-1 XLXL HiHi XiXi

88 H1H1 H2H2 H L-1 HLHL X1X1 X2X2 X L-1 XLXL HiHi XiXi Coin-Tossing Example Numeric example: 3 tosses Outcomes: head, head, tail

89 Coin-Tossing Example Numeric example: 3 tosses Outcomes: head, head, tail P(x 1 =head,h 1 =loaded)= P(loaded 1 ) P(head| loaded 1 )= 0.5*0.75=0.375 P(x 1 =head,h 1 =fair)= P(fair 1 ) P(head| fair 1 )= 0.5*0.5=0.25 First coin is loaded {step 1- forward} F(h i )=P(x 1,…,x i,h i ) =  P(x 1,…,x i-1, h i-1 ) P(h i | h i-1 ) P(x i | h i ) h i-1 Recall:

90 Coin-Tossing Example - forward Numeric example: 3 tosses Outcomes: head, head, tail P(x 1,…,x i,h i ) =  P(x 1,…,x i-1, h i-1 ) P(h i | h i-1 ) P(x i | h i ) h i-1 P(x 1 =head,h 1 =loaded)= P(loaded 1 ) P(head| loaded 1 )= 0.5*0.75=0.375 P(x 1 =head,h 1 =fair)= P(fair 1 ) P(head| fair 1 )= 0.5*0.5=0.25 {step 1} P(x 1 =head,x 2 =head,h 2 =loaded) =  P(x 1,h 1 ) P(h 2 | h 1 ) P(x 2 | h 2 ) = p(x 1 =head, loaded 1 ) P(loaded 2 | loaded 1 ) P(x 2 =head | loaded 2 ) + p(x 1 =head, fair 1 ) P(loaded 2 | fair 1 ) P(x 2 =head | loaded 2 ) = 0.375*0.9*0.75 + 0.25*0.1*0.75=0.253125+ 0.01875= 0.271875 h1h1 {step 2} P(x 1 =head,x 2 =head,h 2 =fair) =p(x 1 =head, loaded 1 ) P(fair 2 | loaded 1 ) P(x 2 =head | fair 2 ) +p(x 1 =head, fair 1 ) P(fair 2 | fair 1 ) P(x 2 =head | fair 2 ) = 0.375*0.1*0.5 + 0.25*0.9*0.5= 0.01875 + 0.1125= 0.13125

91 Coin-Tossing Example - forward Numeric example: 3 tosses Outcomes: head, head, tail P(x 1,…,x i,h i ) =  P(x 1,…,x i-1, h i-1 ) P(h i | h i-1 ) P(x i | h i ) h i-1 P(x 1 =head,x 2 =head,h 2 =loaded) = 0.271875 P(x 1 =head,x 2 =head,h 2 =fair) = 0.13125 {step 2} P(x 1 =head,x 2 =head, x 3 =tail,h 3 =loaded) =  P(x 1, x 2, h 2 ) P(h 3 | h 2 ) P(x 3 | h 3 ) = p(x 1 =head, x 2 =head, loaded 2 ) P(loaded 3 | loaded 2 ) P(x 3 =tail | loaded 3 ) + p(x 1 =head, x 2 =head, fair 2 ) P(loaded 3 | fair 2 ) P(x 3 =tail | loaded 3 ) = 0.271875 *0.9*0.25 + 0.13125 *0.1*0.25=0.6445 h2h2 {step 3} P(x 1 =head,x 2 =head, x 3 =tail,h 3 =fair) = p(x 1 =head, x 2 =head, loaded 2 ) P(fair 3 | loaded 2 ) P(x 3 =tail | fair 3 ) + p(x 1 =head, x 2 =head, fair 2 ) P(fair 3 | fair 2 ) P(x 3 =tail | fair 3 ) = 0.271875 *0.1*0.5 + 0.13125 *0.9*0.5=0.07265

92 Coin-Tossing Example - backward Numeric example: 3 tosses Outcomes: head, head, tail b(h i ) = P(x i+1,…,x L |h i )= P(x i+1,…,x L |h i ) =  P(h i+1 | h i ) P(x i+1 | h i+1 ) b(h i+1 ) P(x 3 =tail | h 2 =loaded)=P(h 3 =loaded | h 2 =loaded) P(x 3 =tail | h 3 =loaded)+ P(h 3 =fair | h 2 =loaded) P(x 3 =tail | h 3 =fair)=0.9*0.25+0.1*0.5=0.275 P(x 3 =tail | h 2 =fair)=P(h 3 =loaded | h 2 =fair) P(x 3 =tail | h 3 =loaded)+ P(h 3 =fair | h 2 =fair) P(x 3 =tail | h 3 =fair)=0.1*0.25+0.9*0.5=0.475 {step 1} h i+1

93 Coin-Tossing Example - backward Numeric example: 3 tosses Outcomes: head, head, tail P(x 3 =tail | h 2 =loaded)=0.275 P(x 3 =tail | h 2 =fair)=0.475 {step 1} P(x 2 =head,x 3 =tail | h 1 =loaded) = P(loaded 2 | loaded 1 ) *P(head| loaded)* 0.275 +P(fair 2 | loaded 1 ) *P(head|fair)*0.475= 0.9*0.75*0.275+0.1*0.5*0.475=0.209 {step 2} P(x 2 =head,x 3 =tail | h 1 =fair) = P(loaded 2 | fair 1 ) *P(head|loaded)* 0.275 +P(fair 2 | fair 1 ) * P(head|fair)*0.475= 0.1*0.75*0.275+0.9*0.5*0.475=0.234 b(h i ) = P(x i+1,…,x L |h i )= P(x i+1,…,x L |h i ) =  P(h i+1 | h i ) P(x i+1 | h i+1 ) b(h i+1 ) h i+1

94 p(x 1,…,x L,h i,h i+1 )=f(h i ) p(h i+1 |h i ) p(x i+1 | h i+1 ) b(h i+1 ) Coin-Tossing Example Outcomes: head, head, tail f(h 1 =loaded) = 0.375, f(h 1 =fair) = 0.25 b(h 2 =loaded) = 0.275, b(h 2 =fair) = 0.475 P(x 1 =head,h 1 =loaded)= P(loaded 1 ) P(head| loaded 1 )= 0.5*0.75=0.375 P(x 1 =head,h 1 =fair)= P(fair 1 ) P(head| fair 1 )= 0.5*0.5=0.25 {step 1} Recall:

95 Coin-Tossing Example Outcomes: head, head, tail f(h 1 =loaded) = 0.375, f(h 1 =fair) = 0.25 b(h 2 =loaded) = 0.275, b(h 2 =fair) = 0.475 p(x 1,…,x L,h 1,h 2 )=f(h 1 ) p(h 1 |h 2 ) p(x 2 | h 2 ) b(h 2 ) p(x 1,…,x L,h 1 =loaded,h 2 =loaded)=0.375*0.9*0.75*0.275=0.0696 p(x 1,…,x L,h 1 =loaded,h 2 =fair)=0.375*0.1*0. 5*0.475=0.0089 p(x 1,…,x L,h 1 =fair,h 2 =loaded)=0.25*0.1*0.75*0.275=0.00516 p(x 1,…,x L,h 1 =fair,h 2 =fair)=0.25*0.9*0. 5*0.475=0.0534

96 Coin-Tossing Example p(h i |h i -1,x 1,…,x L )=p(x 1,…,x L,h i,h i-1 )/p(h i-1,x 1,…,x L ) f(h i-1 )*b(h i-1 ) =f(h i-1 ) p(h i-1 |h i ) p(x i | h i ) b(h i )/(f(h i-1 )*b(h i-1 ))

97 M-step M-step: Update the parameters simultaneously: (in this case we only have one parameter -  )   (  i p(h i =loaded | h i-1 =loaded, x 1,…,x L )+ p (h i =fair | h i-1 =fair, x 1,…,x L ))/(L-1)

98 Example: The ABO locus – EM Learning A locus is a particular place on the chromosome. Each locus’ state (called genotype) consists of two alleles – one parental and one maternal. Some loci (plural of locus) determine distinguished features. The ABO locus, for example, determines blood type. Suppose we randomly sampled N individuals and found that N a/a have genotype a/a, N a/b have genotype a/b, etc. Then, the MLE is given by: The ABO locus has six possible genotypes {a/a, a/o, b/o, b/b, a/b, o/o}. The first two genotypes determine blood type A, the next two determine blood type B, then blood type AB, and finally blood type O. We wish to estimate the proportion in a population of the 6 genotypes.

99 The ABO locus (Cont.) However, testing individuals for their genotype is a very expensive test. Can we estimate the proportions of genotype using the common cheap blood test with outcome being one of the four blood types (A, B, AB, O) ? The problem is that among individuals measured to have blood type A, we don’t know how many have genotype a/a and how many have genotype a/o. So what can we do ? We use the Hardy-Weinberg equilibrium rule that tells us that in equilibrium the frequencies of the three alleles  a,  b,  o in the population determine the frequencies of the genotypes as follows:  a/b = 2  a  b,  a/o = 2  a  o,  b/o = 2  b  o,  a/a = [  a ] 2,  b/b = [  b ] 2,  o/o = [  o ] 2. So now we have three parameters that we need to estimate.

100 The Likelihood Function Let X be a random variable with 6 values x a/a, x a/o,x b/b, x b/o, x a/b, x o/o denoting the six genotypes. The parameters are  = {  a,  b,  o }. The probability P(X= x a/b |  ) = 2  a  b. The probability P(X= x o/o |  ) =  o  o. And so on for the other four genotypes. What is the probability of Data={B,A,B,B,O,A,B,A,O,B, AB} ? Obtaining the maximum of this function yields the MLE.

101 ABO loci as a special case of HMM Model the ABO sampling as an HMM with 6 states (genotypes): a/a, a/b, a/o, b/b, b/o, o/o, and 4 outputs (blood types): A,B,AB,O. Assume 3 transitions types: a, b and o, and a state is determined by 2 successive transitions. The probability of transition x is  x. Emission is done every other state, and is determined by the state. Eg, e a/o (A)=1, since a/o produces blood type A. ao a/o a/b A AB a/b AB b baa

102 A faster and simpler EM for ABO loci Can be solved via the Baum-Welch EM training. This is quite inefficient: for L sampling it requires running the forward and backward algorithm on HMM of length 2L, even that there are only 6 distinct genotypes. Direct application of the EM algorithm yields a simpler and more efficient way.

103 The EM algorithm in Bayes’ nets u E-step u Go over the data: u Sum the expectations of a hidden variables that you get from this data element u M-step u For every hidden variable x u Update your belief according to the expectation you calculated in the last E-step

104 EM - ABO Example a/b/o (hidden) A / B / AB / O (observed) Data type #people A 100 B 200 AB 50 O 50 We choose a “reasonable”  = {0.2,0.2,0.6}  = {  a,  b,  o } is the parameter we need to evaluate

105 EM - ABO Example E-step: M-step: With l = allele and m = blood type

106 EM - ABO Example E-step: we compute all the necessary elements

107 EM - ABO Example  0 = {0.2,0.2,0.6} n=400 (data size) E-step (1 st step): Data type #people A 100 B 200 AB 50 O 50

108 EM - ABO Example  0 = {0.2,0.2,0.6} n=400 (data size) Data type #people A 100 B 200 AB 50 O 50 E-step (1 st step):

109 EM - ABO Example  0 = {0.2,0.2,0.6} n=400 (data size) Data type #people A 100 B 200 AB 50 O 50 E-step (1 st step):

110 EM - ABO Example  0 = {0.2,0.2,0.6} n=400 (data size) Data type #people A 100 B 200 AB 50 O 50 E-step (1 st step):

111 EM - ABO Example  0 = {0.2,0.2,0.6} n=400 (data size) Data type #people A 100 B 200 AB 50 O 50 E-step (1 st step):

112 EM - ABO Example  0 = {0.2,0.2,0.6} n=400 (data size) M-step (1 st step):

113 EM - ABO Example  1 = {0.2,0.35,0.45} E-step (2 nd step):

114 EM - ABO Example M-step (2 nd step):  1 = {0.2,0.35,0.45}

115 EM - ABO Example  2 = {0.21,0.38,0.41} E-step (3 rd step):

116 EM - ABO Example M-step (3 rd step):  2 = {0.29,0.56,0.15} No change


Download ppt "Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2."

Similar presentations


Ads by Google