Presentation is loading. Please wait.

Presentation is loading. Please wait.

Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.

Similar presentations


Presentation on theme: "Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2."— Presentation transcript:

1 Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2

2 CS262 Lecture 6, Win07, Batzoglou Generating a sequence by an HMM generative model A HMM is a generative model 1.Start at state  1 according to prob a 0  1 2.Emit letter x 1 according to prob e  1 (x 1 ) 3.Go to state  2 according to prob a  1  2 4.… until emitting x n 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xnxn 2 1 K 2 0 e 2 (x 1 ) a 02

3 CS262 Lecture 6, Win07, Batzoglou Viterbi, Forward, Backward VITERBI Initialization: V 0 (0) = 1 V k (0) = 0, for all k > 0 Iteration: V l (i) = e l (x i ) max k V k (i-1) a kl Termination: P(x,  *) = max k V k (N) FORWARD Initialization: f 0 (0) = 1 f k (0) = 0, for all k > 0 Iteration: f l (i) = e l (x i )  k f k (i-1) a kl Termination: P(x) =  k f k (N) a k0 BACKWARD Initialization: b k (N) = a k0, for all k Iteration: b l (i) =  k e l (x i +1) a kl b k (i+1) Termination: P(x) =  k a 0k e k (x 1 ) b k (1)

4 CS262 Lecture 6, Win07, Batzoglou A+C+G+T+ A-C-G-T- A modeling Example CpG islands in DNA sequences

5 CS262 Lecture 6, Win07, Batzoglou Methylation & Silencing One way cells differentiate is methylation  Addition of CH 3 in C-nucleotides  Silences genes in region CG (denoted CpG) often mutates to TG, when methylated In each cell, one copy of X is silenced, methylation plays role Methylation is inherited during cell division

6 CS262 Lecture 6, Win07, Batzoglou Example: CpG Islands CpG nucleotides in the genome are frequently methylated (Write CpG not to confuse with CG base pair) C  methyl-C  T Methylation often suppressed around genes, promoters  CpG islands

7 CS262 Lecture 6, Win07, Batzoglou Example: CpG Islands In CpG islands,  CG is more frequent  Other pairs (AA, AG, AT…) have different frequencies Question: Detect CpG islands computationally

8 CS262 Lecture 6, Win07, Batzoglou A model of CpG Islands – (1) Architecture A+C+G+T+ A-C-G-T- CpG Island Not CpG Island

9 CS262 Lecture 6, Win07, Batzoglou A model of CpG Islands – (2) Transitions How do we estimate parameters of the model? Emission probabilities: 1/0 1.Transition probabilities within CpG islands Established from known CpG islands (Training Set) 2.Transition probabilities within other regions Established from known non-CpG islands (Training Set) Note: these transitions out of each state add up to one—no room for transitions between (+) and (-) states + ACGT A.180.274.426.120 C.171.368.274.188 G.161.339.375.125 T.079.355.384.182 - ACGT A.300.205.285.210 C.233.298.078.302 G.248.246.298.208 T.177.239.292 = 1

10 CS262 Lecture 6, Win07, Batzoglou Log Likehoods— Telling “CpG Island” from “Non-CpG Island” ACGT A -0.740+0.419+0.580-0.803 C -0.913+0.302+1.812-0.685 G -0.624+0.461+0.331-0.730 T -1.169+0.573+0.393-0.679 Another way to see effects of transitions: Log likelihoods L(u, v) = log[ P(uv | + ) / P(uv | -) ] Given a region x = x 1 …x N A quick-&-dirty way to decide whether entire x is CpG P(x is CpG) > P(x is not CpG)   i L(x i, x i+1 ) > 0

11 CS262 Lecture 6, Win07, Batzoglou A model of CpG Islands – (2) Transitions What about transitions between (+) and (-) states? They affect  Avg. length of CpG island  Avg. separation between two CpG islands XY 1-p 1-q pq Length distribution of region X: P[l X = 1] = 1-p P[l X = 2] = p(1-p) … P[l X = k] = p k-1 (1-p) E[l X ] = 1/(1-p) Geometric distribution, with mean 1/(1-p)

12 CS262 Lecture 6, Win07, Batzoglou A+C+G+T+ A-C-G-T- 1–p ++ A model of CpG Islands – (2) Transitions Right now, a A+A+ + a A+C+ + a A+G+ + a A+T+ = 1 We need to adjust a ij so as to allow transitions between (+) and (-) states Say we want with probability p ++ to stay within CpG, p -- to stay within non-CpG 1.Let’s adjust all probs by that factor: example, let a A+G+  p ++ × a A+G+ 2.Now, let’s calculate probs between (+) and (-) states 1.Total prob a A+S where S is a (-) state, is (1 – p ++ ) 2.Let q A-, q C-, q G-, q T- be the proportions of A, C, G, and T within non-CpG states in training set 3.Then, let a A+A- = (1 – p ++ ) × q A- ; a A+C- = (1 – p ++ ) × q C- ; … 4.Do the same for (-) to (+) states 3.OK, but how do we estimate p ++ and p -- ? 1.Estimate average length of a CpG island: l + = 1/(1-p)  p = 1 – 1/l + 2.Do the same for length between two CpG islands, l - learning What we just did is a back- of the envelope learning procedure We adjusted the parameters in a manner similar to the learning algorithms we will cover in this lecture

13 CS262 Lecture 6, Win07, Batzoglou Applications of the model Given a DNA region x, The Viterbi algorithm predicts locations of CpG islands Given a nucleotide x i, (say x i = A) The Viterbi parse tells whether x i is in a CpG island in the most likely general scenario The Forward/Backward algorithms can calculate P(x i is in CpG island) = P(  i = A + | x) Posterior Decoding can assign locally optimal predictions of CpG islands  ^ i = argmax k P(  i = k | x) Advantage: ?Each nucleotide is more likely to be called correctly Disadvantage:?The overall parse will be “choppy”—CpG islands too short Advantage/Disadvantage?

14 CS262 Lecture 6, Win07, Batzoglou What if a new genome comes? We just sequenced the porcupine genome We know CpG islands play the same role in this genome However, we have no known CpG islands for porcupines We suspect the frequency and characteristics of CpG islands are quite different in porcupines How do we adjust the parameters in our model? LEARNING

15 CS262 Lecture 6, Win07, Batzoglou Problem 3: Learning Re-estimate the parameters of the model based on training data

16 CS262 Lecture 6, Win07, Batzoglou Two learning scenarios 1.Estimation when the “right answer” is known Examples: GIVEN:a genomic region x = x 1 …x 1,000,000 where we have good (experimental) annotations of the CpG islands GIVEN:the casino player allows us to observe him one evening, as he changes dice and produces 10,000 rolls 2.Estimation when the “right answer” is unknown Examples: GIVEN:the porcupine genome; we don’t know how frequent are the CpG islands there, neither do we know their composition GIVEN: 10,000 rolls of the casino player, but we don’t see when he changes dice QUESTION:Update the parameters  of the model to maximize P(x|  )

17 CS262 Lecture 6, Win07, Batzoglou 1.When the right answer is known Given x = x 1 …x N for which the true  =  1 …  N is known, Define: A kl = # times k  l transition occurs in  E k (b) = # times state k in  emits b in x We can show that the maximum likelihood parameters  (maximize P(x|  )) are: A kl E k (b) a kl = ––––– e k (b) = –––––––  i A ki  c E k (c)

18 CS262 Lecture 6, Win07, Batzoglou 1.When the right answer is known Intuition: When we know the underlying states, Best estimate is the average frequency of transitions & emissions that occur in the training data Drawback: Given little data, there may be overfitting: P(x|  ) is maximized, but  is unreasonable 0 probabilities – BAD Example: Given 10 casino rolls, we observe x = 2, 1, 5, 6, 1, 2, 3, 6, 2, 3  = F, F, F, F, F, F, F, F, F, F Then: a FF = 1; a FL = 0 e F (1) = e F (3) =.2; e F (2) =.3; e F (4) = 0; e F (5) = e F (6) =.1

19 CS262 Lecture 6, Win07, Batzoglou Pseudocounts Solution for small training sets: Add pseudocounts A kl = # times k  l transition occurs in  + r kl E k (b) = # times state k in  emits b in x+ r k (b) r kl, r k (b) are pseudocounts representing our prior belief Larger pseudocounts  Strong priof belief Small pseudocounts (  < 1): just to avoid 0 probabilities

20 CS262 Lecture 6, Win07, Batzoglou Pseudocounts Example: dishonest casino We will observe player for one day, 600 rolls Reasonable pseudocounts: r 0F = r 0L = r F0 = r L0 = 1; r FL = r LF = r FF = r LL = 1; r F (1) = r F (2) = … = r F (6) = 20(strong belief fair is fair) r L (1) = r L (2) = … = r L (6) = 5(wait and see for loaded) Above #s are arbitrary – assigning priors is an art

21 CS262 Lecture 6, Win07, Batzoglou 2.When the right answer is unknown We don’t know the true A kl, E k (b) Idea: We estimate our “best guess” on what A kl, E k (b) are  Or, we start with random / uniform values We update the parameters of the model, based on our guess We repeat

22 CS262 Lecture 6, Win07, Batzoglou 2.When the right answer is unknown Starting with our best guess of a model M, parameters  : Given x = x 1 …x N for which the true  =  1 …  N is unknown, We can get to a provably more likely parameter set  i.e.,  that increases the probability P(x |  ) Principle: EXPECTATION MAXIMIZATION 1.Estimate A kl, E k (b) in the training data 2.Update  according to A kl, E k (b) 3.Repeat 1 & 2, until convergence

23 CS262 Lecture 6, Win07, Batzoglou Estimating new parameters To estimate A kl : (assume “|  CURRENT ”, in all formulas below) At each position i of sequence x, find probability transition k  l is used: P(  i = k,  i+1 = l | x) = [1/P(x)]  P(  i = k,  i+1 = l, x 1 …x N ) = Q/P(x) where Q = P(x 1 …x i,  i = k,  i+1 = l, x i+1 …x N ) = = P(  i+1 = l, x i+1 …x N |  i = k) P(x 1 …x i,  i = k) = = P(  i+1 = l, x i+1 x i+2 …x N |  i = k) f k (i) = = P(x i+2 …x N |  i+1 = l) P(x i+1 |  i+1 = l) P(  i+1 = l |  i = k) f k (i) = = b l (i+1) e l (x i+1 ) a kl f k (i) f k (i) a kl e l (x i+1 ) b l (i+1) So: P(  i = k,  i+1 = l | x,  ) = –––––––––––––––––– P(x |  CURRENT )

24 CS262 Lecture 6, Win07, Batzoglou Estimating new parameters So, A kl is the E[# times transition k  l, given current  ] f k (i) a kl e l (x i+1 ) b l (i+1) A kl =  i P(  i = k,  i+1 = l | x,  ) =  i ––––––––––––––––– P(x |  ) Similarly, E k (b) = [1/P(x |  )]  {i | x i = b} f k (i) b k (i) kl x i+1 a kl e l (x i+1 ) b l (i+1)f k (i) x 1 ………x i-1 x i+2 ………x N xixi

25 CS262 Lecture 6, Win07, Batzoglou The Baum-Welch Algorithm Initialization: Pick the best-guess for model parameters (or arbitrary) Iteration: 1.Forward 2.Backward 3.Calculate A kl, E k (b), given  CURRENT 4.Calculate new model parameters  NEW : a kl, e k (b) 5.Calculate new log-likelihood P(x |  NEW ) GUARANTEED TO BE HIGHER BY EXPECTATION-MAXIMIZATION Until P(x |  ) does not change much

26 CS262 Lecture 6, Win07, Batzoglou The Baum-Welch Algorithm Time Complexity: # iterations  O(K 2 N) Guaranteed to increase the log likelihood P(x |  ) Not guaranteed to find globally best parameters Converges to local optimum, depending on initial conditions Too many parameters / too large model:Overtraining

27 CS262 Lecture 6, Win07, Batzoglou Alternative: Viterbi Training Initialization:Same Iteration: 1.Perform Viterbi, to find  * 2.Calculate A kl, E k (b) according to  * + pseudocounts 3.Calculate the new parameters a kl, e k (b) Until convergence Notes:  Not guaranteed to increase P(x |  )  Guaranteed to increase P(x | ,  * )  In general, worse performance than Baum-Welch

28 CS262 Lecture 6, Win07, Batzoglou Variants of HMMs

29 CS262 Lecture 6, Win07, Batzoglou Higher-order HMMs How do we model “memory” larger than one time point? P(  i+1 = l |  i = k)a kl P(  i+1 = l |  i = k,  i -1 = j)a jkl … A second order HMM with K states is equivalent to a first order HMM with K 2 states state Hstate T a HT (prev = H) a HT (prev = T) a TH (prev = H) a TH (prev = T) state HHstate HT state THstate TT a HHT a TTH a HTT a THH a THT a HTH

30 CS262 Lecture 6, Win07, Batzoglou Modeling the Duration of States Length distribution of region X: E[l X ] = 1/(1-p) Geometric distribution, with mean 1/(1-p) This is a significant disadvantage of HMMs Several solutions exist for modeling different length distributions XY 1-p 1-q pq

31 CS262 Lecture 6, Win07, Batzoglou Example: exon lengths in genes

32 CS262 Lecture 6, Win07, Batzoglou Solution 1: Chain several states XY 1-p 1-q p q X X Disadvantage: Still very inflexible l X = C + geometric with mean 1/(1-p)

33 CS262 Lecture 6, Win07, Batzoglou Solution 2: Negative binomial distribution Duration in X: m turns, where  During first m – 1 turns, exactly n – 1 arrows to next state are followed  During m th turn, an arrow to next state is followed m – 1 P(l X = m) = n – 1 (1 – p) n-1+1 p (m-1)-(n-1) = n – 1 (1 – p) n p m-n X (n) p X (2) X (1) p 1 – p p …… Y 1 – p

34 CS262 Lecture 6, Win07, Batzoglou Example: genes in prokaryotes EasyGene: Prokaryotic gene-finder Larsen TS, Krogh A Negative binomial with n = 3

35 CS262 Lecture 6, Win07, Batzoglou Solution 3:Duration modeling Upon entering a state: 1.Choose duration d, according to probability distribution 2.Generate d letters according to emission probs 3.Take a transition to next state according to transition probs Disadvantage: Increase in complexity: Time: O(D) Space: O(1) where D = maximum duration of state F d<D f x i …x i+d-1 PfPf Warning, Rabiner’s tutorial claims O(D 2 ) & O(D) increases

36 CS262 Lecture 6, Win07, Batzoglou Viterbi with duration modeling Recall original iteration: V F (i) = max k V k (i – 1) a kl  e F (x i ) New iteration: V F (i) = max k max d=1…D f V k (i – d)  P f (d)  a kF   j=i-d+1…i e F (x j ) FL transitions emissions d<D f x i …x i + d – 1 emissions d<D l x j …x j + d – 1 PfPf PlPl Precompute cumulative values


Download ppt "Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2."

Similar presentations


Ads by Google