Download presentation

Presentation is loading. Please wait.

Published byNancy Beesley Modified over 2 years ago

1
Pair-HMMs and CRFs Chuong B. Do CS262, Winter 2009 Lecture #8

2
Outline I’ll cover two different topics today – pair-HMMs – conditional random fields (CRFs) Other resources – For more information on pair-HMMs, see the Durbin et al. book – For more information on CRFs, see the materials online.

3
Part I: Pair-HMMs

4
Quick recap of HMMs Formally, an HMM = (Σ, Q, A, a 0, e). – alphabet: Σ = {b 1, …, b M } – set of states: Q = {1, …, K} – transition probabilities: A = [a ij ] – initial state probabilities: a 0i – emission probabilities: e i (b k ) Example: FAIRLOADED 0.05 0.95

5
3 basic questions for HMMs Evaluation: Given a sequence of observations x and a sequence of states π, compute P(x, π) Decoding: Given a sequence of observations x, compute the maximum probability sequence of states π ML = arg max π P(x, π) Learning: Given an HMM with unspecified parameters θ = (A, a 0, b), compute the parameters that maximize the likelihood of x and π, i.e., θ ML = arg max θ P(x, π | θ)

6
Pair-HMMs Roughly speaking, the generalization of HMMs with 1 observation sequence to 2 observation sequences. Unfortunately, the usage of the term “pair-HMM” has come to mean two very different types of models in computational biology. I’ll talk about both.

7
Pair-HMMs: Usage #1 Consider the HMM = (Σ 1 x Σ 2, Q, A, a 0, e). Alphabet consists of pairs of letters – one letter from Σ 1 paired with one letter from Σ 2. Example: Dishonest casino dealer playing two games at once. HONESTCHEAT 0.05 0.95

8
Pair-HMMs: Usage #1 (continued) QUESTION: How do we handle the 3 basic questions? (evaluation, decoding, learning) QUESTION: How would you generalize to 3 sequences? What about K sequences? Why might this be a bad idea? Where this version of pair-HMMs comes up in practice: – Comparative gene recognition using pairwise alignments – Protein secondary structure prediction using pairwise alignments as input – etc.

9
Pair-HMMs: Usage #2 Consider the HMM = ((Σ 1 U {η}) x (Σ 2 U {η}), Q, A, a 0, e). Similar to Usage #1, except that this time, instead of emitting a pair of letters in every state, in some states we may choose to emit a letter paired with η (the empty string). – For simplicity, we’ll assume that η is never emitted for both observation sequences simultaneously. – We’ll call the two observation sequences x and y.

10
Application: sequence alignment Consider the following pair-HMM: M P(x i, y j ) I P(x i, η) J P(η, y j ) 1 – 2 1 – optional c Σ, P(η, c) = P(c, η) = Q(c) QUESTION: What are the interpretations of P(c,d) and Q(c) for c,d Σ? QUESTION: What does this model have to do with alignments? QUESTION: What is the average length of a gapped region in alignments generated by this model? Average length of matched regions?

11
Pair-HMMs: Usage #2 (continued) Key difference from Usage #1: In these pair-HMMs, we cannot rely on the algorithms for single-sequence HMMs in order to do inference. However, we can modify the algorithms for single sequence HMMs to work for alignment pair-HMMs.

12
Recap: Viterbi for single-sequence HMMs Algorithm: – Define V k (i) = max π1 … πi-1 P(x 1 … x i-1, π 1 … π i-1, x i, π i = k) – Compute using dynamic programming! 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2

13
(Broken) Viterbi for pair-HMMs In the single sequence case, we defined V k (i) = max π1 … πi-1 P(x 1 … x i-1, π 1 … π i-1, x i, π i = k) = e k (x i ) ∙ max j a jk V j (i - 1) – QUESTION: What’s the computational complexity of DP? Observation: In the pairwise case (x 1, y 1 ) … (x i -1, y i-1 ) no longer correspond to the first i – 1 letters of x and y respectively. – QUESTION: Why does this cause a problem in the pair-HMM case?

14
(Fixed) Viterbi for pair-HMMs We’ll just consider this special case: A similar trick can be used when working with the forward/backward algorithms (see Durbin et al for details) QUESTION: What’s the computational complexity of DP? M P(x i, y j ) I P(x i, η) J P(η, y j ) 1 – 2 1 – optional c Σ, P(η, c) = P(c, η) = Q(c) V M (i, j) = P(x i, y j ) max V I (i, j) = Q(x i ) max V J (i, j) =Q(y j ) max (1 - 2δ) V M (i - 1, j - 1) (1 - ε) V I (i - 1, j - 1) δ V M (i - 1, j) ε V I (i - 1, j) δ V M (i, j - 1) ε V J (i, j - 1)

15
Connection to NW with affine gaps QUESTION: How would the optimal alignment change if we divided the probability for every single alignment by ∏ i=1 … |x| Q(x i ) ∏ j = 1 … |y| Q(y j )? V M (i, j) = P(x i, y j ) max V I (i, j) = Q(x i ) max V J (i, j) =Q(y j ) max (1 - 2δ) V M (i - 1, j - 1) (1 - ε) V I (i - 1, j - 1) δ V M (i - 1, j) ε V I (i - 1, j) δ V M (i, j - 1) ε V J (i, j - 1)

16
Connection to NW with affine gaps Account for the extra terms “along the way.” V M (i, j) = P(x i, y j ) max V I (i, j) = max V J (i, j) = max (1 - 2δ) V M (i - 1, j - 1) (1 - ε) V I (i - 1, j - 1) δ V M (i - 1, j) ε V I (i - 1, j) δ V M (i, j - 1) ε V J (i, j - 1) Q(x i ) Q(y j )

17
Connection to NW with affine gaps Take logs, and ignore a couple terms. log V M (i, j) = P(x i, y j ) + max log V I (i, j) = max log V J (i, j) = max log (1 - 2δ) + log V M (i - 1, j - 1) log (1 - ε) + log V I (i - 1, j - 1) log δ + V M (i - 1, j) log ε + V I (i - 1, j) log δ + V M (i, j - 1) log ε + V J (i, j - 1) Q(x i ) Q(y j ) log

18
Connection to NW with affine gaps Rename! M(i, j) = S(x i, y j ) + max I(i, j) = max J(i, j) = max M(i - 1, j - 1) I(i - 1, j - 1) J(i - 1, j - 1) d + M(i - 1, j) e + I(i - 1, j) d + M(i, j - 1) e + J(i, j - 1)

19
Summary of pair-HMMs We briefly discussed two types of pair-HMMs that come up in the literature. We talked about pair-HMMs for sequence alignment. We derived a Viterbi inference algorithm for alignment pair- HMMs and showed its rough “equivalence” to Needleman- Wunsch with affine gap penalties.

20
Part II: Conditional random fields

21
Motivation So far, we’ve seen HMMs, a type of probabilistic model for labeling sequences. – For an input sequence x and a parse π, an HMM defines P(π, x) – In practice, when we want to perform inference, however, what we really care about is P(π | x). In the case of the Viterbi algorithm, arg max π P(π | x) = arg max π P(x, π). But when using posterior decoding, need P(π | x). Why do need to parameterize P(x, π)? If we really are interested in the conditional distribution over possible parses π for a given input x, can’t we just model of P(π | x) directly? – Intuition: P(x, π) = P(x) P(π | x) why waste effort modeling P(x)?

22
Conditional random fields Definition where F : state x state x observations x index R n “local feature mapping” w R n “parameter vector” – Summation is taken over all possible state sequences π’ 1 … π’ |x| – a T b for two vectors a, b R n denotes the inner product, ∑ i=1... n a i b i exp(∑ i=1 … |x| w T F(π i, π i-1, x, i)) ∑ π’ exp(∑ i=1 … |x| w T F(π’ i, π’ i-1, x, i)) P(π | x) = partition coefficient

23
Interpretation of weights Define “global feature mapping” Then, QUESTION: If all features are nonnegative, what is the effect of increasing some component w j ? exp(w T F(x, π)) ∑ π’ exp(w T F(x, π’)) P(π | x) = F(x, π) = ∑ i=1 … |x| w T F(π i, π i-1, x, i)

24
Relationship with HMMs Recall that in an HMM, log P(x, π) = log P(π 0 ) + ∑ i=1 … |x| [ log P(π i | π i-1 ) + log P(x i | π i ) ] = log a 0 (π 0 ) + ∑ i=1 … |x| [ log a(π i-1, π i ) + log e πi (x i ) ] Suppose that we list the logarithms of all the parameters of our model in a long vector w. log a 0 (1) … log a 0 (K) log a 11 … log a KK log e 1 (b 1 ) … log e K (b M ) w = Rn Rn

25
Now, consider a feature mapping with the same dimensionality as w. For each component w j, define F j to be a 0/1 indicator variable indicating whether the corresponding parameter should be included in scoring x, π at position i: Then, log P(x, π) = ∑ i=1 … |x| w T F(π i, π i-1, x, i). log P(x, π) = log a 0 (π 0 ) + ∑ i=1 … |x| [ log a(π i-1, π i ) + log e πi (x i ) ] (*) log a 0 (1) … log a 0 (K) log a 11 … log a KK log e 1 (b 1 ) … log e K (b M ) w = Rn Rn 1{i = 1 Λ π i-1 = 1} … 1{i = 1 Λ π i-1 = K} 1{π i-1 = 1 Λ π i = 1} … 1{π i-1 = K Λ π i = K} 1{x i = b 1 Λ π i = 1} … 1{x i = b M Λ π i = K} F(π i, π i-1, x, i) = Rn Rn Relationship with HMMs

26
Equivalently, Therefore, an HMM can be converted to an equivalent CRF. log P(x, π) = ∑ i=1 … |x| w T F(π i, π i-1, x, i) exp(∑ i=1 … |x| w T F(π i, π i-1, x, i)) ∑ π’ exp(∑ i=1 … |x| w T F(π i, π i-1, x, i)) P(π | x) = = P(x, π) ∑ π P(x, π)

27
CRFs ≥ HMMs All HMMs can be converted to CRFs. But, the reverse does not hold true! Reason #1: HMM parameter vectors obey certain constraints (e.g., ∑ i a 0i = 1) but CRF parameter vectors (w) are unconstrained! Reason #2: CRFs can contain features that are NOT possible to include in HMMs.

28
CRFS ≥ HMMs (continued) In an HMM, our features were of the form – I.e., when scoring position i in the sequence, feature only considered the emission x i at position i. – Cannot look at other positions (e.g., x i-1, x i+1 ) since that would involve “emitting” a character more than once – double-counting of probability! CRFs don’t have this problem. – Why? Because CRFs don’t attempt to model the observations x! F(π i, π i-1, x, i) = F(π i, π i-1, x i, i)

29
Examples of non-local features for CRFs Casino: – Dealer looks at previous 100 positions, and determines whether at least 50 over them had 6’s. F j (LOADED, FAIR, x, i) = 1{ x i-100 … x i has > 50 6s } CpG islands: – Gene occurs near a CpG island. F j (*, EXON, x, i) = 1{ x i-1000 … x i+1000 has > 1/16 CpGs }.

30
3 basic questions for CRFs Evaluation: Given a sequence of observations x and a sequence of states π, compute P(π | x) Decoding: Given a sequence of observations x, compute the maximum probability sequence of states π ML = arg max π P(π | x) Learning: Given a CRF with unspecified parameters w, compute the parameters that maximize the likelihood of π given x, i.e., w ML = arg max w P(π | x, w)

31
Note that: We can derive the following recurrence: Notes: – Even though the features may depend on arbitrary positions in x, the sequence x is constant. The correctness of DP depends only on knowing the previous state. – Computing the partition function can be done by a similar adaptation of the forward/backward algorithms. Viterbi for CRFs exp(∑ i=1 … |x| w T F(π i, π i-1, x, i)) ∑ π’ exp(∑ i=1 … |x| w T F(π’ i, π’ i-1, x, i)) argmax π P(π | x) = argmax π = arg max π exp(∑ i=1 … |x| w T F(π i, π i-1, x, i)) = arg max π ∑ i=1 … |x| w T F(π i, π i-1, x, i) V k (i) = max j [ w T F(k, j, x, i) + V j (i-1) ]

32
3 basic questions for CRFs Evaluation: Given a sequence of observations x and a sequence of states π, compute P(π | x) Decoding: Given a sequence of observations x, compute the maximum probability sequence of states π ML = arg max π P(π | x) Learning: Given a CRF with unspecified parameters w, compute the parameters that maximize the likelihood of π given x, i.e., w ML = arg max w P(π | x, w)

33
Learning CRFs Key observation: – log P(π | x, w) is a differentiable, convex function of w f(x) f(y) convex function Any local minimum is a global minimum.

34
Learning CRFs (continued) Compute partial derivative of log P(π | x, w) with respect to each parameter w j, and use the gradient ascent learning rule: w Gradient points in the direction of greatest function increase

35
The CRF gradient It turns out (you’ll show this on your homework) that This has a very nice interpretation: – We increase parameters for which the correct feature values are greater than the predicted feature values. – We decrease parameters for which the correct feature values are less than the predicted feature values This moves probability mass from incorrect parses to correct parses. ( / w j ) log P(π | x, w) = F j (x, π) – E π’ P(π’ | x, w) [ F j (x, π’) ] correct value for jth feature expected value for jth feature (given the current parameters)

36
Summary of CRFs We described a probabilistic model known as a CRF. We showed how to convert any HMM into an equivalent CRF. We mentioned briefly that inference in CRFs is very similar to inference in HMMs. We described a gradient ascent approach to training CRFs when true parses are available.

Similar presentations

OK

HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:

HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google

Ppt on acc cement company Maths ppt on statistics class 9 Ppt on regular expression cheat Ppt on ip address classes subnet Free ppt on ozone layer depletion Ppt on abstract art wallpaper Ppt on fashion and indian youth Ppt on vitamin d Ppt on synonyms and antonyms Ppt on social media recruitment