# Pair-HMMs and CRFs Chuong B. Do CS262, Winter 2009 Lecture #8.

## Presentation on theme: "Pair-HMMs and CRFs Chuong B. Do CS262, Winter 2009 Lecture #8."— Presentation transcript:

Pair-HMMs and CRFs Chuong B. Do CS262, Winter 2009 Lecture #8

Outline I’ll cover two different topics today – pair-HMMs – conditional random fields (CRFs) Other resources – For more information on pair-HMMs, see the Durbin et al. book – For more information on CRFs, see the materials online.

Part I: Pair-HMMs

Quick recap of HMMs Formally, an HMM = (Σ, Q, A, a 0, e). – alphabet: Σ = {b 1, …, b M } – set of states: Q = {1, …, K} – transition probabilities: A = [a ij ] – initial state probabilities: a 0i – emission probabilities: e i (b k ) Example: FAIRLOADED 0.05 0.95

3 basic questions for HMMs Evaluation: Given a sequence of observations x and a sequence of states π, compute P(x, π) Decoding: Given a sequence of observations x, compute the maximum probability sequence of states π ML = arg max π P(x, π) Learning: Given an HMM with unspecified parameters θ = (A, a 0, b), compute the parameters that maximize the likelihood of x and π, i.e., θ ML = arg max θ P(x, π | θ)

Pair-HMMs Roughly speaking, the generalization of HMMs with 1 observation sequence to 2 observation sequences. Unfortunately, the usage of the term “pair-HMM” has come to mean two very different types of models in computational biology. I’ll talk about both.

Pair-HMMs: Usage #1 Consider the HMM = (Σ 1 x Σ 2, Q, A, a 0, e). Alphabet consists of pairs of letters – one letter from Σ 1 paired with one letter from Σ 2. Example: Dishonest casino dealer playing two games at once. HONESTCHEAT 0.05 0.95

Pair-HMMs: Usage #1 (continued) QUESTION: How do we handle the 3 basic questions? (evaluation, decoding, learning) QUESTION: How would you generalize to 3 sequences? What about K sequences? Why might this be a bad idea? Where this version of pair-HMMs comes up in practice: – Comparative gene recognition using pairwise alignments – Protein secondary structure prediction using pairwise alignments as input – etc.

Pair-HMMs: Usage #2 Consider the HMM = ((Σ 1 U {η}) x (Σ 2 U {η}), Q, A, a 0, e). Similar to Usage #1, except that this time, instead of emitting a pair of letters in every state, in some states we may choose to emit a letter paired with η (the empty string). – For simplicity, we’ll assume that η is never emitted for both observation sequences simultaneously. – We’ll call the two observation sequences x and y.

Application: sequence alignment Consider the following pair-HMM: M P(x i, y j ) I P(x i, η) J P(η, y j ) 1 – 2  1 –      optional  c  Σ, P(η, c) = P(c, η) = Q(c) QUESTION: What are the interpretations of P(c,d) and Q(c) for c,d  Σ? QUESTION: What does this model have to do with alignments? QUESTION: What is the average length of a gapped region in alignments generated by this model? Average length of matched regions?

Pair-HMMs: Usage #2 (continued) Key difference from Usage #1: In these pair-HMMs, we cannot rely on the algorithms for single-sequence HMMs in order to do inference.  However, we can modify the algorithms for single sequence HMMs to work for alignment pair-HMMs.

Recap: Viterbi for single-sequence HMMs Algorithm: – Define V k (i) = max π1 … πi-1 P(x 1 … x i-1, π 1 … π i-1, x i, π i = k) – Compute using dynamic programming! 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2

(Broken) Viterbi for pair-HMMs In the single sequence case, we defined V k (i) = max π1 … πi-1 P(x 1 … x i-1, π 1 … π i-1, x i, π i = k) = e k (x i ) ∙ max j a jk V j (i - 1) – QUESTION: What’s the computational complexity of DP? Observation: In the pairwise case (x 1, y 1 ) … (x i -1, y i-1 ) no longer correspond to the first i – 1 letters of x and y respectively. – QUESTION: Why does this cause a problem in the pair-HMM case?

(Fixed) Viterbi for pair-HMMs We’ll just consider this special case: A similar trick can be used when working with the forward/backward algorithms (see Durbin et al for details) QUESTION: What’s the computational complexity of DP? M P(x i, y j ) I P(x i, η) J P(η, y j ) 1 – 2  1 –      optional  c  Σ, P(η, c) = P(c, η) = Q(c) V M (i, j) = P(x i, y j ) max V I (i, j) = Q(x i ) max V J (i, j) =Q(y j ) max (1 - 2δ) V M (i - 1, j - 1) (1 - ε) V I (i - 1, j - 1) δ V M (i - 1, j) ε V I (i - 1, j) δ V M (i, j - 1) ε V J (i, j - 1)

Connection to NW with affine gaps QUESTION: How would the optimal alignment change if we divided the probability for every single alignment by ∏ i=1 … |x| Q(x i ) ∏ j = 1 … |y| Q(y j )? V M (i, j) = P(x i, y j ) max V I (i, j) = Q(x i ) max V J (i, j) =Q(y j ) max (1 - 2δ) V M (i - 1, j - 1) (1 - ε) V I (i - 1, j - 1) δ V M (i - 1, j) ε V I (i - 1, j) δ V M (i, j - 1) ε V J (i, j - 1)

Connection to NW with affine gaps Account for the extra terms “along the way.” V M (i, j) = P(x i, y j ) max V I (i, j) = max V J (i, j) = max (1 - 2δ) V M (i - 1, j - 1) (1 - ε) V I (i - 1, j - 1) δ V M (i - 1, j) ε V I (i - 1, j) δ V M (i, j - 1) ε V J (i, j - 1) Q(x i ) Q(y j )

Connection to NW with affine gaps Take logs, and ignore a couple terms. log V M (i, j) = P(x i, y j ) + max log V I (i, j) = max log V J (i, j) = max log (1 - 2δ) + log V M (i - 1, j - 1) log (1 - ε) + log V I (i - 1, j - 1) log δ + V M (i - 1, j) log ε + V I (i - 1, j) log δ + V M (i, j - 1) log ε + V J (i, j - 1) Q(x i ) Q(y j ) log

Connection to NW with affine gaps Rename! M(i, j) = S(x i, y j ) + max I(i, j) = max J(i, j) = max M(i - 1, j - 1) I(i - 1, j - 1) J(i - 1, j - 1) d + M(i - 1, j) e + I(i - 1, j) d + M(i, j - 1) e + J(i, j - 1)

Summary of pair-HMMs We briefly discussed two types of pair-HMMs that come up in the literature. We talked about pair-HMMs for sequence alignment. We derived a Viterbi inference algorithm for alignment pair- HMMs and showed its rough “equivalence” to Needleman- Wunsch with affine gap penalties.

Part II: Conditional random fields

Motivation So far, we’ve seen HMMs, a type of probabilistic model for labeling sequences. – For an input sequence x and a parse π, an HMM defines P(π, x) – In practice, when we want to perform inference, however, what we really care about is P(π | x). In the case of the Viterbi algorithm, arg max π P(π | x) = arg max π P(x, π). But when using posterior decoding, need P(π | x). Why do need to parameterize P(x, π)? If we really are interested in the conditional distribution over possible parses π for a given input x, can’t we just model of P(π | x) directly? – Intuition: P(x, π) = P(x) P(π | x)  why waste effort modeling P(x)?

Conditional random fields Definition where F : state x state x observations x index  R n “local feature mapping” w  R n “parameter vector” – Summation is taken over all possible state sequences π’ 1 … π’ |x| – a T b for two vectors a, b  R n denotes the inner product, ∑ i=1... n a i b i exp(∑ i=1 … |x| w T F(π i, π i-1, x, i)) ∑ π’ exp(∑ i=1 … |x| w T F(π’ i, π’ i-1, x, i)) P(π | x) = partition coefficient

Interpretation of weights Define “global feature mapping” Then, QUESTION: If all features are nonnegative, what is the effect of increasing some component w j ? exp(w T F(x, π)) ∑ π’ exp(w T F(x, π’)) P(π | x) = F(x, π) = ∑ i=1 … |x| w T F(π i, π i-1, x, i)

Relationship with HMMs Recall that in an HMM, log P(x, π) = log P(π 0 ) + ∑ i=1 … |x| [ log P(π i | π i-1 ) + log P(x i | π i ) ] = log a 0 (π 0 ) + ∑ i=1 … |x| [ log a(π i-1, π i ) + log e πi (x i ) ] Suppose that we list the logarithms of all the parameters of our model in a long vector w. log a 0 (1) … log a 0 (K) log a 11 … log a KK log e 1 (b 1 ) … log e K (b M ) w =  Rn Rn

Now, consider a feature mapping with the same dimensionality as w. For each component w j, define F j to be a 0/1 indicator variable indicating whether the corresponding parameter should be included in scoring x, π at position i: Then, log P(x, π) = ∑ i=1 … |x| w T F(π i, π i-1, x, i). log P(x, π) = log a 0 (π 0 ) + ∑ i=1 … |x| [ log a(π i-1, π i ) + log e πi (x i ) ] (*) log a 0 (1) … log a 0 (K) log a 11 … log a KK log e 1 (b 1 ) … log e K (b M ) w =  Rn Rn 1{i = 1 Λ π i-1 = 1} … 1{i = 1 Λ π i-1 = K} 1{π i-1 = 1 Λ π i = 1} … 1{π i-1 = K Λ π i = K} 1{x i = b 1 Λ π i = 1} … 1{x i = b M Λ π i = K} F(π i, π i-1, x, i) =  Rn Rn Relationship with HMMs

Equivalently, Therefore, an HMM can be converted to an equivalent CRF. log P(x, π) = ∑ i=1 … |x| w T F(π i, π i-1, x, i) exp(∑ i=1 … |x| w T F(π i, π i-1, x, i)) ∑ π’ exp(∑ i=1 … |x| w T F(π i, π i-1, x, i)) P(π | x) = = P(x, π) ∑ π P(x, π)

CRFs ≥ HMMs All HMMs can be converted to CRFs. But, the reverse does not hold true! Reason #1: HMM parameter vectors obey certain constraints (e.g., ∑ i a 0i = 1) but CRF parameter vectors (w) are unconstrained! Reason #2: CRFs can contain features that are NOT possible to include in HMMs.

CRFS ≥ HMMs (continued) In an HMM, our features were of the form – I.e., when scoring position i in the sequence, feature only considered the emission x i at position i. – Cannot look at other positions (e.g., x i-1, x i+1 ) since that would involve “emitting” a character more than once – double-counting of probability! CRFs don’t have this problem. – Why? Because CRFs don’t attempt to model the observations x! F(π i, π i-1, x, i) = F(π i, π i-1, x i, i)

Examples of non-local features for CRFs Casino: – Dealer looks at previous 100 positions, and determines whether at least 50 over them had 6’s. F j (LOADED, FAIR, x, i) = 1{ x i-100 … x i has > 50 6s } CpG islands: – Gene occurs near a CpG island. F j (*, EXON, x, i) = 1{ x i-1000 … x i+1000 has > 1/16 CpGs }.

3 basic questions for CRFs Evaluation: Given a sequence of observations x and a sequence of states π, compute P(π | x) Decoding: Given a sequence of observations x, compute the maximum probability sequence of states π ML = arg max π P(π | x) Learning: Given a CRF with unspecified parameters w, compute the parameters that maximize the likelihood of π given x, i.e., w ML = arg max w P(π | x, w)

Note that: We can derive the following recurrence: Notes: – Even though the features may depend on arbitrary positions in x, the sequence x is constant. The correctness of DP depends only on knowing the previous state. – Computing the partition function can be done by a similar adaptation of the forward/backward algorithms. Viterbi for CRFs exp(∑ i=1 … |x| w T F(π i, π i-1, x, i)) ∑ π’ exp(∑ i=1 … |x| w T F(π’ i, π’ i-1, x, i)) argmax π P(π | x) = argmax π = arg max π exp(∑ i=1 … |x| w T F(π i, π i-1, x, i)) = arg max π ∑ i=1 … |x| w T F(π i, π i-1, x, i) V k (i) = max j [ w T F(k, j, x, i) + V j (i-1) ]

3 basic questions for CRFs Evaluation: Given a sequence of observations x and a sequence of states π, compute P(π | x) Decoding: Given a sequence of observations x, compute the maximum probability sequence of states π ML = arg max π P(π | x) Learning: Given a CRF with unspecified parameters w, compute the parameters that maximize the likelihood of π given x, i.e., w ML = arg max w P(π | x, w)

Learning CRFs Key observation: – log P(π | x, w) is a differentiable, convex function of w f(x) f(y) convex function Any local minimum is a global minimum.

Learning CRFs (continued) Compute partial derivative of log P(π | x, w) with respect to each parameter w j, and use the gradient ascent learning rule: w Gradient points in the direction of greatest function increase

The CRF gradient It turns out (you’ll show this on your homework) that This has a very nice interpretation: – We increase parameters for which the correct feature values are greater than the predicted feature values. – We decrease parameters for which the correct feature values are less than the predicted feature values This moves probability mass from incorrect parses to correct parses. (  /  w j ) log P(π | x, w) = F j (x, π) – E π’  P(π’ | x, w) [ F j (x, π’) ] correct value for jth feature expected value for jth feature (given the current parameters)

Summary of CRFs We described a probabilistic model known as a CRF. We showed how to convert any HMM into an equivalent CRF. We mentioned briefly that inference in CRFs is very similar to inference in HMMs. We described a gradient ascent approach to training CRFs when true parses are available.

Download ppt "Pair-HMMs and CRFs Chuong B. Do CS262, Winter 2009 Lecture #8."

Similar presentations