Presentation is loading. Please wait.

Presentation is loading. Please wait.

Conditional Random Fields. Sequence Labeling: The Problem Given a sequence (in NLP, words), assign appropriate labels to each word. For example, POS tagging:

Similar presentations


Presentation on theme: "Conditional Random Fields. Sequence Labeling: The Problem Given a sequence (in NLP, words), assign appropriate labels to each word. For example, POS tagging:"— Presentation transcript:

1 Conditional Random Fields

2 Sequence Labeling: The Problem Given a sequence (in NLP, words), assign appropriate labels to each word. For example, POS tagging: The cat sat on the mat. DTNNVBDINDTNN.

3 Sequence Labeling: The Problem Given a sequence (in NLP, words), assign appropriate labels to each word. Another example, partial parsing (aka chunking): The cat sat on the mat B-NPI-NPB-VPB-PPB-NPI-NP

4 Sequence Labeling: The Problem Given a sequence (in NLP, words), assign appropriate labels to each word. Another example, relation extraction: The cat sat on the mat B-ArgI-ArgB-RelI-RelB-ArgI-Arg

5 The CRF Equation A CRF model consists of – F =, a vector of “feature functions” – θ =, a vector of weights for each feature function. Let O = be an observed sentence Let A = be the labent variables. This is the same as the Maximum Entropy equation!

6 Note that the denominator depends on O, but not on y (it’s marginalizing over y). Typically, we write where CRF Equation, standard format

7 Making Structured Predictions

8 Aside: Structured prediction vs. Text Classification Recall: max. ent. for text classification: CRFs for sequence labeling: What’s the difference?

9 Aside: Structured prediction vs. Text Classification Two (related) differences, both for the sake of efficiency: 1)Feature functions in CRFs are restricted to graph parts (described later) 2)We can’t do brute force to compute the argmax. Instead, we do Viterbi.

10 Finding the Best Sequence Best sequence is Recall from HMM discussion: If there are K possible states for each y i variable, and N total y i variables, Then there are K N possible settings for y So brute force can’t find the best sequence. Instead, we resort to a Viterbi-like dynamic program.

11 oToT o1o1 otot o t-1 o t+1 Viterbi Algorithm The state sequence which maximizes the score of seeing the observations to time t-1, landing in state j at time t, and seeing the observation at time t A1A1 A t-1 A t =j

12 oToT o1o1 otot o t-1 o t+1 Viterbi Algorithm Compute the most likely state sequence by working backwards x1x1 x t-1 xtxt x t+1 xTxT

13 Viterbi Algorithm Recursive Computation oToT o1o1 otot o t-1 o t+1 A1A1 A t-1 A t =jA t+1 ??!

14 Feature functions and Graph parts To make efficient computation (dynamic programs) possible, we restrict the feature functions to: Graph parts (or just parts): A feature function that counts how often a particular configuration occurs for a clique in the CRF graph. Clique: a set of completely connected nodes in a graph. That is, each node in the clique has an edge connecting it to every other node in the clique.

15 Clique Example The cliques in a linear chain CRF are the set of individual nodes, and the set of pairs of consecutive nodes. 11 x1x1 22 x2x2 33 x3x3 44 x4x4 55 x5x5 66 x6x6 CRF

16 Clique Example The cliques in a linear chain CRF are the set of individual nodes, and the set of pairs of consecutive nodes. 11 x1x1 22 x2x2 33 x3x3 44 x4x4 55 x5x5 66 x6x6 CRF Individual node cliques

17 Clique Example The cliques in a linear chain CRF are the set of individual nodes, and the set of pairs of consecutive nodes. 11 x1x1 22 x2x2 33 x3x3 44 x4x4 55 x5x5 66 x6x6 CRF Pair-of-node cliques

18 Clique Example For non-linear-chain CRFs (something we won’t normally consider in this class), you can get larger cliques: 11 x1x1 22 x2x2 33 x3x3 44 x4x4 55 x5x5 66 x6x6 CRF Larger cliques  5’

19 Graph part as Feature Function Example Graph parts are feature functions p(y,x) that count how many cliques have a particular configuration. For example, p(y,x) = count of [y i = Noun]. Here, y 2 and y 6 are both Nouns, so p(y,x) = 2. y 1 =D x1x1 y 2 =N x2x2 y 3 =V x3x3 y 4 =D x4x4 y 5 =A x5x5 y 6 =N x6x6 CRF

20 Graph part as Feature Function Example For a pair-of-nodes example, p(y,x) = count of [y i = Noun,y i+1 =Verb] Here, y 2 is a Noun and y 3 is a Verb, so p(y,x) = 1. y 1 =D x1x1 y 2 =N x2x2 y 3 =V x3x3 y 4 =D x4x4 y 5 =A x5x5 y 6 =N x6x6 CRF

21 Features can depend on the whole observation In a CRF, each feature function can depend on x, in addition to a clique in y Normally, we draw a CRF like this: 11 x1x1 22 x2x2 33 x3x3 44 x4x4 55 x5x5 66 x6x6 HMM CRF 11 x1x1 22 x2x2 33 x3x3 44 x4x4 55 x5x5 66 x6x6

22 Features can depend on the whole observation In a CRF, each feature function can depend on x, in addition to a clique in y But really, it’s more like this: This would cause problems for a generative model, but in a conditional model, x is always a fixed constant. So we can still calculate relevant algorithms like Viterbi efficiently. 11 x1x1 22 x2x2 33 x3x3 44 x4x4 55 x5x5 66 x6x6 11 x1x1 22 x2x2 33 x3x3 44 x4x4 55 x5x5 66 x6x6 HMM CRF

23 Graph part as Feature Function Example An example part including x: p(y,x) = count of [y i = A or D,y i+1 =N,x 2 =cat] Here, y 1 is a D and y 2 is a N, plus y 5 is a A and y 6 is a N, plus x 2 =cat, so p(y,x) = 2. Notice that the clique y 5 -y 6 is allowed to depend on x 2. y 1 =D The y 2 =N cat y 3 =V chased y 4 =D the y 5 =A tiny y 6 =N fly CRF

24 Graph part as Feature Function Example An more usual example including x: p(y,x) = count of [y i = A or D,y i+1 =N,x i+1 =cat] Here, y 1 is a D and y 2 is a N, plus x 2 =cat, so p(y,x)=1. y 1 =D The y 2 =N cat y 3 =V chased y 4 =D the y 5 =A tiny y 6 =N fly CRF

25 The CRF Equation, with Parts A CRF model consists of – P =, a vector of parts – θ =, a vector of weights for each part. Let O = be an observed sentence Let A = be the labent variables.

26 Viterbi Algorithm – 2 nd Try Recursive Computation oToT o1o1 otot o t-1 o t+1 A1A1 A t-1 A t =jA t+1

27 Supervised Parameter Estimation

28 Conditional Training Given a set of observations o and the correct labels y for each, determine the best θ: Because the CRF equation is just a special form of the maximum entropy equation, we can train it exactly the same way: – Determine the gradient – Step in the direction of the gradient – Repeat until convergence

29 Recall: Training a ME model Training is an optimization problem: find the value for λ that maximizes the conditional log-likelihood of the training data: 29

30 Recall: Training a ME model Optimization is normally performed using some form of gradient descent: 0) Initialize λ 0 to 0 1) Compute the gradient: ∇ CLL 2) Take a step in the direction of the gradient: λ i+1 = λ i + α ∇ CLL 3) Repeat until CLL doesn’t improve: stop when |CLL( λ i+1 ) – CLL( λ i )| < ε 30

31 Recall: Training a ME model Computing the gradient: 31

32 Recall: Training a ME model Computing the gradient: 32 The hard part for CRFs

33 Training a CRF: Expected feature counts … (sorry, ran out of time)

34 CRFs vs. HMMs

35 Generative (Joint Probability) Models HMMs are generative models: That is, they can compute the joint probability P(sentence, hidden-states) From a generative model, one can compute – Conditional models P(sentence | hidden-states) and P(hidden-states| sentence) – Marginal models P(sentence) and P(hidden-states) For sequence labeling, we want P(hidden-states | sentence)

36 Discriminative (Conditional) Models Most often, people are most interested in the conditional probability P(hidden-states | sentence) For example, this is the distribution needed for sequence labeling. Discriminative (also called conditional) models directly represent the conditional distribution P(hidden-states | sentence) – These models cannot tell you the joint distribution, marginals, or other conditionals. – But they’re quite good at this particular conditional distribution.

37 Discriminative vs. Generative HMM (generative)CRF (discriminative) Marginal, or Language model: P(sentence) Forward algorithm or Backward algorithm, linear in length of sentence Can’t do it. Find optimal label sequence Viterbi, Linear in length of sentence Viterbi, Linear in length of sentence Supervised parameter estimation Bayesian learning, Easy and fast Convex optimization, Can be quite slow Unsupervised parameter estimation Baum-Welch (non-convex optimization), Slow but doable Very difficult, and requires making extra assumptions. Feature functionsParents and children in the graph  Restrictive! Arbitrary functions of a latent state and any portion of the observed nodes

38 CRFs vs. HMMs, a closer look It’s possible to convert an HMM into a CRF: Set p prior,state (y,x) = count[y 1 =state] Set θ prior,state = log P HMM (y 1 =state) = log  state Set p trans,state1,state2 (y,x)= count[y i =state 1,y i+1 =state 2 ] Set θ trans,state1,state2 = log P HMM (y i+1 =state 2 |y i =state 1 ) = log A state1,state2 Set p obs,state,word (y,x)= count[y i =state,x i =word] Set θ obs,state,word = log P HMM (x i =word|y i =state) = log B state,word

39 CRF vs. HMM, a closer look If we convert an HMM to a CRF, all of the CRF parameters θ will be logs of probabilities.  Therefore, they will all be between –∞ and 0 Notice: CRF parameters can be between –∞ and +∞. So, how do HMMs and CRFs compare in terms of bias and variance (as sequence labelers)? – HMMs have more bias – CRFs have more variance

40 Comparing feature functions The biggest advantage of CRFs over HMMs is that they can handle overlapping features. For example, for POS tagging, using words as a features (like x i =“the” or x j =“jogging”) is quite useful. However, it’s often also useful to use “orthographic” features, like “the word ends in –ing” or “the word starts with a capital letter.” These features overlap: some words end in “ing”, some don’t. Generative models have to include in the model parameters for predicting when features will overlap. Discriminative models don’t: they can simply use the features.

41 CRF Example A CRF POS Tagger for English

42 Vocabulary We need to determine the set of possible word types V. Let V = {all types in 1 million tokens of Wall Street Journal text, which we’ll use for training} U {UNKNOWN} (for word types we haven’t seen)

43 L = Label Set Standard Penn Treebank tagset NumberTagDescription 1.CC Coordinating conjunction 2.CDCardinal number 3.DTDeterminer 4.EXExistential there 5.FWForeign word 6.IN Preposition or subordinating conjunction 7.JJAdjective 8.JJR Adjective, comparative NumberTagDescription 9.JJS Adjective, superlative 10.LSList item marker 11.MDModal 12.NN Noun, singular or mass 13.NNSNoun, plural 14.NNP Proper noun, singular 15.NNPSProper noun, plural 16.PDTPredeterminer 17.POSPossessive ending

44 L = Label Set NumberTagDescription 18.PRPPersonal pronoun 19.PRP$Possessive pronoun 20.RBAdverb 21.RBRAdverb, comparative 22.RBSAdverb, superlative 23.RPParticle 24.SYMSymbol 25.TOto 26.UHInterjection 27.VBVerb, base form 28.VBDVerb, past tense 29.VBG Verb, gerund or present participle NumberTagDescription 30.VBN Verb, past participle 31.VBP Verb, non-3rd person singular present 32.VBZ Verb, 3rd person singular present 33.WDTWh-determiner 34.WPWh-pronoun 35.WP$ Possessive wh- pronoun 36.WRBWh-adverb

45 CRF Features Feature TypeDescription Prior  k y i = k Transition  k,k’ y i = k and y i+1 =k’ Word  k,w y i = k and x i =w  k,w y i = k and x i-1 =w  k,w y i = k and x i+1 =w  k,w,w’ y i = k and x i =w and x i-1 =w’  k,w,w’ y i = k and x i =w and x i+1 =w’ Orthography: Suffix  s in {“ing”,”ed”,”ogy”,”s”,”ly”,”ion”,”tion”, “ity”, …} and  k y i =k and x i ends with s Orthography: Punctuation  k y i = k and x i is capitalized  k y i = k and x i is hyphenated  k y i = k and x i contains a period  k y i = k and x i is ALL CAPS  k y i = k and x i contains a digit (0-9) …


Download ppt "Conditional Random Fields. Sequence Labeling: The Problem Given a sequence (in NLP, words), assign appropriate labels to each word. For example, POS tagging:"

Similar presentations


Ads by Google