Presentation is loading. Please wait.

Presentation is loading. Please wait.

Conditional Random Fields

Similar presentations


Presentation on theme: "Conditional Random Fields"— Presentation transcript:

1 Conditional Random Fields
William W. Cohen CALD

2 Announcements Upcoming assignments:
Today: Sha & Pereira, Lafferty et al Mon 2/23: Klein & Manning, Toutanova et al Wed 2/25: no writeup due Mon 3/1: no writeup due Wed 3/3: project proposal due: personnel page Spring break week, no class

3 Review: motivation for CMM’s
Ideally we would like to use many, arbitrary, overlapping features of words. identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor S S S t - 1 t t+1 is “Wisniewski” part of noun phrase ends in “-ski” O O O t - 1 t t +1

4 Motivation for CMMs identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor S S S t - 1 t t+1 is “Wisniewski” part of noun phrase ends in “-ski” O O O t - 1 t t +1 Idea: replace generative model in HMM with a maxent model, where state depends on observations and previous state

5 Implications of the model
Does this do what we want? Q: does Y[i-1] depend on X[i+1] ? “a nodes is conditionally independent of its non-descendents given its parents”

6 Label Bias Problem Consider this MEMM:
P(1 and 2 | ro) = P(2 | 1 and ro)P(1 | ro) = P(2 | 1 and o)P(1 | r) P(1 and 2 | ri) = P(2 | 1 and ri)P(1 | ri) = P(2 | 1 and i)P(1 | r) Since P(2 | 1 and x) = 1 for all x, P(1 and 2 | ro) = P(1 and 2 | ri) In the training data, label value 2 is the only label value observed after label value 1 Therefore P(2 | 1) = 1, so P(2 | 1 and x) = 1 for all x However, we expect P(1 and 2 | ri) to be greater than P(1 and 2 | ro). Per-state normalization does not allow the required expectation

7 Label Bias Problem Pr(0123|rib)=1 Pr(0453|rob)=1
Consider this MEMM, and enough training data to perfectly model it: Pr(0123|rib)=1 Pr(0453|rob)=1 Pr(0123|rob) = Pr(1|0,r)/Z1 * Pr(2|1,o)/Z2 * Pr(3|2,b)/Z3 = 0.5 * 1 * 1 Pr(0453|rib) = Pr(4|0,r)/Z1’ * Pr(5|4,i)/Z2’ * Pr(3|5,b)/Z3’ = 0.5 * 1 *1

8 How important is label bias?
Could be avoided in this case by changing structure: Our models are always wrong – is this “wrongness” a problem? See Klein & Manning’s paper for next week….

9 Another view of label bias [Sha & Pereira]
So what’s the alternative?

10 Review of maxent

11 Review of maxent/MEMM/CMMs

12 Details on CMMs

13 From CMMs to CRFs Recall why we’re unhappy: we don’t want local normalization New model

14 What’s the new model look like?
What’s independent? y1 y2 y3 x1 x2 x3

15 What’s the new model look like?
What’s independent now?? y1 y2 y3 x

16 Hammerley-Clifford For positive distributions P(x1,…,xn):
Pr(xi|x1,…,xi-1,xi+1,…,xn) = Pr(xi|Neighbors(xi)) Pr(A|B,S) = Pr(A|S) where A,B are sets of nodes and S is a set that separates A and B P can be written as normalized product of “clique potentials” So this is very general: any Markov distribution can be written in this form (modulo nits like “positive distribution”)

17 Definition of CRFs X is a random variable over data sequences to be labeled Y is a random variable over corresponding label sequences

18 Example of CRFs

19 Graphical comparison among HMMs, MEMMs and CRFs
HMM MEMM CRF

20 Lafferty et al notation
If the graph G = (V, E) of Y is a tree, the conditional distribution over the label sequence Y = y, given X = x, by fundamental theorem of random fields is: x is a data sequence y is a label sequence v is a vertex from vertex set V = set of label random variables e is an edge from edge set E over V fk and gk are given and fixed. gk is a Boolean vertex feature; fk is a Boolean edge feature k is the number of features are parameters to be estimated y|e is the set of components of y defined by edge e y|v is the set of components of y defined by vertex v

21 Conditional Distribution (cont’d)
CRFs use the observation-dependent normalization Z(x) for the conditional distributions: Z(x) is a normalization over the data sequence x Learning: Lafferty et al’s IIS-based method is rather inefficient. Gradient-based methods are faster Trickiest bit is computing normalization, which is over exponentially many y vectors.

22 CRF learning – from Sha & Pereira

23 CRF learning – from Sha & Pereira

24 CRF learning – from Sha & Pereira
Something like forward-backward Idea: Define matrix of y,y’ “affinities” at stage i Mi[y,y’] = “unnormalized probability” of transition from y to y’ at stage I Mi * Mi+1 = “unnormalized probability” of any path through stages i and i+1

25 y1 y2 y3 x y1 y2 y3

26 Forward backward ideas
name name name c g b f nonName nonName nonName d h

27 CRF learning – from Sha & Pereira

28 CRF learning – from Sha & Pereira

29 Sha & Pereira results CRF beats MEMM (McNemar’s test); MEMM probably beats voted perceptron

30 Sha & Pereira results in minutes, 375k examples

31 POS tagging Experiments in Lafferty et al
Compared HMMs, MEMMs, and CRFs on Penn treebank POS tagging Each word in a given input sentence must be labeled with one of 45 syntactic tags Add a small set of orthographic features: whether a spelling begins with a number or upper case letter, whether it contains a hyphen, and if it contains one of the following suffixes: -ing, -ogy, -ed, -s, -ly, -ion, -tion, -ity, -ies oov = out-of-vocabulary (not observed in the training set)

32 POS tagging vs MXPost


Download ppt "Conditional Random Fields"

Similar presentations


Ads by Google