Conditional Random Fields

Conditional Random Fields
William W. Cohen CALD

Announcements Upcoming assignments:
Today: Sha & Pereira, Lafferty et al Mon 2/23: Klein & Manning, Toutanova et al Wed 2/25: no writeup due Mon 3/1: no writeup due Wed 3/3: project proposal due: personnel page Spring break week, no class

Review: motivation for CMM’s
Ideally we would like to use many, arbitrary, overlapping features of words. identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor … S S S t - 1 t t+1 … is “Wisniewski” … part of noun phrase ends in “-ski” O O O t - 1 t t +1

Motivation for CMMs identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor … S S S t - 1 t t+1 … is “Wisniewski” … part of noun phrase ends in “-ski” O O O t - 1 t t +1 Idea: replace generative model in HMM with a maxent model, where state depends on observations and previous state

Implications of the model
Does this do what we want? Q: does Y[i-1] depend on X[i+1] ? “a nodes is conditionally independent of its non-descendents given its parents”

How important is label bias?
Could be avoided in this case by changing structure: Our models are always wrong – is this “wrongness” a problem? See Klein & Manning’s paper for next week….

Another view of label bias [Sha & Pereira]
So what’s the alternative?

Review of maxent

Review of maxent/MEMM/CMMs

Details on CMMs

From CMMs to CRFs Recall why we’re unhappy: we don’t want local normalization New model

What’s the new model look like?
What’s independent? y1 y2 y3 x1 x2 x3

What’s the new model look like?
What’s independent now?? y1 y2 y3 x

Hammerley-Clifford For positive distributions P(x1,…,xn):
Pr(xi|x1,…,xi-1,xi+1,…,xn) = Pr(xi|Neighbors(xi)) Pr(A|B,S) = Pr(A|S) where A,B are sets of nodes and S is a set that separates A and B P can be written as normalized product of “clique potentials” So this is very general: any Markov distribution can be written in this form (modulo nits like “positive distribution”)

Definition of CRFs X is a random variable over data sequences to be labeled Y is a random variable over corresponding label sequences

Example of CRFs

Graphical comparison among HMMs, MEMMs and CRFs
HMM MEMM CRF

Lafferty et al notation
If the graph G = (V, E) of Y is a tree, the conditional distribution over the label sequence Y = y, given X = x, by fundamental theorem of random fields is: x is a data sequence y is a label sequence v is a vertex from vertex set V = set of label random variables e is an edge from edge set E over V fk and gk are given and fixed. gk is a Boolean vertex feature; fk is a Boolean edge feature k is the number of features are parameters to be estimated y|e is the set of components of y defined by edge e y|v is the set of components of y defined by vertex v

Conditional Distribution (cont’d)
CRFs use the observation-dependent normalization Z(x) for the conditional distributions: Z(x) is a normalization over the data sequence x Learning: Lafferty et al’s IIS-based method is rather inefficient. Gradient-based methods are faster Trickiest bit is computing normalization, which is over exponentially many y vectors.

CRF learning – from Sha & Pereira

Something like forward-backward Idea: Define matrix of y,y’ “affinities” at stage i Mi[y,y’] = “unnormalized probability” of transition from y to y’ at stage I Mi * Mi+1 = “unnormalized probability” of any path through stages i and i+1

y1 y2 y3 x y1 y2 y3

Forward backward ideas
name name name c g b f nonName nonName nonName d h

Sha & Pereira results CRF beats MEMM (McNemar’s test); MEMM probably beats voted perceptron

Sha & Pereira results in minutes, 375k examples

POS tagging Experiments in Lafferty et al
Compared HMMs, MEMMs, and CRFs on Penn treebank POS tagging Each word in a given input sentence must be labeled with one of 45 syntactic tags Add a small set of orthographic features: whether a spelling begins with a number or upper case letter, whether it contains a hyphen, and if it contains one of the following suffixes: -ing, -ogy, -ed, -s, -ly, -ion, -tion, -ity, -ies oov = out-of-vocabulary (not observed in the training set)

POS tagging vs MXPost

Conditional Random Fields

Similar presentations

Presentation on theme: "Conditional Random Fields"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Conditional Random Fields

Similar presentations

Presentation on theme: "Conditional Random Fields"— Presentation transcript:

Similar presentations

About project

Feedback