Presentation is loading. Please wait.

Presentation is loading. Please wait.

IE With Undirected Models

Similar presentations


Presentation on theme: "IE With Undirected Models"— Presentation transcript:

1 IE With Undirected Models
William W. Cohen CALD

2 Announcements Upcoming assignments:
Mon 2/23: Klein & Manning, Toutanova et al Wed 2/25: no writeup due Mon 3/1: no writeup due Wed 3/3: project proposal due: personnel page Spring break week, no class

3 Motivation for CMMs identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor S S S t - 1 t t+1 is “Wisniewski” part of noun phrase ends in “-ski” O O O t - 1 t t +1 Idea: replace generative model in HMM with a maxent model, where state depends on observations and previous state

4 Implications of the model
Does this do what we want? Q: does Y[i-1] depend on X[i+1] ? “a nodes is conditionally independent of its non-descendents given its parents”

5 Another view of label bias [Sha & Pereira]
So what’s the alternative?

6 CRF model y1 y2 y3 y4 x

7 CRF learning – from Sha & Pereira

8 CRF learning – from Sha & Pereira

9 CRF learning – from Sha & Pereira
Something like forward-backward Idea: Define matrix of y,y’ “affinities” at stage i Mi[y,y’] = “unnormalized probability” of transition from y to y’ at stage I Mi * Mi+1 = “unnormalized probability” of any path through stages i and i+1

10 Forward backward ideas
name name name c g b f nonName nonName nonName d h

11 CRF learning – from Sha & Pereira

12 CRF results (from S&P, L et al)
Sha & Pereira even use some statistical tests! And show CRF beats MEMM (McNemar’s test) - but not voted perceptron.

13 CRFs: the good, the bad, and the cumbersome…
Good points: Global optimization of weight vector that guides decision making Trade off decisions made at different points in sequence Worries: Cost (of training) Complexity (do we need all this math?) Amount of context: Matrix for normalizer is |Y| * |Y|, so high-order models for many classes get expensive fast. Strong commitment to maxent-style learning Loglinear models are nice, but nothing is always best.

14 Dependency Nets

15

16 Proposed solution: parents of node are the Markov blanket like undirected Markov net capture all “correlational associations” one conditional probability for each node X, namely P(X|parents of X) like directed Bayes net–no messy clique potentials

17 Dependency nets The bad and the ugly:
Inference is less efficient –MCMC sampling Can’t reconstruct probability via chain rule Networks might be inconsistent ie local P(x|pa(x)’s don’t define a pdf Exactly equal, representationally, to normal undirected Markov nets

18

19 Dependency nets The good:
Learning is simple and elegant (if you know each node’s Markov blanket): just learn a probabilistic classifier for P(X|pa(X)) for each node X. (You might not learn a consistent model, but you’ll probably learn a reasonably good one.) Inference can be speeded up substantially over naïve Gibbs sampling.

20 Dependency nets Learning is simple and elegant (if you know each node’s Markov blanket): just learn a probabilistic classifier for P(X|pa(X)) for each node X. Pr(y1|x,y2) Pr(y2|x,y1,y2) Pr(y3|x,y2,y4) Pr(y4|x,y3) y1 y2 y3 y4 Learning is local, but inference is not, and need not be unidirectional x

21 Toutanova, Klein, Manning, Singer
Dependency nets for POS tagging vs CMM’s. Maxent is used for local conditional model. Goals: An easy-to-train bidirectional model A really good POS tagger

22 Toutanova et al D = {11, 11, 11, 12, 21, 33} ML state: {11}
Don’t use Gibbs sampling for inference: instead use a Viterbi variant (which is not guaranteed to produce the ML sequence) D = {11, 11, 11, 12, 21, 33} ML state: {11} P(a=1|b=1)P(b=1|a=1) < 1 P(a=3|b=3)P(b=3|a=3) = 1

23 Results with model

24 Results with model

25 Results with model “Best” model includes some special unknown-word features, including “a crude company-name detector”

26 Results with model MXPost: 47.6, 96.4, 86.2 CRF+: 95.7, 76.4
Final test-set results MXPost: 47.6, 96.4, CRF+: 95.7, 76.4

27 Other comments Smoothing (quadratic regularization, aka Gaussian prior) is important—it avoids overfitting effects reported elsewhere

28 More on smoothing...

29 Klein & Manning: Conditional Structure vs Estimation

30 Task 1: WSD (Word Sense Disambiguation)
Bush’s election-year ad campaign will begin this summer, with... (sense1) Bush whacking is tiring but rewarding—who wants to spend all their time on marked trails? (sense2) Class is sense1/sense2, features are context words.

31 Task 1: WSD (Word Sense Disambiguation)
Model 1: Naive Bayes multinomial model: Use conditional rule to predict sense s from context-word observations o. Standard NB training maximizes “joint likelihood” under independence assumption

32 Task 1: WSD (Word Sense Disambiguation)
Model 2: Keep same functional form, but maximize conditional likelihood (sound familiar?) or maybe SenseEval score: or maybe even:

33 Task 1: WSD (Word Sense Disambiguation)
Optimize JL with std NB learning Optimize SCL, CL with conjugate gradient Also over “non-deficient models” (?) using Lagrange penalties to enforce “soft” version of deficiency constraint I think this makes sure non-conditional version is a valid probability “Punt” on optimizing accuracy Penalty for extreme predictions in SCL

34

35 Conclusion: maxent beats NB?
All generalizations are wrong?

36 Task 2: POS Tagging Sequential problem Replace NB with HMM model.
Standard algorithms maximize joint likelihood Claim: keeping the same model but maximizing conditional likelihood leads to a CRF Is this true? Alternative is conditional structure (CMM)

37 Using conditional structure vs maximizing conditional likelihood
CMM factors Pr(s,o) into Pr(s|o)Pr(o). For the CMM model, adding dependencies btwn observations does not change Pr(s|o), ie JL estimate =CL estimate for Pr(s|o)

38 Task 2: POS Tagging Experiments with a simple feature set:
For fixed model, CL is preferred to JL (CRF beats HMM) For fixed objective, HMM is preferred to MEMM/CMM

39 Error analysis for POS tagging
Label bias is not the issue: state-state dependencies are weak compared to observation-state dependencies too much emphasis on observation, not enough on previous states (“observation bias”) put another way: label bias predicts overprediction of states with few outgoing transitions, of more generally, low entropy...

40 Error analysis for POS tagging


Download ppt "IE With Undirected Models"

Similar presentations


Ads by Google