Presentation is loading. Please wait.

Presentation is loading. Please wait.

Noah A. Smith and Jason Eisner Department of Computer Science /

Similar presentations


Presentation on theme: "Noah A. Smith and Jason Eisner Department of Computer Science /"— Presentation transcript:

1 Contrastive Estimation: (Efficiently) Training Log-Linear Models (of Sequences) on Unlabeled Data
Noah A. Smith and Jason Eisner Department of Computer Science / Center for Language and Speech Processing Johns Hopkins University ACL • N. A. Smith and J. Eisner • Contrastive Estimation

2 Nutshell Version unannotated text tractable contrastive estimation
training contrastive estimation with lattice neighborhoods Experiments on unlabeled data: POS tagging: 46% error rate reduction (relative to EM) “Max ent” features make it possible to survive damage to tag dictionary Dependency parsing: 21% attachment error reduction (relative to EM) “max ent” features sequence models ACL • N. A. Smith and J. Eisner • Contrastive Estimation

3 “Red leaves don’t hide blue jays.”
ACL • N. A. Smith and J. Eisner • Contrastive Estimation

4 Maximum Likelihood Estimation (Supervised)
JJ NNS MD VB JJ NNS y p red leaves don’t hide blue jays x ? p * ? Σ* × Λ* ACL • N. A. Smith and J. Eisner • Contrastive Estimation

5 Maximum Likelihood Estimation (Unsupervised)
? ? ? ? ? ? p red leaves don’t hide blue jays x This is what EM does. ? p * ? Σ* × Λ* ACL • N. A. Smith and J. Eisner • Contrastive Estimation

6 Focusing Probability Mass
numerator denominator ACL • N. A. Smith and J. Eisner • Contrastive Estimation

7 Focusing Probability Mass
numerator denominator ACL • N. A. Smith and J. Eisner • Contrastive Estimation

8 Conditional Estimation (Supervised)
JJ NNS MD VB JJ NNS y p red leaves don’t hide blue jays x ? ? ? ? ? ? p red leaves don’t hide blue jays A different denominator! (x) × Λ* ACL • N. A. Smith and J. Eisner • Contrastive Estimation

9 Optimization Algorithm
Objective Functions Objective Optimization Algorithm Numerator Denominator MLE Count & Normalize* tags & words Σ* × Λ* MLE with hidden variables EM* words Conditional Likelihood Iterative Scaling (words) × Λ* Perceptron Backprop hypothesized tags & words *For generative models. ACL • N. A. Smith and J. Eisner • Contrastive Estimation

10 Optimization Algorithm
Objective Functions Objective Optimization Algorithm Numerator Denominator MLE Count & Normalize* tags & words Σ* × Λ* MLE with hidden variables EM* words Conditional Likelihood Iterative Scaling (words) × Λ* Perceptron Backprop hypothesized tags & words Contrastive Estimation generic numerical solvers (in this talk, LMVM L-BFGS) observed data (in this talk, raw word sequence, sum over all possible taggings) ? *For generative models. ACL • N. A. Smith and J. Eisner • Contrastive Estimation

11 accuracy tractability. This talk is about denominators ...
in the unsupervised case. A good denominator can improve accuracy and tractability. ACL • N. A. Smith and J. Eisner • Contrastive Estimation

12 Language Learning (Syntax)
red leaves don’t hide blue jays Why didn’t he say, “birds fly” or “dancing granola” or “the wash dishes” or any other sequence of words? EM ACL • N. A. Smith and J. Eisner • Contrastive Estimation

13 Language Learning (Syntax)
red leaves don’t hide blue jays Why did he pick that sequence for those words? Why not say “leaves red ...” or “... hide don’t ...” or ... ACL • N. A. Smith and J. Eisner • Contrastive Estimation

14 What is a syntax model supposed to explain?
Each learning hypothesis corresponds to a denominator / neighborhood. ACL • N. A. Smith and J. Eisner • Contrastive Estimation

15 ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation
The Job of Syntax “Explain why each word is necessary.” → DEL1WORD neighborhood red don’t hide blue jays leaves don’t hide blue jays red leaves hide blue jays red leaves don’t hide blue jays red leaves don’t blue jays red leaves don’t hide blue red leaves don’t hide jays ACL • N. A. Smith and J. Eisner • Contrastive Estimation

16 ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation
The Job of Syntax “Explain the (local) order of the words.” → TRANS1 neighborhood red don’t leaves hide blue jays leaves red don’t hide blue jays red leaves don’t hide blue jays red leaves hide don’t blue jays red leaves don’t hide jays blue red leaves don’t blue hide jays ACL • N. A. Smith and J. Eisner • Contrastive Estimation

17 ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation
? ? ? ? ? ? p red leaves don’t hide blue jays red leaves don’t hide blue jays ? sentences in TRANS1 neighborhood leaves red don’t hide blue jays ? red don’t leaves hide blue jays ? p red leaves hide don’t blue jays ? red leaves don’t blue hide jays ? red leaves don’t hide jays blue ? ACL • N. A. Smith and J. Eisner • Contrastive Estimation

18 ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation
? ? ? ? ? ? p red leaves don’t hide blue jays red leaves don’t hide blue jays leaves don’t hide blue jays p blue red leaves don’t hide don’t hide blue jays (with any tagging) sentences in TRANS1 neighborhood ACL • N. A. Smith and J. Eisner • Contrastive Estimation

19 The New Modeling Imperative
A good sentence hints that a set of bad ones is nearby. numerator denominator (“neighborhood”) “Make the good sentence likely, at the expense of those bad neighbors.” ACL • N. A. Smith and J. Eisner • Contrastive Estimation

20 accuracy tractability. This talk is about denominators ...
in the unsupervised case. A good denominator can improve accuracy and tractability. ACL • N. A. Smith and J. Eisner • Contrastive Estimation

21 Log-Linear Models Computing Z is undesirable! score of x, y
partition function Computing Z is undesirable! Sums over all possible taggings of all possible sentences! Conditional Estimation (Supervised) Contrastive Estimation (Unsupervised) 1 sentence a few sentences ACL • N. A. Smith and J. Eisner • Contrastive Estimation

22 A Big Picture: Sequence Model Estimation
unannotated data tractable sums generative, EM: p(x) generative, MLE: p(x, y) log-linear, CE with lattice neighborhoods log-linear, EM: p(x) log-linear, conditional estimation: p(y | x) log-linear, MLE: p(x, y) overlapping features ACL • N. A. Smith and J. Eisner • Contrastive Estimation

23 Contrastive Neighborhoods
Guide the learner toward models that do what syntax is supposed to do. Lattice representation → efficient algorithms. There is an art to choosing neighborhood functions. ACL • N. A. Smith and J. Eisner • Contrastive Estimation

24 ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation
Neighborhoods neighborhood size lattice arcs perturbations n+1 O(n) delete up to 1 word n transpose any bigram O(n2) delete any contiguous subsequence (EM) - replace each word with anything DEL1WORD TRANS1 DELORTRANS1 DEL1WORD TRANS1 DEL1SUBSEQUENCE Σ* ACL • N. A. Smith and J. Eisner • Contrastive Estimation

25 The Merialdo (1994) Task Given unlabeled text and a POS dictionary
(that tells all possible tags for each word type), learn to tag. A form of supervision. ACL • N. A. Smith and J. Eisner • Contrastive Estimation

26 ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation
Trigram Tagging Model JJ NNS MD VB JJ NNS red leaves don’t hide blue jays feature set: tag trigrams tag/word pairs from a POS dictionary ACL • N. A. Smith and J. Eisner • Contrastive Estimation

27 ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation
CRF supervised HMM LENGTH ≈ log-linear EM TRANS1 DELORTRANS1 DA Smith & Eisner (2004) 10 × data EM Merialdo (1994) EM DEL1WORD DEL1SUBSEQUENCE random 96K words full POS dictionary uninformative initializer best of 8 smoothing conditions ACL • N. A. Smith and J. Eisner • Contrastive Estimation

28 ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation
Dictionary includes ... all words words from 1st half of corpus words with count  2 words with count  3 Dictionary excludes OOV words, which can get any tag. What if we damage the POS dictionary? ACL • N. A. Smith and J. Eisner • Contrastive Estimation

29 ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation
Dictionary includes ... all words words from 1st half of corpus words with count  2 words with count  3 Dictionary excludes OOV words, which can get any tag. 96K words 17 coarse POS tags uninformative initializer EM LENGTH random DELORTRANS1 ACL • N. A. Smith and J. Eisner • Contrastive Estimation

30 Trigram Tagging Model + Spelling
JJ NNS MD VB JJ NNS red leaves don’t hide blue jays feature set: tag trigrams tag/word pairs from a POS dictionary 1- to 3-character suffixes, contains hyphen, digit ACL • N. A. Smith and J. Eisner • Contrastive Estimation

31 Spelling features aided recovery, but only with a smart neighborhood.
EM LENGTH + spelling LENGTH random DELORTRANS1 + spelling DELORTRANS1 ACL • N. A. Smith and J. Eisner • Contrastive Estimation

32 The model need not be finite-state.
ACL • N. A. Smith and J. Eisner • Contrastive Estimation

33 Unsupervised Dependency Parsing
Klein & Manning (2004) attachment accuracy EM LENGTH TRANS1 See our paper at the IJCAI 2005 Grammatical Inference workshop. initializer ACL • N. A. Smith and J. Eisner • Contrastive Estimation

34 To Sum Up ... Contrastive Estimation means for tractability
picking your own denominator for tractability or for accuracy (or, as in our case, for both). Now we can use the task to guide the unsupervised learner (like discriminative techniques do for supervised learners). It’s a particularly good fit for log-linear models: with max ent features unsupervised sequence models all in time for ACL 2006. ACL • N. A. Smith and J. Eisner • Contrastive Estimation

35 ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation


Download ppt "Noah A. Smith and Jason Eisner Department of Computer Science /"

Similar presentations


Ads by Google