Noah A. Smith and Jason Eisner Department of Computer Science /

Contrastive Estimation: (Efficiently) Training Log-Linear Models (of Sequences) on Unlabeled Data
Noah A. Smith and Jason Eisner Department of Computer Science / Center for Language and Speech Processing Johns Hopkins University ACL • N. A. Smith and J. Eisner • Contrastive Estimation

Nutshell Version unannotated text tractable contrastive estimation
training contrastive estimation with lattice neighborhoods Experiments on unlabeled data: POS tagging: 46% error rate reduction (relative to EM) “Max ent” features make it possible to survive damage to tag dictionary Dependency parsing: 21% attachment error reduction (relative to EM) “max ent” features sequence models ACL • N. A. Smith and J. Eisner • Contrastive Estimation

“Red leaves don’t hide blue jays.”
ACL • N. A. Smith and J. Eisner • Contrastive Estimation

Maximum Likelihood Estimation (Supervised)
JJ NNS MD VB JJ NNS y p red leaves don’t hide blue jays x ? p * ? Σ* × Λ* ACL • N. A. Smith and J. Eisner • Contrastive Estimation

Maximum Likelihood Estimation (Unsupervised)
? ? ? ? ? ? p red leaves don’t hide blue jays x This is what EM does. ? p * ? Σ* × Λ* ACL • N. A. Smith and J. Eisner • Contrastive Estimation

Focusing Probability Mass
numerator denominator ACL • N. A. Smith and J. Eisner • Contrastive Estimation

Conditional Estimation (Supervised)
JJ NNS MD VB JJ NNS y p red leaves don’t hide blue jays x ? ? ? ? ? ? p red leaves don’t hide blue jays A different denominator! (x) × Λ* ACL • N. A. Smith and J. Eisner • Contrastive Estimation

Optimization Algorithm
Objective Functions Objective Optimization Algorithm Numerator Denominator MLE Count & Normalize* tags & words Σ* × Λ* MLE with hidden variables EM* words Conditional Likelihood Iterative Scaling (words) × Λ* Perceptron Backprop hypothesized tags & words *For generative models. ACL • N. A. Smith and J. Eisner • Contrastive Estimation

Optimization Algorithm
Objective Functions Objective Optimization Algorithm Numerator Denominator MLE Count & Normalize* tags & words Σ* × Λ* MLE with hidden variables EM* words Conditional Likelihood Iterative Scaling (words) × Λ* Perceptron Backprop hypothesized tags & words Contrastive Estimation generic numerical solvers (in this talk, LMVM L-BFGS) observed data (in this talk, raw word sequence, sum over all possible taggings) ? *For generative models. ACL • N. A. Smith and J. Eisner • Contrastive Estimation

accuracy tractability. This talk is about denominators ...
in the unsupervised case. A good denominator can improve accuracy and tractability. ACL • N. A. Smith and J. Eisner • Contrastive Estimation

Language Learning (Syntax)
red leaves don’t hide blue jays Why didn’t he say, “birds fly” or “dancing granola” or “the wash dishes” or any other sequence of words? EM ACL • N. A. Smith and J. Eisner • Contrastive Estimation

Language Learning (Syntax)
red leaves don’t hide blue jays Why did he pick that sequence for those words? Why not say “leaves red ...” or “... hide don’t ...” or ... ACL • N. A. Smith and J. Eisner • Contrastive Estimation

What is a syntax model supposed to explain?
Each learning hypothesis corresponds to a denominator / neighborhood. ACL • N. A. Smith and J. Eisner • Contrastive Estimation

ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation
The Job of Syntax “Explain why each word is necessary.” → DEL1WORD neighborhood red don’t hide blue jays leaves don’t hide blue jays red leaves hide blue jays red leaves don’t hide blue jays red leaves don’t blue jays red leaves don’t hide blue red leaves don’t hide jays ACL • N. A. Smith and J. Eisner • Contrastive Estimation

The Job of Syntax “Explain the (local) order of the words.” → TRANS1 neighborhood red don’t leaves hide blue jays leaves red don’t hide blue jays red leaves don’t hide blue jays red leaves hide don’t blue jays red leaves don’t hide jays blue red leaves don’t blue hide jays ACL • N. A. Smith and J. Eisner • Contrastive Estimation

? ? ? ? ? ? p red leaves don’t hide blue jays red leaves don’t hide blue jays ? sentences in TRANS1 neighborhood leaves red don’t hide blue jays ? red don’t leaves hide blue jays ? p red leaves hide don’t blue jays ? red leaves don’t blue hide jays ? red leaves don’t hide jays blue ? ACL • N. A. Smith and J. Eisner • Contrastive Estimation

? ? ? ? ? ? p red leaves don’t hide blue jays red leaves don’t hide blue jays leaves don’t hide blue jays p blue red leaves don’t hide don’t hide blue jays (with any tagging) sentences in TRANS1 neighborhood ACL • N. A. Smith and J. Eisner • Contrastive Estimation

The New Modeling Imperative
A good sentence hints that a set of bad ones is nearby. numerator denominator (“neighborhood”) “Make the good sentence likely, at the expense of those bad neighbors.” ACL • N. A. Smith and J. Eisner • Contrastive Estimation

accuracy tractability. This talk is about denominators ...
in the unsupervised case. A good denominator can improve accuracy and tractability. ACL • N. A. Smith and J. Eisner • Contrastive Estimation

Log-Linear Models Computing Z is undesirable! score of x, y
partition function Computing Z is undesirable! Sums over all possible taggings of all possible sentences! Conditional Estimation (Supervised) Contrastive Estimation (Unsupervised) 1 sentence a few sentences ACL • N. A. Smith and J. Eisner • Contrastive Estimation

A Big Picture: Sequence Model Estimation
unannotated data tractable sums generative, EM: p(x) generative, MLE: p(x, y) log-linear, CE with lattice neighborhoods log-linear, EM: p(x) log-linear, conditional estimation: p(y | x) log-linear, MLE: p(x, y) overlapping features ACL • N. A. Smith and J. Eisner • Contrastive Estimation

Contrastive Neighborhoods
Guide the learner toward models that do what syntax is supposed to do. Lattice representation → efficient algorithms. There is an art to choosing neighborhood functions. ACL • N. A. Smith and J. Eisner • Contrastive Estimation

Neighborhoods neighborhood size lattice arcs perturbations n+1 O(n) delete up to 1 word n transpose any bigram  O(n2) delete any contiguous subsequence (EM) ∞ - replace each word with anything DEL1WORD TRANS1 DELORTRANS1 DEL1WORD TRANS1 DEL1SUBSEQUENCE Σ* ACL • N. A. Smith and J. Eisner • Contrastive Estimation

The Merialdo (1994) Task Given unlabeled text and a POS dictionary
(that tells all possible tags for each word type), learn to tag. A form of supervision. ACL • N. A. Smith and J. Eisner • Contrastive Estimation

Trigram Tagging Model JJ NNS MD VB JJ NNS red leaves don’t hide blue jays feature set: tag trigrams tag/word pairs from a POS dictionary ACL • N. A. Smith and J. Eisner • Contrastive Estimation

CRF supervised HMM LENGTH ≈ log-linear EM TRANS1 DELORTRANS1 DA Smith & Eisner (2004) 10 × data EM Merialdo (1994) EM DEL1WORD DEL1SUBSEQUENCE random 96K words full POS dictionary uninformative initializer best of 8 smoothing conditions ACL • N. A. Smith and J. Eisner • Contrastive Estimation

Dictionary includes ... all words words from 1st half of corpus words with count  2 words with count  3 Dictionary excludes OOV words, which can get any tag. What if we damage the POS dictionary? ACL • N. A. Smith and J. Eisner • Contrastive Estimation

Dictionary includes ... all words words from 1st half of corpus words with count  2 words with count  3 Dictionary excludes OOV words, which can get any tag. 96K words 17 coarse POS tags uninformative initializer EM LENGTH random DELORTRANS1 ACL • N. A. Smith and J. Eisner • Contrastive Estimation

Trigram Tagging Model + Spelling
JJ NNS MD VB JJ NNS red leaves don’t hide blue jays feature set: tag trigrams tag/word pairs from a POS dictionary 1- to 3-character suffixes, contains hyphen, digit ACL • N. A. Smith and J. Eisner • Contrastive Estimation

Spelling features aided recovery, but only with a smart neighborhood.
EM LENGTH + spelling LENGTH random DELORTRANS1 + spelling DELORTRANS1 ACL • N. A. Smith and J. Eisner • Contrastive Estimation

The model need not be finite-state.
ACL • N. A. Smith and J. Eisner • Contrastive Estimation

Unsupervised Dependency Parsing
Klein & Manning (2004) attachment accuracy EM LENGTH TRANS1 See our paper at the IJCAI 2005 Grammatical Inference workshop. initializer ACL • N. A. Smith and J. Eisner • Contrastive Estimation

To Sum Up ... Contrastive Estimation means for tractability
picking your own denominator for tractability or for accuracy (or, as in our case, for both). Now we can use the task to guide the unsupervised learner (like discriminative techniques do for supervised learners). It’s a particularly good fit for log-linear models: with max ent features unsupervised sequence models all in time for ACL 2006. ACL • N. A. Smith and J. Eisner • Contrastive Estimation

Noah A. Smith and Jason Eisner Department of Computer Science /

Similar presentations

Presentation on theme: "Noah A. Smith and Jason Eisner Department of Computer Science /"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Noah A. Smith and Jason Eisner Department of Computer Science /

Similar presentations

Presentation on theme: "Noah A. Smith and Jason Eisner Department of Computer Science /"— Presentation transcript:

Similar presentations

About project

Feedback