Presentation is loading. Please wait.

Presentation is loading. Please wait.

ACL 2005 N. A. Smith and J. Eisner Contrastive Estimation Contrastive Estimation : (Efficiently) Training Log-Linear Models (of Sequences) on Unlabeled.

Similar presentations


Presentation on theme: "ACL 2005 N. A. Smith and J. Eisner Contrastive Estimation Contrastive Estimation : (Efficiently) Training Log-Linear Models (of Sequences) on Unlabeled."— Presentation transcript:

1 ACL 2005 N. A. Smith and J. Eisner Contrastive Estimation Contrastive Estimation : (Efficiently) Training Log-Linear Models (of Sequences) on Unlabeled Data Noah A. Smith and Jason Eisner Department of Computer Science / Center for Language and Speech Processing Johns Hopkins University

2 ACL 2005 N. A. Smith and J. Eisner Contrastive Estimation Nutshell Version tractable training unannotated text “max ent” features Experiments on unlabeled data: POS tagging: 46% error rate reduction (relative to EM) “Max ent” features make it possible to survive damage to tag dictionary Dependency parsing: 21% attachment error reduction (relative to EM) contrastive estimation with lattice neighborhoods sequence models

3 ACL 2005 N. A. Smith and J. Eisner Contrastive Estimation “Red leaves don’t hide blue jays.”

4 ACL 2005 N. A. Smith and J. Eisner Contrastive Estimation Maximum Likelihood Estimation (Supervised) redleavesdon’thidebluejays JJNNSMDVBJJNNS ? ? p p * x y Σ* × Λ*

5 ACL 2005 N. A. Smith and J. Eisner Contrastive Estimation redleavesdon’thidebluejays ?????? ? ? p p * x Σ* × Λ* Maximum Likelihood Estimation (Unsupervised) This is what EM does.

6 ACL 2005 N. A. Smith and J. Eisner Contrastive Estimation Focusing Probability Mass numerator denominator

7 ACL 2005 N. A. Smith and J. Eisner Contrastive Estimation Focusing Probability Mass numerator denominator

8 ACL 2005 N. A. Smith and J. Eisner Contrastive Estimation Conditional Estimation (Supervised) redleavesdon’thidebluejays JJNNSMDVBJJNNS p p x y redleavesdon’thidebluejays ?????? ( x ) × Λ* A different denominator !

9 ACL 2005 N. A. Smith and J. Eisner Contrastive Estimation Objective Functions Objective Optimization Algorithm NumeratorDenominator MLE Count & Normalize* tags & wordsΣ* × Λ* MLE with hidden variables EM*wordsΣ* × Λ* Conditional Likelihood Iterative Scaling tags & words(words) × Λ* PerceptronBackproptags & words hypothesized tags & words *For generative models.

10 ACL 2005 N. A. Smith and J. Eisner Contrastive Estimation Objective Functions Objective Optimization Algorithm NumeratorDenominator MLE Count & Normalize* tags & wordsΣ* × Λ* MLE with hidden variables EM*wordsΣ* × Λ* Conditional Likelihood Iterative Scaling tags & words(words) × Λ* PerceptronBackproptags & words hypothesized tags & words *For generative models. Contrastive Estimation observed data (in this talk, raw word sequence, sum over all possible taggings) ? generic numerical solvers (in this talk, LMVM L-BFGS)

11 ACL 2005 N. A. Smith and J. Eisner Contrastive Estimation This talk is about denominators... in the unsupervised case. A good denominator can improve accuracy and tractability.

12 ACL 2005 N. A. Smith and J. Eisner Contrastive Estimation Language Learning (Syntax) redleavesdon’thidebluejays Why didn’t he say, “ birds fly ” or “ dancing granola ” or “ the wash dishes ” or any other sequence of words? EM

13 ACL 2005 N. A. Smith and J. Eisner Contrastive Estimation Language Learning (Syntax) redleavesdon’thidebluejays Why did he pick that sequence for those words? Why not say “ leaves red...” or “... hide don’t...” or...

14 ACL 2005 N. A. Smith and J. Eisner Contrastive Estimation What is a syntax model supposed to explain? Each learning hypothesis corresponds to a denominator / neighborhood.

15 ACL 2005 N. A. Smith and J. Eisner Contrastive Estimation The Job of Syntax “Explain why each word is necessary.” → D EL 1W ORD neighborhood redleavesdon’thidebluejays leavesdon’thidebluejaysreddon’thidebluejaysredleaveshidebluejaysredleavesdon’tbluejays redleavesdon’thidejays redleavesdon’thideblue

16 ACL 2005 N. A. Smith and J. Eisner Contrastive Estimation The Job of Syntax “Explain the (local) order of the words.” → T RANS 1 neighborhood redleavesdon’thidebluejays leavesreddon’thidebluejaysredleaveshidedon’tbluejaysreddon’tleaveshidebluejaysredleavesdon’thidejaysblueredleavesdon’tbluehidejays

17 ACL 2005 N. A. Smith and J. Eisner Contrastive Estimation redleavesdon’thidebluejays ?????? p p leavesreddon’thidebluejays ?????? reddon’tleaveshidebluejays ?????? redleaveshidedon’tbluejays ?????? redleavesdon’tbluehidejays ?????? redleavesdon’thidejaysblue ?????? redleavesdon’thidebluejays ?????? sentences in T RANS 1 neighborhood

18 ACL 2005 N. A. Smith and J. Eisner Contrastive Estimation p p redleavesdon’thidebluejays ?????? redleavesdon’thidebluejays leaves don’t hide blue jays don’thidebluejays leaves don’t hide blue red (with any tagging) sentences in T RANS 1 neighborhood

19 ACL 2005 N. A. Smith and J. Eisner Contrastive Estimation The New Modeling Imperative numerator denominator (“neighborhood”) “Make the good sentence likely, at the expense of those bad neighbors.” A good sentence hints that a set of bad ones is nearby.

20 ACL 2005 N. A. Smith and J. Eisner Contrastive Estimation This talk is about denominators... in the unsupervised case. A good denominator can improve accuracy and tractability.

21 ACL 2005 N. A. Smith and J. Eisner Contrastive Estimation Log-Linear Models score of x, y partition function Computing  is undesirable! Conditional Estimation (Supervised) Contrastive Estimation (Unsupervised) Sums over all possible taggings of all possible sentences! 1 sentence a few sentences

22 ACL 2005 N. A. Smith and J. Eisner Contrastive Estimation A Big Picture: Sequence Model Estimation tractable sums overlapping features generative, MLE: p ( x, y ) log-linear, conditional estimation: p ( y | x ) unannotated data generative, EM: p ( x ) log-linear, MLE: p ( x, y ) log-linear, EM: p ( x ) log-linear, CE with lattice neighborhoods

23 ACL 2005 N. A. Smith and J. Eisner Contrastive Estimation Contrastive Neighborhoods Guide the learner toward models that do what syntax is supposed to do. Lattice representation → efficient algorithms. There is an art to choosing neighborhood functions.

24 ACL 2005 N. A. Smith and J. Eisner Contrastive Estimation Neighborhoods neighborhoodsize lattice arcs perturbations n +1 O(n)O(n) delete up to 1 word n O(n)O(n) transpose any bigram O(n)O(n) O(n)O(n)  O(n2)O(n2) O(n2)O(n2) delete any contiguous subsequence (EM) ∞- replace each word with anything D EL 1S UBSEQUENCE T RANS 1 D EL 1W ORD D EL O R T RANS 1 Σ*Σ* D EL 1W ORD T RANS 1

25 ACL 2005 N. A. Smith and J. Eisner Contrastive Estimation The Merialdo (1994) Task Given unlabeled text and a POS dictionary (that tells all possible tags for each word type), learn to tag. A form of supervision.

26 ACL 2005 N. A. Smith and J. Eisner Contrastive Estimation Trigram Tagging Model redleavesdon’thidebluejays JJNNSMDVBJJNNS feature set: tag trigrams tag/word pairs from a POS dictionary

27 ACL 2005 N. A. Smith and J. Eisner Contrastive Estimation 10 × data supervised 96K words full POS dictionary uninformative initializer best of 8 smoothing conditions CRF HMM random EM L ENGTH T RANS 1 D EL O R T RANS 1 DA EM D EL 1W ORD D EL 1S UBSEQUENCE Smith & Eisner (2004) Merialdo (1994) ≈ log-linear EM

28 ACL 2005 N. A. Smith and J. Eisner Contrastive Estimation Dictionary includes... ■ all words ■ words from 1 st half of corpus ■ words with count  2 ■ words with count  3 Dictionary excludes OOV words, which can get any tag. What if we damage the POS dictionary?

29 ACL 2005 N. A. Smith and J. Eisner Contrastive Estimation Dictionary includes... ■ all words ■ words from 1 st half of corpus ■ words with count  2 ■ words with count  3 Dictionary excludes OOV words, which can get any tag. 96K words 17 coarse POS tags uninformative initializer D EL O R T RANS 1 L ENGTH EM random

30 ACL 2005 N. A. Smith and J. Eisner Contrastive Estimation Trigram Tagging Model + Spelling redleavesdon’thidebluejays JJNNSMDVBJJNNS feature set: tag trigrams tag/word pairs from a POS dictionary 1- to 3-character suffixes, contains hyphen, digit

31 ACL 2005 N. A. Smith and J. Eisner Contrastive Estimation D EL O R T RANS 1 L ENGTH EM random D EL O R T RANS 1 + spelling L ENGTH + spelling Spelling features aided recovery, but only with a smart neighborhood.

32 ACL 2005 N. A. Smith and J. Eisner Contrastive Estimation The model need not be finite-state.

33 ACL 2005 N. A. Smith and J. Eisner Contrastive Estimation Unsupervised Dependency Parsing EM L ENGTH initializer T RANS1 attachment accuracy See our paper at the IJCAI 2005 Grammatical Inference workshop. Klein & Manning (2004)

34 ACL 2005 N. A. Smith and J. Eisner Contrastive Estimation To Sum Up... Contrastive Estimation means picking your own denominator for tractability or for accuracy (or, as in our case, for both ). Now we can use the task to guide the unsupervised learner It’s a particularly good fit for log-linear models: with max ent features unsupervised sequence models all in time for ACL (like discriminative techniques do for supervised learners).

35 ACL 2005 N. A. Smith and J. Eisner Contrastive Estimation


Download ppt "ACL 2005 N. A. Smith and J. Eisner Contrastive Estimation Contrastive Estimation : (Efficiently) Training Log-Linear Models (of Sequences) on Unlabeled."

Similar presentations


Ads by Google