Noah A. Smith and Jason Eisner Department of Computer Science /

Slides:



Advertisements
Similar presentations
Three Basic Problems Compute the probability of a text: P m (W 1,N ) Compute maximum probability tag sequence: arg max T 1,N P m (T 1,N | W 1,N ) Compute.
Advertisements

Statistical Machine Translation
Expectation Maximization Dekang Lin Department of Computing Science University of Alberta.
Document Summarization using Conditional Random Fields Dou Shen, Jian-Tao Sun, Hua Li, Qiang Yang, Zheng Chen IJCAI 2007 Hao-Chin Chang Department of Computer.
Three Basic Problems 1.Compute the probability of a text (observation) language modeling – evaluate alternative texts and models P m (W 1,N ) 2.Compute.
CS626: NLP, Speech and the Web
Image Modeling & Segmentation
Unsupervised Learning Clustering K-Means. Recall: Key Components of Intelligent Agents Representation Language: Graph, Bayes Nets, Linear functions Inference.
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Part of Speech Tagging The DT students NN went VB to P class NN Plays VB NN well ADV NN with P others NN DT Fruit NN flies NN VB NN VB like VB P VB a DT.
John Lafferty, Andrew McCallum, Fernando Pereira
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data John Lafferty Andrew McCallum Fernando Pereira.
Machine Learning and Data Mining Clustering
Chapter 6: HIDDEN MARKOV AND MAXIMUM ENTROPY Heshaam Faili University of Tehran.
Lecture 15 Hidden Markov Models Dr. Jianjun Hu mleg.cse.sc.edu/edu/csce833 CSCE833 Machine Learning University of South Carolina Department of Computer.
Tagging with Hidden Markov Models. Viterbi Algorithm. Forward-backward algorithm Reading: Chap 6, Jurafsky & Martin Instructor: Paul Tarau, based on Rada.
CS Word Sense Disambiguation. 2 Overview A problem for semantic attachment approaches: what happens when a given lexeme has multiple ‘meanings’?
Ch 10 Part-of-Speech Tagging Edited from: L. Venkata Subramaniam February 28, 2002.
Syllabus Text Books Classes Reading Material Assignments Grades Links Forum Text Books עיבוד שפות טבעיות - שיעור חמישי POS Tagging Algorithms עידו.
Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.
Maximum Entropy Model LING 572 Fei Xia 02/08/07. Topics in LING 572 Easy: –kNN, Rocchio, DT, DL –Feature selection, binarization, system combination –Bagging.
SI485i : NLP Set 3 Language Models Fall 2012 : Chambers.
Part II. Statistical NLP Advanced Artificial Intelligence Applications of HMMs and PCFGs in NLP Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme.
6. Experimental Analysis Visible Boltzmann machine with higher-order potentials: Conditional random field (CRF): Exponential random graph model (ERGM):
Unambiguity Regularization for Unsupervised Learning of Probabilistic Grammars Kewei TuVasant Honavar Departments of Statistics and Computer Science University.
Graphical models for part of speech tagging
Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Part-of-Speech Tagging for Bengali with Hidden Markov Model Sandipan Dandapat,
Language Modeling Anytime a linguist leaves the group the recognition rate goes up. (Fred Jelinek)
Part-of-Speech Tagging Foundation of Statistical NLP CHAPTER 10.
Sequence Models With slides by me, Joshua Goodman, Fei Xia.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
1 Generative and Discriminative Models Jie Tang Department of Computer Science & Technology Tsinghua University 2012.
1 LING 696B: Midterm review: parametric and non-parametric inductive inference.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
11 Chapter 14 Part 1 Statistical Parsing Based on slides by Ray Mooney.
1 E. Fatemizadeh Statistical Pattern Recognition.
An Asymptotic Analysis of Generative, Discriminative, and Pseudolikelihood Estimators by Percy Liang and Michael Jordan (ICML 2008 ) Presented by Lihan.
Prototype-Driven Learning for Sequence Models Aria Haghighi and Dan Klein University of California Berkeley Slides prepared by Andrew Carlson for the Semi-
NLP. Introduction to NLP Sequence of random variables that aren’t independent Examples –weather reports –text.
Tokenization & POS-Tagging
Maximum Entropy Models and Feature Engineering CSCI-GA.2590 – Lecture 6B Ralph Grishman NYU.
Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
CS Machine Learning and Statistical Natural Language Processing Prof. Shlomo Argamon, Room: 237C Office Hours: Mon 3-4 PM Book:
Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.
CSE 517 Natural Language Processing Winter 2015
John Lafferty Andrew McCallum Fernando Pereira
Conditional Markov Models: MaxEnt Tagging and MEMMs
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
January 2012Spelling Models1 Human Language Technology Spelling Models.
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.
Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
CMPS 142/242 Review Section Fall 2011 Adapted from Lecture Slides.
Hidden Markov Models HMM Hassanin M. Al-Barhamtoshy
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Statistical Machine Translation Part II: Word Alignments and EM
Maximum Entropy Models and Feature Engineering CSCI-GA.2591
CSC 594 Topics in AI – Natural Language Processing
Prototype-Driven Learning for Sequence Models
Lecture 5 Unsupervised Learning in fully Observed Directed and Undirected Graphical Models.
N-Gram Model Formulas Word sequences Chain rule of probability
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
CS4705 Natural Language Processing
Hidden Markov Models Teaching Demo The University of Arizona
Machine Learning and Data Mining Clustering
Presentation transcript:

Contrastive Estimation: (Efficiently) Training Log-Linear Models (of Sequences) on Unlabeled Data Noah A. Smith and Jason Eisner Department of Computer Science / Center for Language and Speech Processing Johns Hopkins University {nasmith,jason}@cs.jhu.edu ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

Nutshell Version unannotated text tractable contrastive estimation training contrastive estimation with lattice neighborhoods Experiments on unlabeled data: POS tagging: 46% error rate reduction (relative to EM) “Max ent” features make it possible to survive damage to tag dictionary Dependency parsing: 21% attachment error reduction (relative to EM) “max ent” features sequence models ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

“Red leaves don’t hide blue jays.” ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

Maximum Likelihood Estimation (Supervised) JJ NNS MD VB JJ NNS y p red leaves don’t hide blue jays x ? p * ? Σ* × Λ* ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

Maximum Likelihood Estimation (Unsupervised) ? ? ? ? ? ? p red leaves don’t hide blue jays x This is what EM does. ? p * ? Σ* × Λ* ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

Focusing Probability Mass numerator denominator ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

Focusing Probability Mass numerator denominator ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

Conditional Estimation (Supervised) JJ NNS MD VB JJ NNS y p red leaves don’t hide blue jays x ? ? ? ? ? ? p red leaves don’t hide blue jays A different denominator! (x) × Λ* ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

Optimization Algorithm Objective Functions Objective Optimization Algorithm Numerator Denominator MLE Count & Normalize* tags & words Σ* × Λ* MLE with hidden variables EM* words Conditional Likelihood Iterative Scaling (words) × Λ* Perceptron Backprop hypothesized tags & words *For generative models. ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

Optimization Algorithm Objective Functions Objective Optimization Algorithm Numerator Denominator MLE Count & Normalize* tags & words Σ* × Λ* MLE with hidden variables EM* words Conditional Likelihood Iterative Scaling (words) × Λ* Perceptron Backprop hypothesized tags & words Contrastive Estimation generic numerical solvers (in this talk, LMVM L-BFGS) observed data (in this talk, raw word sequence, sum over all possible taggings) ? *For generative models. ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

accuracy tractability. This talk is about denominators ... in the unsupervised case. A good denominator can improve accuracy and tractability. ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

Language Learning (Syntax) red leaves don’t hide blue jays Why didn’t he say, “birds fly” or “dancing granola” or “the wash dishes” or any other sequence of words? EM ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

Language Learning (Syntax) red leaves don’t hide blue jays Why did he pick that sequence for those words? Why not say “leaves red ...” or “... hide don’t ...” or ... ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

What is a syntax model supposed to explain? Each learning hypothesis corresponds to a denominator / neighborhood. ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation The Job of Syntax “Explain why each word is necessary.” → DEL1WORD neighborhood red don’t hide blue jays leaves don’t hide blue jays red leaves hide blue jays red leaves don’t hide blue jays red leaves don’t blue jays red leaves don’t hide blue red leaves don’t hide jays ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation The Job of Syntax “Explain the (local) order of the words.” → TRANS1 neighborhood red don’t leaves hide blue jays leaves red don’t hide blue jays red leaves don’t hide blue jays red leaves hide don’t blue jays red leaves don’t hide jays blue red leaves don’t blue hide jays ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation ? ? ? ? ? ? p red leaves don’t hide blue jays red leaves don’t hide blue jays ? sentences in TRANS1 neighborhood leaves red don’t hide blue jays ? red don’t leaves hide blue jays ? p red leaves hide don’t blue jays ? red leaves don’t blue hide jays ? red leaves don’t hide jays blue ? ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation ? ? ? ? ? ? p red leaves don’t hide blue jays red leaves don’t hide blue jays leaves don’t hide blue jays p blue red leaves don’t hide don’t hide blue jays (with any tagging) sentences in TRANS1 neighborhood ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

The New Modeling Imperative A good sentence hints that a set of bad ones is nearby. numerator denominator (“neighborhood”) “Make the good sentence likely, at the expense of those bad neighbors.” ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

accuracy tractability. This talk is about denominators ... in the unsupervised case. A good denominator can improve accuracy and tractability. ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

Log-Linear Models Computing Z is undesirable! score of x, y partition function Computing Z is undesirable! Sums over all possible taggings of all possible sentences! Conditional Estimation (Supervised) Contrastive Estimation (Unsupervised) 1 sentence a few sentences ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

A Big Picture: Sequence Model Estimation unannotated data tractable sums generative, EM: p(x) generative, MLE: p(x, y) log-linear, CE with lattice neighborhoods log-linear, EM: p(x) log-linear, conditional estimation: p(y | x) log-linear, MLE: p(x, y) overlapping features ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

Contrastive Neighborhoods Guide the learner toward models that do what syntax is supposed to do. Lattice representation → efficient algorithms. There is an art to choosing neighborhood functions. ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation Neighborhoods neighborhood size lattice arcs perturbations n+1 O(n) delete up to 1 word n transpose any bigram  O(n2) delete any contiguous subsequence (EM) ∞ - replace each word with anything DEL1WORD TRANS1 DELORTRANS1 DEL1WORD TRANS1 DEL1SUBSEQUENCE Σ* ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

The Merialdo (1994) Task Given unlabeled text and a POS dictionary (that tells all possible tags for each word type), learn to tag. A form of supervision. ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation Trigram Tagging Model JJ NNS MD VB JJ NNS red leaves don’t hide blue jays feature set: tag trigrams tag/word pairs from a POS dictionary ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation CRF supervised HMM LENGTH ≈ log-linear EM TRANS1 DELORTRANS1 DA Smith & Eisner (2004) 10 × data EM Merialdo (1994) EM DEL1WORD DEL1SUBSEQUENCE random 96K words full POS dictionary uninformative initializer best of 8 smoothing conditions ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation Dictionary includes ... all words words from 1st half of corpus words with count  2 words with count  3 Dictionary excludes OOV words, which can get any tag. What if we damage the POS dictionary? ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation Dictionary includes ... all words words from 1st half of corpus words with count  2 words with count  3 Dictionary excludes OOV words, which can get any tag. 96K words 17 coarse POS tags uninformative initializer EM LENGTH random DELORTRANS1 ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

Trigram Tagging Model + Spelling JJ NNS MD VB JJ NNS red leaves don’t hide blue jays feature set: tag trigrams tag/word pairs from a POS dictionary 1- to 3-character suffixes, contains hyphen, digit ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

Spelling features aided recovery, but only with a smart neighborhood. EM LENGTH + spelling LENGTH random DELORTRANS1 + spelling DELORTRANS1 ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

The model need not be finite-state. ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

Unsupervised Dependency Parsing Klein & Manning (2004) attachment accuracy EM LENGTH TRANS1 See our paper at the IJCAI 2005 Grammatical Inference workshop. initializer ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

To Sum Up ... Contrastive Estimation means for tractability picking your own denominator for tractability or for accuracy (or, as in our case, for both). Now we can use the task to guide the unsupervised learner (like discriminative techniques do for supervised learners). It’s a particularly good fit for log-linear models: with max ent features unsupervised sequence models all in time for ACL 2006. ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation