Statistical NLP: Lecture 8 Statistical Inference: n-gram Models over Sparse Data (Ch 6)

Slides:



Advertisements
Similar presentations
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Advertisements

1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.
Language Modeling.
Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
Language Models Naama Kraus (Modified by Amit Gross) Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze.
N-gram model limitations Important question was asked in class: what do we do about N-grams which were not in our training corpus? Answer given: we distribute.
Biointelligence Laboratory, Seoul National University
Albert Gatt Corpora and Statistical Methods – Lecture 7.
SI485i : NLP Set 4 Smoothing Language Models Fall 2012 : Chambers.
Smoothing Techniques – A Primer
Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
Statistical NLP: Lecture 11
Statistical NLP: Hidden Markov Models Updated 8/12/2005.
Part II. Statistical NLP Advanced Artificial Intelligence N-Gramms Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most slides taken.
Visual Recognition Tutorial
N-Gram Language Models CMSC 723: Computational Linguistics I ― Session #9 Jimmy Lin The iSchool University of Maryland Wednesday, October 28, 2009.
Ngram models and the Sparsity problem John Goldsmith November 2002.
Fall 2001 EE669: Natural Language Processing 1 Lecture 6: N-gram Models and Sparse Data (Chapter 6 of Manning and Schutze, Chapter 6 of Jurafsky and Martin,
Smoothing Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
N-gram model limitations Q: What do we do about N-grams which were not in our training corpus? A: We distribute some probability mass from seen N-grams.
1 Smoothing LING 570 Fei Xia Week 5: 10/24/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A A AA A A A.
I256 Applied Natural Language Processing Fall 2009 Lecture 7 Practical examples of Graphical Models Language models Sparse data & smoothing Barbara Rosario.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 18: 10/26.
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Learning Bit by Bit Class 4 - Ngrams. Ngrams Counting words Using observation to make predictions.
Maximum Entropy Model & Generalized Iterative Scaling Arindam Bose CS 621 – Artificial Intelligence 27 th August, 2007.
Review of Lecture Two Linear Regression Normal Equation
1 Advanced Smoothing, Evaluation of Language Models.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.
Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.
EM and expected complete log-likelihood Mixture of Experts
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models.
1 COMP 791A: Statistical Language Processing n-gram Models over Sparse Data Chap. 6.
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Chapter 6: N-GRAMS Heshaam Faili University of Tehran.
Language Modeling Anytime a linguist leaves the group the recognition rate goes up. (Fred Jelinek)
Combining Statistical Language Models via the Latent Maximum Entropy Principle Shaojum Wang, Dale Schuurmans, Fuchum Peng, Yunxin Zhao.
Chapter6. Statistical Inference : n-gram Model over Sparse Data 이 동 훈 Foundations of Statistic Natural Language Processing.
An Overview of Nonparametric Bayesian Models and Applications to Natural Language Processing Narges Sharif-Razavian and Andreas Zollmann.
Efficient Language Model Look-ahead Probabilities Generation Using Lower Order LM Look-ahead Information Langzhou Chen and K. K. Chin Toshiba Research.
9/22/1999 JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) LM Smoothing (The EM Algorithm) Dr. Jan Hajič CS Dept., Johns.
1 Introduction to Natural Language Processing ( ) LM Smoothing (The EM Algorithm) AI-lab
Lecture 4 Ngrams Smoothing
1 Modeling Long Distance Dependence in Language: Topic Mixtures Versus Dynamic Cache Models Rukmini.M Iyer, Mari Ostendorf.
Statistical NLP Winter 2009
Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.
CHAPTER 6 Naive Bayes Models for Classification. QUESTION????
CS Machine Learning and Statistical Natural Language Processing Prof. Shlomo Argamon, Room: 237C Office Hours: Mon 3-4 PM Book:
ECE 8443 – Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem Proof EM Example – Missing Data Intro to Hidden Markov Models.
Language Modeling Putting a curve to the bag of words Courtesy of Chris Jordan.
Robust Estimation With Sampling and Approximate Pre-Aggregation Author: Christopher Jermaine Presented by: Bill Eberle.
Natural Language Processing Statistical Inference: n-grams
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.
Language Modeling Part II: Smoothing Techniques Niranjan Balasubramanian Slide Credits: Chris Manning, Dan Jurafsky, Mausam.
Intro to NLP - J. Eisner1 Smoothing Intro to NLP - J. Eisner2 Parameter Estimation p(x 1 = h, x 2 = o, x 3 = r, x 4 = s, x 5 = e,
N-Grams Chapter 4 Part 2.
LECTURE 10: EXPECTATION MAXIMIZATION (EM)
Smoothing 10/27/2017.
N-Gram Model Formulas Word sequences Chain rule of probability
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
Presented by Wen-Hung Tsai Speech Lab, CSIE, NTNU 2005/07/13
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Topic Models in Text Processing
INF 141: Information Retrieval
Presentation transcript:

Statistical NLP: Lecture 8 Statistical Inference: n-gram Models over Sparse Data (Ch 6)

Overview Statistical Inference consists of taking some data (generated in accordance with some unknown probability distribution) and then making some inferences about this distribution. There are three issues to consider: –Dividing the training data into equivalence classes –Finding a good statistical estimator for each equivalence class –Combining multiple estimators

Forming Equivalence Classes I Classification Problem: try to predict the target feature based on various classificatory features. ==> Reliability versus discrimination Markov Assumption: Only the prior local context affects the next entry: (n-1)th Markov Model or ngram Size of the n-gram models versus number of parameters: we would like n to be large, but the number of parameters increases exponentially with n. There exist other ways to form equivalence classes of the history, but they require more complicated methods ==> will use n-grams here.

Statistical Estimators I: Overview Goal: To derive a good probability estimate for the target feature based on observed data Running Example: From n-gram data P(w1,..,wn) predict P(wn+1|w1,..,wn) Solutions we will look at: –Maximum Likelihood Estimation –Laplace’s, Lidstone’s and Jeffreys-Perks’ Laws –Held Out Estimation –Cross-Validation –Good-Turing Estimation

Statistical Estimators II: Maximum Likelihood Estimation PMLE(w1,..,wn)=C(w1,..,wn)/N, where C(w1,..,wn) is the frequency of n-gram w1,..,wn PMLE(wn|w1,..,wn-1)= C(w1,..,wn)/C(w1,..,wn-1) This estimate is called Maximum Likelihood Estimate (MLE) because it is the choice of parameters that gives the highest probability to the training corpus. MLE is usually unsuitable for NLP because of the sparseness of the data ==> Use a Discounting or Smoothing technique.

Example

Statistical Estimators III: Smoothing Techniques: Laplace PLAP(w1,..,wn)=(C(w1,..,wn)+1)/(N+B), where C(w1,..,wn) is the frequency of n-gram w1,..,wn and B is the number of bins training instances are divided into. ==> Adding One Process The idea is to give a little bit of the probability space to unseen events. However, in NLP applications that are very sparse, Laplace’s Law actually gives far too much of the probability space to unseen events.

Example

Statistical Estimators IV: Smoothing Techniques:Lidstone and Jeffrey-Perks Since the adding one process may be adding too much, we can add a smaller value. PLID(w1,..,wn)=(C(w1,..,wn)+ )/(N+B ), where C(w1,..,wn) is the frequency of n-gram w1,..,wn and B is the number of bins training instances are divided into, and >0. ==> Lidstone’s Law If =1/2, Lidstone’s Law corresponds to the expectation of the likelihood and is called the Expected Likelihood Estimation (ELE) or the Jeffreys-Perks Law.

Statistical Estimators V, Robust Techniques: Held Out Estimation For each n-gram, w1,..,wn, we compute C1(w1,..,wn) and C2(w1,..,wn), the frequencies of w1,..,wn in training and held out data, respectively. Let Nr be the number of bigrams with frequency r in the training text. Let Tr be the total number of times that all n-grams that appeared r times in the training text appeared in the held out data. An estimate for the probability of one of these ngram is: Pho(w1,..,wn)= Tr/(NrN) where C(w1,..,wn)= r.

Statistical Estimators VI: Robust Techniques: Cross-Validation Held Out estimation is useful if there is a lot of data available. If not, it is useful to use each part of the data both as training data and held out data. Deleted Estimation [Jelinek & Mercer, 1985]: Let N r a be the number of n-grams occurring r times in the a th part of the training data and T r ab be the total occurrences of those bigrams from part a in part b. Pdel(w 1,..,w n )= (T r ab +T r ba )/N(N r a +N r b ) where C(w 1,..,w n ) = r. Leave-One-Out [Ney et al., 1997]

Statistical Estimators VI: Related Approach: Good-Turing Estimator If C(w 1,..,w n ) = r > 0, P GT (w 1,..,w n ) = r*/N where r*=(r+1)N r /r If C(w 1,..,w n ) = 0, P GT (w 1,..,w n )  N 1 /(N 0 N) Simple Good-Turing [Gale & Sampson, 1995]: Use a smoothed estimate of the expectation of Nr. As a smoothing curve, use N r =ar b (with b < -1) and estimate a and b by simple linear regression on the logarithmic form of this equation: log N r = log a + b log r, if r is large. For low values of r, use the measured N r directly.

Good-Turing Smoothing (example) In the Brown Corpus, suppose for n =2, N 2 =4000 N 3 =2400. Then 2* = 3 (2400/4000) = 1.8 P GT (jungle|green) = 3*/207 = 2.2/207 =

Good-Turing Smoothing (example)

Combining Estimators I: Overview If we have several models of how the history predicts what comes next, then we might wish to combine them in the hope of producing an even better model. Combination Methods Considered: –Simple Linear Interpolation –Katz’s Backing Off –General Linear Interpolation

Combining Estimators II: Simple Linear Interpolation One way of solving the sparseness in a trigram model is to mix that model with bigram and unigram models that suffer less from data sparseness. This can be done by linear interpolation (also called finite mixture models). When the functions being interpolated all use a subset of the conditioning information of the most discriminating function, this method is referred to as deleted interpolation. P li (w n |w n-2,W n-1 )= 1 P 1 (w n )+ 2 P 2 (w n |w n-1 )+  3 P 3 (w n |w n-1,W n-2 ) where 0  i  1 and  i i =1 The weights can be set automatically using the Expectation- Maximization (EM) algorithm.

Combining Estimators II: Katz’s Backing Off Model In back-off models, different models are consulted in order depending on their specificity. If the n-gram of concern has appeared more than k times, then an n-gram estimate is used but an amount of the MLE estimate gets discounted (it is reserved for unseen n-grams). If the n-gram occurred k times or less, then we will use an estimate from a shorter n-gram (back-off probability), normalized by the amount of probability remaining and the amount of data covered by this estimate. The process continues recursively.

Katz’s Backing Off Model (3-grams)

Katz’s Backing Off Model (2-grams)

Combining Estimators II: General Linear Interpolation In simple linear interpolation, the weights were just a single number, but one can define a more general and powerful model where the weights are a function of the history. For k probability functions Pk, the general form for a linear interpolation model is: P li (w|h)=  i k i (h) P i (w|h) where 0  i (h)  1 and  i i (h) =1