N-gram model limitations Important question was asked in class: what do we do about N-grams which were not in our training corpus? Answer given: we distribute.

Slides:



Advertisements
Similar presentations
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Advertisements

1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.
Albert Gatt Corpora and Statistical Methods – Lecture 7.
SI485i : NLP Set 4 Smoothing Language Models Fall 2012 : Chambers.
Smoothing Techniques – A Primer
Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.
CSC 9010: Special Topics, Natural Language Processing. Spring, Matuszek & Papalaskari 1 N-Grams CSC 9010: Special Topics. Natural Language Processing.
Terminology A statistic is a number calculated from a sample of data. For each different sample, the value of the statistic is a uniquely determined number.
N-Gram Language Models CMSC 723: Computational Linguistics I ― Session #9 Jimmy Lin The iSchool University of Maryland Wednesday, October 28, 2009.
Ngram models and the Sparsity problem John Goldsmith November 2002.
Smoothing Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
N-gram model limitations Q: What do we do about N-grams which were not in our training corpus? A: We distribute some probability mass from seen N-grams.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 20: 11/8.
1 Smoothing LING 570 Fei Xia Week 5: 10/24/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A A AA A A A.
CS 4705 Lecture 15 Corpus Linguistics III. Training and Testing Probabilities come from a training corpus, which is used to design the model. –overly.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 19: 10/31.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 18: 10/26.
MEASURES OF CENTRAL TENDENCY & DISPERSION Research Methods.
SI485i : NLP Set 3 Language Models Fall 2012 : Chambers.
1 Advanced Smoothing, Evaluation of Language Models.
8/27/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 5 Giuseppe Carenini.
Natural Language Processing Lecture 6—9/17/2013 Jim Martin.
Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture 7 8 August 2007.
6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection -Augmentative.
Probability, contd. Learning Objectives By the end of this lecture, you should be able to: – Describe the difference between discrete random variables.
Some Useful Continuous Probability Distributions.
Random Variables Numerical Quantities whose values are determine by the outcome of a random experiment.
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Statistical NLP: Lecture 8 Statistical Inference: n-gram Models over Sparse Data (Ch 6)
Chapter 6: N-GRAMS Heshaam Faili University of Tehran.
Chapter6. Statistical Inference : n-gram Model over Sparse Data 이 동 훈 Foundations of Statistic Natural Language Processing.
Resolving Word Ambiguities Description: After determining word boundaries, the speech recognition process matches an array of possible word sequences from.
Language acquisition
9/22/1999 JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) LM Smoothing (The EM Algorithm) Dr. Jan Hajič CS Dept., Johns.
1 Introduction to Natural Language Processing ( ) LM Smoothing (The EM Algorithm) AI-lab
LING 388: Language and Computers Sandiway Fong Lecture 27: 12/6.
Lecture 4 Ngrams Smoothing
LING/C SC/PSYC 438/538 Lecture 22 Sandiway Fong. Last Time Gentle introduction to probability Important notions: –sample space –events –rule of counting.
Statistical NLP Winter 2009
Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.
SW318 Social Work Statistics Slide 1 Percentile Practice Problem (1) This question asks you to use percentile for the variable [marital]. Recall that the.
Estimating N-gram Probabilities Language Modeling.
Natural Language Processing Statistical Inference: n-grams
2/29/2016CPSC503 Winter CPSC 503 Computational Linguistics Lecture 5 Giuseppe Carenini.
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.
Language Modeling Part II: Smoothing Techniques Niranjan Balasubramanian Slide Credits: Chris Manning, Dan Jurafsky, Mausam.
The Normal Distributions.  1. Always plot your data ◦ Usually a histogram or stemplot  2. Look for the overall pattern ◦ Shape, center, spread, deviations.
Intro to NLP - J. Eisner1 Smoothing Intro to NLP - J. Eisner2 Parameter Estimation p(x 1 = h, x 2 = o, x 3 = r, x 4 = s, x 5 = e,
Let’s recap what we know how to do: We can use normalcdf to find the area under the normal curve between two z-scores. We proved the empirical rule this.
N-Grams Chapter 4 Part 2.
5.3 The Central Limit Theorem
Distribution of the Sample Means
Lecture 15: Text Classification & Naive Bayes
N-Grams and Corpus Linguistics
Language Models for Information Retrieval
5.3 The Central Limit Theorem
CPSC 503 Computational Linguistics
MEASURES OF CENTRAL TENDENCY
N-Gram Model Formulas Word sequences Chain rule of probability
CSCE 771 Natural Language Processing
Presented by Wen-Hung Tsai Speech Lab, CSIE, NTNU 2005/07/13
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
CSCE 771 Natural Language Processing
Language acquisition (4:30)
5.3 The Central Limit Theorem
Sampling Distributions (§ )
Pi-Chuan Chang, Paul Baumstarck April 11, 2008
Anish Johnson and Nate Chambers 10 April 2009
Presentation transcript:

N-gram model limitations Important question was asked in class: what do we do about N-grams which were not in our training corpus? Answer given: we distribute some probability mass from seen N-grams to this new N- gram. This leads to another question: how do we do this?

Unsmoothed bigrams Recall that we use unigram and bigram counts to compute bigram probabilities: – P(w n |w n-1 ) = C(w n-1 w n ) / C(w n-1 )

Recall exercise from last class Suppose text had N words, how many bigrams (tokens) does it contain? At most N: we assume appearing before first word to get a bigram probability for the word in the initial position. Example (5 words): – words: w1 w2 w3 w4 w5 – bigrams: w1, w1 w2, w2 w3, w3 w4, w4 w5

How many possible bigrams are there? With a vocabulary of N words, there are N 2 possible bigrams.

Example description Berkeley Restaurant Project corpus – approximately 10,000 sentences – 1616 word types – tables will show counts or probabilities for 7 word types, carefully chosen so that the 7 by 7 matrix is not too sparse – notice that many counts in first table are zero (25 zeros of 49 entries)

Unsmoothed N-grams IwanttoeatChinesefoodlunch I want to eat Chinese food lunch Bigram counts (figure 6.4 from text)

Computing probabilities Recall formula (we normalize by unigram counts): – P(w n |w n-1 ) = C(w n-1 w n ) / C(w n-1 ) Unigram counts are: WORDIwanttoeatChinesefoodlunch UNIGRAM COUNT p( eat | to ) = c( to eat ) / c( to ) = 860 / 3256 =.26 p( to | eat ) = c( eat to ) / c(eat) = 2 / 938 =.0021

Unsmoothed N-grams w n-1 w n IwanttoeatChinesefoodlunch I want to eat Chinese food lunch Bigram probabilities (figure 6.5 from text): p( w n | w n-1 )

What do zeros mean? Just because a bigram has a zero count or a zero probability does not mean that it cannot occur – it just means it didn’t occur in the training corpus. So we arrive back at our question: what do we do with bigrams that have zero counts when we encounter them?

Let’s rephrase the question How can we ensure that none of the possible bigrams have zero counts/probabilities? Process of spreading the probability mass around to all possible bigrams are called smoothing. We start with a very simple model: add-one smoothing.

Add-one smoothing counts New counts are gotten by adding one to original counts across the board. This ensures that there are no zero counts, but typically adds to much probability mass to non-occurring bigrams.

Add-one smoothing probabilities Unadjusted probabilities: – P(w n |w n-1 ) = C(w n-1 w n ) / C(w n-1 ) Adjusted probabilities: – P*(w n |w n-1 ) = [ C(w n-1 w n ) + 1 ] / [ C(w n-1 ) + V ] V is total number of word types in vocabulary In numerator we add one to the count of each bigram – as with the plain counts. In denominator we add V, since we are adding one more bigram token of the form w n-1 w, for each w in our vocabulary

A simple approach to smoothing: Add-one smoothing IwanttoeatChinesefoodlunch I want to eat Chinese food lunch Add-one smoothed bigram counts (figure 6.6 from text)

Calculating the probabilities Recall the formula for the adjusted probabilities: – P*(w n |w n-1 ) = [ C(w n-1 w n ) + 1 ] / [ C(w n-1 ) + V ] Unigram counts (adjusted by adding V=1616): WORDIwanttoeatChinesefoodlunch UNIGRAM COUNT p( eat | to ) = c( to eat ) / c( to ) = 861 / 4872 =.18 (was.26) p( to | eat ) = c( eat to ) / c( eat ) = 3 / 2554 =.0012 (was.0021) p( eat | lunch ) = c( lunch eat ) / c( lunch ) = 1 / 2075 = (was 0) p( eat | want ) = c( want eat ) / c( want ) = 1 / 2931 = (was 0)

A simple approach to smoothing: Add-one smoothing IwanttoeatChinesefoodlunch I want to eat Chinese food lunch Add-one smoothed bigram probabilities (figure 6.7 from text)

Discounting We can define the discount to be the ratio of new and old counts (in our case smoothed and unsmoothed counts). Discounts for add-one smoothing for this example: WORDIwanttoeatChinesefoodlunch ADD-ONE DISCOUNT

Witten-Bell discounting Another approach to smoothing Basic idea: “Use the count of things you’ve seen once to help estimate the count of things you’ve never seen.” [p. 211] Total probability mass assigned to all (as yet) unseen bigrams is T / [ T + N ], where – T is the total number of observed types – N is the number of tokens “We can think of our training corpus as a series of events; one event for each token and one event for each new type.” [p. 211] Formula above estimates “the probability of a new type event occurring.” [p. 211]

Distribution of probability mass This probability mass is distributed evenly amongst the unseen bigrams. Z = number of zero-count bigrams. p i * = T / [ Z*(N + T) ]

Discounting This probability mass has to come from somewhere! p i * = c i / (N + T) if c i > 0 Smoothed counts are c i * = T/Z * N/(N+T) if c i = 0 (work back from probability formula) c i * = c i * N/(N+T) if c i > 0

Witten-Bell discounting IwanttoeatChinesefoodlunch I want to eat Chinese food lunch Witten-Bell smoothed (discounted) bigram counts (figure 6.9 from text)

Discounting comparison Table shows discounts for add-one and Witten-Bell smoothing for this example: WORDIwanttoeatChinesefoodlunch ADD-ONE DISCOUNT WITTEN- BELL DISCOUNT

Training sets and test sets Corpus divided into training set and test set Need test items to not be in training set, else they will have artificially high probability Can use this to evaluate different systems: – train two different systems on the same training set – compare performance of systems on the same test set