Lecture 3: Language Model Smoothing

Lecture 3: Language Model Smoothing
Kai-Wei Chang University of Virginia Couse webpage: CS6501 Natural Language Processing

CS6501 Natural Language Processing
This lecture Zipf’s law Dealing with unseen words/n-grams Add-one smoothing Linear smoothing Good-Turing smoothing Absolute discounting Kneser-Ney smoothing CS6501 Natural Language Processing

Recap: Bigram language model
Let P(<S>) = 1 P( I | <S>) = 2 / 3 P(am | I) = 1 P( Sam | am) = 1/3 P( </S> | Sam) = 1/2 P( <S> I am Sam</S>) = 1*2/3*1*1/3*1/2 <S> I am Sam </S> <S> I am legend </S> <S> Sam I am </S> CS6501 Natural Language Processing

More examples: Berkeley Restaurant Project sentences
can you tell me about any good cantonese restaurants close by mid priced thai food is what i’m looking for tell me about chez panisse can you give me a listing of the kinds of food that are available i’m looking for a good place to eat breakfast when is caffe venezia open during the day CS6501 Natural Language Processing

Raw bigram counts Out of 9222 sentences CS6501 Natural Language Processing

Raw bigram probabilities
Normalize by unigrams: Result: CS6501 Natural Language Processing

Zeros Training set: … denied the allegations … denied the reports … denied the claims … denied the request P(“offer” | denied the) = 0 Test set … denied the offer … denied the loan CS6501 Natural Language Processing

Smoothing This dark art is why NLP is taught in the engineering school. There are more principled smoothing methods, too. We’ll look next at log-linear models, which are a good and popular general technique. But the traditional methods are easy to implement, run fast, and will give you intuitions about what you want from a smoothing method. Credit: the following slides are adapted from Jason Eisner’s NLP course CS6501 Natural Language Processing

What is smoothing? 20 200 2000 CS6501 Natural Language Processing

ML 101: bias variance tradeoff
Different samples of size 20 vary considerably though on average, they give the correct bell curve! 20 CS6501 Natural Language Processing

Overfitting CS6501 Natural Language Processing

The perils of overfitting
N-grams only work well for word prediction if the test corpus looks like the training corpus In real life, it often doesn’t We need to train robust models that generalize! One kind of generalization: Zeros! Things that don’t ever occur in the training set But occur in the test set CS6501 Natural Language Processing

The intuition of smoothing
When we have sparse statistics: Steal probability mass to generalize better P(w | denied the) 3 allegations 2 reports 1 claims 1 request 7 total allegations reports attack man outcome … claims request P(w | denied the) 2.5 allegations 1.5 reports 0.5 claims 0.5 request 2 other 7 total allegations allegations attack man outcome … reports claims request Credit: Dan Klein CS6501 Natural Language Processing

Add-one estimation (Laplace smoothing)
Pretend we saw each word one more time than we did Just add one to all the counts! MLE estimate: Add-1 estimate: CS6501 Natural Language Processing

Add-One Smoothing xya 100 100/300 101 101/326 xyb 0/300 1 1/326 xyc xyd 200 200/300 201 201/326 xye … xyz Total xy 300 300/300 326 326/326 CS6501 Natural Language Processing

Berkeley Restaurant Corpus: Laplace smoothed bigram counts

Laplace-smoothed bigrams
V=1446 in the Berkeley Restaurant Project corpus

Reconstituted counts

Compare with raw bigram counts
C(want to) went from 608 to 238, P(to|want) from .66 to .26! Discount d= c*/c d for “chinese food” =.10!!! A 10x reduction

Problem with Add-One Smoothing
We’ve been considering just 26 letter types … xya 1 1/3 2 2/29 xyb 0/3 1/29 xyc xyd 2/3 3 3/29 xye … xyz Total xy 3/3 29 29/29 CS6501 Natural Language Processing

Suppose we’re considering word types see the abacus 1 1/3 2 2/20003 see the abbot 0/3 1/20003 see the abduct see the above 2/3 3 3/20003 see the Abram … see the zygote Total 3/3 20003 20003/20003 CS6501 Natural Language Processing

Suppose we’re considering word types see the abacus 1 1/3 2 2/20003 see the abbot 0/3 1/20003 see the abduct see the above 2/3 3 3/20003 see the Abram … see the zygote Total 3/3 20003 20003/20003 “Novel event” = event never happened in training data. Here: novel events, with total estimated probability 19998/20003. Add-one smoothing thinks we are extremely likely to see novel events, rather than words we’ve seen. CS6501 Natural Language Processing Intro to NLP - J. Eisner 22

Infinite Dictionary? In fact, aren’t there infinitely many possible word types? see the aaaaa 1 1/3 2 2/(∞+3) see the aaaab 0/3 1/(∞+3) see the aaaac see the aaaad 2/3 3 3/(∞+3) see the aaaae … see the zzzzz Total 3/3 (∞+3) (∞+3)/(∞+3) CS6501 Natural Language Processing

Add-Lambda Smoothing A large dictionary makes novel events too probable. To fix: Instead of adding 1 to all counts, add  = 0.01? This gives much less probability to novel events. But how to pick best value for ? That is, how much should we smooth? CS6501 Natural Language Processing

Add Smoothing Doesn’t smooth much (estimated distribution has high variance) xya 1 1/3 1.001 0.331 xyb 0/3 0.001 0.0003 xyc xyd 2 2/3 2.001 0.661 xye … xyz Total xy 3 3/3 3.026 CS6501 Natural Language Processing

Add-1000 Smoothing Smooths too much (estimated distribution has high bias) xya 1 1/3 1001 1/26 xyb 0/3 1000 xyc xyd 2 2/3 1002 xye … xyz Total xy 3 3/3 26003 CS6501 Natural Language Processing

Add-Lambda Smoothing A large dictionary makes novel events too probable. To fix: Instead of adding 1 to all counts, add  = 0.01? This gives much less probability to novel events. But how to pick best value for ? That is, how much should we smooth? E.g., how much probability to “set aside” for novel events? Depends on how likely novel events really are! Which may depend on the type of text, size of training corpus, … Can we figure it out from the data? We’ll look at a few methods for deciding how much to smooth. CS6501 Natural Language Processing

Setting Smoothing Parameters
How to pick best value for ? (in add- smoothing) Try many  values & report the one that gets best results? How to measure whether a particular  gets good results? Is it fair to measure that on test data (for setting )? Moral: Selective reporting on test data can make a method look artificially good. So it is unethical. Rule: Test data cannot influence system development. No peeking! Use it only to evaluate the final system(s). Report all results on it. Training Test CS6501 Natural Language Processing

Feynman’s Advice: “The first principle is that you must not fool yourself, and you are the easiest person to fool.” How to pick best value for ? (in add- smoothing) Try many  values & report the one that gets best results? How to measure whether a particular  gets good results? Is it fair to measure that on test data (for setting )? Moral: Selective reporting on test data can make a method look artificially good. So it is unethical. Rule: Test data cannot influence system development. No peeking! Use it only to evaluate the final system(s). Report all results on it. Training Test CS6501 Natural Language Processing

How to pick best value for ? Try many  values & report the one that gets best results? Training Test Now use that  to get smoothed counts from all 100% … … and report results of that final model on test data. Dev. Training Pick  that gets best results on this 20% … … when we collect counts from this 80% and smooth them using add- smoothing. CS6501 Natural Language Processing Intro to NLP - J. Eisner

Large or small Dev set? Here we held out 20% of our training set (yellow) for development. Would like to use > 20% yellow: 20% not enough to reliably assess  Would like to use > 80% blue: Best  for smoothing 80%  best  for smoothing 100% CS6501 Natural Language Processing

Cross-Validation Try 5 training/dev splits as below Pick  that gets best average performance  Tests on all 100% as yellow, so we can more reliably assess   Still picks a  that’s good at smoothing the 80% size, not 100%. But now we can grow that 80% without trouble Dev. Test CS6501 Natural Language Processing Intro to NLP - J. Eisner 32

N-fold Cross-Validation (“Leave One Out”)
Test … Test each sentence with smoothed model from other N-1 sentences  Still tests on all 100% as yellow, so we can reliably assess   Trains on nearly 100% blue data ((N-1)/N) to measure whether  is good for smoothing that CS6501 Natural Language Processing Intro to NLP - J. Eisner 33

N-fold Cross-Validation (“Leave One Out”)
Test …  Surprisingly fast: why? Usually easy to retrain on blue by adding/subtracting 1 sentence’s counts CS6501 Natural Language Processing Intro to NLP - J. Eisner 34

More Ideas for Smoothing
Remember, we’re trying to decide how much to smooth. E.g., how much probability to “set aside” for novel events? Depends on how likely novel events really are Which may depend on the type of text, size of training corpus, … Can we figure this out from the data? CS6501 Natural Language Processing

How likely are novel events?
20000 types 300 tokens 300 tokens a 150 both 18 candy 1 donuts 2 every 50 versus ??? grapes his 38 ice cream 7 … 0/300 which zero would you expect is really rare? CS6501 Natural Language Processing

20000 types 300 tokens 300 tokens a 150 both 18 candy 1 donuts 2 every 50 versus farina grapes his 38 ice cream 7 … determiners: a closed class CS6501 Natural Language Processing

20000 types 300 tokens 300 tokens a 150 both 18 candy 1 donuts 2 every 50 versus farina grapes his 38 ice cream 7 … (food) nouns: an open class CS6501 Natural Language Processing

Zipfs’ law the r-th most common word 𝑤 𝑟 has P( 𝑤 𝑟 ) ∝ 1/r
A few words are very frequent Most words are very rare CS6501 Natural Language Processing

How common are novel events?
1 * the 1 * EOS N0 * N1 * N2 * N3 * N4 * N5 * N6 * abdomen, bachelor, Caesar … aberrant, backlog, cabinets … abdominal, Bach, cabana … Abbas, babel, Cabot … aback, Babbitt, cabanas … abaringe, Babatinde, cabaret … CS6501 Natural Language Processing

How common are novel events?
Counts from Brown Corpus (N  1 million tokens) N6 * N5 * N4 * N3 * doubletons (occur twice) N2 * singletons (occur once) N1 * novel words (in dictionary but never occur) N0 * N = # doubleton types N2 * 2 = # doubleton tokens r Nr = total # types = T (purple bars) r (Nr * r) = total # tokens = N (all bars) CS6501 Natural Language Processing

Witten-Bell Smoothing Idea
If T/N is large, we’ve seen lots of novel types in the past, so we expect lots more. N6 * N5 * N4 * N3 * unsmoothed  smoothed doubletons 2/N  2/(N+T) N2 * singletons N1 * 1/N  1/(N+T) novel words N0 * 0/N  (T/(N+T)) / N0 CS6501 Natural Language Processing

Good-Turing Smoothing Idea
Partition the type vocabulary into classes (novel, singletons, doubletons, …) by how often they occurred in training data Use observed total probability of class r+1 to estimate total probability of class r N0* N1* N2* N3* N4* N5* N6* obs. (tripleton) est. p(doubleton) 1.2% unsmoothed  smoothed obs. p(doubleton) est. p(singleton) 1.5% 2/N  (N3*3/N)/N2 (N3*3/N)/N2 obs. p(singleton) est. p(novel) 2% 1/N  (N2*2/N)/N1 (N2*2/N)/N1 0/N  (N1*1/N)/N0 (N1*1/N)/N0 CS6501 Natural Language Processing r/N = (Nr*r/N)/Nr  (Nr+1*(r+1)/N)/Nr

Witten-Bell vs. Good-Turing
Estimate p(z | xy) using just the tokens we’ve seen in context xy. Might be a small set … Witten-Bell intuition: If those tokens were distributed over many different types, then novel types are likely in future. Good-Turing intuition: If many of those tokens came from singleton types , then novel types are likely in future. Very nice idea (but a bit tricky in practice) See the paper “Good-Turing smoothing without tears” CS6501 Natural Language Processing

Good-Turing Reweighting
Problem 1: what about “the”? (k=4417) For small k, Nk > Nk+1 For large k, too jumpy. Problem 2: we don’t observe events for every k N1 N0 N2 N1 N3 N2 N3511 N3510 N4417 N4416

Good-Turing Reweighting
Simple Good-Turing [Gale and Sampson]: replace empirical Nk with a best-fit regression (e.g., power law) once count counts get unreliable f(k) = a + b log (k) Find a,b such that f(k) ∼ Nk N1 N1 N2 N2 N3

Backoff and interpolation
Why are we treating all novel events as the same? words words CS6501 Natural Language Processing Intro to NLP - J. Eisner 49

p(zombie | see the) vs. p(baby | see the) What if count(see the zygote) = count(see the baby) = 0? baby beats zygote as a unigram the baby beats the zygote as a bigram  see the baby beats see the zygote ? (even if both have the same count, such as 0) CS6501 Natural Language Processing Intro to NLP - J. Eisner 50

condition on less context for contexts you haven’t learned much about backoff: use trigram if you have good evidence, otherwise bigram, otherwise unigram Interpolation: – mixture of unigram, bigram, trigram (etc.) models CS6501 Natural Language Processing Intro to NLP - J. Eisner 51

Smoothing + backoff Basic smoothing (e.g., add-, Good-Turing, Witten-Bell): Holds out some probability mass for novel events Divided up evenly among the novel events Backoff smoothing Holds out same amount of probability mass for novel events But divide up unevenly in proportion to backoff prob. CS6501 Natural Language Processing Intro to NLP - J. Eisner 52

Smoothing + backoff When defining p(z | xy), the backoff prob for novel z is p(z | y) Even if z was never observed after xy, it may have been observed after y (why?). Then p(z | y) can be estimated without further backoff. If not, we back off further to p(z). CS6501 Natural Language Processing Intro to NLP - J. Eisner 53

Linear Interpolation Jelinek-Mercer smoothing Use a weighted average of backed-off naïve models: paverage(z | xy) = 3 p(z | xy) + 2 p(z | y) + 1 p(z) where 3 + 2 + 1 = 1 and all are  0 The weights  can depend on the context xy E.g., we can make 3 large if count(xy) is high Learn the weights on held-out data w/ jackknifing Different 3 when xy is observed 1 time, 2 times, 5 times, … CS6501 Natural Language Processing Intro to NLP - J. Eisner 54

Absolute Discounting Interpolation
Save ourselves some time and just subtract 0.75 (or some d)! But should we really just use the regular unigram P(w)? discounted bigram Interpolation weight unigram CS6501 Natural Language Processing

Absolute discounting: just subtract a little from each count
How much to subtract ? Church and Gale (1991)’s clever idea Divide data into training and held-out sets for each bigram in the training set see the actual count in the held-out set! It sure looks like c* = (c - .75) Bigram count in training Bigram count in heldout set 1 0.448 2 1.25 3 2.24 4 3.23 5 4.21 6 5.23 7 6.21 8 7.21 9 8.26 CS6501 Natural Language Processing

Lecture 3: Language Model Smoothing

Similar presentations

Presentation on theme: "Lecture 3: Language Model Smoothing"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lecture 3: Language Model Smoothing

Similar presentations

Presentation on theme: "Lecture 3: Language Model Smoothing"— Presentation transcript:

Similar presentations

About project

Feedback