Presentation is loading. Please wait.

Presentation is loading. Please wait.

N-Grams Chapter 4 Part 2.

Similar presentations


Presentation on theme: "N-Grams Chapter 4 Part 2."— Presentation transcript:

1 N-Grams Chapter 4 Part 2

2 Today Word prediction task Language modeling (N-grams) N-gram intro
The chain rule Model evaluation Smoothing 4/23/2018 Speech and Language Processing - Jurafsky and Martin

3 Zero Counts Back to Shakespeare
Shakespeare produced 300,000 bigram types out of V2= 844 million possible bigrams... So, 99.96% of the possible bigrams were never seen (have zero entries in the table) 4/23/2018 Speech and Language Processing - Jurafsky and Martin

4 Zero Counts Some of those zeros are really zeros...
Things that really can’t or shouldn’t happen. On the other hand, some of them are just rare events. If the training corpus had been a little bigger they would have had a count (but probably a count of 1!). Zipf’s Law (long tail phenomenon): A small number of events occur with high frequency A large number of events occur with low frequency You can quickly collect statistics on the high frequency events You might have to wait an arbitrarily long time to get valid statistics on low frequency events Result: Our estimates are sparse! We have no counts at all for the vast bulk of things we want to estimate! Answer: Estimate the likelihood of unseen (zero count) N-grams 4/23/2018 Speech and Language Processing - Jurafsky and Martin

5 Laplace Smoothing Also called add-one smoothing
Just add one to all the counts! Very simple MLE estimate: Laplace estimate: Reconstructed counts: 4/23/2018 Speech and Language Processing - Jurafsky and Martin

6 Bigram Counts Before Smoothing (from the part 1 slides)
V = 1446 Eg. “I want” occurred 827 times 4/23/2018 Speech and Language Processing - Jurafsky and Martin

7 Laplace-Smoothed Bigram Counts
4/23/2018 Speech and Language Processing - Jurafsky and Martin

8 Laplace-Smoothed Bigram Probabilities
4/23/2018 Speech and Language Processing - Jurafsky and Martin

9 Reconstituted Counts 4/23/2018
Speech and Language Processing - Jurafsky and Martin

10 Big Change to the Counts!
C(want to) went from 608 to 238! P(to|want) from .66 to .26! Discount d= c*/c d for “chinese food” =.10!!! A 10x reduction So in general, Laplace is a blunt instrument Laplace smoothing not used for N-grams, as there are better methods Despite its flaws Laplace is however still used to smooth other probabilistic models in NLP, because it is simple, especially For pilot studies in domains where the number of zeros isn’t so huge. 4/23/2018 Speech and Language Processing - Jurafsky and Martin

11 Better Smoothing Intuition used by many smoothing algorithms
Good-Turing Kneser-Ney Witten-Bell Is to use the count of things we’ve seen once to help estimate the count of things we’ve never seen We’ll just cover a basic idea 4/23/2018 Speech and Language Processing - Jurafsky and Martin

12 Good-Turing Idea Imagine you are fishing You have caught
There are 8 species: carp, perch, whitefish, trout, salmon, eel, catfish, bass You have caught 10 carp, 3 perch, 2 whitefish, 1 trout, 1 salmon, 1 eel = 18 fish How likely is it that the next fish caught is from a new species (one not seen in our previous catch)? 3/18 Assuming so, how likely is it that next species is trout? Must be less than 1/18 Slide adapted from Josh Goodman 4/23/2018 Speech and Language Processing - Jurafsky and Martin

13 Back-off Methods Notice that: How to combine things?
N-grams are more precise than (N-1)grams (remember the Shakespeare example) But also, N-grams are more sparse than (N-1) grams How to combine things? Attempt N-grams and back-off to (N-1) if counts are not available E.g. attempt prediction using 4-grams, and back-off to trigrams (or bigrams, or unigrams) if counts are not available

14 Interpolation 4/23/2018 Speech and Language Processing - Jurafsky and Martin

15 How to Set the Lambdas? Use a held-out, or development, corpus
Choose lambdas which maximize the probability of some held-out data One possibility: use the expectation maximization (EM) algorithm (see chapter 6; not required) 4/23/2018 Speech and Language Processing - Jurafsky and Martin

16 Practical Issue We do everything in log space Avoid underflow
(also adding is faster than multiplying) 4/23/2018 Speech and Language Processing - Jurafsky and Martin


Download ppt "N-Grams Chapter 4 Part 2."

Similar presentations


Ads by Google