Overview… Smoothing methods: Simple smoothing Witten-Bell & Good-Turing estimation Held-out estimation and cross-validation Combining several n-gram models: back-off models
Rationale behind smoothing Sample frequencies seen events with probability P unseen events (including “grammatical” zeroes”) with probability 0 Real population frequencies seen events (including the unseen events in our sample) + smoothing to approximate Lower probabilities for seen events (discounting). Left over probability mass distributed over unseens (smoothing). results in
Laplace’s law, Lidstone’s law and the Jeffreys-Perks law
Instances in the Training Corpus: “inferior to ________” F(w)
Maximum Likelihood Estimate F(w) Unknowns are assigned 0% probability mass
Actual Probability Distribution F(w) These are non- zero probabilities in the real distribution
LaPlace’s Law F(w) NB. This method ends up assigning most prob. mass to unseens
Generalisation: Lidstone’s Law P = probability of specific n-gram C(x) = count of n-gram x in training data N = total n-grams in training data V = number of “bins” (possible n-grams) = small positive number M.L.E: = 0 LaPlace’s Law: = 1 (add-one smoothing) Jeffreys-Perks Law: = ½
Main intuition A zero-frequency event can be thought of as an event which hasn’t happened (yet). The probability of it happening can be estimated from the probability of sth happening for the first time. The count of things which are seen only once can be used to estimate the count of things that are never seen.
Witten-Bell method 1. T = no. of times we saw an event for the first time. = no of different n-gram types (bins) NB: T is no. of types actually attested (unlike V, the no of possible types in add- one smoothing) 2. Estimate total probability mass of unseen n-grams: each token is an event & each new type is an event so above equation gives MLE of the probability of a new type event occurring (“being seen for the first time”) This is the total probability mass to be distributed among all zero events (unseens) no of actual n-grams (N) + no of actual types (T)
Witten-Bell method 3. Divide the total probability mass among all the zero n-grams. Can distribute it equally. 4. Remove this probability mass from the non-zero n-grams (discounting):
Witten-Bell vs. Add-one If we work with unigrams, Witten-Bell and Add-one smoothing give very similar results. The difference is with n-grams for n>1. Main idea: estimate probability of an unseen bigram from the probability of seeing a bigram starting with w1 for the first time.
Witten-Bell with bigrams Generalised total probability mass estimate: No. bigram types beginning with w x No. bigram tokens beginning with w x Estimated total probability of bigrams starting with some word w x
Witten-Bell with bigrams Non-zero bigrams get discounted as before, but again conditioning on history: Note: Witten-Bell won’t assign the same probability mass to all unseen n- grams. The amount assigned will depend on the first word in the bigram (first n- 1 words in the n-gram).