Smoothing Ritvik Mudur 14 January 2011

Smoothing Ritvik Mudur 14 January 2011
(Thanks to: Mengqiu Wang, Anish Johnson, Nate Chambers, Bill MacCartney, Jenny Finkel, and Sushant Prakash for these materials)

Format and content of sections
Mix of theoretical and practical topics and tips targeted at PAs Emphasis on simple examples worked out in detail

Outline for today Smoothing: examples, proofs, implementation
Good-Turing smoothing tricks Katz Backoff Absolute discounting example Proving you have a proper probability distribution Kneser-Ney Smoothing Information theory (time permitting): intuitions and examples entropy, joint entropy, conditional entropy, mutual information relative entropy (KL divergence), cross entropy, perplexity Java implementation representing huge models efficiently Some tips on the programming assignments

Example Toy language with two words: "A" and "B". Want to build predictive model from: "A B, B A." "A B, B A A A!" "B A, A B!" "A!" "A, A, A." "" (Can say nothing)

A Unigram Model Let's omit punctuation and put a stop symbol after each utterance: A B B A . A B B A A A . B A A B . A . A A A . . Let C(x) be the observed count of unigram x Let o(x) be the observed frequency of unigram x A multinomial* probability distribution with 2 free parameters Event space {A, B, .} o(x): MLE estimates of the unknown true parameters

A Bigram Model (1/2) This time we'll put a stop symbol
before and after each utterance: . A B B A . . A B B A A A . . B A A B . . A . . A A A . . . Let C(x, y) be the observed count of bigram xy Let o(x, y) be the observed frequency of bigram xy Multinomial* probability distribution with 8 free parameters Event space {A, B, .} × {A, B, .} MLE estimates of the unknown true parameters

Terminology concerns * We referred to the unigram and bigram model distributions as multinomial, however this an error in terminology. These are actually discrete (or categorical distributions). However, the use of the term multinomial to refer to such distributions is rather common within NLP. So much so that it gets a special mention on it’s wikipedia pages

Good-Turing Smoothing
We redistribute the count mass of types observed r+1 times evenly among types observed r times. Then, estimate P(x) as r*/C, where r* is an adjusted count for types observed r times. We want r* such that: r* Nr = (r+1) Nr+1 r* = (r+1) Nr+1 / Nr => PGT(x) = ((r+1) Nr+1/Nr) / C Special Case: PGT(unseen) = N1/C (for all r = 0 events)

Good-Turing (Cont.) To see how this works, let's go back to our example and look at bigram counts. To make the example more interesting, we'll assume we've also seen the sentence, "C C C C C C C C C C C", giving us another 12 bigram observations, as summarized in the following table of counts:

Good-Turing (Cont.) But now the probabilities do not add up to 1
One problem is that for high values of r (i.e., for high-frequency bigrams), Nr+1 is quite likely to be 0 One way to address this is to use the Good-Turing estimates only for frequencies r < k for some constant cutoff k. Above this, the MLE estimates are used. Ok .. Now What!? Just Normalize!

Good-Turing (Cont.) So we fit some function S to the observed values (r, Nr) Gale & Sampson (1995) suggest using a power curve Nr = arb, with b < -1 Fit using linear regression in logspace: log Nr = log a + b log r The observed distribution of Nr, when transformed into logspace, looks like this:

Good-Turing (Cont.) Here is a graph of the line fit in logspace and the resulting power curve: The fit is a poor fit, but it gives us smoothed values for Nr. Now, For r > k, we use S(r) to generate the adjusted counts r*: r* = (r+1) S(r+1) / S(r)

Good-Turing (Cont.) What distribution are we smoothing?
In the previous example we smoothed the joint distribution This is what is used in Katz Backoff as we shall see later r* only represents a good “average” value over conditional distributions We can also smooth the conditional probability distributions all at once(i.e. smooth each P(y | x) for all unigrams x) Witten-Bell Smoothing does something similar (see Chen and Goodman paper) Count changes are specific to each conditional distribution, but data is now a lot sparser.

P*(z|y) if C(y, z) > 0 PKATZ(z|y) = , α(y) P*(z) otherwise
Katz Backoff Good-Turing is not often used in isolation but in conjunction with methods like Katz Backoff Why? => GT only gives you a reasonable estimate for the probability of all unseen events and not how to really distribute this mass among unseen events I.e. GT tells us that P0 = N1/N, and not how to spread this probability mass Use the highest order n-grams for events we have seen and recursively backoff to lower order n-grams for unseen events. P* is the result of the discounted joint distribution Distribute the discounted probability mass among unseen events in higher order n-grams, by relying on information from lower-order n-grams. P*(z|y) if C(y, z) > 0 PKATZ(z|y) = , α(y) P*(z) otherwise

Katz Backoff (cont.) Consider P(z | C) and the unigram distribution we will backoff to: We can now use the zero count events to find the normalization coefficient α(y) α(y) = 0.20/0.50 (so α(y) * ( ) = 0.20) So now, ∑z p(z|C) = pKATZ(C|C) + pKATZ(.|C) + α(y)*[pKATZ(A|C) + pKATZ(B|C)] = /0.5 * ( ) = 1.0 x r PGT''(y, z) P*(z|y) = P(y,z)/P(y) C C 10 0.24 0.78 C . 1 0.01 0.03 C A 0.10 C B total 11 0.31 1.00 y r Pmle(y) A 12 0.33 B 6 0.17 C 11 0.31 . 7 0.19 total 36 1.00 y r PGT''(x) P*(z|y) Pmle(y) C A 0.03 0.10 0.33 C B 0.17 total 0.06 0.20 0.50

Katz Backoff (cont.) From the textbook c* for Good-Turing smoothing
Alpha The numerator is simply the total discounted probability mass for unseen events in the highest order n-gram model The denominator is the total probability mass for events when we back off to lower-order n-gram models

Smoothing: Absolute Discounting
Idea: reduce counts of observed event types by a fixed amount , and reallocate the count mass to unobserved event types. Absolute discounting is simple and gives quite good results. Terminology: x - an event (a type) (e.g., bigram) C - count of all observations (tokens) (e.g., training size) C(x) - count of observations (tokens) of type x (e.g. bigram counts) V - # of event types: |{x}| (e.g. size of vocabulary) Nr - # of event types observed r times: |{x: C(x) = r}|  - a number between 0 and 1 (e.g. 0.75)

Absolute Discounting (Cont.)
For seen types, we deduct  from the count mass: Pad(x) = (C(x) - ) / C if C(x) > 0 How much count mass did we harvest by doing this? We took  from each of V - N0 types, so we have (V - N0) to redistribute among the N0 unseen types. So each unseen type will get a count mass of (V - N0) / N0: Pad(x) = (V-N0) / N0C if C(x) = 0 To see how this works, let's go back to our original example and look at bigram counts. To bring unseens into the picture, let's suppose we have another word, C, giving rise to 7 new unseen bigrams: See a problem? We’ve given more probability mass to an unseen event Discounted the same mass from all events , irrespective of counts

Absolute Discounting (Cont.)
The probabilities add up to 1, but how do we prove this in general? Also, how do you choose a good value for ? 0.75 is often recommended, but you can also use held-out data to tune this parameter. How do you distribute the discounted mass among unseen events?

Proper Probability Distributions
To prove that a function p is a probability distribution, you must show: 1. ∀x p(x) ≥ 0 2. ∑x p(x) = 1 The first is generally trivial; the second can be more challenging. For conditional distributions 1. ∀x p(x | y) ≥ 0 ∑x p(x | y) = 1, ∀y A proof for absolute discounting can be found in the excel sheet (see section notes online)

Kneser-Ney Smoothing Uses absolute discounting and distributes the discounted probability mass with a "continuation probability" that estimates the fraction of contexts in which the word z has appeared. The numerator represents the number of seen bigram events where z is the second word. The denominator is simply the number of all seen bigrams. PKN(z|y) = max(C(y,z) - d, 0)/C(y) + β(y) PCONT(z) PCONT(z) = |{y : C(y, z) > 0}| / ∑z |{y : C(y,z) > 0}|

Kneser-Ney (cont.) Consider the our toy example again and let’s look at P(z|A): y,z C(y,z) d C'(y,z) C C 10 -0.75 9.25 A A 5 4.25 A . 4 3.25 . A A B 3 2.25 B A B B 2 1.25 B . 1 0.25 . B . . C . . C A C 0.00 B C C A C B total 36 y,z C(y,z) C'(y,z) Pad*(z|y) PCONT(z) A A 5 4.25 0.35 3/12 A . 4 3.25 0.27 4/12 A B 3 2.25 0.19 A C 0.00 2/12 total 12 9.75 0.81 1

Kneser-Ney (cont.) From, Chen ang Goodman(1998) we can find a formula for β(y): β(y) = D*|{z : C(y,z) > 0}| / ∑z C(y,z) But wait! β(y) is again a normalizing coefficient. It is simply the probability mass that we have discounted from our original distribution. β(y) = ( )/12 = 2.25/12 = 0.75 * 3 / 12 = d * |{z : C(A,z) > 0}| / ∑z C(A,z)

Kneser-Ney/Katz Summary
Higher-order Kneser-Ney and Katz Backoff models are defined recursively See Textbook or Chen and Goodman (1998) paper for details. Eg. Both techniques provide a way to distribute discounted probability mass more “intelligently” among unseen events Don’t let the math deter you. The equations can look hairy but they’re really not that bad

Smoothing and Conditional Probabilities
Some people have the wrong idea about how to combine smoothing with conditional probability distributions. You know that a conditional distribution can be computed as the ratio of a joint distribution and a marginal distribution: P(x | y) = P(x, y) / P(y) What if you want to use smoothing? WRONG: Smooth joint P(x, y) and marginal P(y) independently, then combine: P'''(x | y) = P'(x, y) / P''(y). Correct: Smooth P(x | y) independently.

Smoothing and Conditional Probabilities (Cont.)
The problem is that steps 1 and 2 do smoothing separately, so it makes no sense to divide the results. The right way to compute the smoothed conditional probability distribution P(x | y) is: 1. From the joint P(x, y), compute a smoothed joint P'(x, y). 2. From the smoothed joint P'(x, y), compute a smoothed marginal P'(y). 3. Divide them: let P'(x | y) = P'(x, y) / P'(y).

Suppose we're on safari, and we count the animals we observe by species (x) and gender (y): From these counts, we can easily compute unsmoothed, empirical joint and marginal distributions:

Now suppose we want to use absolute discounting, with  = The adjusted counts are: Now from these counts, we can compute smoothed joint and marginal distributions: Now, since both P'(x, y) and P'(y) result from the same smoothing operation, it's OK to divide them:

Entropy (1/2) Note that entropy is (the negative of) the expected value of the log probability of x. Log probabilities come up a lot in statistical modeling. Since probabilities are always ≤ 1, log probabilities are always ≤ 0. Therefore, entropy is always ≥ 0. If X is a random variable whose distribution is p (which we write "X ~ p"), then we can define the entropy of X, H(X), as follows: Let's calculate the entropy of our observed unigram distribution, o(x):

Entropy (2/2) What does entropy mean, intuitively?
Helpful to think of it as a measure of the evenness or uniformity of the distribution Lowest entropy? what parameters achieve a minimum? Highest entropy? what parameters achieve a maximum? What if m(x) = 0 for some x? By definition 0 lg 0 = 0, so that events with probability 0 do not affect the entropy calculation.

Joint Entropy (1/2) If random variables X and Y have joint distribution p(x, y), then we can define the joint entropy of X and Y, H(X, Y), as follows: Let's calculate the entropy of our observed bigram distribution, o(x, y): What happens when (x,y) are independent? H(X,Y) = H(X)+H(Y) becuz p(x,y) = p(x) * p(y) joint entropy H(X, Y)

Joint Entropy (2/2) Try fiddling with the parameters of the following joint distribution m(x, y), and observe what happens to the joint entropy:

Conditional Entropy (1/2)
If random variables X and Y have joint distribution p(x, y), then we can define the conditional entropy of Y given X, H(Y | X), as follows: Conditional entropy tells us How much does knowing of X tells us about Y If x and y are independent, H(Y|X) = H(Y), we can also see that from the formula

Relative entropy (1/3) (KL divergence)
If we have two probability distributions, p(x) and q(x), we can define the relative entropy (aka KL divergence) between p and q, D(p||q), as follows: We define 0 log0 = 0, plog(p/0)= Relative entropy measures how much two probability distributions differ. asymmetric Identical distributions have zero relative entropy. Non-identical distributions have positive relative entropy. never negative

Suppose we want to compare our observed unigram distribution o(x) to some arbitrary model distribution m(x). What is the relative entropy between them? Try fiddling with the parameters of m(x) and see what it does to the KL divergence. What parameters minimize the divergence? Maximize?

What happens when o(x) = 0? (lg 0 is normally undefined!) Events that cannot happen (according to o(x)) do not contribute to KL divergence between o(x) and any other distribution. What happens when m(x) = 0? (division by 0!) If an event x was observed (o(x) > 0) but your model says it can't (m(x) = 0), then your model is infinitely surprised: D(o || m) = ∞. D(p || q) = H(p,q) – H(p) Why is D(p || q)  0? Can you prove it? Hint: Gibb’s Inequality Cross Entropy: H(p,q) = -∑x p(x) log(q(x))

Java Bug fix!! In generateWord() function of EmpiricalUnigramLanguageModel.java public String generateWord() { double sample = Math.random(); double sum = 0.0; for (String word : wordCounter.keySet()) { sum += wordCounter.getCount(word) / total; if (sum > sample) { return word; } } return "*UNKNOWN*"; // a little probability mass was // reserved for unknowns } total +1

Java You want to train a trigram model on 10 million words, but you keep running out of memory. Don't use a V × V × V matrix! CounterMap? Good, but needs to be elaborated. Use HashMap or HashTable Use the Counter class Eg. Bigrams => HashMap<String, Counter<String>> Intern your Strings Virtual memory: -mx2000m

Programming Assignment Tips
Objectives: The assignments are intentionally open-ended Think of the assignment as an investigation Think of your report as a mini research paper Implement the basics: brigram and trigram models, smoothing, interpolation Choose one or more additions, e.g.: Fancy smoothing: Katz, Witten-Bell, Kneser-Ney, gapped bigram Compare smoothing joint vs. smoothing conditionals Crude spelling model for unknown words Trade-off between memory and performance Explain what you did, why you did it, and what you found out

Development Use a very small dataset during development (esp. debugging). Use validation data to tune hyperparameters. Investigate the learning curve (performance as a function of training size). Investigate the variance of your model results. Report Be concise! 6 pages is plenty. Prove that your actual, implemented distributions are proper (concisely!). Include a graph or two. Error analysis: discuss examples that your model gets wrong.

DO NOT make us read a 20 page report DO NOT underestimate the time it takes to write the actual report DO NOT WAIT till MLK (Monday) day to start your assignment OR ELSE ….

Smoothing Ritvik Mudur 14 January 2011

Similar presentations

Presentation on theme: "Smoothing Ritvik Mudur 14 January 2011"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Smoothing Ritvik Mudur 14 January 2011

Similar presentations

Presentation on theme: "Smoothing Ritvik Mudur 14 January 2011"— Presentation transcript:

Similar presentations

About project

Feedback