1 Advanced Smoothing, Evaluation of Language Models.

1 Advanced Smoothing, Evaluation of Language Models

2 A zero ngram is just an ngram you haven’t seen yet…but every ngram in the corpus was unseen once…so... How many times did we see an ngram for the first time? Once for each ngram type (T) Est. total probability of unseen bigrams as View training corpus as series of events, one for each token (N) and one for each new type (T) We can divide the probability mass equally among unseen bigrams….or we can condition the probability of an unseen bigram on the first word of the bigram Discount values for Witten-Bell are much more reasonable than Add- One Witten-Bell Discounting

3 Re-estimate amount of probability mass for zero (or low count) ngrams by looking at ngrams with higher counts Estimate E.g. N 0 ’s adjusted count is a function of the count of ngrams that occur once, N 1 Assumes: –word bigrams follow a binomial distribution –We know number of unseen bigrams (VxV-seen) Good-Turing Discounting

4 Interpolation and Backoff Typically used in addition to smoothing techniques/ discounting Example: trigrams Smoothing gives some probability mass to all the trigram types not observed in the training data We could make a more informed decision! How? If backoff finds an unobserved trigram in the test data, it will “back off” to bigrams (and ultimately to unigrams) –Backoff doesn’t treat all unseen trigrams alike –When we have observed a trigram, we will rely solely on the trigram counts

5 Backoff methods (e.g. Katz ‘87) For e.g. a trigram model Compute unigram, bigram and trigram probabilities In use: –Where trigram unavailable back off to bigram if available, o.w. unigram probability –E.g An omnivorous unicorn

6 Smoothing: Simple Interpolation Trigram is very context specific, very noisy Unigram is context-independent, smooth Interpolate Trigram, Bigram, Unigram for best combination Find  0<  <1 by optimizing on “held-out” data Almost good enough

7 Smoothing: Held-out estmation Finding parameter values Split data into training, “heldout”, test Try lots of different values for   on heldout data, pick best Test on test data Sometimes, can use tricks like “EM” (estimation maximization) to find values How much data for training, heldout, test? Answer: enough test data to be statistically significant. (1000s of words perhaps)

8 Summary N-gram probabilities can be used to estimate the likelihood Of a word occurring in a context (N-1) Of a sentence occurring at all Smoothing techniques deal with problems of unseen words in a corpus

9 Practical Issues Represent and compute language model probabilities on log format p1  p2  p3  p4 = exp (log p1 + log p2 + log p3 + log p4)

10 Class-based n-grams P(wi|wi-1) = P(ci|ci-1) x P(wi|ci) Factored Language Models

11 Evaluating language models We need evaluation metrics to determine how good our language models predict the next word Intuition: one should average over the probability of new words

12 Some basic information theory Evaluation metrics for language models Information theory: measures of information Entropy Perplexity

13 Entropy Average length of most efficient coding for a random variable Binary encoding

14 Entropy Example: betting on horses 8 horses, each horse is equally likely to win (Binary) Message required: 001, 010, 011, 100, 101, 110, 111, 000 3-bit message required

15 Entropy 8 horses, some horses are more likely to win Horse 1: ½0 Horse 2: ¼10 Horse 3: 1/8110 Horse 4: 1/161110 Horse 5-8: 1/64111100, 111101, 111110, 111111

16 Perplexity Entropy: H Perplexity: 2 H Intuitively: weighted average number of choices a random variable has to make Equally likely horses: Entropy 3 – Perplexity 2 3 =8 Biased horses: Entropy 2 – Perplexity 2 2 =4

17 Entropy Uncertainty measure (Shannon) given a random variable x r =2, p i = probability the event is i Biased coin: –-0.8 * lg 0.8 + -0.2 * lg 0.2 = 0.258 + 0.464 = 0.722 Unbiased coin: –- 2* 0.5 * lg 0.5 = 1 lg = log2 (log base 2) entropy = H(x) = Shannon uncertainty Perplexity (average) branching factor weighted average number of choices a random variable has to make Formula: 2 H –directly related to the entropy value H Examples Biased coin: –2 0.722 = 0.52 Unbiased coin: –- 2 1 = 2

18 Entropy and Word Sequences Given a word sequence: W = w 1 … w n Entropy for word sequences of length n in language L H( w 1 … w n ) = -  p( w 1 … w n ) log p( w 1 … w n ) over all sequences of length n in language L Entropy rate for word sequences of length n 1/n H( w 1 … w n ) = -1/n  p( w 1 … w n ) log p( w 1 … w n ) Entropy rate H(L) = lim n>  -1/n  p(w 1 … w n ) log p(w 1 … w n ) n is number of words in the sequence Shannon-McMillan-Breiman theorem H(L) = lim n→  - 1/n log p( w 1 … w n ) select sufficiently large n possible then to take a single sequence –instead of summing over all possible w 1 … w n –long sequence will contain many shorter sequences

19 Entropy of a sequence Finite sequence: strings from a language L Entropy rate (per-word entropy)

20 Entropy of a language Entropy rate of language L Shannon-McMillan-Breimann Theorem: If a language is stationary and ergodic A single sequence – if it is long enough – is representative for the language

1 Advanced Smoothing, Evaluation of Language Models.

Similar presentations

Presentation on theme: "1 Advanced Smoothing, Evaluation of Language Models."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Advanced Smoothing, Evaluation of Language Models.

Similar presentations

Presentation on theme: "1 Advanced Smoothing, Evaluation of Language Models."— Presentation transcript:

Similar presentations

About project

Feedback