Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture 7 8 August 2007.

Similar presentations


Presentation on theme: "Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture 7 8 August 2007."— Presentation transcript:

1 Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture 7 8 August 2007

2 Lecture 1, 7/21/2005Natural Language Processing2 A Simple Example P(I want to each Chinese food) = P(I | ) P(want | I) P(to | want) P(eat | to) P(Chinese | eat) P(food | Chinese)

3 Lecture 1, 7/21/2005Natural Language Processing3 A Bigram Grammar Fragment from BERP.001Eat British.03Eat today.007Eat dessert.04Eat Indian.01Eat tomorrow.04Eat a.02Eat Mexican.04Eat at.02Eat Chinese.05Eat dinner.02Eat in.06Eat lunch.03Eat breakfast.06Eat some.03Eat Thai.16Eat on

4 Lecture 1, 7/21/2005Natural Language Processing4.01British lunch.05Want a.01British cuisine.65Want to.15British restaurant.04I have.60British food.08I don’t.02To be.29I would.09To spend.32I want.14To have.02 I’m.26To eat.04 Tell.01Want Thai.06 I’d.04Want some.25 I

5 Lecture 1, 7/21/2005Natural Language Processing5  P(I want to eat British food) = P(I| ) P(want|I) P(to|want) P(eat|to) P(British|eat) P(food|British) =.25*.32*.65*.26*.001*.60 =.000080  vs. I want to eat Chinese food =.00015  Probabilities seem to capture ``syntactic'' facts, ``world knowledge'' eat is often followed by an NP British food is not too popular  N-gram models can be trained by counting and normalization

6 Lecture 1, 7/21/2005Natural Language Processing6 BERP Bigram Counts 0100004Lunch 000017019Food 112000002Chinese 522190200Eat 12038601003To 686078603Want 00013010878I lunchFoodChineseEatToWantI

7 Lecture 1, 7/21/2005Natural Language Processing7 BERP Bigram Probabilities  Normalization: divide each row's counts by appropriate unigram counts for w n-1  Computing the bigram probability of I I C(I,I)/C(all I) p (I|I) = 8 / 3437 =.0023  Maximum Likelihood Estimation (MLE): relative frequency of e.g. 4591506213938325612153437 LunchFoodChineseEatToWantI

8 Lecture 1, 7/21/2005Natural Language Processing8 What do we learn about the language?  What's being captured with... P(want | I) =.32 P(to | want) =.65 P(eat | to) =.26 P(food | Chinese) =.56 P(lunch | eat) =.055  What about... P(I | I) =.0023 P(I | want) =.0025 P(I | food) =.013

9 Lecture 1, 7/21/2005Natural Language Processing9 P(I | I) =.0023 I I I I want P(I | want) =.0025 I want I want P(I | food) =.013 the kind of food I want is...

10 Lecture 1, 7/21/2005Natural Language Processing10 Approximating Shakespeare  As we increase the value of N, the accuracy of the n-gram model increases, since choice of next word becomes increasingly constrained  Generating sentences with random unigrams... Every enter now severally so, let Hill he late speaks; or! a more to leg less first you enter  With bigrams... What means, sir. I confess she? then all sorts, he is trim, captain. Why dost stand forth thy canopy, forsooth; he is this palpable hit the King Henry.

11 Lecture 1, 7/21/2005Natural Language Processing11  Trigrams Sweet prince, Falstaff shall die. This shall forbid it should be branded, if renown made it empty.  Quadrigrams What! I will go seek the traitor Gloucester. Will you not tell me who I am?

12 Lecture 1, 7/21/2005Natural Language Processing12  There are 884,647 tokens, with 29,066 word form types, in about a one million word Shakespeare corpus  Shakespeare produced 300,000 bigram types out of 844 million possible bigrams: so, 99.96% of the possible bigrams were never seen (have zero entries in the table)  Quadrigrams worse: What's coming out looks like Shakespeare because it is Shakespeare

13 Lecture 1, 7/21/2005Natural Language Processing13 N-Gram Training Sensitivity  If we repeated the Shakespeare experiment but trained our n-grams on a Wall Street Journal corpus, what would we get?  This has major implications for corpus selection or design  Dynamically adapting language models to different genres

14 Lecture 1, 7/21/2005Natural Language Processing14 Unknown words  Unknown or Out of vocabulary (OOV) words  Open Vocabulary system – model the unknown word by Training is as follows: 1. Choose a vocabulary 2. Convert any word in training set not belonging to this set to 3. Estimate the probabilities for from its counts

15 Lecture 1, 7/21/2005Natural Language Processing15 Evaluaing n-grams - Perplexity  Evaluating applications (like speech recognition) – potentially expensive  Need a metric to quickly evaluate potential improvements in a language model  Perplexity Intuition: The better model has tighter fit to the test data (assign higher probability to test data) PP(W) = P(w1w2…wn)^(-1/N) (pg 14 – chapter 4)

16 Lecture 1, 7/21/2005Natural Language Processing16 Some Useful Empirical Observations  A small number of events occur with high frequency  A large number of events occur with low frequency  You can quickly collect statistics on the high frequency events  You might have to wait an arbitrarily long time to get valid statistics on low frequency events  Some of the zeroes in the table are really zeros But others are simply low frequency events you haven't seen yet. How to address?

17 Lecture 1, 7/21/2005Natural Language Processing17 Smoothing: None  Called Maximum Likelihood estimate.  Terrible on test data: If no occurrences of C(xyz), probability is 0.

18 Lecture 1, 7/21/2005Natural Language Processing18 Smoothing Techniques  Every n-gram training matrix is sparse, even for very large corpora (Zipf’s law)Zipf’s law  Solution: estimate the likelihood of unseen n-grams  Problems: how do you adjust the rest of the corpus to accommodate these ‘phantom’ n-grams?

19 Lecture 1, 7/21/2005Natural Language Processing19 Smoothing=Redistributing Probability Mass

20 Lecture 1, 7/21/2005Natural Language Processing20 Smoothing Techniques  Every n-gram training matrix is sparse, even for very large corpora (Zipf’s law)Zipf’s law  Solution: estimate the likelihood of unseen n-grams  Problems: how do you adjust the rest of the corpus to accommodate these ‘phantom’ n-grams?

21 Lecture 1, 7/21/2005Natural Language Processing21 Add-one Smoothing  For unigrams: Add 1 to every word (type) count Normalize by N (tokens) /(N (tokens) +V (types)) Smoothed count (adjusted for additions to N) is Normalize by N to get the new unigram probability:  For bigrams: Add 1 to every bigram c(w n-1 w n ) + 1 Incr unigram count by vocabulary size c(w n-1 ) + V

22 Lecture 1, 7/21/2005Natural Language Processing22 Effect on BERP bigram counts

23 Lecture 1, 7/21/2005Natural Language Processing23 Add-one bigram probabilities

24 Lecture 1, 7/21/2005Natural Language Processing24 The problem

25 Lecture 1, 7/21/2005Natural Language Processing25 The problem  Add-one has a huge effect on probabilities: e.g., P(to|want) went from.65 to.28!  Too much probability gets ‘removed’ from n-grams actually encountered (more precisely: the ‘discount factor’

26 Lecture 1, 7/21/2005Natural Language Processing26 Discount: ratio of new counts to old (e.g. add-one smoothing changes the BERP bigram (to|want) from 786 to 331 (d c =.42) and p(to|want) from.65 to.28) But this changes counts drastically:  too much weight given to unseen ngrams  in practice, unsmoothed bigrams often work better !

27 Lecture 1, 7/21/2005Natural Language Processing27 Smoothing  Add one smoothing:  Works very badly.  Add delta smoothing:  Still very bad. [based on slides by Joshua Goodman]

28 Lecture 1, 7/21/2005Natural Language Processing28  A zero ngram is just an ngram you haven’t seen yet…but every ngram in the corpus was unseen once…so... How many times did we see an ngram for the first time? Once for each ngram type (T) Est. total probability of unseen bigrams as View training corpus as series of events, one for each token (N) and one for each new type (T) Witten-Bell Discounting

29 Lecture 1, 7/21/2005Natural Language Processing29 We can divide the probability mass equally among unseen bigrams….or we can condition the probability of an unseen bigram on the first word of the bigram Discount values for Witten-Bell are much more reasonable than Add-One

30 Lecture 1, 7/21/2005Natural Language Processing30  Re-estimate amount of probability mass for zero (or low count) ngrams by looking at ngrams with higher counts Nc : n-grams with frequency c Estimate smoothed count E.g. N 0 ’s adjusted count is a function of the count of ngrams that occur once, N 1 P (tfrequency Assumes:  word bigrams follow a binomial distribution  We know number of unseen bigrams (VxV-seen) Good-Turing Discounting

31 Lecture 1, 7/21/2005Natural Language Processing31 Interpolation and Backoff  Typically used in addition to smoothing techniques/ discounting Example: trigrams Smoothing gives some probability mass to all the trigram types not observed in the training data We could make a more informed decision! How? If backoff finds an unobserved trigram in the test data, it will “back off” to bigrams (and ultimately to unigrams)  Backoff doesn’t treat all unseen trigrams alike  When we have observed a trigram, we will rely solely on the trigram counts Interpolation generally takes bigrams and unigrams into account for trigram probability

32 Lecture 1, 7/21/2005Natural Language Processing32 Backoff methods (e.g. Katz ‘87)  For e.g. a trigram model Compute unigram, bigram and trigram probabilities In use:  Where trigram unavailable back off to bigram if available, o.w. unigram probability  E.g An omnivorous unicorn

33 Lecture 1, 7/21/2005Natural Language Processing33 Smoothing: Simple Interpolation  Trigram is very context specific, very noisy  Unigram is context-independent, smooth  Interpolate Trigram, Bigram, Unigram for best combination  Find  0<  <1 by optimizing on “held-out” data  Almost good enough

34 Lecture 1, 7/21/2005Natural Language Processing34 Smoothing: Held-out estmation  Finding parameter values Split data into training, “heldout”, test Try lots of different values for   on heldout data, pick best Test on test data Sometimes, can use tricks like “EM” (estimation maximization) to find values [Joshua Goodman:] I prefer to use a generalized search algorithm, “Powell search” – see Numerical Recipes in C [based on slides by Joshua Goodman]

35 Lecture 1, 7/21/2005Natural Language Processing35 Held-out estimation: splitting data  How much data for training, heldout, test?  Some people say things like “1/3, 1/3, 1/3” or “80%, 10%, 10%” They are WRONG  Heldout should have (at least) 100-1000 words per parameter.  Answer: enough test data to be statistically significant. (1000s of words perhaps) [based on slides by Joshua Goodman]

36 Lecture 1, 7/21/2005Natural Language Processing36 Summary  N-gram probabilities can be used to estimate the likelihood Of a word occurring in a context (N-1) Of a sentence occurring at all  Smoothing techniques deal with problems of unseen words in a corpus

37 Lecture 1, 7/21/2005Natural Language Processing37 Practical Issues  Represent and compute language model probabilities on log format p1  p2  p3  p4 = exp (log p1 + log p2 + log p3 + log p4)

38 Lecture 1, 7/21/2005Natural Language Processing38 Class-based n-grams  P(wi|wi-1) = P(ci|ci-1) x P(wi|ci)


Download ppt "Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture 7 8 August 2007."

Similar presentations


Ads by Google