Presentation is loading. Please wait.

Presentation is loading. Please wait.

Language Modeling 1. Roadmap (for next two classes)  Review LM evaluation metrics  Entropy  Perplexity  Smoothing  Good-Turing  Backoff and Interpolation.

Similar presentations


Presentation on theme: "Language Modeling 1. Roadmap (for next two classes)  Review LM evaluation metrics  Entropy  Perplexity  Smoothing  Good-Turing  Backoff and Interpolation."— Presentation transcript:

1 Language Modeling 1

2 Roadmap (for next two classes)  Review LM evaluation metrics  Entropy  Perplexity  Smoothing  Good-Turing  Backoff and Interpolation  Absolute Discounting  Kneser-Ney 2

3 Language Model Evaluation Metrics 3

4 Applications 4

5 Entropy and perplexity 5

6 6

7 7

8 8

9 Language model perplexity  Recipe:  Train a language model on training data  Get negative logprobs of test data, compute average  Exponentiate!  Perplexity correlates rather well with:  Speech recognition error rates  MT quality metrics  LM Perplexities for word-based models are normally between say 50 and 1000  Need to drop perplexity by a significant fraction (not absolute amount) to make a visible impact 9

10 Parameter estimation  What is it? 10

11 Parameter estimation  Model form is fixed (coin unigrams, word bigrams, …)  We have observations  H H H T T H T H H  Want to find the parameters  Maximum Likelihood Estimation – pick the parameters that assign the most probability to our training data  c(H) = 6; c(T) = 3  P(H) = 6 / 9 = 2 / 3; P(T) = 3 / 9 = 1 / 3  MLE picks parameters best for training data…  …but these don’t generalize well to test data – zeros! 11

12 Parameter estimation  Model form is fixed (coin unigrams, word bigrams, …)  We have observations  H H H T T H T H H  Want to find the parameters  Maximum Likelihood Estimation – pick the parameters that assign the most probability to our training data  c(H) = 6; c(T) = 3  P(H) = 6 / 9 = 2 / 3; P(T) = 3 / 9 = 1 / 3  MLE picks parameters best for training data…  …but these don’t generalize well to test data – zeros! 12

13 Parameter estimation  Model form is fixed (coin unigrams, word bigrams, …)  We have observations  H H H T T H T H H  Want to find the parameters  Maximum Likelihood Estimation – pick the parameters that assign the most probability to our training data  c(H) = 6; c(T) = 3  P(H) = 6 / 9 = 2 / 3; P(T) = 3 / 9 = 1 / 3  MLE picks parameters best for training data…  …but these don’t generalize well to test data – zeros! 13

14 Smoothing 14

15 Smoothing techniques  Laplace  Good-Turing  Backoff  Mixtures  Interpolation  Kneser-Ney 15

16 Laplace 16

17 Good-Turing Smoothing  New idea: Use counts of things you have seen to estimate those you haven’t

18 Good-Turing Josh Goodman Intuition  Imagine you are fishing  There are 8 species: carp, perch, whitefish, trout, salmon, eel, catfish, bass  You have caught  10 carp, 3 perch, 2 whitefish, 1 trout, 1 salmon, 1 eel = 18 fish  How likely is it that the next fish caught is from a new species (one not seen in our previous catch)? Slide adapted from Josh Goodman, Dan Jurafsky

19 Good-Turing Josh Goodman Intuition  Imagine you are fishing  There are 8 species: carp, perch, whitefish, trout, salmon, eel, catfish, bass  You have caught  10 carp, 3 perch, 2 whitefish, 1 trout, 1 salmon, 1 eel = 18 fish  How likely is it that the next fish caught is from a new species (one not seen in our previous catch)?  3/18  Assuming so, how likely is it that next species is trout? Slide adapted from Josh Goodman, Dan Jurafsky

20 Good-Turing Josh Goodman Intuition  Imagine you are fishing  There are 8 species: carp, perch, whitefish, trout, salmon, eel, catfish, bass  You have caught  10 carp, 3 perch, 2 whitefish, 1 trout, 1 salmon, 1 eel = 18 fish  How likely is it that the next fish caught is from a new species (one not seen in our previous catch)?  3/18  Assuming so, how likely is it that next species is trout?  Must be less than 1/18 Slide adapted from Josh Goodman, Dan Jurafsky

21 Some more hypotheticals SpeciesPuget SoundLake WashingtonGreenlake Salmon8120 Trout311 Cod110 Rockfish100 Snapper100 Skate100 Bass0114 TOTAL15 21 How likely is it to find a new fish in each of these places?

22 Good-Turing Smoothing  New idea: Use counts of things you have seen to estimate those you haven’t

23 Good-Turing Smoothing  New idea: Use counts of things you have seen to estimate those you haven’t  Good-Turing approach: Use frequency of singletons to re-estimate frequency of zero-count n-grams

24 Good-Turing Smoothing

25

26 GT Fish Example

27 Enough about the fish… how does this relate to language?  Name some linguistic situations where the number of new words would differ 27

28 Enough about the fish… how does this relate to language?  Name some linguistic situations where the number of new words would differ  Different languages:  Chinese has almost no morphology  Turkish has a lot of morphology  Lots of new words in Turkish! 28

29 Enough about the fish… how does this relate to language?  Name some linguistic situations where the number of new words would differ  Different languages:  Chinese has almost no morphology  Turkish has a lot of morphology  Lots of new words in Turkish!  Different domains:  Airplane maintenance manuals: controlled vocabulary  Random web posts: uncontrolled vocab 29

30 Bigram Frequencies of Frequencies and GT Re-estimates

31 Good-Turing Smoothing

32 Additional Issues in Good-Turing  General approach:  Estimate of c * for N c depends on N c+1  What if N c+1 = 0?  More zero count problems  Not uncommon: e.g. fish example, no 4s

33 Modifications  Simple Good-Turing  Compute N c bins, then smooth N c to replace zeroes  Fit linear regression in log space  log(N c ) = a +b log(c)  What about large c’s?  Should be reliable  Assume c * =c if c is large, e.g c > k (Katz: k =5)  Typically combined with other approaches

34 Backoff and Interpolation  Another really useful source of knowledge  If we are estimating:  trigram p(z|x,y)  but count(xyz) is zero  Use info from:

35 Backoff and Interpolation  Another really useful source of knowledge  If we are estimating:  trigram p(z|x,y)  but count(xyz) is zero  Use info from:  Bigram p(z|y)  Or even:  Unigram p(z)

36 Backoff and Interpolation  Another really useful source of knowledge  If we are estimating:  trigram p(z|x,y)  but count(xyz) is zero  Use info from:  Bigram p(z|y)  Or even:  Unigram p(z)  How to combine this trigram, bigram, unigram info in a valid fashion?

37 Backoff vs. Interpolation  Backoff: use trigram if you have it, otherwise bigram, otherwise unigram

38 Backoff vs. Interpolation  Backoff: use trigram if you have it, otherwise bigram, otherwise unigram  Interpolation: always mix all three

39 Backoff 39

40 Backoff 40

41 Backoff 41

42 Mixtures 42

43 Interpolation

44 How to Set the Lambdas?  Use a held-out, or development, corpus  Choose lambdas which maximize the probability of some held-out data  I.e. fix the N-gram probabilities  Then search for lambda values  That when plugged into previous equation  Give largest probability for held-out set  Can use EM to do this search

45 Kneser-Ney Smoothing  Most commonly used modern smoothing technique  Intuition: improving backoff  I can’t see without my reading……  Compare P(Francisco|reading) vs P(glasses|reading)

46 Kneser-Ney Smoothing  Most commonly used modern smoothing technique  Intuition: improving backoff  I can’t see without my reading……  Compare P(Francisco|reading) vs P(glasses|reading)  P(Francisco|reading) backs off to P(Francisco)

47 Kneser-Ney Smoothing  Most commonly used modern smoothing technique  Intuition: improving backoff  I can’t see without my reading……  Compare P(Francisco|reading) vs P(glasses|reading)  P(Francisco|reading) backs off to P(Francisco)  P(glasses|reading) > 0  High unigram frequency of Francisco > P(glasses|reading)

48 Kneser-Ney Smoothing  Most commonly used modern smoothing technique  Intuition: improving backoff  I can’t see without my reading……  Compare P(Francisco|reading) vs P(glasses|reading)  P(Francisco|reading) backs off to P(Francisco)  P(glasses|reading) > 0  High unigram frequency of Francisco > P(glasses|reading)  However, Francisco appears in few contexts, glasses many

49 Kneser-Ney Smoothing  Most commonly used modern smoothing technique  Intuition: improving backoff  I can’t see without my reading……  Compare P(Francisco|reading) vs P(glasses|reading)  P(Francisco|reading) backs off to P(Francisco)  P(glasses|reading) > 0  High unigram frequency of Francisco > P(glasses|reading)  However, Francisco appears in few contexts, glasses many  Interpolate based on # of contexts  Words seen in more contexts, more likely to appear in others

50 Kneser-Ney Smoothing: bigrams 50

51 Kneser-Ney Smoothing: bigrams 51

52 Kneser-Ney Smoothing: bigrams 52

53 OOV words: word  Out Of Vocabulary = OOV words

54 OOV words: word  Out Of Vocabulary = OOV words  We don’t use GT smoothing for these

55 OOV words: word  Out Of Vocabulary = OOV words  We don’t use GT smoothing for these  Because GT assumes we know the number of unseen events  Instead: create an unknown word token

56 OOV words: word  Out Of Vocabulary = OOV words  We don’t use GT smoothing for these  Because GT assumes we know the number of unseen events  Instead: create an unknown word token  Training of probabilities  Create a fixed lexicon L of size V  At text normalization phase, any training word not in L changed to  Now we train its probabilities like a normal word

57 OOV words: word  Out Of Vocabulary = OOV words  We don’t use GT smoothing for these  Because GT assumes we know the number of unseen events  Instead: create an unknown word token  Training of probabilities  Create a fixed lexicon L of size V  At text normalization phase, any training word not in L changed to  Now we train its probabilities like a normal word  At decoding time  If text input: Use UNK probabilities for any word not in training  Plus an additional penalty! UNK predicts the class of unknown words; then we need to pick a member

58 Class-Based Language Models  Variant of n-gram models using classes or clusters

59 Class-Based Language Models  Variant of n-gram models using classes or clusters  Motivation: Sparseness  Flight app.: P(ORD|to),P(JFK|to),.. P(airport_name|to)  Relate probability of n-gram to word classes & class ngram

60 Class-Based Language Models  Variant of n-gram models using classes or clusters  Motivation: Sparseness  Flight app.: P(ORD|to),P(JFK|to),.. P(airport_name|to)  Relate probability of n-gram to word classes & class ngram  IBM clustering: assume each word in single class  P(w i |w i-1 )~P(c i |c i-1 )xP(w i |c i )  Learn by MLE from data  Where do classes come from?

61 Class-Based Language Models  Variant of n-gram models using classes or clusters  Motivation: Sparseness  Flight app.: P(ORD|to),P(JFK|to),.. P(airport_name|to)  Relate probability of n-gram to word classes & class ngram  IBM clustering: assume each word in single class  P(w i |w i-1 )~P(c i |c i-1 )xP(w i |c i )  Learn by MLE from data  Where do classes come from?  Hand-designed for application (e.g. ATIS)  Automatically induced clusters from corpus

62 Class-Based Language Models  Variant of n-gram models using classes or clusters  Motivation: Sparseness  Flight app.: P(ORD|to),P(JFK|to),.. P(airport_name|to)  Relate probability of n-gram to word classes & class ngram  IBM clustering: assume each word in single class  P(w i |w i-1 )~P(c i |c i-1 )xP(w i |c i )  Learn by MLE from data  Where do classes come from?  Hand-designed for application (e.g. ATIS)  Automatically induced clusters from corpus

63 LM Adaptation  Challenge: Need LM for new domain  Have little in-domain data

64 LM Adaptation  Challenge: Need LM for new domain  Have little in-domain data  Intuition: Much of language is pretty general  Can build from ‘general’ LM + in-domain data

65 LM Adaptation  Challenge: Need LM for new domain  Have little in-domain data  Intuition: Much of language is pretty general  Can build from ‘general’ LM + in-domain data  Approach: LM adaptation  Train on large domain independent corpus  Adapt with small in-domain data set  What large corpus?

66 LM Adaptation  Challenge: Need LM for new domain  Have little in-domain data  Intuition: Much of language is pretty general  Can build from ‘general’ LM + in-domain data  Approach: LM adaptation  Train on large domain independent corpus  Adapt with small in-domain data set  What large corpus?  Web counts! e.g. Google n-grams

67 Incorporating Longer Distance Context  Why use longer context?

68 Incorporating Longer Distance Context  Why use longer context?  N-grams are approximation  Model size  Sparseness

69 Incorporating Longer Distance Context  Why use longer context?  N-grams are approximation  Model size  Sparseness  What sorts of information in longer context?

70 Incorporating Longer Distance Context  Why use longer context?  N-grams are approximation  Model size  Sparseness  What sorts of information in longer context?  Priming  Topic  Sentence type  Dialogue act  Syntax

71 Long Distance LMs  Bigger n!  284M words: <= 6-grams improve; 7-20 no better

72 Long Distance LMs  Bigger n!  284M words: <= 6-grams improve; 7-20 no better  Cache n-gram:  Intuition: Priming: word used previously, more likely  Incrementally create ‘cache’ unigram model on test corpus  Mix with main n-gram LM

73 Long Distance LMs  Bigger n!  284M words: <= 6-grams improve; 7-20 no better  Cache n-gram:  Intuition: Priming: word used previously, more likely  Incrementally create ‘cache’ unigram model on test corpus  Mix with main n-gram LM  Topic models:  Intuition: Text is about some topic, on-topic words likely  P(w|h) ~ Σt P(w|t)P(t|h)

74 Long Distance LMs  Bigger n!  284M words: <= 6-grams improve; 7-20 no better  Cache n-gram:  Intuition: Priming: word used previously, more likely  Incrementally create ‘cache’ unigram model on test corpus  Mix with main n-gram LM  Topic models:  Intuition: Text is about some topic, on-topic words likely  P(w|h) ~ Σt P(w|t)P(t|h)  Non-consecutive n-grams:  skip n-grams, triggers, variable lengths n-grams

75 Language Models  N-gram models:  Finite approximation of infinite context history  Issues: Zeroes and other sparseness  Strategies: Smoothing  Add-one, add-δ, Good-Turing, etc  Use partial n-grams: interpolation, backoff  Refinements  Class, cache, topic, trigger LMs


Download ppt "Language Modeling 1. Roadmap (for next two classes)  Review LM evaluation metrics  Entropy  Perplexity  Smoothing  Good-Turing  Backoff and Interpolation."

Similar presentations


Ads by Google