Download presentation

Presentation is loading. Please wait.

Published byDontae Culpepper Modified about 1 year ago

1
Language Modeling 1

2
Roadmap (for next two classes) Review LM evaluation metrics Entropy Perplexity Smoothing Good-Turing Backoff and Interpolation Absolute Discounting Kneser-Ney 2

3
Language Model Evaluation Metrics 3

4
Applications 4

5
Entropy and perplexity 5

6
6

7
7

8
8

9
Language model perplexity Recipe: Train a language model on training data Get negative logprobs of test data, compute average Exponentiate! Perplexity correlates rather well with: Speech recognition error rates MT quality metrics LM Perplexities for word-based models are normally between say 50 and 1000 Need to drop perplexity by a significant fraction (not absolute amount) to make a visible impact 9

10
Parameter estimation What is it? 10

11
Parameter estimation Model form is fixed (coin unigrams, word bigrams, …) We have observations H H H T T H T H H Want to find the parameters Maximum Likelihood Estimation – pick the parameters that assign the most probability to our training data c(H) = 6; c(T) = 3 P(H) = 6 / 9 = 2 / 3; P(T) = 3 / 9 = 1 / 3 MLE picks parameters best for training data… …but these don’t generalize well to test data – zeros! 11

12
Parameter estimation Model form is fixed (coin unigrams, word bigrams, …) We have observations H H H T T H T H H Want to find the parameters Maximum Likelihood Estimation – pick the parameters that assign the most probability to our training data c(H) = 6; c(T) = 3 P(H) = 6 / 9 = 2 / 3; P(T) = 3 / 9 = 1 / 3 MLE picks parameters best for training data… …but these don’t generalize well to test data – zeros! 12

13
Parameter estimation Model form is fixed (coin unigrams, word bigrams, …) We have observations H H H T T H T H H Want to find the parameters Maximum Likelihood Estimation – pick the parameters that assign the most probability to our training data c(H) = 6; c(T) = 3 P(H) = 6 / 9 = 2 / 3; P(T) = 3 / 9 = 1 / 3 MLE picks parameters best for training data… …but these don’t generalize well to test data – zeros! 13

14
Smoothing 14

15
Smoothing techniques Laplace Good-Turing Backoff Mixtures Interpolation Kneser-Ney 15

16
Laplace 16

17
Good-Turing Smoothing New idea: Use counts of things you have seen to estimate those you haven’t

18
Good-Turing Josh Goodman Intuition Imagine you are fishing There are 8 species: carp, perch, whitefish, trout, salmon, eel, catfish, bass You have caught 10 carp, 3 perch, 2 whitefish, 1 trout, 1 salmon, 1 eel = 18 fish How likely is it that the next fish caught is from a new species (one not seen in our previous catch)? Slide adapted from Josh Goodman, Dan Jurafsky

19
Good-Turing Josh Goodman Intuition Imagine you are fishing There are 8 species: carp, perch, whitefish, trout, salmon, eel, catfish, bass You have caught 10 carp, 3 perch, 2 whitefish, 1 trout, 1 salmon, 1 eel = 18 fish How likely is it that the next fish caught is from a new species (one not seen in our previous catch)? 3/18 Assuming so, how likely is it that next species is trout? Slide adapted from Josh Goodman, Dan Jurafsky

20
Good-Turing Josh Goodman Intuition Imagine you are fishing There are 8 species: carp, perch, whitefish, trout, salmon, eel, catfish, bass You have caught 10 carp, 3 perch, 2 whitefish, 1 trout, 1 salmon, 1 eel = 18 fish How likely is it that the next fish caught is from a new species (one not seen in our previous catch)? 3/18 Assuming so, how likely is it that next species is trout? Must be less than 1/18 Slide adapted from Josh Goodman, Dan Jurafsky

21
Some more hypotheticals SpeciesPuget SoundLake WashingtonGreenlake Salmon8120 Trout311 Cod110 Rockfish100 Snapper100 Skate100 Bass0114 TOTAL15 21 How likely is it to find a new fish in each of these places?

22
Good-Turing Smoothing New idea: Use counts of things you have seen to estimate those you haven’t

23
Good-Turing Smoothing New idea: Use counts of things you have seen to estimate those you haven’t Good-Turing approach: Use frequency of singletons to re-estimate frequency of zero-count n-grams

24
Good-Turing Smoothing

25

26
GT Fish Example

27
Enough about the fish… how does this relate to language? Name some linguistic situations where the number of new words would differ 27

28
Enough about the fish… how does this relate to language? Name some linguistic situations where the number of new words would differ Different languages: Chinese has almost no morphology Turkish has a lot of morphology Lots of new words in Turkish! 28

29
Enough about the fish… how does this relate to language? Name some linguistic situations where the number of new words would differ Different languages: Chinese has almost no morphology Turkish has a lot of morphology Lots of new words in Turkish! Different domains: Airplane maintenance manuals: controlled vocabulary Random web posts: uncontrolled vocab 29

30
Bigram Frequencies of Frequencies and GT Re-estimates

31
Good-Turing Smoothing

32
Additional Issues in Good-Turing General approach: Estimate of c * for N c depends on N c+1 What if N c+1 = 0? More zero count problems Not uncommon: e.g. fish example, no 4s

33
Modifications Simple Good-Turing Compute N c bins, then smooth N c to replace zeroes Fit linear regression in log space log(N c ) = a +b log(c) What about large c’s? Should be reliable Assume c * =c if c is large, e.g c > k (Katz: k =5) Typically combined with other approaches

34
Backoff and Interpolation Another really useful source of knowledge If we are estimating: trigram p(z|x,y) but count(xyz) is zero Use info from:

35
Backoff and Interpolation Another really useful source of knowledge If we are estimating: trigram p(z|x,y) but count(xyz) is zero Use info from: Bigram p(z|y) Or even: Unigram p(z)

36
Backoff and Interpolation Another really useful source of knowledge If we are estimating: trigram p(z|x,y) but count(xyz) is zero Use info from: Bigram p(z|y) Or even: Unigram p(z) How to combine this trigram, bigram, unigram info in a valid fashion?

37
Backoff vs. Interpolation Backoff: use trigram if you have it, otherwise bigram, otherwise unigram

38
Backoff vs. Interpolation Backoff: use trigram if you have it, otherwise bigram, otherwise unigram Interpolation: always mix all three

39
Backoff 39

40
Backoff 40

41
Backoff 41

42
Mixtures 42

43
Interpolation

44
How to Set the Lambdas? Use a held-out, or development, corpus Choose lambdas which maximize the probability of some held-out data I.e. fix the N-gram probabilities Then search for lambda values That when plugged into previous equation Give largest probability for held-out set Can use EM to do this search

45
Kneser-Ney Smoothing Most commonly used modern smoothing technique Intuition: improving backoff I can’t see without my reading…… Compare P(Francisco|reading) vs P(glasses|reading)

46
Kneser-Ney Smoothing Most commonly used modern smoothing technique Intuition: improving backoff I can’t see without my reading…… Compare P(Francisco|reading) vs P(glasses|reading) P(Francisco|reading) backs off to P(Francisco)

47
Kneser-Ney Smoothing Most commonly used modern smoothing technique Intuition: improving backoff I can’t see without my reading…… Compare P(Francisco|reading) vs P(glasses|reading) P(Francisco|reading) backs off to P(Francisco) P(glasses|reading) > 0 High unigram frequency of Francisco > P(glasses|reading)

48
Kneser-Ney Smoothing Most commonly used modern smoothing technique Intuition: improving backoff I can’t see without my reading…… Compare P(Francisco|reading) vs P(glasses|reading) P(Francisco|reading) backs off to P(Francisco) P(glasses|reading) > 0 High unigram frequency of Francisco > P(glasses|reading) However, Francisco appears in few contexts, glasses many

49
Kneser-Ney Smoothing Most commonly used modern smoothing technique Intuition: improving backoff I can’t see without my reading…… Compare P(Francisco|reading) vs P(glasses|reading) P(Francisco|reading) backs off to P(Francisco) P(glasses|reading) > 0 High unigram frequency of Francisco > P(glasses|reading) However, Francisco appears in few contexts, glasses many Interpolate based on # of contexts Words seen in more contexts, more likely to appear in others

50
Kneser-Ney Smoothing: bigrams 50

51
Kneser-Ney Smoothing: bigrams 51

52
Kneser-Ney Smoothing: bigrams 52

53
OOV words: word Out Of Vocabulary = OOV words

54
OOV words: word Out Of Vocabulary = OOV words We don’t use GT smoothing for these

55
OOV words: word Out Of Vocabulary = OOV words We don’t use GT smoothing for these Because GT assumes we know the number of unseen events Instead: create an unknown word token

56
OOV words: word Out Of Vocabulary = OOV words We don’t use GT smoothing for these Because GT assumes we know the number of unseen events Instead: create an unknown word token Training of probabilities Create a fixed lexicon L of size V At text normalization phase, any training word not in L changed to Now we train its probabilities like a normal word

57
OOV words: word Out Of Vocabulary = OOV words We don’t use GT smoothing for these Because GT assumes we know the number of unseen events Instead: create an unknown word token Training of probabilities Create a fixed lexicon L of size V At text normalization phase, any training word not in L changed to Now we train its probabilities like a normal word At decoding time If text input: Use UNK probabilities for any word not in training Plus an additional penalty! UNK predicts the class of unknown words; then we need to pick a member

58
Class-Based Language Models Variant of n-gram models using classes or clusters

59
Class-Based Language Models Variant of n-gram models using classes or clusters Motivation: Sparseness Flight app.: P(ORD|to),P(JFK|to),.. P(airport_name|to) Relate probability of n-gram to word classes & class ngram

60
Class-Based Language Models Variant of n-gram models using classes or clusters Motivation: Sparseness Flight app.: P(ORD|to),P(JFK|to),.. P(airport_name|to) Relate probability of n-gram to word classes & class ngram IBM clustering: assume each word in single class P(w i |w i-1 )~P(c i |c i-1 )xP(w i |c i ) Learn by MLE from data Where do classes come from?

61
Class-Based Language Models Variant of n-gram models using classes or clusters Motivation: Sparseness Flight app.: P(ORD|to),P(JFK|to),.. P(airport_name|to) Relate probability of n-gram to word classes & class ngram IBM clustering: assume each word in single class P(w i |w i-1 )~P(c i |c i-1 )xP(w i |c i ) Learn by MLE from data Where do classes come from? Hand-designed for application (e.g. ATIS) Automatically induced clusters from corpus

62
Class-Based Language Models Variant of n-gram models using classes or clusters Motivation: Sparseness Flight app.: P(ORD|to),P(JFK|to),.. P(airport_name|to) Relate probability of n-gram to word classes & class ngram IBM clustering: assume each word in single class P(w i |w i-1 )~P(c i |c i-1 )xP(w i |c i ) Learn by MLE from data Where do classes come from? Hand-designed for application (e.g. ATIS) Automatically induced clusters from corpus

63
LM Adaptation Challenge: Need LM for new domain Have little in-domain data

64
LM Adaptation Challenge: Need LM for new domain Have little in-domain data Intuition: Much of language is pretty general Can build from ‘general’ LM + in-domain data

65
LM Adaptation Challenge: Need LM for new domain Have little in-domain data Intuition: Much of language is pretty general Can build from ‘general’ LM + in-domain data Approach: LM adaptation Train on large domain independent corpus Adapt with small in-domain data set What large corpus?

66
LM Adaptation Challenge: Need LM for new domain Have little in-domain data Intuition: Much of language is pretty general Can build from ‘general’ LM + in-domain data Approach: LM adaptation Train on large domain independent corpus Adapt with small in-domain data set What large corpus? Web counts! e.g. Google n-grams

67
Incorporating Longer Distance Context Why use longer context?

68
Incorporating Longer Distance Context Why use longer context? N-grams are approximation Model size Sparseness

69
Incorporating Longer Distance Context Why use longer context? N-grams are approximation Model size Sparseness What sorts of information in longer context?

70
Incorporating Longer Distance Context Why use longer context? N-grams are approximation Model size Sparseness What sorts of information in longer context? Priming Topic Sentence type Dialogue act Syntax

71
Long Distance LMs Bigger n! 284M words: <= 6-grams improve; 7-20 no better

72
Long Distance LMs Bigger n! 284M words: <= 6-grams improve; 7-20 no better Cache n-gram: Intuition: Priming: word used previously, more likely Incrementally create ‘cache’ unigram model on test corpus Mix with main n-gram LM

73
Long Distance LMs Bigger n! 284M words: <= 6-grams improve; 7-20 no better Cache n-gram: Intuition: Priming: word used previously, more likely Incrementally create ‘cache’ unigram model on test corpus Mix with main n-gram LM Topic models: Intuition: Text is about some topic, on-topic words likely P(w|h) ~ Σt P(w|t)P(t|h)

74
Long Distance LMs Bigger n! 284M words: <= 6-grams improve; 7-20 no better Cache n-gram: Intuition: Priming: word used previously, more likely Incrementally create ‘cache’ unigram model on test corpus Mix with main n-gram LM Topic models: Intuition: Text is about some topic, on-topic words likely P(w|h) ~ Σt P(w|t)P(t|h) Non-consecutive n-grams: skip n-grams, triggers, variable lengths n-grams

75
Language Models N-gram models: Finite approximation of infinite context history Issues: Zeroes and other sparseness Strategies: Smoothing Add-one, add-δ, Good-Turing, etc Use partial n-grams: interpolation, backoff Refinements Class, cache, topic, trigger LMs

Similar presentations

© 2016 SlidePlayer.com Inc.

All rights reserved.

Ads by Google