Download presentation

Presentation is loading. Please wait.

Published byGarret Wescott Modified over 4 years ago

1
Language Models & Smoothing Shallow Processing Techniques for NLP Ling570 October 19, 2011

2
Announcements Career exploration talk: Bill McNeill Thursday (10/20): 2:30-3:30pm Thomson 135 & Online (Treehouse URL) Treehouse meeting: Friday 10/21: 11-12 Thesis topic brainstorming GP Meeting: Friday 10/21: 3:30-5pm PCAR 291 & Online (…/clmagrad)

3
Roadmap Ngram language models Constructing language models Generative language models Evaluation: Training and Testing Perplexity Smoothing: Laplace smoothing Good-Turing smoothing Interpolation & backoff

4
Ngram Language Models Independence assumptions moderate data needs Approximate probability given all prior words Assume finite history Unigram: Probability of word in isolation Bigram: Probability of word given 1 previous Trigram: Probability of word given 2 previous N-gram approximation Bigram sequence

5
Berkeley Restaurant Project Sentences can you tell me about any good cantonese restaurants close by mid priced thai food is what im looking for tell me about chez panisse can you give me a listing of the kinds of food that are available im looking for a good place to eat breakfast when is caffe venezia open during the day

6
Bigram Counts Out of 9222 sentences Eg. I want occurred 827 times

7
Bigram Probabilities Divide bigram counts by prefix unigram counts to get probabilities.

8
Bigram Estimates of Sentence Probabilities P( I want english food ) = P(i| )* P(want|I)* P(english|want)* P(food|english)* P( |food) =.000031

9
Kinds of Knowledge P(english|want) =.0011 P(chinese|want) =.0065 P(to|want) =.66 P(eat | to) =.28 P(food | to) = 0 P(want | spend) = 0 P (i | ) =.25 What types of knowledge are captured by ngram models?

10
Kinds of Knowledge P(english|want) =.0011 P(chinese|want) =.0065 P(to|want) =.66 P(eat | to) =.28 P(food | to) = 0 P(want | spend) = 0 P (i | ) =.25 World knowledge What types of knowledge are captured by ngram models?

11
Kinds of Knowledge P(english|want) =.0011 P(chinese|want) =.0065 P(to|want) =.66 P(eat | to) =.28 P(food | to) = 0 P(want | spend) = 0 P (i | ) =.25 World knowledge Syntax What types of knowledge are captured by ngram models?

12
Kinds of Knowledge P(english|want) =.0011 P(chinese|want) =.0065 P(to|want) =.66 P(eat | to) =.28 P(food | to) = 0 P(want | spend) = 0 P (i | ) =.25 World knowledge Syntax Discourse What types of knowledge are captured by ngram models?

13
Probabilistic Language Generation Coin-flipping models A sentence is generated by a randomized algorithm The generator can be in one of several states Flip coins to choose the next state Flip other coins to decide which letter or word to output

14
Generated Language: Effects of N 1. Zero-order approximation: XFOML RXKXRJFFUJ ZLPWCFWKCYJ FFJEYVKCQSGHYD QPAAMKBZAACIBZLHJQD

15
Generated Language: Effects of N 1. Zero-order approximation: XFOML RXKXRJFFUJ ZLPWCFWKCYJ FFJEYVKCQSGHYD QPAAMKBZAACIBZLHJQD 2. First-order approximation: OCRO HLI RGWR NWIELWIS EU LL NBNESEBYA TH EEI ALHENHTTPA OOBTTVA NAH RBL

16
Generated Language: Effects of N 1. Zero-order approximation: XFOML RXKXRJFFUJ ZLPWCFWKCYJ FFJEYVKCQSGHYD QPAAMKBZAACIBZLHJQD 2. First-order approximation: OCRO HLI RGWR NWIELWIS EU LL NBNESEBYA TH EEI ALHENHTTPA OOBTTVA NAH RBL 3. Second-order approximation: ON IE ANTSOUTINYS ARE T INCTORE ST BE S DEAMY ACHIND ILONASIVE TUCOOWE AT TEASONARE FUSO TIZIN ANDY TOBE SEACE CTISBE

17
Word Models: Effects of N 1. First-order approximation: REPRESENTING AND SPEEDILY IS AN GOOD APT OR COME CAN DIFFERENT NATURAL HERE HE THE A IN CAME THE TO OF TO EXPERT GRAY COME TO FURNISHES THE LINE MESSAGE HAD BE THESE

18
Word Models: Effects of N 1. First-order approximation: REPRESENTING AND SPEEDILY IS AN GOOD APT OR COME CAN DIFFERENT NATURAL HERE HE THE A IN CAME THE TO OF TO EXPERT GRAY COME TO FURNISHES THE LINE MESSAGE HAD BE THESE 2. Second-order approximation: THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH WRITER THAT THE CHARACTER OF THIS POINT IS THEREFORE ANOTHER METHOD FOR THE LETTERS THAT THE TIME OF WHO EVER TOLD THE PROBLEM FOR AN UNEXPECTED

19
Shakespeare

20
The Wall Street Journal is Not Shakespeare

21
Evaluation

22
Evaluation - General Evaluation crucial for NLP systems Required for most publishable results Should be integrated early Many factors:

23
Evaluation - General Evaluation crucial for NLP systems Required for most publishable results Should be integrated early Many factors: Data Metrics Prior results …..

24
Evaluation Guidelines Evaluate your system Use standard metrics Use (standard) training/dev/test sets Describing experiments: (Intrinsic vs Extrinsic)

25
Evaluation Guidelines Evaluate your system Use standard metrics Use (standard) training/dev/test sets Describing experiments: (Intrinsic vs Extrinsic) Clearly lay out experimental setting

26
Evaluation Guidelines Evaluate your system Use standard metrics Use (standard) training/dev/test sets Describing experiments: (Intrinsic vs Extrinsic) Clearly lay out experimental setting Compare to baseline and previous results Perform error analysis

27
Evaluation Guidelines Evaluate your system Use standard metrics Use (standard) training/dev/test sets Describing experiments: (Intrinsic vs Extrinsic) Clearly lay out experimental setting Compare to baseline and previous results Perform error analysis Show utility in real application (ideally)

28
Data Organization Training: Training data: used to learn model parameters

29
Data Organization Training: Training data: used to learn model parameters Held-out data: used to tune additional parameters

30
Data Organization Training: Training data: used to learn model parameters Held-out data: used to tune additional parameters Development (Dev) set: Used to evaluate system during development Avoid overfitting

31
Data Organization Training: Training data: used to learn model parameters Held-out data: used to tune additional parameters Development (Dev) set: Used to evaluate system during development Avoid overfitting Test data: Used for final, blind evaluation

32
Data Organization Training: Training data: used to learn model parameters Held-out data: used to tune additional parameters Development (Dev) set: Used to evaluate system during development Avoid overfitting Test data: Used for final, blind evaluation Typical division of data: 80/10/10 Tradeoffs Cross-validation

33
Evaluting LMs Extrinsic evaluation (aka in vivo) Embed alternate models in system See which improves overall application MT, IR, …

34
Evaluting LMs Extrinsic evaluation (aka in vivo) Embed alternate models in system See which improves overall application MT, IR, … Intrinsic evaluation: Metric applied directly to model Independent of larger application Perplexity

35
Evaluting LMs Extrinsic evaluation (aka in vivo) Embed alternate models in system See which improves overall application MT, IR, … Intrinsic evaluation: Metric applied directly to model Independent of larger application Perplexity Why not just extrinsic?

36
Perplexity

37
Intuition: A better model will have tighter fit to test data Will yield higher probability on test data

38
Perplexity Intuition: A better model will have tighter fit to test data Will yield higher probability on test data Formally,

39
Perplexity Intuition: A better model will have tighter fit to test data Will yield higher probability on test data Formally,

40
Perplexity Intuition: A better model will have tighter fit to test data Will yield higher probability on test data Formally,

41
Perplexity Intuition: A better model will have tighter fit to test data Will yield higher probability on test data Formally, For bigrams:

42
Perplexity Intuition: A better model will have tighter fit to test data Will yield higher probability on test data Formally, For bigrams: Inversely related to probability of sequence Higher probability Lower perplexity

43
Perplexity Intuition: A better model will have tighter fit to test data Will yield higher probability on test data Formally, For bigrams: Inversely related to probability of sequence Higher probability Lower perplexity Can be viewed as average branching factor of model

44
Perplexity Example Alphabet: 0,1,…,9 Equiprobable

45
Perplexity Example Alphabet: 0,1,…,9; Equiprobable: P(X)=1/10

46
Perplexity Example Alphabet: 0,1,…,9; Equiprobable: P(X)=1/10 PP(W)=

47
Perplexity Example Alphabet: 0,1,…,9; Equiprobable: P(X)=1/10 PP(W)= If probability of 0 is higher, PP(W) will be

48
Perplexity Example Alphabet: 0,1,…,9; Equiprobable: P(X)=1/10 PP(W)= If probability of 0 is higher, PP(W) will be lower

49
Thinking about Perplexity Given some vocabulary V with a uniform distribution I.e. P(w) = 1/|V|

50
Thinking about Perplexity Given some vocabulary V with a uniform distribution I.e. P(w) = 1/|V| Under a unigram LM, the perplexity is PP(W) =

51
Thinking about Perplexity Given some vocabulary V with a uniform distribution I.e. P(w) = 1/|V| Under a unigram LM, the perplexity is PP(W) =

52
Thinking about Perplexity Given some vocabulary V with a uniform distribution I.e. P(w) = 1/|V| Under a unigram LM, the perplexity is PP(W) = Perplexity is effective branching factor of language

53
Perplexity and Entropy Given that Consider the perplexity equation: PP(W) = P(W) -1/N =

54
Perplexity and Entropy Given that Consider the perplexity equation: PP(W) = P(W) -1/N =

55
Perplexity and Entropy Given that Consider the perplexity equation: PP(W) = P(W) -1/N = =

56
Perplexity and Entropy Given that Consider the perplexity equation: PP(W) = P(W) -1/N = = = 2 H(L,P) Where H is the entropy of the language L

57
Entropy Information theoretic measure Measures information in grammar Conceptually, lower bound on # bits to encode

58
Entropy Information theoretic measure Measures information in grammar Conceptually, lower bound on # bits to encode Entropy: H(X): X is a random var, p: prob fn

59
Entropy Information theoretic measure Measures information in grammar Conceptually, lower bound on # bits to encode Entropy: H(X): X is a random var, p: prob fn E.g. 8 things: number as code => 3 bits/trans Alt. short code if high prob; longer if lower Can reduce

60
Computing Entropy Picking horses (Cover and Thomas) Send message: identify horse - 1 of 8 If all horses equally likely, p(i)

61
Computing Entropy Picking horses (Cover and Thomas) Send message: identify horse - 1 of 8 If all horses equally likely, p(i) = 1/8

62
Computing Entropy Picking horses (Cover and Thomas) Send message: identify horse - 1 of 8 If all horses equally likely, p(i) = 1/8 Some horses more likely: 1: ½; 2: ¼; 3: 1/8; 4: 1/16; 5,6,7,8: 1/64

63
Computing Entropy Picking horses (Cover and Thomas) Send message: identify horse - 1 of 8 If all horses equally likely, p(i) = 1/8 Some horses more likely: 1: ½; 2: ¼; 3: 1/8; 4: 1/16; 5,6,7,8: 1/64

64
Entropy of a Sequence Basic sequence Entropy of language: infinite lengths Assume stationary & ergodic

65
Computing P(s): s is a sentence Let s = w 1 w 2 ….w n Assume a bigram model P(s) = P(w 1 w 2 …w n ) = P(BOS w 1 w 2 ….w n EOS)

66
Computing P(s): s is a sentence Let s = w 1 w 2 ….w n Assume a bigram model P(s) = P(w 1 w 2 …w n ) = P(BOS w 1 w 2 ….w n EOS) ~ P(BOS)*P(w 1 |BOS)*P(w 2 |w 1 )*…*P(w n |w n-1 )*P(EOS|w n ) Out-of-vocabulary words (OOV): If n-gram contains OOV word,

67
Computing P(s): s is a sentence Let s = w 1 w 2 ….w n Assume a bigram model P(s) = P(w 1 w 2 …w n ) = P(BOS w 1 w 2 ….w n EOS) ~ P(BOS)*P(w 1 |BOS)*P(w 2 |w 1 )*…*P(w n |w n-1 )*P(EOS|w n ) Out-of-vocabulary words (OOV): If n-gram contains OOV word, Remove n-gram from computation Increment oov_count N

68
Computing P(s): s is a sentence Let s = w 1 w 2 ….w n Assume a trigram model P(s) = P(w 1 w 2 …w n ) = P(BOS w 1 w 2 ….w n EOS) ~P(w 1 |BOS)*P(w 2 |w 1 BOS)*…*P(w n |w n-2 w n-1 )*P(EOS|w n-1 w n ) Out-of-vocabulary words (OOV): If n-gram contains OOV word, Remove n-gram from computation Increment oov_count N =sent_leng + 1 – oov_count

69
Computing Perplexity PP(W) = Where W is a set of m sentences: s 1,s 2,…,s m log P(W)

70
Computing Perplexity PP(W) = Where W is a set of m sentences: s 1,s 2,…,s m log P(W) =

71
Computing Perplexity PP(W) = Where W is a set of m sentences: s 1,s 2,…,s m log P(W) = N

72
Computing Perplexity PP(W) = Where W is a set of m sentences: s 1,s 2,…,s m log P(W) = N = word_count + sent_count – oov_count

73
Perplexity Model Comparison Compare models with different history

74
Homework #4

75
Building Language Models Step 1: Count ngrams Step 2: Build model – Compute probabilities MLE Smoothed: Laplace, GT Step 3: Compute perplexity Steps 2 & 3 depend on model/smoothing choices

76
Q1: Counting N-grams Collect real counts from the training data: ngram_count.* training_data ngram_count_file Output ngrams and real count c(w1), c(w1, w2), and c(w1, w2, w3). Given a sentence: John called Mary Insert BOS and EOS: John called Mary

77
Q1: Output Count key 875a … 200 the book … 20thank you very In chunks – unigrams, then bigrams, then trigrams Sort in decreasing order of count within chunk

78
Q2: Create Language Model build_lm.* ngram_count_file lm_file Store the logprob of ngrams and other parameters in the lm There are actually three language models: P(w3), P(w3|w2) and P(w3|w1,w2) The output file is in a modified ARPA format (see next slide) Lines for n-grams are sorted by n-gram counts

79
Modified ARPA Format \data\ ngram 1: type = xx; token = yy ngram 2: type = xx; token = yy ngram 3: type = xx; token = yy \1-grams: count prob logprob w1 \2-grams: count prob logprob w1 w2 \3-grams: count prob logprob w1 w2 w3 # xx: is type count # yy: is token count # prob is P(w) # prob is P(w2|w1) #count in C(w1w2)

80
Q3: Calculating Perplexity pp.* lm_file n test_file outfile Compute perplexity for n-gram history given model sum=0; count=0; for each s in test_file: if n-gram of history n exists Compute P(wi|…wi-n+1) sum += log_2 P(wi…) count ++ total = -sum/count pp(test_file) = 2 total

81
Output format Sent #1: Influential members of the House … 1: log P(Influential | ) = -inf(unknown word) 2: log P(members | Influential) = -inf (unseen ngrams) 4: log P(the | members of) = -0.673243382588536 1 sentence, 38 words, 9 OOVs logprob=-82.8860891791949 ppl=721.341645452964 %%%%%%%%%%%%%%%% sent_num=50 word_num=1175 oov_num=190 logprob=-2854.78157013778 ave_logprob=-2.75824306293506 pp=573.116699237283

82
Q4: Compute Perplexity Compute perplexity for different n

Similar presentations

OK

Chapter6. Statistical Inference : n-gram Model over Sparse Data 2005. 1. 13 이 동 훈 Foundations of Statistic Natural Language Processing.

Chapter6. Statistical Inference : n-gram Model over Sparse Data 2005. 1. 13 이 동 훈 Foundations of Statistic Natural Language Processing.

© 2018 SlidePlayer.com Inc.

All rights reserved.

To make this website work, we log user data and share it with processors. To use this website, you must agree to our Privacy Policy, including cookie policy.

Ads by Google