# Language Models & Smoothing Shallow Processing Techniques for NLP Ling570 October 19, 2011.

## Presentation on theme: "Language Models & Smoothing Shallow Processing Techniques for NLP Ling570 October 19, 2011."— Presentation transcript:

Language Models & Smoothing Shallow Processing Techniques for NLP Ling570 October 19, 2011

Announcements Career exploration talk: Bill McNeill Thursday (10/20): 2:30-3:30pm Thomson 135 & Online (Treehouse URL) Treehouse meeting: Friday 10/21: 11-12 Thesis topic brainstorming GP Meeting: Friday 10/21: 3:30-5pm PCAR 291 & Online (…/clmagrad)

Roadmap Ngram language models Constructing language models Generative language models Evaluation: Training and Testing Perplexity Smoothing: Laplace smoothing Good-Turing smoothing Interpolation & backoff

Ngram Language Models Independence assumptions moderate data needs Approximate probability given all prior words Assume finite history Unigram: Probability of word in isolation Bigram: Probability of word given 1 previous Trigram: Probability of word given 2 previous N-gram approximation Bigram sequence

Berkeley Restaurant Project Sentences can you tell me about any good cantonese restaurants close by mid priced thai food is what im looking for tell me about chez panisse can you give me a listing of the kinds of food that are available im looking for a good place to eat breakfast when is caffe venezia open during the day

Bigram Counts Out of 9222 sentences Eg. I want occurred 827 times

Bigram Probabilities Divide bigram counts by prefix unigram counts to get probabilities.

Bigram Estimates of Sentence Probabilities P( I want english food ) = P(i| )* P(want|I)* P(english|want)* P(food|english)* P( |food) =.000031

Kinds of Knowledge P(english|want) =.0011 P(chinese|want) =.0065 P(to|want) =.66 P(eat | to) =.28 P(food | to) = 0 P(want | spend) = 0 P (i | ) =.25 What types of knowledge are captured by ngram models?

Kinds of Knowledge P(english|want) =.0011 P(chinese|want) =.0065 P(to|want) =.66 P(eat | to) =.28 P(food | to) = 0 P(want | spend) = 0 P (i | ) =.25 World knowledge What types of knowledge are captured by ngram models?

Kinds of Knowledge P(english|want) =.0011 P(chinese|want) =.0065 P(to|want) =.66 P(eat | to) =.28 P(food | to) = 0 P(want | spend) = 0 P (i | ) =.25 World knowledge Syntax What types of knowledge are captured by ngram models?

Kinds of Knowledge P(english|want) =.0011 P(chinese|want) =.0065 P(to|want) =.66 P(eat | to) =.28 P(food | to) = 0 P(want | spend) = 0 P (i | ) =.25 World knowledge Syntax Discourse What types of knowledge are captured by ngram models?

Probabilistic Language Generation Coin-flipping models A sentence is generated by a randomized algorithm The generator can be in one of several states Flip coins to choose the next state Flip other coins to decide which letter or word to output

Generated Language: Effects of N 1. Zero-order approximation: XFOML RXKXRJFFUJ ZLPWCFWKCYJ FFJEYVKCQSGHYD QPAAMKBZAACIBZLHJQD

Generated Language: Effects of N 1. Zero-order approximation: XFOML RXKXRJFFUJ ZLPWCFWKCYJ FFJEYVKCQSGHYD QPAAMKBZAACIBZLHJQD 2. First-order approximation: OCRO HLI RGWR NWIELWIS EU LL NBNESEBYA TH EEI ALHENHTTPA OOBTTVA NAH RBL

Generated Language: Effects of N 1. Zero-order approximation: XFOML RXKXRJFFUJ ZLPWCFWKCYJ FFJEYVKCQSGHYD QPAAMKBZAACIBZLHJQD 2. First-order approximation: OCRO HLI RGWR NWIELWIS EU LL NBNESEBYA TH EEI ALHENHTTPA OOBTTVA NAH RBL 3. Second-order approximation: ON IE ANTSOUTINYS ARE T INCTORE ST BE S DEAMY ACHIND ILONASIVE TUCOOWE AT TEASONARE FUSO TIZIN ANDY TOBE SEACE CTISBE

Word Models: Effects of N 1. First-order approximation: REPRESENTING AND SPEEDILY IS AN GOOD APT OR COME CAN DIFFERENT NATURAL HERE HE THE A IN CAME THE TO OF TO EXPERT GRAY COME TO FURNISHES THE LINE MESSAGE HAD BE THESE

Word Models: Effects of N 1. First-order approximation: REPRESENTING AND SPEEDILY IS AN GOOD APT OR COME CAN DIFFERENT NATURAL HERE HE THE A IN CAME THE TO OF TO EXPERT GRAY COME TO FURNISHES THE LINE MESSAGE HAD BE THESE 2. Second-order approximation: THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH WRITER THAT THE CHARACTER OF THIS POINT IS THEREFORE ANOTHER METHOD FOR THE LETTERS THAT THE TIME OF WHO EVER TOLD THE PROBLEM FOR AN UNEXPECTED

Shakespeare

The Wall Street Journal is Not Shakespeare

Evaluation

Evaluation - General Evaluation crucial for NLP systems Required for most publishable results Should be integrated early Many factors:

Evaluation - General Evaluation crucial for NLP systems Required for most publishable results Should be integrated early Many factors: Data Metrics Prior results …..

Evaluation Guidelines Evaluate your system Use standard metrics Use (standard) training/dev/test sets Describing experiments: (Intrinsic vs Extrinsic)

Evaluation Guidelines Evaluate your system Use standard metrics Use (standard) training/dev/test sets Describing experiments: (Intrinsic vs Extrinsic) Clearly lay out experimental setting

Evaluation Guidelines Evaluate your system Use standard metrics Use (standard) training/dev/test sets Describing experiments: (Intrinsic vs Extrinsic) Clearly lay out experimental setting Compare to baseline and previous results Perform error analysis

Evaluation Guidelines Evaluate your system Use standard metrics Use (standard) training/dev/test sets Describing experiments: (Intrinsic vs Extrinsic) Clearly lay out experimental setting Compare to baseline and previous results Perform error analysis Show utility in real application (ideally)

Data Organization Training: Training data: used to learn model parameters

Data Organization Training: Training data: used to learn model parameters Held-out data: used to tune additional parameters

Data Organization Training: Training data: used to learn model parameters Held-out data: used to tune additional parameters Development (Dev) set: Used to evaluate system during development Avoid overfitting

Data Organization Training: Training data: used to learn model parameters Held-out data: used to tune additional parameters Development (Dev) set: Used to evaluate system during development Avoid overfitting Test data: Used for final, blind evaluation

Data Organization Training: Training data: used to learn model parameters Held-out data: used to tune additional parameters Development (Dev) set: Used to evaluate system during development Avoid overfitting Test data: Used for final, blind evaluation Typical division of data: 80/10/10 Tradeoffs Cross-validation

Evaluting LMs Extrinsic evaluation (aka in vivo) Embed alternate models in system See which improves overall application MT, IR, …

Evaluting LMs Extrinsic evaluation (aka in vivo) Embed alternate models in system See which improves overall application MT, IR, … Intrinsic evaluation: Metric applied directly to model Independent of larger application Perplexity

Evaluting LMs Extrinsic evaluation (aka in vivo) Embed alternate models in system See which improves overall application MT, IR, … Intrinsic evaluation: Metric applied directly to model Independent of larger application Perplexity Why not just extrinsic?

Perplexity

Intuition: A better model will have tighter fit to test data Will yield higher probability on test data

Perplexity Intuition: A better model will have tighter fit to test data Will yield higher probability on test data Formally,

Perplexity Intuition: A better model will have tighter fit to test data Will yield higher probability on test data Formally,

Perplexity Intuition: A better model will have tighter fit to test data Will yield higher probability on test data Formally,

Perplexity Intuition: A better model will have tighter fit to test data Will yield higher probability on test data Formally, For bigrams:

Perplexity Intuition: A better model will have tighter fit to test data Will yield higher probability on test data Formally, For bigrams: Inversely related to probability of sequence Higher probability Lower perplexity

Perplexity Intuition: A better model will have tighter fit to test data Will yield higher probability on test data Formally, For bigrams: Inversely related to probability of sequence Higher probability Lower perplexity Can be viewed as average branching factor of model

Perplexity Example Alphabet: 0,1,…,9 Equiprobable

Perplexity Example Alphabet: 0,1,…,9; Equiprobable: P(X)=1/10

Perplexity Example Alphabet: 0,1,…,9; Equiprobable: P(X)=1/10 PP(W)=

Perplexity Example Alphabet: 0,1,…,9; Equiprobable: P(X)=1/10 PP(W)= If probability of 0 is higher, PP(W) will be

Perplexity Example Alphabet: 0,1,…,9; Equiprobable: P(X)=1/10 PP(W)= If probability of 0 is higher, PP(W) will be lower

Thinking about Perplexity Given some vocabulary V with a uniform distribution I.e. P(w) = 1/|V|

Thinking about Perplexity Given some vocabulary V with a uniform distribution I.e. P(w) = 1/|V| Under a unigram LM, the perplexity is PP(W) =

Thinking about Perplexity Given some vocabulary V with a uniform distribution I.e. P(w) = 1/|V| Under a unigram LM, the perplexity is PP(W) =

Thinking about Perplexity Given some vocabulary V with a uniform distribution I.e. P(w) = 1/|V| Under a unigram LM, the perplexity is PP(W) = Perplexity is effective branching factor of language

Perplexity and Entropy Given that Consider the perplexity equation: PP(W) = P(W) -1/N =

Perplexity and Entropy Given that Consider the perplexity equation: PP(W) = P(W) -1/N =

Perplexity and Entropy Given that Consider the perplexity equation: PP(W) = P(W) -1/N = =

Perplexity and Entropy Given that Consider the perplexity equation: PP(W) = P(W) -1/N = = = 2 H(L,P) Where H is the entropy of the language L

Entropy Information theoretic measure Measures information in grammar Conceptually, lower bound on # bits to encode

Entropy Information theoretic measure Measures information in grammar Conceptually, lower bound on # bits to encode Entropy: H(X): X is a random var, p: prob fn

Entropy Information theoretic measure Measures information in grammar Conceptually, lower bound on # bits to encode Entropy: H(X): X is a random var, p: prob fn E.g. 8 things: number as code => 3 bits/trans Alt. short code if high prob; longer if lower Can reduce

Computing Entropy Picking horses (Cover and Thomas) Send message: identify horse - 1 of 8 If all horses equally likely, p(i)

Computing Entropy Picking horses (Cover and Thomas) Send message: identify horse - 1 of 8 If all horses equally likely, p(i) = 1/8

Computing Entropy Picking horses (Cover and Thomas) Send message: identify horse - 1 of 8 If all horses equally likely, p(i) = 1/8 Some horses more likely: 1: ½; 2: ¼; 3: 1/8; 4: 1/16; 5,6,7,8: 1/64

Computing Entropy Picking horses (Cover and Thomas) Send message: identify horse - 1 of 8 If all horses equally likely, p(i) = 1/8 Some horses more likely: 1: ½; 2: ¼; 3: 1/8; 4: 1/16; 5,6,7,8: 1/64

Entropy of a Sequence Basic sequence Entropy of language: infinite lengths Assume stationary & ergodic

Computing P(s): s is a sentence Let s = w 1 w 2 ….w n Assume a bigram model P(s) = P(w 1 w 2 …w n ) = P(BOS w 1 w 2 ….w n EOS)

Computing P(s): s is a sentence Let s = w 1 w 2 ….w n Assume a bigram model P(s) = P(w 1 w 2 …w n ) = P(BOS w 1 w 2 ….w n EOS) ~ P(BOS)*P(w 1 |BOS)*P(w 2 |w 1 )*…*P(w n |w n-1 )*P(EOS|w n ) Out-of-vocabulary words (OOV): If n-gram contains OOV word,

Computing P(s): s is a sentence Let s = w 1 w 2 ….w n Assume a bigram model P(s) = P(w 1 w 2 …w n ) = P(BOS w 1 w 2 ….w n EOS) ~ P(BOS)*P(w 1 |BOS)*P(w 2 |w 1 )*…*P(w n |w n-1 )*P(EOS|w n ) Out-of-vocabulary words (OOV): If n-gram contains OOV word, Remove n-gram from computation Increment oov_count N

Computing P(s): s is a sentence Let s = w 1 w 2 ….w n Assume a trigram model P(s) = P(w 1 w 2 …w n ) = P(BOS w 1 w 2 ….w n EOS) ~P(w 1 |BOS)*P(w 2 |w 1 BOS)*…*P(w n |w n-2 w n-1 )*P(EOS|w n-1 w n ) Out-of-vocabulary words (OOV): If n-gram contains OOV word, Remove n-gram from computation Increment oov_count N =sent_leng + 1 – oov_count

Computing Perplexity PP(W) = Where W is a set of m sentences: s 1,s 2,…,s m log P(W)

Computing Perplexity PP(W) = Where W is a set of m sentences: s 1,s 2,…,s m log P(W) =

Computing Perplexity PP(W) = Where W is a set of m sentences: s 1,s 2,…,s m log P(W) = N

Computing Perplexity PP(W) = Where W is a set of m sentences: s 1,s 2,…,s m log P(W) = N = word_count + sent_count – oov_count

Perplexity Model Comparison Compare models with different history

Homework #4

Building Language Models Step 1: Count ngrams Step 2: Build model – Compute probabilities MLE Smoothed: Laplace, GT Step 3: Compute perplexity Steps 2 & 3 depend on model/smoothing choices

Q1: Counting N-grams Collect real counts from the training data: ngram_count.* training_data ngram_count_file Output ngrams and real count c(w1), c(w1, w2), and c(w1, w2, w3). Given a sentence: John called Mary Insert BOS and EOS: John called Mary

Q1: Output Count key 875a … 200 the book … 20thank you very In chunks – unigrams, then bigrams, then trigrams Sort in decreasing order of count within chunk

Q2: Create Language Model build_lm.* ngram_count_file lm_file Store the logprob of ngrams and other parameters in the lm There are actually three language models: P(w3), P(w3|w2) and P(w3|w1,w2) The output file is in a modified ARPA format (see next slide) Lines for n-grams are sorted by n-gram counts

Modified ARPA Format \data\ ngram 1: type = xx; token = yy ngram 2: type = xx; token = yy ngram 3: type = xx; token = yy \1-grams: count prob logprob w1 \2-grams: count prob logprob w1 w2 \3-grams: count prob logprob w1 w2 w3 # xx: is type count # yy: is token count # prob is P(w) # prob is P(w2|w1) #count in C(w1w2)

Q3: Calculating Perplexity pp.* lm_file n test_file outfile Compute perplexity for n-gram history given model sum=0; count=0; for each s in test_file: if n-gram of history n exists Compute P(wi|…wi-n+1) sum += log_2 P(wi…) count ++ total = -sum/count pp(test_file) = 2 total

Output format Sent #1: Influential members of the House … 1: log P(Influential | ) = -inf(unknown word) 2: log P(members | Influential) = -inf (unseen ngrams) 4: log P(the | members of) = -0.673243382588536 1 sentence, 38 words, 9 OOVs logprob=-82.8860891791949 ppl=721.341645452964 %%%%%%%%%%%%%%%% sent_num=50 word_num=1175 oov_num=190 logprob=-2854.78157013778 ave_logprob=-2.75824306293506 pp=573.116699237283

Q4: Compute Perplexity Compute perplexity for different n

Download ppt "Language Models & Smoothing Shallow Processing Techniques for NLP Ling570 October 19, 2011."

Similar presentations