Introduction to N-grams

Introduction to N-grams
[Many of the slides were originally created by Prof. Dan Jurafsky from Stanford.]

Who wrote this? “You are uniformly charming!” cried he, with a smile of associating and now and then I bowed and they perceived a chaise and four to wish for. This is written by a machine. It is a random sentence generated from a Jane Austen trigram model.

Assign a probability to a phrase
Speech Recognition P(I saw a van) >> P(eyes awe of an) Spell Correction The office is about fifteen minuets from my house P(about fifteen minutes from) > P(about fifteen minuets from) Machine Translation P(high winds tonite) > P(large winds tonite)

Predict the next word Please turn your homework ____ ….
Google search bar

Counts and probabilities
P(its water is so transparent that) P(its water is so transparent that the) P(the|its water is so transparent that)

Probabilistic Language Modeling
Compute the probability of a sentence or sequence of words: P(W) = P(w1w2w3w4w5…wn) Related task: probability of an upcoming word: P(w5|w1w2w3w4) A model that computes either of these: P(W) or P(wn|w1w2…wn-1) is called a language model.

The Chain Rule P(“its water is so transparent”) =
P(its) × P(water|its) × P(is|its water) × P(so|its water is) × P(transparent|its water is so)

Unigram model Some automatically generated sentences from a unigram model: fifth an of futures the an incorporated a a the inflation most dollars quarter in is mass thrift did eighty said hard 'm july bullish that or limited the

Bigram model Condition on only the previous word:
texaco rose one in this issue is pursuing growth in a boiler house said mr. gurria mexico 's motion control proposal without permission from five hundred fifty five yen outside new car parking lot of the agreement reached this would be a record november

N-gram models We can extend to trigrams, 4-grams, 5-grams
In general this is an insufficient model of language because language has long-distance dependencies: “The computer which I had just put into the machine room on the fifth floor ___.” Predict the last word in the above sentence.

Bigram probabilities Use Maximum Likelihood Estimate

Corpus from Dr. Seuss A corpus is a collection of written texts.
Here is mini-corpus of three sentences. <s> I am Sam </s> <s> Sam I am </s> <s> I do not like green eggs and ham </s> Each sentence starts with the special token <s> and ends with </s>.

Calculate bigram probabilities
<s> I am Sam </s> <s> Sam I am </s> <s> I do not like green eggs and ham </s>

Berkeley Restaurant Project sentences
can you tell me about any good cantonese restaurants close by mid priced thai food is what i’m looking for tell me about chez panisse can you give me a listing of the kinds of food that are available i’m looking for a good place to eat breakfast when is caffe venezia open during the day

Unigram counts There are 9222 sentences in the corpus.

Bigram counts

Bigram probabilities Normalize by unigrams: Result:
5/2533=0.002 9/2533=0.0036 211/2417 =0.087 Sparsity: Lots of zeros.

Computing efficiency Issues
Store log probabilities, not the raw probabilities. Avoid underflow. Adding is faster than multiplying.

Google 4-gram counts serve as the incoming 92
serve as the incubator 99 serve as the independent 794 serve as the index 223 serve as the indication 72 serve as the indicator 120 serve as the indicators 45 serve as the indispensable 111 serve as the indispensible 40 serve as the individual 234

Complete the sentence I always order pizza with cheese and ____.
mushrooms 0.1 pepperoni 0.1 anchovies 0.01 …. fried rice and 1e-100 I always order pizza with cheese and ____. The current president of the US is ____. I saw a ____. Is the unigram model good at this guessing game? Which sentence perplexes you the most?

Perplexity How hard is the task of recognizing digits {0,1,2,3,4,5,6,7,8,9}? Each digit is equally likely. Perplexity = 10 How hard is recognizing 30,000 names at Microsoft. Perplexity = 30,000 Minimizing perplexity is the same as maximizing probability.

Lower perplexity = better model
Training 38 million words, test 1.5 million words, WSJ Unigram Bigram Trigram Perplexity 962 170 109

Shannon Visualization Method
Choose the first bigram (<s>, w) according to its probability Now choose next bigram (w, x) according to its probability And so on until we choose </s> Then string the words together <s> I I want want to to eat eat Chinese Chinese food food </s> I want to eat Chinese food

Shakespeare lines generated by N-grams

Shakespeare as corpus N=884,647 tokens size of vocabulary, V=29,066
Shakespeare produced 300,000 bigram types out of V2= 844 million possible bigrams % of the possible bigrams were never seen. Quadrigrams: What's coming out looks like Shakespeare because it is Shakespeare

Wall Street Journal

Training set and test set
Use Shakespeare corpus to train the language model. Test the model on sentences from WSJ. Your natural language processor will run into big trouble.

Problem with zeros Training set: Test set: denied the allegations
denied the reports denied the claims denied the request P(“offer” | denied the) = 0 perplexity = ∞ Test set: denied the offer denied the loan QC

Bigram counts

Berkeley Restaurant Corpus: Add 1 counts

Laplace smoothing MLE estimate:
Pretend we saw each word one more time than we did Add-1 estimate: Maximum Likelihood Estimation Maximum likelihood estimate Vocabulary size

Deal with zeros by smoothing
When we have sparse statistics: Steal probability mass to generalize better P(w | denied the) 3 allegations 2 reports 1 claims 1 request 7 total allegations reports attack man outcome … claims request P(w | denied the) 2.5 allegations 1.5 reports 0.5 claims 0.5 request 2 other 7 total allegations allegations attack man outcome … reports claims request

Laplace-smoothed bigram probabilities
concatenates

Reconstituted smoothed counts

Compare the raw count with smoothed count
C(want to) went from 608 to 238, P(to|want) from .66 to .26! Discount d= c*/c d for “chinese food” =.10!!! A 10x reduction

Backoff and Interpolation
Use the trigram if its probability is good else use a bigram if its probability is good otherwise unigram Interpolation Use all three by mixing them. Interpolation works better

How to set the lambdas? Training Data Held-Out Data Test Data
Use the training data to find the N-gram probabilities. Use a held-out data to find the λs. Choose λs to maximize the probability of success on held-out data. After the system is set, test it on the future test data.

How to deal with words not in vocabulary?
If we know all the words in advanced Vocabulary V is fixed Closed vocabulary task Often we don’t know this Out Of Vocabulary = OOV words Open vocabulary task Special unknown word token <UNK> Reduce V to V’ by throwing out all the unimportant, low-count words. At text normalization phase, change any training word not in V’ to <UNK>. Calculate probability of <UNK> as if it is a normal word. At testing time, use UNK probability for any word not in V’.

Conclusion N-gram technology is underpinned by sound probability theory and statistics. It is simplified by Markov model. Calculations involve simple counting and divisions. It is useful in spelling correction, natural language translation, imitating Shakespeare, machine writing poetry, etc.

Introduction to N-grams

Similar presentations

Presentation on theme: "Introduction to N-grams"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Introduction to N-grams

Similar presentations

Presentation on theme: "Introduction to N-grams"— Presentation transcript:

Similar presentations

About project

Feedback