Introduction to N-grams

Slides:



Advertisements
Similar presentations
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Advertisements

1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.
Language Modeling for ASR Andreas Stolcke Microsoft and ICSI (based on slides from Dan Jurafsky at Stanford)
Language Modeling.
N-Grams Chapter 4 Part 1.
Language Models Naama Kraus (Modified by Amit Gross) Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze.
N-gram model limitations Important question was asked in class: what do we do about N-grams which were not in our training corpus? Answer given: we distribute.
SI485i : NLP Set 4 Smoothing Language Models Fall 2012 : Chambers.
Smoothing Techniques – A Primer
Introduction to N-grams
Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.
September BASIC TECHNIQUES IN STATISTICAL NLP Word prediction n-grams smoothing.
N-Gram Language Models CMSC 723: Computational Linguistics I ― Session #9 Jimmy Lin The iSchool University of Maryland Wednesday, October 28, 2009.
Fall BASIC TECHNIQUES IN STATISTICAL NLP Word prediction n-grams smoothing.
1 Language Model (LM) LING 570 Fei Xia Week 4: 10/21/2009 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAA A A.
CS 4705 Lecture 15 Corpus Linguistics III. Training and Testing Probabilities come from a training corpus, which is used to design the model. –overly.
Language Model. Major role: Language Models help a speech recognizer figure out how likely a word sequence is, independent of the acoustics. A lot of.
Introduction to Language Models Evaluation in information retrieval Lecture 4.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 18: 10/26.
SI485i : NLP Set 3 Language Models Fall 2012 : Chambers.
1 Advanced Smoothing, Evaluation of Language Models.
Natural Language Processing Lecture 6—9/17/2013 Jim Martin.
1 N-Grams and Corpus Linguistics September 6, 2012 Lecture #4.
Speech and Language Processing
Session 12 N-grams and Corpora Introduction to Speech and Natural Language Processing (KOM422 ) Credits: 3(3-0)
Introduction to language modeling
Slides are from Dan Jurafsky and Schütze Language Modeling.
1 LIN6932 Spring 2007 LIN6932: Topics in Computational Linguistics Hana Filip Lecture 5: N-grams.
6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection -Augmentative.
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Chapter 6: N-GRAMS Heshaam Faili University of Tehran.
N-gram Language Models
Chapter6. Statistical Inference : n-gram Model over Sparse Data 이 동 훈 Foundations of Statistic Natural Language Processing.
Lecture 4 Ngrams Smoothing
N-gram Models CMSC Artificial Intelligence February 24, 2005.
Statistical NLP Winter 2009
Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.
Introduction to N-grams Language Modeling. Dan Jurafsky Probabilistic Language Models Today’s goal: assign a probability to a sentence Machine Translation:
Introduction to N-grams Language Modeling. Probabilistic Language Models Today’s goal: assign a probability to a sentence Machine Translation: P(high.
Estimating N-gram Probabilities Language Modeling.
Natural Language Processing Statistical Inference: n-grams
2/29/2016CPSC503 Winter CPSC 503 Computational Linguistics Lecture 5 Giuseppe Carenini.
Introduction to N-grams Language Modeling. Dan Jurafsky Probabilistic Language Models Today’s goal: assign a probability to a sentence Machine Translation:
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
Statistical Methods for NLP Diana Trandab ă ț
Speech and Language Processing Lecture 4 Chapter 4 of SLP.
Language Modeling Part II: Smoothing Techniques Niranjan Balasubramanian Slide Credits: Chris Manning, Dan Jurafsky, Mausam.
Introduction to N-grams Language Modeling IP disclosure: Content borrowed from J&M 3 rd edition and Raymond Mooney.
Language Model for Machine Translation Jang, HaYoung.
Statistical Methods for NLP
Introduction to N-grams
N-Grams Chapter 4 Part 2.
CSC 594 Topics in AI – Natural Language Processing
Introduction to N-grams
Introduction to N-grams
CSCI 5832 Natural Language Processing
CSCI 5832 Natural Language Processing
Speech and Language Processing
Neural Language Model CS246 Junghoo “John” Cho.
Natural Language Processing
CPSC 503 Computational Linguistics
N-Gram Model Formulas Word sequences Chain rule of probability
CSCE 771 Natural Language Processing
CSCI 5832 Natural Language Processing
CSCE 771 Natural Language Processing
Presented by Wen-Hung Tsai Speech Lab, CSIE, NTNU 2005/07/13
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
CSCE 771 Natural Language Processing
Statistical NLP Winter 2009
Presentation transcript:

Introduction to N-grams [Many of the slides were originally created by Prof. Dan Jurafsky from Stanford.]

Who wrote this? “You are uniformly charming!” cried he, with a smile of associating and now and then I bowed and they perceived a chaise and four to wish for. This is written by a machine. It is a random sentence generated from a Jane Austen trigram model.

Assign a probability to a phrase Speech Recognition P(I saw a van) >> P(eyes awe of an) Spell Correction The office is about fifteen minuets from my house P(about fifteen minutes from) > P(about fifteen minuets from) Machine Translation P(high winds tonite) > P(large winds tonite)

Predict the next word Please turn your homework ____ …. Google search bar

Counts and probabilities P(its water is so transparent that) P(its water is so transparent that the) P(the|its water is so transparent that)

Probabilistic Language Modeling Compute the probability of a sentence or sequence of words: P(W) = P(w1w2w3w4w5…wn) Related task: probability of an upcoming word: P(w5|w1w2w3w4) A model that computes either of these: P(W) or P(wn|w1w2…wn-1) is called a language model.

The Chain Rule P(“its water is so transparent”) = P(its) × P(water|its) × P(is|its water) × P(so|its water is) × P(transparent|its water is so)

Unigram model Some automatically generated sentences from a unigram model: fifth an of futures the an incorporated a a the inflation most dollars quarter in is mass thrift did eighty said hard 'm july bullish that or limited the

Bigram model Condition on only the previous word: texaco rose one in this issue is pursuing growth in a boiler house said mr. gurria mexico 's motion control proposal without permission from five hundred fifty five yen outside new car parking lot of the agreement reached this would be a record november

N-gram models We can extend to trigrams, 4-grams, 5-grams In general this is an insufficient model of language because language has long-distance dependencies: “The computer which I had just put into the machine room on the fifth floor ___.” Predict the last word in the above sentence.

Bigram probabilities Use Maximum Likelihood Estimate

Corpus from Dr. Seuss A corpus is a collection of written texts. Here is mini-corpus of three sentences. <s> I am Sam </s> <s> Sam I am </s> <s> I do not like green eggs and ham </s> Each sentence starts with the special token <s> and ends with </s>.

Calculate bigram probabilities <s> I am Sam </s> <s> Sam I am </s> <s> I do not like green eggs and ham </s>

Calculate bigram probabilities <s> I am Sam </s> <s> Sam I am </s> <s> I do not like green eggs and ham </s>

Berkeley Restaurant Project sentences can you tell me about any good cantonese restaurants close by mid priced thai food is what i’m looking for tell me about chez panisse can you give me a listing of the kinds of food that are available i’m looking for a good place to eat breakfast when is caffe venezia open during the day

Unigram counts There are 9222 sentences in the corpus.

Bigram counts

Bigram probabilities Normalize by unigrams: Result: 5/2533=0.002 9/2533=0.0036 211/2417 =0.087 Sparsity: Lots of zeros.

Bigram estimates of sentence probabilities P(<s> I want english food </s>) = P(I|<s>) × P(want|I) × P(english|want) × P(food|english) × P(</s>|food) = .000031

Knowledge in the bigrams P(english|want) = .0011 P(chinese|want) = .0065 P(to|want) = .66 verb want followed by to + infinitive P(eat | to) = .28 to + infinitive P(food | to) = 0 P(want | spend) = 0 P (i | <s>) = .25

Computing efficiency Issues Store log probabilities, not the raw probabilities. Avoid underflow. Adding is faster than multiplying.

Google 4-gram counts serve as the incoming 92 serve as the incubator 99 serve as the independent 794 serve as the index 223 serve as the indication 72 serve as the indicator 120 serve as the indicators 45 serve as the indispensable 111 serve as the indispensible 40 serve as the individual 234

Complete the sentence I always order pizza with cheese and ____. mushrooms 0.1 pepperoni 0.1 anchovies 0.01 …. fried rice 0.0001 and 1e-100 I always order pizza with cheese and ____. The current president of the US is ____. I saw a ____. Is the unigram model good at this guessing game? Which sentence perplexes you the most?

Perplexity How hard is the task of recognizing digits {0,1,2,3,4,5,6,7,8,9}? Each digit is equally likely. Perplexity = 10 How hard is recognizing 30,000 names at Microsoft. Perplexity = 30,000 Minimizing perplexity is the same as maximizing probability.

Lower perplexity = better model Training 38 million words, test 1.5 million words, WSJ Unigram Bigram Trigram Perplexity 962 170 109

Shannon Visualization Method Choose the first bigram (<s>, w) according to its probability Now choose next bigram (w, x) according to its probability And so on until we choose </s> Then string the words together <s> I I want want to to eat eat Chinese Chinese food food </s> I want to eat Chinese food

Shakespeare lines generated by N-grams

Shakespeare as corpus N=884,647 tokens size of vocabulary, V=29,066 Shakespeare produced 300,000 bigram types out of V2= 844 million possible bigrams. 99.96% of the possible bigrams were never seen. Quadrigrams: What's coming out looks like Shakespeare because it is Shakespeare

Wall Street Journal

Training set and test set Use Shakespeare corpus to train the language model. Test the model on sentences from WSJ. Your natural language processor will run into big trouble.

Problem with zeros Training set: Test set: denied the allegations denied the reports denied the claims denied the request P(“offer” | denied the) = 0 perplexity = ∞ Test set: denied the offer denied the loan QC

Bigram counts

Berkeley Restaurant Corpus: Add 1 counts

Laplace smoothing MLE estimate: Pretend we saw each word one more time than we did Add-1 estimate: Maximum Likelihood Estimation Maximum likelihood estimate Vocabulary size

Deal with zeros by smoothing When we have sparse statistics: Steal probability mass to generalize better P(w | denied the) 3 allegations 2 reports 1 claims 1 request 7 total allegations reports attack man outcome … claims request P(w | denied the) 2.5 allegations 1.5 reports 0.5 claims 0.5 request 2 other 7 total allegations allegations attack man outcome … reports claims request

Laplace-smoothed bigram probabilities concatenates

Reconstituted smoothed counts

Compare the raw count with smoothed count C(want to) went from 608 to 238, P(to|want) from .66 to .26! Discount d= c*/c d for “chinese food” =.10!!! A 10x reduction

Backoff and Interpolation Use the trigram if its probability is good else use a bigram if its probability is good otherwise unigram Interpolation Use all three by mixing them. Interpolation works better

How to set the lambdas? Training Data Held-Out Data Test Data Use the training data to find the N-gram probabilities. Use a held-out data to find the λs. Choose λs to maximize the probability of success on held-out data. After the system is set, test it on the future test data.

How to deal with words not in vocabulary? If we know all the words in advanced Vocabulary V is fixed Closed vocabulary task Often we don’t know this Out Of Vocabulary = OOV words Open vocabulary task Special unknown word token <UNK> Reduce V to V’ by throwing out all the unimportant, low-count words. At text normalization phase, change any training word not in V’ to <UNK>. Calculate probability of <UNK> as if it is a normal word. At testing time, use UNK probability for any word not in V’.

Conclusion N-gram technology is underpinned by sound probability theory and statistics. It is simplified by Markov model. Calculations involve simple counting and divisions. It is useful in spelling correction, natural language translation, imitating Shakespeare, machine writing poetry, etc.