Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture 7 8 August 2007.

Slides:



Advertisements
Similar presentations
1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.
Advertisements

Language Modeling.
N-Grams and Corpus Linguistics 6 July Linguistics vs. Engineering “But it must be recognized that the notion of “probability of a sentence” is an.
N-gram model limitations Important question was asked in class: what do we do about N-grams which were not in our training corpus? Answer given: we distribute.
Albert Gatt Corpora and Statistical Methods – Lecture 7.
SI485i : NLP Set 4 Smoothing Language Models Fall 2012 : Chambers.
Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.
CMSC 723 / LING 645: Intro to Computational Linguistics February 25, 2004 Lecture 5 (Dorr): Intro to Probabilistic NLP and N-grams (chap ) Prof.
CS 4705 N-Grams and Corpus Linguistics Julia Hirschberg CS 4705.
CSC 9010: Special Topics, Natural Language Processing. Spring, Matuszek & Papalaskari 1 N-Grams CSC 9010: Special Topics. Natural Language Processing.
1 N-Grams and Corpus Linguistics September 2009 Lecture #5.
NATURAL LANGUAGE PROCESSING. Applications  Classification ( spam )  Clustering ( news stories, twitter )  Input correction ( spell checking )  Sentiment.
N-Grams and Corpus Linguistics.  Regular expressions for asking questions about the stock market from stock reports  Due midnight, Sept. 29 th  Use.
Part II. Statistical NLP Advanced Artificial Intelligence N-Gramms Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most slides taken.
Part II. Statistical NLP Advanced Artificial Intelligence Markov Models and N-gramms Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Kristian Kersting Some.
N-Grams and Corpus Linguistics
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture 5 2 August 2007.
N-Gram Language Models CMSC 723: Computational Linguistics I ― Session #9 Jimmy Lin The iSchool University of Maryland Wednesday, October 28, 2009.
Page 1 Language Modeling. Page 2 Next Word Prediction From a NY Times story... Stocks... Stocks plunged this …. Stocks plunged this morning, despite a.
CS 4705 Lecture 6 N-Grams and Corpus Linguistics.
Ngram models and the Sparsity problem John Goldsmith November 2002.
Smoothing Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
N-gram model limitations Q: What do we do about N-grams which were not in our training corpus? A: We distribute some probability mass from seen N-grams.
1 Language Model (LM) LING 570 Fei Xia Week 4: 10/21/2009 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAA A A.
1 Smoothing LING 570 Fei Xia Week 5: 10/24/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A A AA A A A.
N-Grams and Language Modeling
CS 4705 Lecture 15 Corpus Linguistics III. Training and Testing Probabilities come from a training corpus, which is used to design the model. –overly.
Introduction to Language Models Evaluation in information retrieval Lecture 4.
CS 4705 N-Grams and Corpus Linguistics. Homework Use Perl or Java reg-ex package HW focus is on writing the “grammar” or FSA for dates and times The date.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 18: 10/26.
1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 15, 2004.
CS 4705 N-Grams and Corpus Linguistics. Spelling Correction, revisited M$ suggests: –ngram: NorAm –unigrams: anagrams, enigmas –bigrams: begrimes –trigrams:
CS 4705 Lecture 14 Corpus Linguistics II. Relating Conditionals and Priors P(A | B) = P(A ^ B) / P(B) –Or, P(A ^ B) = P(A | B) P(B) Bayes Theorem lets.
SI485i : NLP Set 3 Language Models Fall 2012 : Chambers.
1 Advanced Smoothing, Evaluation of Language Models.
Natural Language Processing Lecture 6—9/17/2013 Jim Martin.
1 N-Grams and Corpus Linguistics September 6, 2012 Lecture #4.
Machine Translation Course 3 Diana Trandab ă ț Academic year:
NGrams 09/16/2004 Instructor: Rada Mihalcea Note: some of the material in this slide set was adapted from an NLP course taught by Bonnie Dorr at Univ.
Formal Models of Language. Slide 1 Language Models A language model an abstract representation of a (natural) language phenomenon. an approximation to.
Natural Language Processing Language Model. Language Models Formal grammars (e.g. regular, context free) give a hard “binary” model of the legal sentences.
Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.
6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection -Augmentative.
N-Grams and Corpus Linguistics guest lecture by Dragomir Radev
1 COMP 791A: Statistical Language Processing n-gram Models over Sparse Data Chap. 6.
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Statistical NLP: Lecture 8 Statistical Inference: n-gram Models over Sparse Data (Ch 6)
Chapter 6: N-GRAMS Heshaam Faili University of Tehran.
Chapter6. Statistical Inference : n-gram Model over Sparse Data 이 동 훈 Foundations of Statistic Natural Language Processing.
Language acquisition
9/22/1999 JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) LM Smoothing (The EM Algorithm) Dr. Jan Hajič CS Dept., Johns.
1 Introduction to Natural Language Processing ( ) LM Smoothing (The EM Algorithm) AI-lab
Lecture 4 Ngrams Smoothing
1 Modeling Long Distance Dependence in Language: Topic Mixtures Versus Dynamic Cache Models Rukmini.M Iyer, Mari Ostendorf.
Statistical NLP Winter 2009
Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.
Estimating N-gram Probabilities Language Modeling.
Natural Language Processing Statistical Inference: n-grams
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
Statistical Methods for NLP Diana Trandab ă ț
Speech and Language Processing Lecture 4 Chapter 4 of SLP.
Language Modeling Part II: Smoothing Techniques Niranjan Balasubramanian Slide Credits: Chris Manning, Dan Jurafsky, Mausam.
Intro to NLP - J. Eisner1 Smoothing Intro to NLP - J. Eisner2 Parameter Estimation p(x 1 = h, x 2 = o, x 3 = r, x 4 = s, x 5 = e,
N-Grams Chapter 4 Part 2.
N-Grams and Corpus Linguistics
N-Grams and Corpus Linguistics
CSCE 771 Natural Language Processing
Presented by Wen-Hung Tsai Speech Lab, CSIE, NTNU 2005/07/13
CSCE 771 Natural Language Processing
Presentation transcript:

Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture 7 8 August 2007

Lecture 1, 7/21/2005Natural Language Processing2 A Simple Example P(I want to each Chinese food) = P(I | ) P(want | I) P(to | want) P(eat | to) P(Chinese | eat) P(food | Chinese)

Lecture 1, 7/21/2005Natural Language Processing3 A Bigram Grammar Fragment from BERP.001Eat British.03Eat today.007Eat dessert.04Eat Indian.01Eat tomorrow.04Eat a.02Eat Mexican.04Eat at.02Eat Chinese.05Eat dinner.02Eat in.06Eat lunch.03Eat breakfast.06Eat some.03Eat Thai.16Eat on

Lecture 1, 7/21/2005Natural Language Processing4.01British lunch.05Want a.01British cuisine.65Want to.15British restaurant.04I have.60British food.08I don’t.02To be.29I would.09To spend.32I want.14To have.02 I’m.26To eat.04 Tell.01Want Thai.06 I’d.04Want some.25 I

Lecture 1, 7/21/2005Natural Language Processing5  P(I want to eat British food) = P(I| ) P(want|I) P(to|want) P(eat|to) P(British|eat) P(food|British) =.25*.32*.65*.26*.001*.60 =  vs. I want to eat Chinese food =  Probabilities seem to capture ``syntactic'' facts, ``world knowledge'' eat is often followed by an NP British food is not too popular  N-gram models can be trained by counting and normalization

Lecture 1, 7/21/2005Natural Language Processing6 BERP Bigram Counts Lunch Food Chinese Eat To Want I lunchFoodChineseEatToWantI

Lecture 1, 7/21/2005Natural Language Processing7 BERP Bigram Probabilities  Normalization: divide each row's counts by appropriate unigram counts for w n-1  Computing the bigram probability of I I C(I,I)/C(all I) p (I|I) = 8 / 3437 =.0023  Maximum Likelihood Estimation (MLE): relative frequency of e.g LunchFoodChineseEatToWantI

Lecture 1, 7/21/2005Natural Language Processing8 What do we learn about the language?  What's being captured with... P(want | I) =.32 P(to | want) =.65 P(eat | to) =.26 P(food | Chinese) =.56 P(lunch | eat) =.055  What about... P(I | I) =.0023 P(I | want) =.0025 P(I | food) =.013

Lecture 1, 7/21/2005Natural Language Processing9 P(I | I) =.0023 I I I I want P(I | want) =.0025 I want I want P(I | food) =.013 the kind of food I want is...

Lecture 1, 7/21/2005Natural Language Processing10 Approximating Shakespeare  As we increase the value of N, the accuracy of the n-gram model increases, since choice of next word becomes increasingly constrained  Generating sentences with random unigrams... Every enter now severally so, let Hill he late speaks; or! a more to leg less first you enter  With bigrams... What means, sir. I confess she? then all sorts, he is trim, captain. Why dost stand forth thy canopy, forsooth; he is this palpable hit the King Henry.

Lecture 1, 7/21/2005Natural Language Processing11  Trigrams Sweet prince, Falstaff shall die. This shall forbid it should be branded, if renown made it empty.  Quadrigrams What! I will go seek the traitor Gloucester. Will you not tell me who I am?

Lecture 1, 7/21/2005Natural Language Processing12  There are 884,647 tokens, with 29,066 word form types, in about a one million word Shakespeare corpus  Shakespeare produced 300,000 bigram types out of 844 million possible bigrams: so, 99.96% of the possible bigrams were never seen (have zero entries in the table)  Quadrigrams worse: What's coming out looks like Shakespeare because it is Shakespeare

Lecture 1, 7/21/2005Natural Language Processing13 N-Gram Training Sensitivity  If we repeated the Shakespeare experiment but trained our n-grams on a Wall Street Journal corpus, what would we get?  This has major implications for corpus selection or design  Dynamically adapting language models to different genres

Lecture 1, 7/21/2005Natural Language Processing14 Unknown words  Unknown or Out of vocabulary (OOV) words  Open Vocabulary system – model the unknown word by Training is as follows: 1. Choose a vocabulary 2. Convert any word in training set not belonging to this set to 3. Estimate the probabilities for from its counts

Lecture 1, 7/21/2005Natural Language Processing15 Evaluaing n-grams - Perplexity  Evaluating applications (like speech recognition) – potentially expensive  Need a metric to quickly evaluate potential improvements in a language model  Perplexity Intuition: The better model has tighter fit to the test data (assign higher probability to test data) PP(W) = P(w1w2…wn)^(-1/N) (pg 14 – chapter 4)

Lecture 1, 7/21/2005Natural Language Processing16 Some Useful Empirical Observations  A small number of events occur with high frequency  A large number of events occur with low frequency  You can quickly collect statistics on the high frequency events  You might have to wait an arbitrarily long time to get valid statistics on low frequency events  Some of the zeroes in the table are really zeros But others are simply low frequency events you haven't seen yet. How to address?

Lecture 1, 7/21/2005Natural Language Processing17 Smoothing: None  Called Maximum Likelihood estimate.  Terrible on test data: If no occurrences of C(xyz), probability is 0.

Lecture 1, 7/21/2005Natural Language Processing18 Smoothing Techniques  Every n-gram training matrix is sparse, even for very large corpora (Zipf’s law)Zipf’s law  Solution: estimate the likelihood of unseen n-grams  Problems: how do you adjust the rest of the corpus to accommodate these ‘phantom’ n-grams?

Lecture 1, 7/21/2005Natural Language Processing19 Smoothing=Redistributing Probability Mass

Lecture 1, 7/21/2005Natural Language Processing20 Smoothing Techniques  Every n-gram training matrix is sparse, even for very large corpora (Zipf’s law)Zipf’s law  Solution: estimate the likelihood of unseen n-grams  Problems: how do you adjust the rest of the corpus to accommodate these ‘phantom’ n-grams?

Lecture 1, 7/21/2005Natural Language Processing21 Add-one Smoothing  For unigrams: Add 1 to every word (type) count Normalize by N (tokens) /(N (tokens) +V (types)) Smoothed count (adjusted for additions to N) is Normalize by N to get the new unigram probability:  For bigrams: Add 1 to every bigram c(w n-1 w n ) + 1 Incr unigram count by vocabulary size c(w n-1 ) + V

Lecture 1, 7/21/2005Natural Language Processing22 Effect on BERP bigram counts

Lecture 1, 7/21/2005Natural Language Processing23 Add-one bigram probabilities

Lecture 1, 7/21/2005Natural Language Processing24 The problem

Lecture 1, 7/21/2005Natural Language Processing25 The problem  Add-one has a huge effect on probabilities: e.g., P(to|want) went from.65 to.28!  Too much probability gets ‘removed’ from n-grams actually encountered (more precisely: the ‘discount factor’

Lecture 1, 7/21/2005Natural Language Processing26 Discount: ratio of new counts to old (e.g. add-one smoothing changes the BERP bigram (to|want) from 786 to 331 (d c =.42) and p(to|want) from.65 to.28) But this changes counts drastically:  too much weight given to unseen ngrams  in practice, unsmoothed bigrams often work better !

Lecture 1, 7/21/2005Natural Language Processing27 Smoothing  Add one smoothing:  Works very badly.  Add delta smoothing:  Still very bad. [based on slides by Joshua Goodman]

Lecture 1, 7/21/2005Natural Language Processing28  A zero ngram is just an ngram you haven’t seen yet…but every ngram in the corpus was unseen once…so... How many times did we see an ngram for the first time? Once for each ngram type (T) Est. total probability of unseen bigrams as View training corpus as series of events, one for each token (N) and one for each new type (T) Witten-Bell Discounting

Lecture 1, 7/21/2005Natural Language Processing29 We can divide the probability mass equally among unseen bigrams….or we can condition the probability of an unseen bigram on the first word of the bigram Discount values for Witten-Bell are much more reasonable than Add-One

Lecture 1, 7/21/2005Natural Language Processing30  Re-estimate amount of probability mass for zero (or low count) ngrams by looking at ngrams with higher counts Nc : n-grams with frequency c Estimate smoothed count E.g. N 0 ’s adjusted count is a function of the count of ngrams that occur once, N 1 P (tfrequency Assumes:  word bigrams follow a binomial distribution  We know number of unseen bigrams (VxV-seen) Good-Turing Discounting

Lecture 1, 7/21/2005Natural Language Processing31 Interpolation and Backoff  Typically used in addition to smoothing techniques/ discounting Example: trigrams Smoothing gives some probability mass to all the trigram types not observed in the training data We could make a more informed decision! How? If backoff finds an unobserved trigram in the test data, it will “back off” to bigrams (and ultimately to unigrams)  Backoff doesn’t treat all unseen trigrams alike  When we have observed a trigram, we will rely solely on the trigram counts Interpolation generally takes bigrams and unigrams into account for trigram probability

Lecture 1, 7/21/2005Natural Language Processing32 Backoff methods (e.g. Katz ‘87)  For e.g. a trigram model Compute unigram, bigram and trigram probabilities In use:  Where trigram unavailable back off to bigram if available, o.w. unigram probability  E.g An omnivorous unicorn

Lecture 1, 7/21/2005Natural Language Processing33 Smoothing: Simple Interpolation  Trigram is very context specific, very noisy  Unigram is context-independent, smooth  Interpolate Trigram, Bigram, Unigram for best combination  Find  0<  <1 by optimizing on “held-out” data  Almost good enough

Lecture 1, 7/21/2005Natural Language Processing34 Smoothing: Held-out estmation  Finding parameter values Split data into training, “heldout”, test Try lots of different values for   on heldout data, pick best Test on test data Sometimes, can use tricks like “EM” (estimation maximization) to find values [Joshua Goodman:] I prefer to use a generalized search algorithm, “Powell search” – see Numerical Recipes in C [based on slides by Joshua Goodman]

Lecture 1, 7/21/2005Natural Language Processing35 Held-out estimation: splitting data  How much data for training, heldout, test?  Some people say things like “1/3, 1/3, 1/3” or “80%, 10%, 10%” They are WRONG  Heldout should have (at least) words per parameter.  Answer: enough test data to be statistically significant. (1000s of words perhaps) [based on slides by Joshua Goodman]

Lecture 1, 7/21/2005Natural Language Processing36 Summary  N-gram probabilities can be used to estimate the likelihood Of a word occurring in a context (N-1) Of a sentence occurring at all  Smoothing techniques deal with problems of unseen words in a corpus

Lecture 1, 7/21/2005Natural Language Processing37 Practical Issues  Represent and compute language model probabilities on log format p1  p2  p3  p4 = exp (log p1 + log p2 + log p3 + log p4)

Lecture 1, 7/21/2005Natural Language Processing38 Class-based n-grams  P(wi|wi-1) = P(ci|ci-1) x P(wi|ci)