Chapter 6: N-GRAMS Heshaam Faili University of Tehran.

Slides:



Advertisements
Similar presentations
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Advertisements

1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.
Language Modeling.
Language Models Naama Kraus (Modified by Amit Gross) Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze.
N-gram model limitations Important question was asked in class: what do we do about N-grams which were not in our training corpus? Answer given: we distribute.
Albert Gatt Corpora and Statistical Methods – Lecture 7.
SI485i : NLP Set 4 Smoothing Language Models Fall 2012 : Chambers.
Smoothing Techniques – A Primer
Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.
Language modelling using N-Grams Corpora and Statistical Methods Lecture 7.
Tagging with Hidden Markov Models. Viterbi Algorithm. Forward-backward algorithm Reading: Chap 6, Jurafsky & Martin Instructor: Paul Tarau, based on Rada.
Morphology & FSTs Shallow Processing Techniques for NLP Ling570 October 17, 2011.
N-Gram Language Models CMSC 723: Computational Linguistics I ― Session #9 Jimmy Lin The iSchool University of Maryland Wednesday, October 28, 2009.
Ngram models and the Sparsity problem John Goldsmith November 2002.
Smoothing Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
N-gram model limitations Q: What do we do about N-grams which were not in our training corpus? A: We distribute some probability mass from seen N-grams.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 20: 11/8.
1 Language Model (LM) LING 570 Fei Xia Week 4: 10/21/2009 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAA A A.
1 Smoothing LING 570 Fei Xia Week 5: 10/24/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A A AA A A A.
Part 5 Language Model CSE717, SPRING 2008 CUBS, Univ at Buffalo.
Language Model. Major role: Language Models help a speech recognizer figure out how likely a word sequence is, independent of the acoustics. A lot of.
Word prediction What are likely completions of the following sentences? –“Oh, that must be a syntax …” –“I have to go to the …” –“I’d also like a Coke.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 19: 10/31.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 18: 10/26.
LING/C SC 581: Advanced Computational Linguistics Lecture Notes Jan 22 nd.
SI485i : NLP Set 3 Language Models Fall 2012 : Chambers.
1 Advanced Smoothing, Evaluation of Language Models.
8/27/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 5 Giuseppe Carenini.
Natural Language Processing Lecture 6—9/17/2013 Jim Martin.
Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.
BİL711 Natural Language Processing1 Statistical Language Processing In the solution of some problems in the natural language processing, statistical techniques.
Heshaam Faili University of Tehran
6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection -Augmentative.
Language acquisition
NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models.
1 COMP 791A: Statistical Language Processing n-gram Models over Sparse Data Chap. 6.
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Statistical NLP: Lecture 8 Statistical Inference: n-gram Models over Sparse Data (Ch 6)
Language Modeling Anytime a linguist leaves the group the recognition rate goes up. (Fred Jelinek)
Chapter6. Statistical Inference : n-gram Model over Sparse Data 이 동 훈 Foundations of Statistic Natural Language Processing.
Sequence Models With slides by me, Joshua Goodman, Fei Xia.
Resolving Word Ambiguities Description: After determining word boundaries, the speech recognition process matches an array of possible word sequences from.
Language acquisition
9/22/1999 JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) LM Smoothing (The EM Algorithm) Dr. Jan Hajič CS Dept., Johns.
1 Introduction to Natural Language Processing ( ) LM Smoothing (The EM Algorithm) AI-lab
Lecture 4 Ngrams Smoothing
N-gram Models CMSC Artificial Intelligence February 24, 2005.
Chapter 23: Probabilistic Language Models April 13, 2004.
LING/C SC/PSYC 438/538 Lecture 22 Sandiway Fong. Last Time Gentle introduction to probability Important notions: –sample space –events –rule of counting.
Statistical NLP Winter 2009
Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.
1 Introduction to Natural Language Processing ( ) Language Modeling (and the Noisy Channel) AI-lab
Estimating N-gram Probabilities Language Modeling.
CS Machine Learning and Statistical Natural Language Processing Prof. Shlomo Argamon, Room: 237C Office Hours: Mon 3-4 PM Book:
Language Modeling Putting a curve to the bag of words Courtesy of Chris Jordan.
Natural Language Processing Statistical Inference: n-grams
2/29/2016CPSC503 Winter CPSC 503 Computational Linguistics Lecture 5 Giuseppe Carenini.
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
Statistical NLP Spring 2011 Lecture 3: Language Models II Dan Klein – UC Berkeley TexPoint fonts used in EMF. Read the TexPoint manual before you delete.
Learning, Uncertainty, and Information: Evaluating Models Big Ideas November 12, 2004.
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
Language Modeling Part II: Smoothing Techniques Niranjan Balasubramanian Slide Credits: Chris Manning, Dan Jurafsky, Mausam.
Intro to NLP - J. Eisner1 Smoothing Intro to NLP - J. Eisner2 Parameter Estimation p(x 1 = h, x 2 = o, x 3 = r, x 4 = s, x 5 = e,
N-Grams Chapter 4 Part 2.
N-Gram Model Formulas Word sequences Chain rule of probability
CSCE 771 Natural Language Processing
Presented by Wen-Hung Tsai Speech Lab, CSIE, NTNU 2005/07/13
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
CSCE 771 Natural Language Processing
Presentation transcript:

Chapter 6: N-GRAMS Heshaam Faili University of Tehran

2 N-grams: Motivation An n-gram is a stretch of text n words long Approximation of language: information in n-grams tells us something about language, but doesn’t capture the structure Efficient: finding and using every, e.g., two-word collocation in a text is quick and easy to do N-grams can help in a variety of NLP applications: Word prediction = n-grams can be used to aid in predicting the next word of an utterance, based on the previous n- 1 words Useful for context-sensitive spelling correction, approximation of language,...

3 Corpus-based NLP Corpus (pl. corpora) = a computer-readable collection of text and/or speech, often with annotations We can use corpora to gather probabilities and other information about language us We can say that a corpus used to gather prior information is training data Testing data, by contrast, is the data one uses to test the accuracy of a method We can distinguish types and tokens in a corpus type = distinct word (e.g., like) token = distinct occurrence of a word (e.g., the type like might have 20,000 tokens in a corpus)

4 Simple n-grams Let’s assume we want to predict the next word, based on the previous context of I dreamed I saw the knights in What we want to find is the likelihood of w 8 being the next word, given that we’ve seen w 1,..., w 7, in other words, P(w 1,..., w 8 ) In general, for w n, we are looking for: (1) P(w 1,..., w n ) = P(w 1 )P(w 2 |w 1 )...P(w n |w 1,..., w n-1 ) But these probabilities are impractical to calculate: they hardly ever occur in a corpus, if at all. (And it would be a lot of data to store, if we could calculate them.)

5 Unigrams So, we can approximate these probabilities to a particular n-gram, for a given n. What should n be? Unigrams (n= 1): (2) P(w n |w 1,..., w n-1 ) ≈ P(w n ) Easy to calculate, but we have no contextual information (3) The quick brown fox jumped We would like to say that over has a higher probability in this context than lazy does.

6 Bigrams bigrams (n= 2) are a better choice and still easy to calculate: (4) P(w n |w 1,..., w n-1 ) ≈ P(w n |w n-1 ) (5) P( over| The, quick, brown, fox, jumped) ≈ P( over| jumped) And thus, we obtain for the probability of a sentence: (6) P(w 1,..., w n ) = P(w 1 )P(w 2 |w 1 )P(w 3 |w 2 )...P(w n |w n-1 )

7 Markov models A bigram model is also called a first-order Markov model because it has one element of memory (one token in the past) Markov models are essentially weighted FSAs— i.e., the arcs between states have probabilities The states in the FSA are words Much more on Markov models when we hit POS tagging...

8 Bigram example What is the probability of seeing the sentence The quick brown fox jumped over the lazy dog? (7) P(The quick brown fox jumped over the lazy dog) = P( The| ) P( quick| The)P( brown| quick)...P( dog| lazy) Probabilities are generally small, so log probabilities are usually used Does this favor shorter sentences?

9 Trigrams If bigrams are good, then trigrams ( n= 3) can be even better. Wider context: P( know| did, he) vs. P( know| he) Generally, trigrams are still short enough that we will have enough data to gather accurate probabilities

10 Training n-gram models Go through corpus and calculate relative frequencies: (8) P(w n |w n-1 ) = C(w n-1,w n ) / C(w n-1 ) (9) P(w n |w n-2, w n-1 ) = C(w n-2,w n-1,w n ) / C(w n-2,w n-1 ) This technique of gathering probabilities from a training corpus is called maximum likelihood estimation (MLE)

11 Maximum likelihood estimation (MLE) The resulting parameter set in one in which the likelihood of the training set T given the model M (P(T|M)) is maximized. Example: Chinese occurs 400 times in 1 million words Brown corpus. That is MLE = is not the best possible estimate of the probability of Chinese occurring in all situation, But it is the probability that makes it most likely that Chines will occur 400 times in a million-word corpus

12 Know your corpus We mentioned earlier about having training data and testing data... it’s important to remember what your training data is when applying your technology to new data If you train your trigram model on Shakespeare, then you have learned the probabilities in Shakespeare, not the probabilities of English overall What corpus you use depends on what you want to do later

13 Smoothing: Motivation Let’s assume that we have a good corpus and have trained a bigram model on it, i.e., learned MLE probabilities for bigrams But we won’t have seen every possible bigram lickety split is a possible English bigram, but it may not be in the corpus This is a problem of data sparseness  there are zero probability bigrams which are actual possible bigrams in the language To account for this sparseness, we turn to smoothing techniques  making zero probabilities non-zero, i.e., adjusting probabilities to account for unseen data

14 Add-One Smoothing One way to smooth is to add a count of one to every bigram: in order to still be a probability, all probabilities need to sum to one so, we add the number of word types to the denominator (i.e., we added one to every type of bigram, so we need to account for all our numerator additions) (10) P(w n |w n-1 ) = ( C(w n-1,w n )+1 ) /( C(w n-1 )+V ) V = total number of word types that we might see

15 Add-One smoothing P(W x ) = C(W x ) /  i C(W i ) Adjusted Count C * i = (C i +1)(N/(N+V)) Discount d c = c * /c P * i = (c i +1)/(N+V) P(W n |W n-1 )=(C(W n-1 W n )+1)/(C(W n-1 )+V)

16 Smoothing example So, if treasure trove never occurred in the data, but treasure occurred twice, we have: (11) P(trove | treasure) = (0+1)/(2+V) The probability won’t be very high, but it will be better than 0 If all the surrounding probabilities are still high, then treasure trove could still be the best pick If the probability were zero, there would be no chance of it appearing.

17 Discounting An alternate way of viewing smoothing is as discounting Lowering non-zero counts to get the probability mass we need for the zero count items The discounting factor can be defined as the ratio of the smoothed count to the MLE count Jurafsky and Martin show that add-one smoothing can discount probabilities by a factor of 8! That’s way too much...

18 Witten-Bell Discounting Use The count of things you’ve seen once to help estimate the count of things you’ve never seen Probability of seeing an N-gram for the first time: counting the number of times we saw N-grams for the first time in the training corpus  i:ci=0 p i * = T/(T+N) The above value is the total probability, should be divided by total number that is Z=  i:ci=0 1 P * i =T/Z(T+N)

19 Witten-Bell Discounting P i * = c i /(N+T) if(c i >0) c i * = T/Z. N/(N+T) if(c i = 0) c i * = c i. N/(N+T) if(c i > 0)

20 Witten-Bell Discounting, Bigram Main idea: Instead of simply adding one to every n-gram, compute the probability of w i-1, w i by seeing how likely w i-1 is at starting any bigram. Words that begin lots of bigrams lead to higher “unseen bigram” probabilities Non-zero bigrams are discounted in essentially the same manner as zero count bigrams Jurafsky and Martin show that they are onlydiscounted by about a factor of one

21 Witten-Bell Discounting formula (12) zero count bigrams: p * (w i |w i-1 ) =T(w i-1 )/ (Z(w i-1 ) ( N(w i-1 )+T(w i-1 )) ) T(w i-1 ) = the number of bigram types starting with w i-1 as the numerator, determines how high the value willbe N(w i-1 ) = the number of bigram tokens starting with w i-1 N(w i-1 ) + T(w i-1 ) gives us the total number of “events” to divide by Z(w i-1 ) = the number of bigram tokens starting with w i-1 and having zero count this just distributes the probability mass between all zero count bigrams starting with w i-1

22 Good Turing discouting Using Joint Probability instead of conditional probability i.e. P(w x w i ) instead of p(w i |w x ) The main idea is to re-estimate the amount of probability mass to assign to N-grams with zero or low counts by looking at the number of N-grams with higher counts.

23 Good Turing discouting N c : number of n-grams that occurs c times Nc =  b:c(b)=c 1 Good-Turing estimate: c i * =(c i +1)N c+1 /N c For c=0, c * = N 1 /N 0 But in real, c* = c for c>k

24 Backoff models: Basic idea Let’s say we’re using a trigram model for predicting language, and we haven’t seen a particular trigram before. But maybe we’ve seen the bigram, or if not, the unigram information would be useful Backoff models allow one to try the most informative n-gram first and then back off to lower n-grams

25 Backoff equations Roughly speaking, this is how a backoff model works: If this trigram has a non-zero count, we use that information (13) Pˆ(w i |w i-2 w i-1 ) = P(w i |w i-2 w i-1 ) else, if the bigram count is non-zero, we use that bigram information: (14) Pˆ(w i |w i-2 w i-1 ) =  1 P(w i |w i-1 ) and in all other cases we just take the unigram information: (15) Pˆ(w i |w i-2 w i-1 ) =  2 P(w i )

26 Backoff models: example Let’s say we’ve never seen the trigram “maples want more” before But we have seen “want more”, so we can use that bigram to calculate a probability estimate. So, we look at P(more|want)... But we’re now assigning probability to P(more|maples, want) which was zero before  we won’t have a true probability model anymore This is why  1 was used in the previous equations, to assign less re-weight to the probability. In general, backoff models have to be combined with discounting models

27 Deleted Interpolation Deleted interpolation is similar to backing off, except that we always use the bigram and unigram information to calculate the probability estimate (16) Pˆ(w i |w i-2 w i-1 ) = 1 P(w i |w i-2 w i-1 ) + 2 P(w i |w i-1 ) + 3 P(w i ) where the lambdas ( ) all sum to one Every trigram probability, then, is a composite of the focus word’s trigram, bigram, and unigram.

28 A note on information theory Some very useful notions for n-gram work can be found in information theory. We’ll just go over the basic ideas: entropy = a measure of how much information is needed to encode something perplexity = a measure of the amount of surprise of an outcome mutual information(cross entropy) = the amount of information one item has about another item (e.g., collocations have high mutual information)

29 Entropy Measure of information How much information there is in a particular grammar and how well given grammar matches a given language. H(X)=-  x  £ p(x) log 2 p(x) (in bits) H(X) maximized when all members of £ has equal probability.

30 Perplexity Perplexity = 2 H (H is entropy) Weighted average number of choices a random variable has to make.

31 Cross entropy For comparing models H(p,m) = lim n  1/n  w  L p(w 1 …w n ). log m(w 1..w n ) For regular grammars: H(p,m) = lim n  1/n log m(w 1..w n ) Note that : H(p)<=H(p,m)

32 Context-sensitive spelling correction Getting back to the task of spelling correction, we can look at bigrams of words to correct a particular misspelling. Question: Given the previous word, what is the probability of the current word? e.g., given these, we have a 5% chance of seeing reports and a 0.001% chance of seeing report ( these report cards). Thus, we will change report to reports Generally, we choose the correction which maximizes the probability of the whole sentence As mentioned, we may hardly ever see these reports, so we won’t know the probability of that bigram. Aside from smoothing techniques, another possible solution is to use bigrams of parts of speech. e.g., What is the probability of a noun given that the previous word was an adjective?

33 Exercises Who will represent Markov Model and there basic algorithms ? 6.2,6.4,6.8, 6.9