1 Smoothing LING 570 Fei Xia Week 5: 10/24/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A A AA A A A.

Slides:



Advertisements
Similar presentations
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Advertisements

1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.
Language Modeling.
N-gram model limitations Important question was asked in class: what do we do about N-grams which were not in our training corpus? Answer given: we distribute.
Albert Gatt Corpora and Statistical Methods – Lecture 7.
SI485i : NLP Set 4 Smoothing Language Models Fall 2012 : Chambers.
Smoothing Techniques – A Primer
Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.
Part II. Statistical NLP Advanced Artificial Intelligence N-Gramms Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most slides taken.
N-Gram Language Models CMSC 723: Computational Linguistics I ― Session #9 Jimmy Lin The iSchool University of Maryland Wednesday, October 28, 2009.
Ngram models and the Sparsity problem John Goldsmith November 2002.
Fall 2001 EE669: Natural Language Processing 1 Lecture 6: N-gram Models and Sparse Data (Chapter 6 of Manning and Schutze, Chapter 6 of Jurafsky and Martin,
Smoothing Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
N-gram model limitations Q: What do we do about N-grams which were not in our training corpus? A: We distribute some probability mass from seen N-grams.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 20: 11/8.
1 Language Model (LM) LING 570 Fei Xia Week 4: 10/21/2009 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAA A A.
Maximum Likelihood (ML), Expectation Maximization (EM)
I256 Applied Natural Language Processing Fall 2009 Lecture 7 Practical examples of Graphical Models Language models Sparse data & smoothing Barbara Rosario.
Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 19: 10/31.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 18: 10/26.
Language Models Data-Intensive Information Processing Applications ― Session #9 Nitin Madnani University of Maryland Tuesday, April 6, 2010 This work is.
Maximum Entropy Model LING 572 Fei Xia 02/08/07. Topics in LING 572 Easy: –kNN, Rocchio, DT, DL –Feature selection, binarization, system combination –Bagging.
Learning Bit by Bit Class 4 - Ngrams. Ngrams Counting words Using observation to make predictions.
1 Advanced Smoothing, Evaluation of Language Models.
12/13/2007Chia-Ho Ling1 SRILM Language Model Student: Chia-Ho Ling Instructor: Dr. Veton Z. K ë puska.
Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture 7 8 August 2007.
Heshaam Faili University of Tehran
6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection -Augmentative.
A Bit of Progress in Language Modeling Extended Version
Language acquisition
1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.
NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models.
1 COMP 791A: Statistical Language Processing n-gram Models over Sparse Data Chap. 6.
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Statistical NLP: Lecture 8 Statistical Inference: n-gram Models over Sparse Data (Ch 6)
Chapter 6: N-GRAMS Heshaam Faili University of Tehran.
Chapter6. Statistical Inference : n-gram Model over Sparse Data 이 동 훈 Foundations of Statistic Natural Language Processing.
Sequence Models With slides by me, Joshua Goodman, Fei Xia.
An Overview of Nonparametric Bayesian Models and Applications to Natural Language Processing Narges Sharif-Razavian and Andreas Zollmann.
Language Modeling 1. Roadmap (for next two classes)  Review LMs  What are they?  How (and where) are they used?  How are they trained?  Evaluation.
Language acquisition
Efficient Language Model Look-ahead Probabilities Generation Using Lower Order LM Look-ahead Information Langzhou Chen and K. K. Chin Toshiba Research.
9/22/1999 JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) LM Smoothing (The EM Algorithm) Dr. Jan Hajič CS Dept., Johns.
1 Introduction to Natural Language Processing ( ) LM Smoothing (The EM Algorithm) AI-lab
Lecture 4 Ngrams Smoothing
N-gram Models CMSC Artificial Intelligence February 24, 2005.
LING/C SC/PSYC 438/538 Lecture 22 Sandiway Fong. Last Time Gentle introduction to probability Important notions: –sample space –events –rule of counting.
Statistical NLP Winter 2009
Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.
CS Machine Learning and Statistical Natural Language Processing Prof. Shlomo Argamon, Room: 237C Office Hours: Mon 3-4 PM Book:
Natural Language Processing Statistical Inference: n-grams
A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
Statistical NLP Spring 2011 Lecture 3: Language Models II Dan Klein – UC Berkeley TexPoint fonts used in EMF. Read the TexPoint manual before you delete.
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval Chengxiang Zhai, John Lafferty School of Computer Science Carnegie.
Language Modeling Again So are we smooth now? Courtesy of Chris Jordan.
Language Modeling Part II: Smoothing Techniques Niranjan Balasubramanian Slide Credits: Chris Manning, Dan Jurafsky, Mausam.
Intro to NLP - J. Eisner1 Smoothing Intro to NLP - J. Eisner2 Parameter Estimation p(x 1 = h, x 2 = o, x 3 = r, x 4 = s, x 5 = e,
N-Grams Chapter 4 Part 2.
N-Gram Model Formulas Word sequences Chain rule of probability
CSCE 771 Natural Language Processing
Presented by Wen-Hung Tsai Speech Lab, CSIE, NTNU 2005/07/13
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
CSCE 771 Natural Language Processing
Language acquisition (4:30)
Professor Junghoo “John” Cho UCLA
Presentation transcript:

1 Smoothing LING 570 Fei Xia Week 5: 10/24/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A A AA A A A

2 Smoothing What? Why? –To deal with events observed zero times. –“event”: a particular ngram How? –To shave a little bit of probability mass from the higher counts, and pile it instead on the zero counts –For the time being, we assume that there are no unknown words; that is, V is a closed vocabulary.

3 Smoothing methods Laplace smoothing (a.k.a. Add-one smoothing) Good-Turing Smoothing Linear interpolation (a.k.a. Jelinek-Mercer smoothing) Katz Backoff Class-based smoothing Absolute discounting Kneser-Ney smoothing

4 Laplace smoothing Add 1 to all frequency counts. Let V be the vocabulary size.

5 Laplace smoothing: n-gram Bigram: n-gram:

6 Problem with Laplace smoothing Example: |V|=100K, a bigram “w1 w2” occurs 10 times, and the bigram ‘w1 w2 w3” occurs 9 times. –P_{MLE} (w3 | w1, w2) = 0.9 –P_{Lap}(w3 | w1, w2) = (9+1)/(10+100K) = Problem: give too much probability mass to unseen n-grams. Add-one smoothing works horribly in practice.

7 It works better than add-one, but still works horribly. Need to choose

8 Good-Turing smoothing

9 Basic ideas Re-estimate the frequency of zero-count N-grams with the number of N-grams that occur once. Let N c be the number of n-grams that occurred c times. The Good-Turing estimate for any n-gram that occurs c times, we should pretend that it occurs c* times:

10 N 0 is the number of unseen ngrams For unseen n-grams, we assume that each of them occurs c 0 * times. The total prob mass for unseen ngrams is : Therefore, the prob of EACH unseen ngram is:

11 An example

12 N-gram counts to conditional probability c * comes from GT estimate.

13 Issues in Good-Turing estimation If N c+1 is zero, how to estimate c * ? –Smooth N c by some functions: Ex: log(N c ) = a + b log(c) –Large counts are assumed to be reliabl  gt_max[ ] Ex: c * = c for c > gt_max May also want to treat n-grams with low counts (especially 1) as zeroes  gt_min[ ]. Need to renormalize all the estimate to ensure that the probs add to one. Good-Turing is often not used by itself; it is used in combination with the backoff and interpolation algorithms.

14 One way to implement Good-Turing Let N be the number of trigram tokens in the training corpus, and min3 and max3 be the min and max cutoffs for trigrams. From the trigram counts calculate N_0, N_1, …, N max3+1, and N calculate a function f(c), for c=0, 1, …, max3. Define c * = c if c > max3 = f(c) otherwise Do the same for bigram counts and unigram counts.

15 Good-Turing implementation (cont) Estimate trigram conditional prob: For an unseen trigram, the joint prob is: Do the same for unigram and bigram models

16 Backoff and interpolation

17 N-gram hierarchy P 3 (w3|w1,w2), P 2 (w3|w2), P 1 (w3) Back off to a lower N-gram  backoff estimation Mix the probability estimates from all the N-grams  interpolation

18 Katz Backoff P katz (w i |w i-2, w i-1 ) = P 3 (w i |w i-2, w i-1 ) if c(w i- 2, w i-1, w i ) > 0 ® (w i-2, w i-1 ) P katz (w i |w i-1 ) o.w. P katz (w i |w i-1 ) = P 2 (w i | w i-1 ) if c(w i-1, w i ) > 0 ® (w i-1 ) P 1 (w i ) o.w

19 Katz backoff (cont) ® are used to normalize probability mass so that it still sums to 1, and to “smooth” the lower order probabilities that are used. See J&M Sec for details of how to calculate ® (and M&S for additional discussion)

20 Jelinek-Mercer smoothing (interpolation) Bigram : Trigram:

21 Interpolation (cont)

22 How to set ¸ i ? Generally, here’s what’s done: –Split data into training, held-out, and test –Train model on training set –Use held-out to test different values and pick the ones that works best (i.e., maximize the likelihood of the held-out data) –Test the model on the test data

23 So far Laplace smoothing: Good-Turing Smoothing: gt_min[ ], gt_max[ ]. Linear interpolation: ¸ i (w i-2,w i-1 ) Katz Backoff: ® (w i-2,w i-1 )

24 Additional slides

25 Another example for Good-Turing 10 tuna, 3 unagi, 2 salmon, 1 shrimp, 1 octopus, 1 yellowtail How likely is octopus? Since c(octopus) = 1. The GT estimate is 1*. To compute 1*, we need n 1 =3 and n 2 =1. What happens when N c = 0?

26 Absolute discounting What is the value for D? How to set ® (x)?

27 Intuition for Kneser-Ney smoothing I cannot find my reading __ P(Francisco | reading ) > P(glasses | reading) –Francisco is common, so interpolation gives P(Francisco | reading) a high value –But Francisco occurs in few contexts (only after San), whereas glasses occurs in many contexts. –Hence weight the interpolation based on number of contexts for the word using discounting  Words that have appeared in more contexts are more likely to appear in some new context as well.

28 Kneser-Ney smoothing (cont) Interpolation: Backoff:

29 Class-based LM Examples: –The stock rises to $81.24 –He will visit Hyderabad next January P(w i | w i-1 ) ¼ P(c i-1 | w i-1 ) * P(c i | c i-1 ) * P(w i | c i ) Hard clustering vs. soft clustering

30 Summary Laplace smoothing (a.k.a. Add-one smoothing) Good-Turing Smoothing Linear interpolation: ¸ (w i-2,w i-1 ), ¸ (w i-1 ) Katz Backoff: ® (w i-2,w i-1 ), ® (w i-1 ) Absolute discounting: D, ® (w i-2,w i-1 ), ® (w i-1 ) Kneser-Ney smoothing: D, ® (w i-2,w i-1 ), ® (w i-1 ) Class-based smoothing: clusters