Presentation is loading. Please wait.

Presentation is loading. Please wait.

Chapter6. Statistical Inference : n-gram Model over Sparse Data 2005. 1. 13 이 동 훈 Foundations of Statistic Natural Language Processing.

Similar presentations


Presentation on theme: "Chapter6. Statistical Inference : n-gram Model over Sparse Data 2005. 1. 13 이 동 훈 Foundations of Statistic Natural Language Processing."— Presentation transcript:

1 Chapter6. Statistical Inference : n-gram Model over Sparse Data 2005. 1. 13 이 동 훈 huni77@pusan.ac.kr Foundations of Statistic Natural Language Processing

2 2 / 20 Table of Contents  Introduction  Bins : Forming Equivalence Classes  Reliability vs. Discrimination  N-gram models  Statistical Estimators  Maximum Likelihood Estimation (MLE)  Laplace’s law, Lidstone’s law and the Jeffreys-Perks law  Held out estimation  Cross-validation (deleted estimation)  Good-Turing estimation  Combining Estimators  Simple linear interpolation  Katz’s backing-off  General linear interpolation  Conclusions

3 3 / 20 Introduction  Object of Statistical NLP  Do statistical inference for the field of natural language.  Statistical inference in general consists of :  Taking some data generated by unknown probability distribution.  Making some inferences about this distribution.  Divides the problem into three areas :  Dividing the training data into equivalence class.  Finding a good statistical estimator for each equivalence class.  Combining multiple estimators.

4 4 / 20 Bins : Forming Equivalence Classes[1/2]  Reliability vs. Discrimination “large green ___________” tree? mountain? frog? car? “swallowed the large green ________” pill? broccoli?  larger n: more information about the context of the specific instance (greater discrimination)  smaller n: more instances in training data, better statistical estimates (more reliability)

5 5 / 20 Bins : Forming Equivalence Classes[2/2]  N-gram models  “n-gram” = sequence of n words  predicting the next word :  Markov assumption -Only the prior local con text - the last few words – affects the next word.  Selecting an n : Vocabulary size = 20,000 words nNumber of bins 2 (bigrams)400,000,000 3 (trigrams)8,000,000,000,000 4 (4-grams)1.6 x 10 17

6 6 / 20 Statistical Estimators[1/3]  Given the observed training data.  How do you develop a model (probability distribution) to predict future events?  Probability estimate  target feature -  Estimating the unknown probability distribution of n-grams.

7 7 / 20 Statistical Estimators[2/3]  Notation for the statistical estimation chapter. NNumber of training instances BNumber of bins training instances are divided into w 1n An n-gram w 1 …w n in the training text C(w 1 …w n )Freq. of n-gram w 1 …w n in the training text rFreq. of an n-gram f()Freq. estimate of a model NrNr Number of bins that have r training instances in them TrTr Total count of n-grams of freq. r in further data h‘History’ of preceding words

8 8 / 20 Statistical Estimators[3/3]  Example - Instances in the training corpus: “inferior to ________”

9 9 / 20 Maximum Likelihood Estimation (MLE)[1/2]  Definition  Using the relative frequency as a probability estimate.  Example :  In corpus, found 10 training instances of the word “comes across”  8 times they were followed by “as” : P(as) = 0.8  Once by “more” and “a” : P(more) = 0.1, P(a) = 0.1  Not among the above 3 word : P(x) = 0.0  Formula

10 10 / 20 Maximum Likelihood Estimation (MLE)[2/2]

11 11 / 20 Laplace’s law, Lidstone’s law and the Jeffreys-Perks law [1/2]  Laplace’s law  Add a little bit of probability space to unseen events

12 12 / 20 Laplace’s law, Lidstone’s law and the Jeffreys-Perks law [2/2]  Lidstone’s law and the Jeffreys-Perks law  Lidstone’s Law -add some positive value  Jeffreys-Perks Law - = 0.5 -Called ELE (Expected Likelihood Estimation)

13 13 / 20 Held out estimation  Validate by holding out part of the training data.  -C 1 (w 1n ) = Frequency of w 1n in training data -C 2 (w 1n ) = Frequency of w 1n in held out data -T = Number of token in held out data 

14 14 / 20 Cross-validation (deleted estimation) [1/2]  Use data for both training and validation  Divide test data into 2 parts  Train on A, validate on B  Train on B, validate on A  Combine two models AB trainvalidate train Model 1 Model 2 Model 1Model 2 + Final Model

15 15 / 20 Cross-validation (deleted estimation) [2/2]  Cross validation : training data is used both as  initial training data  held out data  On large training corpora, deleted estimation works better than held-out estimation

16 16 / 20 Good-Turing estimation  Suitable for large number of observations from a large vocabulary  Works well for n-grams ( r* is an adjusted frequency ) ( E denotes the expectation of random variable )

17 17 / 20 Combining Estimators[1/3]  Basic Idea  Consider how to combine multiple probability estimate from various different models  How can you develop a model to utilize different length n-grams as appropriate?  Simple linear interpolation   Combination of trigram, bigram and unigram

18 18 / 20 Combining Estimators[2/3]  Katz’s backing-off  used to smooth or to combine information source  n-gram appeared more than k time  n-gram estimate  k or less than k  estimate from a shorter n-gram

19 19 / 20 Combining Estimators[3/3]  General linear interpolation  weight : function of history  Very general way to combine models (commonly used)

20 20 / 20 Conclusions  problems of sparse data  Good-Turing, linear interpolation or back-off  Good-Turing smoothing is good -Church & Gale (1991)  Active research  combining probability models  dealing with sparse data


Download ppt "Chapter6. Statistical Inference : n-gram Model over Sparse Data 2005. 1. 13 이 동 훈 Foundations of Statistic Natural Language Processing."

Similar presentations


Ads by Google