Chapter 6: Statistical Inference: n-gram Models over Sparse Data

Slides:



Advertisements
Similar presentations
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Machine Learning PoS-Taggers COMP3310 Natural Language Processing Eric.
Advertisements

Information Retrieval and Organisation Chapter 12 Language Models for Information Retrieval Dell Zhang Birkbeck, University of London.
1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.
Language Modeling.
Language Models Naama Kraus (Modified by Amit Gross) Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze.
Probabilistic models Haixu Tang School of Informatics.
N-gram model limitations Important question was asked in class: what do we do about N-grams which were not in our training corpus? Answer given: we distribute.
Albert Gatt Corpora and Statistical Methods – Lecture 7.
SI485i : NLP Set 4 Smoothing Language Models Fall 2012 : Chambers.
Smoothing Techniques – A Primer
Language modelling using N-Grams Corpora and Statistical Methods Lecture 7.
LINEAR REGRESSION: Evaluating Regression Models Overview Assumptions for Linear Regression Evaluating a Regression Model.
LINEAR REGRESSION: Evaluating Regression Models. Overview Assumptions for Linear Regression Evaluating a Regression Model.
1 Language Model CSC4170 Web Intelligence and Social Computing Tutorial 8 Tutor: Tom Chao Zhou
September BASIC TECHNIQUES IN STATISTICAL NLP Word prediction n-grams smoothing.
N-Gram Language Models CMSC 723: Computational Linguistics I ― Session #9 Jimmy Lin The iSchool University of Maryland Wednesday, October 28, 2009.
Ngram models and the Sparsity problem John Goldsmith November 2002.
Fall 2001 EE669: Natural Language Processing 1 Lecture 6: N-gram Models and Sparse Data (Chapter 6 of Manning and Schutze, Chapter 6 of Jurafsky and Martin,
Smoothing Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
N-gram model limitations Q: What do we do about N-grams which were not in our training corpus? A: We distribute some probability mass from seen N-grams.
1 Language Model (LM) LING 570 Fei Xia Week 4: 10/21/2009 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAA A A.
1 Smoothing LING 570 Fei Xia Week 5: 10/24/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A A AA A A A.
Part 5 Language Model CSE717, SPRING 2008 CUBS, Univ at Buffalo.
I256 Applied Natural Language Processing Fall 2009 Lecture 7 Practical examples of Graphical Models Language models Sparse data & smoothing Barbara Rosario.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 18: 10/26.
Language Modeling Approaches for Information Retrieval Rong Jin.
Learning Bit by Bit Class 4 - Ngrams. Ngrams Counting words Using observation to make predictions.
LIN3022 Natural Language Processing Lecture 5 Albert Gatt LIN Natural Language Processing.
1 Advanced Smoothing, Evaluation of Language Models.
Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.
6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection -Augmentative.
1 Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. Basic Principle of Statistics: Rare Event Rule If, under a given assumption,
1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.
Comparative study of various Machine Learning methods For Telugu Part of Speech tagging -By Avinesh.PVS, Sudheer, Karthik IIIT - Hyderabad.
NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models.
1 COMP 791A: Statistical Language Processing n-gram Models over Sparse Data Chap. 6.
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Statistical NLP: Lecture 8 Statistical Inference: n-gram Models over Sparse Data (Ch 6)
Chapter 6: N-GRAMS Heshaam Faili University of Tehran.
Language Modeling Anytime a linguist leaves the group the recognition rate goes up. (Fred Jelinek)
Chapter6. Statistical Inference : n-gram Model over Sparse Data 이 동 훈 Foundations of Statistic Natural Language Processing.
9/22/1999 JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) LM Smoothing (The EM Algorithm) Dr. Jan Hajič CS Dept., Johns.
1 Introduction to Natural Language Processing ( ) LM Smoothing (The EM Algorithm) AI-lab
Lecture 4 Ngrams Smoothing
N-gram Models CMSC Artificial Intelligence February 24, 2005.
Collocations. Definition Of Collocation (wrt Corpus Literature) A collocation is defined as a sequence of two or more consecutive words, that has characteristics.
Statistical NLP Winter 2009
Yuya Akita , Tatsuya Kawahara
Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.
CHAPTER 6 Naive Bayes Models for Classification. QUESTION????
Estimating N-gram Probabilities Language Modeling.
CS Machine Learning and Statistical Natural Language Processing Prof. Shlomo Argamon, Room: 237C Office Hours: Mon 3-4 PM Book:
Language Modeling Putting a curve to the bag of words Courtesy of Chris Jordan.
1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:
Natural Language Processing Statistical Inference: n-grams
Statistical NLP Spring 2011 Lecture 3: Language Models II Dan Klein – UC Berkeley TexPoint fonts used in EMF. Read the TexPoint manual before you delete.
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
Language Modeling Again So are we smooth now? Courtesy of Chris Jordan.
Language Modeling Part II: Smoothing Techniques Niranjan Balasubramanian Slide Credits: Chris Manning, Dan Jurafsky, Mausam.
N-Grams Chapter 4 Part 2.
Statistical Language Models
N-Gram Model Formulas Word sequences Chain rule of probability
CS4705 Natural Language Processing
CSCE 771 Natural Language Processing
Presented by Wen-Hung Tsai Speech Lab, CSIE, NTNU 2005/07/13
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
CSCE 771 Natural Language Processing
INF 141: Information Retrieval
Professor Junghoo “John” Cho UCLA
Presentation transcript:

Chapter 6: Statistical Inference: n-gram Models over Sparse Data TDM Seminar Jonathan Henke http://www.sims.berkeley.edu/~jhenke/Tdm/TDM-Ch6.ppt

Basic Idea: Examine short sequences of words How likely is each sequence? “Markov Assumption” – word is affected only by its “prior local context” (last few words)

Possible Applications: OCR / Voice recognition – resolve ambiguity Spelling correction Machine translation Confirming the author of a newly discovered work “Shannon game”

“Shannon Game” Predict the next word, given (n-1) previous words Claude E. Shannon. “Prediction and Entropy of Printed English”, Bell System Technical Journal 30:50-64. 1951. Predict the next word, given (n-1) previous words Determine probability of different sequences by examining training corpus

Forming Equivalence Classes (Bins) “n-gram” = sequence of n words bigram trigram four-gram

Reliability vs. Discrimination “large green ___________” tree? mountain? frog? car? “swallowed the large green ________” pill? broccoli?

Reliability vs. Discrimination larger n: more information about the context of the specific instance (greater discrimination) smaller n: more instances in training data, better statistical estimates (more reliability)

Selecting an n Vocabulary (V) = 20,000 words Number of bins 2 (bigrams) 400,000,000 3 (trigrams) 8,000,000,000,000 4 (4-grams) 1.6 x 1017

Statistical Estimators Given the observed training data … How do you develop a model (probability distribution) to predict future events?

Statistical Estimators Example: Corpus: five Jane Austen novels N = 617,091 words V = 14,585 unique words Task: predict the next word of the trigram “inferior to ________” from test data, Persuasion: “[In person, she was] inferior to both [sisters.]”

Instances in the Training Corpus: “inferior to ________”

Maximum Likelihood Estimate:

Actual Probability Distribution:

Actual Probability Distribution:

“Smoothing” Develop a model which decreases probability of seen events and allows the occurrence of previously unseen n-grams a.k.a. “Discounting methods” “Validation” – Smoothing methods which utilize a second batch of test data.

LaPlace’s Law (adding one)

LaPlace’s Law (adding one)

LaPlace’s Law

Lidstone’s Law P = probability of specific n-gram C = count of that n-gram in training data N = total n-grams in training data B = number of “bins” (possible n-grams)  = small positive number M.L.E:  = 0 LaPlace’s Law:  = 1 Jeffreys-Perks Law:  = ½

Jeffreys-Perks Law

Objections to Lidstone’s Law Need an a priori way to determine . Predicts all unseen events to be equally likely Gives probability estimates linear in the M.L.E. frequency

Smoothing Lidstone’s Law (incl. LaPlace’s Law and Jeffreys-Perks Law): modifies the observed counts Other methods: modify probabilities.

Held-Out Estimator How much of the probability distribution should be “held out” to allow for previously unseen events? Validate by holding out part of the training data. How often do events unseen in training data occur in validation data? (e.g., to choose  for Lidstone model)

Held-Out Estimator r = C(w1… wn)

Testing Models Hold out ~ 5 – 10% for testing Hold out ~ 10% for validation (smoothing) For testing: useful to test on multiple sets of data, report variance of results. Are results (good or bad) just the result of chance?

Cross-Validation (a.k.a. deleted estimation) Use data for both training and validation Divide test data into 2 parts Train on A, validate on B Train on B, validate on A Combine two models A B train validate Model 1 validate train Model 2 + Model 1 Model 2 Final Model

Cross-Validation Two estimates: Combined estimate: Nra = number of n-grams occurring r times in a-th part of training set Trab = total number of those found in b-th part Combined estimate: (arithmetic mean)

Good-Turing Estimator r* = “adjusted frequency” Nr = number of n-gram-types which occur r times E(Nr) = “expected value” E(Nr+1) < E(Nr)

Discounting Methods First, determine held-out probability Absolute discounting: Decrease probability of each observed n-gram by subtracting a small constant Linear discounting: Decrease probability of each observed n-gram by multiplying by the same proportion

Combining Estimators (Sometimes a trigram model is best, sometimes a bigram model is best, and sometimes a unigram model is best.) How can you develop a model to utilize different length n-grams as appropriate?

Simple Linear Interpolation (a. k. a. , finite mixture models; a. k. a Simple Linear Interpolation (a.k.a., finite mixture models; a.k.a., deleted interpolation) weighted average of unigram, bigram, and trigram probabilities

Katz’s Backing-Off Use n-gram probability when enough training data (when adjusted count > k; k usu. = 0 or 1) If not, “back-off” to the (n-1)-gram probability (Repeat as needed)

Problems with Backing-Off If bigram w1 w2 is common but trigram w1 w2 w3 is unseen may be a meaningful gap, rather than a gap due to chance and scarce data i.e., a “grammatical null” May not want to back-off to lower-order probability