LING 438/538 Computational Linguistics Sandiway Fong Lecture 18: 10/26.

Slides:



Advertisements
Similar presentations
1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.
Advertisements

Language Models Naama Kraus (Modified by Amit Gross) Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze.
N-gram model limitations Important question was asked in class: what do we do about N-grams which were not in our training corpus? Answer given: we distribute.
Albert Gatt Corpora and Statistical Methods – Lecture 7.
Smoothing Techniques – A Primer
Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.
CSC 9010: Special Topics, Natural Language Processing. Spring, Matuszek & Papalaskari 1 N-Grams CSC 9010: Special Topics. Natural Language Processing.
Morphology & FSTs Shallow Processing Techniques for NLP Ling570 October 17, 2011.
Probabilistic Pronunciation + N-gram Models CSPP Artificial Intelligence February 25, 2004.
N-Gram Language Models CMSC 723: Computational Linguistics I ― Session #9 Jimmy Lin The iSchool University of Maryland Wednesday, October 28, 2009.
Smoothing Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
N-gram model limitations Q: What do we do about N-grams which were not in our training corpus? A: We distribute some probability mass from seen N-grams.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 20: 11/8.
LING 438/538 Computational Linguistics
1 Language Model (LM) LING 570 Fei Xia Week 4: 10/21/2009 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAA A A.
1 Smoothing LING 570 Fei Xia Week 5: 10/24/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A A AA A A A.
Part 5 Language Model CSE717, SPRING 2008 CUBS, Univ at Buffalo.
CS 4705 Lecture 15 Corpus Linguistics III. Training and Testing Probabilities come from a training corpus, which is used to design the model. –overly.
LING 581: Advanced Computational Linguistics Lecture Notes January 12th.
Language Model. Major role: Language Models help a speech recognizer figure out how likely a word sequence is, independent of the acoustics. A lot of.
Word prediction What are likely completions of the following sentences? –“Oh, that must be a syntax …” –“I have to go to the …” –“I’d also like a Coke.
I256 Applied Natural Language Processing Fall 2009 Lecture 7 Practical examples of Graphical Models Language models Sparse data & smoothing Barbara Rosario.
Introduction to Language Models Evaluation in information retrieval Lecture 4.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 17: 10/24.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 19: 10/31.
LING/C SC 581: Advanced Computational Linguistics Lecture Notes Jan 22 nd.
SI485i : NLP Set 3 Language Models Fall 2012 : Chambers.
1 Advanced Smoothing, Evaluation of Language Models.
8/27/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 5 Giuseppe Carenini.
Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.
LING/C SC/PSYC 438/538 Lecture 23 Sandiway Fong. Administrivia Homework 4 – out today – due next Wednesday – (recommend you attempt it early) Reading.
1 LIN6932 Spring 2007 LIN6932: Topics in Computational Linguistics Hana Filip Lecture 5: N-grams.
Heshaam Faili University of Tehran
6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection -Augmentative.
Comparative study of various Machine Learning methods For Telugu Part of Speech tagging -By Avinesh.PVS, Sudheer, Karthik IIIT - Hyderabad.
NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models.
1 COMP 791A: Statistical Language Processing n-gram Models over Sparse Data Chap. 6.
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Statistical NLP: Lecture 8 Statistical Inference: n-gram Models over Sparse Data (Ch 6)
Chapter 6: N-GRAMS Heshaam Faili University of Tehran.
Language Modeling Anytime a linguist leaves the group the recognition rate goes up. (Fred Jelinek)
Chapter6. Statistical Inference : n-gram Model over Sparse Data 이 동 훈 Foundations of Statistic Natural Language Processing.
Resolving Word Ambiguities Description: After determining word boundaries, the speech recognition process matches an array of possible word sequences from.
LING/C SC/PSYC 438/538 Lecture 21 Sandiway Fong. Today’s Topics Last Time – Stemming and minimum edit distance Reading – Chapter 4 of JM: N-grams Pre-requisite:
1 Introduction to Natural Language Processing ( ) LM Smoothing (The EM Algorithm) AI-lab
Tokenization & POS-Tagging
LING 388: Language and Computers Sandiway Fong Lecture 27: 12/6.
Lecture 4 Ngrams Smoothing
N-gram Models CMSC Artificial Intelligence February 24, 2005.
Chapter 23: Probabilistic Language Models April 13, 2004.
LING/C SC/PSYC 438/538 Lecture 22 Sandiway Fong. Last Time Gentle introduction to probability Important notions: –sample space –events –rule of counting.
Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.
1 Introduction to Natural Language Processing ( ) Language Modeling (and the Noisy Channel) AI-lab
Natural Language Processing Statistical Inference: n-grams
2/29/2016CPSC503 Winter CPSC 503 Computational Linguistics Lecture 5 Giuseppe Carenini.
Learning, Uncertainty, and Information: Evaluating Models Big Ideas November 12, 2004.
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
Probabilistic Pronunciation + N-gram Models CMSC Natural Language Processing April 15, 2003.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.
Language Modeling Part II: Smoothing Techniques Niranjan Balasubramanian Slide Credits: Chris Manning, Dan Jurafsky, Mausam.
Language Model for Machine Translation Jang, HaYoung.
N-Grams Chapter 4 Part 2.
CPSC 503 Computational Linguistics
LING/C SC 581: Advanced Computational Linguistics
Lecture 7 HMMs – the 3 Problems Forward Algorithm
CSCE 771 Natural Language Processing
Presented by Wen-Hung Tsai Speech Lab, CSIE, NTNU 2005/07/13
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
CSCE 771 Natural Language Processing
Presentation transcript:

LING 438/538 Computational Linguistics Sandiway Fong Lecture 18: 10/26

2 Administrivia Reminder –Homework 4 due tonight –Questions? After class

3 Last Time Background –general introduction to probability concepts Sample Space and Events Permutations and Combinations Rule of Counting Event Probability/Conditional Probability Uncertainty and Entropy statistical experiment: outcomes

4 Today’s Topic statistical methods are widely used in language processing –apply probability theory to language N-grams reading –textbook chapter 6: N-grams

5 N-grams: Unigrams introduction –Given a corpus of text, the n-grams are the sequences of n consecutive words that are in the corpus example ( 12 word sentence ) – the cat that sat on the sofa also sat on the mat N=1 ( 8 unigrams ) –the3 –sat2 –on2 –cat1 –that1 –sofa1 –also1 –mat1

6 N-grams: Bigrams example ( 12 word sentence ) – the cat that sat on the sofa also sat on the mat N=2 ( 8 bigrams ) –sat on2 –on the2 –the cat1 –cat that1 –that sat1 –the sofa1 –sofa also1 –also sat1 –the mat 1 2 words

7 N-grams: Trigrams example ( 12 word sentence ) – the cat that sat on the sofa also sat on the mat N=3 ( 9 trigrams ) –most language models stop here, some stop at quadrigrams too many n-grams low frequencies –sat on the2 –the cat that1 –cat that sat1 –that sat on1 –on the sofa1 –the sofa also 1 –sofa also sat 1 –also sat on1 –on the mat1 3 words

8 N-grams: Quadrigrams Example: ( 12 word sentence ) – the cat that sat on the sofa also sat on the mat N=4 ( 8 quadrigrams ) –the cat that sat1 –cat that sat on1 –that sat on the1 –sat on the sofa1 –on the sofa also1 –the sofa also sat1 –sofa also sat on1 –also sat on the1 –sat on the mat1 4 words

9 N-grams: frequency curves family of curves sorted by frequency –unigrams, bigrams, trigrams, quadrigrams... –decreasing frequency f frequency curve family

10 N-grams: the word as a unit we count words but what counts as a word? –punctuation useful surface cue also = beginning of a sentence, as a dummy word part-of-speech taggers include punctuation as words (why?) –capitalization They, theysame token or not? –wordform vs. lemma cats, catsame token or not? –disfluencies part of spoken language er, um, main- mainly speech recognition systems have to cope with them

11 N-grams: Word what counts as a word? –punctuation useful surface cue also = beginning of a sentence, as a dummy word part-of-speech taggers include punctuation as words (why?) From the Penn Treebank tagset

12 Language Models and N-grams Brown corpus (1million words): –word wf(w)p(w) –the 69, –rabbit given a word sequence –w 1 w 2 w 3... w n –probability of seeing w i depends on what we seen before recall conditional probability introduced last time example (section 6.2) –Just then, the whiterabbit – the –expectation is p(rabbit|white) > p(the|white) –but p(the) > p(rabbit)

13 Language Models and N-grams given a word sequence –w 1 w 2 w 3... w n chain rule –how to compute the probability of a sequence of words –p(w 1 w 2 ) = p(w 1 ) p(w 2 |w 1 ) –p(w 1 w 2 w 3 ) = p(w 1 ) p(w 2 |w 1 ) p(w 3 |w 1 w 2 ) –... –p(w 1 w 2 w 3...w n ) = p(w 1 ) p(w 2 |w 1 ) p(w 3 |w 1 w 2 )... p(w n |w 1...w n-2 w n-1 ) note –It’s not easy to collect (meaningful) statistics on p(w n |w n-1 w n-2...w 1 ) for all possible word sequences

14 Language Models and N-grams Given a word sequence –w 1 w 2 w 3... w n Bigram approximation –just look at the previous word only (not all the proceedings words) –Markov Assumption: finite length history –1st order Markov Model –p(w 1 w 2 w 3...w n ) = p(w 1 ) p(w 2 |w 1 ) p(w 3 |w 1 w 2 )...p(w n |w 1...w n-3 w n-2 w n-1 ) –p(w 1 w 2 w 3...w n )  p(w 1 ) p(w 2 |w 1 ) p(w 3 |w 2 )...p(w n |w n-1 ) note –p(w n |w n-1 ) is a lot easier to collect data for (and thus estimate well) than p(w n |w 1...w n-2 w n-1 )

15 Language Models and N-grams Trigram approximation –2nd order Markov Model –just look at the preceding two words only –p(w 1 w 2 w 3 w 4...w n ) = p(w 1 ) p(w 2 |w 1 ) p(w 3 |w 1 w 2 ) p(w 4 |w 1 w 2 w 3 )...p(w n |w 1...w n-3 w n-2 w n-1 ) –p(w 1 w 2 w 3...w n )  p(w 1 ) p(w 2 |w 1 ) p(w 3 |w 1 w 2 )p(w 4 |w 2 w 3 )...p(w n |w n-2 w n-1 ) note –p(w n |w n-2 w n-1 ) is a lot easier to estimate well than p(w n |w 1...w n-2 w n-1 ) but harder than p(w n |w n-1 )

16 Language Models and N-grams example: (bigram language model from section 6.2) – = start of sentence –p(I want to eat British food) = p(I| )p(want|I)p(to|want)p(eat|to)p(British|eat)p(food|British) p(w 1 w 2 w 3...w n )  p(w 1 ) p(w 2 |w 1 ) p(w 3 |w 2 )...p(w n |w n-1 ) figure 6.2

17 Language Models and N-grams example: (bigram language model from section 6.2) – = start of sentence –p(I want to eat British food) = p(I| )p(want|I)p(to|want)p(eat|to)p(British|eat)p(food|British) figure 6.3 –p(I| )p(want|I)p(to|want)p(eat|to)p(British|eat)p(food|British) –= 0.25 * 0.32 * 0.65 * 0.26 * * 0.60 –= (different from textbook)

18 Language Models and N-grams estimating from corpora –how to compute bigram probabilities –p(w n |w n-1 ) = f(w n-1 w n )/f(w n-1 w)w is any word –Since f(w n-1 w) = f(w n-1 ) f(w n-1 ) = unigram frequency for w n-1 –p(w n |w n-1 ) = f(w n-1 w n )/f(w n-1 )relative frequency Note: –The technique of estimating (true) probabilities using a relative frequency measure over a training corpus is known as maximum likelihood estimation (MLE)

19 Language Models and N-grams example –p(I want to eat British food) = p(I| )p(want|I)p(to|want)p(eat |to)p(British|eat)p(food|British) –= 0.25 * 0.32 * 0.65 * 0.26 * * 0.60 –= in practice calculations are done in log space –p(I want to eat British food) = a tiny number –use logprob (log 2 probability) –actually sum of (negative) log 2 s of probabilities Question: –Why sum negative log of probabilities? Answer (Part 1): –computer floating point number storage has limited range  5.0×10 −324 to 1.7× double (64 bit) danger of underflow

20 Language Models and N-grams Question: –Why sum negative log of probabilities? Answer (Part 2): –A = BC –log(A) = log(B) + log(C) –probabilities are in range (0, 1] –Note: want probabilities to be non-zero log(0) = -  –log of probabilites will be negative (up to 0) –take negative to make them positive log function region of interest

21 Motivation for smoothing Smoothing: avoid zero probabilities Consider what happens when any individual probability component is zero? –multiplication law: 0×X = 0 –very brittle! even in a large corpus, many n-grams will have zero frequency –particularly so for larger n p(w 1 w 2 w 3...w n )  p(w 1 ) p(w 2 |w 1 ) p(w 3 |w 2 )...p(w n |w n-1 )

22 Language Models and N-grams Example: unigram frequencies w n-1 w n bigram frequencies bigram probabilities sparse matrix zeros render probabilities unusable (we’ll need to add fudge factors - i.e. do smoothing) w n-1 wnwn

23 Smoothing and N-grams sparse dataset means zeros are a problem –Zero probabilities are a problem p(w 1 w 2 w 3...w n )  p(w 1 ) p(w 2 |w 1 ) p(w 3 |w 2 )...p(w n |w n-1 ) bigram model one zero and the whole product is zero –Zero frequencies are a problem p(w n |w n-1 ) = f(w n-1 w n )/f(w n-1 )relative frequency bigram f(w n-1 w n ) doesn’t exist in dataset smoothing –refers to ways of assigning zero probability n-grams a non-zero value –we’ll look at two ways here (just one of them today)

24 Smoothing and N-grams Add-One Smoothing –add 1 to all frequency counts –simple and no more zeros (but there are better methods) unigram –p(w) = f(w)/N(before Add-One) N = size of corpus –p(w) = (f(w)+1)/(N+V)(with Add-One) –f*(w) = (f(w)+1)*N/(N+V)(with Add-One) V = number of distinct words in corpus N/(N+V) normalization factor adjusting for the effective increase in the corpus size caused by Add-One bigram –p(w n |w n-1 ) = f(w n-1 w n )/f(w n-1 )(before Add-One) –p(w n |w n-1 ) = (f(w n-1 w n )+1)/(f(w n-1 )+V)(after Add-One) –f*(w n-1 w n ) = (f(w n-1 w n )+1)* f(w n-1 ) /(f(w n-1 )+V)(after Add-One) must rescale so that total probability mass stays at 1

25 Smoothing and N-grams Add-One Smoothing –add 1 to all frequency counts bigram –p(w n |w n-1 ) = (f(w n-1 w n )+1)/(f(w n-1 )+V) –(f(w n-1 w n )+1)* f(w n-1 ) /(f(w n-1 )+V) frequencies Remarks: perturbation problem add-one causes large changes in some frequencies due to relative size of V (1616) want to: 786  338 = figure 6.8 = figure 6.4

26 Smoothing and N-grams Add-One Smoothing –add 1 to all frequency counts bigram –p(w n |w n-1 ) = (f(w n-1 w n )+1)/(f(w n-1 )+V) –(f(w n-1 w n )+1)* f(w n-1 ) /(f(w n-1 )+V) Probabilities Remarks: perturbation problem similar changes in probabilities = figure 6.5 = figure 6.7

27 Smoothing and N-grams Excel spreadsheet available –addone.xls