LING 438/538 Computational Linguistics Sandiway Fong Lecture 19: 10/31.

Slides:



Advertisements
Similar presentations
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Advertisements

1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.
Language Models Naama Kraus (Modified by Amit Gross) Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze.
Albert Gatt Corpora and statistical methods. In this lecture Overview of rules of probability multiplication rule subtraction rule Probability based on.
N-gram model limitations Important question was asked in class: what do we do about N-grams which were not in our training corpus? Answer given: we distribute.
Albert Gatt Corpora and Statistical Methods – Lecture 7.
SI485i : NLP Set 4 Smoothing Language Models Fall 2012 : Chambers.
Smoothing Techniques – A Primer
Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.
Semantics Semantics is the branch of linguistics that deals with the study of meaning, changes in meaning, and the principles that govern the relationship.
September BASIC TECHNIQUES IN STATISTICAL NLP Word prediction n-grams smoothing.
N-Gram Language Models CMSC 723: Computational Linguistics I ― Session #9 Jimmy Lin The iSchool University of Maryland Wednesday, October 28, 2009.
LING 388: Language and Computers Sandiway Fong Lecture 22: 11/10.
Ngram models and the Sparsity problem John Goldsmith November 2002.
September SOME BASIC NOTIONS OF PROBABILITY THEORY Universita’ di Venezia 29 Settembre 2003.
Smoothing Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
N-gram model limitations Q: What do we do about N-grams which were not in our training corpus? A: We distribute some probability mass from seen N-grams.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 20: 11/8.
Computational Intelligence 696i Language Lecture 2 Sandiway Fong.
LING 438/538 Computational Linguistics
1 Language Model (LM) LING 570 Fei Xia Week 4: 10/21/2009 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAA A A.
1 Smoothing LING 570 Fei Xia Week 5: 10/24/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A A AA A A A.
IGERT External Advisory Board Meeting Wednesday, March 14, 2007 INSTITUTE FOR COGNITIVE SCIENCES University of Pennsylvania.
CS 4705 Lecture 15 Corpus Linguistics III. Training and Testing Probabilities come from a training corpus, which is used to design the model. –overly.
LING 581: Advanced Computational Linguistics Lecture Notes January 12th.
I256 Applied Natural Language Processing Fall 2009 Lecture 7 Practical examples of Graphical Models Language models Sparse data & smoothing Barbara Rosario.
Introduction to Language Models Evaluation in information retrieval Lecture 4.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 17: 10/24.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 18: 10/26.
Learning Bit by Bit Class 4 - Ngrams. Ngrams Counting words Using observation to make predictions.
LING/C SC 581: Advanced Computational Linguistics Lecture Notes Jan 22 nd.
SI485i : NLP Set 3 Language Models Fall 2012 : Chambers.
The Marriage Problem Finding an Optimal Stopping Procedure.
1 Advanced Smoothing, Evaluation of Language Models.
Natural Language Processing Lecture 6—9/17/2013 Jim Martin.
Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.
LING/C SC/PSYC 438/538 Lecture 23 Sandiway Fong. Administrivia Homework 4 – out today – due next Wednesday – (recommend you attempt it early) Reading.
Heshaam Faili University of Tehran
6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection -Augmentative.
NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models.
1 COMP 791A: Statistical Language Processing n-gram Models over Sparse Data Chap. 6.
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Statistical NLP: Lecture 8 Statistical Inference: n-gram Models over Sparse Data (Ch 6)
Chapter 6: N-GRAMS Heshaam Faili University of Tehran.
Chapter6. Statistical Inference : n-gram Model over Sparse Data 이 동 훈 Foundations of Statistic Natural Language Processing.
Sequence Models With slides by me, Joshua Goodman, Fei Xia.
Resolving Word Ambiguities Description: After determining word boundaries, the speech recognition process matches an array of possible word sequences from.
LING/C SC/PSYC 438/538 Lecture 21 Sandiway Fong. Today’s Topics Last Time – Stemming and minimum edit distance Reading – Chapter 4 of JM: N-grams Pre-requisite:
Language acquisition
1 Introduction to Natural Language Processing ( ) LM Smoothing (The EM Algorithm) AI-lab
LING 388: Language and Computers Sandiway Fong Lecture 27: 12/6.
Lecture 4 Ngrams Smoothing
N-gram Models CMSC Artificial Intelligence February 24, 2005.
Chapter 23: Probabilistic Language Models April 13, 2004.
LING/C SC/PSYC 438/538 Lecture 22 Sandiway Fong. Last Time Gentle introduction to probability Important notions: –sample space –events –rule of counting.
Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.
1 Introduction to Natural Language Processing ( ) Language Modeling (and the Noisy Channel) AI-lab
Estimating N-gram Probabilities Language Modeling.
Natural Language Processing Statistical Inference: n-grams
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 3 rd.
The accuracy of averages We learned how to make inference from the sample to the population: Counting the percentages. Here we begin to learn how to make.
Language Modeling Part II: Smoothing Techniques Niranjan Balasubramanian Slide Credits: Chris Manning, Dan Jurafsky, Mausam.
LING/C SC 581: Advanced Computational Linguistics Lecture Notes Jan 20 th.
N-Grams Chapter 4 Part 2.
LING/C SC 581: Advanced Computational Linguistics
CSCE 771 Natural Language Processing
Presented by Wen-Hung Tsai Speech Lab, CSIE, NTNU 2005/07/13
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Presentation transcript:

LING 438/538 Computational Linguistics Sandiway Fong Lecture 19: 10/31

2 Administrivia Homework 3 –graded and returned by

3 Last Time applying probability theory to language –N-grams unigrams, bigrams, trigrams etc. frequency counts for “a sliding window” –example ( 12 word sentence ) – the cat that sat on the sofa also sat on the mat reading –textbook chapter 6: N-grams 3 words

4 Last Time the importance of conditional probability or expectation in language example (section 6.2) –Just then, the whiterabbit – the –expectation is p(rabbit|white) > p(the|white)(conditional) –but p(the) > p(rabbit)(unconditional) chain rule –how to compute the probability of a sequence of words w 1 w 2 w 3... w n –p(w 1 w 2 w 3...w n ) = p(w 1 ) p(w 2 |w 1 ) p(w 3 |w 1 w 2 )... p(w n |w 1...w n-2 w n-1 ) Markov Assumption: finite length history Bigram approximation –just look at the previous word only (1st order Markov Model) –p(w 1 w 2 w 3...w n )  p(w 1 ) p(w 2 |w 1 ) p(w 3 |w 2 )...p(w n |w n-1 ) Trigram approximation –just look at the preceding two words only (2nd order Markov Model) –p(w 1 w 2 w 3...w n )  p(w 1 ) p(w 2 |w 1 ) p(w 3 |w 1 w 2 )p(w 4 |w 2 w 3 )...p(w n |w n-2 w n-1 )

5 Last Time computing estimates from corpora –maximum likelihood estimation (MLE) estimating p(w n |w n-1 ) –can compute f(w n-1 w n ) –what do we divide f(w n-1 w n ) by to get a probability? –answer: sum of f(w n-1 w) over all possible w compute f(w n-1 w) –= f(w n-1 ) f(w n-1 ) = unigram frequency for w n-1 therefore p(w n |w n-1 ) = f(w n-1 w n )/f(w n-1 ) –relative frequency for the bigram approximation –generalizes to trigram case etc.

6 Last Time Started talking about smoothing... –the problem of zero frequency n-gram counts –even in a large corpus

7 Smoothing and N-grams Add-One Smoothing –add 1 to all frequency counts –simple and no more zeros but there are better methods unigram –p(w) = f(w)/N(before Add-One) N = size of corpus –p(w) = (f(w)+1)/(N+V)(with Add-One) –f*(w) = (f(w)+1)*N/(N+V)(with Add-One) V = number of distinct words in corpus N/(N+V) normalization factor adjusting for the effective increase in the corpus size caused by Add-One bigram –p(w n |w n-1 ) = f(w n-1 w n )/f(w n-1 )(before Add-One) –p(w n |w n-1 ) = (f(w n-1 w n )+1)/(f(w n-1 )+V)(after Add-One) –f*(f(w n-1 w n )) = (f(w n-1 w n )+1)* f(w n-1 ) /(f(w n-1 )+V)(after Add-One) must rescale so that total probability mass stays at 1

8 Smoothing and N-grams Add-One Smoothing –add 1 to all frequency counts bigram –p(w n |w n-1 ) = (f(w n-1 w n )+1)/(f(w n-1 )+V) –(f(w n-1 w n )+1)* f(w n-1 ) /(f(w n-1 )+V) frequencies Remarks: perturbation problem add-one causes large changes in some frequencies due to relative size of V (1616) want to: 786  338 = figure 6.8 = figure 6.4

9 Smoothing and N-grams Revisited let’s illustrate the problem –take the bigram case: –w n-1 w n –p(w n |w n-1 ) = f(w n-1 w n )/f(w n-1 ) –suppose there are cases –w n-1 w 0 1 that don’t occur in the corpus probability mass f(w n-1 ) f(w n-1 w n ) f(w n-1 w 0 1 )=0 f(w n-1 w 0 m )=0...

10 Smoothing and N-grams Revisited add-one –“give everyone 1” probability mass f(w n-1 ) f(w n-1 w n )+1 f(w n-1 w 0 1 )=1 f(w n-1 w 0 m )=1...

11 Smoothing and N-grams Revisited add-one –“give everyone 1” probability mass f(w n-1 ) f(w n-1 w n )+1 f(w n-1 w 0 1 )=1 f(w n-1 w 0 m )=1... V = |{w i )| redistribution of probability mass –p(w n |w n-1 ) = (f(w n-1 w n )+1)/(f(w n-1 )+V)

12 Smoothing and N-grams Excel spreadsheet available –addone.xls

13 Smoothing and N-grams Revisited Witten-Bell –“don’t give everyone the same increment” –“instead condition increment on probability” take the bigram case: w n-1 w n –want to base our estimation y for the zero cases on how often w n-1 starts a bigram T(w n-1 ) = number of different bigram types beginning with w n-1 (that have already been seen on the corpus) –what to scale T(w n-1 ) by? let Z(w n-1 ) be the number of w 0 i ’s i.e. zero cases T(w n-1 )/Z(w n-1 ) –for probability, scale everything by T(w n-1 )+N(w n-1 ) N(w n-1 )= f(w n-1 ) total number of bigrams beginning with w n-1 –hence (for the zeros) p(w n |w n-1 ) = T(w n-1 )/(Z(w n-1 )*(T(w n-1 )+N(w n-1 )) probability mass f(w n-1 ) f(w n-1 w n ) f(w n-1 w 0 1 )=y f(w n-1 w 0 m )=y... y= T(w n-1 )/Z(w n-1 ) for the non-zeros, we also have to rescale: p(w n |w n-1 ) = f(w n-1 w n )/(f(w n-1 )+T(w n-1 ))

14 Smoothing and N-grams Witten-Bell Smoothing –equate zero frequency items with frequency 1 items –use frequency of things seen once to estimate frequency of things we haven’t seen yet –smaller impact than Add-One unigram –a zero frequency word (unigram) is “an event that hasn’t happened yet” –count the number of (different) words (T) we’ve observed in the corpus –p(w) = T/(Z*(N+T)) w is a word with zero frequency Z = number of zero frequency words N = size of corpus bigram –p(w n |w n-1 ) = f(w n-1 w n )/f(w n-1 )(original) –p(w n |w n-1 ) = T(w n-1 )/(Z(w n-1 )*(T(w n-1 )+N(w n-1 ))for zero bigrams (after Witten-Bell) T(w n-1 ) = number of (seen) bigrams beginning with w n-1 Z(w n-1 ) = number of unseen bigrams beginning with w n-1 Z(w n-1 ) = (possible bigrams beginning with w n-1 ) – (the ones we’ve seen) Z(w n-1 ) = V - T(w n-1 ) –T(w n-1 )/ Z(w n-1 ) * f(w n-1 )/(f(w n-1 )+ T(w n-1 )) estimated zero bigram frequency –p(w n |w n-1 ) = f(w n-1 w n )/(f(w n-1 )+T(w n-1 )) for non-zero bigrams (after Witten-Bell)

15 Smoothing and N-grams Witten-Bell Smoothing –use frequency of things seen once to estimate frequency of things we haven’t seen yet bigram –T(w n-1 )/ Z(w n-1 ) * f(w n-1 )/(f(w n-1 )+ T(w n-1 )) estimated zero bigram frequency T(w n-1 ) = number of bigrams beginning with w n-1 Z(w n-1 ) = number of unseen bigrams beginning with w n-1 Remark: smaller changes = figure 6.4 = figure 6.9

16 Smoothing and N-grams Witten-Bell excel spreadsheet: –wb.xls

17 Smoothing and N-grams Witten-Bell Smoothing Implementation (Excel) –one line formula sheet1 sheet2: rescaled sheet1

18 Language Models and N-grams N-gram models –they’re technically easy to compute (in the sense that lots of training data are available) –but just how good are these n-gram language models? –and what can they show us about language?

19 Language Models and N-grams approximating Shakespeare –generate random sentences using n-grams –train on complete Works of Shakespeare Unigram (pick random, unconnected words) Bigram

20 Language Models and N-grams Approximating Shakespeare (section 6.2) –generate random sentences using n-grams –train on complete Works of Shakespeare Trigram Quadrigram Remarks: dataset size problem training set is small 884,647 words 29,066 different words 29,066 2 = 844,832,356 possible bigrams for the random sentence generator, this means very limited choices for possible continuations, which means program can’t be very innovative for higher n

21 Possible Application?

22 Language Models and N-grams Aside:

23 Language Models and N-grams N-gram models + smoothing –one consequence of smoothing is that –every possible concatentation or sequence of words has a non-zero probability

24 Colorless green ideas examples –(1) colorless green ideas sleep furiously –(2) furiously sleep ideas green colorless Chomsky (1957): –... It is fair to assume that neither sentence (1) nor (2) (nor indeed any part of these sentences) has ever occurred in an English discourse. Hence, in any statistical model for grammaticalness, these sentences will be ruled out on identical grounds as equally `remote' from English. Yet (1), though nonsensical, is grammatical, while (2) is not. idea –(1) is syntactically valid, (2) is word salad Statistical Experiment (Pereira 2002)

25 Colorless green ideas examples –(1) colorless green ideas sleep furiously –(2) furiously sleep ideas green colorless Statistical Experiment (Pereira 2002) wiwi w i-1 bigram language model

26 example –colorless green ideas sleep furiously First hit Interesting things to Google

27 Interesting things to Google example –colorless green ideas sleep furiously first hit –compositional semantics – a green idea is, according to well established usage of the word "green" is one that is an idea that is new and untried. –again, a colorless idea is one without vividness, dull and unexciting. –so it follows that a colorless green idea is a new, untried idea that is without vividness, dull and unexciting. –to sleep is, among other things, is to be in a state of dormancy or inactivity, or in a state of unconsciousness. –to sleep furiously may seem a puzzling turn of phrase but one reflects that the mind in sleep often indeed moves furiously with ideas and images flickering in and out.

28 Interesting things to Google example –colorless green ideas sleep furiously another hit: (a story) –"So this is our ranking system," said Chomsky. "As you can see, the highest rank is yellow." –"And the new ideas?" –"The green ones? Oh, the green ones don't get a color until they've had some seasoning. These ones, anyway, are still too angry. Even when they're asleep, they're furious. We've had to kick them out of the dormitories - they're just unmanageable." –"So where are they?" –"Look," said Chomsky, and pointed out of the window. There below, on the lawn, the colorless green ideas slept, furiously.

29 More on N-grams How to degrade gracefully when we don’t have evidence –Backoff –Deleted Interpolation N-grams and Spelling Correction Entropy

30 Backoff idea –Hierarchy of approximations –trigram > bigram > unigram –degrade gracefully Given a word sequence fragment: –…w n-2 w n-1 w n … preference rule 1.p(w n |w n-2 w n-1 ) if f(w n-2 w n-1 w n )  0 2.  1 p(w n |w n-1 ) if f(w n-1 w n )  0 3.  2 p(w n ) notes: –  1 and  2 are fudge factors to ensure that probabilities still sum to 1

31 Backoff preference rule 1.p(w n |w n-2 w n-1 ) if f(w n-2 w n-1 w n )  0 2.  1 p(w n |w n-1 ) if f(w n-1 w n )  0 3.  2 p(w n ) problem –if f(w n-2 w n-1 w n ) = 0, we use one of the estimates from (2) or (3) –assume the backoff value is non-zero –then we are introducing non-zero probability for p(w n |w n-2 w n-1 ) – which is zero in the corpus –then this adds “probability mass” to p(w n |w n-2 w n-1 ) which is not in the original system –therefore, we have to be careful to juggle the probabilities to still sum to 1 –[details are given in section 6.4 of the textbook]

32 Deleted Interpolation –fundamental idea of interpolation –equation: (trigram) p(w n |w n-2 w n-1 ) = – 1 p(w n |w n-2 w n-1 ) + – 2 p(w n |w n-2 ) + – 3 p(w n ) –Note: 1, 2 and 3 are fudge factors to ensure that probabilities still sum to 1 X X