September 2003 1 BASIC TECHNIQUES IN STATISTICAL NLP Word prediction n-grams smoothing.

Slides:



Advertisements
Similar presentations
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Advertisements

1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.
N-gram model limitations Important question was asked in class: what do we do about N-grams which were not in our training corpus? Answer given: we distribute.
Albert Gatt Corpora and Statistical Methods – Lecture 7.
SI485i : NLP Set 4 Smoothing Language Models Fall 2012 : Chambers.
Smoothing Techniques – A Primer
Language modelling using N-Grams Corpora and Statistical Methods Lecture 7.
CSC 9010: Special Topics, Natural Language Processing. Spring, Matuszek & Papalaskari 1 N-Grams CSC 9010: Special Topics. Natural Language Processing.
1 N-Grams and Corpus Linguistics September 2009 Lecture #5.
1 BASIC NOTIONS OF PROBABILITY THEORY. NLE 2 What probability theory is for Suppose that we have a fair dice, with six faces, and that we keep throwing.
A BAYESIAN APPROACH TO SPELLING CORRECTION. ‘Noisy channels’ In a number of tasks involving natural language, the problem can be viewed as recovering.
Part II. Statistical NLP Advanced Artificial Intelligence N-Gramms Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most slides taken.
Part II. Statistical NLP Advanced Artificial Intelligence Markov Models and N-gramms Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Kristian Kersting Some.
N-Grams and Corpus Linguistics
CS 4705 Lecture 13 Corpus Linguistics I. From Knowledge-Based to Corpus-Based Linguistics A Paradigm Shift begins in the 1980s –Seeds planted in the 1950s.
N-Gram Language Models CMSC 723: Computational Linguistics I ― Session #9 Jimmy Lin The iSchool University of Maryland Wednesday, October 28, 2009.
CS 4705 Lecture 6 N-Grams and Corpus Linguistics.
Ngram models and the Sparsity problem John Goldsmith November 2002.
September SOME BASIC NOTIONS OF PROBABILITY THEORY Universita’ di Venezia 29 Settembre 2003.
Smoothing Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
Fall BASIC TECHNIQUES IN STATISTICAL NLP Word prediction n-grams smoothing.
N-gram model limitations Q: What do we do about N-grams which were not in our training corpus? A: We distribute some probability mass from seen N-grams.
1 Language Model (LM) LING 570 Fei Xia Week 4: 10/21/2009 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAA A A.
1 Smoothing LING 570 Fei Xia Week 5: 10/24/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A A AA A A A.
CS 4705 Lecture 15 Corpus Linguistics III. Training and Testing Probabilities come from a training corpus, which is used to design the model. –overly.
Introduction to Language Models Evaluation in information retrieval Lecture 4.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 19: 10/31.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 18: 10/26.
CS 4705 N-Grams and Corpus Linguistics. Spelling Correction, revisited M$ suggests: –ngram: NorAm –unigrams: anagrams, enigmas –bigrams: begrimes –trigrams:
Metodi statistici nella linguistica computazionale The Bayesian approach to spelling correction.
CS 4705 Lecture 14 Corpus Linguistics II. Relating Conditionals and Priors P(A | B) = P(A ^ B) / P(B) –Or, P(A ^ B) = P(A | B) P(B) Bayes Theorem lets.
SI485i : NLP Set 3 Language Models Fall 2012 : Chambers.
1 Advanced Smoothing, Evaluation of Language Models.
8/27/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 5 Giuseppe Carenini.
Natural Language Processing Lecture 6—9/17/2013 Jim Martin.
NGrams 09/16/2004 Instructor: Rada Mihalcea Note: some of the material in this slide set was adapted from an NLP course taught by Bonnie Dorr at Univ.
1 LIN6932 Spring 2007 LIN6932: Topics in Computational Linguistics Hana Filip Lecture 5: N-grams.
BİL711 Natural Language Processing1 Statistical Language Processing In the solution of some problems in the natural language processing, statistical techniques.
Heshaam Faili University of Tehran
6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection -Augmentative.
NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models.
1 COMP 791A: Statistical Language Processing n-gram Models over Sparse Data Chap. 6.
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Statistical NLP: Lecture 8 Statistical Inference: n-gram Models over Sparse Data (Ch 6)
Chapter 6: N-GRAMS Heshaam Faili University of Tehran.
Language Modeling Anytime a linguist leaves the group the recognition rate goes up. (Fred Jelinek)
Chapter6. Statistical Inference : n-gram Model over Sparse Data 이 동 훈 Foundations of Statistic Natural Language Processing.
Resolving Word Ambiguities Description: After determining word boundaries, the speech recognition process matches an array of possible word sequences from.
1 Introduction to Natural Language Processing ( ) LM Smoothing (The EM Algorithm) AI-lab
Lecture 4 Ngrams Smoothing
LING/C SC/PSYC 438/538 Lecture 22 Sandiway Fong. Last Time Gentle introduction to probability Important notions: –sample space –events –rule of counting.
Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.
Introduction to N-grams Language Modeling. Dan Jurafsky Probabilistic Language Models Today’s goal: assign a probability to a sentence Machine Translation:
12/6/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 5 Giuseppe Carenini.
Estimating N-gram Probabilities Language Modeling.
A COMPARISON OF HAND-CRAFTED SEMANTIC GRAMMARS VERSUS STATISTICAL NATURAL LANGUAGE PARSING IN DOMAIN-SPECIFIC VOICE TRANSCRIPTION Curry Guinn Dave Crist.
Natural Language Processing Statistical Inference: n-grams
2/29/2016CPSC503 Winter CPSC 503 Computational Linguistics Lecture 5 Giuseppe Carenini.
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
Speech and Language Processing Lecture 4 Chapter 4 of SLP.
Language Modeling Part II: Smoothing Techniques Niranjan Balasubramanian Slide Credits: Chris Manning, Dan Jurafsky, Mausam.
Language Model for Machine Translation Jang, HaYoung.
N-Grams Chapter 4 Part 2.
N-Grams and Corpus Linguistics
CPSC 503 Computational Linguistics
CSCI 5832 Natural Language Processing
CSCE 771 Natural Language Processing
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Lecture 13 Corpus Linguistics I CS 4705.
Presentation transcript:

September BASIC TECHNIQUES IN STATISTICAL NLP Word prediction n-grams smoothing

September Statistical Methods in NLE Two characteristics of NL make it desirable to endow programs with the ability to LEARN from examples of past use: – VARIETY (no programmer can really take into account all possibilities) – AMBIGUITY (need to have ways of choosing between alternatives) In a number of NLE applications, statistical methods are very common The simplest application: WORD PREDICTION

September We are good at word prediction Stocks plunged this morning, despite a cut in interestStocks plunged this morning, despite a cut in interest rates by the Federal Reserve, as Wall Stocks plunged this morning, despite a cut in interest rates by the Federal Reserve, as Wall Street began ….

September Real Spelling Errors They are leaving in about fifteen minuets to go to her house The study was conducted mainly be John Black. The design an construction of the system will take more than one year. Hopefully, all with continue smoothly in my absence. Can they lave him my messages? I need to notified the bank of this problem. He is trying to fine out.

September Handwriting recognition From Woody Allen’s Take the Money and Run (1969) – Allen (a bank robber), walks up to the teller and hands her a note that reads. "I have a gun. Give me all your cash." The teller, however, is puzzled, because he reads "I have a gub." "No, it's gun", Allen says. "Looks like 'gub' to me," the teller says, then asks another teller to help him read the note, then another, and finally everyone is arguing over what the note means.

September Applications of word prediction Spelling checkers Mobile phone texting Speech recognition Handwriting recognition Disabled users

September Statistics and word prediction The basic idea underlying the statistical approach to word prediction is to use the probabilities of SEQUENCES OF WORDS to choose the most likely next word / correction of spelling error I.e., to compute For all words w, and predict as next word the one for which this (conditional) probability is highest. P(w | W 1 …. W N-1 )

September Using corpora to estimate probabilities But where do we get these probabilities? Idea: estimate them by RELATIVE FREQUENCY. The simplest method: Maximum Likelihood Estimate (MLE). Count the number of words in a corpus, then count how many times a given sequence is encountered. ‘Maximum’ because doesn’t waste any probability on events not in the corpus

September Maximum Likelihood Estimation for conditional probabilities In order to estimate P(w|W1 … WN), we can use instead: Cfr.: – P(A|B) = P(A&B) / P(B)

September Aside: counting words in corpora Keep in mind that it’s not always so obvious what ‘a word’ is (cfr. yesterday) In text: – He stepped out into the hall, was delighted to encounter a brother. (From the Brown corpus.) In speech: – I do uh main- mainly business data processing LEMMAS: cats vs cat TYPES vs. TOKENS

September The problem: sparse data In principle, we would like the n of our models to be fairly large, to model ‘long distance’ dependencies such as: – Sue SWALLOWED the large green … However, in practice, most events of encountering sequences of words of length greater than 3 hardly ever occur in our corpora! (See below) (Part of the) Solution: we APPROXIMATE the probability of a word given all previous words

September The Markov Assumption The probability of being in a certain state only depends on the previous state: P(Xn = Sk| X1 … Xn-1) = P(Xn = Sk|Xn-1) This is equivalent to the assumption that the next state only depends on the previous m inputs, for m finite (N-gram models / Markov models can be seen as probabilistic finite state automata)

September The Markov assumption for language: n-grams models Making the Markov assumption for word prediction means assuming that the probability of a word only depends on the previous n words (N-GRAM model)

September Bigrams and trigrams Typical values of n are 2 or 3 (BIGRAM or TRIGRAM models): P(W n |W 1 ….. W n-1 ) ~ P(W n |W n-2,W n-1 ) P(W 1,…W n ) ~ П P(W i | W i-2,W i-1 ) What bigram model means in practice: – Instead of P(rabbit|Just the other day I saw a) – We use P(rabbit|a) Unigram: P(dog) Bigram: P(dog|big) Trigram: P(dog|the,big)

September The chain rule So how can we compute the probability of sequences of words longer than 2 or 3? We use the CHAIN RULE: E.g., – P(the big dog) = P(the) P(big|the) P(dog|the big) Then we use the Markov assumption to reduce this to manageable proportions:

September Example: the Berkeley Restaurant Project (BERP) corpus BERP is a speech-based restaurant consultant The corpus contains user queries; examples include – I’m looking for Cantonese food – I’d like to eat dinner someplace nearby – Tell me about Chez Panisse – I’m looking for a good place to eat breakfast

September Computing the probability of a sentence Given a corpus like BERP, we can compute the probability of a sentence like “I want to eat Chinese food” Making the bigram assumption and using the chain rule, the probability can be approximated as follows: – P(I want to eat Chinese food) ~ P(I|”sentence start”) P(want|I) P(to|want)P(eat|to) P(Chinese|eat)P(food|Chinese)

September Bigram counts

September How the bigram probabilities are computed Example of P(I,I): – C(“I”,”I”): 8 – C(“I”): …. = 3437 – P(“I”|”I”) = 8 / 3437 =.0023

September Bigram probabilities

September The probability of the example sentence P(I want to eat Chinese food)  P(I|”sentence start”) * P(want|I) * P(to|want) * P(eat|to) * P(Chinese|eat) * P(food|Chinese) =.25 *.32 *.65 *.26 *.002 *.60 =

September Examples of actual bigram probabilities computed using BERP

September Visualizing an n-gram based language model: the Shannon/Miller/Selfridge method For unigrams: – Choose a random value r between 0 and 1 – Print out w such that P(w) = r For bigrams: – Choose a random bigram P(w| ) – Then pick up bigrams to follow as before

September The Shannon/Miller/Selfridge method trained on Shakespeare

September Approximating Shakespeare, cont’d

September A more formal evaluation mechanism Entropy Cross-entropy

September The downside The entire Shakespeare oeuvre consists of – 884,647 tokens (N) – 29,066 types (V) – 300,000 bigrams All of Jane Austen’s novels (on Manning and Schuetze’s website): – N = 617,091 tokens – V = 14,585 types

September Comparing Austen n-grams: unigrams In person shewasinferiorto 1-gramP(.) 1the.034the.034the.034the.034 2to.032to.032to.032to.032 3and.030and.030and.030 … 8was.015was.015 … 13she.011 … 1701inferior.00005

September Comparing Austen n-grams: bigrams In person shewasinferiorto 2-gramP(.|person)P(.|she)P(.|was)P(.inferior) 1and.099had.0141not.065to.212 2who.099was.122a.052 … 23she.009 … inferior0

September Comparing Austen n-grams: trigrams In person shewasinferiorto 3-gramP(.|In,person)P(.|person, she) P(.|she, was) P(.was, inferior) 1UNSEENdid.05not.057UNSEEN 2was.05very.038 … inferior0

September Maybe with a larger corpus? Words such as ‘ergativity’ unlikely to be found outside a corpus of linguistic articles More in general: Zipf’s law

September Zipf’s law for the Brown corpus

September Addressing the zeroes SMOOTHING is re-evaluating some of the zero- probability and low-probability n-grams, assigning them non-zero probabilities – Add-one – Witten-Bell – Good-Turing BACK-OFF is using the probabilities of lower order n- grams when higher order ones are not available – Backoff – Linear interpolation

September Add-one (‘Laplace’s Law’)

September Effect on BERP bigram counts

September Add-one bigram probabilities

September The problem

September The problem Add-one has a huge effect on probabilities: e.g., P(to|want) went from.65 to.28! Too much probability gets ‘removed’ from n- grams actually encountered – (more precisely: the ‘discount factor’

September Witten-Bell Discounting How can we get a better estimate of the probabilities of things we haven’t seen? The Witten-Bell algorithm is based on the idea that a zero-frequency N-gram is just an event that hasn’t happened yet How often these events happen? We model this by the probability of seeing an N-gram for the first time (we just count the number of times we first encountered a type)

September Witten-Bell: the equations Total probability mass assigned to zero-frequency N- grams: (NB: T is OBSERVED types, not V) So each zero N-gram gets the probability:

September Witten-Bell: why ‘discounting’ Now of course we have to take away something (‘discount’) from the probability of the events seen more than once:

September Witten-Bell for bigrams We `relativize’ the types to the previous word:

September Add-one vs. Witten-Bell discounts for unigrams in the BERP corpus WordAdd-OneWitten-Bell “I’” “want” “to” “eat” “Chinese” “food” “lunch”.22.91

September One last discounting method …. The best-known discounting method is GOOD- TURING (Good, 1953) Basic insight: re-estimate the probability of N- grams with zero counts by looking at the number of bigrams that occurred once For example, the revised count for bigrams that never occurred is estimated by dividing N 1, the number of bigrams that occurred once, by N 0, the number of bigrams that never occurred

September Combining estimators A method often used (generally in combination with discounting methods) is to use lower-order estimates to ‘help’ with higher-order ones Backoff (Katz, 1987) Linear interpolation (Jelinek and Mercer, 1980)

September Backoff: the basic idea

September Backoff with discounting

September Readings Jurafsky and Martin, chapter 6 The Statistics GlossaryStatistics Glossary Word prediction: – For mobile phones For mobile phones – For disabled users For disabled users Further reading: Manning and Schuetze, chapters 6 (Good-Turing)

September Acknowledgments Some of the material in these slides was taken from lecture notes by Diane Litman & James Martin