Lecture 3 Ngrams Topics Python NLTK N – grams SmoothingReadings: Chapter 4 – Jurafsky and Martin January 23, 2013 CSCE 771 Natural Language Processing.

Slides:

Advertisements

Similar presentations

Chapter 6: Statistical Inference: n-gram Models over Sparse Data

Advertisements

1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.

N-Grams and Corpus Linguistics 6 July Linguistics vs. Engineering “But it must be recognized that the notion of “probability of a sentence” is an.

N-gram model limitations Important question was asked in class: what do we do about N-grams which were not in our training corpus? Answer given: we distribute.

Albert Gatt Corpora and Statistical Methods – Lecture 7.

SI485i : NLP Set 4 Smoothing Language Models Fall 2012 : Chambers.

Smoothing Techniques – A Primer

Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.

NLTK & Python Day 4 LING Computational Linguistics Harry Howard Tulane University.

1 N-Grams and Corpus Linguistics September 2009 Lecture #5.

Advanced AI - Part II Luc De Raedt University of Freiburg WS 2004/2005 Many slides taken from Helmut Schmid.

September BASIC TECHNIQUES IN STATISTICAL NLP Word prediction n-grams smoothing.

N-Gram Language Models CMSC 723: Computational Linguistics I ― Session #9 Jimmy Lin The iSchool University of Maryland Wednesday, October 28, 2009.

Fall BASIC TECHNIQUES IN STATISTICAL NLP Word prediction n-grams smoothing.

N-gram model limitations Q: What do we do about N-grams which were not in our training corpus? A: We distribute some probability mass from seen N-grams.

LING 438/538 Computational Linguistics Sandiway Fong Lecture 20: 11/8.

CS 4705 Lecture 15 Corpus Linguistics III. Training and Testing Probabilities come from a training corpus, which is used to design the model. –overly.

Introduction to Language Models Evaluation in information retrieval Lecture 4.

LING 438/538 Computational Linguistics Sandiway Fong Lecture 19: 10/31.

LING 438/538 Computational Linguistics Sandiway Fong Lecture 18: 10/26.

Statistical Natural Language Processing Advanced AI - Part II Luc De Raedt University of Freiburg WS 2005/2006 Many slides taken from Helmut Schmid.

CS 4705 Lecture 14 Corpus Linguistics II. Relating Conditionals and Priors P(A | B) = P(A ^ B) / P(B) –Or, P(A ^ B) = P(A | B) P(B) Bayes Theorem lets.

SI485i : NLP Set 3 Language Models Fall 2012 : Chambers.

1 Advanced Smoothing, Evaluation of Language Models.

8/27/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 5 Giuseppe Carenini.

Natural Language Processing Lecture 6—9/17/2013 Jim Martin.

Speech and Language Processing

Session 12 N-grams and Corpora Introduction to Speech and Natural Language Processing (KOM422 ) Credits: 3(3-0)

1 LIN6932 Spring 2007 LIN6932: Topics in Computational Linguistics Hana Filip Lecture 5: N-grams.

NLTK & BASIC TEXT STATS DAY /08/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

Heshaam Faili University of Tehran

6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection -Augmentative.

NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models.

Chapter 6: Statistical Inference: n-gram Models over Sparse Data

Statistical NLP: Lecture 8 Statistical Inference: n-gram Models over Sparse Data (Ch 6)

Chapter 6: N-GRAMS Heshaam Faili University of Tehran.

Chapter6. Statistical Inference : n-gram Model over Sparse Data 이 동 훈 Foundations of Statistic Natural Language Processing.

Resolving Word Ambiguities Description: After determining word boundaries, the speech recognition process matches an array of possible word sequences from.

Lecture 4 Ngrams Smoothing

NLTK & Python Day 5 LING Computational Linguistics Harry Howard Tulane University.

N-gram Models CMSC Artificial Intelligence February 24, 2005.

LING/C SC/PSYC 438/538 Lecture 22 Sandiway Fong. Last Time Gentle introduction to probability Important notions: –sample space –events –rule of counting.

Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.

Introduction to N-grams Language Modeling. Dan Jurafsky Probabilistic Language Models Today’s goal: assign a probability to a sentence Machine Translation:

12/6/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 5 Giuseppe Carenini.

Estimating N-gram Probabilities Language Modeling.

Natural Language Processing Statistical Inference: n-grams

2/29/2016CPSC503 Winter CPSC 503 Computational Linguistics Lecture 5 Giuseppe Carenini.

Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.

N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.

Speech and Language Processing Lecture 4 Chapter 4 of SLP.

Language Modeling Part II: Smoothing Techniques Niranjan Balasubramanian Slide Credits: Chris Manning, Dan Jurafsky, Mausam.

Language Model for Machine Translation Jang, HaYoung.

N-Grams Chapter 4 Part 2.

CSC 594 Topics in AI – Natural Language Processing

NLTK Natural Language Processing with Python, Steven Bird, Ewan Klein, and Edward Loper, O'REILLY, 2009.

CSCI 5832 Natural Language Processing

CSCI 5832 Natural Language Processing

Speech and Language Processing

CPSC 503 Computational Linguistics

CPSC 503 Computational Linguistics

CSCE 771 Natural Language Processing

CSCI 5832 Natural Language Processing

CSCE 771 Natural Language Processing

Lecture 10: Speech Recognition (II) October 28, 2004 Dan Jurafsky

Presented by Wen-Hung Tsai Speech Lab, CSIE, NTNU 2005/07/13

Chapter 6: Statistical Inference: n-gram Models over Sparse Data

CSCE 771 Natural Language Processing

CPSC 503 Computational Linguistics

Presentation transcript:

Lecture 3 Ngrams Topics Python NLTK N – grams SmoothingReadings: Chapter 4 – Jurafsky and Martin January 23, 2013 CSCE 771 Natural Language Processing

– 2 – CSCE 771 Spring 2013 Last Time Slides from Lecture Regular expressions in Python, (grep, vi, emacs, word)? Eliza MorphologyToday N-gram models for prediction

– 3 – CSCE 771 Spring 2013 Eliza.py List of re – response pattern pairsList of re – response pattern pairs If Regular expression matchesIf Regular expression matches Then respond with …Then respond with … pairs = ( (r'I need (.*)', (r'I need (.*)', ( "Why do you need %1?", ( "Why do you need %1?", "Would it really help you to get %1?", "Would it really help you to get %1?", "Are you sure you need %1?")), "Are you sure you need %1?")), (r'Why don\'t you (.*)', (r'Why don\'t you (.*)', ( "Do you really think I don't %1?", ( "Do you really think I don't %1?", "Perhaps eventually I will %1.", "Perhaps eventually I will %1.", "Do you really want me to %1?")), "Do you really want me to %1?")),

– 4 – CSCE 771 Spring Natural Language Processing with Python --- Analyzing Text with the Natural Language Toolkit Steven Bird, Ewan Klein, and Edward Loper Preface (extras) 1. Language Processing and Python (extras) 2. Accessing Text Corpora and Lexical Resources (extras) 3. Processing Raw Text 4. Writing Structured Programs (extras) 5. Categorizing and Tagging Words 6. Learning to Classify Text (extras) 7. Extracting Information from Text 8. Analyzing Sentence Structure (extras) 9. Building Feature Based Grammars 10. Analyzing the Meaning of Sentences (extras) 11. Managing Linguistic Data 12. Afterword: Facing the Language Challenge PrefaceextrasLanguage Processing and Pythonextras Accessing Text Corpora and Lexical Resourcesextras Processing Raw TextWriting Structured Programsextras Categorizing and Tagging WordsLearning to Classify TextextrasExtracting Information from TextAnalyzing Sentence StructureextrasBuilding Feature Based GrammarsAnalyzing the Meaning of Sentencesextras Managing Linguistic DataAfterword: Facing the Language Challenge PrefaceextrasLanguage Processing and Pythonextras Accessing Text Corpora and Lexical Resourcesextras Processing Raw TextWriting Structured Programsextras Categorizing and Tagging WordsLearning to Classify TextextrasExtracting Information from TextAnalyzing Sentence StructureextrasBuilding Feature Based GrammarsAnalyzing the Meaning of Sentencesextras Managing Linguistic DataAfterword: Facing the Language Challenge nltk.org/book

– 5 – CSCE 771 Spring 2013 Language Processing and Python >>> from nltk.book import * *** Introductory Examples for the NLTK Book *** Loading text1,..., text9 and sent1,..., sent9 Type the name of the text or sentence to view it. Type: 'texts()' or 'sents()' to list the materials. text1: Moby Dick by Herman Melville 1851 text1: Moby Dick by Herman Melville 1851 text2: Sense and Sensibility by Jane Austen 1811 text3: The Book of Genesis text4: Inaugural Address Corpus … nltk.org/book

– 6 – CSCE 771 Spring 2013 Simple text processing with NLTK >>> text1.concordance("monstrous") >>> text1.similar("monstrous") >>> text2.common_contexts(["monstrous", "very"]) >>> text4.dispersion_plot(["citizens", "democracy", "freedom", "duties", "America"]) >>> text3.generate() >>> text5[16715:16735] nltk.org/book

– 7 – CSCE 771 Spring 2013 Counting Vocabulary >>> len(text3) >>> sorted(set(text3)) >>> from __future__ import division >>> len(text3) / len(set(text3)) >>> text3.count("smote") nltk.org/book

– 8 – CSCE 771 Spring 2013 lexical_diversity >>> def lexical_diversity(text):... return len(text) / len(set(text))... >>> def percentage(count, total):... return 100 * count / total... nltk.org/book

– 9 – CSCE 771 Spring Computing with Language: Simple Statistics Frequency Distributions >>> fdist1 = FreqDist(text1) >>> fdist1 >>> fdist1 >>> vocabulary1 = fdist1.keys() >>> vocabulary1[:50] >>> fdist1['whale'] >>> V = set(text1) >>> long_words = [w for w in V if len(w) > 15] >>> sorted(long_words) nltk.org/book

– 10 – CSCE 771 Spring 2013 List constructors in Python >>> V = set(text1) >>> long_words = [w for w in V if len(w) > 15] >>> sorted(long_words) >>> fdist5 = FreqDist(text5) >>> sorted([w for w in set(text5) if len(w) > 7 and fdist5[w] > 7]) nltk.org/book

– 11 – CSCE 771 Spring 2013 Collocations and Bigrams >>> bigrams(['more', 'is', 'said', 'than', 'done']) [('more', 'is'), ('is', 'said'), ('said', 'than'), ('than', 'done')] >>> text4.collocations() Building collocations list United States; fellow citizens; years ago; Federal Government; General Government; American people; Vice President; Almighty God; Fellow citizens; Chief Magistrate; Chief Justice; God bless; Indian tribes; public debt; foreign nations; political parties; State governments; National Government; United Nations; public money nltk.org/book

– 12 – CSCE 771 Spring 2013 Table 1.2 ExampleDescription fdist = FreqDist(samples) create a frequency distribution containing the given samples fdist.inc(sample)increment the count for this sample fdist['monstrous'] count of the number of times a given sample occurred fdist.freq('monstrous')frequency of a given sample fdist.N()total number of samples fdist.keys()sorted in order of decreasing frequency for sample in fdist:iterate over the samples, decreasing frequency fdist.max()sample with the greatest count fdist.tabulate()tabulate the frequency distribution fdist.plot()graphical plot of the frequency distribution fdist.plot(cumulative=True)cumulative plot of the frequency distribution fdist1 < fdist2 test if samples in fdist1 occur less frequently than in fdist2 nltk.org/book

– 13 – CSCE 771 Spring 2013 Quotes from Chapter 4 But it must be recognized that the notion “probability of a sentence” is an entirely useless one, under any known interpretation of this term. Noam Chomsky Anytime a linguist leaves the group the recognition rate goes up. Fred Jelinek (then of the IBM speech group) SLP – Jurafsky and Matrin for the rest of the day

– 14 – CSCE 771 Spring 2013 Predicting Words Please turn your homework … What is the next word? Language models: N-gram models

– 15 – CSCE 771 Spring 2013 Word/Character prediction Uses 1.Spelling correction (at character level) 2.Spelling correction (at a higher level) when the corrector corrects to the wrong word 3.Augmentative communication – person with disability chooses words from a menu predicted by the system

– 16 – CSCE 771 Spring 2013 Real-Word Spelling Errors Mental confusions Their/they’re/there To/too/two Weather/whether Peace/piece You’re/your Typos that result in real words

– 17 – CSCE 771 Spring 2013 Spelling Errors that are Words TyposContext Left context Right context

– 18 – CSCE 771 Spring 2013 Real Word Spelling Errors Collect a set of common pairs of confusions Whenever a member of this set is encountered compute the probability of the sentence in which it appears Substitute the other possibilities and compute the probability of the resulting sentence Choose the higher one

– 19 – CSCE 771 Spring 2013 Word Counting Probability based on counting He stepped out into the hall, was delighted to encounter a water brother. (from the Brown corpus)He stepped out into the hall, was delighted to encounter a water brother. (from the Brown corpus) Words? Bi-grams Frequencies of words, but what words? Corpora ? Web everything on it Shakespeare Bible/Koran Spoken transcripts (switchboard) Problems with spoken speech “uh”, “um” fillers

– 20 – CSCE 771 Spring Bigrams from Berkeley Restaurant Proj. Berkeley Restaurant Project – a speech based restaurant consultant Handling requests: I’m looking for Cantonese food. I’m looking for a good place to eat breakfast.

– 21 – CSCE 771 Spring 2013 Chain Rule Recall the definition of conditional probabilities RewritingOr…Or…

– 22 – CSCE 771 Spring 2013 Example The big red dog P(The)*P(big|the)*P(red|the big)*P(dog|the big red) Better P(The| ) written as P(The | )

– 23 – CSCE 771 Spring 2013 General Case The word sequence from position 1 to n is So the probability of a sequence is

– 24 – CSCE 771 Spring 2013 Unfortunately That doesn’t help since its unlikely we’ll ever gather the right statistics for the prefixes.

– 25 – CSCE 771 Spring 2013 Markov Assumption Assume that the entire prefix history isn’t necessary. In other words, an event doesn’t depend on all of its history, just a fixed length near history

– 26 – CSCE 771 Spring 2013 Markov Assumption So for each component in the product replace each with its with the approximation (assuming a prefix of N)

– 27 – CSCE 771 Spring 2013 Maximum Likelihood Estimation Maximum Likelihood Estimation (MLE) - Method to estimate probabilities for the n-gram models Normalize counts from a corpus

– 28 – CSCE 771 Spring 2013 N-Grams: The big red dog Unigrams:P(dog) Bigrams:P(dog|red) Trigrams:P(dog|big red) Four-grams:P(dog|the big red) In general, we’ll be dealing with P(Word| Some fixed prefix)

– 29 – CSCE 771 Spring 2013 Caveat The formulation P(Word| Some fixed prefix) is not really appropriate in many applications. It is if we’re dealing with real time speech where we only have access to prefixes. But if we’re dealing with text we already have the right and left contexts. There’s no a priori reason to stick to left contexts only.

– 30 – CSCE 771 Spring 2013 BERP Table: Counts (fig 4.1) Then we can normalize by dividing each row by the unigram counts.

– 31 – CSCE 771 Spring 2013 BERP Table: Bigram Probabilities

– 32 – CSCE 771 Spring 2013 Example For this example P(I | ) =.25 P(food | english) =.5 P (english | want) P ( | food) =.68 Now consider “ I want English food ” P( I want English food ) = P(I | ) P(want | i) P(english | want) P(food | english) P( |food)

– 33 – CSCE 771 Spring 2013 An Aside on Logs You don’t really do all those multiplies. The numbers are too small and lead to underflows Convert the probabilities to logs and then do additions. To get the real probability (if you need it) go back to the antilog.

– 34 – CSCE 771 Spring 2013 Some Observations The following numbers are very informative. Think about what they capture. P(want|I) =.32 P(to|want) =.65 P(eat|to) =.26 P(food|Chinese) =.56 P(lunch|eat) =.055

– 35 – CSCE 771 Spring 2013 Some More Observations P(I | I) P(want | I) P(I | food) I I I want I want I want to The food I want is

– 36 – CSCE 771 Spring 2013 Generation Choose N-Grams according to their probabilities and string them together

– 37 – CSCE 771 Spring 2013 BERP I want want to to eat eat Chinese Chinese food food.

– 38 – CSCE 771 Spring 2013 Some Useful Observations A small number of events occur with high frequency You can collect reliable statistics on these events with relatively small samples A large number of events occur with small frequency You might have to wait a long time to gather statistics on the low frequency events

– 39 – CSCE 771 Spring 2013 Some Useful Observations Some zeroes are really zeroes Meaning that they represent events that can’t or shouldn’t occur On the other hand, some zeroes aren’t really zeroes They represent low frequency events that simply didn’t occur in the corpus

– 40 – CSCE 771 Spring 2013 Shannon’s Method Sentences randomly generated based on the probability models (n-gram models) Sample a random bigram (, w) according to its probability Now sample a random bigram (w, x) according to its probability Where the prefix w matches the suffix of the first. And so on until we randomly choose a (y, ) Then string the words together I I I want I want want to to eat eat Chinese Chinese food food Slide from: Speech and Language Processing Jurafsky and Martin

– 41 – CSCE 771 Spring 2013 Shannon’s method applied to Shakespeare

– 42 – CSCE 771 Spring 2013 Shannon applied to Wall Street Journal

– 43 – CSCE 771 Spring 2013 Evaluating N-grams: Perplexity Training set Test set : W = w 1 w 2 ….w n Perplexity (PP) is a Measure of how good a model is. PP(W) = P(w 1 w 2 ….w n ) -1/N Higher probability  lower perplexity Wall Street Journal perplexities of models

– 44 – CSCE 771 Spring 2013 Unknown words: Open versus Closed Vocabularies unrecognized word token unrecognized word token

– 45 – CSCE 771 Spring 2013 Google words visualization

– 46 – CSCE 771 Spring 2013 Problem Let’s assume we’re using N-grams How can we assign a probability to a sequence where one of the component n-grams has a value of zero Assume all the words are known and have been seen Go to a lower order n-gram Back off from bigrams to unigrams Replace the zero with something else

– 47 – CSCE 771 Spring 2013 Smoothing Smoothing - reevaluating some of the zero and low probability N-grams and assigning them non-zero values Add-One (Laplace) Make the zero counts 1. Rationale: They’re just events you haven’t seen yet. If you had seen them, chances are you would only have seen them once… so make the count equal to 1.

– 48 – CSCE 771 Spring 2013 Add-One Smoothing Terminology N – Number of total words N – Number of total words V – vocabulary size == number of distinct words V – vocabulary size == number of distinct words Maximum Likelihood estimate

– 49 – CSCE 771 Spring 2013 Adjusted counts “C*” Terminology N – Number of total words N – Number of total words V – vocabulary size == number of distinct words V – vocabulary size == number of distinct words Adjusted count C* Adjusted probabilities

– 50 – CSCE 771 Spring 2013 Discounting Discounting – lowering some of the larger non-zero counts to get the “probability” to assign to the zero entries d c – the discounted counts The discounted probabilities can then be directly calculated

– 51 – CSCE 771 Spring 2013 Original BERP Counts (fig 6.4 again) Berkeley Restaurant Project data V = 1616

– 52 – CSCE 771 Spring 2013 Figure 6.6 Add one counts Counts Probabilities

– 53 – CSCE 771 Spring 2013 Figure 6.6 Add one counts & prob. Counts Probabilities

– 54 – CSCE 771 Spring 2013 Add-One Smoothed bigram counts Think about the occurrence of an unseen item (

– 55 – CSCE 771 Spring 2013 Witten-Bell Think about the occurrence of an unseen item (word, bigram, etc) as an event. The probability of such an event can be measured in a corpus by just looking at how often it happens. Just take the single word case first. Assume a corpus of N tokens and T types. How many times was an as yet unseen type encountered?

– 56 – CSCE 771 Spring 2013 Witten Bell First compute the probability of an unseen event Then distribute that probability mass equally among the as yet unseen events That should strike you as odd for a number of reasons In the case of words… In the case of bigrams

– 57 – CSCE 771 Spring 2013 Witten-Bell In the case of bigrams, not all conditioning events are equally promiscuous P(x|the) vs P(x|going) So distribute the mass assigned to the zero count bigrams according to their promiscuity

– 58 – CSCE 771 Spring 2013 Witten-Bell Finally, renormalize the whole table so that you still have a valid probability

– 59 – CSCE 771 Spring 2013 Original BERP Counts; Now the Add 1 counts

– 60 – CSCE 771 Spring 2013 Witten-Bell Smoothed and Reconstituted

– 61 – CSCE 771 Spring 2013 Add-One Smoothed BERP Reconstituted