SI485i : NLP Set 3 Language Models Fall 2012 : Chambers.

Slides:

Advertisements

Similar presentations

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Machine Learning PoS-Taggers COMP3310 Natural Language Processing Eric.

Advertisements

Language Models & Smoothing Shallow Processing Techniques for NLP Ling570 October 19, 2011.

1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.

Language Modeling.

N-Grams Chapter 4 Part 1.

N-gram model limitations Important question was asked in class: what do we do about N-grams which were not in our training corpus? Answer given: we distribute.

SI485i : NLP Set 4 Smoothing Language Models Fall 2012 : Chambers.

Smoothing Techniques – A Primer

Introduction to N-grams

Morphology & FSTs Shallow Processing Techniques for NLP Ling570 October 17, 2011.

September BASIC TECHNIQUES IN STATISTICAL NLP Word prediction n-grams smoothing.

N-Gram Language Models CMSC 723: Computational Linguistics I ― Session #9 Jimmy Lin The iSchool University of Maryland Wednesday, October 28, 2009.

N-gram model limitations Q: What do we do about N-grams which were not in our training corpus? A: We distribute some probability mass from seen N-grams.

1 Language Model (LM) LING 570 Fei Xia Week 4: 10/21/2009 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAA A A.

CS 4705 Lecture 15 Corpus Linguistics III. Training and Testing Probabilities come from a training corpus, which is used to design the model. –overly.

Word prediction What are likely completions of the following sentences? –“Oh, that must be a syntax …” –“I have to go to the …” –“I’d also like a Coke.

Introduction to Language Models Evaluation in information retrieval Lecture 4.

LING 438/538 Computational Linguistics Sandiway Fong Lecture 19: 10/31.

LING 438/538 Computational Linguistics Sandiway Fong Lecture 18: 10/26.

Learning Bit by Bit Class 4 - Ngrams. Ngrams Counting words Using observation to make predictions.

SI485i : NLP Set 12 Features and Prediction. What is NLP, really? Many of our tasks boil down to finding intelligent features of language. We do lots.

1 Advanced Smoothing, Evaluation of Language Models.

8/27/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 5 Giuseppe Carenini.

Natural Language Processing Lecture 6—9/17/2013 Jim Martin.

1 N-Grams and Corpus Linguistics September 6, 2012 Lecture #4.

Speech and Language Processing

Session 12 N-grams and Corpora Introduction to Speech and Natural Language Processing (KOM422 ) Credits: 3(3-0)

Introduction to language modeling

Slides are from Dan Jurafsky and Schütze Language Modeling.

1 LIN6932 Spring 2007 LIN6932: Topics in Computational Linguistics Hana Filip Lecture 5: N-grams.

Lecture 3 Ngrams Topics Python NLTK N – grams SmoothingReadings: Chapter 4 – Jurafsky and Martin January 23, 2013 CSCE 771 Natural Language Processing.

Heshaam Faili University of Tehran

6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection -Augmentative.

NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models.

Chapter 6: Statistical Inference: n-gram Models over Sparse Data

Chapter 6: N-GRAMS Heshaam Faili University of Tehran.

N-gram Language Models

Chapter6. Statistical Inference : n-gram Model over Sparse Data 이 동 훈 Foundations of Statistic Natural Language Processing.

Efficient Language Model Look-ahead Probabilities Generation Using Lower Order LM Look-ahead Information Langzhou Chen and K. K. Chin Toshiba Research.

Lecture 4 Ngrams Smoothing

N-gram Models CMSC Artificial Intelligence February 24, 2005.

Chapter 23: Probabilistic Language Models April 13, 2004.

1 Introduction to Natural Language Processing ( ) Language Modeling (and the Noisy Channel) AI-lab

Introduction to N-grams Language Modeling. Dan Jurafsky Probabilistic Language Models Today’s goal: assign a probability to a sentence Machine Translation:

Introduction to N-grams Language Modeling. Probabilistic Language Models Today’s goal: assign a probability to a sentence Machine Translation: P(high.

12/6/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 5 Giuseppe Carenini.

Estimating N-gram Probabilities Language Modeling.

2/29/2016CPSC503 Winter CPSC 503 Computational Linguistics Lecture 5 Giuseppe Carenini.

Introduction to N-grams Language Modeling. Dan Jurafsky Probabilistic Language Models Today’s goal: assign a probability to a sentence Machine Translation:

Learning, Uncertainty, and Information: Evaluating Models Big Ideas November 12, 2004.

N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.

Statistical Language Models Hongning Wang CS 6501: Text Mining1.

CS 60050: Natural Language Processing Course Speech Recognition and Synthesis - I Presented By: Pratyush Banerjee Dept. of Computer Science and Engg. IIT.

Statistical Methods for NLP Diana Trandab ă ț

Speech and Language Processing Lecture 4 Chapter 4 of SLP.

Language Model for Machine Translation Jang, HaYoung.

Statistical Methods for NLP

Introduction to N-grams

CSC 594 Topics in AI – Natural Language Processing

Introduction to N-grams

Introduction to N-grams

CSCI 5832 Natural Language Processing

CSCI 5832 Natural Language Processing

Speech and Language Processing

Natural Language Processing

CPSC 503 Computational Linguistics

N-Gram Model Formulas Word sequences Chain rule of probability

CSCI 5832 Natural Language Processing

Lecture 10: Speech Recognition (II) October 28, 2004 Dan Jurafsky

Introduction to N-grams

Presentation transcript:

SI485i : NLP Set 3 Language Models Fall 2012 : Chambers

Language Modeling Which sentence is most likely (most probable)? I saw this dog running across the street. Saw dog this I running across street the. Why? You have a language model in your head. P( “I saw this” ) >> P(“saw dog this”)

Language Modeling Compute P(w1,w2,w3,w4,w5…wn) the probability of a sequence Compute P(w5 | w1,w2,w3,w4,w5) the probability of a word given some previous words The model that computes P(W) is the language model. A better term for this would be “The Grammar” But “Language model” or LM is standard

LMs: “fill in the blank” Think of this as a “fill in the blank” problem. P( wn | w1, w2, …, wn-1 ) “He picked up the bat and hit the _____” Ball? Poetry? P( ball | he, picked, up, the, bat, and, hit, the ) = ??? P( poetry | he, picked, up, the, bat, and, hit, the ) = ???

How do we count words? “They picnicked by the pool then lay back on the grass and looked at the stars” 16 tokens 14 types The Brown Corpus (1992): a big corpus of English text 583 million wordform tokens 293,181 wordform types N = number of tokens V = vocabulary = number of types General wisdom: V > O(sqrt(N))

Computing P(W) How to compute this joint probability: P(“the”,”other”,”day”,”I”,”was”,”walking”,”along”,”and”,”saw”,”a ”,”lizard”) Rely on the Chain Rule of Probability

The Chain Rule of Probability Recall the definition of conditional probabilities Rewriting: More generally P(A,B,C,D) = P(A)P(B|A)P(C|A,B)P(D|A,B,C) P(x 1,x 2,x 3,…x n ) = P(x 1 )P(x 2 |x 1 )P(x 3 |x 1,x 2 )…P(x n |x 1 …x n-1 )

The Chain Rule Applied to joint probability of words in sentence P(“the big red dog was”) = ??? P(the)*P(big|the)*P(red|the big)* P(dog|the big red)*P(was|the big red dog) = ???

Very easy estimate: P(the | its water is so transparent that) = C(its water is so transparent that the) / C(its water is so transparent that) How to estimate? P(the | its water is so transparent that)

Unfortunately There are a lot of possible sentences We’ll never be able to get enough data to compute the statistics for those long prefixes P(lizard|the,other,day,I,was,walking,along,and,saw,a)

Markov Assumption Make a simplifying assumption P(lizard|the,other,day,I,was,walking,along,and,saw,a) = P(lizard|a) Or maybe P(lizard|the,other,day,I,was,walking,along,and,saw,a) = P(lizard|saw,a)

So for each component in the product replace with the approximation (assuming a prefix of N) Bigram version Markov Assumption

N-gram Terminology Unigrams: single words Bigrams: pairs of words Trigrams: three word phrases 4-grams, 5-grams, 6-grams, etc. “I saw a lizard yesterday” Unigrams I saw a lizard yesterday Bigrams I I saw saw a a lizard lizard yesterday yesterday Trigrams I I saw I saw a saw a lizard a lizard yesterday lizard yesterday

Estimating bigram probabilities The Maximum Likelihood Estimate Bigram language model: what counts do I have to keep track of??

An example I am Sam Sam I am I do not like green eggs and ham This is the Maximum Likelihood Estimate, because it is the one which maximizes P(Training set | Model)

Maximum Likelihood Estimates The MLE of a parameter in a model M from a training set T …is the estimate that maximizes the likelihood of the training set T given the model M Suppose the word “Chinese” occurs 400 times in a corpus What is the probability that a random word from another text will be “Chinese”? MLE estimate is 400/ =.004 This may be a bad estimate for some other corpus But it is the estimate that makes it most likely that “Chinese” will occur 400 times in a million word corpus.

Example: Berkeley Restaurant Project can you tell me about any good cantonese restaurants close by mid priced thai food is what i’m looking for tell me about chez panisse can you give me a listing of the kinds of food that are available i’m looking for a good place to eat breakfast when is caffe venezia open during the day

Raw bigram counts Out of 9222 sentences

Raw bigram probabilities Normalize by unigram counts: Result:

Bigram estimates of sentence probabilities P( I want english food ) = p(I | ) * p(want | I) * p(english | want) * p(food | english) * p( | food) =.24 x.33 x.0011 x 0.5 x 0.68 =

Unknown words Closed Vocabulary Task We know all the words in advanced Vocabulary V is fixed Open Vocabulary Task You typically don’t know the vocabulary Out Of Vocabulary = OOV words

Unknown words: Fixed lexicon solution Create a fixed lexicon L of size V Create an unknown word token Training At text normalization phase, any training word not in L changed to Train its probabilities like a normal word At decoding time Use probabilities for any word not in training

Unknown words: A Simplistic Approach Count all tokens in your training set. Create an “unknown” token Assign probability P( ) = 1 / (N+1) All other tokens receive P(word) = C(word) / (N+1) During testing, any new word not in the vocabulary receives P( ).

Evaluate I counted a bunch of words. But is my language model any good? 1.Auto-generate sentences 2.Perplexity 3.Word-Error Rate

The Shannon Visualization Method Generate random sentences: Choose a random bigram “ w” according to its probability Now choose a random bigram “w x” according to its probability And so on until we randomly choose “ ” Then string the words together I I want want to to eat eat Chinese Chinese food food

Evaluation We learned probabilities from a training set. Look at the model’s performance on some new data This is a test set. A dataset different than our training set Then we need an evaluation metric to tell us how well our model is doing on the test set. One such metric is perplexity (to be introduced below)

Perplexity Perplexity is the probability of the test set (assigned by the language model), normalized by the number of words: Chain rule: For bigrams: Minimizing perplexity is the same as maximizing probability The best language model is one that best predicts an unseen test set

Lower perplexity = better model Training 38 million words, test 1.5 million words, WSJ

Begin the lab! Make bigram and trigram models!