NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models.

Slides:



Advertisements
Similar presentations
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Advertisements

1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
Albert Gatt Corpora and Statistical Methods – Lecture 7.
SI485i : NLP Set 4 Smoothing Language Models Fall 2012 : Chambers.
Smoothing Techniques – A Primer
Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.
Language modelling using N-Grams Corpora and Statistical Methods Lecture 7.
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
Part II. Statistical NLP Advanced Artificial Intelligence N-Gramms Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most slides taken.
Morphology & FSTs Shallow Processing Techniques for NLP Ling570 October 17, 2011.
N-Gram Language Models CMSC 723: Computational Linguistics I ― Session #9 Jimmy Lin The iSchool University of Maryland Wednesday, October 28, 2009.
Ngram models and the Sparsity problem John Goldsmith November 2002.
Fall 2001 EE669: Natural Language Processing 1 Lecture 6: N-gram Models and Sparse Data (Chapter 6 of Manning and Schutze, Chapter 6 of Jurafsky and Martin,
Smoothing Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
1 Language Model (LM) LING 570 Fei Xia Week 4: 10/21/2009 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAA A A.
1 Smoothing LING 570 Fei Xia Week 5: 10/24/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A A AA A A A.
Part 5 Language Model CSE717, SPRING 2008 CUBS, Univ at Buffalo.
Generative Models Rong Jin. Statistical Inference Training ExamplesLearning a Statistical Model  Prediction p(x;  ) Female: Gaussian distribution N(
Statistical techniques in NLP Vasileios Hatzivassiloglou University of Texas at Dallas.
Language Model. Major role: Language Models help a speech recognizer figure out how likely a word sequence is, independent of the acoustics. A lot of.
I256 Applied Natural Language Processing Fall 2009 Lecture 7 Practical examples of Graphical Models Language models Sparse data & smoothing Barbara Rosario.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 18: 10/26.
Language Models Data-Intensive Information Processing Applications ― Session #9 Nitin Madnani University of Maryland Tuesday, April 6, 2010 This work is.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Natural Language Understanding
1 Statistical NLP: Lecture 5 Mathematical Foundations II: Information Theory.
SI485i : NLP Set 3 Language Models Fall 2012 : Chambers.
1 Advanced Smoothing, Evaluation of Language Models.
8/27/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 5 Giuseppe Carenini.
Natural Language Processing Language Model. Language Models Formal grammars (e.g. regular, context free) give a hard “binary” model of the legal sentences.
BİL711 Natural Language Processing1 Statistical Language Processing In the solution of some problems in the natural language processing, statistical techniques.
Heshaam Faili University of Tehran
6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection -Augmentative.
1 COMP 791A: Statistical Language Processing n-gram Models over Sparse Data Chap. 6.
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Statistical NLP: Lecture 8 Statistical Inference: n-gram Models over Sparse Data (Ch 6)
Chapter 6: N-GRAMS Heshaam Faili University of Tehran.
Chapter6. Statistical Inference : n-gram Model over Sparse Data 이 동 훈 Foundations of Statistic Natural Language Processing.
Sequence Models With slides by me, Joshua Goodman, Fei Xia.
9/22/1999 JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) LM Smoothing (The EM Algorithm) Dr. Jan Hajič CS Dept., Johns.
1 Introduction to Natural Language Processing ( ) LM Smoothing (The EM Algorithm) AI-lab
Language modelling María Fernández Pajares Verarbeitung gesprochener Sprache.
Chapter 23: Probabilistic Language Models April 13, 2004.
Statistical NLP Winter 2009
Mathematical Foundations Elementary Probability Theory Essential Information Theory Updated 11/11/2005.
Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.
1 Introduction to Natural Language Processing ( ) Language Modeling (and the Noisy Channel) AI-lab
CS Machine Learning and Statistical Natural Language Processing Prof. Shlomo Argamon, Room: 237C Office Hours: Mon 3-4 PM Book:
Natural Language Processing Statistical Inference: n-grams
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Mutual Information, Joint Entropy & Conditional Entropy
A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.
Probabilistic Pronunciation + N-gram Models CMSC Natural Language Processing April 15, 2003.
Language Modeling Part II: Smoothing Techniques Niranjan Balasubramanian Slide Credits: Chris Manning, Dan Jurafsky, Mausam.
Tasneem Ghnaimat. Language Model An abstract representation of a (natural) language. An approximation to real language Assume we have a set of sentences,
Language Model for Machine Translation Jang, HaYoung.
N-Grams Chapter 4 Part 2.
Statistical Language Models
N-Gram Model Formulas Word sequences Chain rule of probability
Presented by Wen-Hung Tsai Speech Lab, CSIE, NTNU 2005/07/13
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
CPSC 503 Computational Linguistics
Speech recognition, machine learning
Speech Recognition: Acoustic Waves
Speech recognition, machine learning
Presentation transcript:

NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models

NLP Language Models2 T wo Main Approaches to NLP Knowlege (AI) Statistical models - Inspired in speech recognition : probability of next word based on previous - Statistical models different from statistical methods methods

NLP Language Models3 Probability Theory X be uncertain outcome of some event. Called a random variable V(X) finite number of possible outcome (not a real number) P(X=x), probability of the particular outcome x (x belongs V(X)) X desease of your patient, V(X) all possible diseases,

NLP Language Models4 Conditional probability of the outcome of an event based upon the outcome of a second event We pick two words randomly from a book. We know first word is the, we want to know probability second word is dog P(W 2 = dog|W 1 = the) = | W 1 = the,W 2 = dog|/ |W 1 = the| Bayes’s law: P(x|y) = P(x) P(y|x) / P(y) Probability Theory

NLP Language Models5 Bayes’s law: P(x|y) = P(x) P(y|x) / P(y) P(desease/symptom)= P(desease)P(symptom/desease)/P(symptom) P(w 1,n | speech signal) = P(w 1,n )P(speech signal | w 1,n )/ P(speech signal) We only need to maximize the numerator P(speech signal | w 1,n ) expresses how well the speech signal fits the sequence of words w 1,n Probability Theory

NLP Language Models6 Useful generalizations of the Bayes’s law - To find the probability of something happening calculate the probability that it hapens given some second event times the probability of the second event -P(w,z|y,z) = P(w,x) P(y,z | w,z) / P(y,z) Where x,y,y,z aseparate event (i.e. take a word) - P(w 1,w 2,..,w n ) =P(w 1 )P(w 2 |w 1 )P(w 3 |w 1,w 2 ),… P(w n |w 1..,w n-1 ) also when conditioned on some event P(w 1,w 2,..,w n |x) =P(w 1 |x)P(w 2 |w 1,x) … P(w n |w 1..,w n-1,x) Probability Theory

NLP Language Models7 Statistical Model of a Language Statistical models of words of sentences – language models Probability of all possible sequences of words. For sequences of words of length n, assign a number to P(W 1,n = w 1,n ), being w 1,n a sequence of n random variable w 1,w 2,..w n being any word Computing the probability of next word related to computing probability of sequences P(the big dog) = P(the)P(big/the)P(dog/the bP(the pig dog) = P(the)P(pig/the)P(dog/the pig)

NLP Language Models8 Ngram Model Simple but durable statistical model Useful to indentify words in noisy, ambigous input. Speech recognition, many input speech sounds similar and confusable Machine translation, spelling correction, handwriting recognition, predictive text input Other NLP tasks: part of speech tagging, NL generation, word similarity

NLP Language Models9 CORPORA Corpora (singular corpus) are online collections of text or speech. Brown Corpus: 1 million word collection from 500 written texts from different genres (newspaper,novels, academic). Punctuation can be treated as words. Switchboard corpus: 2430 Telephone conversations averaging 6 minutes each, 240 hour of speech and 3 million words There is no punctuation. There are disfluencies: filled pauses and fragments “I do uh main- mainly business data processing”

NLP Language Models10 Training and Test Sets Probabilities of N-gram model come from the corpus it is trained for Data in the corpus is divided into training set (or training corpus) and test set (or test corpus). Perplexity: compare statistical models

NLP Language Models11 Ngram Model How can we compute probabilities of entire sequences P(w 1,w 2,..,w n ) Descomposition using the chain rule of probability P(w 1,w 2,..,w n ) =P(w 1 )P(w 2 |w 1 )P(w 3 |w 1,w 2 ),… P(w n |w 1..,w n-1 ) Assigns a conditional probability to possible next words considering the history. Markov assumption : we can predict the probability of some future unit without looking too far into the past. Bigrams only consider previous usint, trigrams, two previous unit, n-grams, n previous unit

NLP Language Models12 Ngram Model Assigns a conditional probability to possible next words.Only n-1 previous words have effect on the probabilities of next word For n = 3, Trigrams P(w n |w 1..,w n-1 ) = P(w n |w n-2,w n-1 ) How we estimate these trigram or N-gram probabilities? Maximum Likelihood Estimation (MLE) To maximize the likelihood of the training set T given the model M (P(T/M) To create the model use training text (corpus), taking counts and normalizing them so they lie between 0 and 1.

NLP Language Models13 Ngram Model For n = 3, Trigrams P(w n |w 1..,w n-1 ) = P(w n |w n-2,w n-1 ) To create the model use training text and record pairs and triples of words that appear in the text and how many times P(w i |w i-2,w i-1 )= C(w i-2,i ) / C(w i-2,i-1 ) P(submarine|the, yellow) = C(the,yellow, submarine)/C(the,yellow) Relative frequency: observed frequency of a particular sequence divided by observed fequency of a prefix

NLP Language Models14 Statistical Models Language Models (LM) Vocabulary (V), word w  V Language (L), sentence s  L L  V * usually infinite s = w 1,…w N Probability of s P(s) Language Models

NLP Language Models15 W XW*Y encoder decoder Channel p(y|x) message input to channel Output from channel Attempt to reconstruct message based on output Noisy Channel Model

NLP Language Models16 In NLP we do not usually act on coding. The problem is reduced to decode for getting the most likely input given the output, decoder Noisy Channel p(o|I) I O Noisy Channel Model in NLP

NLP Language Models17 Language Model Channel Model

NLP Language Models18 noisy channel X  Y real Language X Observed Language Y We want to retrieve X from Y

NLP Language Models19 Correct text errors text with errors noisy channel X  Y real Language X Observed Language Y

NLP Language Models20 Correct text space removing text without spaces noisy channel X  Y real Language X Observed Language Y

NLP Language Models21 language model acoustic model noisy channel X  Y real Language X Observed Language Y text spelling speech

NLP Language Models22 Source Language Translation noisy channel X  Y real Language X Observed Language Y Target Language

NLP Language Models23 Acoustic chain word chain Language model Acoustic Model example: ASR Automatic Speech Recognizer ASR X 1... X T w 1... w N

NLP Language Models24 Target Language Model Translation Model example: Machine Translation

NLP Language Models25 Naive Implementation Enumerate s  L Compute p(s) Parameters of the model |L| But... L is usually infinite How to estimate the parameters? Simplifications History h i = { w i, … w i-1 } Markov Models

NLP Language Models26 Markov Models of order n + 1 P(w i |h i ) = P(w i |w i-n+1, … w i-1 ) 0-gram 1-gram P(w i |h i ) = P(w i ) 2-gram P(w i |h i ) = P(w i |w i-1 ) 3-gram P(w i |h i ) = P(w i |w i-2,w i-1 )

NLP Language Models27 n large: more context information (more discriminative power) n small: more cases in the training corpus (more reliable) Selecting n: ej. for |V| = nnum. parameters 2 (bigrams)400,000,000 3 (trigrams)8,000,000,000,000 4 (4-grams)1.6 x 10 17

NLP Language Models28 Parameters of an n-gram model |V| n MLE estimation From a training corpus Problem of sparseness

NLP Language Models29 1-gram Model 2-gram Model 3-gram Model

NLP Language Models30

NLP Language Models31

NLP Language Models32 True probability distribution

NLP Language Models33 The seen cases are overestimated the unseen ones have a null probability

NLP Language Models34 Save a part of the mass probability from seen cases and assign it to the unseen ones SMOOTHING

NLP Language Models35 Some methods perform on the countings: Laplace, Lidstone, Jeffreys-Perks Some methods perform on the probabilities: Held-Out Good-Turing Descuento Some methods combine models Linear interpolation Back Off

NLP Language Models36 P = probability of an n-gram C = counting of the n-gram in the training corpus N = total of n-grams in the training corpus B = parameters of the model (possible n-grams) Laplace (add 1)

NLP Language Models37 = small positive number M.L.E: = 0 Laplace: = 1 Jeffreys-Perks: = ½ Lidstone (generalization of Laplace)

NLP Language Models38 Compute the percentage of the probability mass that has to be reserved for the n-grams unseen in the training corpus We separate from the training corpus a held-out corpus We compute howmany n-grams unseen in the training corpus occur in the held-out corpus An alternative of using a held-out corpus is using Cross- Validation Held-out interpolation Deleted interpolation Held-Out

NLP Language Models39 Let a n-gram w 1 … w n r = C(w 1 … w n ) C 1 (w 1 … w n ) counting of the n-gram in the training set C 2 (w 1 … w n ) counting of the n-gram in the held-out set N r number of n-grams with counting r in the training set Held-Out

NLP Language Models40 r * = adjusted count N r = number of n-gram-types occurring r times E(N r ) = expected value E(N r+1 ) < E(N r ) Zipf law Good-Turing

NLP Language Models41 Combination of models: Linear combination (interpolation) Linear combination of de 1-gram, 2-gram, 3-gram,... Estimation of using a development corpus

NLP Language Models42 Start with a n-gram model Back off to n-1 gram for null (or low) counts Proceed recursively Katz’s Backing-Off

NLP Language Models43 Performing on the history Class-based Models Clustering (or classifying) words into classes POS, syntactic, semantic Rosenfeld, 2000: P(wi|wi-2,wi-1) = P(wi|Ci) P(Ci|wi-2,wi-1) P(wi|wi-2,wi-1) = P(wi|Ci) P(Ci|wi-2,Ci-1) P(wi|wi-2,wi-1) = P(wi|Ci) P(Ci|Ci-2,Ci-1) P(wi|wi-2,wi-1) = P(wi|Ci-2,Ci-1)

NLP Language Models44 Structured Language Models Jelinek, Chelba, 1999 Including the syntactic structure into the history T i are the syntactic structures binarized lexicalized trees