Language Modeling.

Slides:



Advertisements
Similar presentations
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Advertisements

1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.
Basics of Statistical Estimation
N-gram model limitations Important question was asked in class: what do we do about N-grams which were not in our training corpus? Answer given: we distribute.
Albert Gatt Corpora and Statistical Methods – Lecture 7.
SI485i : NLP Set 4 Smoothing Language Models Fall 2012 : Chambers.
Smoothing Techniques – A Primer
Introduction to N-grams
Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.
N-Gram Language Models CMSC 723: Computational Linguistics I ― Session #9 Jimmy Lin The iSchool University of Maryland Wednesday, October 28, 2009.
Ngram models and the Sparsity problem John Goldsmith November 2002.
Smoothing Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
1 Language Model (LM) LING 570 Fei Xia Week 4: 10/21/2009 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAA A A.
1 Smoothing LING 570 Fei Xia Week 5: 10/24/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A A AA A A A.
1 LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS.
Part 5 Language Model CSE717, SPRING 2008 CUBS, Univ at Buffalo.
Language Models Data-Intensive Information Processing Applications ― Session #9 Nitin Madnani University of Maryland Tuesday, April 6, 2010 This work is.
SI485i : NLP Set 3 Language Models Fall 2012 : Chambers.
1 Advanced Smoothing, Evaluation of Language Models.
Natural Language Processing Lecture 6—9/17/2013 Jim Martin.
Speech and Language Processing
Session 12 N-grams and Corpora Introduction to Speech and Natural Language Processing (KOM422 ) Credits: 3(3-0)
Introduction to language modeling
Slides are from Dan Jurafsky and Schütze Language Modeling.
Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.
1IBM T.J. Waston CLSP, The Johns Hopkins University Using Random Forests Language Models in IBM RT-04 CTS Peng Xu 1 and Lidia Mangu 2 1. CLSP, the Johns.
Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.
Heshaam Faili University of Tehran
6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection -Augmentative.
Language acquisition
NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models.
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Statistical NLP: Lecture 8 Statistical Inference: n-gram Models over Sparse Data (Ch 6)
Chapter 6: N-GRAMS Heshaam Faili University of Tehran.
Language Modeling Anytime a linguist leaves the group the recognition rate goes up. (Fred Jelinek)
Combining Statistical Language Models via the Latent Maximum Entropy Principle Shaojum Wang, Dale Schuurmans, Fuchum Peng, Yunxin Zhao.
Chapter6. Statistical Inference : n-gram Model over Sparse Data 이 동 훈 Foundations of Statistic Natural Language Processing.
Language Modeling 1. Roadmap (for next two classes)  Review LMs  What are they?  How (and where) are they used?  How are they trained?  Evaluation.
Efficient Language Model Look-ahead Probabilities Generation Using Lower Order LM Look-ahead Information Langzhou Chen and K. K. Chin Toshiba Research.
9/22/1999 JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) LM Smoothing (The EM Algorithm) Dr. Jan Hajič CS Dept., Johns.
1 Introduction to Natural Language Processing ( ) LM Smoothing (The EM Algorithm) AI-lab
Lecture 4 Ngrams Smoothing
1 Modeling Long Distance Dependence in Language: Topic Mixtures Versus Dynamic Cache Models Rukmini.M Iyer, Mari Ostendorf.
Statistical NLP Winter 2009
Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.
Introduction to N-grams Language Modeling. Dan Jurafsky Probabilistic Language Models Today’s goal: assign a probability to a sentence Machine Translation:
Estimating N-gram Probabilities Language Modeling.
Language Modeling Putting a curve to the bag of words Courtesy of Chris Jordan.
Natural Language Processing Statistical Inference: n-grams
A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
Statistical NLP Spring 2011 Lecture 3: Language Models II Dan Klein – UC Berkeley TexPoint fonts used in EMF. Read the TexPoint manual before you delete.
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
Speech and Language Processing Lecture 4 Chapter 4 of SLP.
Recent Paper of Md. Akmal Haidar Meeting before ICASSP 2013 報告者:郝柏翰 2013/05/23.
Language Modeling Again So are we smooth now? Courtesy of Chris Jordan.
Language Modeling Part II: Smoothing Techniques Niranjan Balasubramanian Slide Credits: Chris Manning, Dan Jurafsky, Mausam.
Language Model for Machine Translation Jang, HaYoung.
N-Grams Chapter 4 Part 2.
CSC 594 Topics in AI – Natural Language Processing
In the name of God Language Modeling Mohammad Bahrani Feb 2011.
CSCI 5832 Natural Language Processing
Speech and Language Processing
Natural Language Processing
CSCE 771 Natural Language Processing
Presented by Wen-Hung Tsai Speech Lab, CSIE, NTNU 2005/07/13
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
CSCE 771 Natural Language Processing
Professor Junghoo “John” Cho UCLA
Presentation transcript:

Language Modeling

Roadmap (for next two classes) Review LM evaluation metrics Entropy Perplexity Smoothing Good-Turing Backoff and Interpolation Absolute Discounting Kneser-Ney

Language Model Evaluation Metrics

Applications

Entropy and perplexity Entropy – measure information content, in bits 𝐻 𝑝 = 𝑥 𝑝 𝑥 × − log 2 𝑝 𝑥 − log 2 𝑝 𝑥 is message length with ideal code Use log 2 if you want to measure in bits! Cross entropy – measure ability of trained model to compactly represent test data 𝑤 1 𝑛 𝑖=1 𝑛 1 𝑛 × − log 2 𝑝 𝑤 𝑖 𝑤 1 𝑖−1 Average logprob of test data Perplexity – measure average branching factor 2 𝑐𝑟𝑜𝑠𝑠 𝑒𝑛𝑡𝑟𝑜𝑝𝑦

Entropy and perplexity Entropy – measure information content, in bits 𝐻 𝑝 = 𝑥 𝑝 𝑥 × − log 2 𝑝 𝑥 − log 2 𝑝 𝑥 is message length with ideal code Use log 2 if you want to measure in bits! Cross entropy – measure ability of trained model to compactly represent test data 𝑤 1 𝑛 𝑖=1 𝑛 1 𝑛 × − log 2 𝑝 𝑤 𝑖 𝑤 1 𝑖−1 Average logprob of test data Perplexity – measure average branching factor 2 𝑐𝑟𝑜𝑠𝑠 𝑒𝑛𝑡𝑟𝑜𝑝𝑦

Entropy and perplexity Entropy – measure information content, in bits 𝐻 𝑝 = 𝑥 𝑝 𝑥 × − log 2 𝑝 𝑥 − log 2 𝑝 𝑥 is message length with ideal code Use log 2 if you want to measure in bits! Cross entropy – measure ability of trained model to compactly represent test data 𝑤 1 𝑛 𝑖=1 𝑛 1 𝑛 × − log 2 𝑝 𝑤 𝑖 𝑤 1 𝑖−1 Average logprob of test data Perplexity – measure average branching factor 2 𝑐𝑟𝑜𝑠𝑠 𝑒𝑛𝑡𝑟𝑜𝑝𝑦

Entropy and perplexity Entropy – measure information content, in bits 𝐻 𝑝 = 𝑥 𝑝 𝑥 × − log 2 𝑝 𝑥 − log 2 𝑝 𝑥 is message length with ideal code Use log 2 if you want to measure in bits! Cross entropy – measure ability of trained model to compactly represent test data 𝑤 1 𝑛 𝑖=1 𝑛 1 𝑛 × − log 2 𝑝 𝑤 𝑖 𝑤 1 𝑖−1 Average logprob of test data Perplexity – measure average branching factor 2 𝑐𝑟𝑜𝑠𝑠 𝑒𝑛𝑡𝑟𝑜𝑝𝑦

Language model perplexity Recipe: Train a language model on training data Get negative logprobs of test data, compute average Exponentiate! Perplexity correlates rather well with: Speech recognition error rates MT quality metrics LM Perplexities for word-based models are normally between say 50 and 1000 Need to drop perplexity by a significant fraction (not absolute amount) to make a visible impact

Parameter estimation What is it?

Parameter estimation Model form is fixed (coin unigrams, word bigrams, …) We have observations H H H T T H T H H Want to find the parameters Maximum Likelihood Estimation – pick the parameters that assign the most probability to our training data c(H) = 6; c(T) = 3 P(H) = 6 / 9 = 2 / 3; P(T) = 3 / 9 = 1 / 3 MLE picks parameters best for training data… …but these don’t generalize well to test data – zeros!

Parameter estimation Model form is fixed (coin unigrams, word bigrams, …) We have observations H H H T T H T H H Want to find the parameters Maximum Likelihood Estimation – pick the parameters that assign the most probability to our training data c(H) = 6; c(T) = 3 P(H) = 6 / 9 = 2 / 3; P(T) = 3 / 9 = 1 / 3 MLE picks parameters best for training data… …but these don’t generalize well to test data – zeros!

Parameter estimation Model form is fixed (coin unigrams, word bigrams, …) We have observations H H H T T H T H H Want to find the parameters Maximum Likelihood Estimation – pick the parameters that assign the most probability to our training data c(H) = 6; c(T) = 3 P(H) = 6 / 9 = 2 / 3; P(T) = 3 / 9 = 1 / 3 MLE picks parameters best for training data… …but these don’t generalize well to test data – zeros!

Smoothing Take mass from seen events, give to unseen events Robin Hood for probability models MLE at one end of the spectrum; uniform distribution the other Need to pick a happy medium, and yet maintain a distribution 𝑥 𝑝 𝑥 =1 𝑝 𝑥 ≥0 ∀𝑥

Smoothing techniques Laplace Good-Turing Backoff Mixtures Interpolation Kneser-Ney

Laplace From MLE: 𝑝 𝑥 = 𝑐 𝑥 𝑥 ′ 𝑐 𝑥 ′ To Laplace: 𝑝 𝑥 = 𝑐 𝑥 𝑥 ′ 𝑐 𝑥 ′ To Laplace: 𝑝 𝑥 = 𝑐 𝑥 +1 𝑥 ′ 𝑐 𝑥 ′ +1

Good-Turing Smoothing New idea: Use counts of things you have seen to estimate those you haven’t

Good-Turing Josh Goodman Intuition Imagine you are fishing There are 8 species: carp, perch, whitefish, trout, salmon, eel, catfish, bass You have caught 10 carp, 3 perch, 2 whitefish, 1 trout, 1 salmon, 1 eel = 18 fish How likely is it that the next fish caught is from a new species (one not seen in our previous catch)? Slide adapted from Josh Goodman, Dan Jurafsky

Good-Turing Josh Goodman Intuition Imagine you are fishing There are 8 species: carp, perch, whitefish, trout, salmon, eel, catfish, bass You have caught 10 carp, 3 perch, 2 whitefish, 1 trout, 1 salmon, 1 eel = 18 fish How likely is it that the next fish caught is from a new species (one not seen in our previous catch)? 3/18 Assuming so, how likely is it that next species is trout? Slide adapted from Josh Goodman, Dan Jurafsky

Good-Turing Josh Goodman Intuition Imagine you are fishing There are 8 species: carp, perch, whitefish, trout, salmon, eel, catfish, bass You have caught 10 carp, 3 perch, 2 whitefish, 1 trout, 1 salmon, 1 eel = 18 fish How likely is it that the next fish caught is from a new species (one not seen in our previous catch)? 3/18 Assuming so, how likely is it that next species is trout? Must be less than 1/18 Slide adapted from Josh Goodman, Dan Jurafsky

Some more hypotheticals Species Puget Sound Lake Washington Greenlake Salmon 8 12 Trout 3 1 Cod Rockfish Snapper Skate Bass 14 TOTAL 15 How likely is it to find a new fish in each of these places?

Good-Turing Smoothing New idea: Use counts of things you have seen to estimate those you haven’t

Good-Turing Smoothing New idea: Use counts of things you have seen to estimate those you haven’t Good-Turing approach: Use frequency of singletons to re-estimate frequency of zero-count n-grams

Good-Turing Smoothing New idea: Use counts of things you have seen to estimate those you haven’t Good-Turing approach: Use frequency of singletons to re-estimate frequency of zero-count n-grams Notation: Nc is the frequency of frequency c Number of ngrams which appear c times N0: # ngrams of count 0; N1: # of ngrams of count 1 𝑁 𝑐 = 𝑥:𝑐 𝑥 =𝑐 1

Good-Turing Smoothing Estimate probability of things which occur c times with the probability of things which occur c+1 times Discounted counts: steal mass from seen cases to provide for the unseen: 𝑐 ⋆ = 𝑐+1 𝑁 𝑐+1 𝑁 𝑐 MLE 𝑝 𝑥 = 𝑐 𝑥 𝑁 GT 𝑝 𝑥 = 𝑐 ⋆ 𝑥 N

GT Fish Example

Enough about the fish… how does this relate to language? Name some linguistic situations where the number of new words would differ

Enough about the fish… how does this relate to language? Name some linguistic situations where the number of new words would differ Different languages: Chinese has almost no morphology Turkish has a lot of morphology Lots of new words in Turkish!

Enough about the fish… how does this relate to language? Name some linguistic situations where the number of new words would differ Different languages: Chinese has almost no morphology Turkish has a lot of morphology Lots of new words in Turkish! Different domains: Airplane maintenance manuals: controlled vocabulary Random web posts: uncontrolled vocab

Bigram Frequencies of Frequencies and GT Re-estimates

Good-Turing Smoothing N-gram counts to conditional probability 𝑃 𝑤 𝑖 𝑤 1 .. 𝑤 𝑖−1 = 𝑐 ⋆ 𝑤 1 … 𝑤 𝑖 𝑐 ⋆ 𝑤 1 … 𝑤 𝑖−1 Use c* from GT estimate

Additional Issues in Good-Turing General approach: Estimate of c* for Nc depends on N c+1 What if Nc+1 = 0? More zero count problems Not uncommon: e.g. fish example, no 4s

Modifications Simple Good-Turing Compute Nc bins, then smooth Nc to replace zeroes Fit linear regression in log space log(Nc) = a +b log(c) What about large c’s? Should be reliable Assume c*=c if c is large, e.g c > k (Katz: k =5) Typically combined with other approaches

Backoff and Interpolation Another really useful source of knowledge If we are estimating: trigram p(z|x,y) but count(xyz) is zero Use info from:

Backoff and Interpolation Another really useful source of knowledge If we are estimating: trigram p(z|x,y) but count(xyz) is zero Use info from: Bigram p(z|y) Or even: Unigram p(z)

Backoff and Interpolation Another really useful source of knowledge If we are estimating: trigram p(z|x,y) but count(xyz) is zero Use info from: Bigram p(z|y) Or even: Unigram p(z) How to combine this trigram, bigram, unigram info in a valid fashion?

Backoff vs. Interpolation Backoff: use trigram if you have it, otherwise bigram, otherwise unigram

Backoff vs. Interpolation Backoff: use trigram if you have it, otherwise bigram, otherwise unigram Interpolation: always mix all three

Backoff Bigram distribution 𝑝 𝑏 𝑎 = 𝑐 𝑎𝑏 𝑐 𝑎 But 𝑐 𝑎 could be zero… 𝑝 𝑏 𝑎 = 𝑐 𝑎𝑏 𝑐 𝑎 But 𝑐 𝑎 could be zero… What if we fell back (or “backed off”) to a unigram distribution? 𝑝 𝑏 𝑎 = 𝑐 𝑎𝑏 𝑐 𝑎 if 𝑐 𝑎 >0 𝑐 𝑏 𝑁 otherwise Also 𝑐 𝑎𝑏 could be zero…

Backoff What’s wrong with this distribution? 𝑝 𝑏 𝑎 = 𝑐 𝑎𝑏 𝑐 𝑎 if 𝑐 𝑎𝑏 >0 𝑐 𝑏 𝑁 if 𝑐 𝑎𝑏 =0, 𝑐 𝑎 >0 𝑐 𝑏 𝑁 𝑐 𝑎 =0 Doesn’t sum to one! Need to steal mass…

Backoff 𝑝 𝑏 𝑎 = 𝑐 𝑎𝑏 −𝐷 𝑐 𝑎 if 𝑐 𝑎𝑏 >0 𝛼 𝑎 𝑐 𝑏 𝑁 if 𝑐 𝑎𝑏 =0, 𝑐 𝑎 >0 𝑐 𝑏 𝑁 𝑐 𝑎 =0 𝛼 𝑎 = 𝑏 ′ :𝑐 𝑎 𝑏 ′ ≠0 1−𝑝 𝑏 ′ 𝑎 𝑏 ′ :𝑐 𝑎 𝑏 ′ =0 𝑐 𝑏 ′ 𝑁

Mixtures Given distributions 𝑝 1 (𝑥) and 𝑝 2 𝑥 Pick any number 𝜆 between 0 and 1 𝑝 𝑥 =𝜆 𝑝 1 𝑥 + 1−𝜆 𝑝 2 𝑥 is a distribution (Laplace is a mixture!)

Interpolation Simple interpolation 𝑝 ⋆ 𝑤 𝑖 𝑤 𝑖−1 =𝜆 𝑐 𝑤 𝑖−1 𝑤 𝑖 𝑐 𝑤 𝑖−1 + 1−𝜆 𝑐 𝑤 𝑖 𝑁 𝜆∈ 0,1 Or, pick interpolation value based on context 𝑝 ⋆ 𝑤 𝑖 𝑤 𝑖−1 =𝜆 𝑤 𝑖−1 𝑐 𝑤 𝑖−1 𝑤 𝑖 𝑐 𝑤 𝑖−1 + 1−𝜆 𝑤 𝑖−1 𝑐 𝑤 𝑖 𝑁 Intuition: Higher weight on more frequent n-grams

How to Set the Lambdas? Use a held-out, or development, corpus Choose lambdas which maximize the probability of some held-out data I.e. fix the N-gram probabilities Then search for lambda values That when plugged into previous equation Give largest probability for held-out set Can use EM to do this search

Kneser-Ney Smoothing Most commonly used modern smoothing technique Intuition: improving backoff I can’t see without my reading…… Compare P(Francisco|reading) vs P(glasses|reading)

Kneser-Ney Smoothing Most commonly used modern smoothing technique Intuition: improving backoff I can’t see without my reading…… Compare P(Francisco|reading) vs P(glasses|reading) P(Francisco|reading) backs off to P(Francisco)

Kneser-Ney Smoothing Most commonly used modern smoothing technique Intuition: improving backoff I can’t see without my reading…… Compare P(Francisco|reading) vs P(glasses|reading) P(Francisco|reading) backs off to P(Francisco) P(glasses|reading) > 0 High unigram frequency of Francisco > P(glasses|reading)

Kneser-Ney Smoothing Most commonly used modern smoothing technique Intuition: improving backoff I can’t see without my reading…… Compare P(Francisco|reading) vs P(glasses|reading) P(Francisco|reading) backs off to P(Francisco) P(glasses|reading) > 0 High unigram frequency of Francisco > P(glasses|reading) However, Francisco appears in few contexts, glasses many

Kneser-Ney Smoothing Most commonly used modern smoothing technique Intuition: improving backoff I can’t see without my reading…… Compare P(Francisco|reading) vs P(glasses|reading) P(Francisco|reading) backs off to P(Francisco) P(glasses|reading) > 0 High unigram frequency of Francisco > P(glasses|reading) However, Francisco appears in few contexts, glasses many Interpolate based on # of contexts Words seen in more contexts, more likely to appear in others

Kneser-Ney Smoothing: bigrams Modeling diversity of contexts 𝑐 𝑑𝑖𝑣 𝑤 =# of contexts in which w occurs = 𝑣:𝑐 𝑣𝑤 >0 So 𝑐 𝑑𝑖𝑣 glasses ≫ 𝑐 𝑑𝑖𝑣 (Francisco) 𝑝 𝑑𝑖𝑣 𝑤 𝑖 = 𝑐 𝑑𝑖𝑣 𝑤 𝑖 𝑤 ′ 𝑐 𝑑𝑖𝑣 𝑤 ′

Kneser-Ney Smoothing: bigrams 𝑐 𝑑𝑖𝑣 𝑤 =# of contexts in which w occurs = 𝑣:𝑐 𝑣𝑤 >0 𝑝 𝑑𝑖𝑣 𝑤 𝑖 = 𝑐 𝑑𝑖𝑣 𝑤 𝑖 𝑤 ′ 𝑐 𝑑𝑖𝑣 𝑤 ′ Backoff: 𝑝 𝑤 𝑖 𝑤 𝑖−1 = 𝑐 𝑤 𝑖−1 𝑤 𝑖 −𝐷 𝑐 𝑤 𝑖−1 if 𝑐 𝑤 𝑖−1 𝑤 𝑖 >0 𝛼 𝑤 𝑖−1 𝑝 𝑑𝑖𝑣 𝑤 𝑖 otherwise

Kneser-Ney Smoothing: bigrams 𝑐 𝑑𝑖𝑣 𝑤 =# of contexts in which w occurs = 𝑣:𝑐 𝑣𝑤 >0 𝑝 𝑑𝑖𝑣 𝑤 𝑖 = 𝑐 𝑑𝑖𝑣 𝑤 𝑖 𝑤 ′ 𝑐 𝑑𝑖𝑣 𝑤 ′ Interpolation: 𝑝 𝑤 𝑖 𝑤 𝑖−1 = 𝑐 𝑤 𝑖−1 𝑤 𝑖 −𝐷 𝑐 𝑤 𝑖−1 +𝛽 𝑤 𝑖−1 𝑝 𝑑𝑖𝑣 𝑤 𝑖

OOV words: <UNK> word Out Of Vocabulary = OOV words

OOV words: <UNK> word Out Of Vocabulary = OOV words We don’t use GT smoothing for these

OOV words: <UNK> word Out Of Vocabulary = OOV words We don’t use GT smoothing for these Because GT assumes we know the number of unseen events Instead: create an unknown word token <UNK>

OOV words: <UNK> word Out Of Vocabulary = OOV words We don’t use GT smoothing for these Because GT assumes we know the number of unseen events Instead: create an unknown word token <UNK> Training of <UNK> probabilities Create a fixed lexicon L of size V At text normalization phase, any training word not in L changed to <UNK> Now we train its probabilities like a normal word

OOV words: <UNK> word Out Of Vocabulary = OOV words We don’t use GT smoothing for these Because GT assumes we know the number of unseen events Instead: create an unknown word token <UNK> Training of <UNK> probabilities Create a fixed lexicon L of size V At text normalization phase, any training word not in L changed to <UNK> Now we train its probabilities like a normal word At decoding time If text input: Use UNK probabilities for any word not in training Plus an additional penalty! UNK predicts the class of unknown words; then we need to pick a member

Class-Based Language Models Variant of n-gram models using classes or clusters

Class-Based Language Models Variant of n-gram models using classes or clusters Motivation: Sparseness Flight app.: P(ORD|to),P(JFK|to),.. P(airport_name|to) Relate probability of n-gram to word classes & class ngram

Class-Based Language Models Variant of n-gram models using classes or clusters Motivation: Sparseness Flight app.: P(ORD|to),P(JFK|to),.. P(airport_name|to) Relate probability of n-gram to word classes & class ngram IBM clustering: assume each word in single class P(wi|wi-1)~P(ci|ci-1)xP(wi|ci) Learn by MLE from data Where do classes come from?

Class-Based Language Models Variant of n-gram models using classes or clusters Motivation: Sparseness Flight app.: P(ORD|to),P(JFK|to),.. P(airport_name|to) Relate probability of n-gram to word classes & class ngram IBM clustering: assume each word in single class P(wi|wi-1)~P(ci|ci-1)xP(wi|ci) Learn by MLE from data Where do classes come from? Hand-designed for application (e.g. ATIS) Automatically induced clusters from corpus

Class-Based Language Models Variant of n-gram models using classes or clusters Motivation: Sparseness Flight app.: P(ORD|to),P(JFK|to),.. P(airport_name|to) Relate probability of n-gram to word classes & class ngram IBM clustering: assume each word in single class P(wi|wi-1)~P(ci|ci-1)xP(wi|ci) Learn by MLE from data Where do classes come from? Hand-designed for application (e.g. ATIS) Automatically induced clusters from corpus

LM Adaptation Challenge: Need LM for new domain Have little in-domain data

LM Adaptation Challenge: Need LM for new domain Have little in-domain data Intuition: Much of language is pretty general Can build from ‘general’ LM + in-domain data

LM Adaptation Challenge: Need LM for new domain Have little in-domain data Intuition: Much of language is pretty general Can build from ‘general’ LM + in-domain data Approach: LM adaptation Train on large domain independent corpus Adapt with small in-domain data set What large corpus?

LM Adaptation Challenge: Need LM for new domain Have little in-domain data Intuition: Much of language is pretty general Can build from ‘general’ LM + in-domain data Approach: LM adaptation Train on large domain independent corpus Adapt with small in-domain data set What large corpus? Web counts! e.g. Google n-grams

Incorporating Longer Distance Context Why use longer context?

Incorporating Longer Distance Context Why use longer context? N-grams are approximation Model size Sparseness

Incorporating Longer Distance Context Why use longer context? N-grams are approximation Model size Sparseness What sorts of information in longer context?

Incorporating Longer Distance Context Why use longer context? N-grams are approximation Model size Sparseness What sorts of information in longer context? Priming Topic Sentence type Dialogue act Syntax

Long Distance LMs Bigger n! 284M words: <= 6-grams improve; 7-20 no better

Long Distance LMs Bigger n! Cache n-gram: 284M words: <= 6-grams improve; 7-20 no better Cache n-gram: Intuition: Priming: word used previously, more likely Incrementally create ‘cache’ unigram model on test corpus Mix with main n-gram LM

Long Distance LMs Bigger n! Cache n-gram: Topic models: 284M words: <= 6-grams improve; 7-20 no better Cache n-gram: Intuition: Priming: word used previously, more likely Incrementally create ‘cache’ unigram model on test corpus Mix with main n-gram LM Topic models: Intuition: Text is about some topic, on-topic words likely P(w|h) ~ Σt P(w|t)P(t|h)

Long Distance LMs Bigger n! Cache n-gram: Topic models: 284M words: <= 6-grams improve; 7-20 no better Cache n-gram: Intuition: Priming: word used previously, more likely Incrementally create ‘cache’ unigram model on test corpus Mix with main n-gram LM Topic models: Intuition: Text is about some topic, on-topic words likely P(w|h) ~ Σt P(w|t)P(t|h) Non-consecutive n-grams: skip n-grams, triggers, variable lengths n-grams

Language Models N-gram models: Issues: Zeroes and other sparseness Finite approximation of infinite context history Issues: Zeroes and other sparseness Strategies: Smoothing Add-one, add-δ, Good-Turing, etc Use partial n-grams: interpolation, backoff Refinements Class, cache, topic, trigger LMs