N-gram Models CMSC 25000 Artificial Intelligence February 24, 2005.

Slides:



Advertisements
Similar presentations
Language Models & Smoothing Shallow Processing Techniques for NLP Ling570 October 19, 2011.
Advertisements

Chapter 6: Statistical Inference: n-gram Models over Sparse Data
1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.
N-gram model limitations Important question was asked in class: what do we do about N-grams which were not in our training corpus? Answer given: we distribute.
Albert Gatt Corpora and Statistical Methods – Lecture 7.
LIN3022 Natural Language Processing Lecture 4 Albert Gatt LIN Natural Language Processing.
Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.
Language modelling using N-Grams Corpora and Statistical Methods Lecture 7.
Morphology & FSTs Shallow Processing Techniques for NLP Ling570 October 17, 2011.
Probabilistic Pronunciation + N-gram Models CSPP Artificial Intelligence February 25, 2004.
Texture This isn’t described in Trucco and Verri Parts are described in: – Computer Vision, a Modern Approach by Forsyth and Ponce –“Texture Synthesis.
N-Gram Language Models CMSC 723: Computational Linguistics I ― Session #9 Jimmy Lin The iSchool University of Maryland Wednesday, October 28, 2009.
Ngram models and the Sparsity problem John Goldsmith November 2002.
Smoothing Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
N-gram model limitations Q: What do we do about N-grams which were not in our training corpus? A: We distribute some probability mass from seen N-grams.
LING 438/538 Computational Linguistics
1 Language Model (LM) LING 570 Fei Xia Week 4: 10/21/2009 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAA A A.
1 Smoothing LING 570 Fei Xia Week 5: 10/24/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A A AA A A A.
February 3, 2010Harvard QR481 Coding and Entropy.
Language Model. Major role: Language Models help a speech recognizer figure out how likely a word sequence is, independent of the acoustics. A lot of.
Introduction to Language Models Evaluation in information retrieval Lecture 4.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 19: 10/31.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 18: 10/26.
Language Modeling Approaches for Information Retrieval Rong Jin.
LIN3022 Natural Language Processing Lecture 5 Albert Gatt LIN Natural Language Processing.
Decision Trees and Information: A Question of Bits Great Theoretical Ideas In Computer Science Steven Rudich, Anupam GuptaCS Spring 2004 Lecture.
SI485i : NLP Set 3 Language Models Fall 2012 : Chambers.
1 Advanced Smoothing, Evaluation of Language Models.
8/27/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 5 Giuseppe Carenini.
Formal Models of Language. Slide 1 Language Models A language model an abstract representation of a (natural) language phenomenon. an approximation to.
Natural Language Processing Language Model. Language Models Formal grammars (e.g. regular, context free) give a hard “binary” model of the legal sentences.
Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.
Heshaam Faili University of Tehran
6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection -Augmentative.
NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models.
1 COMP 791A: Statistical Language Processing n-gram Models over Sparse Data Chap. 6.
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Statistical NLP: Lecture 8 Statistical Inference: n-gram Models over Sparse Data (Ch 6)
Chapter 6: N-GRAMS Heshaam Faili University of Tehran.
Language Modeling Anytime a linguist leaves the group the recognition rate goes up. (Fred Jelinek)
NLP. Introduction to NLP Extrinsic –Use in an application Intrinsic –Cheaper Correlate the two for validation purposes.
DISCRIMINATIVE TRAINING OF LANGUAGE MODELS FOR SPEECH RECOGNITION Hong-Kwang Jeff Kuo, Eric Fosler-Lussier, Hui Jiang, Chin-Hui Lee ICASSP 2002 Min-Hsuan.
Chapter6. Statistical Inference : n-gram Model over Sparse Data 이 동 훈 Foundations of Statistic Natural Language Processing.
Entropy & Hidden Markov Models Natural Language Processing CMSC April 22, 2003.
Information Theory Ying Nian Wu UCLA Department of Statistics July 9, 2007 IPAM Summer School.
1 Introduction to Natural Language Processing ( ) LM Smoothing (The EM Algorithm) AI-lab
Tokenization & POS-Tagging
Lecture 4 Ngrams Smoothing
Hidden Markov Models: Probabilistic Reasoning Over Time Natural Language Processing.
Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.
12/6/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 5 Giuseppe Carenini.
CS Machine Learning and Statistical Natural Language Processing Prof. Shlomo Argamon, Room: 237C Office Hours: Mon 3-4 PM Book:
Hidden Markov Models: Probabilistic Reasoning Over Time Artificial Intelligence CMSC February 26, 2008.
Natural Language Processing Statistical Inference: n-grams
2/29/2016CPSC503 Winter CPSC 503 Computational Linguistics Lecture 5 Giuseppe Carenini.
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
Learning, Uncertainty, and Information: Evaluating Models Big Ideas November 12, 2004.
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
Hidden Markov Models: Probabilistic Reasoning Over Time Natural Language Processing CMSC February 22, 2005.
Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.
Probabilistic Pronunciation + N-gram Models CMSC Natural Language Processing April 15, 2003.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.
Language Modeling Part II: Smoothing Techniques Niranjan Balasubramanian Slide Credits: Chris Manning, Dan Jurafsky, Mausam.
Language Model for Machine Translation Jang, HaYoung.
Uncertain Reasoning over Time
N-Gram Model Formulas Word sequences Chain rule of probability
Presented by Wen-Hung Tsai Speech Lab, CSIE, NTNU 2005/07/13
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
CPSC 503 Computational Linguistics
Presentation transcript:

N-gram Models CMSC Artificial Intelligence February 24, 2005

Roadmap n-gram models –Motivation Basic n-grams –Markov assumptions Coping with sparse data –Smoothing, Backoff Evaluating the model –Entropy and Perplexity

Information & Communication Shannon (1948) Perspective: –Message selected from possible messages Number (or function of #) of messages measure of information produced by selecting that message Logarithmic measure –Base 2: # of bits

Probabilistic Language Generation Coin-flipping models –A sentence is generated by a randomized algorithm The generator can be in one of several “states” Flip coins to choose the next state. Flip other coins to decide which letter or word to output

Shannon’s Generated Language 1. Zero-order approximation: –XFOML RXKXRJFFUJ ZLPWCFWKCYJ FFJEYVKCQSGHYD QPAAMKBZAACIBZLHJQD 2. First-order approximation: –OCRO HLI RGWR NWIELWIS EU LL NBNESEBYA TH EEI ALHENHTTPA OOBTTVA NAH RBL 3. Second-order approximation: –ON IE ANTSOUTINYS ARE T INCTORE ST BE S DEAMY ACHIND ILONASIVE TUCOOWE AT TEASONARE FUSO TIZIN ANDY TOBE SEACE CTISBE

Shannon’s Word Models 1. First-order approximation: –REPRESENTING AND SPEEDILY IS AN GOOD APT OR COME CAN DIFFERENT NATURAL HERE HE THE A IN CAME THE TO OF TO EXPERT GRAY COME TO FURNISHES THE LINE MESSAGE HAD BE THESE 2. Second-order approximation: –THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH WRITER THAT THE CHARACTER OF THIS POINT IS THEREFORE ANOTHER METHOD FOR THE LETTERS THAT THE TIME OF WHO EVER TOLD THE PROBLEM FOR AN UNEXPECTED

N-grams Perspective: –Some sequences (words/chars) are more likely than others –Given sequence, can guess most likely next Used in – Speech recognition – Spelling correction, –Augmentative communication –Language Identification –Information Retrieval

Corpus Counts Estimate probabilities by counts in large collections of text/speech Issues: –Wordforms (surface) vs lemma (root) –Case? Punctuation? Disfluency? –Type (distinct words) vs Token (total)

Basic N-grams Most trivial: 1/#tokens: too simple! Standard unigram: frequency –# word occurrences/total corpus size E.g. the=0.07; rabbit = –Too simple: no context! Conditional probabilities of word sequences

Markov Assumptions Exact computation requires too much data Approximate probability given all prior wds –Assume finite history –Bigram: Probability of word given 1 previous First-order Markov –Trigram: Probability of word given 2 previous N-gram approximation Bigram sequence

Issues Relative frequency –Typically compute count of sequence Divide by prefix Corpus sensitivity –Shakespeare vs Wall Street Journal Very unnatural Ngrams –Unigram: little; bigrams: colloc; trigrams:phrase

Sparse Data Issues Zero-count n-grams –Problem: Not seen yet! Not necessarily impossible.. –Solution: Estimate probabilities of unseen events Two strategies: –Smoothing Divide estimated probability mass –Backoff Guess higher order n-grams from lower

Smoothing out Zeroes Add-one smoothing –Simple: add 1 to all counts -> no zeroes! –Normalize by count and vocabulary size Unigrams: –Adjusted count: –Adjusted probability Bigrams: –Adjusted probability Problem: Too much weight on (former) zeroes

Backoff Idea: If no tri-grams, estimate with bigrams E.g. = Deleted interpolation: –Replace α’s with λ’s that are trained for word contexts

Toward an Information Measure Knowledge: event probabilities available Desirable characteristics: H(p1,p2,…,pn) –Continuous in pi –If pi equally likely, monotonic increasing in n If equally likely, more choice w/more elements –If broken into successive choices, weighted sum Entropy: H(X): X is a random var, p: prob fn

Evaluating n-gram models Entropy & Perplexity –Information theoretic measures –Measures information in grammar or fit to data –Conceptually, lower bound on # bits to encode Entropy: H(X): X is a random var, p: prob fn Perplexity: –Weighted average of number of choices

Computing Entropy Picking horses (Cover and Thomas) Send message: identify horse - 1 of 8 –If all horses equally likely, p(i) = 1/8 –Some horses more likely: 1: ½; 2: ¼; 3: 1/8; 4: 1/16; 5,6,7,8: 1/64

Entropy of a Sequence Basic sequence Entropy of language: infinite lengths –Assume stationary & ergodic

Cross-Entropy Comparing models –Actual distribution unknown –Use simplified model to estimate Closer match will have lower cross-entropy

Cross-Entropy Comparing models –Actual distribution unknown –Use simplified model to estimate Closer match will have lower cross-entropy

Perplexity Model Comparison Compare models with different history Train models –38 million words – Wall Street Journal Compute perplexity on held-out test set –1.5 million words (~20K unique, smoothed) N-gram Order |Perplexity – Unigram | 962 – Bigram | 170 – Trigram | 109

Does the model improve? Compute probability of data under model –Compute perplexity Relative measure –Decrease toward optimum? –Lower than competing model? Iter P(data) 9^-191^-162^-163^-164^-16 5^-16 Perplex

Entropy of English Shannon’s experiment –Subjects guess strings of letters, count guesses –Entropy of guess seq = Entropy of letter seq –1.3 bits; Restricted text Build stochastic model on text & compute –Brown computed trigram model on varied corpus –Compute (per-char) entropy of model –1.75 bits

Using N-grams Language Identification –Take text samples English, French, Spanish, German –Build character tri-gram models –Test Sample: Compute maximum likelihood Best match is chosen language Authorship attribution