Statistical Language Models Hongning Wang CS 6501: Text Mining1.

Slides:



Advertisements
Similar presentations
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Advertisements

Information Retrieval and Organisation Chapter 12 Language Models for Information Retrieval Dell Zhang Birkbeck, University of London.
1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.
Language Models Naama Kraus (Modified by Amit Gross) Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze.
1 Language Models for TR (Lecture for CS410-CXZ Text Info Systems) Feb. 25, 2011 ChengXiang Zhai Department of Computer Science University of Illinois,
Language Models Hongning Wang
Albert Gatt Corpora and Statistical Methods – Lecture 7.
1 Essential Probability & Statistics (Lecture for CS598CXZ Advanced Topics in Information Retrieval ) ChengXiang Zhai Department of Computer Science University.
Vector Space Model Hongning Wang Recap: what is text mining 6501: Text Mining2 “Text mining, also referred to as text data mining, roughly.
Probabilistic Ranking Principle
Chapter 7 Retrieval Models.
N-Gram Language Models CMSC 723: Computational Linguistics I ― Session #9 Jimmy Lin The iSchool University of Maryland Wednesday, October 28, 2009.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 12: Language Models for IR.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
Language Models for TR Rong Jin Department of Computer Science and Engineering Michigan State University.
I256 Applied Natural Language Processing Fall 2009 Lecture 7 Practical examples of Graphical Models Language models Sparse data & smoothing Barbara Rosario.
Introduction to Language Models Evaluation in information retrieval Lecture 4.
Scalable Text Mining with Sparse Generative Models
Language Modeling Approaches for Information Retrieval Rong Jin.
SI485i : NLP Set 3 Language Models Fall 2012 : Chambers.
1 Advanced Smoothing, Evaluation of Language Models.
Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.
6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection -Augmentative.
IRDM WS Chapter 4: Advanced IR Models 4.1 Probabilistic IR 4.2 Statistical Language Models (LMs) Principles and Basic LMs Smoothing.
Text Classification, Active/Interactive learning.
NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models.
Language Models Hongning Wang Two-stage smoothing [Zhai & Lafferty 02] c(w,d) |d| P(w|d) = +  p(w|C) ++ Stage-1 -Explain unseen words -Dirichlet.
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Presented by Chen Yi-Ting.
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
Chapter6. Statistical Inference : n-gram Model over Sparse Data 이 동 훈 Foundations of Statistic Natural Language Processing.
Probabilistic Ranking Principle Hongning Wang
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Boolean Model Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.
Language Models Hongning Wang Recap: document generation model 4501: Information Retrieval Model of relevant docs for Q Model of non-relevant.
Boolean Model Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.
Lecture 4 Ngrams Smoothing
Chapter 23: Probabilistic Language Models April 13, 2004.
Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.
KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.
Language Modeling Putting a curve to the bag of words Courtesy of Chris Jordan.
CpSc 881: Information Retrieval. 2 Using language models (LMs) for IR ❶ LM = language model ❷ We view the document as a generative model that generates.
Natural Language Processing Statistical Inference: n-grams
A Generation Model to Unify Topic Relevance and Lexicon-based Sentiment for Opinion Retrieval Min Zhang, Xinyao Ye Tsinghua University SIGIR
Statistical NLP: Lecture 4 Mathematical Foundations I: Probability Theory (Ch2)
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
Essential Probability & Statistics (Lecture for CS397-CXZ Algorithms in Bioinformatics) Jan. 23, 2004 ChengXiang Zhai Department of Computer Science University.
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
Introduction to Text Mining Thanks for Hongning slides on Text Ming Courses, Slides are slightly modified by Lei Chen.
2010 © University of Michigan Probabilistic Models in Information Retrieval SI650: Information Retrieval Winter 2010 School of Information University of.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.
KNN & Naïve Bayes Hongning Wang
Language Modeling Part II: Smoothing Techniques Niranjan Balasubramanian Slide Credits: Chris Manning, Dan Jurafsky, Mausam.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 14: Language Models for IR.
Essential Probability & Statistics
Overview of Statistical Language Models
True/False questions (3pts*2)
Statistical Language Models
Statistical Models for Automatic Speech Recognition
Statistical NLP: Lecture 4
N-Gram Model Formulas Word sequences Chain rule of probability
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Probabilistic Ranking Principle
Language Models Hongning Wang
CS590I: Information Retrieval
INF 141: Information Retrieval
Language Models for TR Rong Jin
Presentation transcript:

Statistical Language Models Hongning Wang CS 6501: Text Mining1

Today’s lecture 1.How to represent a document? – Make it computable 2.How to infer the relationship among documents or identify the structure within a document? – Knowledge discovery 6501: Text Mining2

What is a statistical LM? A model specifying probability distribution over word sequences – p(“ Today is Wednesday ”)  – p(“ Today Wednesday is ”)  – p(“ The eigenvalue is positive” )  It can be regarded as a probabilistic mechanism for “generating” text, thus also called a “generative” model 6501: Text Mining3

Why is a LM useful? Provide a principled way to quantify the uncertainties associated with natural language Allow us to answer questions like: – Given that we see “ John ” and “ feels ”, how likely will we see “ happy ” as opposed to “ habit ” as the next word? (speech recognition) – Given that we observe “baseball” three times and “game” once in a news article, how likely is it about “sports” v.s. “politics”? (text categorization) – Given that a user is interested in sports news, how likely would the user use “baseball” in a query? (information retrieval) 6501: Text Mining4

Measure the fluency of documents 6501: Text Mining5

Source-Channel framework [Shannon 48] Source Transmitter (encoder) Destination Receiver (decoder) Noisy Channel P(X) P(Y|X) X Y X’ P(X|Y)=? When X is text, p(X) is a language model (Bayes Rule) Many Examples: Speech recognition: X=Word sequence Y=Speech signal Machine translation: X=English sentence Y=Chinese sentence OCR Error Correction: X=Correct word Y= Erroneous word Information Retrieval: X=Document Y=Query Summarization: X=Summary Y=Document 6501: Text Mining6

Basic concepts of probability Random experiment – An experiment with uncertain outcome (e.g., tossing a coin, picking a word from text) Sample space (S) – All possible outcomes of an experiment, e.g., tossing 2 fair coins, S={HH, HT, TH, TT} Event (E) – E  S, E happens iff outcome is in S, e.g., E={HH} (all heads), E={HH,TT} (same face) – Impossible event ({}), certain event (S) Probability of event – 0 ≤ P(E) ≤ : Text Mining7

Recap: how to assign weights? Important! Why? – Corpus-wise: some terms carry more information about the document content – Document-wise: not all terms are equally important How? – Two basic heuristics TF (Term Frequency) = Within-doc-frequency IDF (Inverse Document Frequency) 6501: Text Mining8

Recap: TF-IDF weighting “Salton was perhaps the leading computer scientist working in the field of information retrieval during his time.” - wikipedia Gerard Salton Award – highest achievement award in IR 6501: Text Mining9

Recap: cosine similarity TF-IDF vector Unit vector D1D1 D2D2 D6D6 TF-IDF space Sports Finance 6501: Text Mining10

Recap: disadvantages of VS model Assume term independence Lack of “predictive adequacy” – Arbitrary term weighting – Arbitrary similarity measure Lots of parameter tuning! 6501: Text Mining11

Recap: what is a statistical LM? A model specifying probability distribution over word sequences – p(“ Today is Wednesday ”)  – p(“ Today Wednesday is ”)  – p(“ The eigenvalue is positive” )  It can be regarded as a probabilistic mechanism for “generating” text, thus also called a “generative” model 6501: Text Mining12

Sampling with replacement Pick a random shape, then put it back in the bag 6501: Text Mining13

Essential probability concepts 6501: Text Mining14

Essential probability concepts 6501: Text Mining15

Language model for text 6501: Text Mining16 Chain rule: from conditional probability to joint probability sentence Average English sentence length is 14.3 words 475,000 main headwords in Webster's Third New International Dictionary How large is this? We need independence assumptions!

Unigram language model 6501: Text Mining17 The simplest and most popular choice!

More sophisticated LMs 6501: Text Mining18

Why just unigram models? Difficulty in moving toward more complex models – They involve more parameters, so need more data to estimate – They increase the computational complexity significantly, both in time and space Capturing word order or structure may not add so much value for “topical inference” But, using more sophisticated models can still be expected to improve performance : Text Mining19

Generative view of text documents (Unigram) Language Model  p(w|  ) … text 0.2 mining 0.1 assocation 0.01 clustering 0.02 … food … Topic 1: Text mining … text 0.01 food 0.25 nutrition 0.1 healthy 0.05 diet 0.02 … Topic 2: Health Document Text mining document Food nutrition document Sampling 6501: Text Mining20

How to generate text from an N-gram language model? 6501: Text Mining21

Generating text from language models 6501: Text Mining22 Under a unigram language model:

Generating text from language models 6501: Text Mining23 The same likelihood! Under a unigram language model:

N-gram language models will help Unigram – Months the my and issue of year foreign new exchange’s september were recession exchange new endorsed a q acquire to six executives. Bigram – Last December through the way to preserve the Hudson corporation N.B.E.C. Taylor would seem to complete the major central planners one point five percent of U.S.E. has already told M.X. corporation of living on information such as more frequently fishing to keep her. Trigram – They also point to ninety nine point six billon dollars from two hundred four oh six three percent of the rates of interest stores as Mexico and Brazil on market conditions. 6501: Text Mining24 Generated from language models of New York Times

Turing test: generating Shakespeare 6501: Text Mining25 A B C D SCIgen - An Automatic CS Paper Generator

Estimation of language models 6501: Text Mining26 Document text 10 mining 5 association 3 database 3 algorithm 2 … query 1 efficient 1 Unigram Language Model  p(w|  )=? … text ? mining ? assocation ? database ? … query ? … Estimation A “text mining” paper (total #words=100)

Sampling with replacement Pick a random shape, then put it back in the bag 6501: Text Mining27

Recap: language model for text 6501: Text Mining28 Chain rule: from conditional probability to joint probability sentence Average English sentence length is 14.3 words 475,000 main headwords in Webster's Third New International Dictionary How large is this? We need independence assumptions!

Recap: how to generate text from an N-gram language model? 6501: Text Mining29

Recap: estimation of language models Maximum likelihood estimation Unigram Language Model  p(w|  )=? Document text 10 mining 5 association 3 database 3 algorithm 2 … query 1 efficient 1 … text ? mining ? assocation ? database ? … query ? … Estimation A “text mining” paper (total #words=100) 10/100 5/100 3/100 1/ : Text Mining30

Parameter estimation 6501: Text Mining31

Maximum likelihood vs. Bayesian Maximum likelihood estimation – “Best” means “data likelihood reaches maximum” – Issue: small sample size Bayesian estimation – “Best” means being consistent with our “prior” knowledge and explaining data well – A.k.a, Maximum a Posterior estimation – Issue: how to define prior? 6501: Text Mining32 ML: Frequentist’s point of view MAP: Bayesian’s point of view

Illustration of Bayesian estimation 6501: Text Mining33 Prior: p(  ) Likelihood: p(X|  ) X=(x 1,…,x N ) Posterior: p(  |X)  p(X|  )p(  )    : prior mode  ml : ML estimate  : posterior mode

Maximum likelihood estimation Set partial derivatives to zero ML estimate Sincewe have Requirement from probability CS 6501: Text Mining

Maximum likelihood estimation 6501: Text Mining35 Length of document or total number of words in a corpus

Problem with MLE Unseen events – We estimated a model on 440K word tokens, but: Only 30,000 unique words occurred Only 0.04% of all possible bigrams occurred  This means any word/N-gram that does not occur in the training data has zero probability!  No future documents can contain those unseen words/N-grams 6501: Text Mining36 A plot of word frequency in Wikipedia (Nov 27, 2006) Word frequency Word rank by frequency

Smoothing If we want to assign non-zero probabilities to unseen words – Unseen words = new words, new N-grams – Discount the probabilities of observed words General procedure 1.Reserve some probability mass of words seen in a document/corpus 2.Re-allocate it to unseen words 6501: Text Mining37

Illustration of N-gram language model smoothing w Max. Likelihood Estimate Smoothed LM Assigning nonzero probabilities to the unseen words 6501: Text Mining38 Discount from the seen words

Smoothing methods Additive smoothing – Add a constant  to the counts of each word Unigram language model as an example – Problems? Hint: all words are equally important? “Add one”, Laplace smoothing Vocabulary size Counts of w in d Length of d (total counts) 6501: Text Mining39

Add one smoothing for bigrams 6501: Text Mining40

After smoothing Giving too much to the unseen events 6501: Text Mining41

Refine the idea of smoothing Should all unseen words get equal probabilities? We can use a reference model to discriminate unseen words Discounted ML estimate Reference language model 6501: Text Mining42

Smoothing methods Linear interpolation – Use (N –1)-gram probabilities to smooth N-gram probabilities We never see the trigram “Bob was reading”, but we do see the bigram “was reading” 6501: Text Mining43

Smoothing methods Linear interpolation – Use (N –1)-gram probabilities to smooth N-gram probabilities – Further generalize it 6501: Text Mining44 ML N-gram Smoothed (N-1)-gram Smoothed N-gram …….

Smoothing methods 6501: Text Mining45 We will come back to this later

Smoothing methods 6501: Text Mining46 Smoothed (N-1)-gram Smoothed N-gram

Language model evaluation Train the models on the same training set – Parameter tuning can be done by holding off some training set for validation Test the models on an unseen test set – This data set must be disjoint from training data Language model A is better than model B – If A assigns higher probability to the test data than B 6501: Text Mining47

Perplexity Standard evaluation metric for language models – A function of the probability that a language model assigns to a data set – Rooted in the notion of cross-entropy in information theory 6501: Text Mining48

Perplexity The inverse of the likelihood of the test set as assigned by the language model, normalized by the number of words 6501: Text Mining49 N-gram language model

An experiment Models – Unigram, Bigram, Trigram models (with proper smoothing) Training data – 38M words of WSJ text (vocabulary: 20K types) Test data – 1.5M words of WSJ text Results 6501: Text Mining50 UnigramBigramTrigram Perplexity

What you should know N-gram language models How to generate text documents from a language model How to estimate a language model General idea and different ways of smoothing Language model evaluation 6501: Text Mining51

Today’s reading Introduction to information retrieval – Chapter 12: Language models for information retrieval Speech and Language Processing – Chapter 4: N-Grams 6501: Text Mining52