CS590I: Information Retrieval

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Information Retrieval and Organisation Chapter 11 Probabilistic Information Retrieval Dell Zhang Birkbeck, University of London.
Language Models Naama Kraus (Modified by Amit Gross) Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze.
1 Language Models for TR (Lecture for CS410-CXZ Text Info Systems) Feb. 25, 2011 ChengXiang Zhai Department of Computer Science University of Illinois,
Language Models Hongning Wang
1 Essential Probability & Statistics (Lecture for CS598CXZ Advanced Topics in Information Retrieval ) ChengXiang Zhai Department of Computer Science University.
Flipping A Biased Coin Suppose you have a coin with an unknown bias, θ ≡ P(head). You flip the coin multiple times and observe the outcome. From observations,
CpSc 881: Information Retrieval
Chapter 7 Retrieval Models.
Lecture 5: Learning models using EM
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 11: Probabilistic Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.
Language Models for TR Rong Jin Department of Computer Science and Engineering Michigan State University.
. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.
Switch to Top-down Top-down or move-to-nearest Partition documents into ‘k’ clusters Two variants “Hard” (0/1) assignment of documents to clusters “soft”
Carnegie Mellon Exact Maximum Likelihood Estimation for Word Mixtures Yi Zhang & Jamie Callan Carnegie Mellon University Wei Xu.
1 CS 430 / INFO 430 Information Retrieval Lecture 10 Probabilistic Information Retrieval.
Computer vision: models, learning and inference
Language Modeling Approaches for Information Retrieval Rong Jin.
Chapter 7 Retrieval Models.
Chapter Two Probability Distributions: Discrete Variables
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.
EM and expected complete log-likelihood Mixture of Experts
Improved search for Socially Annotated Data Authors: Nikos Sarkas, Gautam Das, Nick Koudas Presented by: Amanda Cohen Mostafavi.
Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Presented by Chen Yi-Ting.
Relevance Feedback Hongning Wang What we have learned so far Information Retrieval User results Query Rep Doc Rep (Index) Ranker.
Chapter 23: Probabilistic Language Models April 13, 2004.
Positional Relevance Model for Pseudo–Relevance Feedback Yuanhua Lv & ChengXiang Zhai Department of Computer Science, UIUC Presented by Bo Man 2014/11/18.
KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.
Language Modeling Putting a curve to the bag of words Courtesy of Chris Jordan.
Maximum Likelihood Estimation
NTNU Speech Lab Dirichlet Mixtures for Query Estimation in Information Retrieval Mark D. Smucker, David Kulp, James Allan Center for Intelligent Information.
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
Relevance Feedback Hongning Wang
A Study of Poisson Query Generation Model for Information Retrieval
Essential Probability & Statistics (Lecture for CS397-CXZ Algorithms in Bioinformatics) Jan. 23, 2004 ChengXiang Zhai Department of Computer Science University.
KNN & Naïve Bayes Hongning Wang
A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval Chengxiang Zhai, John Lafferty School of Computer Science Carnegie.
Language Modeling Again So are we smooth now? Courtesy of Chris Jordan.
Language-model-based similarity on large texts Tolga Çekiç /10.
Introduction to Information Retrieval Probabilistic Information Retrieval Chapter 11 1.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 14: Language Models for IR.
Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Microsoft Research Cambridge,
Essential Probability & Statistics
Lecture 13: Language Models for IR
Statistical Language Models
CSCI 5417 Information Retrieval Systems Jim Martin
Maximum Likelihood Estimation
Latent Variables, Mixture Models and EM
Relevance Feedback Hongning Wang
Language Models for Information Retrieval
More about Posterior Distributions
Introduction to Statistical Modeling
John Lafferty, Chengxiang Zhai School of Computer Science
Michal Rosen-Zvi University of California, Irvine
Latent Dirichlet Allocation
CS 4501: Information Retrieval
Topic Models in Text Processing
Language Models Hongning Wang
Parametric Methods Berlin Chen, 2005 References:
CS590I: Information Retrieval
INF 141: Information Retrieval
Conceptual grounding Nisheeth 26th March 2019.
Language Models for TR Rong Jin
Presentation transcript:

CS590I: Information Retrieval Retrieval Models: Language models Luo Si Department of Computer Science Purdue University

Retrieval Model: Language Model Introduction to language model Unigram language model Document language model estimation Maximum Likelihood estimation Maximum a posterior estimation Jelinek Mercer Smoothing Model-based feedback

Language Models: Motivation Vector space model for information retrieval Documents and queries are vectors in the term space Relevance is measure by the similarity between document vectors and query vector Problems for vector space model Ad-hoc term weighting schemes Ad-hoc similarity measurement No justification of relationship between relevance and similarity We need more principled retrieval models…

Introduction to Language Models: Language model can be created for any language sample A document A collection of documents Sentence, paragraph, chapter, query… The size of language sample affects the quality of language model Long documents have more accurate model Short documents have less accurate model Model for sentence, paragraph or query may not be reliable

Introduction to Language Models: A document language model defines a probability distribution over indexed terms E.g., the probability of generating a term Sum of the probabilities is 1 A query can be seen as observed data from unknown models Query also defines a language model (more on this later) How might the models be used for IR? Rank documents by Pr( | ) Rank documents by language models of and based on kullback-Leibler (KL) divergence between the models (come later)

Language Model for IR: Example Generate retrieval results sport, basketball Estimate the generation probability of Pr( | ) Language Model for Language Model for Language Model for Estimating language model for each document sport, basketball, ticket, sport stock, finance, finance, stock basketball, ticket, finance, ticket, sport

Language Models Three basic problems for language models What type of probabilistic distribution can be used to construct language models? How to estimate the parameters of the distribution of the language models? How to compute the likelihood of generating queries given the language modes of documents?

Multinomial/Unigram Language Models Language model built by multinomial distribution on single terms (i.e., unigram) in the vocabulary Examples: Five words in vocabulary (sport, basketball, ticket, finance, stock) For a document , its language mode is: {Pi(“sport”), Pi(“basketball”), Pi(“ticket”), Pi(“finance”), Pi(“stock”)} Formally: The language model is: {Pi(w) for any word w in vocabulary V}

Multinomial/Unigram Language Models Multinomial Model for Multinomial Model for Multinomial Model for Estimating language model for each document sport, basketball, ticket, sport basketball, ticket, finance, ticket, sport stock, finance, finance, stock

Maximum Likelihood Estimation (MLE) Find model parameters that make generation likelihood reach maximum: M*=argmaxMPr(D|M) There are K words in vocabulary, w1...wK (e.g., 5) Data: one document with counts tfi(w1), …, tfi(wK), and length | | Model: multinomial M with parameters {pi(wk)} Likelihood: Pr( | M) M*=argmaxMPr( |M)

Maximum Likelihood Estimation (MLE) Use Lagrange multiplier approach Set partial derivatives to zero Get maximum likelihood estimate

Maximum Likelihood Estimation (MLE) (psp, pb, pt, pf, pst) = (0.5,0.25,0.25,0,0) (psp, pb, pt, pf, pst) = (0.2,0.2,0.4,0.2,0) (psp, pb, pt, pf, pst) = (0,0,0,0.5,0.5) Estimating language model for each document sport, basketball, ticket, sport basketball, ticket, finance, ticket, sport stock, finance, finance, stock

Maximum Likelihood Estimation (MLE) Assign zero probabilities to unseen words in small sample A specific example: Only two words in vocabulary (w1=sport, w2=business) like (head, tail) for a coin; A document generates sequence of two words or draw a coin for many times Only observe two words (flip the coin twice) and MLE estimators are: “business sport” Pi(w1)=0.5 “sport sport” Pi(w1)=1 ? “business business” Pi(w1)=0 ?

Maximum Likelihood Estimation (MLE) A specific example: Only observe two words (flip the coin twice) and MLE estimators are: “business sport” Pi(w1)*=0.5 “sport sport” Pi(w1)*=1 ? “business business” Pi(w1)*=0 ? Data sparseness problem

Solution to Sparse Data Problems Maximum a posterior (MAP) estimation Shrinkage Bayesian ensemble approach

Maximum A Posterior (MAP) Estimation Maximum A Posterior Estimation: Select a model that maximizes the probability of model given observed data M*=argmaxMPr(M|D)=argmaxMPr(D|M)Pr(M) Pr(M): Prior belief/knowledge Use prior Pr(M) to avoid zero probabilities A specific examples: Only two words in vocabulary (sport, business) For a document : Prior Distribution

Maximum A Posterior (MAP) Estimation Maximum A Posterior Estimation: Introduce prior on the multinomial distribution Use prior Pr(M) to avoid zero probabilities, most of coins are more or less unbiased Use Dirichlet prior on p(w) Hyper-parameters Constant for pK (x) is gamma function

Maximum A Posterior (MAP) Estimation For the two word example: a Dirichlet prior P(w1)2 (1-P(w1)2)

Maximum A Posterior (MAP) Estimation M*=argmaxMPr(M|D)=argmaxMPr(D|M)Pr(M) Pseudo Counts

Maximum A Posterior (MAP) Estimation A specific example: Only observe two words (flip a coin twice): “sport sport” Pi(w1)*=1 ? P(w1)2 (1-P(w1)2) times

Maximum A Posterior (MAP) Estimation A specific example: Only observe two words (flip a coin twice): “sport sport” Pi(w1)*=1 ?

MAP Estimation Unigram Language Model Maximum A Posterior Estimation: Use Dirichlet prior for multinomial distribution How to set the parameters for Dirichlet prior

MAP Estimation Unigram Language Model Maximum A Posterior Estimation: Use Dirichlet prior for multinomial distribution There are K terms in the vocabulary: Hyper-parameters Constant for pK

MAP Estimation Unigram Language Model MAP Estimation for unigram language model: Use Lagrange Multiplier; Set derivative to 0 Pseudo counts set by hyper-parameters

MAP Estimation Unigram Language Model MAP Estimation for unigram language model: Use Lagrange Multiplier; Set derivative to 0 How to determine the appropriate value for hyper-parameters? When nothing observed from a document What is most likely pi(wk) without looking at the content of the document?

MAP Estimation Unigram Language Model MAP Estimation for unigram language model: What is most likely pi(wk) without looking at the content of the document? The most likely pi(wk) without looking into the content of the document d is the unigram probability of the collection: {p(w1|c), p(w2|c),…, p(wK|c)} Without any information, guess the behavior of one member on the behavior of whole population Constant

MAP Estimation Unigram Language Model MAP Estimation for unigram language model: Use Lagrange Multiplier; Set derivative to 0 Pseudo counts Pseudo document length

Maximum A Posterior (MAP) Estimation Dirichlet MAP Estimation for unigram language model: Step 0: compute the probability on whole collection based collection unigram language model Step 1: for each document , compute its smoothed unigram language model (Dirichlet smoothing) as

Maximum A Posterior (MAP) Estimation Dirichlet MAP Estimation for unigram language model: Step 2: For a given query ={tfq(w1),…, tfq(wk)} For each document , compute likelihood The larger the likelihood, the more relevant the document is to the query

Dirichlet Smoothing & TF-IDF ? TF-IDF Weighting:

Dirichlet Smoothing & TF-IDF TF-IDF Weighting:

Dirichlet Smoothing & TF-IDF

Dirichlet Smoothing & TF-IDF Irrelevant part TF-IDF Weighting:

Dirichlet Smoothing & TF-IDF Look at the tf.idf part

Dirichlet Smoothing Hyper-Parameter When is very small, approach MLE estimator When is very large, approach probability on whole collection How to set appropriate ?

Dirichlet Smoothing Hyper-Parameter Leave One out Validation: ... wj w1 Leave w1 out ... Leave wj out ...

Dirichlet Smoothing Hyper-Parameter Leave One out Validation: Leave all words out one by one for a document ... wj w1 Do the procedure for all documents in a collection Find appropriate

Dirichlet Smoothing Hyper-Parameter What type of document/collection would get large ? Most documents use similar vocabulary and wording pattern as the whole collection What type of document/collection would get small ? Most documents use different vocabulary and wording pattern than the whole collection

Shrinkage Maximum Likelihood (MLE) builds model purely on document data and generates query word Model may not be accurate when document is short (many unseen words) Shrinkage estimator builds more reliable model by consulting more general models (e.g., collection language model) Example: Estimate P(Lung_Cancer|Smoke) West Lafayette Indiana U.S.

Shrinkage Jelinek Mercer Smoothing Assume for each word, with probability , it is generated from document language model (MLE), with probability 1- , it is generated from collection language model (MLE) Linear interpolation between document language model and collection language model JM Smoothing:

Shrinkage Relationship between JM Smoothing and Dirichlet Smoothing

Model Based Feedback Equivalence of retrieval based on query generation likelihood and Kullback-Leibler (KL) Divergence between query and document language models Kullback-Leibler (KL) Divergence between two probabilistic distributions It is the distance between two probabilistic distributions It is always larger than zero How to prove it ?

Model Based Feedback Equivalence of retrieval based on query generation likelihood and Kullback-Leibler (KL) Divergence between query and document language models Loglikelihood of query generation probability Document independent constant Generalize query representation to be a distribution (fractional term weighting)

Estimating document language model Estimating language model Model Based Feedback Calculate KL Divergence Retrieval results Estimating query language model Language Model for Estimating document language model Estimating language model Language Model for Estimate the generation probability of Pr( | ) Retrieval results

Estimating document language model Model Based Feedback Feedback Documents from initial results Retrieval results Language Model for Calculate KL Divergence Estimating document language model Language Model for New Query Model Language Model for Estimating query language model No feedback Full feedback

Model Based Feedback: Estimate Assume there is a generative model to produce each word within feedback document(s) For each word in feedback document(s), given  Background model 1- Feedback Documents PC(w) w Flip a coin Topic words  qF(w) w

Model Based Feedback: Estimate For each word, there is a hidden variable telling which language model it comes from the 0.12 to 0.05 it 0.04 a 0.02 … sport 0.0001 basketball 0.00005 Feedback Documents Background Model pC(w|C) 1-=0.8 MLE Estimator Unknown query topic p(w|F)=? “Basketball” … sport =? basketball =? game =? player =? =0.2 If we know the value of hidden variable of each word ...

Model Based Feedback: Estimate For each word, the hidden variable Zi = {1 (feedback), 0 (background} Step1: estimate hidden variable based current on model parameter (Expectation) E-step the (0.1) basketball (0.7) game (0.6) is (0.2) …. Step2: Update model parameters based on the guess in step1 (Maximization) M-Step

Model Based Feedback: Estimate Expectation-Maximization (EM) algorithm Step 0: Initialize values of Step1: (Expectation) Step2: (Maximization) Give =0.5

Model Based Feedback: Estimate Properties of parameter  If  is close to 0, most common words can be generated from collection language model, so more topic words in query language mode If  is close to 1, query language model has to generate most common words, so fewer topic words in query language mode

Retrieval Model: Language Model Introduction to language model Unigram language model Document language model estimation Maximum Likelihood estimation Maximum a posterior estimation Jelinek Mercer Smoothing Model-based feedback