Information Retrieval Models: Probabilistic Models

Slides:



Advertisements
Similar presentations
Information Retrieval and Organisation Chapter 11 Probabilistic Information Retrieval Dell Zhang Birkbeck, University of London.
Advertisements

Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Language Models Naama Kraus (Modified by Amit Gross) Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze.
A Formal Study of Information Retrieval Heuristics Hui Fang, Tao Tao, and ChengXiang Zhai University of Illinois at Urbana Champaign SIGIR 2004 (Best paper.
1 Language Models for TR (Lecture for CS410-CXZ Text Info Systems) Feb. 25, 2011 ChengXiang Zhai Department of Computer Science University of Illinois,
Language Models Hongning Wang
The Probabilistic Model. Probabilistic Model n Objective: to capture the IR problem using a probabilistic framework; n Given a user query, there is an.
Probabilistic Information Retrieval Chris Manning, Pandu Nayak and
Introduction to Information Retrieval Information Retrieval and Data Mining (AT71.07) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor:
CpSc 881: Information Retrieval
Probabilistic Ranking Principle
Hinrich Schütze and Christina Lioma
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 11: Probabilistic Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
Probabilistic IR Models Based on probability theory Basic idea : Given a document d and a query q, Estimate the likelihood of d being relevant for the.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
Probabilistic Information Retrieval Part II: In Depth Alexander Dekhtyar Department of Computer Science University of Maryland.
Modeling Modern Information Retrieval
Language Models for TR Rong Jin Department of Computer Science and Engineering Michigan State University.
1 CS 430 / INFO 430 Information Retrieval Lecture 10 Probabilistic Information Retrieval.
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
IR Models: Review Vector Model and Probabilistic.
Probabilistic Models in IR Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata Using majority of the slides from.
2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., Statistical Models for Information Retrieval and Text Mining ChengXiang.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
IRDM WS Chapter 4: Advanced IR Models 4.1 Probabilistic IR Principles Probabilistic IR with Term Independence Probabilistic.
Information Retrieval: Problem Formulation & Evaluation ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
IR Models J. H. Wang Mar. 11, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text quer y.
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 龙星计划课程 : 信息检索 Overview of Text Retrieval: Part 2 ChengXiang Zhai (
Relevance Feedback Hongning Wang What we have learned so far Information Retrieval User results Query Rep Doc Rep (Index) Ranker.
IR Theory: Relevance Feedback. Relevance Feedback: Example  Initial Results Search Engine2.
Probabilistic Ranking Principle Hongning Wang
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Boolean Model Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.
Language Models Hongning Wang Recap: document generation model 4501: Information Retrieval Model of relevant docs for Q Model of non-relevant.
Boolean Model Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.
A Language Modeling Approach to Information Retrieval 한 경 수  Introduction  Previous Work  Model Description  Empirical Results  Conclusions.
Language Models LBSC 796/CMSC 828o Session 4, February 16, 2004 Douglas W. Oard.
Lower-Bounding Term Frequency Normalization Yuanhua Lv and ChengXiang Zhai University of Illinois at Urbana-Champaign CIKM 2011 Best Student Award Paper.
1 A Formal Study of Information Retrieval Heuristics Hui Fang, Tao Tao and ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Language Model in Turkish IR Melih Kandemir F. Melih Özbekoğlu Can Şardan Ömer S. Uğurlu.
Web-Mining Agents Probabilistic Information Retrieval Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme Karsten Martiny (Übungen)
Information Retrieval Models: Vector Space Models
Relevance Feedback Hongning Wang
A Generation Model to Unify Topic Relevance and Lexicon-based Sentiment for Opinion Retrieval Min Zhang, Xinyao Ye Tsinghua University SIGIR
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture Probabilistic Information Retrieval.
1 Risk Minimization and Language Modeling in Text Retrieval ChengXiang Zhai Thesis Committee: John Lafferty (Chair), Jamie Callan Jaime Carbonell David.
A Study of Poisson Query Generation Model for Information Retrieval
2010 © University of Michigan Probabilistic Models in Information Retrieval SI650: Information Retrieval Winter 2010 School of Information University of.
A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval Chengxiang Zhai, John Lafferty School of Computer Science Carnegie.
Introduction to Information Retrieval Probabilistic Information Retrieval Chapter 11 1.
Information Retrieval Models: Language Models
A Formal Study of Information Retrieval Heuristics
Probabilistic Retrieval Models
Language Models for Text Retrieval
CSCI 5417 Information Retrieval Systems Jim Martin
Information Retrieval Models: Probabilistic Models
Relevance Feedback Hongning Wang
Language Models for Information Retrieval
Murat Açar - Zeynep Çipiloğlu Yıldız
John Lafferty, Chengxiang Zhai School of Computer Science
Language Model Approach to IR
CS 4501: Information Retrieval
Probabilistic Ranking Principle
Language Models Hongning Wang
CS 430: Information Discovery
Language Models for TR Rong Jin
ADVANCED TOPICS IN INFORMATION RETRIEVAL AND WEB SEARCH
Presentation transcript:

Information Retrieval Models: Probabilistic Models ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign

Modeling Relevance: Raodmap for Retrieval Models Relevance constraints [Fang et al. 04] Relevance (Rep(q), Rep(d)) Similarity P(r=1|q,d) r {0,1} Probability of Relevance P(d q) or P(q d) Probabilistic inference Div. from Randomness (Amati & Rijsbergen 02) Generative Model Regression Model (Fuhr 89) Prob. concept space model (Wong & Yao, 95) Different inference system Inference network model (Turtle & Croft, 91) Different rep & similarity Vector space model (Salton et al., 75) Prob. distr. (Wong & Yao, 89) … Doc generation Query Learn. To Rank (Joachims 02, Berges et al. 05) Classical prob. Model (Robertson & Sparck Jones, 76) LM approach (Ponte & Croft, 98) (Lafferty & Zhai, 01a)

What is the probability that THIS document is relevant to THIS query? The Basic Question What is the probability that THIS document is relevant to THIS query? Formally… 3 random variables: query Q, document D, relevance R {0,1} Given a particular query q, a particular document d, p(R=1|Q=q,D=d)=?

Probability of Relevance Three random variables Query Q Document D Relevance R  {0,1} Goal: rank D based on P(R=1|Q,D) Evaluate P(R=1|Q,D) Actually, only need to compare P(R=1|Q,D1) with P(R=1|Q,D2), I.e., rank documents Several different ways to refine P(R=1|Q,D)

Probabilistic Retrieval Models: Intuitions Suppose we have a large number of relevance judgments (e.g., clickthroughs: “1”=clicked; “0”= skipped) We can score documents based on   P(R=1|Q1, D1)=1/2 P(R=1|Q1,D2)=2/2 P(R=1|Q1,D3)=0/2 … Query(Q) Doc (D) Rel (R) ? Q1 D1 1 Q1 D2 1 Q1 D3 0 Q1 D4 0 Q1 D5 1 … Q1 D1 0 Q2 D3 1 Q3 D1 1 Q4 D2 1 Q4 D3 0 What if we don’t have (sufficient) search log? We can approximate p(R=1|Q,D)! Different assumptions lead to different models

Refining P(R=1|Q,D) Method 1: conditional models Basic idea: relevance depends on how well a query matches a document Define features on Q x D, e.g., #matched terms, # the highest IDF of a matched term, #doclen, or even score(Q,D) given any other retrieval function,… P(R=1|Q,D)=g(f1(Q,D), f2(Q,D),…,fn(Q,D), ) Using training data (known relevance judgments) to estimate parameter  Apply the model to rank new documents Will be covered more later in learning to rank

Refining P(R=1|Q,D) Method 2: generative models Basic idea Define P(Q,D|R) Compute O(R=1|Q,D) using Bayes’ rule Special cases Document “generation”: P(Q,D|R)=P(D|Q,R)P(Q|R) Query “generation”: P(Q,D|R)=P(Q|D,R)P(D|R) Ignored for ranking D

Document Generation Model of relevant docs for Q Model of non-relevant docs for Q Assume independent attributes A1…Ak ….(why?) Let D=d1…dk, where dk {0,1} is the value of attribute Ak (Similarly Q=q1…qk )

Robertson-Sparck Jones Model (Robertson & Sparck Jones 76) (RSJ model) Two parameters for each term Ai: pi = P(Ai=1|Q,R=1): prob. that term Ai occurs in a relevant doc qi = P(Ai=1|Q,R=0): prob. that term Ai occurs in a non-relevant doc How to estimate parameters? Suppose we have relevance judgments, “+0.5” and “+1” can be justified by Bayesian estimation

RSJ Model: No Relevance Info (Croft & Harper 79) How to estimate parameters? Suppose we do not have relevance judgments, - We will assume pi to be a constant - Estimate qi by assuming all documents to be non-relevant N: # documents in collection ni: # documents in which term Ai occurs

RSJ Model: Summary The most important classic prob. IR model Use only term presence/absence, thus also referred to as Binary Independence Model Essentially Naïve Bayes for doc ranking Most natural for relevance/pseudo feedback When without relevance judgments, the model parameters must be estimated in an ad hoc way Performance isn’t as good as tuned VS model

Improving RSJ: Adding TF Basic doc. generation model: Let D=d1…dk, where dk is the frequency count of term Ak 2-Poisson mixture model Many more parameters to estimate! (how many exactly?)

BM25/Okapi Approximation (Robertson et al. 94) Idea: Approximate p(R=1|Q,D) with a simpler function that share similar properties Observations: log O(R=1|Q,D) is a sum of term weights Wi Wi= 0, if TFi=0 Wi increases monotonically with Tfi Wi has an asymptotic limit The simple function is

Adding Doc. Length & Query TF Incorporating doc length Motivation: The 2-Poisson model assumes equal document length Implementation: “Carefully” penalize long doc Incorporating query TF Motivation: Appears to be not well-justified Implementation: A similar TF transformation The final formula is called BM25, achieving top TREC performance

The BM25 Formula “Okapi TF/BM25 TF”

Extensions of “Doc Generation” Models Capture term dependence (Rijsbergen & Harper 78) Alternative ways to incorporate TF (Croft 83, Kalt96) Feature/term selection for feedback (Okapi’s TREC reports) Estimate of the relevance model based on pseudo feedback, to be covered later [Lavrenko & Croft 01]

What You Should Know Basic idea of probabilistic retrieval models How to use Bayes Rule to derive a general document-generation retrieval model How to derive the RSJ retrieval model (i.e., binary independence model) Assumptions that have to be made in order to derive the RSJ model The two-Poisson model and the heuristics introduced to derive BM25