Introduction to Information Retrieval Introduction to Information Retrieval Lecture 12-13 Probabilistic Information Retrieval.

Slides:



Advertisements
Similar presentations
Information Retrieval and Organisation Chapter 11 Probabilistic Information Retrieval Dell Zhang Birkbeck, University of London.
Advertisements

Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Language Models Naama Kraus (Modified by Amit Gross) Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze.
Modern information retrieval Modelling. Introduction IR systems usually adopt index terms to process queries IR systems usually adopt index terms to process.
INSTRUCTOR: DR.NICK EVANGELOPOULOS PRESENTED BY: QIUXIA WU CHAPTER 2 Information retrieval DSCI 5240.
The Probabilistic Model. Probabilistic Model n Objective: to capture the IR problem using a probabilistic framework; n Given a user query, there is an.
Probabilistic Information Retrieval Chris Manning, Pandu Nayak and
Probabilistic Information Retrieval
Introduction to Information Retrieval (Part 2) By Evren Ermis.
Introduction to Information Retrieval Information Retrieval and Data Mining (AT71.07) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor:
CpSc 881: Information Retrieval
Lecture 11: Probabilistic Information Retrieval
Probabilistic Ranking Principle
Ranking models in IR Key idea: We wish to return in order the documents most likely to be useful to the searcher To do this, we want to know which documents.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
Information Retrieval Models: Probabilistic Models
IR Models: Overview, Boolean, and Vector
Hinrich Schütze and Christina Lioma
ISP 433/533 Week 2 IR Models.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 12: Language Models for IR.
Chapter 5: Query Operations Baeza-Yates, 1999 Modern Information Retrieval.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 11: Probabilistic Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
Probabilistic IR Models Based on probability theory Basic idea : Given a document d and a query q, Estimate the likelihood of d being relevant for the.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
Probabilistic Information Retrieval Part II: In Depth Alexander Dekhtyar Department of Computer Science University of Maryland.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Modeling Modern Information Retrieval
Language Models for TR Rong Jin Department of Computer Science and Engineering Michigan State University.
Vector Space Model CS 652 Information Extraction and Integration.
1 CS 430 / INFO 430 Information Retrieval Lecture 10 Probabilistic Information Retrieval.
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Automatic Indexing (Term Selection) Automatic Text Processing by G. Salton, Chap 9, Addison-Wesley, 1989.
CS276A Text Retrieval and Mining Lecture 10. Recap of the last lecture Improving search results Especially for high recall. E.g., searching for aircraft.
IR Models: Review Vector Model and Probabilistic.
Probabilistic Models in IR Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata Using majority of the slides from.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Query Operations J. H. Wang Mar. 26, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text.
Relevance Feedback Hongning Wang What we have learned so far Information Retrieval User results Query Rep Doc Rep (Index) Ranker.
Probabilistic Ranking Principle Hongning Wang
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Latent Semantic Indexing and Probabilistic (Bayesian) Information Retrieval.
C.Watterscsci64031 Probabilistic Retrieval Model.
Web-Mining Agents Probabilistic Information Retrieval Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme Karsten Martiny (Übungen)
CpSc 881: Information Retrieval. 2 Using language models (LMs) for IR ❶ LM = language model ❷ We view the document as a generative model that generates.
CS276A Text Information Retrieval, Mining, and Exploitation Lecture 6 22 Oct 2002.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Searching Full Text 3.
Relevance Feedback Hongning Wang
The Effect of Database Size Distribution on Resource Selection Algorithms Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University.
Introduction to Information Retrieval Probabilistic Information Retrieval Chapter 11 1.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 14: Language Models for IR.
1 Boolean and Vector Space Retrieval Models Many slides in this section are adapted from Prof. Joydeep Ghosh (UT ECE) who in turn adapted them from Prof.
1 Probabilistic Models for Ranking Some of these slides are based on Stanford IR Course slides at
Probabilistic Information Retrieval
Plan for Today’s Lecture(s)
Lecture 13: Language Models for IR
CS276 Information Retrieval and Web Search
Probabilistic Retrieval Models
Information Retrieval and Data Mining (AT71. 07) Comp. Sc. and Inf
Information Retrieval Models: Probabilistic Models
Relevance Feedback Hongning Wang
Text Retrieval and Mining
Probabilistic Ranking Principle
CS 430: Information Discovery
Information Retrieval and Data Mining (AT71. 07) Comp. Sc. and Inf
Probabilistic Information Retrieval
INF 141: Information Retrieval
Information Retrieval and Web Design
ADVANCED TOPICS IN INFORMATION RETRIEVAL AND WEB SEARCH
Presentation transcript:

Introduction to Information Retrieval Introduction to Information Retrieval Lecture Probabilistic Information Retrieval

Introduction to Information Retrieval Why probabilities in IR? User Information Need Documents Document Representation Query Representation Query Representation How to match? In vector space model (VSM), matching between each document and query is attempted in a semantically imprecise space of index terms. Probabilities provide a principled foundation for uncertain reasoning. Can we use probabilities to quantify our uncertainties? Uncertain guess of whether document has relevant content Understanding of user need is uncertain

Introduction to Information Retrieval Probabilistic IR topics  Classical probabilistic retrieval model  Probability ranking principle, etc.  Binary independence model (≈ Naïve Bayes text cat)  (Okapi) BM25  Bayesian networks for text retrieval  Language model approach to IR  An important emphasis in recent work  Probabilistic methods are one of the oldest but also one of the currently hottest topics in IR.

Introduction to Information Retrieval The document ranking problem  We have a collection of documents  User issues a query  A list of documents needs to be returned  Ranking method is the core of an IR system:  In what order do we present documents to the user?  We want the “best” document to be first, second best second, etc….  Idea: Rank by probability of relevance of the document w.r.t. information need  P(R=1|document i, query)

Introduction to Information Retrieval  For events A and B:  Bayes’ Rule  Odds: Prior Recall a few probability basics Posterior

Introduction to Information Retrieval Probability Ranking Principle Let x represent a document in the collection. Let R represent relevance of a document w.r.t. given (fixed) query and let R=1 represent relevant and R=0 not relevant. p(x|R=1), p(x|R=0) - probability that if a relevant (not relevant) document is retrieved, it is x. Need to find p(R=1|x) - probability that a document x is relevant. p(R=1),p(R=0) - prior probability of retrieving a relevant or non-relevant document

Introduction to Information Retrieval Probability Ranking Principle  How do we compute all those probabilities?  Do not know exact probabilities, have to use estimates  Binary Independence Model (BIM) – which we discuss next – is the simplest model  Questionable assumptions  “Relevance” of each document is independent of relevance of other documents.  Really, it’s bad to keep on returning duplicates  “ term independence assumption ”  terms ’ contributions to relevance are treated as independent events.

Introduction to Information Retrieval Probabilistic Retrieval Strategy  Estimate how terms contribute to relevance  How do things like tf, df, and document length influence your judgments about document relevance?  A more nuanced answer is the Okapi formulae  Spärck Jones / Robertson  Combine to find document relevance probability  Order documents by decreasing probability

Introduction to Information Retrieval Probabilistic Ranking Basic concept: “For a given query, if we know some documents that are relevant, terms that occur in those documents should be given greater weighting in searching for other relevant documents. By making assumptions about the distribution of terms and applying Bayes Theorem, it is possible to derive weights theoretically.” Van Rijsbergen

Introduction to Information Retrieval Binary Independence Model  Traditionally used in conjunction with PRP  “Binary” = Boolean: documents are represented as binary incidence vectors of terms (cf. IIR Chapter 1):   iff term i is present in document x.  “Independence”: terms occur in documents independently  Different documents can be modeled as the same vector

Introduction to Information Retrieval Binary Independence Model  Queries: binary term incidence vectors  Given query q,  for each document d need to compute p(R|q,d).  replace with computing p(R|q,x) where x is binary term incidence vector representing d.  Interested only in ranking  Will use odds and Bayes’ Rule:

Introduction to Information Retrieval Binary Independence Model Using Independence Assumption: Constant for a given query Needs estimation

Introduction to Information Retrieval Binary Independence Model Since x i is either 0 or 1 : Let Assume, for all terms not occurring in the query (q i =0)

Introduction to Information Retrieval documentrelevant (R=1)not relevant (R=0) term presentx i = 1pipi riri term absentx i = 0(1 – p i )(1 – r i )

Introduction to Information Retrieval All matching terms Non-matching query terms Binary Independence Model All matching terms All query terms

Introduction to Information Retrieval Binary Independence Model Constant for each query Only quantity to be estimated for rankings Retrieval Status Value:

Introduction to Information Retrieval Binary Independence Model All boils down to computing RSV. So, how do we compute c i ’ s from our data ? The c i are log odds ratios They function as the term weights in this model

Introduction to Information Retrieval Binary Independence Model Estimating RSV coefficients in theory For each term i look at this table of document counts: Estimates: For now, assume no zero terms. See later lecture.

Introduction to Information Retrieval Estimation – key challenge  If non-relevant documents are approximated by the whole collection, then r i (prob. of occurrence in non-relevant documents for query) is n/N and

Introduction to Information Retrieval Estimation – key challenge  p i (probability of occurrence in relevant documents) cannot be approximated as easily  p i can be estimated in various ways:  from relevant documents if know some  Relevance weighting can be used in a feedback loop  constant (Croft and Harper combination match) – then just get idf weighting of terms (with p i =0.5 )  proportional to prob. of occurrence in collection  Greiff (SIGIR 1998) argues for 1/3 + 2/3 df i /N

Introduction to Information Retrieval Probabilistic Relevance Feedback 1.Guess a preliminary probabilistic description of R=1 documents and use it to retrieve a first set of documents 2.Interact with the user to refine the description: learn some definite members with R=1 and R=0 3.Reestimate p i and r i on the basis of these  Or can combine new information with original guess (use Bayesian prior): 4.Repeat, thus generating a succession of approximations to relevant documents κ is prior weight

Introduction to Information Retrieval 22 Iteratively estimating p i and r i (= Pseudo-relevance feedback) 1.Assume that p i is constant over all x i in query and r i as before  p i = 0.5 (even odds) for any given doc 2.Determine guess of relevant document set:  V is fixed size set of highest ranked documents on this model 3.We need to improve our guesses for p i and r i, so  Use distribution of x i in docs in V. Let V i be set of documents containing x i  p i = |V i | / |V|  Assume if not retrieved then not relevant  r i = (n i – |V i |) / (N – |V|) 4.Go to 2. until converges then return ranking

Introduction to Information Retrieval Okapi BM25: A Non-binary Model The BIM was originally designed for short catalog records of fairly consistent length, and it works reasonably in these contexts For modern full-text search collections, a model should pay attention to term frequency and document length BestMatch25 (a.k.a BM25 or Okapi) is sensitive to these quantities From 1994 until today, BM25 is one of the most widely used and robust retrieval models 23

Introduction to Information Retrieval Okapi BM25: A Nonbinary Model  The simplest score for document d is just idf weighting of the query terms present in the document:  Improve this formula by factoring in the term frequency and document length:  tf td : term frequency in document d  L d (Lave): length of document d (average document length in the whole collection)  k 1 : tuning parameter controlling the document term frequency scaling  b: tuning parameter controlling the scaling by document length 24

Introduction to Information Retrieval Saturation function  For high values of k 1, increments in tf i continue to contribute significantly to the score  Contributions tail off quickly for low values of k 1

Introduction to Information Retrieval Document length normalization

Introduction to Information Retrieval Okapi BM25: A Nonbinary Model  If the query is long, we might also use similar weighting for query terms  tf tq : term frequency in the query q  k 3 : tuning parameter controlling term frequency scaling of the query  No length normalisation of queries (because retrieval is being done with respect to a single fixed query)  The above tuning parameters should ideally be set to optimize performance on a development test collection. In the absence of such optimisation, experiments have shown reasonable values are to set k 1 and k 3 to a value between 1.2 and 2 and b =