Information Retrieval and Organisation Chapter 11 Probabilistic Information Retrieval Dell Zhang Birkbeck, University of London.

Slides:



Advertisements
Similar presentations
Traditional IR models Jian-Yun Nie.
Advertisements

Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Modern information retrieval Modelling. Introduction IR systems usually adopt index terms to process queries IR systems usually adopt index terms to process.
Probabilistic Information Retrieval Part I: Survey Alexander Dekhtyar department of Computer Science University of Maryland.
INSTRUCTOR: DR.NICK EVANGELOPOULOS PRESENTED BY: QIUXIA WU CHAPTER 2 Information retrieval DSCI 5240.
The Probabilistic Model. Probabilistic Model n Objective: to capture the IR problem using a probabilistic framework; n Given a user query, there is an.
Probabilistic Information Retrieval Chris Manning, Pandu Nayak and
Probabilistic Information Retrieval
Introduction to Information Retrieval (Part 2) By Evren Ermis.
Introduction to Information Retrieval Information Retrieval and Data Mining (AT71.07) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor:
CpSc 881: Information Retrieval
Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto
Probabilistic Ranking Principle
Information Retrieval Models: Probabilistic Models
IR Models: Overview, Boolean, and Vector
Hinrich Schütze and Christina Lioma
ISP 433/533 Week 2 IR Models.
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
Information Retrieval Modeling CS 652 Information Extraction and Integration.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 11: Probabilistic Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
Modern Information Retrieval Chapter 2 Modeling. Probabilistic model the appearance or absent of an index term in a document is interpreted either as.
Probabilistic IR Models Based on probability theory Basic idea : Given a document d and a query q, Estimate the likelihood of d being relevant for the.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
Probabilistic Information Retrieval Part II: In Depth Alexander Dekhtyar Department of Computer Science University of Maryland.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Modeling Modern Information Retrieval
Indexing and Representation: The Vector Space Model Document represented by a vector of terms Document represented by a vector of terms Words (or word.
Vector Space Model CS 652 Information Extraction and Integration.
1 CS 430 / INFO 430 Information Retrieval Lecture 10 Probabilistic Information Retrieval.
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
IR Models: Review Vector Model and Probabilistic.
Probabilistic Models in IR Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata Using majority of the slides from.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
IR Models J. H. Wang Mar. 11, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text quer y.
1 Computing Relevance, Similarity: The Vector Space Model.
CPSC 404 Laks V.S. Lakshmanan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides at UC-Berkeley.
Lecture 1: Overview of IR Maya Ramanath. Who hasn’t used Google? Why did Google return these results first ? Can we improve on it? Is this a good result.
Probabilistic Ranking Principle Hongning Wang
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Latent Semantic Indexing and Probabilistic (Bayesian) Information Retrieval.
C.Watterscsci64031 Probabilistic Retrieval Model.
Language Modeling Putting a curve to the bag of words Courtesy of Chris Jordan.
Web-Mining Agents Probabilistic Information Retrieval Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme Karsten Martiny (Übungen)
Introduction n IR systems usually adopt index terms to process queries n Index term: u a keyword or group of selected words u any word (more general) n.
Probabilistic Model n Objective: to capture the IR problem using a probabilistic framework n Given a user query, there is an ideal answer set n Querying.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture Probabilistic Information Retrieval.
Introduction to Information Retrieval Probabilistic Information Retrieval Chapter 11 1.
1 Boolean and Vector Space Retrieval Models Many slides in this section are adapted from Prof. Joydeep Ghosh (UT ECE) who in turn adapted them from Prof.
1 Probabilistic Models for Ranking Some of these slides are based on Stanford IR Course slides at
Probabilistic Information Retrieval
Lecture 13: Language Models for IR
Probabilistic Retrieval Models
Information Retrieval and Data Mining (AT71. 07) Comp. Sc. and Inf
Information Retrieval Models: Probabilistic Models
Text Retrieval and Mining
Probabilistic Ranking Principle
Recuperação de Informação B
CS 430: Information Discovery
Information Retrieval and Data Mining (AT71. 07) Comp. Sc. and Inf
Probabilistic Information Retrieval
Recuperação de Informação B
Recuperação de Informação B
Recuperação de Informação B
Information Retrieval and Web Design
Advanced information retrieval
ADVANCED TOPICS IN INFORMATION RETRIEVAL AND WEB SEARCH
Presentation transcript:

Information Retrieval and Organisation Chapter 11 Probabilistic Information Retrieval Dell Zhang Birkbeck, University of London

Why Probabilities in IR? User Information Need Documents Document Representation Query Representation Query Representation How to match? In IR systems, matching between each document and query is attempted in a semantically imprecise space of index terms. Probabilities provide a principled foundation for uncertain reasoning. Can we use probabilities to quantify our uncertainties? Uncertain guess of whether document has relevant content Understanding of user need is uncertain

Why Probabilities in IR? Problems with vector space model Ad-hoc term weighting schemes Ad-hoc basis vectors Ad-hoc similarity measurement We need something more principled!

Probability Ranking Principle The document ranking method is the core of an IR system We have a collection of documents. The user issues a query. A list of documents needs to be returned. In what order do we present documents to the user? We want the “best” document to be first, second best second, etc….

Probability Ranking Principle “If a reference retrieval system's response to each request is a ranking of the documents in the collection in order of decreasing probability of relevance to the user who submitted the request, where the probabilities are estimated as accurately as possible on the basis of whatever data have been made available to the system for this purpose, the overall effectiveness of the system to its user will be the best that is obtainable on the basis of those data.” van Rijsbergen (1979: )

Probability Ranking Principle Theorem. The PRP is optimal, in the sense that it minimizes the expected loss (also known as the Bayes risk) under 1/0 loss. Provable if all probabilities are known correctly.

Binary Independence Model BIM is the model that has traditionally been used in conjunction with PRP “Binary” = Boolean: documents and queries are represented as binary incidence vectors of terms. “Independence”: terms occur in documents and queries independently. BIM = Bernoulli Naive Bayes model

Binary Independence Model Use Bayes’ Rule

Binary Independence Model Rank documents by their odds constant

Binary Independence Model Make the Naïve Bayes conditional independence assumption

Binary Independence Model Let

Binary Independence Model Assume that constant useful

Binary Independence Model Taking the logarithm function which is monotonic, we get the retrieval status value log odds ratio

Binary Independence Model Assume that relevant documents are a very small percentage IDF

Okapi BM25 Assume that Factor in the term frequencies ( tf ) and document length ( L d and L ave ) Practically good vales: 1.2 ≤ k 1 ≤ 2; b = 0.75 Ideally the parameters k 1, b should be tuned on a validation set.

Appraisal Getting reasonable approximations of probabilities is possible, but requires restrictive assumptions. In the BIM, these are a Boolean representation of documents/queries/relevance term independence terms not in the query don’t affect the outcome document relevance values are independent Problem: either require partial relevance information or only can derive somewhat inferior term weights.

Appraisal Probabilistic methods are one of the oldest but also one of the currently hottest topics in IR. Traditionally: neat ideas, but they’ve never won on performance. It may be different now. For example, the Okapi BM25 term weighting formulas have been very successful, especially in TREC evaluations.

Okapi BM25 The parameters k 1, b should ideally be tuned on a validation set. The good values in practice are 1.2 ≤ k 1 ≤ 2; b = Retrieval Status Value IDF(t) The document length of d The average document length for the collection

Well-Known UK Researchers Stephen Robertson Keith van Rijsbergen Karen Sparck Jones