Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto

Slides:



Advertisements
Similar presentations
Information Retrieval and Organisation Chapter 11 Probabilistic Information Retrieval Dell Zhang Birkbeck, University of London.
Advertisements

INFO624 - Week 2 Models of Information Retrieval Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University.
Modern information retrieval Modelling. Introduction IR systems usually adopt index terms to process queries IR systems usually adopt index terms to process.
Basic IR: Modeling Basic IR Task: Slightly more complex:
INSTRUCTOR: DR.NICK EVANGELOPOULOS PRESENTED BY: QIUXIA WU CHAPTER 2 Information retrieval DSCI 5240.
Modern Information Retrieval Chapter 1: Introduction
The Probabilistic Model. Probabilistic Model n Objective: to capture the IR problem using a probabilistic framework; n Given a user query, there is an.
Introduction to Information Retrieval (Part 2) By Evren Ermis.
Web Search - Summer Term 2006 II. Information Retrieval (Basics Cont.)
What is missing? Reasons that ideal effectiveness hard to achieve: 1. Users’ inability to describe queries precisely. 2. Document representation loses.
IR Models: Overview, Boolean, and Vector
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Classic Information Retrieval (IR)
ISP 433/533 Week 2 IR Models.
IR Models: Structural Models
Models for Information Retrieval Mainly used in science and research, (probably?) less often in real systems But: Research results have significance for.
Information Retrieval Modeling CS 652 Information Extraction and Integration.
Chapter 5: Query Operations Baeza-Yates, 1999 Modern Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
Probabilistic IR Models Based on probability theory Basic idea : Given a document d and a query q, Estimate the likelihood of d being relevant for the.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Chapter 2Modeling 資工 4B 陳建勳. Introduction.  Traditional information retrieval systems usually adopt index terms to index and retrieve documents.
Modeling Modern Information Retrieval
Project Management: The project is due on Friday inweek13.
IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.
Vector Space Model CS 652 Information Extraction and Integration.
1 CS 430 / INFO 430 Information Retrieval Lecture 10 Probabilistic Information Retrieval.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Automatic Indexing (Term Selection) Automatic Text Processing by G. Salton, Chap 9, Addison-Wesley, 1989.
IR Models: Review Vector Model and Probabilistic.
Other IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models boolean vector.
Recuperação de Informação. IR: representation, storage, organization of, and access to information items Emphasis is on the retrieval of information (not.
Information Retrieval: Foundation to Web Search Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems August 13, 2015 Some.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
CS344: Introduction to Artificial Intelligence Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 32-33: Information Retrieval: Basic concepts and Model.
Query Operations J. H. Wang Mar. 26, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text.
Information Retrieval Chapter 2: Modeling 2.1, 2.2, 2.3, 2.4, 2.5.1, 2.5.2, Slides provided by the author, modified by L N Cassel September 2003.
Information Retrieval Models - 1 Boolean. Introduction IR systems usually adopt index terms to process queries Index terms:  A keyword or group of selected.
IR Models J. H. Wang Mar. 11, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text quer y.
Advanced information retrieval Chapter. 02: Modeling (Set Theoretic Models) – Fuzzy model.
Information Retrieval Model Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Information Retrieval CSE 8337 Spring 2005 Modeling Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates and Berthier.
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
1 Patrick Lambrix Department of Computer and Information Science Linköpings universitet Information Retrieval.
Vector Space Models.
Modern information retreival Chapter. 02: Modeling (Latent Semantic Indexing)
Information Retrieval and Web Search Probabilistic IR and Alternative IR Models Rada Mihalcea (Some of the slides in this slide set come from a lecture.
The Boolean Model Simple model based on set theory
Information Retrieval and Web Search IR models: Boolean model Instructor: Rada Mihalcea Class web page:
Recuperação de Informação B Cap. 02: Modeling (Set Theoretic Models) 2.6 September 08, 1999.
C.Watterscsci64031 Classical IR Models. C.Watterscsci64032 Goal Hit set of relevant documents Ranked set Best match Answer.
Set Theoretic Models 1. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models.
Information Retrieval and Web Search Introduction to IR models and methods Rada Mihalcea (Some of the slides in this slide set come from IR courses taught.
Recuperação de Informação B Cap. 02: Modeling (Latent Semantic Indexing & Neural Network Model) 2.7.2, September 27, 1999.
Introduction n IR systems usually adopt index terms to process queries n Index term: u a keyword or group of selected words u any word (more general) n.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture Probabilistic Information Retrieval.
Information Retrieval Models School of Informatics Dept. of Library and Information Studies Dr. Miguel E. Ruiz.
Recuperação de Informação B Modern Information Retrieval Cap. 2: Modeling Section 2.8 : Alternative Probabilistic Models September 20, 1999.
Latent Semantic Indexing
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
CS 430: Information Discovery
Recuperação de Informação B
Recuperação de Informação B
Recuperação de Informação B
Berlin Chen Department of Computer Science & Information Engineering
Information Retrieval and Web Design
Advanced information retrieval
ADVANCED TOPICS IN INFORMATION RETRIEVAL AND WEB SEARCH
Presentation transcript:

Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Chapter 2 Modeling Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto

Introduction Traditional information retrieval systems usually adopt index terms to index and retrieve documents. An index term is a keyword (or group of related words) which has some meaning of its own (usually a noun). Advantages: Simple The semantic of the documents and of the user information need can be naturally expressed through sets of index terms.

IR Models Ranking algorithms are at the core of information retrieval systems (predicting which documents are relevant and which are not).

A taxonomy of information retrieval models Classic Models Set Theoretic Boolean Vector Probabilistic Fuzzy Extended Boolean U S E R T A K Retrieval: Ad hoc Filtering Algebraic Structured Models Generalized Vector Lat. Semantic Index Neural Networks Non-overlapping lists Proximal Nodes Browsing Probabilistic Browsing Inference Network Belief Network Flat Structured Guided Hypertext

Structure Guided Hypertext Flat Hypertext Browsing Structured Classic Set Theoretic Algebraic Probabilistic Retrieval Full Text+ Structure Full Text Index Terms Figure 2.2 Retrieval models most frequently associated with distinct combinations of a document logical view and a user task.

Retrieval : Ad hoc and Filtering Ad hoc (Search): The documents in the collection remain relatively static while new queries are submitted to the system. Routing (Filtering): The queries remain relatively static while new documents come into the system

A formal characterization of IR models D : A set composed of logical views (or representation) for the documents in the collection. Q : A set composed of logical views (or representation) for the user information needs (queries). F : A framework for modeling document representations, queries, and their relationships. R(qi, dj) : A ranking function which defines an ordering among the documents with regard to the query.

Define ki : A generic index term K : The set of all index terms {k1,…,kt} wi,j : A weight associated with index term ki of a document dj gi: A function returns the weight associated with ki in any t-dimensoinal vector( gi(dj)=wi,j )

Classic IR Model Basic concepts : Each document is described by a set of representative keywords called index terms. Assign a numerical weights to distinct relevance between index terms.

Boolean model Binary decision criterion Data retrieval model Advantage clean formalism, simplicity Disadvantage It is not simple to translate an information need into a Boolean expression. exact matching may lead to retrieval of too few or too many documents

Example: Can be represented as a disjunction of conjunction vectors (in DNF). Q= qa(qbqc)=(1,1,1)  (1,1,0)  (1,0,0) Formal definition For the Boolean model, the index term weight are all binary. A query is a conventional Boolean expression, which can be transformed to a disjunctive normal form if (qcc )(ki, wi,j=gi(qcc))

Vector model Assign non-binary weights to index terms in queries and in documents. => TFxIDF Compute the similarity between documents and query. => Sim(Dj, Q) More precise than Boolean model.

The IR problem  A clustering problem We think of the documents as a collection C of objects and think of the user query as a specification of a set A of objects. Intra-cluster: What are the features which better describe the objects in the set A? Inter-cluster: What are the features which better distinguish the objects in the set A?

Idea for TFxIDF TF: inter-clustering similarity is quantified by measuring the raw frequency of a term ki inside a document dj, such term frequency is usually referred to as the tf factor and provides one measure of how well that term describes the document contents. IDF : inter-clustering similarity is quantified by measuring the inverse of the frequency of a term ki among the documents in the collection.This frequency is often referred to as the inverse document frequency.

Vector Model (1/4) Index terms are assigned positive and non-binary weights. The index terms in the query are also weighted. Term weights are used to compute the degree of similarity between documents and the user query. Then, retrieved documents are sorted in decreasing order.

Vector Model (2/4) Degree of similarity

Vector Model (3/4) Definition normalized frequency inverse document frequency term-weighting schemes query-term weights

Vector Model (4/4) Advantages Disadvantage its term-weighting scheme improves retrieval performance its partial matching strategy allows retrieval of documents that approximate the query conditions its cosine ranking formula sorts the documents according to their degree of similarity to the query Disadvantage The assumption of mutual independence between index terms

Orthogonal v1: (1,0) (1,0) v2: (1,1) (0,1) v3: (0,1) (-1,1) Cos(v1,v2)=1/2 Cos(v2,v3)=1/2 Cos(v1,v3)=0 Cos(v1,v2)=0 Cos(v1,v3)=-1/2 v2 v3 v1

Probabilistic Model (1/6) Introduced by Roberston and Sparck Jones, 1976 Also called binary independence retrieval (BIR) model Idea: Given a user query q, and the ideal answer set of the relevant documents, the problem is to specify the properties for this set. i.e. the probabilistic model tries to estimate the probability that the user will find the document dj relevant with ratio P(dj relevant to q)/P(dj nonrelevant to q)

Probabilistic Model (2/6) Definition All index term weights are all binary i.e., wi,j  {0,1} Let R be the set of documents know to be relevant to query q Let be the complement of R Let be the probability that the document dj is relevant to the query q Let be the probability that the document dj is nonelevant to query q

Probabilistic Model (3/6) The similarity sim(dj,q) of the document dj to the query q is defined as the ratio Using Bayes’ rule, P(R) stands for the probability that a document randomly selected from the entire collection is relevant stands for the probability of randomly selecting the document dj from the set R of relevant documents

Probabilistic Model (4/6) Assuming independence of index terms and given q=(d1, d2, …, dt),

Probabilistic Model (5/6) Pr(ki |R) stands for the probability that the index term ki is present in a document randomly selected from the set R stands for the probability that the index term ki is not present in a document randomly selected from the set R

Probabilistic Model (6/6)

Estimation of Term Relevance In the very beginning: Next, the ranking can be improved as follows: For small values for V Let V be a subset of the documents initially retrieved

Alternative Set Theoretic Models Fuzzy Set Model Extended Boolean Model

Fuzzy Theory A fuzzy subset A of a universe U is characterized by a membership function uA: U{0,1} which associates with each element uU a number uA Let A and B be two fuzzy subsets of U,

Fuzzy Information Retrieval Using a term-term correlation matrix Define a fuzzy set associated to each index term ki. If a term kl is strongly related to ki, that is ci,l ~1, then ui(dj)~1 If a term kl is loosely related to ki, that is ci,l ~0, then ui(dj)~0

Example Disjunctive Normal Form

Algebraic Sum and Product The degree of membership in a disjunctive fuzzy set is computed using an algebraic sum, instead of max function. The degree of membership in a conjunctive fuzzy set is computed using an algebraic product, instead of min function. More smooth than max and min functions.

Alternative Algebraic Models Generalized Vector Space Model Latent Semantic Model

Latent Semantic Indexing (1/5) Let A be a term-document association matrix with m rows and n columns. Latent semantic indexing decomposes A using singular value decompositions. U (mm) is the matrix of eigenvectors derived from the term-to-term correlation matrix (AAT) V (nn) is the matrix of eigenvectors derived from the the document-to-document matrix (ATA)  is an mn diagonal matrix of singular values, where rmin(t,N) is the rank of A.

Latent Semantic Indexing (2/5) Consider now only the s largest singular values of S, and their corresponding columns in U and V. (The remaining singular values of  are deleted). The resultant matrix As (rank s) is closest to the original matrix A in the least square sense. s<r is the dimensionality of a reduced concept space.

Latent Semantic Indexing (3/5) The selection of s attempts to balance two opposing effects: s should be large enough to allow fitting all the structure in the real data s should be small enough to allow filtering out all the non-relevant representational details Us={u1, u2, …, us} are the s principle components of column space (document space) Rm Vs={v1, v2, …, vs} are the s principle components of row space (term space) Rn

Latent Semantic Indexing (4/5) Consider the relationship between any two documents is the projected vector for document di (RmRs) is the projected vector for term vector ti (RnRs)

Latent Semantic Indexing (5/5) To rank documents with regard to a given user query, we model the query as a pseudo-document in the matrix A (original). Assume the query is modeled as the document with number k. Then the kth row in the matrix provides the ranks of all documents with respect to this query.

Speedup The matrix vector multiplication requires a total of Nt scalar multiplications. While requires only (n+m)s scalar multiplications.

Alternative Probabilistic Model Bayesian Networks Inference Network Model Belief Network Model

Bayesian Network Let xi be a node in a Bayesian network G and xi be the set of parent nodes of xi. The influence of xi on xi can be specified by any set of functions that satisfy: P(x1,x2,x3,x4,x5)=P(x1)P(x2|x1)P(x3|x1)P(x4|x2,x3)P(x5|x3)

Belief Network Model (1/6) The probability space The set K={k1, k2, …, kt} is the universe. To each subset u is associated a vector such that gi( )=1  kiu. Random variables To each index term ki is associated a binary random variable.

Belief Network Model (2/6) Concept space A document dj is represented as a concept composed of the terms used to index dj. A user query q is also represented as a concept composed of the terms used to index q. Both user query and document are modeled as subsets of index terms. Probability distribution P over K

Belief Network Model (3/6) A query is modeled as a network node This variable is set to 1 whenever q completely covers the concept space K P(q) computes the degree of coverage of the space K by q A document dj is modeled as a network node This random variable is 1 to indicate that dj completely covers the concept space K P(dj) computes the degree of coverage of the space K by dj

Belief Network Model (4/6)

Belief Network Model (5/6) Assumption P(dj |q) is adopted as the rank of the document dj with respect to the query q.

Belief Network Model (6/6) Specify the conditional probabilities as follows: Thus, the belief network model can be tuned to subsume the vector model.

Comparison Belief network model Inference network model Belief network model is based on set-theoretic view Belief network model provides a separation between the document and the query Belief network model is able to reproduce any ranking strategy generated by the inference network model Inference network model