Information Retrieval and Web Search IR models: Boolean model Instructor: Rada Mihalcea Class web page:

Slides:



Advertisements
Similar presentations
Traditional IR models Jian-Yun Nie.
Advertisements

Modern information retrieval Modelling. Introduction IR systems usually adopt index terms to process queries IR systems usually adopt index terms to process.
Basic IR: Modeling Basic IR Task: Slightly more complex:
Modern Information Retrieval Chapter 1: Introduction
The Probabilistic Model. Probabilistic Model n Objective: to capture the IR problem using a probabilistic framework; n Given a user query, there is an.
Extended Boolean Model n Boolean model is simple and elegant. n But, no provision for a ranking n As with the fuzzy model, a ranking can be obtained by.
Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto
Web Search - Summer Term 2006 II. Information Retrieval (Basics Cont.)
Motivation and Outline
IR Models: Overview, Boolean, and Vector
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Classic Information Retrieval (IR)
ISP 433/533 Week 2 IR Models.
IR Models: Structural Models
Models for Information Retrieval Mainly used in science and research, (probably?) less often in real systems But: Research results have significance for.
Fussy Set Theory Definition A fuzzy subset A of a universe of discourse U is characterized by a membership function which associate with each element u.
Information Retrieval Modeling CS 652 Information Extraction and Integration.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Chapter 2Modeling 資工 4B 陳建勳. Introduction.  Traditional information retrieval systems usually adopt index terms to index and retrieve documents.
Modeling Modern Information Retrieval
Project Management: The project is due on Friday inweek13.
IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.
IR Models: Review Vector Model and Probabilistic.
Other IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models boolean vector.
Information Retrieval: Foundation to Web Search Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems August 13, 2015 Some.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
CS344: Introduction to Artificial Intelligence Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 32-33: Information Retrieval: Basic concepts and Model.
PrasadL2IRModels1 Models for IR Adapted from Lectures by Berthier Ribeiro-Neto (Brazil), Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning.
Information Retrieval Chapter 2: Modeling 2.1, 2.2, 2.3, 2.4, 2.5.1, 2.5.2, Slides provided by the author, modified by L N Cassel September 2003.
Information Retrieval Models - 1 Boolean. Introduction IR systems usually adopt index terms to process queries Index terms:  A keyword or group of selected.
Introduction to Digital Libraries Searching
IR Models J. H. Wang Mar. 11, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text quer y.
Advanced information retrieval Chapter. 02: Modeling (Set Theoretic Models) – Fuzzy model.
Chapter. 02: Modeling Contenue... 19/10/2015Dr. Almetwally Mostafa 1.
CS621 : Artificial Intelligence Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 19: Fuzzy Logic and Neural Net Based IR.
Generalized Vector Model n Classic models enforce independence of index terms. n For the Vector model: u Set of term vectors {k1, k2,..., kt} are linearly.
Information Retrieval and Web Search IR models: Vectorial Model Instructor: Rada Mihalcea Class web page: [Note: Some.
Information Retrieval Model Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
CSCE 5300 Information Retrieval and Web Search Introduction to IR models and methods Instructor: Rada Mihalcea Class web page:
Information Retrieval CSE 8337 Spring 2005 Modeling Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates and Berthier.
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
Najah Alshanableh. Fuzzy Set Model n Queries and docs represented by sets of index terms: matching is approximate from the start n This vagueness can.
1 Patrick Lambrix Department of Computer and Information Science Linköpings universitet Information Retrieval.
Information Retrieval Chap. 02: Modeling - Part 2 Slides from the text book author, modified by L N Cassel September 2003.
Information Retrieval and Web Search Probabilistic IR and Alternative IR Models Rada Mihalcea (Some of the slides in this slide set come from a lecture.
The Boolean Model Simple model based on set theory
Recuperação de Informação B Cap. 02: Modeling (Set Theoretic Models) 2.6 September 08, 1999.
Set Theoretic Models 1. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models.
Information Retrieval and Web Search Introduction to IR models and methods Rada Mihalcea (Some of the slides in this slide set come from IR courses taught.
Information Retrieval CSE 8337 Spring 2005 Modeling (Part II) Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates.
Introduction n IR systems usually adopt index terms to process queries n Index term: u a keyword or group of selected words u any word (more general) n.
1 Boolean Model. 2 A document is represented as a set of keywords. Queries are Boolean expressions of keywords, connected by AND, OR, and NOT, including.
Information Retrieval Models School of Informatics Dept. of Library and Information Studies Dr. Miguel E. Ruiz.
Latent Semantic Indexing
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Recuperação de Informação B
Models for Retrieval and Browsing - Fuzzy Set, Extended Boolean, Generalized Vector Space Models Berlin Chen 2003 Reference: 1. Modern Information Retrieval,
Recuperação de Informação B
Recuperação de Informação B
Recuperação de Informação B
Recuperação de Informação B
Recuperação de Informação B
Berlin Chen Department of Computer Science & Information Engineering
Recuperação de Informação B
Information Retrieval and Web Design
Modeling in Information Retrieval - Fuzzy Set, Extended Boolean, Generalized Vector Space, Set-based Models, and Best Match Models Berlin Chen Department.
Modeling in Information Retrieval - Fuzzy Set, Extended Boolean, Generalized Vector Space, Set-based Models, and Best Match Models Berlin Chen Department.
Advanced information retrieval
Modeling in Information Retrieval - Fuzzy Set, Extended Boolean, Generalized Vector Space, Set-based Models, and Best Match Models Berlin Chen Department.
Berlin Chen Department of Computer Science & Information Engineering
Presentation transcript:

Information Retrieval and Web Search IR models: Boolean model Instructor: Rada Mihalcea Class web page:

Slide 1 Today’s topics Boolean retrieval Improvements / Variations of the boolean model –Extended boolean model –Fuzzy information retrieval

Slide 2 IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models boolean vector probabilistic Set Theoretic Fuzzy Extended Boolean Probabilistic Inference Network Belief Network Algebraic Generalized Vector Lat. Semantic Index Neural Networks Browsing Flat Structure Guided Hypertext

Slide 3 The Boolean Model Simple model based on set theory Queries specified as boolean expressions –precise semantics –neat formalism –q = ka  (kb   kc) Terms are either present or absent. Thus, wij  {0,1} Consider –q = ka  (kb   kc) –vec(qdnf) = (1,1,1)  (1,1,0)  (1,0,0) –vec(qcc) = (1,1,0) is a conjunctive component Each query can be transformed in DNF form

Slide 4 The Boolean Model q = ka  (kb   kc) sim(q,dj) = 1, if document satisfies the boolean query 0 otherwise - no in-between, only 0 or 1 (1,1,1) (1,0,0) (1,1,0) KaKb Kc

Slide 5 Exercise D 1 = “computer information retrieval” D 2 = “computer retrieval” D 3 = “information” D 4 = “computer information” Q 1 = “information  retrieval” Q 2 = “information  ¬computer”

Slide 6 Exercise 0 1Swift 2Shakespeare 3 Swift 4Milton 5 Swift 6MiltonShakespeare 7MiltonShakespeareSwift 8Chaucer 9 Swift 10ChaucerShakespeare 11ChaucerShakespeareSwift 12ChaucerMilton 13ChaucerMiltonSwift 14ChaucerMiltonShakespeare 15ChaucerMiltonShakespeareSwift ((chaucer OR milton) AND (NOT swift)) OR ((NOT chaucer) AND (swift OR shakespeare))

Slide 7 Drawbacks of the Boolean Model Retrieval based on binary decision criteria with no notion of partial matching No ranking of the documents is provided (absence of a grading scale) Information need has to be translated into a Boolean expression which most users find awkward The Boolean queries formulated by the users are most often too simplistic As a consequence, the Boolean model frequently returns either too few or too many documents in response to a user query

Slide 8 The Boolean model imposes a binary criterion for deciding relevance The question of how to extend the Boolean model to accomodate partial matching and a ranking has attracted considerable attention in the past Two extensions of boolean model: –Fuzzy Set Model –Extended Boolean Model

Slide 9 Fuzzy Set Model Queries and docs represented by sets of index terms: matching is approximate from the start This vagueness can be modeled using a fuzzy framework, as follows: –with each term is associated a fuzzy set –each doc has a degree of membership in this fuzzy set This interpretation provides the foundation for many models for IR based on fuzzy theory In here, the model proposed by Ogawa, Morita, and Kobayashi (1991)

Slide 10 Fuzzy Set Theory Framework for representing classes whose boundaries are not well defined Key idea is to introduce the notion of a degree of membership associated with the elements of a set This degree of membership varies from 0 to 1 and allows modeling the notion of marginal membership Thus, membership is now a gradual notion, contrary to the notion enforced by classic Boolean logic

Slide 11 Fuzzy Set Theory Definition –A fuzzy subset A of U is characterized by a membership function  (A,u) : U  [0,1] which associates with each element u of U a number  (u) in the interval [0,1] Definition –Let A and B be two fuzzy subsets of U. Also, let ¬A be the complement of A. Then,  (¬A,u) = 1 -  (A,u)  (A  B,u) = max(  (A,u),  (B,u))  (A  B,u) = min(  (A,u),  (B,u))

Slide 12 Fuzzy Information Retrieval Fuzzy sets are modeled based on a thesaurus This thesaurus is built as follows: –Let vec(c) be a term-term correlation matrix –Let c(i,l) be a normalized correlation factor for (ki,kl): c(i,l) = n(i,l) ni + nl - n(i,l) -ni: number of docs which contain ki -nl: number of docs which contain kl -n(i,l): number of docs which contain both ki and kl We now have the notion of proximity among index terms.

Slide 13 Fuzzy Information Retrieval The correlation factor c(i,l) can be used to define fuzzy set membership for a document dj as follows:  (i,j) = 1 -  (1 - c(i,l)) kl  dj -  (i,j) : membership of doc dj in fuzzy subset associated with ki The above expression computes an algebraic sum over all terms in the doc dj A doc dj belongs to the fuzzy set for ki, if its own terms are associated with ki

Slide 14 Fuzzy Information Retrieval  (i,j) = 1 -  (1 - c(i,l)) kl  dj -  (i,j) : membership of doc dj in fuzzy subset associated with ki If doc dj contains a term kl which is closely related to ki, we have –c(i,l) ~ 1 –  (i,j) ~ 1 –index ki is a good fuzzy index for doc

Slide 15 Fuzzy IR: An Example q = ka  (kb   kc) vec(qdnf) = (1,1,1) + (1,1,0) + (1,0,0) = vec(cc1) + vec(cc2) + vec(cc3)  (q,dj) =  (cc1+cc2+cc3,j) = 1 - (1 -  (a,j)  (b,j)  (c,j)) * (1 -  (a,j)  (b,j) (1-  (c,j))) * (1 -  (a,j) (1-  (b,j)) (1-  (c,j))) cc1 cc3 cc2 KaKb Kc

Slide 16 Fuzzy Information Retrieval Fuzzy IR models have been discussed mainly in the literature associated with fuzzy theory Experiments with standard test collections are not available Difficult to compare at this time

Slide 17 Extended Boolean Model Boolean model is simple and elegant. But, no provision for a ranking As with the fuzzy model, a ranking can be obtained by relaxing the condition on set membership Extend the Boolean model with the notions of partial matching and term weighting Combine characteristics of the Vector model with properties of Boolean algebra

Slide 18 The Idea The extended Boolean model (introduced by Salton, Fox, and Wu, 1983) is based on a critique of a basic assumption in Boolean algebra Let, –q = kx  ky –Use weights associated with kx and ky –In boolean model: wx = wy = 1; all other documents are irrelevant

Slide 19 The Idea: qand = kx  ky; wxj = x and wyj = y dj dj+1 y = wyj x = wxj(0,0) (1,1) kx ky sim(qand,dj) = 1 - sqrt( (1-x) + (1-y) ) 2 22 AND We want a document to be as close as possible to (1,1)

Slide 20 The Idea: qor = kx  ky; wxj = x and wyj = y dj dj+1 y = wyj x = wxj(0,0) (1,1) kx ky sim(qor,dj) = sqrt( x + y ) 2 22 OR We want a document to be as far as possible from (0,0)

Slide 21 Generalizing the Idea We can extend the previous model to consider Euclidean distances in a t-dimensional space This can be done using p-norms which extend the notion of distance to include p-distances, where 1  p   is a new parameter A generalized conjunctive query is given by – qor = k1 k2... kt A generalized disjunctive query is given by – qand = k1 k2... kt p  p  p   p  p  p

Slide 22 Generalizing the Idea –sim(qand,dj) = 1 - ((1-x1) + (1-x2) (1-xm) ) m ppp p 1 –sim(qor,dj) = (x1 + x xm ) m ppp p 1 –If p = 1 then (Vector like) sim(qor,dj) = sim(qand,dj) = x xm m ppp p 1

Slide 23 Conclusions Model is quite powerful Properties are interesting and might be useful Computation is somewhat complex However, distributivity operation does not hold for ranking computation: –q1 = (k1  k2)  k3 –q2 = (k1  k3)  (k2  k3) – sim(q1,dj)  sim(q2,dj)