Information Retrieval and Web Search IR models: Boolean model Instructor: Rada Mihalcea Class web page:

Information Retrieval and Web Search IR models: Boolean model Instructor: Rada Mihalcea Class web page: http://www.cs.unt.edu/~rada/CSCE5300

Today’s topics Boolean retrieval Improvements / Variations of the boolean model –Extended boolean model –Fuzzy information retrieval

IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models boolean vector probabilistic Set Theoretic Fuzzy Extended Boolean Probabilistic Inference Network Belief Network Algebraic Generalized Vector Lat. Semantic Index Neural Networks Browsing Flat Structure Guided Hypertext

The Boolean Model Simple model based on set theory Queries specified as boolean expressions –precise semantics –neat formalism –q = ka  (kb   kc) Terms are either present or absent. Thus, wij  {0,1} Consider –q = ka  (kb   kc) –vec(qdnf) = (1,1,1)  (1,1,0)  (1,0,0) –vec(qcc) = (1,1,0) is a conjunctive component Each query can be transformed in DNF form

The Boolean Model q = ka  (kb   kc) sim(q,dj) = 1, if document satisfies the boolean query 0 otherwise - no in-between, only 0 or 1 (1,1,1) (1,0,0) (1,1,0) KaKb Kc

Exercise D 1 = “computer information retrieval” D 2 = “computer retrieval” D 3 = “information” D 4 = “computer information” Q 1 = “information  retrieval” Q 2 = “information  ¬computer”

Exercise 0 1Swift 2Shakespeare 3 Swift 4Milton 5 Swift 6MiltonShakespeare 7MiltonShakespeareSwift 8Chaucer 9 Swift 10ChaucerShakespeare 11ChaucerShakespeareSwift 12ChaucerMilton 13ChaucerMiltonSwift 14ChaucerMiltonShakespeare 15ChaucerMiltonShakespeareSwift ((chaucer OR milton) AND (NOT swift)) OR ((NOT chaucer) AND (swift OR shakespeare))

Drawbacks of the Boolean Model Retrieval based on binary decision criteria with no notion of partial matching No ranking of the documents is provided (absence of a grading scale) Information need has to be translated into a Boolean expression which most users find awkward The Boolean queries formulated by the users are most often too simplistic As a consequence, the Boolean model frequently returns either too few or too many documents in response to a user query

The Boolean model imposes a binary criterion for deciding relevance The question of how to extend the Boolean model to accomodate partial matching and a ranking has attracted considerable attention in the past Two extensions of boolean model: –Fuzzy Set Model –Extended Boolean Model

Fuzzy Set Model Queries and docs represented by sets of index terms: matching is approximate from the start This vagueness can be modeled using a fuzzy framework, as follows: –with each term is associated a fuzzy set –each doc has a degree of membership in this fuzzy set This interpretation provides the foundation for many models for IR based on fuzzy theory In here, the model proposed by Ogawa, Morita, and Kobayashi (1991)

Fuzzy Set Theory Framework for representing classes whose boundaries are not well defined Key idea is to introduce the notion of a degree of membership associated with the elements of a set This degree of membership varies from 0 to 1 and allows modeling the notion of marginal membership Thus, membership is now a gradual notion, contrary to the notion enforced by classic Boolean logic

Fuzzy Set Theory Definition –A fuzzy subset A of U is characterized by a membership function  (A,u) : U  [0,1] which associates with each element u of U a number  (u) in the interval [0,1] Definition –Let A and B be two fuzzy subsets of U. Also, let ¬A be the complement of A. Then,  (¬A,u) = 1 -  (A,u)  (A  B,u) = max(  (A,u),  (B,u))  (A  B,u) = min(  (A,u),  (B,u))

Fuzzy Information Retrieval Fuzzy sets are modeled based on a thesaurus This thesaurus is built as follows: –Let vec(c) be a term-term correlation matrix –Let c(i,l) be a normalized correlation factor for (ki,kl): c(i,l) = n(i,l) ni + nl - n(i,l) -ni: number of docs which contain ki -nl: number of docs which contain kl -n(i,l): number of docs which contain both ki and kl We now have the notion of proximity among index terms.

Fuzzy Information Retrieval The correlation factor c(i,l) can be used to define fuzzy set membership for a document dj as follows:  (i,j) = 1 -  (1 - c(i,l)) kl  dj -  (i,j) : membership of doc dj in fuzzy subset associated with ki The above expression computes an algebraic sum over all terms in the doc dj A doc dj belongs to the fuzzy set for ki, if its own terms are associated with ki

Fuzzy Information Retrieval  (i,j) = 1 -  (1 - c(i,l)) kl  dj -  (i,j) : membership of doc dj in fuzzy subset associated with ki If doc dj contains a term kl which is closely related to ki, we have –c(i,l) ~ 1 –  (i,j) ~ 1 –index ki is a good fuzzy index for doc

Fuzzy IR: An Example q = ka  (kb   kc) vec(qdnf) = (1,1,1) + (1,1,0) + (1,0,0) = vec(cc1) + vec(cc2) + vec(cc3)  (q,dj) =  (cc1+cc2+cc3,j) = 1 - (1 -  (a,j)  (b,j)  (c,j)) * (1 -  (a,j)  (b,j) (1-  (c,j))) * (1 -  (a,j) (1-  (b,j)) (1-  (c,j))) cc1 cc3 cc2 KaKb Kc

Fuzzy Information Retrieval Fuzzy IR models have been discussed mainly in the literature associated with fuzzy theory Experiments with standard test collections are not available Difficult to compare at this time

Extended Boolean Model Boolean model is simple and elegant. But, no provision for a ranking As with the fuzzy model, a ranking can be obtained by relaxing the condition on set membership Extend the Boolean model with the notions of partial matching and term weighting Combine characteristics of the Vector model with properties of Boolean algebra

The Idea The extended Boolean model (introduced by Salton, Fox, and Wu, 1983) is based on a critique of a basic assumption in Boolean algebra Let, –q = kx  ky –Use weights associated with kx and ky –In boolean model: wx = wy = 1; all other documents are irrelevant

The Idea: qand = kx  ky; wxj = x and wyj = y dj dj+1 y = wyj x = wxj(0,0) (1,1) kx ky sim(qand,dj) = 1 - sqrt( (1-x) + (1-y) ) 2 22 AND We want a document to be as close as possible to (1,1)

The Idea: qor = kx  ky; wxj = x and wyj = y dj dj+1 y = wyj x = wxj(0,0) (1,1) kx ky sim(qor,dj) = sqrt( x + y ) 2 22 OR We want a document to be as far as possible from (0,0)

Generalizing the Idea We can extend the previous model to consider Euclidean distances in a t-dimensional space This can be done using p-norms which extend the notion of distance to include p-distances, where 1  p   is a new parameter A generalized conjunctive query is given by – qor = k1 k2... kt A generalized disjunctive query is given by – qand = k1 k2... kt p  p  p   p  p  p

Generalizing the Idea –sim(qand,dj) = 1 - ((1-x1) + (1-x2) +... + (1-xm) ) m ppp p 1 –sim(qor,dj) = (x1 + x2 +... + xm ) m ppp p 1 –If p = 1 then (Vector like) sim(qor,dj) = sim(qand,dj) = x1 +... + xm m ppp p 1

Conclusions Model is quite powerful Properties are interesting and might be useful Computation is somewhat complex However, distributivity operation does not hold for ranking computation: –q1 = (k1  k2)  k3 –q2 = (k1  k3)  (k2  k3) – sim(q1,dj)  sim(q2,dj)

Information Retrieval and Web Search IR models: Boolean model Instructor: Rada Mihalcea Class web page:

Similar presentations

Presentation on theme: "Information Retrieval and Web Search IR models: Boolean model Instructor: Rada Mihalcea Class web page:"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Information Retrieval and Web Search IR models: Boolean model Instructor: Rada Mihalcea Class web page:

Similar presentations

Presentation on theme: "Information Retrieval and Web Search IR models: Boolean model Instructor: Rada Mihalcea Class web page:"— Presentation transcript:

Similar presentations

About project

Feedback