Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 CS 430: Information Discovery Lecture 12 Extending the Boolean Model.

Similar presentations


Presentation on theme: "1 CS 430: Information Discovery Lecture 12 Extending the Boolean Model."— Presentation transcript:

1 1 CS 430: Information Discovery Lecture 12 Extending the Boolean Model

2 2 Course Administration Midterm examination: Date: Wednesday, 31 October, 7:30 to 8:30 p.m. Room: TBA Open book Assignment 1: Grades have been sent by email. If you have not received a grade, please send a message to wya@cs.cornell.edu

3 3 Problems with the Boolean model Counter-intuitive results: Query q = A and B and C and D and E Document d has terms A, B, C and D, but not E Intuitively, d is quite a good match for q, but it is rejected by the Boolean model. Query q = A or B or C or D or E Document d 1 has terms A, B, C, D and E Document d 2 has term A, but not B, C, D or E Intuitively, d 1 is a much better match than d 2, but the Boolean model ranks them as equal.

4 4 Problems with the Boolean model (continued) Boolean model has no way to rank documents. Boolean model allows for no uncertainty in assigning index terms to documents. The Boolean model has no provision for assigning weights to the importance of query terms. Boolean is all or nothing.

5 5 Boolean model as sets A d q d and q are either in the set A or not in A. There is no halfway!

6 6 Extending the Boolean model Term weighting Give weights to terms in documents and/or queries. Combine standard Boolean retrieval with vector ranking of results Fuzzy sets Relax the boundaries of the sets used in Boolean retrieval

7 7 Ranking methods in Boolean systems SIRE (Syracuse Information Retrieval Experiment) Term weights Add term weights to documents Weights calculated by the standard method of term frequency * inverse document frequency. Ranking Calculate results set by standard Boolean methods Rank results by vector distances

8 8 Relevance feedback in SIRE SIRE (Syracuse Information Retrieval Experiment) Relevance feedback is particularly important with Boolean retrieval because it allow the results set to be expanded Results set is created by standard Boolean retrieval User selects one document from results set Other documents in collection are ranked by vector distance from this document

9 9 Boolean model as fuzzy sets A d q q is more or less in A. There is a halfway!

10 10 Basic concept A document has a term weights associated with each index term. The term weight measures the degree to which that term characterizes the document. Term weights are in the range [0, 1]. (In the standard Boolean model all weights are either 0 or 1.) For a given query, calculate the similarity between the query and each document in the collection. This calculation is needed for every document that has a non- zero weight for any of the terms in the query.

11 11 MMM: Mixed Min and Max model Fuzzy set theory d A is the degree of membership of an element to set A intersection (and) d A  B = min(d A, d B ) union (or) d A  B = max(d A, d B )

12 12 MMM: Mixed Min and Max model Fuzzy set theory example standard fuzzy set theory set theory d A 11000.50.500 d B 10100.700.70 and d A  B 10000.5000 or d A  B 11100.70.50.70

13 13 MMM: Mixed Min and Max model Terms: A 1, A 2,..., A n Document D, with index-term weights: d A1, d A2,..., d An Q or = (A 1 or A 2 or... or A n ) Query-document similarity: S(Q or, D) = C or1 * max(d A1, d A2,.., d An ) + C or2 * min(d A1, d A2,.., d An ) where C or1 + C or2 = 1

14 14 MMM: Mixed Min and Max model Terms: A 1, A 2,..., A n Document D, with index-term weights: d A1, d A2,..., d An Q and = (A 1 and A 2 and... and A n ) Query-document similarity: S(Q and, D) = C and1 * min(d A1,.., d An ) + C and2 * max(d A1,.., d An ) where C and1 + C and2 = 1

15 15 MMM: Mixed Min and Max model Experimental values: C and1 in range [0.5, 0.8] C or1 > 0.2 Computational cost is low. Retrieval performance much improved.

16 16 Paice Model Paice model is a relative of the MMM model. The MMM model considers only the maximum and minimum document weights. The Paice model takes into account all of the document weights. Computational cost is higher than from MMM. Retrieval performance is improved. See Frake, pages 396-397 for more details

17 17 P-norm model Terms: A 1, A 2,..., A n Document D, with term weights: d A1, d A2,..., d An Query terms are given weights, a 1, a 2,...,a n, which indicate their relative importance. Operators have coefficients that indicate their degree of strictness Query-document similarity is calculated by considering each document and query as a point in n space. See Frake, pages 397-398 for details

18 18 Test data CISICACMINSPEC P-norm79106210 Paice77104206 MMM68109195 Percentage improvement over standard Boolean model (average best precision) Lee and Fox, 1988

19 19 Reading E. Fox, S. Betrabet, M. Koushik, W. Lee, Extended Boolean Models, Frake, Chapter 15 Methods based on fuzzy set concepts


Download ppt "1 CS 430: Information Discovery Lecture 12 Extending the Boolean Model."

Similar presentations


Ads by Google