1 CS 430: Information Discovery Lecture 12 Extending the Boolean Model.

Slides:



Advertisements
Similar presentations
Traditional IR models Jian-Yun Nie.
Advertisements

Chapter 5: Introduction to Information Retrieval
Modern information retrieval Modelling. Introduction IR systems usually adopt index terms to process queries IR systems usually adopt index terms to process.
Basic IR: Modeling Basic IR Task: Slightly more complex:
Fuzzy Set and Opertion. Outline Fuzzy Set and Crisp Set Expanding concepts Standard operation of fuzzy set Fuzzy relations Operations on fuzzy relations.
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
Extended Boolean Model n Boolean model is simple and elegant. n But, no provision for a ranking n As with the fuzzy model, a ranking can be obtained by.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
CS 430 / INFO 430 Information Retrieval
CS 430 / INFO 430 Information Retrieval
IR Models: Overview, Boolean, and Vector
1 CS 430 / INFO 430 Information Retrieval Lecture 15 Usability 3.
1 CS 430 / INFO 430 Information Retrieval Lecture 2 Searching Full Text 2.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 11 Latent Semantic Indexing Extending the Boolean Model.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
Modern Information Retrieval Chapter 2 Modeling. Probabilistic model the appearance or absent of an index term in a document is interpreted either as.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Chapter 2Modeling 資工 4B 陳建勳. Introduction.  Traditional information retrieval systems usually adopt index terms to index and retrieve documents.
Modern Information Retrieval Chapter 5 Query Operations.
Project Management: The project is due on Friday inweek13.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1.
Vector Space Model CS 652 Information Extraction and Integration.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
1 CS 430 / INFO 430 Information Retrieval Lecture 10 Probabilistic Information Retrieval.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
WXGB6106 INFORMATION RETRIEVAL Week 3 RETRIEVAL EVALUATION.
1 CS 430: Information Discovery Lecture 2 Introduction to Text Based Information Retrieval.
1 CS 502: Computing Methods for Digital Libraries Lecture 11 Information Retrieval I.
Chapter 5: Information Retrieval and Web Search
1 CS 430 / INFO 430 Information Retrieval Lecture 2 Text Based Information Retrieval.
1 CS 430: Information Discovery Lecture 9 Term Weighting and Ranking.
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Advanced information retrieval Chapter. 02: Modeling (Set Theoretic Models) – Fuzzy model.
1 CS 430: Information Discovery Lecture 3 Inverted Files.
Chapter 6: Information Retrieval and Web Search
Distributed Information Retrieval Server Ranking for Distributed Text Retrieval Systems on the Internet B. Yuwono and D. Lee Siemens TREC-4 Report: Further.
University of Malta CSA3080: Lecture 6 © Chris Staff 1 of 20 CSA3080: Adaptive Hypertext Systems I Dr. Christopher Staff Department.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
1 CS 430: Information Discovery Sample Midterm Examination Notes on the Solutions.
Vector Space Models.
The Boolean Model Simple model based on set theory
Information Retrieval and Web Search IR models: Boolean model Instructor: Rada Mihalcea Class web page:
C.Watterscsci64031 Classical IR Models. C.Watterscsci64032 Goal Hit set of relevant documents Ranked set Best match Answer.
Set Theoretic Models 1. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Searching Full Text 3.
1 CS 430: Information Discovery Lecture 8 Automatic Term Extraction and Weighting.
1 CS 430: Information Discovery Lecture 5 Ranking.
The Development of a search engine & Comparison according to algorithms Sung-soo Kim The final report.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
1 CS 430: Information Discovery Lecture 8 Collection-Level Metadata Vector Methods.
Automated Information Retrieval
CS 430: Information Discovery
CS 430: Information Discovery
CS 430: Information Discovery
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Representation of documents and queries
موضوع پروژه : بازیابی اطلاعات Information Retrieval
CS 430: Information Discovery
CS 430: Information Discovery
Recuperação de Informação B
CS 430: Information Discovery
Retrieval Utilities Relevance feedback Clustering
Recuperação de Informação B
Information Retrieval and Web Design
Berlin Chen Department of Computer Science & Information Engineering
Information Retrieval and Web Design
CS 430: Information Discovery
Presentation transcript:

1 CS 430: Information Discovery Lecture 12 Extending the Boolean Model

2 Course Administration Midterm examination: Date: Wednesday, 31 October, 7:30 to 8:30 p.m. Room: TBA Open book Assignment 1: Grades have been sent by . If you have not received a grade, please send a message to

3 Problems with the Boolean model Counter-intuitive results: Query q = A and B and C and D and E Document d has terms A, B, C and D, but not E Intuitively, d is quite a good match for q, but it is rejected by the Boolean model. Query q = A or B or C or D or E Document d 1 has terms A, B, C, D and E Document d 2 has term A, but not B, C, D or E Intuitively, d 1 is a much better match than d 2, but the Boolean model ranks them as equal.

4 Problems with the Boolean model (continued) Boolean model has no way to rank documents. Boolean model allows for no uncertainty in assigning index terms to documents. The Boolean model has no provision for assigning weights to the importance of query terms. Boolean is all or nothing.

5 Boolean model as sets A d q d and q are either in the set A or not in A. There is no halfway!

6 Extending the Boolean model Term weighting Give weights to terms in documents and/or queries. Combine standard Boolean retrieval with vector ranking of results Fuzzy sets Relax the boundaries of the sets used in Boolean retrieval

7 Ranking methods in Boolean systems SIRE (Syracuse Information Retrieval Experiment) Term weights Add term weights to documents Weights calculated by the standard method of term frequency * inverse document frequency. Ranking Calculate results set by standard Boolean methods Rank results by vector distances

8 Relevance feedback in SIRE SIRE (Syracuse Information Retrieval Experiment) Relevance feedback is particularly important with Boolean retrieval because it allow the results set to be expanded Results set is created by standard Boolean retrieval User selects one document from results set Other documents in collection are ranked by vector distance from this document

9 Boolean model as fuzzy sets A d q q is more or less in A. There is a halfway!

10 Basic concept A document has a term weights associated with each index term. The term weight measures the degree to which that term characterizes the document. Term weights are in the range [0, 1]. (In the standard Boolean model all weights are either 0 or 1.) For a given query, calculate the similarity between the query and each document in the collection. This calculation is needed for every document that has a non- zero weight for any of the terms in the query.

11 MMM: Mixed Min and Max model Fuzzy set theory d A is the degree of membership of an element to set A intersection (and) d A  B = min(d A, d B ) union (or) d A  B = max(d A, d B )

12 MMM: Mixed Min and Max model Fuzzy set theory example standard fuzzy set theory set theory d A d B and d A  B or d A  B

13 MMM: Mixed Min and Max model Terms: A 1, A 2,..., A n Document D, with index-term weights: d A1, d A2,..., d An Q or = (A 1 or A 2 or... or A n ) Query-document similarity: S(Q or, D) = C or1 * max(d A1, d A2,.., d An ) + C or2 * min(d A1, d A2,.., d An ) where C or1 + C or2 = 1

14 MMM: Mixed Min and Max model Terms: A 1, A 2,..., A n Document D, with index-term weights: d A1, d A2,..., d An Q and = (A 1 and A 2 and... and A n ) Query-document similarity: S(Q and, D) = C and1 * min(d A1,.., d An ) + C and2 * max(d A1,.., d An ) where C and1 + C and2 = 1

15 MMM: Mixed Min and Max model Experimental values: C and1 in range [0.5, 0.8] C or1 > 0.2 Computational cost is low. Retrieval performance much improved.

16 Paice Model Paice model is a relative of the MMM model. The MMM model considers only the maximum and minimum document weights. The Paice model takes into account all of the document weights. Computational cost is higher than from MMM. Retrieval performance is improved. See Frake, pages for more details

17 P-norm model Terms: A 1, A 2,..., A n Document D, with term weights: d A1, d A2,..., d An Query terms are given weights, a 1, a 2,...,a n, which indicate their relative importance. Operators have coefficients that indicate their degree of strictness Query-document similarity is calculated by considering each document and query as a point in n space. See Frake, pages for details

18 Test data CISICACMINSPEC P-norm Paice MMM Percentage improvement over standard Boolean model (average best precision) Lee and Fox, 1988

19 Reading E. Fox, S. Betrabet, M. Koushik, W. Lee, Extended Boolean Models, Frake, Chapter 15 Methods based on fuzzy set concepts