1 CS 430 / INFO 430 Information Retrieval Lecture 8 Evaluation of Retrieval Effectiveness 1.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Information Retrieval IR 7. Recap of the last lecture Vector space scoring Efficiency considerations Nearest neighbors and approximations.
Evaluating Search Engine
Search Engines and Information Retrieval
1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement: Relevance Feedback Information Filtering.
Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.
1 CS 430: Information Discovery Lecture 10 Cranfield and TREC.
Modern Information Retrieval
1 CS 430 / INFO 430 Information Retrieval Lecture 11 Evaluation of Retrieval Effectiveness 2.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
Information Retrieval in Practice
INFO 624 Week 3 Retrieval System Evaluation
1 CS 430 / INFO 430 Information Retrieval Lecture 11 Evaluation of Retrieval Effectiveness 2.
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
CS 430 / INFO 430 Information Retrieval
Evaluating the Performance of IR Sytems
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1.
1 CS 430 / INFO 430 Information Retrieval Lecture 10 Probabilistic Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Searching Full Text 3.
1 CS 430: Information Discovery Lecture 2 Introduction to Text Based Information Retrieval.
ISP 433/633 Week 6 IR Evaluation. Why Evaluate? Determine if the system is desirable Make comparative assessments.
Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.
1 CS 502: Computing Methods for Digital Libraries Lecture 11 Information Retrieval I.
Search and Retrieval: Relevance and Evaluation Prof. Marti Hearst SIMS 202, Lecture 20.
| 1 › Gertjan van Noord2014 Zoekmachines Lecture 5: Evaluation.
Evaluation David Kauchak cs458 Fall 2012 adapted from:
Search Engines and Information Retrieval Chapter 1.
Philosophy of IR Evaluation Ellen Voorhees. NIST Evaluation: How well does system meet information need? System evaluation: how good are document rankings?
IR Evaluation Evaluate what? –user satisfaction on specific task –speed –presentation (interface) issue –etc. My focus today: –comparative performance.
Evaluation of IR LIS531H. Why eval? When designing and using a system there are decisions to be made:  Manual or automatic indexing?  Controlled vocabularies.
1 CS 430 / INFO 430 Information Retrieval Lecture 2 Text Based Information Retrieval.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
A Comparison of Statistical Significance Tests for Information Retrieval Evaluation CIKM´07, November 2007.
Evaluation INST 734 Module 5 Doug Oard. Agenda Evaluation fundamentals  Test collections: evaluating sets Test collections: evaluating rankings Interleaving.
Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.
1 CS 430: Information Discovery Lecture 3 Inverted Files.
CS 533 Information Retrieval Systems.  Introduction  Connectivity Analysis  Kleinberg’s Algorithm  Problems Encountered  Improved Connectivity Analysis.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
1 CS 430: Information Discovery Sample Midterm Examination Notes on the Solutions.
C.Watterscs64031 Evaluation Measures. C.Watterscs64032 Evaluation? Effectiveness? For whom? For what? Efficiency? Time? Computational Cost? Cost of missed.
Performance Measurement. 2 Testing Environment.
1 CS 430: Information Discovery Lecture 8 Evaluation of Retrieval Effectiveness II.
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Searching Full Text 3.
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
1 CS 430 / INFO 430 Information Retrieval Lecture 9 Evaluation of Retrieval Effectiveness 2.
1 CS 430: Information Discovery Lecture 5 Ranking.
Evaluation. The major goal of IR is to search document relevant to a user query. The evaluation of the performance of IR systems relies on the notion.
Information Retrieval Quality of a Search Engine.
1 CS 430: Information Discovery Lecture 11 Cranfield and TREC.
1 CS 430: Information Discovery Lecture 8 Collection-Level Metadata Vector Methods.
Information Retrieval Lecture 3 Introduction to Information Retrieval (Manning et al. 2007) Chapter 8 For the MSc Computer Science Programme Dell Zhang.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Query Refinement and Relevance Feedback.
INFORMATION RETRIEVAL MEASUREMENT OF RELEVANCE EFFECTIVENESS 1Adrienn Skrop.
1 CS 430 / INFO 430 Information Retrieval Lecture 10 Evaluation of Retrieval Effectiveness 1.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Information Retrieval in Practice
Text Based Information Retrieval
CS 430: Information Discovery
Evaluation.
Modern Information Retrieval
IR Theory: Evaluation Methods
CS 430: Information Discovery
CS 430: Information Discovery
Advanced Information Retrieval
Evaluation of Information Retrieval Systems
Retrieval Performance Evaluation - Measures
Precision and Recall Reminder:
CS 430: Information Discovery
Presentation transcript:

1 CS 430 / INFO 430 Information Retrieval Lecture 8 Evaluation of Retrieval Effectiveness 1

2 Course administration Change of Office Hours Office hours are now: Tuesday: 9:30 to 10:30 Thursday: 9:30 to 10:30

3 Course administration Discussion Class 4 Check the Web site. (a) It is not necessary to study the entire paper in detail (b) The PDF version of the file is damaged. Use the PostScript version.

4 Retrieval Effectiveness Designing an information retrieval system has many decisions: Manual or automatic indexing? Natural language or controlled vocabulary? What stoplists? What stemming methods? What query syntax? etc. How do we know which of these methods are most effective? Is everything a matter of judgment?

5 From Lecture 1: Evaluation To place information retrieval on a systematic basis, we need repeatable criteria to evaluate how effective a system is in meeting the information needs of the user of the system. This proves to be very difficult with a human in the loop. It proves hard to define: the task that the human is attempting the criteria to measure success

6 Relevance as a set comparison D = set of documents A = set of documents that satisfy some user- based criterion B = set of documents identified by the search system

7 Measures based on relevance retrieved relevant | A  B | relevant | A | retrieved relevant | A  B | retrieved | B | retrieved not-relevant | B - A  B | not-relevant | D - A | recall = = precision = = fallout = =

8 Relevance Recall and precision: depend on concept of relevance Relevance is a context-, task-dependent property of documents "Relevance is the correspondence in context between an information requirement statement... and an article (a document), that is, the extent to which the article covers the material that is appropriate to the requirement statement." F. W. Lancaster, 1979

9 Relevance How stable are relevance judgments? For textual documents, knowledgeable users have good agreement in deciding whether a document is relevant to an information requirement. There is less consistency with non-textual documents, e.g., a photograph. Attempts to have users give a level of relevance, e.g., on a five point scale, are inconsistent.

10 Studies of Retrieval Effectiveness The Cranfield Experiments, Cyril W. Cleverdon, Cranfield College of Aeronautics, SMART System, Gerald Salton, Cornell University, TREC, Donna Harman, National Institute of Standards and Technology (NIST),

11 Cranfield Experiments (Example) Comparative efficiency of indexing systems: (Universal Decimal Classification, alphabetical subject index, a special facet classification, Uniterm system of co-ordinate indexing) Four indexes prepared manually for each document in three batches of 6,000 documents -- total 18,000 documents, each indexed four times. The documents were reports and paper in aeronautics. Indexes for testing were prepared on index cards and other cards. Very careful control of indexing procedures.

12 Cranfield Experiments (continued) Searching: 1,200 test questions, each satisfied by at least one document Reviewed by expert panel Searches carried out by 3 expert librarians Two rounds of searching to develop testing methodology Subsidiary experiments at English Electric Whetstone Laboratory and Western Reserve University

13 The Cranfield Data The Cranfield data was made widely available and used by other researchers Salton used the Cranfield data with the SMART system (a) to study the relationship between recall and precision, and (b) to compare automatic indexing with human indexing Sparc Jones and van Rijsbergen used the Cranfield data for experiments in relevance weighting, clustering, definition of test corpora, etc.

14 Cranfield Experiments -- Measures of Effectiveness for Matching Methods Cleverdon's work was applied to matching methods. He made extensive use of recall and precision, based on concept of relevance. recall (%) precision (%) x x x x x x x x x x Each x represents one search. The graph illustrates the trade-off between precision and recall.

15 Typical precision-recall graph for different queries recall precision Broad, general query Narrow, specific query Using Boolean type queries

16 Some Cranfield Results The various manual indexing systems have similar retrieval efficiency Retrieval effectiveness using automatic indexing can be at least as effective as manual indexing with controlled vocabularies -> original results from the Cranfield + SMART experiments (published in 1967) -> considered counter-intuitive -> other results since then have supported this conclusion

17 Precision and Recall with Ranked Results Precision and recall are defined for a fixed set of hits, e.g., Boolean retrieval. Their use needs to be modified for a ranked list of results.

18 Ranked retrieval: Recall and precision after retrieval of n documents nrelevantrecallprecision 1yes yes no yes no yes no no no no no no yes no SMART system using Cranfield data, 200 documents in aeronautics of which 5 are relevant

19 Precision-recall graph recall precision Note: Some authors plot recall against precision.

20 11 Point Precision (Recall Cut Off) p(n) is precision at that point where recall has first reached n Define 11 standard recall points p(r 0 ), p(r 1 ),... p(r 10 ), where p(r j ) = p(j/10) Note: if p(r j ) is not an exact data point, use interpolation

21 Recall cutoff graph: choice of interpolation points recall precision The blue line is the recall cutoff graph.

22 Example: SMART System on Cranfield Data Recall Precision Precision values in blue are actual data. Precision values in red are by interpolation (by convention equal to the next actual data value).

23 Average precision Average precision for a single topic is the mean of the precision obtained after each relevant document is obtained. Example: p = ( ) / 5 = 0.75 Mean average precision for a run consisting of many topics is the mean of the average precision scores for each individual topic in the run. Definitions from TREC-8

24 Normalized recall measure ideal ranks actual ranks worst ranks recall ranks of retrieved documents

25 Normalized recall area between actual and worst area between best and worst Normalized recall = R norm = 1 - (after some mathematical manipulation) r i - i n(N - n) i = 1 n  n 

26 Combining Recall and Precision: Normalized Symmetric Difference Relevant Retrieved D = set of documents A B Symmetric difference, S = A  B - A  B Normalized symmetric difference = |S| / 2 (|A| + |B|) = (1/recall + 1/precision) Symmetric Difference: The set of elements belonging to one but not both of two given sets { }

27 Statistical tests Suppose that a search is carried out on systems i and j System i is superior to system j if, for all test cases, recall(i) >= recall(j) precisions(i) >= precision(j) In practice, we have data from a limited number of test cases. What conclusions can we draw?

28 Recall-precision graph recall precision The red system appears better than the black, but is the difference statistically significant?

29 Statistical tests The t-test is the standard statistical test for comparing two table of numbers, but depends on statistical assumptions of independence and normal distributions that do not apply to this data. The sign test makes no assumptions of normality and uses only the sign (not the magnitude) of the the differences in the sample values, but assumes independent samples. The Wilcoxon signed rank uses the ranks of the differences, not their magnitudes, and makes no assumption of normality but but assumes independent samples.