1 CS 430: Information Discovery Lecture 8 Evaluation of Retrieval Effectiveness II.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.
1 Retrieval Performance Evaluation Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Addison-Wesley, (Chapter 3)
Evaluating Search Engine
Search Engines and Information Retrieval
1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement: Relevance Feedback Information Filtering.
Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.
1 CS 430: Information Discovery Lecture 10 Cranfield and TREC.
Modern Information Retrieval
1 CS 430 / INFO 430 Information Retrieval Lecture 11 Evaluation of Retrieval Effectiveness 2.
Information Retrieval in Practice
INFO 624 Week 3 Retrieval System Evaluation
SLIDE 1IS 240 – Spring 2009 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 11 Evaluation of Retrieval Effectiveness 2.
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
CS/Info 430: Information Retrieval
CS 430 / INFO 430 Information Retrieval
Evaluating the Performance of IR Sytems
1 CS 430: Information Discovery Lecture 20 The User in the Loop.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
SLIDE 1IS 240 – Spring 2009 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
1 Discussion Class 5 TREC. 2 Discussion Classes Format: Questions. Ask a member of the class to answer. Provide opportunity for others to comment. When.
WXGB6106 INFORMATION RETRIEVAL Week 3 RETRIEVAL EVALUATION.
1 CS 430: Information Discovery Lecture 2 Introduction to Text Based Information Retrieval.
1 CS 502: Computing Methods for Digital Libraries Lecture 11 Information Retrieval I.
Evaluation Information retrieval Web. Purposes of Evaluation System Performance Evaluation efficiency of data structures and methods operational profile.
Search and Retrieval: Relevance and Evaluation Prof. Marti Hearst SIMS 202, Lecture 20.
Search Engines and Information Retrieval Chapter 1.
Minimal Test Collections for Retrieval Evaluation B. Carterette, J. Allan, R. Sitaraman University of Massachusetts Amherst SIGIR2006.
Evaluation Experiments and Experience from the Perspective of Interactive Information Retrieval Ross Wilkinson Mingfang Wu ICT Centre CSIRO, Australia.
Philosophy of IR Evaluation Ellen Voorhees. NIST Evaluation: How well does system meet information need? System evaluation: how good are document rankings?
IR Evaluation Evaluate what? –user satisfaction on specific task –speed –presentation (interface) issue –etc. My focus today: –comparative performance.
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.
Evaluation of IR LIS531H. Why eval? When designing and using a system there are decisions to be made:  Manual or automatic indexing?  Controlled vocabularies.
1 CS 430 / INFO 430 Information Retrieval Lecture 2 Text Based Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement and Relevance Feedback.
Jane Reid, AMSc IRIC, QMUL, 16/10/01 1 Evaluation of IR systems Jane Reid
Evaluating Search Engines in chapter 8 of the book Search Engines Information Retrieval in Practice Hongfei Yan.
A Comparison of Statistical Significance Tests for Information Retrieval Evaluation CIKM´07, November 2007.
Evaluation INST 734 Module 5 Doug Oard. Agenda Evaluation fundamentals  Test collections: evaluating sets Test collections: evaluating rankings Interleaving.
1 CS 430: Information Discovery Lecture 3 Inverted Files.
IR System Evaluation Farhad Oroumchian. IR System Evaluation System-centered strategy –Given documents, queries, and relevance judgments –Try several.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Basic Implementation and Evaluations Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
1 CS 430: Information Discovery Sample Midterm Examination Notes on the Solutions.
Measuring How Good Your Search Engine Is. *. Information System Evaluation l Before 1993 evaluations were done using a few small, well-known corpora of.
C.Watterscs64031 Evaluation Measures. C.Watterscs64032 Evaluation? Effectiveness? For whom? For what? Efficiency? Time? Computational Cost? Cost of missed.
Performance Measurement. 2 Testing Environment.
Information Retrieval
Advantages of Query Biased Summaries in Information Retrieval by A. Tombros and M. Sanderson Presenters: Omer Erdil Albayrak Bilge Koroglu.
1 CS 430 / INFO 430 Information Retrieval Lecture 8 Evaluation of Retrieval Effectiveness 1.
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
1 CS 430 / INFO 430 Information Retrieval Lecture 9 Evaluation of Retrieval Effectiveness 2.
Evaluation of Information Retrieval Systems Xiangming Mu.
Evaluation. The major goal of IR is to search document relevant to a user query. The evaluation of the performance of IR systems relies on the notion.
1 CS 430: Information Discovery Lecture 11 Cranfield and TREC.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Query Refinement and Relevance Feedback.
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008 Annotations by Michael L. Nelson.
1 CS 430 / INFO 430 Information Retrieval Lecture 10 Evaluation of Retrieval Effectiveness 1.
Evaluation Anisio Lacerda.
Text Based Information Retrieval
Evaluation of IR Systems
CS 430: Information Discovery
Evaluation.
IR Theory: Evaluation Methods
CS 430: Information Discovery
CS 430: Information Discovery
Advanced Information Retrieval
Presentation transcript:

1 CS 430: Information Discovery Lecture 8 Evaluation of Retrieval Effectiveness II

2 Course administration

3 The Cranfield methodology Recall and precision depend on concept of relevance -> Is relevance a context-, task-independent property of documents? "Relevance is the correspondence in context between an information requirement statement (a query) and an article (a document), that is, the extent to which the article covers the material that is appropriate to the requirement statement." F. W. Lancaster, 1979

4 Relevance Recall and precision values are for a specific set of documents and a specific set of queries Relevance is subjective, but experimental evidence suggests that, for textual documents, different experts have similar judgments about relevance Estimates of relevance level are less consistent Query types are important, depending on specificity -> subject-heading queries -> title queries -> paragraphs or free text Tests should use realistic queries

5 Text Retrieval Conferences (TREC) Led by Donna Harman (NIST), with DARPA support Annual since 1992 (initial experiment ended 1999) Corpus of several million textual documents, total of more than five gigabytes of data Researchers attempt a standard set of tasks -> search the corpus for topics provided by surrogate users -> match a stream of incoming documents against standard queries Participants include large commercial companies, small information retrieval vendors, and university research groups.

6 The TREC Corpus (Examples) SourceSize# DocsMedian (Mbytes)words/doc Wall Street Journal, , Associated Press newswire, , Computer Selects articles24275, Federal Register, , abstracts of DOE publications184226, Wall Street Journal, , Associated Press newswire, , Computer Selects articles17556, Federal Register, ,860396

7 The TREC Corpus (continued) SourceSize# DocsMedian (Mbytes)words/doc San Jose Mercury News , Associated Press newswire, , Computer Selects articles345161, U.S. patents, ,7114,445 Financial Times, , Federal Register, , Congressional Record, , Foreign Broadcast Information470130, LA Times475131,896351

8 The TREC Corpus (continued) Notes: 1. The TREC corpus consists mainly of general articles. The Cranfield data was in a specialized engineering domain. 2. The TREC data is raw data: -> No stop words are removed; no stemming -> Words are alphanumeric strings -> No attempt made to correct spelling, sentence fragments, etc.

9 TREC Experiments 1.NIST provides text corpus on CD-ROM Participant builds index using own technology 2.NIST provides 50 natural language topic statements Participant converts to queries (automatically or manually) 3.Participant run search, returns up to 1,000 hits to NIST. NIST analyzes for recall and precision (all TREC participants use rank based methods of searching)

10 TREC Topic Statement Number: 409 legal, Pan Am, 103 Description: What legal actions have resulted from the destruction of Pan Am Flight 103 over Lockerbie, Scotland, on December 21, 1988? Narrative: Documents describing any charges, claims, or fines presented to or imposed by any court or tribunal are relevant, but documents that discuss charges made in diplomatic jousting are not relevant. A sample TREC topic statement

11 Relevance Assessment For each query, a pool of potentially relevant documents is assembled, using the top 100 ranked documents from each participant. The human expert who set the query looks at every document in the pool and determines whether it is relevant. Documents outside the pool are not examined. In a TREC-8 example, with 71 participants: 7,100 documents in the pool 1,736 unique documents (eliminating duplicates) 94 judged relevant

12 A Cornell Footnote The TREC analysis uses a program developed by Chris Buckley, who spent 17 years at Cornell before completing his Ph.D. in Buckley has continued to maintain the SMART software and has been a participant at every TREC conference. SMART is used as the basis against which other systems are compared. During the early TREC conferences, the tuning of SMART with the TREC corpus led to steady improvements in retrieval efficiency, but after about TREC-5 a plateau was reached. TREC-8, in 1999, was the final year for this experiment.

13 Measures based on relevance RR NN NR RN not retrieved not relevant retrieved not relevant retrieved relevant not retrieved relevant retrieved not retrieved not relevant relevant

14 Measures based on relevance retrieved relevant relevant retrieved relevant retrieved retrieved not-relevant not-relevant recall = precision = fallout =

15 Estimates of Recall Pooled used by TREC depends on the pool of nominated documents. Are there relevant documents not in the pool? An Example of Estimating Recall Litigation support system using IBM STAIRS system Corpus 40,000 documents 51 queries Random samples of document examined by lawyers in blind sampling experiment Estimate that only 20% of relevant documents found by STAIRS Blair and Mahon, 1981

16 Recall-precision after retrieval of n documents nrelevantrecallprecision 1yes yes no yes no yes no no no no no no yes no SMART system using Cranfield data, 200 documents in aeronautics of which 5 are relevant

17 Recall-precision graph recall precision

18 Typical recall-precision graph recall precision Broad, general query Narrow, specific query

19 Normalized recall measure ideal ranks actual ranks worst ranks recall ranks of retrieved documents

20 Normalized recall area between actual and worst area between best and worst Normalized recall = R norm = 1 - r i - i n(N - n) i = 1 n  n 

21 Normalized Symmetric Difference Retrieved Relevant All documents A B Symmetric difference, S = A  B - A  B Normalized symmetric difference = S / ½ (|A| + |B|) = (1/recall + 1/precision)

22 Statistical tests Suppose that a search is carried out on systems i and j System i is superior to system j if, for all test cases, recall(i) >= recall(j) precisions(i) >= precision(j)

23 Recall-precision graph recall precision The red system appears better than the black, but is the difference statistically significant?

24 Statistical tests The t-test is the standard statistical test for comparing two table of numbers, but depends on statistical assumptions of independence and normal distributions that do not apply to this data. The sign test makes no assumptions of normality and uses only the sign (not the magnitude) of the the differences in the sample values, but assumes independent samples. The Wilcoxon signed rank uses the ranks of the differences, not their magnitudes, and makes no assumption of normality but but assumes independent samples.

25 User criteria System-centered and user-centered evaluation -> Is user satisfied? -> Is user successful? System efficiency -> What efforts are involved in carrying out the search? Suggested criteria recall and precision response time user effort form of presentation content coverage

26 System factors that affect user satisfaction Collection Input policies -- coverage, error rates, timeliness Document characteristics -- title, abstract, summary, full text Indexing Rules for assigning terms, specificity, exhaustively Query Formulation, operators