C.Watterscs64031 Evaluation Measures. C.Watterscs64032 Evaluation? Effectiveness? For whom? For what? Efficiency? Time? Computational Cost? Cost of missed.

Slides:



Advertisements
Similar presentations
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Advertisements

Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.
Exercising these ideas  You have a description of each item in a small collection. (30 web sites)  Assume we are looking for information about boxers,
Evaluating Search Engine
1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement: Relevance Feedback Information Filtering.
Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.
Carnegie Mellon 1 Maximum Likelihood Estimation for Information Thresholding Yi Zhang & Jamie Callan Carnegie Mellon University
1 CS 430: Information Discovery Lecture 10 Cranfield and TREC.
Modern Information Retrieval
1 CS 430 / INFO 430 Information Retrieval Lecture 11 Evaluation of Retrieval Effectiveness 2.
INFO 624 Week 3 Retrieval System Evaluation
Retrieval Evaluation. Brief Review Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
1 CS 430 / INFO 430 Information Retrieval Lecture 11 Evaluation of Retrieval Effectiveness 2.
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
Retrieval Evaluation: Precision and Recall. Introduction Evaluation of implementations in computer science often is in terms of time and space complexity.
CS/Info 430: Information Retrieval
CS 430 / INFO 430 Information Retrieval
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
Automatic Indexing (Term Selection) Automatic Text Processing by G. Salton, Chap 9, Addison-Wesley, 1989.
Evaluation CSC4170 Web Intelligence and Social Computing Tutorial 5 Tutor: Tom Chao Zhou
Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.
WXGB6106 INFORMATION RETRIEVAL Week 3 RETRIEVAL EVALUATION.
ISP 433/633 Week 6 IR Evaluation. Why Evaluate? Determine if the system is desirable Make comparative assessments.
Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.
LIS618 lecture 11 i/r performance evaluation Thomas Krichel
Search and Retrieval: Relevance and Evaluation Prof. Marti Hearst SIMS 202, Lecture 20.
| 1 › Gertjan van Noord2014 Zoekmachines Lecture 5: Evaluation.
Philosophy of IR Evaluation Ellen Voorhees. NIST Evaluation: How well does system meet information need? System evaluation: how good are document rankings?
IR Evaluation Evaluate what? –user satisfaction on specific task –speed –presentation (interface) issue –etc. My focus today: –comparative performance.
Evaluation of IR LIS531H. Why eval? When designing and using a system there are decisions to be made:  Manual or automatic indexing?  Controlled vocabularies.
Evaluating Search Engines in chapter 8 of the book Search Engines Information Retrieval in Practice Hongfei Yan.
A Comparison of Statistical Significance Tests for Information Retrieval Evaluation CIKM´07, November 2007.
Evaluation INST 734 Module 5 Doug Oard. Agenda Evaluation fundamentals  Test collections: evaluating sets Test collections: evaluating rankings Interleaving.
Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
IR System Evaluation Farhad Oroumchian. IR System Evaluation System-centered strategy –Given documents, queries, and relevance judgments –Try several.
University of Malta CSA3080: Lecture 6 © Chris Staff 1 of 20 CSA3080: Adaptive Hypertext Systems I Dr. Christopher Staff Department.
Evaluation INST 734 Module 5 Doug Oard. Agenda Evaluation fundamentals Test collections: evaluating sets  Test collections: evaluating rankings Interleaving.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Basic Implementation and Evaluations Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Performance Measures. Why to Conduct Performance Evaluation? 2 n Evaluation is the key to building effective & efficient IR (information retrieval) systems.
1 CS 430 / INFO 430 Information Retrieval Lecture 8 Evaluation of Retrieval Effectiveness 1.
1 CS 430: Information Discovery Lecture 8 Evaluation of Retrieval Effectiveness II.
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
1 CS 430 / INFO 430 Information Retrieval Lecture 9 Evaluation of Retrieval Effectiveness 2.
A Logistic Regression Approach to Distributed IR Ray R. Larson : School of Information Management & Systems, University of California, Berkeley --
Evaluation. The major goal of IR is to search document relevant to a user query. The evaluation of the performance of IR systems relies on the notion.
1 CS 430: Information Discovery Lecture 11 Cranfield and TREC.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Information Retrieval Lecture 3 Introduction to Information Retrieval (Manning et al. 2007) Chapter 8 For the MSc Computer Science Programme Dell Zhang.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 10 Evaluation.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Query Refinement and Relevance Feedback.
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008 Annotations by Michael L. Nelson.
INFORMATION RETRIEVAL MEASUREMENT OF RELEVANCE EFFECTIVENESS 1Adrienn Skrop.
1 CS 430 / INFO 430 Information Retrieval Lecture 10 Evaluation of Retrieval Effectiveness 1.
Sampath Jayarathna Cal Poly Pomona
Lecture 12: Relevance Feedback & Query Expansion - II
Evaluation of IR Systems
Psychology 202a Advanced Psychological Statistics
Lecture 10 Evaluation.
Evaluation.
Modern Information Retrieval
IR Theory: Evaluation Methods
CS 430: Information Discovery
Lecture 6 Evaluation.
Advanced Information Retrieval
Dr. Sampath Jayarathna Cal Poly Pomona
Cumulated Gain-Based Evaluation of IR Techniques
Retrieval Performance Evaluation - Measures
Precision and Recall Reminder:
Presentation transcript:

C.Watterscs64031 Evaluation Measures

C.Watterscs64032 Evaluation? Effectiveness? For whom? For what? Efficiency? Time? Computational Cost? Cost of missed information? Too much info?

C.Watterscs64033 Studies of Retrieval Effectiveness The Cranfield Experiments, Cyril W. Cleverdon, Cranfield College of Aeronautics, 1957 –1968 (hundreds of docs) SMART System, Gerald Salton, Cornell University, (thousands of docs,) TREC, Donna Harman, National Institute of Standards and Technology (NIST), (millions of docs, 100k to 7.5M per set, training Q’s and test Q’s, 150 each)

C.Watterscs64034 What can we measure? ??? Algorithm (Efficiency) –Speed of algorithm –Update potential of indexing scheme –Size of storage required –Potential for distribution & parallelism User Experience (Effectiveness) –How many of all relevant docs were found –How many were missed –How many errors in selection –How many need to be scanned before get good ones

C.Watterscs64035 Measures based on relevance RR NN NR RN not retrieved not relevant retrieved not relevant retrieved relevant not retrieved relevant retrieved not retrieved not relevant relevant Doc set

C.Watterscs64036 Relevance System always correct!! Who judges relevance? Inter-rater reliability Early evaluations –Done by panel of experts – abstracts of docs TREC experiments –Done automatically –thousands of docs –Pooling + people

C.Watterscs64037 Defining the universe of relevant docs Manual inspection Manual exhaustive search Pooling (TREC) –Relevant set is the union of multiple techniques Sampling –Take a random sample –Inspect –Estimate from the sample for the whole set

C.Watterscs64038 Defining the relevant docs in a retrieved set (hit list) Panel of judges Individual users Automatic detection techniques –Vocabulary overlap with known relevant docs –metadata

C.Watterscs64039 Estimates of Recall Pooled system used by TREC depends on the quality of the set of nominated documents. Are there relevant documents not in the pool?

C.Watterscs Measures based on relevance RR NN NR RN not retrieved not relevant retrieved not relevant retrieved relevant not retrieved relevant retrieved not retrieved not relevant relevant Doc set

C.Watterscs Measures based on relevance RR RR + NR RR RR + RN RN RN + NN recall = precision = fallout =

C.Watterscs Relevance Evaluation Measures Recall and Precision Single valued measures –Macro and micro averages –R-precision –E-measure –Swet’s measure

C.Watterscs Recall and Precision Recall –Proportion of relevant docs retrieved Precision –Proportion of retrieved docs that are relevant

C.Watterscs Formula (what do we have to work with?) R q = number of docs in whole data set relevant to query, q R r = number of docs in hit set (retrieved docs) that are relevant R h = number of docs in hit set Recall = Precision =

C.Watterscs Recall = Precision = collection RhRh RqRq RrRr

C.Watterscs Recall = Precision = R q ={d 3, d 5, d 9, d 25, d 39, d 44, d 56, d 71, d 89, d 123 } R h ={d 6, d 8, d 9, d 84, d 123 } R r ={ } Recall = Precision= ???what does that tell us??

C.Watterscs Recall Precision Graphs precision Recall 100%

C.Watterscs Typical recall-precision graph precision recall narrow, specific query Broad, general query

C.Watterscs Recall-precision after retrieval of n documents nrelevantrecallprecision 1yes yes no yes no yes no no no no no no yes no SMART system using Cranfield data, 200 documents in aeronautics of which 5 are relevant

C.Watterscs Recall-precision graph recall precision

C.Watterscs Recall-precision graph recall precision The red system appears better than the black, but is the difference statistically significant?

C.Watterscs Consider Rank Recall = Precision = R q ={d 3, d 5, d 9, d 25, d 39, d 44, d 56, d 71, d 89, d 123 } R h ={d 123, d 84, d 56, d 6, d 8, d 9, d 511, d 129, d 187, d 25, d 38, d 48, d 250, d 113, d 3 } R r ={ } Recall = Precision= What happens as we go through the hits?

C.Watterscs Standard Recall Levels Plot Precision for Recall =0%, 10%,….100% P R100

C.Watterscs Consider two queries R q ={d 3, d 5, d 9, d 25, d 39, d 44, d 56, d 71, d 89, d 123 } R h ={d 123, d 84, d 56, d 8, d 9, d 511, d 129, d 25, d 38, d 3 } R q ={d 3, d 5, d 56, d 89, d 90, d 94, d 129, d 206, d 500 d 502 } R h ={d 12,d 84, d 56, d 6, d 8, d 3, d 511, d 129,d 44,d 89 }

C.Watterscs Comparison of Query results

C.Watterscs P-R for Multiple Queries For each recall level average precision Avg Prec at recall r N q is number of queries P i (r) is prec at recall level r for ith query

C.Watterscs Comparison of two systems

C.Watterscs Macro and Micro Averaging Micro – average over each point Macro – average of averages per query example

C.Watterscs Statistical tests Suppose that a search is carried out on systems i and j System i is superior to system j if recall(i) >= recall(j) precisions(i) >= precision(j)

C.Watterscs Statistical tests The t-test is the standard statistical test for comparing two table of numbers, but depends on statistical assumptions of independence and normal distributions that do not apply to this data. The sign test makes no assumptions of normality and uses only the sign (not the magnitude) of the the differences in the sample values, but assumes independent samples. The Wilcoxon signed rank uses the ranks of the differences, not their magnitudes, and makes no assumption of normality but but assumes independent samples.

C.Watterscs II Single Value Measures E-Measure & F1 measure Swet’s Measure Expected Search Length etc

C.Watterscs E Measure & F1 Measure Weighted average of recall and precision Can increase importance of recall or precision F=1-E (bigger is better) Beta often =1 Increase beta ???

C.Watterscs Normalized Recall recall is normalized against all relevant documents. Suppose there are N documents in the collection and out of which n are relevant. These n documents are ranked as i 1, i 2,…, i n. normalized recall is calculated: Normalize recall = (i j - j) / (n * (N – n))

J.Allan34 Normalized recall measure ideal ranks actual ranks worst ranks recall ranks of retrieved documents

J.Allan35 Normalized recall area between actual and worst area between best and worst Normalized recall = R norm = 1 - r i - i n(N - n) i = 1 n  n 

C.Watterscs Example: N=200 n=5 at 1,3,5,10,14

C.Watterscs Expected Search Length Number of non-relevant docs before relevant doc(s) are found Assume weak or no ordering L q = j + i. s/(r+1) s is required # of relevant docs j is # non-relevant docs before get required # ??

C.Watterscs Problems with testing Determining relevant docs Setting up test questions Comparing results Understanding relevance of the results