2007.03.08 - SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.

Slides:



Advertisements
Similar presentations
Evaluation of Information Retrieval Systems Thanks to Marti Hearst, Ray Larson.
Advertisements

Retrieval Evaluation J. H. Wang Mar. 18, Outline Chap. 3, Retrieval Evaluation –Retrieval Performance Evaluation –Reference Collections.
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.
Introduction to Information Retrieval (Part 2) By Evren Ermis.
Information Retrieval IR 7. Recap of the last lecture Vector space scoring Efficiency considerations Nearest neighbors and approximations.
Evaluating Search Engine
Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.
Carnegie Mellon 1 Maximum Likelihood Estimation for Information Thresholding Yi Zhang & Jamie Callan Carnegie Mellon University
SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
Modern Information Retrieval
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
SLIDE 1IS 240 – Spring 2010 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
INFO 624 Week 3 Retrieval System Evaluation
Retrieval Evaluation. Brief Review Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
SLIDE 1IS 240 – Spring 2009 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
Retrieval Evaluation: Precision and Recall. Introduction Evaluation of implementations in computer science often is in terms of time and space complexity.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
Evaluating the Performance of IR Sytems
SLIDE 1IS 240 – Spring 2009 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
Retrieval Evaluation. Introduction Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
Evaluation CSC4170 Web Intelligence and Social Computing Tutorial 5 Tutor: Tom Chao Zhou
Inferences About Process Quality
Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.
WXGB6106 INFORMATION RETRIEVAL Week 3 RETRIEVAL EVALUATION.
Analysis of Variance & Multivariate Analysis of Variance
ISP 433/633 Week 6 IR Evaluation. Why Evaluate? Determine if the system is desirable Make comparative assessments.
Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.
LIS618 lecture 11 i/r performance evaluation Thomas Krichel
Evaluation of Image Retrieval Results Relevant: images which meet user’s information need Irrelevant: images which don’t meet user’s information need Query:
Search and Retrieval: Relevance and Evaluation Prof. Marti Hearst SIMS 202, Lecture 20.
Probabilistic Models in IR Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata Using majority of the slides from.
| 1 › Gertjan van Noord2014 Zoekmachines Lecture 5: Evaluation.
ANCOVA Lecture 9 Andrew Ainsworth. What is ANCOVA?
Evaluation David Kauchak cs458 Fall 2012 adapted from:
Evaluation David Kauchak cs160 Fall 2009 adapted from:
IR Evaluation Evaluate what? –user satisfaction on specific task –speed –presentation (interface) issue –etc. My focus today: –comparative performance.
Evaluation INST 734 Module 5 Doug Oard. Agenda Evaluation fundamentals  Test collections: evaluating sets Test collections: evaluating rankings Interleaving.
Information Retrieval Lecture 7. Recap of the last lecture Vector space scoring Efficiency considerations Nearest neighbors and approximations.
Distributed Information Retrieval Server Ranking for Distributed Text Retrieval Systems on the Internet B. Yuwono and D. Lee Siemens TREC-4 Report: Further.
IR System Evaluation Farhad Oroumchian. IR System Evaluation System-centered strategy –Given documents, queries, and relevance judgments –Try several.
Assessing The Retrieval A.I Lab 박동훈. Contents 4.1 Personal Assessment of Relevance 4.2 Extending the Dialog with RelFbk 4.3 Aggregated Assessment.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Chapter 12 Introduction to Analysis of Variance PowerPoint Lecture Slides Essentials of Statistics for the Behavioral Sciences Eighth Edition by Frederick.
C.Watterscs64031 Evaluation Measures. C.Watterscs64032 Evaluation? Effectiveness? For whom? For what? Efficiency? Time? Computational Cost? Cost of missed.
Performance Measurement. 2 Testing Environment.
1 CS 430 / INFO 430 Information Retrieval Lecture 8 Evaluation of Retrieval Effectiveness 1.
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
Minimum Rank Error Training for Language Modeling Meng-Sung Wu Department of Computer Science and Information Engineering National Cheng Kung University,
Efficient Result-set Merging Across Thousands of Hosts Simulating an Internet-scale GIR application with the GOV2 Test Collection Christopher Fallen Arctic.
Evaluation. The major goal of IR is to search document relevant to a user query. The evaluation of the performance of IR systems relies on the notion.
Information Retrieval Quality of a Search Engine.
Slides to accompany Weathington, Cunningham & Pittenger (2010), Chapter 12: Single Variable Between- Subjects Research 1.
Information Retrieval Lecture 3 Introduction to Information Retrieval (Manning et al. 2007) Chapter 8 For the MSc Computer Science Programme Dell Zhang.
Relevant Document Distribution Estimation Method for Resource Selection Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University
INFORMATION RETRIEVAL MEASUREMENT OF RELEVANCE EFFECTIVENESS 1Adrienn Skrop.
Sampath Jayarathna Cal Poly Pomona
Evaluation of Information Retrieval Systems
Evaluation.
Modern Information Retrieval
IR Theory: Evaluation Methods
Cumulated Gain-Based Evaluation of IR Techniques
Retrieval Evaluation - Measures
Retrieval Performance Evaluation - Measures
Precision and Recall Reminder:
Presentation transcript:

SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00 pm Spring Principles of Information Retrieval Lecture 14: Evaluation Cont.

SLIDE 2IS 240 – Spring 2007 Overview Review –Calculating Recall and Precision –The trec_eval program Limits of retrieval performance – Relationship of Recall and Precision More on Evaluation (Alternatives to R/P) Expected Search Length Blair and Maron

SLIDE 3IS 240 – Spring 2007 How Test Runs are Evaluated 1.d 123* 2.d 84 3.d 56 * 4.d 6 5.d 8 6.d 9 * 7.d d d d 25 * 11. d d d d d 3 * First ranked doc is relevant, which is 10% of the total relevant. Therefore Precision at the 10% Recall level is 100% Next Relevant gives us 66% Precision at 20% recall level Etc…. R q ={d 3,d 5,d 9,d 25,d 39,d 44,d 56,d 71,d 89,d 123 } : 10 Relevant Examples from Chapter 3 in Baeza-Yates

SLIDE 4IS 240 – Spring 2007 Graphing for a Single Query PRECISIONPRECISION RECALL

SLIDE 5IS 240 – Spring 2007 Averaging Multiple Queries

SLIDE 6IS 240 – Spring 2007 Interpolation R q ={d 3,d 56,d 129 } 1.d 123* 2.d 84 3.d 56 * 4.d 6 5.d 8 6.d 9 * 7.d d d d 25 * 11. d d d d d 3 * First relevant doc is 56, which is gives recall and precision of 33.3% Next Relevant (129) gives us 66% recall at 25% precision Next (3) gives us 100% recall with 20% precision How do we figure out the precision at the 11 standard recall levels?

SLIDE 7IS 240 – Spring 2007 Interpolation

SLIDE 8IS 240 – Spring 2007 Interpolation So, at recall levels 0%, 10%, 20%, and 30% the interpolated precision is 33.3% At recall levels 40%, 50%, and 60% interpolated precision is 25% And at recall levels 70%, 80%, 90% and 100%, interpolated precision is 20% Giving graph…

SLIDE 9IS 240 – Spring 2007 Interpolation PRECISIONPRECISION RECALL

SLIDE 10IS 240 – Spring 2007 TREC_EVAL Output Queryid (Num): 49 Total number of documents over all queries Retrieved: Relevant: 1670 Rel_ret: 1258 Interpolated Recall - Precision Averages: at at at at at at at at at at at Average precision (non-interpolated) for all rel docs(averaged over queries) Number of Queries From QRELS Relevant and Retrieved Average Precision at Fixed Recall Levels From individual queries

SLIDE 11IS 240 – Spring 2007 TREC_EVAL Output Precision: At 5 docs: At 10 docs: At 15 docs: At 20 docs: At 30 docs: At 100 docs: At 200 docs: At 500 docs: At 1000 docs: R-Precision (precision after R (= num_rel for a query) docs retrieved): Exact: Average Precision at Fixed Number of Documents Precision after R Documents retrieved

SLIDE 12IS 240 – Spring 2007 Problems with Precision/Recall Can’t know true recall value –except in small collections Precision/Recall are related –A combined measure sometimes more appropriate Assumes batch mode –Interactive IR is important and has different criteria for successful searches –We will touch on this in the UI section Assumes a strict rank ordering matters

SLIDE 13IS 240 – Spring 2007 Relationship between Precision and Recall Doc is Relevant Doc is NOT relevant Doc is retrieved Doc is NOT retrieved Buckland & Gey, JASIS: Jan 1994

SLIDE 14IS 240 – Spring 2007 Recall Under various retrieval assumptions Buckland & Gey, JASIS: Jan RECALLRECALL Proportion of documents retrieved Random Perfect Perverse Tangent Parabolic Recall Parabolic Recall 1000 Documents 100 Relevant

SLIDE 15IS 240 – Spring 2007 Precision under various assumptions 1000 Documents 100 Relevant PRECISIONPRECISION Proportion of documents retrieved Random Perfect Perverse Tangent Parabolic Recall Parabolic Recall

SLIDE 16IS 240 – Spring 2007 Recall-Precision 1000 Documents 100 Relevant PRECISIONPRECISION RECALL Random Perfect Perverse Tangent Parabolic Recall Parabolic Recall

SLIDE 17IS 240 – Spring 2007 CACM Query 25

SLIDE 18IS 240 – Spring 2007 Relationship of Precision and Recall

SLIDE 19IS 240 – Spring 2007 Today More on Evaluation (Alternatives to R/P) Expected Search Length Non-Binary Relevance and Evaluation

SLIDE 20IS 240 – Spring 2007 Other Relationships From van Rijsbergen Information Retrieval (2 nd Ed.) RELEVANT NON-RELEVANT RETRIEVED NOT RETRIEVED

SLIDE 21IS 240 – Spring 2007 Other Relationships

SLIDE 22IS 240 – Spring 2007 Other Relationships All of the previous measures are related by this equation P=Precision, R=Recall, F=Fallout, G=Generality

SLIDE 23IS 240 – Spring 2007 MiniTREC 2000 Collection: Financial Times (FT) ~600 Mb 50 Topics (# ) from TREC FT QRELs from TREC8 Four Groups, 12 runs –Cheshire 1 – 4 runs –Cheshire 2 – 5 runs –MG -- 3 runs –SMART -- Still working…(not really) Total of ranked documents submitted

SLIDE 24IS 240 – Spring 2007 Precision/Recall for all MiniTREC runs

SLIDE 25IS 240 – Spring 2007 Revised Precision/Recall

SLIDE 26IS 240 – Spring 2007 Further Analysis Analysis of Variance (ANOVA) –Uses the ret_rel, total relevant and average precision for each topic –Not a perfectly balanced design… –Looked at the simple models: Recall = Runid Precision = Runid

SLIDE 27IS 240 – Spring 2007 ANOVA results: Recall Waller-Duncan K-ratio t Test for recall NOTE: This test minimizes the Bayes risk under additive loss and certain other assumptions. Kratio 100 Error Degrees of Freedom 572 Error Mean Square F Value 4.65 Critical Value of t Minimum Significant Difference Harmonic Mean of Cell Sizes NOTE: Cell sizes are not equal.

SLIDE 28IS 240 – Spring 2007 ANOVA Results: Mean Recall Means with the same letter are not significantly different. Waller Grouping Mean N runid A mg_manua B A ch1_test B A ch1_newc B A ch2_run2 B A ch2_run1 B A mg_t5_al B A ch2_run3 B A mg_t5_re B A ch1_cont B ch2_run5 B ch2_run4 C ch1_relf

SLIDE 29IS 240 – Spring 2007 ANOVA Recall - Revised Means with the same letter are not significantly different. Waller Grouping Mean N runid A ch1_test A mg_manua A ch1_newc A ch2_run2 A ch2_run1 A mg_t5_al A ch2_run3 A mg_t5_re A ch1_cont A ch2_run5 A ch2_run4 B ch1_relf

SLIDE 30IS 240 – Spring 2007 ANOVA Results: Avg Precision Waller-Duncan K-ratio t Test for avgprec NOTE: This test minimizes the Bayes risk under additive loss and certain other assumptions. Kratio 100 Error Degrees of Freedom 572 Error Mean Square F Value 0.78 Critical Value of t Minimum Significant Difference Harmonic Mean of Cell Sizes NOTE: Cell sizes are not equal.

SLIDE 31IS 240 – Spring 2007 ANOVA Results: Avg Precision Means with the same letter are not significantly different. Waller Grouping Mean N runid A ch2_run1 A ch2_run3 A ch1_test A ch1_newc A mg_manua A ch2_run5 A ch1_cont A ch2_run4 A ch2_run2 A mg_t5_re A mg_t5_al A ch1_relf

SLIDE 32IS 240 – Spring 2007 ANOVA – Avg Precision Revised Means with the same letter are not significantly different. Waller Grouping Mean N runid A ch2_run1 A ch2_run3 A ch1_test A ch1_newc A mg_manua A ch2_run5 A ch1_cont A ch2_run4 A ch2_run2 A mg_t5_re A mg_t5_al A ch1_relf

SLIDE 33IS 240 – Spring 2007 What to Evaluate? Effectiveness –Difficult to measure –Recall and Precision are only one way –What might be others?

SLIDE 34IS 240 – Spring 2007 Other Ways of Evaluating “The primary function of a retrieval system is conceived to be that of saving its users to as great an extent as possible, the labor of perusing and discarding irrelevant documents, in their search for relevant ones” William S. Cooper (1968) “Expected Search Length: A Single measure of Retrieval Effectiveness Based on the Weak Ordering Action of Retrieval Systems” American Documentation, 19(1).

SLIDE 35IS 240 – Spring 2007 Other Ways of Evaluating If the purpose of retrieval system is to rank the documents in descending order of their probability of relevance for the user, then maybe the sequence is important and can be used as a way of evaluating systems. How to do it?

SLIDE 36IS 240 – Spring 2007 Query Types Only one relevant document is wanted Some arbitrary number n is wanted All relevant documents are wanted Some proportion of the relevant documents is wanted No documents are wanted? (Special case)

SLIDE 37IS 240 – Spring 2007 Search Length and Expected Search Length Work by William Cooper in the late ’60s Issues with IR Measures: –Usually not a single measure –Assume “retrieved” and “not retrieved” sets without considering more than two classes –No built-in way to compare to purely random retrieval –Don’t take into account how much relevant material the user actually needs (or wants)

SLIDE 38IS 240 – Spring 2007 Weak Ordering in IR Systems The assumption that there are two sets of “Retrieved” and “Not Retrieved” is not really accurate. IR Systems usually rank into many sets of equal retrieval weights Consider Coordinate-Level ranking…

SLIDE 39IS 240 – Spring 2007 Weak Ordering

SLIDE 40IS 240 – Spring 2007 Search Length nynyyyynynnnnynynnnn Rank Relevant Search Length = The number of NON-RELEVANT documents that a user must examine before finding the number of documents that they want (n) If n=2 then search length is 2 If n=6 then search length is 3

SLIDE 41IS 240 – Spring 2007 Weak Ordering Search Length nnyynyyynyynnnnnnnyn Rank Relevant If we assume order within ranks is random… If n=6 then we must go to level 3 of the ranking, but the POSSIBLE search lengths are 3, 4, 5, or 6. To compute Expected Search Length we need to know the probability of each possible search length. to get this we need to consider the number of different ways in which document may be distributed in the ranks…

SLIDE 42IS 240 – Spring 2007 Expected Search Length Rank Relevant nnyynyyynyynnnnnnnyn

SLIDE 43IS 240 – Spring 2007 Expected Search Length

SLIDE 44IS 240 – Spring 2007 Expected Search Length