ISP 433/633 Week 6 IR Evaluation. Why Evaluate? Determine if the system is desirable Make comparative assessments.

Slides:



Advertisements
Similar presentations
Towards Methods for the Collective Gathering and Quality Control of Relevance Assessments SIGIR´09, July 2009.
Advertisements

Evaluation of Information Retrieval Systems Thanks to Marti Hearst, Ray Larson.
Retrieval Evaluation J. H. Wang Mar. 18, Outline Chap. 3, Retrieval Evaluation –Retrieval Performance Evaluation –Reference Collections.
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.
Exercising these ideas  You have a description of each item in a small collection. (30 web sites)  Assume we are looking for information about boxers,
Information Retrieval IR 7. Recap of the last lecture Vector space scoring Efficiency considerations Nearest neighbors and approximations.
Precision and Recall.
Evaluating Search Engine
Information Retrieval Review
Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.
SLIDE 1IS 202 – FALL 2002 Lecture 20: Evaluation Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00.
Modern Information Retrieval
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
SLIDE 1IS 240 – Spring 2010 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
INFO 624 Week 3 Retrieval System Evaluation
Retrieval Evaluation. Brief Review Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
SLIDE 1IS 240 – Spring 2009 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
Information Access I Measurement and Evaluation GSLT, Göteborg, October 2003 Barbara Gawronska, Högskolan i Skövde.
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
Retrieval Evaluation: Precision and Recall. Introduction Evaluation of implementations in computer science often is in terms of time and space complexity.
SLIDE 1IS 202 – FALL 2004 Lecture 10: IR Evaluation Workshop Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30.
CS 430 / INFO 430 Information Retrieval
Indexing and Representation: The Vector Space Model Document represented by a vector of terms Document represented by a vector of terms Words (or word.
SLIDE 1IS 240 – Spring 2009 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
Retrieval Evaluation. Introduction Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.
WXGB6106 INFORMATION RETRIEVAL Week 3 RETRIEVAL EVALUATION.
Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.
Evaluation of Image Retrieval Results Relevant: images which meet user’s information need Irrelevant: images which don’t meet user’s information need Query:
Search and Retrieval: Relevance and Evaluation Prof. Marti Hearst SIMS 202, Lecture 20.
Evaluation David Kauchak cs458 Fall 2012 adapted from:
Evaluation David Kauchak cs160 Fall 2009 adapted from:
Evaluation Experiments and Experience from the Perspective of Interactive Information Retrieval Ross Wilkinson Mingfang Wu ICT Centre CSIRO, Australia.
Philosophy of IR Evaluation Ellen Voorhees. NIST Evaluation: How well does system meet information need? System evaluation: how good are document rankings?
IR Evaluation Evaluate what? –user satisfaction on specific task –speed –presentation (interface) issue –etc. My focus today: –comparative performance.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Evaluation INST 734 Module 5 Doug Oard. Agenda Evaluation fundamentals  Test collections: evaluating sets Test collections: evaluating rankings Interleaving.
Information Retrieval Lecture 7. Recap of the last lecture Vector space scoring Efficiency considerations Nearest neighbors and approximations.
IR System Evaluation Farhad Oroumchian. IR System Evaluation System-centered strategy –Given documents, queries, and relevance judgments –Try several.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Evaluation of (Search) Results How do we know if our results are any good? Evaluating a search engine  Benchmarks  Precision and recall Results summaries:
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Basic Implementation and Evaluations Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Measuring How Good Your Search Engine Is. *. Information System Evaluation l Before 1993 evaluations were done using a few small, well-known corpora of.
C.Watterscs64031 Evaluation Measures. C.Watterscs64032 Evaluation? Effectiveness? For whom? For what? Efficiency? Time? Computational Cost? Cost of missed.
Performance Measures. Why to Conduct Performance Evaluation? 2 n Evaluation is the key to building effective & efficient IR (information retrieval) systems.
1 CS 430 / INFO 430 Information Retrieval Lecture 8 Evaluation of Retrieval Effectiveness 1.
What Does the User Really Want ? Relevance, Precision and Recall.
Evaluation INST 734 Module 5 Doug Oard. Agenda Evaluation fundamentals Test collections: evaluating sets Test collections: evaluating rankings Interleaving.
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
Evaluation of Information Retrieval Systems Xiangming Mu.
Evaluation. The major goal of IR is to search document relevant to a user query. The evaluation of the performance of IR systems relies on the notion.
Information Retrieval Quality of a Search Engine.
Information Retrieval Lecture 3 Introduction to Information Retrieval (Manning et al. 2007) Chapter 8 For the MSc Computer Science Programme Dell Zhang.
1 CS 430 / INFO 430 Information Retrieval Lecture 10 Evaluation of Retrieval Effectiveness 1.
Sampath Jayarathna Cal Poly Pomona
Evaluation of Information Retrieval Systems
Evaluation.
Modern Information Retrieval
IR Theory: Evaluation Methods
Evaluation of Information Retrieval Systems
Retrieval Evaluation - Reference Collections
Retrieval Evaluation - Measures
Retrieval Performance Evaluation - Measures
Precision and Recall Reminder:
Presentation transcript:

ISP 433/633 Week 6 IR Evaluation

Why Evaluate? Determine if the system is desirable Make comparative assessments

What to Evaluate? How much of the information need is satisfied. How much was learned about a topic. Incidental learning: –How much was learned about the collection. –How much was learned about other topics. How inviting the system is.

Relevance In what ways can a document be relevant to a query? –Answer precise question precisely. –Partially answer question. –Suggest a source for more information. –Give background information. –Remind the user of other knowledge. –Others...

Relevance How relevant is the document –for this user for this information need. Subjective, but Measurable to some extent –How often do people agree a document is relevant to a query How well does it answer the question? –Complete answer? Partial? –Background Information? –Hints for further exploration?

Get me stuff I want Two considerations: Accuracy: Did it give me only stuff I wanted from the collection, or did it give me junk too? Comprehensiveness: Did it give me all the stuff I wanted from the collection, or did it miss some?

Relevant vs. Retrieved Relevant Retrieved All docs

After Query… Retrieved? Yes No Relevant ? YesNo Hits False Alarms Misses Correct Rejections

Precision vs. Recall Accuracy: Did it give me only stuff I wanted from the collection, or did it give me junk too? –Hits vs. False Alarms –Standard measure: Precision = |Hits| / |Retrieved| Comprehensiveness: Did it give me all the stuff I wanted from the collection, or did it miss some? –Hits vs. Misses –Standard measure Recall = |Hits| / |Relevant| Retrieved? Yes No Relevant? YesNo Hits Misses False Alarms Correct Rejections

Retrieved vs. Relevant Relevant Very high precision, very low recall

Retrieved vs. Relevant Relevant Very low precision, very low recall (0 in fact)

Retrieved vs. Relevant Relevant High recall, but low precision

Retrieved vs. Relevant Relevant High precision, high recall

Precision/Recall Curves There is a tradeoff between Precision and Recall So measure Precision at different levels of Recall Note: this is an AVERAGE over MANY queries precision recall x x x x

After a single Query R q ={d 3,d 5,d 9,d 25,d 39,d 44,d 56,d 71,d 89,d 123 } : 10 Relevant 1.d 123* 2.d 84 3.d 56 * 4.d 6 5.d 8 6.d 9 * 7.d d d d 25 * 11. d d d d d 3 * First ranked doc is relevant, which is 10% of the total relevant. Therefore Precision at the 10% Recall level is 100% Next Relevant gives us 66% Precision at 20% recall level ….

Graph for a Single Query PRECISIONPRECISION RECALL

Averaging Multiple Queries

11 standard recall level 0%, 10%, 20%, …, 100% May need interpolation

Interpolation R q ={d 3,d 56,d 129 } 1.d 123* 2.d 84 3.d 56 * 4.d 6 5.d 8 6.d 9 * 7.d d d d 25 * 11. d d d d d 3 * First relevant doc is 56, which is gives recall and precision of 33.3% Next Relevant (129) gives us 66% recall at 25% precision Next (3) gives us 100% recall with 20% precision How do we figure out the precision at the 11 standard recall levels?

Interpolation

So, at recall levels 0%, 10%, 20%, and 30% the interpolated precision is 33.3% At recall levels 40%, 50%, and 60% interpolated precision is 25% And at recall levels 70%, 80%, 90% and 100%, interpolated precision is 20% Giving graph…

Interpolated precision/recall PRECISIONPRECISION RECALL

Document Cutoff Levels Another way to evaluate: –Fix the number of documents retrieved at several levels: top 5 top 10 top 20 top 50 top 100 top 500 –Measure precision at each of these levels –Take (weighted) average over results This is a way to focus on how well the system ranks the first k documents.

Problem with Precision/Recall Can’t know true recall value –except in small collections Precision/Recall are related –A combined measure sometimes more appropriate Assumes batch mode –Interactive IR is important and has different criteria for successful searches Assumes a strict rank ordering matters.

Precision/Recall Curves Difficult to determine which of these two hypothetical results is better: precision recall x x x x

The E-Measure Combine Precision and Recall into one number (van Rijsbergen 79) P = precision R = recall b = measure of relative importance of P or R For example, b = 0.5 means user is twice as interested in precision as recall

User-Oriented Measures

Reference Collections Need consistent benchmarks for evaluation TREC - Text REtrieval Conference/Competition –Run by NIST (National Institute of Standards & Technology) –Hold each year since 1992 –Queries + Relevance Judgments Queries devised and judged by “Information Specialists” Relevance judgments done only for those documents retrieved -- not entire collection! Results judged on precision and recall, going up to a recall level of 1000 documents

Sample TREC queries (topics) Number: 168 Topic: Financing AMTRAK Description: A document will address the role of the Federal Government in financing the operation of the National Railroad Transportation Corporation (AMTRAK) Narrative: A relevant document must provide information on the government’s responsibility to make AMTRAK an economically viable entity. It could also discuss the privatization of AMTRAK as an alternative to continuing government subsidies. Documents comparing government subsidies given to air and bus transportation with those provided to aMTRAK would also be relevant.

TREC Benefits: –made research systems scale to large collections (pre-WWW) –allows for somewhat controlled comparisons Drawbacks: –emphasis on high recall, which may be unrealistic for what most users want –very long queries, also unrealistic –comparisons still difficult to make, because systems are quite different on many dimensions –focus on batch ranking rather than interaction There is an interactive track.

Some Observations Relatively low recall is sometimes acceptable –In a study, the system retrieved less than 20% of the relevant documents for a particular information need; users thought they had 75%, and were satisfied users can’t foresee exact words and phrases that will indicate relevant documents “accident” referred to by those responsible as: “event,” “incident,” “situation,” “problem,” … differing technical terminology slang, misspellings