Evaluation David Kauchak cs458 Fall 2012 adapted from:

Slides:



Advertisements
Similar presentations
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Advertisements

Introduction to Information Retrieval Introduction to Information Retrieval Evaluation & Result Summaries 1.
Imbalanced data David Kauchak CS 451 – Fall 2013.
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.
Information Retrieval IR 7. Recap of the last lecture Vector space scoring Efficiency considerations Nearest neighbors and approximations.
Precision and Recall.
Evaluating Search Engine
Hinrich Schütze and Christina Lioma
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.
Introduction to Information Retrieval and Web Search Lecture 8: Evaluation and Result Summaries.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 8: Evaluation & Result Summaries.
INFO 624 Week 3 Retrieval System Evaluation
Retrieval Evaluation. Brief Review Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
CS276A Information Retrieval Lecture 8. Recap of the last lecture Vector space scoring Efficiency considerations Nearest neighbors and approximations.
Retrieval Evaluation: Precision and Recall. Introduction Evaluation of implementations in computer science often is in terms of time and space complexity.
SIMS 202 Information Organization and Retrieval Prof. Marti Hearst and Prof. Ray Larson UC Berkeley SIMS Tues/Thurs 9:30-11:00am Fall 2000.
Evaluation CSC4170 Web Intelligence and Social Computing Tutorial 5 Tutor: Tom Chao Zhou
ISP 433/633 Week 6 IR Evaluation. Why Evaluate? Determine if the system is desirable Make comparative assessments.
Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.
CS276 Information Retrieval and Web Search
LIS618 lecture 11 i/r performance evaluation Thomas Krichel
Evaluation of Image Retrieval Results Relevant: images which meet user’s information need Irrelevant: images which don’t meet user’s information need Query:
Chapter 5: Information Retrieval and Web Search
Search and Retrieval: Relevance and Evaluation Prof. Marti Hearst SIMS 202, Lecture 20.
| 1 › Gertjan van Noord2014 Zoekmachines Lecture 5: Evaluation.
Evaluation David Kauchak cs160 Fall 2009 adapted from:
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Christopher Manning and Prabhakar.
IR Evaluation Evaluate what? –user satisfaction on specific task –speed –presentation (interface) issue –etc. My focus today: –comparative performance.
Evaluation of IR Systems
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 7 9/13/2011.
Information Retrieval Lecture 7. Recap of the last lecture Vector space scoring Efficiency considerations Nearest neighbors and approximations.
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
LIS618 lecture 3 Thomas Krichel Structure of talk Document Preprocessing Basic ingredients of query languages Retrieval performance evaluation.
Searching the web Enormous amount of information –In 1994, 100 thousand pages indexed –In 1997, 100 million pages indexed –In June, 2000, 500 million pages.
Introduction to Information Retrieval Introduction to Information Retrieval Modified from Stanford CS276 slides Chap. 8: Evaluation.
WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.
Evaluation of (Search) Results How do we know if our results are any good? Evaluating a search engine  Benchmarks  Precision and recall Results summaries:
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Lecture 3: Retrieval Evaluation Maya Ramanath. Benchmarking IR Systems Result Quality Data Collection – Ex: Archives of the NYTimes Query set – Provided.
Performance Measurement. 2 Testing Environment.
What Does the User Really Want ? Relevance, Precision and Recall.
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
Evaluation. The major goal of IR is to search document relevant to a user query. The evaluation of the performance of IR systems relies on the notion.
Web Information Retrieval Textbook by Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze Notes Revised by X. Meng for SEU May 2014.
Information Retrieval Quality of a Search Engine.
Relevance Feedback Prof. Marti Hearst SIMS 202, Lecture 24.
Information Retrieval Lecture 3 Introduction to Information Retrieval (Manning et al. 2007) Chapter 8 For the MSc Computer Science Programme Dell Zhang.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 10 Evaluation.
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008 Annotations by Michael L. Nelson.
CS276 Information Retrieval and Web Search Lecture 8: Evaluation.
INFORMATION RETRIEVAL MEASUREMENT OF RELEVANCE EFFECTIVENESS 1Adrienn Skrop.
Introduction to Information Retrieval CSE 538 MRS BOOK – CHAPTER VIII Evaluation & Result Summaries 1.
Introduction to Information Retrieval Introduction to Information Retrieval Evaluation & Result Summaries 1.
Introduction to Information Retrieval Introduction to Information Retrieval Evaluation & Result Summaries 1.
Introduction to Information Retrieval Introduction to Information Retrieval Information Retrieval and Web Search Lecture 8: Evaluation.
Sampath Jayarathna Cal Poly Pomona
Evaluation Anisio Lacerda.
CSE 538 MRS BOOK – CHAPTER VIII Evaluation & Result Summaries
7CCSMWAL Algorithmic Issues in the WWW
Evaluation of IR Systems
Lecture 10 Evaluation.
אחזור מידע, מנועי חיפוש וספריות
Modern Information Retrieval
IR Theory: Evaluation Methods
Lecture 6 Evaluation.
Lecture 8: Evaluation Hankz Hankui Zhuo
Retrieval Performance Evaluation - Measures
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Precision and Recall.
Presentation transcript:

Evaluation David Kauchak cs458 Fall 2012 adapted from:

Administrative Assignment 2 Great job getting ahead! hw 3 out soon and will be due next Thursday

IR Evaluation For hw1, you examined 5 systems. How did you evaluate the systems/queries? What are important features for an IR system? How might we automatically evaluate the performance of a system? Compare two systems? What data might be useful?

Measures for a search engine How fast does it index (how frequently can we update the index) How fast does it search How big is the index Expressiveness of query language UI Is it free? Quality of the search results

Measuring user performance Who is the user we are trying to make happy and how can we measure this? Web search engine user finds what they want and return to the engine measure rate of return users Financial drivers eCommerce site user finds what they want and make a purchase Is it the end-user, or the eCommerce site, whose happiness we measure? Measure: time to purchase, or fraction of searchers who become buyers, revenue, profit, … Enterprise (company/govt/academic) Care about “user productivity” How much time do my users save when looking for information?

Common IR evaluation Most common proxy: relevance of search results Relevance is assessed relative to the information need not the query Information need: I'm looking for information on whether drinking red wine is more effective at reducing your risk of heart attacks than white wine Query: wine red white heart attack effective You evaluate whether the doc addresses the information need, NOT whether it has these words

Data for evaluation Documents Test queries IR system

Data for evaluation Documents Test queries IR system … What do we want to know about these results?

Data for evaluation Documents Test queries IR system … relevant vs. non-relevant

Data for evaluation Documents Test queries IR system2 What if we want to test another system? 10 more systems?

Data for evaluation: option 1 For each query, identify ALL the relevant (and non- relevant) documents Given a new system, we know whether the results retrieved are relevant or not query problems? ideas?

Data for evaluation: option 2 In many domains, finding ALL relevant documents is infeasible (think the web) Instead, evaluate a few sets of results for a few systems, and assume these are all the relevant documents query IR system IR system2 IR system3

How can we quantify the results? We want a numerical score to quantify how well our system is doing Allows us to compare systems To start with, let’s just talk about boolean retrieval IR system … relevant vs. non-relevant

Accuracy? The search engine divides ALL of the documents into two sets: relevant and non-relevant The accuracy of a search engine is the proportion of these that it got right Accuracy is a commonly used evaluation measure in machine learning classification Is this a good approach for IR?

Accuracy? How to build a % accurate search engine on a low budget…. People doing information retrieval want to find something and have a certain tolerance for junk. Search for: 0 matching results found.

Unranked retrieval evaluation: Precision and Recall Precision: fraction of retrieved docs that are relevant = P(relevant | retrieved) Recall: fraction of relevant docs that are retrieved = P(retrieved | relevant) retrieved precision relevant recall

Precision/Recall tradeoff Often a trade-off between better precision and better recall How can we increase recall? Increase the number of documents retrieved (for example, return all documents) What impact will this likely have on precision? Generally, retrieving more documents will result in a decrease in precision

A combined measure: F Combined measure that assesses precision/recall tradeoff is F measure (weighted harmonic mean): People usually use balanced F 1 measure i.e., with  = 1 or  = ½ harmonic mean is a conservative average

F 1 and other averages

Evaluating ranked results Most IR systems are ranked systems We want to evaluate the systems based on their ranking of the documents What might we do? With a ranked system, we can look at the precision/recall for the top K results Plotting this over K, gives us the precision-recall curve

A precision-recall curve

Which is system is better? recall precision

Evaluation Graphs are good, but people want summary measures! Precision at fixed retrieval level Precision-at-k: Precision of top k results Perhaps appropriate for most of web search: all people want are good matches on the first one or two results pages But: averages badly and has an arbitrary parameter of k Any way to capture more of the graph? 11-point average precision Take the precision at 11 levels of recall varying from 0 to 1 by tenths of the documents and average them Evaluates performance at all recall levels (which may be good or bad)

Typical (good) 11 point precisions SabIR/Cornell 8A1 11pt precision from TREC 8 (1999)

11 point is somewhat arbitrary… What are we really interested in? How high up are the relevant results How might we measure this? Average position in list Any issue with this? Query dependent, i.e. if there are more relevant documents, will be higher (worse) Mean average precision (MAP) Average of the precision value obtained for the top k documents, each time a relevant doc is retrieved

MAP Average of the precision value obtained each time a relevant doc is retrieved for all relevant documents If a relevant document is not retrieved it is given a precision of 0 in the average Precision at k? ?

MAP Average of the precision value obtained each time a relevant doc is retrieved for all relevant documents If a relevant document is not retrieved it is given a precision of 0 in the average 1/1

MAP Average of the precision value obtained each time a relevant doc is retrieved for all relevant documents If a relevant document is not retrieved it is given a precision of 0 in the average Precision at k? 1/1 ?

MAP Average of the precision value obtained each time a relevant doc is retrieved for all relevant documents If a relevant document is not retrieved it is given a precision of 0 in the average 1/1 2/4

MAP Average of the precision value obtained each time a relevant doc is retrieved for all relevant documents If a relevant document is not retrieved it is given a precision of 0 in the average Precision at k? 1/1 2/4 ?

MAP Average of the precision value obtained each time a relevant doc is retrieved for all relevant documents If a relevant document is not retrieved it is given a precision of 0 in the average Precision at k? 1/1 2/4 3/5

MAP Average of the precision value obtained each time a relevant doc is retrieved for all relevant documents If a relevant document is not retrieved it is given a precision of 0 in the average 1/1 2/4 3/5 4/7 average

Other issues: human evaluations Humans are not perfect or consistent Often want multiple people to evaluate the results Number of docsJudge 1Judge 2 300Relevant 70Nonrelevant 20RelevantNonrelevant 10Nonrelevantrelevant

Multiple human labelers Can we trust the data? How do we use multiple judges? Number of docs Judge 1Judge 2 300Relevant 70Nonrelevant 20RelevantNonrelevant 10Nonrelevantrelevant Number of docs Judge 1Judge 2 100Relevant 30Nonrelevant 200RelevantNonrelevant 70Nonrelevantrelevant

Measuring inter-judge agreement Number of docs Judge 1Judge 2 300Relevant 70Nonrelevant 20RelevantNonrelevant 10Nonrelevantrelevant Number of docs Judge 1Judge 2 100Relevant 30Nonrelevant 200RelevantNonrelevant 70Nonrelevantrelevant 370/400 = 92.5%130/400 = 32.5% Is there any problem with this?

Measuring inter-judge (dis)agreement Kappa measure Agreement measure among judges Designed for categorical judgments Corrects for chance agreement Kappa = [ P(A) – P(E) ] / [ 1 – P(E) ] P(A) – proportion of time judges agree P(E) – what agreement would be by chance Kappa = -1 for total disagreement, 0 for chance agreement, 1 for total agreement Kappa above 0.7 is usually considered good enough

Other issues: pure relevance Why does Google do this?

Other issues: pure relevance Relevance vs Marginal Relevance A document can be redundant even if it is highly relevant Duplicates The same information from different sources Marginal relevance is a better measure of utility for the user Measuring marginal relevance can be challenging, but search engines still attempt to tackle the problem

Evaluation at large search engines Search engines have test collections of queries and hand-ranked results Search engines also use non-relevance-based measures Ideas? Clickthrough on first result Not very reliable if you look at a single clickthrough … but pretty reliable in the aggregate. Studies of user behavior in the lab A/B testing

A/B Testing Google wants to test the variants below to see what the impact of the two variants is How can they do it? google has a new font

A/B testing Have most users use old system Divert a small proportion of traffic (e.g., 1%) to the new system that includes the innovation Evaluate with an “automatic” measure like clickthrough on first result Now we can directly see if the innovation does improve user happiness

Guest speaker today Ron Kohavi