Information Retrieval Lecture 3 Introduction to Information Retrieval (Manning et al. 2007) Chapter 8 For the MSc Computer Science Programme Dell Zhang.

Slides:



Advertisements
Similar presentations
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Advertisements

Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.
Information Retrieval IR 7. Recap of the last lecture Vector space scoring Efficiency considerations Nearest neighbors and approximations.
Precision and Recall.
Evaluating Search Engine
Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.
Introduction to Information Retrieval and Web Search Lecture 8: Evaluation and Result Summaries.
Modern Information Retrieval
Scoring and Ranking 198:541. Scoring Thus far, our queries have all been Boolean Docs either match or not Good for expert users with precise understanding.
Retrieval Evaluation. Brief Review Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
CS276A Information Retrieval Lecture 8. Recap of the last lecture Vector space scoring Efficiency considerations Nearest neighbors and approximations.
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
Retrieval Evaluation: Precision and Recall. Introduction Evaluation of implementations in computer science often is in terms of time and space complexity.
CS 430 / INFO 430 Information Retrieval
Evaluating the Performance of IR Sytems
Retrieval Evaluation. Introduction Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
Introduction to Language Models Evaluation in information retrieval Lecture 4.
Evaluation CSC4170 Web Intelligence and Social Computing Tutorial 5 Tutor: Tom Chao Zhou
Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.
WXGB6106 INFORMATION RETRIEVAL Week 3 RETRIEVAL EVALUATION.
ISP 433/633 Week 6 IR Evaluation. Why Evaluate? Determine if the system is desirable Make comparative assessments.
Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.
Evaluation Information retrieval Web. Purposes of Evaluation System Performance Evaluation efficiency of data structures and methods operational profile.
CS276 Information Retrieval and Web Search
Evaluation of Image Retrieval Results Relevant: images which meet user’s information need Irrelevant: images which don’t meet user’s information need Query:
Chapter 5: Information Retrieval and Web Search
Search and Retrieval: Relevance and Evaluation Prof. Marti Hearst SIMS 202, Lecture 20.
| 1 › Gertjan van Noord2014 Zoekmachines Lecture 5: Evaluation.
Evaluation David Kauchak cs458 Fall 2012 adapted from:
Evaluation David Kauchak cs160 Fall 2009 adapted from:
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Christopher Manning and Prabhakar.
IR Evaluation Evaluate what? –user satisfaction on specific task –speed –presentation (interface) issue –etc. My focus today: –comparative performance.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 7 9/13/2011.
Information Retrieval Lecture 7. Recap of the last lecture Vector space scoring Efficiency considerations Nearest neighbors and approximations.
Chapter 6: Information Retrieval and Web Search
Assessing The Retrieval A.I Lab 박동훈. Contents 4.1 Personal Assessment of Relevance 4.2 Extending the Dialog with RelFbk 4.3 Aggregated Assessment.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
Evaluation of (Search) Results How do we know if our results are any good? Evaluating a search engine  Benchmarks  Precision and recall Results summaries:
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Lecture 3: Retrieval Evaluation Maya Ramanath. Benchmarking IR Systems Result Quality Data Collection – Ex: Archives of the NYTimes Query set – Provided.
Performance Measurement. 2 Testing Environment.
1 CS 430 / INFO 430 Information Retrieval Lecture 8 Evaluation of Retrieval Effectiveness 1.
What Does the User Really Want ? Relevance, Precision and Recall.
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
Evaluation. The major goal of IR is to search document relevant to a user query. The evaluation of the performance of IR systems relies on the notion.
Information Retrieval Quality of a Search Engine.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 10 Evaluation.
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008 Annotations by Michael L. Nelson.
CS276 Information Retrieval and Web Search Lecture 8: Evaluation.
Introduction to Information Retrieval Introduction to Information Retrieval Information Retrieval and Web Search Lecture 8: Evaluation.
Sampath Jayarathna Cal Poly Pomona
Evaluation of Information Retrieval Systems
7CCSMWAL Algorithmic Issues in the WWW
Evaluation of IR Systems
Information Retrieval
Lecture 10 Evaluation.
Evaluation.
אחזור מידע, מנועי חיפוש וספריות
Modern Information Retrieval
IR Theory: Evaluation Methods
Lecture 6 Evaluation.
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
CS246: Information Retrieval
Evaluation of Information Retrieval Systems
Lecture 8: Evaluation Hankz Hankui Zhuo
Retrieval Evaluation - Measures
Retrieval Performance Evaluation - Measures
Information Retrieval and Web Design
Precision and Recall Reminder:
Precision and Recall.
Presentation transcript:

Information Retrieval Lecture 3 Introduction to Information Retrieval (Manning et al. 2007) Chapter 8 For the MSc Computer Science Programme Dell Zhang Birkbeck, University of London

Evaluating an IR System Expressiveness  The ability of query language to express complex information needs e.g., Boolean operators, wildcard, phrase, proximity, etc. Efficiency  How fast does it index? How large is the index?  How fast does it search? Effectiveness – the key measure  How effective does it find relevant documents Is this search engine good? Which search engine is better?

Relevance How do we quantify relevance?  A benchmark set of docs (corpus)  A benchmark set of queries  A binary assessment for each query-doc pair either relevant or irrelevant

Relevance Relevance should be evaluated according to the information need (which is translated into a query).  [information need] I'm looking for information on whether drinking red wine is more effective at reducing your risk of heart attacks than white wine.  [query] wine red white heart attack effective  We judge whether the document addresses the information need, not whether it has those words.

Benchmarks Common Test Corpora  TREC National Institute of Standards and Testing (NIST) has run a large IR test bed for many years  Reuters Reuters RCV1  20 Newsgroups  …… Relevance judgements are given by human experts.

TREC TREC Ad Hoc tasks from first 8 TRECs are standard IR tasks  50 detailed information needs per year  Human evaluation of pooled results returned  More recently other related things QA, Web, Genomics, etc.

TREC A Query from TREC5 Number: 225 Description: What is the main function of the Federal Emergency Management Agency (FEMA) and the funding level provided to meet emergencies? Also, what resources are available to FEMA such as people, equipment, facilities?

Ranking-Ignorant Measures The IR system returns a certain number of documents. The retrieved documents are regarded as a set. It can actually be considered as classification – each doc is classified/predicted to be either ‘relevant’ or ‘irrelevant’.

Contingency Table Relevant Not Relevant Retrieved tpfp Not Retrieved fntn p = positive; n = negative; t = true; f = false.

Accuracy Accuracy = (tp+tn) / (tp+fp+tn+fn)  The fraction of correct classifications.  Not a very useful evaluation measure in IR. Why? Accuracy puts equal weights on relevant and irrelevant documents. It is common that the number of relevant documents is very small compared to the total number of documents. People doing information retrieval want to find something and have a certain tolerance for junk.

Accuracy Search for: 0 matching results found. This Web search engine returns 0 matching results for all queries. How much time do you need to build it?1 minute! How much accuracy does it have? %

Precision and Recall Precision P = tp/(tp+fp)  The fraction of retrieved docs that are relevant.  Pr[relevant|retrieved] Recall R = tp/(tp+fn)  The fraction of relevant docs that are retrieved.  Pr[retrieved|relevant]  Recall is a non-decreasing function of the number of docs retrieved. You can get a perfect recall (but low precision) by retrieving all docs for all queries!

Precision and Recall Precision/Recall Tradeoff  In a good IR system,  precision decreases as recall increases,  and vice versa.

F measure F  : weighted harmonic mean of P and R  Combined measure that assesses the precision/recall tradeoff.  Harmonic mean is a conservative average.

F measure F 1 : balanced F measure (with  = 1 or  = ½ )  Most popular IR evaluation measure

F 1 and other averages

F measure – Exercise d1d2d3d4d5d1d2d3d4d5 retrieved relevant irrelevant F 1 = ? IR result for q

Ranking-Aware Measures The IR system rank all docs in the decreasing order of their relevance to the query. Returning various numbers of the top ranked docs leads to different recalls (and accordingly different precisions).

Precision-Recall Curve

The interpolated precision at a recall level R  The highest precision found for any recall level higher than R.  Removes the jiggles in the precision-recall curve.

11-Point Interpolated Average Precision For each information need, the interpolated precision is measured at 11 recall levels  0.0, 0.1, 0.2, …, 1.0 The measured interpolated precisions are averaged (i.e., arithmetic mean) over the set of queries in the benchmark. A composite precision-recall curve showing 11 points can be graphed.

11-Point Interpolated Average Precision A representative (good) TREC system

Mean Average Precision (MAP) For one information need, it is the average of the precision value obtained for the top k docs each time a relevant doc is retrieved.  No use of fixed recall levels. No interpolation. When no relevant doc is retrieved, the precision value is taken to be 0. The MAP value for a test collection is then the arithmetic mean of MAP values for individual information needs.  Macro-averaging: each query counts equally.

Precision/Recall at k : Precision on the top k retrieved docs.  Appropriate for Web search engines Most users scan only the first few (e.g., 10) hyperlinks that are presented. : Recall on the top k retrieved docs.  Appropriate for archival retrieval systems what fraction of total number of relevant docs did a user find after scanning the first (say 100) docs?

R -Precision Precision on the top Rel retrieved docs  Rel is the size of the set of relevant documents (though perhaps incomplete).  A perfect IR system could score 1 on this metric for each query.

PRBEP Given a precision-recall curve, the Precision/Recall Break-Even Point (PRBEP) is the value at which the precision is equal to the recall.  It is obvious from the definition of precision/recall, the equality is achieved for contingency tables with tp+fp = tp+fn.

ROC Curve An ROC curve plots the true positive rate or sensitivity against the false positive rate or (1- specificity).  true positive rate or sensitivity = recall = tp/(tp+fn)  false positive rate = fp/(fp+tn) = 1 – specificity specificty = tn/(fp+tn) The area under the ROC curve

ROC Curve

Variance in Performance It is normally the case that the variance in performance of the same system across different queries is much greater than the variance in performance of different systems on the same query.  For a test collection, an IR system may perform terribly on some information needs (e.g., MAP = 0.1) but excellently on others (e.g., MAP = 0.7).  There are easy information needs and hard ones!

Take Home Messages Evaluation of Effectiveness based on Relevance  Ranking-Ignorant Measures Accuracy; Precision & Recall F measure (especially F 1 )  Ranking-Aware Measures Precision-Recall curve 11 Point, MAP, P/R at k, R -Precision, PRBEP ROC curve