Information Retrieval Lecture 3 Introduction to Information Retrieval (Manning et al. 2007) Chapter 8 For the MSc Computer Science Programme Dell Zhang.

Information Retrieval Lecture 3 Introduction to Information Retrieval (Manning et al. 2007) Chapter 8 For the MSc Computer Science Programme Dell Zhang Birkbeck, University of London

Evaluating an IR System Expressiveness  The ability of query language to express complex information needs e.g., Boolean operators, wildcard, phrase, proximity, etc. Efficiency  How fast does it index? How large is the index?  How fast does it search? Effectiveness – the key measure  How effective does it find relevant documents Is this search engine good? Which search engine is better?

Relevance How do we quantify relevance?  A benchmark set of docs (corpus)  A benchmark set of queries  A binary assessment for each query-doc pair either relevant or irrelevant

Relevance Relevance should be evaluated according to the information need (which is translated into a query).  [information need] I'm looking for information on whether drinking red wine is more effective at reducing your risk of heart attacks than white wine.  [query] wine red white heart attack effective  We judge whether the document addresses the information need, not whether it has those words.

Benchmarks Common Test Corpora  TREC National Institute of Standards and Testing (NIST) has run a large IR test bed for many years  Reuters Reuters-21578 RCV1  20 Newsgroups  …… Relevance judgements are given by human experts.

TREC TREC Ad Hoc tasks from first 8 TRECs are standard IR tasks  50 detailed information needs per year  Human evaluation of pooled results returned  More recently other related things QA, Web, Genomics, etc.

TREC A Query from TREC5 Number: 225 Description: What is the main function of the Federal Emergency Management Agency (FEMA) and the funding level provided to meet emergencies? Also, what resources are available to FEMA such as people, equipment, facilities?

Ranking-Ignorant Measures The IR system returns a certain number of documents. The retrieved documents are regarded as a set. It can actually be considered as classification – each doc is classified/predicted to be either ‘relevant’ or ‘irrelevant’.

Contingency Table Relevant Not Relevant Retrieved tpfp Not Retrieved fntn p = positive; n = negative; t = true; f = false.

Accuracy Accuracy = (tp+tn) / (tp+fp+tn+fn)  The fraction of correct classifications.  Not a very useful evaluation measure in IR. Why? Accuracy puts equal weights on relevant and irrelevant documents. It is common that the number of relevant documents is very small compared to the total number of documents. People doing information retrieval want to find something and have a certain tolerance for junk.

Accuracy Search for: 0 matching results found. This Web search engine returns 0 matching results for all queries. How much time do you need to build it?1 minute! How much accuracy does it have? 99.9999%

Precision and Recall Precision P = tp/(tp+fp)  The fraction of retrieved docs that are relevant.  Pr[relevant|retrieved] Recall R = tp/(tp+fn)  The fraction of relevant docs that are retrieved.  Pr[retrieved|relevant]  Recall is a non-decreasing function of the number of docs retrieved. You can get a perfect recall (but low precision) by retrieving all docs for all queries!

Precision and Recall Precision/Recall Tradeoff  In a good IR system,  precision decreases as recall increases,  and vice versa.

F measure F  : weighted harmonic mean of P and R  Combined measure that assesses the precision/recall tradeoff.  Harmonic mean is a conservative average.

F measure F 1 : balanced F measure (with  = 1 or  = ½ )  Most popular IR evaluation measure

F 1 and other averages

F measure – Exercise d1d2d3d4d5d1d2d3d4d5 retrieved relevant irrelevant F 1 = ? IR result for q

Ranking-Aware Measures The IR system rank all docs in the decreasing order of their relevance to the query. Returning various numbers of the top ranked docs leads to different recalls (and accordingly different precisions).

Precision-Recall Curve

The interpolated precision at a recall level R  The highest precision found for any recall level higher than R.  Removes the jiggles in the precision-recall curve.

11-Point Interpolated Average Precision For each information need, the interpolated precision is measured at 11 recall levels  0.0, 0.1, 0.2, …, 1.0 The measured interpolated precisions are averaged (i.e., arithmetic mean) over the set of queries in the benchmark. A composite precision-recall curve showing 11 points can be graphed.

11-Point Interpolated Average Precision A representative (good) TREC system

Mean Average Precision (MAP) For one information need, it is the average of the precision value obtained for the top k docs each time a relevant doc is retrieved.  No use of fixed recall levels. No interpolation. When no relevant doc is retrieved, the precision value is taken to be 0. The MAP value for a test collection is then the arithmetic mean of MAP values for individual information needs.  Macro-averaging: each query counts equally.

Precision/Recall at k Prec @k : Precision on the top k retrieved docs.  Appropriate for Web search engines Most users scan only the first few (e.g., 10) hyperlinks that are presented. Rec @k : Recall on the top k retrieved docs.  Appropriate for archival retrieval systems what fraction of total number of relevant docs did a user find after scanning the first (say 100) docs?

R -Precision Precision on the top Rel retrieved docs  Rel is the size of the set of relevant documents (though perhaps incomplete).  A perfect IR system could score 1 on this metric for each query.

PRBEP Given a precision-recall curve, the Precision/Recall Break-Even Point (PRBEP) is the value at which the precision is equal to the recall.  It is obvious from the definition of precision/recall, the equality is achieved for contingency tables with tp+fp = tp+fn.

ROC Curve An ROC curve plots the true positive rate or sensitivity against the false positive rate or (1- specificity).  true positive rate or sensitivity = recall = tp/(tp+fn)  false positive rate = fp/(fp+tn) = 1 – specificity specificty = tn/(fp+tn) The area under the ROC curve

ROC Curve

Variance in Performance It is normally the case that the variance in performance of the same system across different queries is much greater than the variance in performance of different systems on the same query.  For a test collection, an IR system may perform terribly on some information needs (e.g., MAP = 0.1) but excellently on others (e.g., MAP = 0.7).  There are easy information needs and hard ones!

Take Home Messages Evaluation of Effectiveness based on Relevance  Ranking-Ignorant Measures Accuracy; Precision & Recall F measure (especially F 1 )  Ranking-Aware Measures Precision-Recall curve 11 Point, MAP, P/R at k, R -Precision, PRBEP ROC curve

Information Retrieval Lecture 3 Introduction to Information Retrieval (Manning et al. 2007) Chapter 8 For the MSc Computer Science Programme Dell Zhang.

Similar presentations

Presentation on theme: "Information Retrieval Lecture 3 Introduction to Information Retrieval (Manning et al. 2007) Chapter 8 For the MSc Computer Science Programme Dell Zhang."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Information Retrieval Lecture 3 Introduction to Information Retrieval (Manning et al. 2007) Chapter 8 For the MSc Computer Science Programme Dell Zhang.

Similar presentations

Presentation on theme: "Information Retrieval Lecture 3 Introduction to Information Retrieval (Manning et al. 2007) Chapter 8 For the MSc Computer Science Programme Dell Zhang."— Presentation transcript:

Similar presentations

About project

Feedback