Precision and Recall.

Slides:



Advertisements
Similar presentations
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Advertisements

Imbalanced data David Kauchak CS 451 – Fall 2013.
Query Chains: Learning to Rank from Implicit Feedback Paper Authors: Filip Radlinski Thorsten Joachims Presented By: Steven Carr.
Exercising these ideas  You have a description of each item in a small collection. (30 web sites)  Assume we are looking for information about boxers,
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
Information Retrieval IR 7. Recap of the last lecture Vector space scoring Efficiency considerations Nearest neighbors and approximations.
Evaluating Search Engine
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.
Modern Information Retrieval
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
INFO 624 Week 3 Retrieval System Evaluation
Retrieval Evaluation. Brief Review Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
CS276A Information Retrieval Lecture 8. Recap of the last lecture Vector space scoring Efficiency considerations Nearest neighbors and approximations.
Retrieval Evaluation: Precision and Recall. Introduction Evaluation of implementations in computer science often is in terms of time and space complexity.
Retrieval Evaluation. Introduction Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
Introduction to Language Models Evaluation in information retrieval Lecture 4.
Evaluation CSC4170 Web Intelligence and Social Computing Tutorial 5 Tutor: Tom Chao Zhou
ISP 433/633 Week 6 IR Evaluation. Why Evaluate? Determine if the system is desirable Make comparative assessments.
Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.
Evaluation of Image Retrieval Results Relevant: images which meet user’s information need Irrelevant: images which don’t meet user’s information need Query:
Chapter 5: Information Retrieval and Web Search
Evaluating Classifiers
Evaluation David Kauchak cs458 Fall 2012 adapted from:
Evaluation David Kauchak cs160 Fall 2009 adapted from:
Query Relevance Feedback and Ontologies How to Make Queries Better.
APPLICATIONS OF DATA MINING IN INFORMATION RETRIEVAL.
1 Context-Aware Search Personalization with Concept Preference CIKM’11 Advisor : Jia Ling, Koh Speaker : SHENG HONG, CHUNG.
IR Evaluation Evaluate what? –user satisfaction on specific task –speed –presentation (interface) issue –etc. My focus today: –comparative performance.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 7 9/13/2011.
Information Retrieval Lecture 7. Recap of the last lecture Vector space scoring Efficiency considerations Nearest neighbors and approximations.
CS 533 Information Retrieval Systems.  Introduction  Connectivity Analysis  Kleinberg’s Algorithm  Problems Encountered  Improved Connectivity Analysis.
Web Search. Crawling Start from some root site e.g., Yahoo directories. Traverse the HREF links. Search(initialLink) fringe.Insert( initialLink ); loop.
Evaluation INST 734 Module 5 Doug Oard. Agenda Evaluation fundamentals Test collections: evaluating sets  Test collections: evaluating rankings Interleaving.
Searching the web Enormous amount of information –In 1994, 100 thousand pages indexed –In 1997, 100 million pages indexed –In June, 2000, 500 million pages.
Evaluation of (Search) Results How do we know if our results are any good? Evaluating a search engine  Benchmarks  Precision and recall Results summaries:
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Computational Intelligence: Methods and Applications Lecture 16 Model evaluation and ROC Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
Lecture 3: Retrieval Evaluation Maya Ramanath. Benchmarking IR Systems Result Quality Data Collection – Ex: Archives of the NYTimes Query set – Provided.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
What Does the User Really Want ? Relevance, Precision and Recall.
Quiz 1 review. Evaluating Classifiers Reading: T. Fawcett paper, link on class website, Sections 1-4 Optional reading: Davis and Goadrich paper, link.
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
Evaluation. The major goal of IR is to search document relevant to a user query. The evaluation of the performance of IR systems relies on the notion.
Information Retrieval Quality of a Search Engine.
1 Random Walks on the Click Graph Nick Craswell and Martin Szummer Microsoft Research Cambridge SIGIR 2007.
Information Retrieval Lecture 3 Introduction to Information Retrieval (Manning et al. 2007) Chapter 8 For the MSc Computer Science Programme Dell Zhang.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 10 Evaluation.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
Relevant Document Distribution Estimation Method for Resource Selection Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University
INFORMATION RETRIEVAL MEASUREMENT OF RELEVANCE EFFECTIVENESS 1Adrienn Skrop.
Knowledge and Information Retrieval Dr Nicholas Gibbins 32/4037.
Introduction to Information Retrieval Introduction to Information Retrieval Information Retrieval and Web Search Lecture 8: Evaluation.
Sampath Jayarathna Cal Poly Pomona
7CCSMWAL Algorithmic Issues in the WWW
Lecture 10 Evaluation.
Evaluation.
Modern Information Retrieval
Lecture 6 Evaluation.
CS246: Information Retrieval
Feature Selection for Ranking
Dr. Sampath Jayarathna Cal Poly Pomona
Retrieval Performance Evaluation - Measures
Dr. Sampath Jayarathna Cal Poly Pomona
Precision and Recall Reminder:
Precision and Recall.
Presentation transcript:

Precision and Recall

Information Retrieval Evaluation In the information retrieval (search engines) community, system evaluation revolves around the notion of relevant and not relevant documents. With respect to a given query, a document is given a binary classification as either relevant or not relevant. An information retrieval system can be thought of as a two class classifier which attempts to label documents as such. It retrieves the subset of documents which it believes to be relevant To measure information retrieval effectiveness, we need: 1. A test collection of documents 2. A benchmark suite of queries 3. A binary assessment of either relevant or not relevant for each query-document pair.

Precision and Recall Precision is the fraction of retrieved documents that are relevant Recall is the fraction of relevant documents that are retrieved

In terms of confusion matrix…

Why not accuracy? Accuracy is (tp+tn) / (tp+fp+fn+tn) Good reason why accuracy is not an appropriate for information retrieval. Data is extremely skewed: normally over 99.9% of the documents are in the not relevant category. In such circumstances, a system tuned to maximize accuracy will almost always declare every document not relevant. A Web user is always going to want to see some documents, and can be assumed to have a certain tolerance for seeing some false positives. Precision and recall concentrate the evaluation on the return of true positives, asking: what percentage of the relevant documents have been found and how many false positives have also been returned.

Why having two numbers? The advantage of having the two numbers for precision and recall is that one is more important than the other in many circumstances. Typical web surfers: would like every result on the first page to be relevant (high precision), but have not the slightest interest in knowing let alone looking at every document that is relevant. In contrast, various professional searchers such as paralegals and intelligence analysts: are very concerned with trying to get as high recall as possible, and will tolerate fairly low precision results in order to get it. Nevertheless, the two quantities clearly trade off against one another: you can always get a recall of 1 (but very low precision) by retrieving all documents for all queries! Recall is a non-decreasing function of the number of documents retrieved. On the other hand, precision usually decreases as the number of documents retrieved is increased.

What about a single number? The combined measure which is standardly used is called the F measure, which is the weighted harmonic mean of precision and recall: where The default is to equally weight precision and recall, giving a balanced F measure. This corresponds to making  = 1/2 or  =1. Commonly written as F1, which is short for F=1

Why not arithmetic mean? We can always get 100% recall by just returning all documents, and therefore we can always get a 50% arithmetic mean by the same process. In contrast, if we assume that 1 document in 10000 is relevant to the query, the harmonic mean score of this strategy is 0.02%. The harmonic mean, the third of the classical Pythagorean means, is even more conservative: it is always less than or equal to the geometric mean. The harmonic mean is closer to the minimum of two numbers than to their arithmetic mean.

Evaluation of ranked retrieval results Interpolated precision is shown by the red line. Precision-recall curves have a distinctive saw-tooth shape: if the (k+1)th document retrieved is irrelevant then recall is the same as for the top k documents, but precision has dropped. If it is relevant, then both precision and recall increase, and the curve jags up and to the right. It’s often useful to remove these jiggles (by interpolation): the precision at a certain recall level r is defined as the highest precision found for any recall level q  r. Justification: almost anyone would be prepared to look at a few more documents if it would increase the percentage of the viewed set that were relevant (that is, if the precision of the larger set is higher).

Precision at k The above measures precision at all recall levels. What matters is rather how many good results there are on the first page or the first three pages. This leads to measures of precision at fixed low levels of retrieved results, such as 10 or 30 documents. This is referred to as “Precision at k”, for example “Precision at 10.”