Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.

Slides:



Advertisements
Similar presentations
Introduction to Information Retrieval
Advertisements

Modern information retrieval Modelling. Introduction IR systems usually adopt index terms to process queries IR systems usually adopt index terms to process.
Retrieval Evaluation J. H. Wang Mar. 18, Outline Chap. 3, Retrieval Evaluation –Retrieval Performance Evaluation –Reference Collections.
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Retrieval Models Probabilistic and Language Models.
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.
Exercising these ideas  You have a description of each item in a small collection. (30 web sites)  Assume we are looking for information about boxers,
Precision and Recall.
Evaluating Search Engine
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Modern Information Retrieval
INFO 624 Week 3 Retrieval System Evaluation
Retrieval Evaluation. Brief Review Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
Modeling Modern Information Retrieval
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
Retrieval Evaluation: Precision and Recall. Introduction Evaluation of implementations in computer science often is in terms of time and space complexity.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Retrieval Evaluation. Introduction Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
Evaluation CSC4170 Web Intelligence and Social Computing Tutorial 5 Tutor: Tom Chao Zhou
Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.
WXGB6106 INFORMATION RETRIEVAL Week 3 RETRIEVAL EVALUATION.
ISP 433/633 Week 6 IR Evaluation. Why Evaluate? Determine if the system is desirable Make comparative assessments.
The Relevance Model  A distribution over terms, given information need I, (Lavrenko and Croft 2001). For term r, P(I) can be dropped w/o affecting the.
LIS618 lecture 11 i/r performance evaluation Thomas Krichel
Evaluation of Image Retrieval Results Relevant: images which meet user’s information need Irrelevant: images which don’t meet user’s information need Query:
Chapter 5: Information Retrieval and Web Search
Search and Retrieval: Relevance and Evaluation Prof. Marti Hearst SIMS 202, Lecture 20.
| 1 › Gertjan van Noord2014 Zoekmachines Lecture 5: Evaluation.
Evaluation David Kauchak cs458 Fall 2012 adapted from:
Evaluation David Kauchak cs160 Fall 2009 adapted from:
Minimal Test Collections for Retrieval Evaluation B. Carterette, J. Allan, R. Sitaraman University of Massachusetts Amherst SIGIR2006.
Philosophy of IR Evaluation Ellen Voorhees. NIST Evaluation: How well does system meet information need? System evaluation: how good are document rankings?
IR Evaluation Evaluate what? –user satisfaction on specific task –speed –presentation (interface) issue –etc. My focus today: –comparative performance.
Evaluation of IR LIS531H. Why eval? When designing and using a system there are decisions to be made:  Manual or automatic indexing?  Controlled vocabularies.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 7 9/13/2011.
Evaluating What’s Been Learned. Cross-Validation Foundation is a simple idea – “ holdout ” – holds out a certain amount for testing and uses rest for.
Chapter 6: Information Retrieval and Web Search
Information retrieval 1 Boolean retrieval. Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text)
Evaluation INST 734 Module 5 Doug Oard. Agenda Evaluation fundamentals Test collections: evaluating sets  Test collections: evaluating rankings Interleaving.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Chapter 23: Probabilistic Language Models April 13, 2004.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Basic Implementation and Evaluations Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Searching Specification Documents R. Agrawal, R. Srikant. WWW-2002.
C.Watterscs64031 Evaluation Measures. C.Watterscs64032 Evaluation? Effectiveness? For whom? For what? Efficiency? Time? Computational Cost? Cost of missed.
Performance Measurement. 2 Testing Environment.
1 CS 430 / INFO 430 Information Retrieval Lecture 8 Evaluation of Retrieval Effectiveness 1.
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
Evaluation. The major goal of IR is to search document relevant to a user query. The evaluation of the performance of IR systems relies on the notion.
Information Retrieval Quality of a Search Engine.
Information Retrieval Lecture 3 Introduction to Information Retrieval (Manning et al. 2007) Chapter 8 For the MSc Computer Science Programme Dell Zhang.
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008 Annotations by Michael L. Nelson.
INFORMATION RETRIEVAL MEASUREMENT OF RELEVANCE EFFECTIVENESS 1Adrienn Skrop.
A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval Chengxiang Zhai, John Lafferty School of Computer Science Carnegie.
Knowledge and Information Retrieval Dr Nicholas Gibbins 32/4037.
Sampath Jayarathna Cal Poly Pomona
Evaluation of Information Retrieval Systems
7CCSMWAL Algorithmic Issues in the WWW
Evaluation of IR Systems
Evaluation.
Modern Information Retrieval
IR Theory: Evaluation Methods
Retrieval Evaluation - Measures
Retrieval Performance Evaluation - Measures
Precision and Recall Reminder:
Precision and Recall.
Presentation transcript:

Evaluation

 Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs. B –System A vs A´ Most common evaluation - retrieval effectiveness – Assistance in formulating queries – Speed of retrieval – Resources required – Presentation of documents – Ability to find relevant documents

 Allan, Ballesteros, Croft, and/or Turtle The Concept of Relevance Relevance of a document D to a query Q is subjective –Different users will have different judgments –Same users may judge differently at different times –Degree of relevance of different documents may vary

 Allan, Ballesteros, Croft, and/or Turtle The Concept of Relevance In evaluating IR systems it is assumed that: –A subset of the documents of the database (DB) are relevant –A document is either relevant or not

 Allan, Ballesteros, Croft, and/or Turtle Relevance In a small collection - the relevance of each document can be checked With real collections, never know full set of relevant documents Any retrieval model includes an implicit definition of relevance –Satisfiability of a FOL expression –Distance –P(Relevance|query,document) –P(query|document)

 Allan, Ballesteros, Croft, and/or Turtle Evaluation Set of queries Collection of documents (corpus) Relevance judgements: Which documents are correct and incorrect for each query Potato farming and nutritional value of potatoes. Mr. Potato Head … nutritional info for spuds potato blight … growing potatoes … x  x   x If small collection, can review all documents Not practical for large collections Any ideas about how we might approach collecting relevance judgments for very large collections?

 Allan, Ballesteros, Croft, and/or Turtle Finding Relevant Documents Pooling –Retrieve documents using several automatic techniques –Judge top n documents for each technique –Relevant set is union –Subset of true relevant set Possible to estimate size of relevant set by sampling When testing: –How should un-judged documents be treated? –How might this affect results?

 Allan, Ballesteros, Croft, and/or Turtle Test Collections To compare the performance of two techniques: –each technique used to evaluate same queries –results (set or ranked list) compared using metric –most common measures - precision and recall Usually use multiple measures to get different views of performance Usually test with multiple collections – –performance is collection dependent

 Allan, Ballesteros, Croft, and/or Turtle

Evaluation Retrieved documents Relevant documents Rel&Ret documents Ability to return ALL relevant items. Retrieved Ability to return ONLY relevant items. Let retrieved = 100, relevant = 25, rel & ret = 10 Recall = 10/25 =.40 Precision = 10/100 =.10

 Allan, Ballesteros, Croft, and/or Turtle Precision and Recall Precision and recall well-defined for sets For ranked retrieval –Compute value at fixed recall points (e.g. precision at 20% recall) –Compute a P/R point for each relevant document, interpolate –Compute value at fixed rank cutoffs (e.g. precision at rank 20)

 Allan, Ballesteros, Croft, and/or Turtle Average Precision for a Query Often want a single-number effectiveness measure Average precision is widely used in IR Calculate by averaging precision when recall increases

 Allan, Ballesteros, Croft, and/or Turtle

Averaging Across Queries Hard to compare P/R graphs or tables for individual queries (too much data) –Need to average over many queries Two main types of averaging –Micro-average - each relevant document is a point in the average (most common) –Macro-average - each query is a point in the average Also done with average precision value –Average of many queries’ average precision values –Called mean average precision (MAP) “Average average precision” sounds weird

 Allan, Ballesteros, Croft, and/or Turtle

Averaging and Interpolation Interpolation –actual recall levels of individual queries are seldom equal to standard levels –interpolation estimates the best possible performance value between two known values e.g.) assume 3 relevant docs retrieved at ranks 4, 9, 20 their precision at actual recall is.25,.22, and.15 –On average, as recall increases, precision decreases

 Allan, Ballesteros, Croft, and/or Turtle Averaging and Interpolation Actual recall levels of individual queries are seldom equal to standard levels Interpolated precision at the ith recall level, R i, is the maximum precision at all points p such that R i  p  R i+1 –assume only 3 relevant docs retrieved at ranks 4, 9, 20 –their actual recall points are:.33,.67, and 1.0 –their precision is.25,.22, and.15 –what is interpolated precision at standard recall points? Recall levelInterpolated Precision 0.0, 0.1, 0.2, , 0.5, , 0.8, 0.9,

 Allan, Ballesteros, Croft, and/or Turtle Interpolated Average Precision Average precision at standard recall points For a given query, compute P/R point for every relevant doc. Interpolate precision at standard recall levels –11-pt is usually 100%, 90, 80, …, 10, 0% (yes, 0% recall) –3-pt is usually 75%, 50%, 25% Average over all queries to get average precision at each recall level Average interpolated recall levels to get single result –Called “interpolated average precision” Not used much anymore; “mean average precision” more common Values at specific interpolated points still commonly used

 Allan, Ballesteros, Croft, and/or Turtle 1. d 123 (*)6. d 9 (*)11. d d d d d 56 (*)8. d d d 6 9. d d d d 25 (*)15. d 3 (*) Let, R q = {d 3, d 5, d 9, d 25, d 39, d 44, d 56, d 71, d 89, d 123 } |R q | = 10, no. of relevant docs for q Ranking of retreived docs in the answer set of q: 10 % Recall=>.1 * 10 rel docs = 1 rel doc retrieved One doc retrieved to get 1 rel doc: precision = 1/1 = 100% Micro-averaging: 1 Qry Find precision given total number of docs retrieved at given recall value.

 Allan, Ballesteros, Croft, and/or Turtle 1. d 123 (*)6. d 9 (*)11. d d d d d 56 (*)8. d d d 6 9. d d d d 25 (*)15. d 3 (*) Let, R q = {d 3, d 5, d 9, d 25, d 39, d 44, d 56, d 71, d 89, d 123 } |R q | = 10, no. of relevant docs for q Ranking of retreived docs in the answer set of q: 10 % Recall=>.1 * 10 rel docs = 1 rel doc retrieved One doc retrieved to get 1 rel doc: precision = 1/1 = 100% Micro-averaging : 1 Qry 20% Recall:.2 * 10 rel docs = 2 rel docs retrieved 3 docs retrieved to get 2 rel docs: precision = 2/3 = Find precision given total number of docs retrieved at given recall value.

 Allan, Ballesteros, Croft, and/or Turtle 1. d 123 (*)6. d 9 (*)11. d d d d d 56 (*)8. d d d 6 9. d d d d 25 (*)15. d 3 (*) Let, R q = {d 3, d 5, d 9, d 25, d 39, d 44, d 56, d 71, d 89, d 123 } |R q | = 10, no. of relevant docs for q Ranking of retreived docs in the answer set of q: 10 % Recall=>.1 * 10 rel docs = 1 rel doc retrieved One doc retrieved to get 1 rel doc: precision = 1/1 = 100% Micro-averaging : 1 Qry What is precision at recall values from %? 20% Recall:.2 * 10 rel docs = 2 rel docs retrieved 3 docs retrieved to get 2 rel docs: precision = 2/3 = % Recall:.3 * 10 rel docs = 3 rel docs retrieved 6 docs retrieved to get 3 rel docs: precision = 3/6 = 0.5

 Allan, Ballesteros, Croft, and/or Turtle |R q | = 10, no. of relevant docs for q Ranking of retreived docs in the answer set of q: Recall/ Precision Curve Precision Recall 1. d 123 (*)5. d 8 9. d d d d 9 (*)10. d 25 (*)14. d d 56 (*)7. d d d 3 (*) 4. d 6 8. d d 48 RecallPrecision 0.11/1 = 100% 0.22/3 = 0.67% 0.33/6 = 0.5% 0.44/10 = 0.4% 0.55/15 = 0.33% 0.60%… 1.00%

 Allan, Ballesteros, Croft, and/or Turtle Averaging and Interpolation macroaverage - each query is a point in the avg –can be independent of any parameter –average of precision values across several queries at standard recall levels e.g.) assume 3 relevant docs retrieved at ranks 4, 9, 20 –their actual recall points are:.33,.67, and 1.0 (why?) –their precision is.25,.22, and.15 (why?) Average over all relevant docs –rewards systems that retrieve relevant docs at the top ( )/3= 0.21

 Allan, Ballesteros, Croft, and/or Turtle Recall-Precision Tables & Graphs

 Allan, Ballesteros, Croft, and/or Turtle Document Level Averages Precision after a given number of docs retrieved –e.g.) 5, 10, 15, 20, 30, 100, 200, 500, & 1000 documents Reflects the actual system performance as a user might see it Each precision avg is computed by summing precisions at the specified doc cut-off and dividing by the number of queries –e.g. average precision for all queries at the point where n docs have been retrieved

 Allan, Ballesteros, Croft, and/or Turtle R-Precision Precision after R documents are retrieved –R = number of relevant docs for the query Average R-Precision –mean of the R-Precisions across all queries e.g.) Assume 2 qrys having 50 & 10 relevant docs; system retrieves 17 and 7 relevant docs in the top 50 and 10 documents retrieved, respectively

 Allan, Ballesteros, Croft, and/or Turtle Evaluation Recall-Precision value pairs may co-vary in ways that are hard to understand Would like to find composite measures –A single number measure of effectiveness primarily ad hoc and not theoretically justifiable Some attempt to invent measures that combine parts of the contingency table into a single number measure

 Allan, Ballesteros, Croft, and/or Turtle Contingency Table Miss = C/(A+C)

 Allan, Ballesteros, Croft, and/or Turtle Symmetric Difference A is the retrieved set of documents B is the relevant set of documents A  B (the symmetric difference) is the shaded area

 Allan, Ballesteros, Croft, and/or Turtle E measure (van Rijsbergen) used to emphasize precision or recall –like a weighted average of precision and recall large  increases importance of precision –can transform by  = 1/(  2 +1),  = P/R –when  = 1/2,  = 1; precision and recall are equally important E= normalized symmetric difference of retrieved and relevant sets E b=1 = |A  B|/(|A| + |B|) F =1- E is typical (good results mean larger values of F)