Sampath Jayarathna Cal Poly Pomona

Slides:



Advertisements
Similar presentations
Introduction to Information Retrieval
Advertisements

Super Awesome Presentation Dandre Allison Devin Adair.
Chapter 5: Introduction to Information Retrieval
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
Precision and Recall.
Evaluating Search Engine
Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.
INFO 624 Week 3 Retrieval System Evaluation
Evaluating the Performance of IR Sytems
ISP 433/633 Week 6 IR Evaluation. Why Evaluate? Determine if the system is desirable Make comparative assessments.
Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.
LIS618 lecture 11 i/r performance evaluation Thomas Krichel
Evaluation of Image Retrieval Results Relevant: images which meet user’s information need Irrelevant: images which don’t meet user’s information need Query:
Retrieval Evaluation Hongning Wang
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Evaluation David Kauchak cs458 Fall 2012 adapted from:
Evaluation David Kauchak cs160 Fall 2009 adapted from:
Information Retrieval and Web Search IR Evaluation and IR Standard Text Collections.
Evaluation Experiments and Experience from the Perspective of Interactive Information Retrieval Ross Wilkinson Mingfang Wu ICT Centre CSIRO, Australia.
IR Evaluation Evaluate what? –user satisfaction on specific task –speed –presentation (interface) issue –etc. My focus today: –comparative performance.
Evaluating Search Engines in chapter 8 of the book Search Engines Information Retrieval in Practice Hongfei Yan.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
Chapter 6: Information Retrieval and Web Search
Processing of large document collections Part 3 (Evaluation of text classifiers, term selection) Helena Ahonen-Myka Spring 2006.
Evaluation INST 734 Module 5 Doug Oard. Agenda Evaluation fundamentals Test collections: evaluating sets  Test collections: evaluating rankings Interleaving.
 Examine two basic sources for implicit relevance feedback on the segment level for search personalization. Eye tracking Display time.
Evaluation of (Search) Results How do we know if our results are any good? Evaluating a search engine  Benchmarks  Precision and recall Results summaries:
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Basic Implementation and Evaluations Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Information Retrieval and Web Search IR Evaluation and IR Standard Text Collections Instructor: Rada Mihalcea Some slides in this section are adapted from.
Lecture 3: Retrieval Evaluation Maya Ramanath. Benchmarking IR Systems Result Quality Data Collection – Ex: Archives of the NYTimes Query set – Provided.
Performance Measurement. 2 Testing Environment.
Retrieval Evaluation Hongning Wang
Performance Measures. Why to Conduct Performance Evaluation? 2 n Evaluation is the key to building effective & efficient IR (information retrieval) systems.
Chapter 8: Evaluation Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.
Evaluation. The major goal of IR is to search document relevant to a user query. The evaluation of the performance of IR systems relies on the notion.
Information Retrieval Quality of a Search Engine.
Predicting User Interests from Contextual Information R. W. White, P. Bailey, L. Chen Microsoft (SIGIR 2009) Presenter : Jae-won Lee.
Information Retrieval Lecture 3 Introduction to Information Retrieval (Manning et al. 2007) Chapter 8 For the MSc Computer Science Programme Dell Zhang.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 10 Evaluation.
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008 Annotations by Michael L. Nelson.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.
Knowledge and Information Retrieval Dr Nicholas Gibbins 32/4037.
Introduction to Information Retrieval Introduction to Information Retrieval Information Retrieval and Web Search Lecture 8: Evaluation.
Sampath Jayarathna Cal Poly Pomona
Evaluation Anisio Lacerda.
7CCSMWAL Algorithmic Issues in the WWW
Evaluation of IR Systems
Information Retrieval and Web Search
Lecture 10 Evaluation.
Special Topics on Information Retrieval
Evaluation.
אחזור מידע, מנועי חיפוש וספריות
Modern Information Retrieval
IR Theory: Evaluation Methods
Lecture 6 Evaluation.
Evaluating Information Retrieval Systems
Feature Selection for Ranking
Dr. Sampath Jayarathna Cal Poly Pomona
Cumulated Gain-Based Evaluation of IR Techniques
Retrieval Evaluation - Measures
INF 141: Information Retrieval
Retrieval Evaluation - Measures
Retrieval Performance Evaluation - Measures
Dr. Sampath Jayarathna Cal Poly Pomona
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Precision and Recall.
Presentation transcript:

Sampath Jayarathna Cal Poly Pomona Evaluation in IR Sampath Jayarathna Cal Poly Pomona Credit for some of the slides in this lecture goes to Prof. Ray Mooney at UT Austin & Prof. Rong Jin at MSU

Evaluation Evaluation is key to building effective and efficient IR systems usually carried out in controlled experiments online testing can also be done Effectiveness and efficiency are related High efficiency may be obtained at the price of effectiveness

Why System Evaluation? There are many retrieval models/ algorithms/ systems, which one is the best? What is the best component for: Ranking function (dot-product, cosine, …) Term selection (stopword removal, stemming…) Term weighting (TF, TF-IDF,…) How far down the ranked list will a user need to look to find some/all relevant documents?

Difficulties in Evaluating IR Systems Effectiveness is related to the relevancy of retrieved items. Relevancy is not typically binary but continuous. Even if relevancy is binary, it can be a difficult judgment to make. Relevancy, from a human standpoint, is: Subjective: Depends upon a specific user’s judgment. Situational: Relates to user’s current needs. Cognitive: Depends on human perception and behavior. Dynamic: Changes over time.

Human Labeled Corpora : (Gold Standard) Start with a corpus of documents. Collect a set of queries for this corpus. Have one or more human experts exhaustively label the relevant documents for each query. Typically assumes binary relevance judgments. Requires considerable human effort for large document/query corpora.

Precision and Recall Relevant Retrieved Relevant + Not Relevant + Not Retrieved Space of all documents

Precision and Recall Precision Recall The ability to retrieve top-ranked documents that are mostly relevant. Precision P = tp/(tp + fp) Recall The ability of the search to find all of the relevant items in the corpus. Recall R = tp/(tp + fn) Relevant Nonrelevant Retrieved tp fp Not Retrieved fn tn

Precision/Recall : Example

Precision/Recall : Example

Accuracy : Example Accuracy = tp + tn/(tp+tn+fp+fn) Relevant Nonrelevant Retrieved tp = ? fp = ? Not Retrieved fn = ? tn = ?

F Measure (F1/Harmonic Mean) One measure of performance that takes into account both recall and precision. Harmonic mean of recall and precision: Why harmonic mean? harmonic mean emphasizes the importance of small values, whereas the arithmetic mean is affected more by outliers that are unusually large Data are extremely skewed; over 99% documents are non- relevant. This is why accuracy is not an appropriate measure Compared to arithmetic mean, both need to be high for harmonic mean to be high.

F Measure (F1/Harmonic Mean) : example Recall = 2/6 = 0.33 Precision = 2/3 = 0.67 F = 2*Recall*Precision/(Recall + Precision) = 2*0.33*0.67/(0.33 + 0.67) = 0.44

F Measure (F1/Harmonic Mean) : example Recall = 5/6 = 0.83 Precision = 5/6 = 0.83 F = 2*Recall*Precision/(Recall + Precision) = 2*0.83*0.83/(0.83 + 0.83) = 0.83

E Measure (parameterized F Measure) A variant of F measure that allows weighting emphasis on precision over recall: Value of  controls trade-off:  = 1: Equally weight precision and recall (E=F).  > 1: Weight recall more.  < 1: Weight precision more.

Computing Recall/Precision Points For a given query, produce the ranked list of retrievals. Adjusting a threshold on this ranked list produces different sets of retrieved documents, and therefore different recall/precision measures. Mark each document in the ranked list that is relevant according to the gold standard. Compute a recall/precision pair for each position in the ranked list that contains a relevant document.

Computing Recall/Precision Points: Example 1 Let total # of relevant docs = 6 Check each new recall point: R=1/6=0.167; P=1/1=1 R=2/6=0.333; P=2/2=1 R=3/6=0.5; P=3/4=0.75 R=4/6=0.667; P=4/6=0.667 Missing one relevant document. Never reach 100% recall R=5/6=0.833; p=5/13=0.38

Computing Recall/Precision Points: Example 2 Let total # of relevant docs = 6 Check each new recall point: R=1/6=0.167; P=1/1=1 R=2/6=0.333; P=2/3=0.667 R=3/6=0.5; P=3/5=0.6 R=4/6=0.667; P=4/8=0.5 R=5/6=0.833; P=5/9=0.556 R=6/6=1.0; p=6/14=0.429

Mean Average Precision (MAP) Average Precision: Average of the precision values at the points at which each relevant document is retrieved. Ex1: (1 + 1 + 0.75 + 0.667 + 0.38 + 0)/6 = 0.633 Ex2: (1 + 0.667 + 0.6 + 0.5 + 0.556 + 0.429)/6 = 0.625 Averaging the precision values from the rank positions where a relevant document was retrieved Set precision values to be zero for the not retrieved documents

Average Precision: Example

Average Precision: Example

Average Precision: Example

Average Precision: Example Miss one relevant document

Average Precision: Example Miss two relevant documents

Mean Average Precision (MAP) Summarize rankings from multiple queries by averaging average precision Most commonly used measure in research papers Assumes user is interested in finding many relevant documents for each query Requires many relevance judgments in text collection

Mean Average Precision (MAP)

Non-Binary Relevance Documents are rarely entirely relevant or non- relevant to a query Many sources of graded relevance judgments Relevance judgments on a 5-point scale Multiple judges

Cumulative Gain With graded relevance judgments, we can compute the gain at each rank. Cumulative Gain at rank n: (Where reli is the graded relevance of the document at position i)

Discounted Cumulative Gain (DCG) Popular measure for evaluating web search and related tasks Use graded relevance Two assumptions: Highly relevant documents are more useful than marginally relevant document the lower the ranked position of a relevant document, the less useful it is for the user, since it is less likely to be examined

Discounted Cumulative Gain Gain is accumulated starting at the top of the ranking and is discounted at lower ranks Typical discount is 1/log (rank) With base 2, the discount at rank 4 is 1/2, and at rank 8 it is 1/3

Discounting Based on Position Users care more about high-ranked documents, so we discount results by 1/log2(rank) Discounted Cumulative Gain:

Normalized Discounted Cumulative Gain (NDCG) To compare DCGs, normalize values so that a ideal ranking would have a Normalized DCG of 1.0 Ideal ranking:

Normalized Discounted Cumulative Gain (NDCG) Normalize by DCG of the ideal ranking: NDCG ≤ 1 at all ranks NDCG is comparable across different queries

NDCG Example Documents ordered by ranking algorithm D1, D2, D3, D4, D5, D6 The user provides following relevance score 3,2,3,0,1,2 \ Changing the order of any two documents does not affect the CG measure. If D3 and D4 are switched, the CG remains the same, 11. 

NDCG Example DCG is used to emphasize highly relevant documents appearing early in the result list. Using the logarithmic scale for reduction, the DCG for each result Now D3 and D4 are switched, results in a reduced DCG because a less relevant document is placed higher in the ranking; that is, a more relevant document is discounted more by being placed in a lower rank. To normalize DCG values, an ideal ordering for the given query is needed. For this example, that ordering would be the monotonically decreasing sort of the relevance judgments provided by the experiment participant 3,3,2,2,1,0

Focusing on Top Documents Users tend to look at only the top part of the ranked result list to find relevant documents Some search tasks have only one relevant document e.g., navigational search, question answering Recall not appropriate instead need to measure how well the search engine does at retrieving relevant documents at very high ranks

Focusing on Top Documents Precision at Rank R R typically 5, 10, 20 easy to compute, average, understand not sensitive to rank positions less than R Reciprocal Rank reciprocal of the rank at which the first relevant document is retrieved Mean Reciprocal Rank (MRR) is the average of the reciprocal ranks over a set of queries very sensitive to rank position

R- Precision Precision at the R-th position in the ranking of results for a query that has R relevant documents. R = # of relevant docs = 6 R-Precision = 4/6 = 0.67

Mean Reciprocal Rank (MRR)

Significance Tests Given the results from a number of queries, how can we conclude that ranking algorithm B is better than algorithm A? A significance test null hypothesis: no difference between A and B alternative hypothesis: B is better than A the power of a test is the probability that the test will reject the null hypothesis correctly increasing the number of queries in the experiment also increases power of test

Example Experimental Results Significance level:  = 0.05 Probability for B=A

Example Experimental Results t-test t = 2.33 p-value = 0.02 < 0.05 Probability for B=A is 0.02 Reject null hypothesis  B is better than A Avg 41.1 62.5 Significance level:  = 0.05 Probability for B=A

Online Testing Test (or even train) using live traffic on a search engine Benefits: real users, less biased, large amounts of test data Drawbacks: noisy data, can degrade user experience Often done on small proportion (1-5%) of live traffic

Summary No single measure is the correct one for any application choose measures appropriate for task use a combination shows different aspects of the system effectiveness Use significance tests (t-test) Analyze performance of individual queries