IR Theory: Evaluation Methods

Slides:



Advertisements
Similar presentations
Towards Methods for the Collective Gathering and Quality Control of Relevance Assessments SIGIR´09, July 2009.
Advertisements

1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Exercising these ideas  You have a description of each item in a small collection. (30 web sites)  Assume we are looking for information about boxers,
Evaluating Search Engine
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.
INFO 624 Week 3 Retrieval System Evaluation
Retrieval Evaluation. Brief Review Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
Retrieval Evaluation: Precision and Recall. Introduction Evaluation of implementations in computer science often is in terms of time and space complexity.
Evaluating the Performance of IR Sytems
Retrieval Evaluation. Introduction Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
Evaluation CSC4170 Web Intelligence and Social Computing Tutorial 5 Tutor: Tom Chao Zhou
Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.
WXGB6106 INFORMATION RETRIEVAL Week 3 RETRIEVAL EVALUATION.
ISP 433/633 Week 6 IR Evaluation. Why Evaluate? Determine if the system is desirable Make comparative assessments.
Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.
The Relevance Model  A distribution over terms, given information need I, (Lavrenko and Croft 2001). For term r, P(I) can be dropped w/o affecting the.
Evaluation of Image Retrieval Results Relevant: images which meet user’s information need Irrelevant: images which don’t meet user’s information need Query:
Search and Retrieval: Relevance and Evaluation Prof. Marti Hearst SIMS 202, Lecture 20.
Retrieval Evaluation Hongning Wang
Modern Retrieval Evaluations Hongning Wang
Evaluation David Kauchak cs458 Fall 2012 adapted from:
Evaluation David Kauchak cs160 Fall 2009 adapted from:
Evaluation Experiments and Experience from the Perspective of Interactive Information Retrieval Ross Wilkinson Mingfang Wu ICT Centre CSIRO, Australia.
Philosophy of IR Evaluation Ellen Voorhees. NIST Evaluation: How well does system meet information need? System evaluation: how good are document rankings?
IR Evaluation Evaluate what? –user satisfaction on specific task –speed –presentation (interface) issue –etc. My focus today: –comparative performance.
Evaluation INST 734 Module 5 Doug Oard. Agenda Evaluation fundamentals  Test collections: evaluating sets Test collections: evaluating rankings Interleaving.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 7 9/13/2011.
Information Retrieval Evaluation and the Retrieval Process.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
IR System Evaluation Farhad Oroumchian. IR System Evaluation System-centered strategy –Given documents, queries, and relevance judgments –Try several.
IR Theory: Relevance Feedback. Relevance Feedback: Example  Initial Results Search Engine2.
Evaluation INST 734 Module 5 Doug Oard. Agenda Evaluation fundamentals Test collections: evaluating sets  Test collections: evaluating rankings Interleaving.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Basic Implementation and Evaluations Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Measuring How Good Your Search Engine Is. *. Information System Evaluation l Before 1993 evaluations were done using a few small, well-known corpora of.
C.Watterscs64031 Evaluation Measures. C.Watterscs64032 Evaluation? Effectiveness? For whom? For what? Efficiency? Time? Computational Cost? Cost of missed.
Performance Measurement. 2 Testing Environment.
Information Retrieval Transfer Cycle Dania Bilal IS 530 Fall 2007.
What Does the User Really Want ? Relevance, Precision and Recall.
1 13/05/07 1/20 LIST – DTSI – Interfaces, Cognitics and Virtual Reality Unit The INFILE project: a crosslingual filtering systems evaluation campaign Romaric.
Modern Retrieval Evaluations Hongning Wang
The Loquacious ( 愛說話 ) User: A Document-Independent Source of Terms for Query Expansion Diane Kelly et al. University of North Carolina at Chapel Hill.
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
Evaluation of Information Retrieval Systems Xiangming Mu.
Evaluation. The major goal of IR is to search document relevant to a user query. The evaluation of the performance of IR systems relies on the notion.
Information Retrieval Quality of a Search Engine.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Information Retrieval Lecture 3 Introduction to Information Retrieval (Manning et al. 2007) Chapter 8 For the MSc Computer Science Programme Dell Zhang.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 10 Evaluation.
INFORMATION RETRIEVAL MEASUREMENT OF RELEVANCE EFFECTIVENESS 1Adrienn Skrop.
Sampath Jayarathna Cal Poly Pomona
Developments in Evaluation of Search Engines
Evaluation of Information Retrieval Systems
7CCSMWAL Algorithmic Issues in the WWW
Modern Retrieval Evaluations
Evaluation of IR Systems
Lecture 10 Evaluation.
Evaluation.
Modern Information Retrieval
Evaluation of IR Performance
Lecture 6 Evaluation.
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Search Engine Architecture
Evaluation of Information Retrieval Systems
Retrieval Evaluation - Measures
Retrieval Performance Evaluation - Measures
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Presentation transcript:

IR Theory: Evaluation Methods

Evaluation of IR Systems: Why? Which is the best? IR Model, Term Weighting, Indexing Method, User Interface, Search System How to determine if a given approach is effective? Is my new term weighting formula better than Okapi? Is Google better than Naver? Is VSM better than Boolean? Without standard evaluation methods, assessment can be Anecdotal Biased Incorrect Search Engine

Evaluation of IR Systems: How? Measure the degree to which information need is satisfied Number and ranks of relevant documents in search results User’s assessment Usage How often is the system used? Return rate Change in user’s knowledge Task completion How to determine if a given approach is effective Baseline system vs. Test system Baseline system Standard (top) system Test system Same as baseline except for the methods to test Easy to assess if system differences are restricted Multiple factors cause compound effects e.g., IR model + term weight Difficult to isolate the contribution of each factor Search Engine

Evaluation of IR Systems: Challenges What measure to use? Effectiveness, Relevance, Utility, User Satisfaction Who to judge? Individual User, All Users, Subject Expert How to measure? Binary, Multi-level, Continuous Difficulties of Evaluation At the core, it is a subjective process Information needs change over time and with learning Evaluation criteria vary by user and context Search Engine

IR Evaluation: Standard Approach Measure to used Effectiveness, Relevance, Utility, User Satisfaction Who to judge? Individual User, All Users, Subject Expert How to measure? Binary, Multi-level, Continuous Process Collect a set of documents (Collection) Construct a set of queries (Queries/Topics) For each query, human assessors identify relevant documents (Relevance Judgments) IR systems use the queries against the collection to produce search results Evaluate the results by comparing it to the relevance judgments Search Engine

IR Evaluation: TREC Approach Making relevance judgments for a large collection is practically impossible #docs to review = #queries * #documents (collection size) For 50 queries and 1 million documents, assessors need to review 50 million documents Pooled Relevance Judgment Create a pool of documents to review by collecting only top n (e.g., 100) documents from each system result Justification Previous research finding Different systems retrieve different sets of documents Assumption Document pool created from top results will include majority of relevant documents in the collection TREC finding Pooled relevance does not affect relative rankings of systems Search Engine

Evaluation Measures: Overview Test Collection Document Corpus Topics: Description of information need Relevance Judgments: Set of known relevant documents for each topic Evaluation Measures Precision (P): 정밀도 A measure of the system’s ability to present only the relevant items P = number of relevant items retrieved / total number of items retrieved Recall (R): 재현율 A measure of the system’s ability to retrieve all the relevant items R = number of relevant items retrieved / number of relevant items in the collection R-Precision (RP) De-emphasizes the exact ranking of documents (good when many reldocs are retrieved) RP = precision at rank R, where R= number of relevant items in the collection Reciprocal Rank (RR) Good for known item search RR = 1/rank where first relevant item is retrieved Search Engine

Evaluation Measures: Precision & Recall Document Collection Retrieved Documents P =  Relevant Documents Retrieved Documents Relevant Documents Relevant documents retrieved R =  Search Engine

Evaluation Measures: Precision vs. Recall Recall-Precision Tradeoff Retrieves mostly relevant documents but misses many relevant documents Optimum Performance 1 Precision 1 Recall Retrieves most of relevant documents along with lots of junk Search Engine

Evaluation Measures: Recall-Precision Recall-Precision Graph Precision at 11 standard recall levels Commonly used to compare systems closer to upper right  better system Recall Precision 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.6169 0.4517 0.3938 0.3243 0.2725 0.2224 0.1642 0.1342 0.0904 0.0472 0.0031 Search Engine

Evaluation Measures: Average Precision Average Precision (AP) Single valued measure that reflects performance over all relevant documents Rewards the system that retrieves relevant documents at high ranks Standard evaluation measure used to evaluate IR systems Devised by TREC Search Engine

Examples: Precision & Recall rank Doc# Relevant? Precision Recall (Nrel=10) 1 42 Yes 1/1 =1.00 1/10 =0.10 2 221 No 1/2 =0.50 3 123 1/3 = 0.33 4 21 2/4 = 0.50 2/10 =0.20 5 111 3/5 = 0.60 3/10 =0.30 6 11 3/6 = 0.50 7 93 3/7 = 0.43 8 234 3/8 = 0.38 9 4/9 = 0.44 4/10 =0.40 10 254 4/10 = 0.40 333 5/11 = 0.45 5/10 =0.50 12 421 6/12 = 0.50 6/10 =0.60 13 45 6/13 = 0.46 14 761 7/14 = 0.50 7/10 =0.70 15 8/15 = 0.53 8/10 =0.80 Search Engine

Examples: Average Precision rank Doc# Relevant? Precision Recall (Nrel=10) 1 42 Yes 1/1 =1.00 1/10 =0.10 2 221 No 1/2 =0.50 3 123 1/3 = 0.33 4 21 2/4 = 0.50 2/10 =0.20 5 111 3/5 = 0.60 3/10 =0.30 6 11 3/6 = 0.50 7 93 3/7 = 0.43 8 234 3/8 = 0.38 9 4/9 = 0.44 4/10 =0.40 10 254 4/10 = 0.40 333 5/11 = 0.45 5/10 =0.50 12 421 6/12 = 0.50 6/10 =0.60 13 45 6/13 = 0.46 14 761 7/14 = 0.50 7/10 =0.70 15 8/15 = 0.53 8/10 =0.80 Search Engine