Evaluation INST 734 Module 5 Doug Oard. Agenda Evaluation fundamentals  Test collections: evaluating sets Test collections: evaluating rankings Interleaving.

Slides:



Advertisements
Similar presentations
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Advertisements

Evaluation LBSC 796/INFM 718R Session 5, October 7, 2007 Douglas W. Oard.
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.
Ranked Retrieval INST 734 Module 3 Doug Oard. Agenda  Ranked retrieval Similarity-based ranking Probability-based ranking.
Evaluating Search Engine
Information Retrieval Review
© Tefko Saracevic, Rutgers University1 Search strategy & tactics Governed by effectiveness & feedback.
Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.
Modern Information Retrieval
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
LBSC 796/INFM 718R: Week 2 Evaluation
INFO 624 Week 3 Retrieval System Evaluation
Retrieval Evaluation. Brief Review Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
A Task Oriented Non- Interactive Evaluation Methodology for IR Systems By Jane Reid Alyssa Katz LIS 551 March 30, 2004.
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
Retrieval Evaluation: Precision and Recall. Introduction Evaluation of implementations in computer science often is in terms of time and space complexity.
Retrieval Evaluation. Introduction Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
An investigation of query expansion terms Gheorghe Muresan Rutgers University, School of Communication, Information and Library Science 4 Huntington St.,
Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.
WXGB6106 INFORMATION RETRIEVAL Week 3 RETRIEVAL EVALUATION.
ISP 433/633 Week 6 IR Evaluation. Why Evaluate? Determine if the system is desirable Make comparative assessments.
Evaluation of Evaluation in Information Retrieval - Tefko Saracevic Historical Approach to IR Evaluation.
Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.
Search and Retrieval: Relevance and Evaluation Prof. Marti Hearst SIMS 202, Lecture 20.
| 1 › Gertjan van Noord2014 Zoekmachines Lecture 5: Evaluation.
Search Engines and Information Retrieval Chapter 1.
Minimal Test Collections for Retrieval Evaluation B. Carterette, J. Allan, R. Sitaraman University of Massachusetts Amherst SIGIR2006.
Philosophy of IR Evaluation Ellen Voorhees. NIST Evaluation: How well does system meet information need? System evaluation: how good are document rankings?
IR Evaluation Evaluate what? –user satisfaction on specific task –speed –presentation (interface) issue –etc. My focus today: –comparative performance.
Evaluating Search Engines in chapter 8 of the book Search Engines Information Retrieval in Practice Hongfei Yan.
A Comparison of Statistical Significance Tests for Information Retrieval Evaluation CIKM´07, November 2007.
Information Retrieval Lecture 7. Recap of the last lecture Vector space scoring Efficiency considerations Nearest neighbors and approximations.
Evaluation INST 734 Module 5 Doug Oard. Agenda  Evaluation fundamentals Test collections: evaluating sets Test collections: evaluating rankings Interleaving.
IR System Evaluation Farhad Oroumchian. IR System Evaluation System-centered strategy –Given documents, queries, and relevance judgments –Try several.
Assessing The Retrieval A.I Lab 박동훈. Contents 4.1 Personal Assessment of Relevance 4.2 Extending the Dialog with RelFbk 4.3 Aggregated Assessment.
Evaluation INST 734 Module 5 Doug Oard. Agenda Evaluation fundamentals Test collections: evaluating sets  Test collections: evaluating rankings Interleaving.
Structure of IR Systems INST 734 Module 1 Doug Oard.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Basic Implementation and Evaluations Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Measuring How Good Your Search Engine Is. *. Information System Evaluation l Before 1993 evaluations were done using a few small, well-known corpora of.
Lecture 3: Retrieval Evaluation Maya Ramanath. Benchmarking IR Systems Result Quality Data Collection – Ex: Archives of the NYTimes Query set – Provided.
C.Watterscs64031 Evaluation Measures. C.Watterscs64032 Evaluation? Effectiveness? For whom? For what? Efficiency? Time? Computational Cost? Cost of missed.
Retrieval Evaluation Hongning Wang
Performance Measures. Why to Conduct Performance Evaluation? 2 n Evaluation is the key to building effective & efficient IR (information retrieval) systems.
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
1 CS 430 / INFO 430 Information Retrieval Lecture 8 Evaluation of Retrieval Effectiveness 1.
What Does the User Really Want ? Relevance, Precision and Recall.
1 CS 430: Information Discovery Lecture 8 Evaluation of Retrieval Effectiveness II.
Evaluation LBSC 796/CMSC 828o Session 8, March 15, 2004 Douglas W. Oard.
Evaluation INST 734 Module 5 Doug Oard. Agenda Evaluation fundamentals Test collections: evaluating sets Test collections: evaluating rankings Interleaving.
Topic by Topic Performance of Information Retrieval Systems Walter Liggett National Institute of Standards and Technology TREC-7 (1999)
Evaluation LBSC 708A/CMSC 838L Session 6, October 16, 2001 Philip Resnik and Douglas W. Oard.
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
Evaluation of Information Retrieval Systems Xiangming Mu.
Evaluation. The major goal of IR is to search document relevant to a user query. The evaluation of the performance of IR systems relies on the notion.
Information Retrieval Quality of a Search Engine.
Information Retrieval Lecture 3 Introduction to Information Retrieval (Manning et al. 2007) Chapter 8 For the MSc Computer Science Programme Dell Zhang.
Usefulness of Quality Click- through Data for Training Craig Macdonald, ladh Ounis Department of Computing Science University of Glasgow, Scotland, UK.
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008 Annotations by Michael L. Nelson.
INFORMATION RETRIEVAL MEASUREMENT OF RELEVANCE EFFECTIVENESS 1Adrienn Skrop.
Knowledge and Information Retrieval Dr Nicholas Gibbins 32/4037.
Evaluation of IR Systems
Evaluation.
Modern Information Retrieval
IR Theory: Evaluation Methods
Authors: Peiling Wang and Dagobert Soergel Reviewer: Douglas W. Oard
Evaluation of IR Performance
Cumulated Gain-Based Evaluation of IR Techniques
Retrieval Evaluation - Reference Collections
Retrieval Performance Evaluation - Measures
Presentation transcript:

Evaluation INST 734 Module 5 Doug Oard

Agenda Evaluation fundamentals  Test collections: evaluating sets Test collections: evaluating rankings Interleaving User studies

Batch Evaluation Model IR Black Box Query Search Result Documents Evaluation Module Measure of Effectiveness Relevance Judgments These are the four things we need

IR Test Collection Design Representative document collection –Size, sources, genre, topics, … “Random” sample of topics –Associated somehow with queries Known (often binary) levels of relevance –For each topic-document pair (topic, not query!) –Assessed by humans, used only for evaluation Measure(s) of effectiveness –Used to compare alternate systems

A TREC Ad Hoc Topic Title: Health and Computer Terminals Description: Is it hazardous to the health of individuals to work with computer terminals on a daily basis? Narrative: Relevant documents would contain any information that expands on any physical disorder/problems that may be associated with the daily working with computer terminals. Such things as carpel tunnel, cataracts, and fatigue have been said to be associated, but how widespread are these or other problems and what is being done to alleviate any health problems.

Saracevic on Relevance measure degree dimension estimate appraisal relation correspondence utility connection satisfaction fit bearing matching document article textual form reference information provided fact query request information used point of view information need statement person judge user requester Information specialist Tefko Saracevic. (1975) Relevance: A Review of and a Framework for Thinking on the Notion in Information Science. Journal of the American Society for Information Science, 26(6), Relevance is theof a existing between aand a as determined by

Teasing Apart “Relevance” Relevance relates a topic and a document –Duplicates are equally relevant by definition –Constant over time and across users Pertinence relates a task and a document –Accounts for quality, complexity, language, … Utility relates a user and a document –Accounts for prior knowledge Dagobert Soergel (1994). Indexing and Retrieval Performance: The Logical Evidence. JASIS, 45(8),

Set-Based Effectiveness Measures Precision –How much of what was found is relevant? Often of interest, particularly for interactive searching Recall –How much of what is relevant was found? Particularly important for law, patents, and medicine Fallout –How much of what was irrelevant was rejected? Useful when different size collections are compared

Retrieved Relevant Relevant + Retrieved Not Relevant + Not Retrieved

Effectiveness Measures Relevant Retrieved False AlarmIrrelevant Rejected Miss Relevant Not Relevant RetrievedNot Retrieved Truth System User- Oriented System- Oriented

Single-Figure Set-Based Measures Balanced F-measure –Harmonic mean of recall and precision –Weakness: What if no relevant documents exist? Cost function –Reward relevant retrieved, Penalize non-relevant For example, 3R + - 2N + –Weakness: Hard to normalize, so hard to average

(Paired) Statistical Significance Tests System A System B Query Average Sign Test p=1.0 Wilcoxon p= % of outcomes Try some at: t-test p=0.34

Reporting Results Do you have a measurable improvement? –Inter-assessor agreement limits max precision –Using one judge to assess another yields ~0.8 Do you have a meaningful improvement? –0.05 (absolute) in precision might be noticed –0.10 (absolute) in precision makes a difference Do you have a reliable improvement? –Two-tailed paired statistical significance test

Agenda Evaluation fundamentals Test collections: evaluating sets  Test collections: evaluating rankings Interleaving User studies