Retrieval Evaluation - Reference Collections

Slides:



Advertisements
Similar presentations
Towards Methods for the Collective Gathering and Quality Control of Relevance Assessments SIGIR´09, July 2009.
Advertisements

Chapter 5: Introduction to Information Retrieval
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.
1 Retrieval Performance Evaluation Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Addison-Wesley, (Chapter 3)
Evaluating Search Engine
Accessing Sources Of Evidence For Practice Introduction To Databases Karen Smith Department of Health Sciences University of York.
Modern Information Retrieval
INFO 624 Week 3 Retrieval System Evaluation
© Tefko Saracevic, Rutgers University 1 EVALUATION in searching IR systems Digital libraries Reference sources Web sources.
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
CS 430 / INFO 430 Information Retrieval
CSE 730 Information Retrieval of Biomedical Data The use of medical lexicon in biomedical IR.
WXGB6106 INFORMATION RETRIEVAL Week 3 RETRIEVAL EVALUATION.
ISP 433/633 Week 6 IR Evaluation. Why Evaluate? Determine if the system is desirable Make comparative assessments.
Federated Search of Text Search Engines in Uncooperative Environments Luo Si Language Technology Institute School of Computer Science Carnegie Mellon University.
Search and Retrieval: Relevance and Evaluation Prof. Marti Hearst SIMS 202, Lecture 20.
IL Step 1: Sources of Information Information Literacy 1.
Evaluation Experiments and Experience from the Perspective of Interactive Information Retrieval Ross Wilkinson Mingfang Wu ICT Centre CSIRO, Australia.
Philosophy of IR Evaluation Ellen Voorhees. NIST Evaluation: How well does system meet information need? System evaluation: how good are document rankings?
IR Evaluation Evaluate what? –user satisfaction on specific task –speed –presentation (interface) issue –etc. My focus today: –comparative performance.
H. Lundbeck A/S3-Oct-151 Assessing the effectiveness of your current search and retrieval function Anna G. Eslau, Information Specialist, H. Lundbeck A/S.
Jane Reid, AMSc IRIC, QMUL, 16/10/01 1 Evaluation of IR systems Jane Reid
Evaluating Search Engines in chapter 8 of the book Search Engines Information Retrieval in Practice Hongfei Yan.
Query Routing in Peer-to-Peer Web Search Engine Speaker: Pavel Serdyukov Supervisors: Gerhard Weikum Christian Zimmer Matthias Bender International Max.
Evaluation INST 734 Module 5 Doug Oard. Agenda Evaluation fundamentals  Test collections: evaluating sets Test collections: evaluating rankings Interleaving.
Controlling Overlap in Content-Oriented XML Retrieval Charles L. A. Clarke School of Computer Science University of Waterloo Waterloo, Canada.
Toward A Session-Based Search Engine Smitha Sriram, Xuehua Shen, ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.
Greedy is not Enough: An Efficient Batch Mode Active Learning Algorithm Chen, Yi-wen( 陳憶文 ) Graduate Institute of Computer Science & Information Engineering.
Users and Assessors in the Context of INEX: Are Relevance Dimensions Relevant? Jovan Pehcevski, James A. Thom School of CS and IT, RMIT University, Australia.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Text REtrieval Conference (TREC) Implementing a Question-Answering Evaluation for AQUAINT Ellen M. Voorhees Donna Harman.
Measuring How Good Your Search Engine Is. *. Information System Evaluation l Before 1993 evaluations were done using a few small, well-known corpora of.
Information Retrieval
Reference Collections: Collection Characteristics.
1 CS 430 / INFO 430 Information Retrieval Lecture 8 Evaluation of Retrieval Effectiveness 1.
AQUAINT AQUAINT Evaluation Overview Ellen M. Voorhees.
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
1 CS 430 / INFO 430 Information Retrieval Lecture 9 Evaluation of Retrieval Effectiveness 2.
Evaluation of Information Retrieval Systems Xiangming Mu.
Evaluation. The major goal of IR is to search document relevant to a user query. The evaluation of the performance of IR systems relies on the notion.
Information Retrieval Quality of a Search Engine.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Information Retrieval Lecture 3 Introduction to Information Retrieval (Manning et al. 2007) Chapter 8 For the MSc Computer Science Programme Dell Zhang.
Usefulness of Quality Click- through Data for Training Craig Macdonald, ladh Ounis Department of Computing Science University of Glasgow, Scotland, UK.
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008 Annotations by Michael L. Nelson.
Retrieval Evaluation Modern Information Retrieval, Chapter 3
Data Mining for Expertise: Using Scopus to Create Lists of Experts for U.S. Department of Education Discretionary Grant Programs Good afternoon, my name.
Information Retrieval in Practice
Sampath Jayarathna Cal Poly Pomona
BME1450: Biomaterials and Biomedical Research
Evaluation of Information Retrieval Systems
Evaluation Anisio Lacerda.
Human Computer Interaction Lecture 15 Usability Evaluation
Evaluation of IR Systems
Martin Rajman, Martin Vesely
Lecture 10 Evaluation.
Evaluation.
IR Theory: Evaluation Methods
CS 430: Information Discovery
Lecture 6 Evaluation.
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Introduction to Information Retrieval
Retrieval Evaluation - Reference Collections
Retrieval Performance Evaluation - Measures
Information Retrieval and Web Design
Retrieval Evaluation - Reference Collections
Retrieval Evaluation - Reference Collections
Retrieval Evaluation - Reference Collections
Presentation transcript:

Retrieval Evaluation - Reference Collections Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University References: 1. Modern Information Retrieval, Chapter 4 & Teaching material 2. Text REtrieval Conference. http://trec.nist.gov/ 3. Search Engines: Information Retrieval in Practice, Chapter 8

Premises Research in IR has frequently been criticized on two fronts Lack a solid formal framework as a basic foundation The inherent degree of psychological subjectiveness associated with the task decides the relevance of a given document Difficult to dismiss entirely Relevance can be binary or graded Binary relevance: relevant and not relevant Graded relevance: e.g., highly relevant, (mildly) relevant and not relevant Lack robust and consistent testbeds and benchmarks Small test collections did not reflect real-world applications No widely accepted benchmarks Comparisons between various retrieval systems were difficult (different groups focus on different aspects of retrieval) IR 來自外界的批評主要有二: 決定relevant 與否是相當主觀的一件事 缺乏一致性的測試平台與基準

The TREC Collection Text REtrieval Conference (TREC) Established in 1991, co-sponsored by the National Institute of Standards and Technology (NIST) and the Defense Advanced Research Projects Agency (DARPA) Evaluation of large scale IR problems The premier annual conference was held at NIST in Nov. 1992 Most well known IR evaluation setting http://trec.nist.gov/overview.html

Parts of the following slides were from TREC overviews by Ellen Voorhees of NIST

Topics: Users’ Information Need Tipster:預測者; Donna Harman Donna Harman graduated from Cornell University with a degree in electrical engineering, but has been involved with research in new search engine techniques for many years. She currently heads a group at the National Institute of Standards and Technology working in the area of natural language access to full text, both in search and browsing modes. In 1992 she started the Text Retrieval Conference (TREC), an ongoing forum that brings together researchers from industry and academia to test their search engines against a common corpus involving over a million documents, with appropriate topics and relevance judgments.

TREC - Test Collection and Benchmarks TREC test collection consists The documents The example information requests/needs (called topics in the TREC nomenclature) A set of relevant documents for each example information request Benchmark Tasks Ad hoc task New queries against a set of static docs Routing task Fixed queries against continuously changing doc The retrieved docs must be ranked Other tasks started from TREC-4 Training/Development Evaluation collections

TREC - Document Collection Example: TREC-6

TREC - Document Collection TREC document example: WSJ880406-0090 Docs are tagged with SGML (Standard Generalized Markup Languages) <doc> <docno> WSJ880406-0090 </docno> < hl > AT&T Unveils Services to Upgrade Phone Networks Under Global Plan </hl> <author> Janet Guyon (WSJ staff) </author> <dateline> New York </dateline> <text> American Telephone & Telegraph Co. introduced the first of a new generation of phone services with broad … </ text > </ doc > Major structures: doc number doc text The original structures are preserved as much as possible while common framework was provided

TREC Topic Example taken as a short query, more typical of a web application taken as a long query, more typical of a web application describe the criteria for relevance, used by the people doing relevance judgments, and not taken as a query Narrative:敘述,講述 Bengal:孟加拉 Therapy:治療, 療法

TREC - Creating Relevance Judgments For each topic (example information request) Each participating systems created top K (set between 50 and 200) documents and put in a pool Duplicates are removed, while documents are presented in some random order to the relevance judges Human “assessors” decide on the relevance of each document Usually, an assessor judged a document as relevant (most are binary judgments) if it contained information that could be used to help write a report on the query topic The so-called “pooling method” Two assumptions Vast majority of relevant documents is collected in the assembled pool Documents not in the pool were considered to be irrelevant Such assumptions have been verified to be accurate!

TREC – Pros and Cons Pros Cons Large-scale collections applied to common task Allows for somewhat controlled comparisons Cons Time-consuming in preparation and testing Very long queries, also unrealistic A high-recall search task and collections of news articles are sometimes inappropriate for other retrieval tasks Comparisons still difficult to make, because systems are quite different on many dimensions Also, topics used in every conference year present little overlap , which make the comparison difficult Focus on batch ranking rather than interaction There is an interactive track already

Some Experiences Learned from TREC An analysis of TREC experiments has shown that With 25 queries, an absolute difference in the effectiveness measure mAP of 0.05 will results in the wrong conclusion about which system is better is about 13 % of the comparisons With 50 queries, this error rate falls below 4% (which means an absolute difference of 0.05 in mAP is quite large) If a significance test is used, a relative difference of 10 % in mAP is sufficient to guarantee a low error rate with 50 queries If more relevance judgments are made possible, it will be more productive to judge more queries rather than to judge more documents from existing queries Though relevance may be a very subjective concept Differences in relevance judgments do not have a significant effect on the error rate for comparisons (because of “narrative” ?)

Other Collections The CACM Collection The ISI Collection 3204 articles (only containing the title and abstract parts) published in the Communications of the ACM from 1958 to 1979 Topics cover computer science literatures Queries were generated students and faculty of computer science department (Relevance judgment were also done by the same people) The ISI Collection 1460 documents selected from a collection assembled at Institute of Scientific Information (ISI) The Cystic Fibrosis (CF) Collection 1239 documents indexed with the term “cystic fibrosis” in National Library of Medicine’s MEDLINE database Cystic【生】囊的; 有囊的; 囊狀的【解】膀胱的; 膽囊的 fibrosis【生】纖維變性; 纖維化 much human expertise involved

The Cystic Fibrosis (CF) Collection 1,239 abstracts of articles 100 information requests in the form of complete questions 4 separate relevance scores for each request Relevant docs determined and rated by 3 separate subject experts and one medial bibliographer on 0-2 scale 0: Not relevant 1: Marginally relevant 2: Highly relevant 4 relevance scores (totally ranged 0~8) are therefore generated

User Actions as Implicit Relevance Judgments Query logs that capture user interactions with a search engine have become an extremely important resource for web search engine development Many user actions can also be considered implicit relevance judgments If these can be exploited, we can substantially reduce the effort of constructing a test collection The following actions (i.e., clickthrough data) to some extent may indicate the relevance of a document to a query Clicking on a document in a result list Move a document to a folder Send a document to a printer, etc. But how to maintain the privacy of users ?

More on Clickthrough Data May use clickthough data to predict preferences between pairs of documents (high correlation with relevance) Appropriate for tasks with multiple levels of relevance (graded relevance), focused on user relevance (rather than purely topical relevance) Clickthough data can also be aggregated to remove potential noise and individual differences Skip Above and Skip Next click data generated preferences Preference: documents with more relevance should ne ranked higher.