Search and Retrieval: Relevance and Evaluation Prof. Marti Hearst SIMS 202, Lecture 20.

Slides:



Advertisements
Similar presentations
Towards Methods for the Collective Gathering and Quality Control of Relevance Assessments SIGIR´09, July 2009.
Advertisements

Evaluation of Information Retrieval Systems Thanks to Marti Hearst, Ray Larson.
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
1 Retrieval Performance Evaluation Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Addison-Wesley, (Chapter 3)
A Quality Focused Crawler for Health Information Tim Tang.
Evaluating Search Engine
Search Engines and Information Retrieval
Information Retrieval Review
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
- SLAYT 1 BBY 220 Re-evaluation of IR Systems Yaşar Tonta Hacettepe Üniversitesi yunus.hacettepe.edu.tr/~tonta/ BBY220 Bilgi Erişim.
Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.
SLIDE 1IS 202 – FALL 2002 Lecture 20: Evaluation Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00.
Modern Information Retrieval
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
SLIDE 1IS 240 – Spring 2010 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
INFO 624 Week 3 Retrieval System Evaluation
Retrieval Evaluation. Brief Review Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
SLIDE 1IS 240 – Spring 2009 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
Information Access I Measurement and Evaluation GSLT, Göteborg, October 2003 Barbara Gawronska, Högskolan i Skövde.
Current Topics in Information Access: IR Background
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
SLIDE 1IS 202 – FALL 2004 Lecture 10: IR Evaluation Workshop Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30.
SIMS 202 Information Organization and Retrieval Prof. Marti Hearst and Prof. Ray Larson UC Berkeley SIMS Tues/Thurs 9:30-11:00am Fall 2000.
SLIDE 1IS 202 – FALL 2004 Lecture 9: IR Evaluation Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00.
Indexing and Representation: The Vector Space Model Document represented by a vector of terms Document represented by a vector of terms Words (or word.
SLIDE 1IS 240 – Spring 2009 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
Evaluation CSC4170 Web Intelligence and Social Computing Tutorial 5 Tutor: Tom Chao Zhou
Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.
WXGB6106 INFORMATION RETRIEVAL Week 3 RETRIEVAL EVALUATION.
ISP 433/633 Week 6 IR Evaluation. Why Evaluate? Determine if the system is desirable Make comparative assessments.
Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.
SLIDE 1IS 202 – FALL 2003 Lecture 20: Evaluation Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00.
Evaluation of Image Retrieval Results Relevant: images which meet user’s information need Irrelevant: images which don’t meet user’s information need Query:
Evaluation David Kauchak cs458 Fall 2012 adapted from:
Evaluation David Kauchak cs160 Fall 2009 adapted from:
Search Engines and Information Retrieval Chapter 1.
Information Retrieval and Web Search IR Evaluation and IR Standard Text Collections.
Evaluation Experiments and Experience from the Perspective of Interactive Information Retrieval Ross Wilkinson Mingfang Wu ICT Centre CSIRO, Australia.
Philosophy of IR Evaluation Ellen Voorhees. NIST Evaluation: How well does system meet information need? System evaluation: how good are document rankings?
IR Evaluation Evaluate what? –user satisfaction on specific task –speed –presentation (interface) issue –etc. My focus today: –comparative performance.
Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.
University of Malta CSA3080: Lecture 6 © Chris Staff 1 of 20 CSA3080: Adaptive Hypertext Systems I Dr. Christopher Staff Department.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Measuring How Good Your Search Engine Is. *. Information System Evaluation l Before 1993 evaluations were done using a few small, well-known corpora of.
1 CS 430 / INFO 430 Information Retrieval Lecture 8 Evaluation of Retrieval Effectiveness 1.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
What Does the User Really Want ? Relevance, Precision and Recall.
Evaluation INST 734 Module 5 Doug Oard. Agenda Evaluation fundamentals Test collections: evaluating sets Test collections: evaluating rankings Interleaving.
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
Evaluation of Information Retrieval Systems Xiangming Mu.
Evaluation. The major goal of IR is to search document relevant to a user query. The evaluation of the performance of IR systems relies on the notion.
Information Retrieval Quality of a Search Engine.
Search and Retrieval: Finding Out About Prof. Marti Hearst SIMS 202, Lecture 18.
Search and Retrieval: Query Languages Prof. Marti Hearst SIMS 202, Lecture 19.
Relevance Feedback Prof. Marti Hearst SIMS 202, Lecture 24.
Information Retrieval Lecture 3 Introduction to Information Retrieval (Manning et al. 2007) Chapter 8 For the MSc Computer Science Programme Dell Zhang.
SLIDE 1IS 202 – FALL 2002 Lecture 20: Web Search Issues and Algorithms Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday.
SIMS 202, Marti Hearst Final Review Prof. Marti Hearst SIMS 202.
Evaluation of Information Retrieval Systems
Evaluation.
Modern Information Retrieval
IR Theory: Evaluation Methods
Evaluation of Information Retrieval Systems
Retrieval Evaluation - Measures
Retrieval Performance Evaluation - Measures
Presentation transcript:

Search and Retrieval: Relevance and Evaluation Prof. Marti Hearst SIMS 202, Lecture 20

Marti A. Hearst SIMS 202, Fall 1997 Finding Out About n Three phases: n Asking of a question n Construction of an answer n Assessment of the answer n Part of an iterative process

Marti A. Hearst SIMS 202, Fall 1997 Today n Relevance n Evaluation of IR Systems n Precision vs. Recall n Cutoff Points n Test Collections/TREC n Blair & Maron Study

Marti A. Hearst SIMS 202, Fall 1997 Evaluation n Why Evaluate? n What to Evaluate? n How to Evaluate?

Marti A. Hearst SIMS 202, Fall 1997 Why Evaluate?

Marti A. Hearst SIMS 202, Fall 1997 Why Evaluate? n Determine if the system is desirable n Make comparative assessments n Others?

Marti A. Hearst SIMS 202, Fall 1997 What to Evaluate?

Marti A. Hearst SIMS 202, Fall 1997 What to Evaluate? n How much learned about the collection n How much learned about a topic n How much of the information need is satisfied n How inviting the system is

Marti A. Hearst SIMS 202, Fall 1997 What to Evaluate? n What can be measured that reflects users’ ability to use system? (Cleverdon 66) n Coverage of Information n Form of Presentation n Effort required/Ease of Use n Time and Space Efficiency n Recall n proportion of relevant material actually retrieved n Precision n proportion of retrieved material actually relevant effectiveness

Marti A. Hearst SIMS 202, Fall 1997 Assessing the Answer n How well does it answer the question? n Complete answer? Partial? n Background Information? n Hints for further exploration? n How relevant is it to the user?

Marti A. Hearst SIMS 202, Fall 1997 Relevance n Subjective n Measurable to some extent n How often do people agree a document is relevant to a query

Marti A. Hearst SIMS 202, Fall 1997 Relevance n In what ways can a document be relevant to a query? n Answer precise question precisely. n Partially answer question. n Suggest a source for more information. n Give background information. n Remind the user of other knowledge. n Others...

Marti A. Hearst SIMS 202, Fall 1997 Retrieved vs. Relevant Documents Retrieved

Marti A. Hearst SIMS 202, Fall 1997 Retrieved vs. Relevant Documents Relevant Retrieved

Marti A. Hearst SIMS 202, Fall 1997 Retrieved vs. Relevant Documents Relevant High Precision Retrieved

Marti A. Hearst SIMS 202, Fall 1997 Retrieved vs. Relevant Documents Relevant High Recall Retrieved

Marti A. Hearst SIMS 202, Fall 1997 Retrieved vs. Relevant Documents Relevant High Precision High Recall Retrieved

Marti A. Hearst SIMS 202, Fall 1997 Retrieved vs. Relevant Documents Relevant High Precision High Recall Retrieved

Marti A. Hearst SIMS 202, Fall 1997 Why Precision and Recall? n Get as much good stuff while at the same time getting as little junk as possible

Standard IR Evaluation n Precision n Recall Collection # relevant in collection # retrieved # relevant retrieved Retrieved Documents

Marti A. Hearst SIMS 202, Fall 1997 Precision/Recall Curves n There is a tradeoff between Precision and Recall n So measure Precision at different levels of Recall precision recall x x x x

Marti A. Hearst SIMS 202, Fall 1997 Precision/Recall Curves n Difficult to determine which of these two hypothetical results is better: precision recall x x x x

Precision/Recall Curves

Marti A. Hearst SIMS 202, Fall 1997 Document Cutoff Levels n Another way to evaluate: n Fix the number of documents retrieved at several levels: n top 5, top 10, top 20, top 50, top 100, top 500 n Measure precision at each of these levels n Take (weighted) average over results n This is a way to focus on high precision

Marti A. Hearst SIMS 202, Fall 1997 The E-Measure Combine Precision and Recall into one number (van Rijsbergen 79) P = precision R = recall b = measure of relative importance of P or R For example, b = 0.5 means user is twice as interested in precision as recall

Marti A. Hearst SIMS 202, Fall 1997 Expected Search Length n Documents are presented in order of predicted relevance n Search length: number of non-relevant documents that user must scan through in order to have their information need satisfied n The shorter the better n below: n=2, search length = 2; n=3, search length = n y n y y y y n y n n n n

Marti A. Hearst SIMS 202, Fall 1997 What to Evaluate? n Effectiveness n Difficult to measure n Recall and Precision are one way n What might be others?

Marti A. Hearst SIMS 202, Fall 1997 How to Evaluate?

Marti A. Hearst SIMS 202, Fall 1997TREC n Text REtrieval Conference/Competition n Run by NIST (National Institute of Standards & Technology) n 1997 was the 6th year n Collection: 3 Gigabytes, >1 Million Docs n Newswire & full text news (AP, WSJ, Ziff) n Government documents (federal register) n Queries + Relevance Judgments n Queries devised and judged by “Information Specialists” n Relevance judgments done only for those documents retrieved -- not entire collection! n Competition n Various research and commercial groups compete n Results judged on precision and recall, going up to a recall level of 1000 documents

Marti A. Hearst SIMS 202, Fall 1997 Sample TREC queries (topics) Number: 168 Topic: Financing AMTRAK Description: A document will address the role of the Federal Government in financing the operation of the National Railroad Transportation Corporation (AMTRAK) Narrative: A relevant document must provide information on the government’s responsibility to make AMTRAK an economically viable entity. It could also discuss the privatization of AMTRAK as an alternative to continuing government subsidies. Documents comparing government subsidies given to air and bus transportation with those provided to aMTRAK would also be relevant.

Marti A. Hearst SIMS 202, Fall 1997 TREC n Benefits: n made research systems scale to large collections (pre-WWW) n allows for somewhat controlled comparisons n Drawbacks: n emphasis on high recall, which may be unrealistic for what most users want n very long queries, also unrealistic n comparisons still difficult to make, because systems are quite different on many dimensions n focus on batch ranking rather than interaction n no focus on the WWW

Marti A. Hearst SIMS 202, Fall 1997 TREC is changing n Emphasis on specialized “tracks” n Interactive track n Natural Language Processing (NLP) track n Multilingual tracks (Chinese, Spanish) n Filtering track n High-Precision n High-Performance n www-nlpir.nist.gov/TREC

Marti A. Hearst SIMS 202, Fall 1997 TREC Results n Differ each year n For the main track: n Best systems not statistically significantly different n Small differences sometimes have big effects n how good was the hyphenation model n how was document length taken into account n Systems were optimized for longer queries and all performed worse for shorter, more realistic queries n Excitement is in the new tracks

Marti A. Hearst SIMS 202, Fall 1997 Blair and Maron 1985 n A classic study of retrieval effectiveness n earlier studies were on unrealistically small collections n Studied an archive of documents for a legal suit n ~350,000 pages of text n 40 queries n focus on high recall n Used IBM’s STAIRS full-text system n Main Result: System retrieved less than 20% of the relevant documents for a particular information needs when lawyers thought they had 75% n But many queries had very high precision

Marti A. Hearst SIMS 202, Fall 1997 Blair and Maron, cont. n How they estimated recall n generated partially random samples of unseen documents n had users (unaware these were random) judge them for relevance n Other results: n two lawyers searches had similar performance n lawyers recall was not much different from paralegal’s

Marti A. Hearst SIMS 202, Fall 1997 Blair and Maron, cont. n Why recall was low n users can’t foresee exact words and phrases that will indicate relevant documents n “accident” referred to by those responsible as: “event,” “incident,” “situation,” “problem,” … n differing technical terminology n slang, misspellings n Perhaps the value of higher recall decreases as the number of relevant documents grows, so more detailed queries were not attempted once the users were satisfied

Marti A. Hearst SIMS 202, Fall 1997 Still to come: n Evaluating user interfaces