Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.

Slides:



Advertisements
Similar presentations
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Advertisements

Ranked Retrieval INST 734 Module 3 Doug Oard. Agenda  Ranked retrieval Similarity-based ranking Probability-based ranking.
Web Search – Summer Term 2006 VI. Web Search - Ranking (c) Wolfgang Hürst, Albert-Ludwigs-University.
Web Search – Summer Term 2006 VI. Web Search - Ranking (cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.
Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.
Web Search – Summer Term 2006 I. General Introduction (c) Wolfgang Hürst, Albert-Ludwigs-University.
Web Search - Summer Term 2006 II. Information Retrieval (Basics Cont.)
Web Search - Summer Term 2006 III. Web Search - Introduction (Cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.
Evaluating Search Engine
Search Engines and Information Retrieval
Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.
Models for Information Retrieval Mainly used in science and research, (probably?) less often in real systems But: Research results have significance for.
Probabilistic IR Models Based on probability theory Basic idea : Given a document d and a query q, Estimate the likelihood of d being relevant for the.
Web Search – Summer Term 2006 III. Web Search - Introduction (Cont.) - Jeff Dean, Google's Systems Lab:
Web Search – Summer Term 2006 IV. Web Search - Crawling (c) Wolfgang Hürst, Albert-Ludwigs-University.
Information Retrieval in Practice
INFO 624 Week 3 Retrieval System Evaluation
Retrieval Evaluation. Brief Review Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
Retrieval Evaluation: Precision and Recall. Introduction Evaluation of implementations in computer science often is in terms of time and space complexity.
CS/Info 430: Information Retrieval
Evaluating the Performance of IR Sytems
Retrieval Evaluation. Introduction Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
Web Search – Summer Term 2006 VII. Selected Topics - Metasearch Engines [1] (c) Wolfgang Hürst, Albert-Ludwigs-University.
Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.
Web Search – Summer Term 2006 II. Information Retrieval (Basics) (c) Wolfgang Hürst, Albert-Ludwigs-University.
ISP 433/633 Week 6 IR Evaluation. Why Evaluate? Determine if the system is desirable Make comparative assessments.
Exercise 1: Bayes Theorem (a). Exercise 1: Bayes Theorem (b) P (b 1 | c plain ) = P (c plain ) P (c plain | b 1 ) * P (b 1 )
Information Retrieval
Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.
Web Search - Summer Term 2006 II. Information Retrieval (Models, Cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.
Evaluation of Image Retrieval Results Relevant: images which meet user’s information need Irrelevant: images which don’t meet user’s information need Query:
Search and Retrieval: Relevance and Evaluation Prof. Marti Hearst SIMS 202, Lecture 20.
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Evaluation David Kauchak cs458 Fall 2012 adapted from:
Evaluation David Kauchak cs160 Fall 2009 adapted from:
Search Engines and Information Retrieval Chapter 1.
Evaluation Experiments and Experience from the Perspective of Interactive Information Retrieval Ross Wilkinson Mingfang Wu ICT Centre CSIRO, Australia.
Philosophy of IR Evaluation Ellen Voorhees. NIST Evaluation: How well does system meet information need? System evaluation: how good are document rankings?
IR Evaluation Evaluate what? –user satisfaction on specific task –speed –presentation (interface) issue –etc. My focus today: –comparative performance.
Reyyan Yeniterzi Weakly-Supervised Discovery of Named Entities Using Web Search Queries Marius Pasca Google CIKM 2007.
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Predicting Question Quality Bruce Croft and Stephen Cronen-Townsend University of Massachusetts Amherst.
Evaluation INST 734 Module 5 Doug Oard. Agenda Evaluation fundamentals Test collections: evaluating sets  Test collections: evaluating rankings Interleaving.
WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
1 CS 430: Information Discovery Sample Midterm Examination Notes on the Solutions.
Measuring How Good Your Search Engine Is. *. Information System Evaluation l Before 1993 evaluations were done using a few small, well-known corpora of.
C.Watterscs64031 Evaluation Measures. C.Watterscs64032 Evaluation? Effectiveness? For whom? For what? Efficiency? Time? Computational Cost? Cost of missed.
Performance Measurement. 2 Testing Environment.
Information Retrieval
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
Evaluation of Information Retrieval Systems Xiangming Mu.
CS798: Information Retrieval Charlie Clarke Information retrieval is concerned with representing, searching, and manipulating.
Evaluation. The major goal of IR is to search document relevant to a user query. The evaluation of the performance of IR systems relies on the notion.
Information Retrieval Quality of a Search Engine.
Information Retrieval Lecture 3 Introduction to Information Retrieval (Manning et al. 2007) Chapter 8 For the MSc Computer Science Programme Dell Zhang.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 10 Evaluation.
Information Retrieval in Practice
Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.)
Information Retrieval (in Practice)
Lecture 10 Evaluation.
IR Theory: Evaluation Methods
Lecture 6 Evaluation.
Cumulated Gain-Based Evaluation of IR Techniques
Retrieval Evaluation - Measures
Retrieval Performance Evaluation - Measures
Presentation transcript:

Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University

Organizational Remarks Exercises: Please, register for the exercises by sending me an till Friday, May 5th, with - Your name, - Matrikelnummer, - Studiengang (BA, MSc, Diploma, …) - Plans for exam (yes, no, undecided) This is just to organize the exercises but has no effect if you decide to drop this course later.

INDEX Recap: IR System & Tasks Involved INFORMATION NEEDDOCUMENTS User Interface PERFORMANCE EVALUATION QUERY QUERY PROCESSING (PARSING & TERM PROCESSING) LOGICAL VIEW OF THE INFORM. NEED SELECT DATA FOR INDEXING PARSING & TERM PROCESSING SEARCHING RANKING RESULTS DOCS. RESULT REPRESENTATION

Evaluation of IR Systems Standard approaches for algorithm and computer system evaluation Speed / processing time Storage requirements Correctness of used algorithms and their implementation But most importantly Performance, effectiveness Another important issue: Usability, users’ perception Questions: What is a good / better search engine? How to measure search engine quality? How to perform evaluations? Etc.

What does Performance/Effectiveness of IR Systems mean? Typical questions: How good is the quality of a system? Which system should I buy? Which one is better? How can I measure the quality of a system? What does quality mean for me? Etc. Their answer depends on users, application, … Very different views and perceptions User vs. search engine provider, developer vs. manager, seller vs. buyer, … And remember: Queries can be ambiguous, unspecific, etc. Hence, in practice, use restrictions and idealization, e.g. only binary decisions

Precision & Recall PRECISION = # FOUND & RELEVANT # FOUND RECALL = # FOUND & RELEVANT # RELEVANT RESULT: DOCUMENTS: AC D B F H G E J I 1. DOC. B 2. DOC. E 3. DOC. F 4. DOC. G 5. DOC. D 6. DOC. H Restrictions: 0/1 Relevance, Set instead of order/ranking But: We can use this for eval. of ranking, too (via top N docs.)

Calculating Precision & Recall Precision : Can be calculated directly from the result Recall : Requires relevance ratings for whole (!) data collection In practice: Approaches to estimate recall 1.) Use a representative sample instead of whole data collection 2.) Document-source method 3.) Expanding queries 4.) Compare result with external sources 5.) Pooling method

Precision & Recall – Special cases Special treatment is necessary, if no doc. is found or no relevant docs. exist (division by zero) NO REL. DOC. EXISTS : A = C = 0 1st CASE: B = 0 2nd CASE: B > 0 EMPTY RESULT SET: A = B = 0 1st CASE: C = 0 2nd CASE: C > 0 A B C D D B D D C D

Precision & Recall Graphs Comparing 2 systems: System 1: Prec 1 = 0.6, Rec 1 = 0.3 System 2: Prec 2 = 0.4, Rec 2 = 0.6 Which one is better? Prec.-Recall-Graph: PRECISION RECALL

The F Measure Alternative measures exist, including ones combining Prec. p and Rec. r in 1 single value Example: The F Measure (  = rel. weight for recall, manually set) SOURCE: N. FUHR (UNIV. DUISBURG) SKRIPTUM ZUR VORLESUNG INFORMATION RETRIEVAL, SS 2006 Example for different  F  = ( 2 + 1) * p * r  2 * p + r

Calculating Average Prec. Values 1. Macro assessment Estimates the expected value for the precision of a randomly chosen query (query or user oriented) Problem: Queries with empty result set 2. Micro assessment Estimates the likelihood of a randomly chosen doc. being relevant (document or system oriented) Problem: Does not support monotony

Monotony of Precision & Recall Monotony : Adding a query that delivers the same results for both systems does not change their quality assessment. Example (Precision):

Distinguish between linear and weak ranking Basic idea: Evaluate precision and recall by looking at the top n results for different n Generally: Precision decreases and recall increases with growing n Precision & Recall for Rankings RANKREL.REC.PREC. 1.OK X OK X X0.4 6.OK OK X OK X PRECISION RECALL

Precision & Recall for Rankings (Cont.)

Realizing Evaluations Now we have a system to evaluate and: Measures to quantify performance Methods to calculate them What else do we need? Documents d j (test set) Tasks (information needs) and respective queries q i Relevance judgments r ij (normally binary) Results (delivered by the system) Evaluation = comparison of Given, perfect result: (q i, d j, r ij ) with result from the system: (q i, d j, r ij (S 1 ))

The TREC Conference Series In the old days: IR evaluation critical because No good (i.e. big) test sets No comparability because of different test sets Motivation for initiatives such as TREC: Text REtrieval Conference (TREC), since 1992, see Goals of TREC: Create realistic, significant test sets Achieve comparability of different systems Establish common basics for IR evaluation Increase technology transfer between industries and research

The TREC Conf. Series (Cont.) TREC offers Various collections of test data Standardized retrieval tasks (queries & topics) Related relevance measures Different tasks ( tracks ) for certain problems Examples for Tracks targeted by TREC: Traditional text retrieval Spoken document retrieval Non-English or multilingual retrieval Information filtering User interactions Web search, SPAM (since 2005), Blog (since 2005) Video retrieval etc.

Advantages and Disadv. of TREC TREC (and other IR initiatives) Very successful, progress which otherwise might probably not have happened But disadvantages exist as well, e.g. Only compares performance but not actual reasons for different behavior Unrealistic data (e.g. still too small, not represen- tative enough) Often just batch mode evaluation, no interactivity or user experience (Note: There are interactivity tracks!) Often no analysis of significance Note: Most of these arguments are general problems of IR evaluation and not necessarily TREC specific

TREC Home Page Visit the TREC site at and browse the different Tracks (gives you an idea about what is going on in the IR community)

INDEX Recap: IR System & Tasks Involved INFORMATION NEEDDOCUMENTS User Interface PERFORMANCE EVALUATION QUERY QUERY PROCESSING (PARSING & TERM PROCESSING) LOGICAL VIEW OF THE INFORM. NEED SELECT DATA FOR INDEXING PARSING & TERM PROCESSING SEARCHING RANKING RESULTS DOCS. RESULT REPRESENTATION