Evaluation Information retrieval Web. Purposes of Evaluation System Performance Evaluation efficiency of data structures and methods operational profile.

Slides:



Advertisements
Similar presentations
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Advertisements

Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.
1 Retrieval Performance Evaluation Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Addison-Wesley, (Chapter 3)
Information Retrieval IR 7. Recap of the last lecture Vector space scoring Efficiency considerations Nearest neighbors and approximations.
Evaluating Search Engine
Search Engines and Information Retrieval
Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.
1 CS 430: Information Discovery Lecture 10 Cranfield and TREC.
Modern Information Retrieval
1 CS 430 / INFO 430 Information Retrieval Lecture 11 Evaluation of Retrieval Effectiveness 2.
Information Retrieval in Practice
INFO 624 Week 3 Retrieval System Evaluation
Retrieval Evaluation. Brief Review Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
1 CS 430 / INFO 430 Information Retrieval Lecture 11 Evaluation of Retrieval Effectiveness 2.
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
An investigation of query expansion terms Gheorghe Muresan Rutgers University, School of Communication, Information and Library Science 4 Huntington St.,
1 Discussion Class 5 TREC. 2 Discussion Classes Format: Questions. Ask a member of the class to answer. Provide opportunity for others to comment. When.
WXGB6106 INFORMATION RETRIEVAL Week 3 RETRIEVAL EVALUATION.
ISP 433/633 Week 6 IR Evaluation. Why Evaluate? Determine if the system is desirable Make comparative assessments.
Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.
The Relevance Model  A distribution over terms, given information need I, (Lavrenko and Croft 2001). For term r, P(I) can be dropped w/o affecting the.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Search is not only about the Web An Overview on Printed Documents Search and Patent Search Walid Magdy Centre for Next Generation Localisation School of.
Search and Retrieval: Relevance and Evaluation Prof. Marti Hearst SIMS 202, Lecture 20.
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Search Engines and Information Retrieval Chapter 1.
Information Retrieval and Web Search IR Evaluation and IR Standard Text Collections.
Evaluation Experiments and Experience from the Perspective of Interactive Information Retrieval Ross Wilkinson Mingfang Wu ICT Centre CSIRO, Australia.
Philosophy of IR Evaluation Ellen Voorhees. NIST Evaluation: How well does system meet information need? System evaluation: how good are document rankings?
IR Evaluation Evaluate what? –user satisfaction on specific task –speed –presentation (interface) issue –etc. My focus today: –comparative performance.
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.
Web Document Clustering By Sang-Cheol Seok. 1.Introduction: Web document clustering? Why ? Two results for the same query ‘amazon’ Google : currently.
Evaluating Search Engines in chapter 8 of the book Search Engines Information Retrieval in Practice Hongfei Yan.
Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.
Web Document Clustering: A Feasibility Demonstration Oren Zamir and Oren Etzioni, SIGIR, 1998.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Chapter 6: Information Retrieval and Web Search
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.
1 01/10/09 1 INFILE CEA LIST ELDA Univ. Lille 3 - Geriico Overview of the INFILE track at CLEF 2009 multilingual INformation FILtering Evaluation.
PSEUDO-RELEVANCE FEEDBACK FOR MULTIMEDIA RETRIEVAL Seo Seok Jun.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Measuring How Good Your Search Engine Is. *. Information System Evaluation l Before 1993 evaluations were done using a few small, well-known corpora of.
Information Retrieval
Threshold Setting and Performance Monitoring for Novel Text Mining Wenyin Tang and Flora S. Tsai School of Electrical and Electronic Engineering Nanyang.
1 Evaluating High Accuracy Retrieval Techniques Chirag Shah,W. Bruce Croft Center for Intelligent Information Retrieval Department of Computer Science.
1 CS 430: Information Discovery Lecture 8 Evaluation of Retrieval Effectiveness II.
1 13/05/07 1/20 LIST – DTSI – Interfaces, Cognitics and Virtual Reality Unit The INFILE project: a crosslingual filtering systems evaluation campaign Romaric.
Topic by Topic Performance of Information Retrieval Systems Walter Liggett National Institute of Standards and Technology TREC-7 (1999)
Relevance Models and Answer Granularity for Question Answering W. Bruce Croft and James Allan CIIR University of Massachusetts, Amherst.
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
Combining Text and Image Queries at ImageCLEF2005: A Corpus-Based Relevance-Feedback Approach Yih-Cheng Chang Department of Computer Science and Information.
Evaluation of Information Retrieval Systems Xiangming Mu.
Evaluation. The major goal of IR is to search document relevant to a user query. The evaluation of the performance of IR systems relies on the notion.
Information Retrieval Quality of a Search Engine.
1 CS 430: Information Discovery Lecture 11 Cranfield and TREC.
Document Clustering and Collection Selection Diego Puppin Web Mining,
Information Retrieval Lecture 3 Introduction to Information Retrieval (Manning et al. 2007) Chapter 8 For the MSc Computer Science Programme Dell Zhang.
Clustering (Search Engine Results) CSE 454. © Etzioni & Weld To Do Lecture is short Add k-means Details of ST construction.
1 INFILE - INformation FILtering Evaluation Evaluation of adaptive filtering systems for business intelligence and technology watch Towards real use conditions.
Information Retrieval in Practice
Evaluation Anisio Lacerda.
Information Retrieval (in Practice)
Evaluation of IR Systems
IR Theory: Evaluation Methods
Retrieval Evaluation - Reference Collections
Retrieval Performance Evaluation - Measures
Information Retrieval and Web Design
Presentation transcript:

Evaluation Information retrieval Web

Purposes of Evaluation System Performance Evaluation efficiency of data structures and methods operational profile Retrieval Evaluation how well can system “guess” what is needed from range of queries (including ill-specified and vague) Comparison – Who’s best how does new algorithm/data structure compare to its predecessors; what is its contribution?

Canonical Retrieval & Comparison Experiment 1.For each method M being compared, a. For each parameter setting P of M, 1) Train M(P) on a training data set D’. 2) For each testing data set D, a) Run M on D b) Compare actual results to expected results c) Compute performance metrics b. Compare performance on set P for method M 2.Compare performance across set M using statistical tests for significance

Key Questions To what methods should you compare? What parameters should be checked? What data sets should be used? How should you divide into training and testing? What metrics should be used? What statistical tests should be run?

Choosing Methods and Parameters Depends on your hypothesis… Experiments are based on hypothesis,: My system has significantly better performance than state of the art. Heuristics significantly improve performance. Negative feedback makes little difference to performance.

IR Data Sets TREC (Text Retrieval Conference) yearly conference organized by NIST groups compare IR systems on designated tasks and data data sets are large and include human relevance and topic judgments tasks: usually IR (e.g., retrieval, filtering), recently Web Largest,most widely used collections

TREC Data Collections (>800,000 docs) 1. Wall Street Journal (1987, 1988, 1989), the Federal Register (1989), Associated Press (1989), Department of Energy abstracts, and Information from the Computer Select disks (1989, 1990) 2. Wall Street Journal (1990, 1991, 1992), the Federal Register (1988), Associated Press (1988) and Information from the Computer Select disks (1989, 1990) copyrighted by Ziff-Davis 3. San Jose Mercury News (1991), the Associated Press (1990), U.S. Patents ( ), and Information from the Computer Select disks (1991, 1992) copyrighted by Ziff-Davis 4. Financial Times Limited (1991, 1992, 1993, 1994), the Congressional Record of the 103rd Congress (1993), and the Federal Register (1994). 5. Foreign Broadcast Information Service (1996) and the Los Angeles Times (1989, 1990).

TREC Relevance Judgments Definition of relevance: “If you were writing a report on the subject of the topic and would use the information contained in the document in the report, then the document is relevant.” binary judgments, relevant if any part is pooled (based on returns of actual systems) subset of docs are judged by topic author

TREC Topics Tipster Topic Description Number: 066 Domain: Science and Technology Topic: Natural Language Processing Description: Document will identify a type of natural language processing technology which is being developed or marketed in the U.S. Narrative: A relevant document will identify a company or institution developing or marketing a natural language processing technology, identify the technology, and identify one or more features of the company's product. Concept(s): 1. natural language processing 2. translation, language, dictionary, font 3. software applications Factor(s): Nationality: U.S. Definition(s):

TREC Tasks ad hoc retrieval routing (query stream from profile) confusion (docs with scanning errors) Database merging (docs from separate collections) spoken documents filtering (construct profile to filter incoming stream) question and answer web ad hoc web homepage finding

TREC 2003 Tasks Cross-language: retrieve documents in multiple languages Filtering: user’s information need is stable; check incoming stream for which docs should be disseminated Genome: gene sequences as well as supporting docs High Accuracy Retrieval from Documents (HARD): leverage knowledge about user and/or goals Interactive Novelty: retrieve new,non-redundant data Question answering Video: Content-based retrieval of digital video Web: search on a snapshot of the web

Web Topics Number: 501 deduction and induction in English? Description: What is the difference between deduction and induction in the process of reasoning? Narrative: A relevant document will contrast inductive and deductive reasoning. A document that discusses only one or the other is not relevant.

Metrics: Precision and Recall Precision: were all retrieved docs relevant? hits/(hits+fp) Recall: were all relevant docs retrieved? hits/(hits+fn) Relevant~Relevant Retrievedhitfalse positive ~Retrievedfalse negative miss

Precision/Recall Tradeoff

Performance for Each Query Hold recall fixed (through parameters) and allow precision to vary Average precision at seen relevant docs: running average R-precision: precision by rth ranking where r=|relevant docs|

Metrics: P&R Combination Precision and recall tend to balance each other. Improve one at cost of the other. F is harmonic mean, comparing recall and precision in single metric.

Experiment Pitfalls Ill specified hypothesis Reliance on flaky and/or too few users Bias in the user base Poorly constructed user instructions Inappropriate comparison methods Too much varying simultaneously Biased evaluation metric (confounding) or data selection Too many or too few statistical tests

Strategies for Avoiding Pitfalls Fully specify experiment before running it Run pilot experiment Put serious thought into “right” hypothesis Think about what you hope to say when experiment concludes… will experiment support your saying it?

Web Experiment Example [O. Zamir & O. Etzioni, “Web Document Clustering”, Proc. of 21 st International SIGIR Conference on Research and Development in Information Retrieval, 1998.] Developed incremental, linear time clustering algorithm which clusters based on shared phrases in document snippets: Suffix Tree Clustering Hypotheses: STC has higher precision than standard clustering methods for this domain. Clustering based on snippets does not significantly degrade precision for this domain. STC is faster than standard clustering methods for this domain.

Zamir & Etzioni Experiment Data set: Thought up 10 queries Generated 10 collections of 200 snippets from MetaCrawler query results (retrieved original docs as well) Manually assigned relevance (mean=40 relevant docs/query) Methods: Single-Pass, K-Means, Buckshot, Fractionation, GAHC

Z&E Exp: STC has higher precision.

Z&E Exp: Snippet Clustering does not unduly affect precision.

Z&E Exp: STC is faster.