Evaluation Information retrieval Web. Purposes of Evaluation System Performance Evaluation efficiency of data structures and methods operational profile.

Evaluation Information retrieval Web

Purposes of Evaluation System Performance Evaluation efficiency of data structures and methods operational profile Retrieval Evaluation how well can system “guess” what is needed from range of queries (including ill-specified and vague) Comparison – Who’s best how does new algorithm/data structure compare to its predecessors; what is its contribution?

Canonical Retrieval & Comparison Experiment 1.For each method M being compared, a. For each parameter setting P of M, 1) Train M(P) on a training data set D’. 2) For each testing data set D, a) Run M on D b) Compare actual results to expected results c) Compute performance metrics b. Compare performance on set P for method M 2.Compare performance across set M using statistical tests for significance

Key Questions To what methods should you compare? What parameters should be checked? What data sets should be used? How should you divide into training and testing? What metrics should be used? What statistical tests should be run?

Choosing Methods and Parameters Depends on your hypothesis… Experiments are based on hypothesis,: My system has significantly better performance than state of the art. Heuristics significantly improve performance. Negative feedback makes little difference to performance.

IR Data Sets TREC (Text Retrieval Conference) http://trec.nist.gov/ yearly conference organized by NIST groups compare IR systems on designated tasks and data data sets are large and include human relevance and topic judgments tasks: usually IR (e.g., retrieval, filtering), recently Web Largest,most widely used collections

TREC Data Collections (>800,000 docs) 1. Wall Street Journal (1987, 1988, 1989), the Federal Register (1989), Associated Press (1989), Department of Energy abstracts, and Information from the Computer Select disks (1989, 1990) 2. Wall Street Journal (1990, 1991, 1992), the Federal Register (1988), Associated Press (1988) and Information from the Computer Select disks (1989, 1990) copyrighted by Ziff-Davis 3. San Jose Mercury News (1991), the Associated Press (1990), U.S. Patents (1983-1991), and Information from the Computer Select disks (1991, 1992) copyrighted by Ziff-Davis 4. Financial Times Limited (1991, 1992, 1993, 1994), the Congressional Record of the 103rd Congress (1993), and the Federal Register (1994). 5. Foreign Broadcast Information Service (1996) and the Los Angeles Times (1989, 1990).

TREC Relevance Judgments Definition of relevance: “If you were writing a report on the subject of the topic and would use the information contained in the document in the report, then the document is relevant.” binary judgments, relevant if any part is pooled (based on returns of actual systems) subset of docs are judged by topic author

TREC Topics Tipster Topic Description Number: 066 Domain: Science and Technology Topic: Natural Language Processing Description: Document will identify a type of natural language processing technology which is being developed or marketed in the U.S. Narrative: A relevant document will identify a company or institution developing or marketing a natural language processing technology, identify the technology, and identify one or more features of the company's product. Concept(s): 1. natural language processing 2. translation, language, dictionary, font 3. software applications Factor(s): Nationality: U.S. Definition(s):

TREC Tasks ad hoc retrieval routing (query stream from profile) confusion (docs with scanning errors) Database merging (docs from separate collections) spoken documents filtering (construct profile to filter incoming stream) question and answer web ad hoc web homepage finding

TREC 2003 Tasks Cross-language: retrieve documents in multiple languages Filtering: user’s information need is stable; check incoming stream for which docs should be disseminated Genome: gene sequences as well as supporting docs High Accuracy Retrieval from Documents (HARD): leverage knowledge about user and/or goals Interactive Novelty: retrieve new,non-redundant data Question answering Video: Content-based retrieval of digital video Web: search on a snapshot of the web

Web Topics Number: 501 deduction and induction in English? Description: What is the difference between deduction and induction in the process of reasoning? Narrative: A relevant document will contrast inductive and deductive reasoning. A document that discusses only one or the other is not relevant.

Metrics: Precision and Recall Precision: were all retrieved docs relevant? hits/(hits+fp) Recall: were all relevant docs retrieved? hits/(hits+fn) Relevant~Relevant Retrievedhitfalse positive ~Retrievedfalse negative miss

Precision/Recall Tradeoff

Performance for Each Query Hold recall fixed (through parameters) and allow precision to vary Average precision at seen relevant docs: running average R-precision: precision by rth ranking where r=|relevant docs|

Metrics: P&R Combination Precision and recall tend to balance each other. Improve one at cost of the other. F is harmonic mean, comparing recall and precision in single metric.

Experiment Pitfalls Ill specified hypothesis Reliance on flaky and/or too few users Bias in the user base Poorly constructed user instructions Inappropriate comparison methods Too much varying simultaneously Biased evaluation metric (confounding) or data selection Too many or too few statistical tests

Strategies for Avoiding Pitfalls Fully specify experiment before running it Run pilot experiment Put serious thought into “right” hypothesis Think about what you hope to say when experiment concludes… will experiment support your saying it?

Web Experiment Example [O. Zamir & O. Etzioni, “Web Document Clustering”, Proc. of 21 st International SIGIR Conference on Research and Development in Information Retrieval, 1998.] Developed incremental, linear time clustering algorithm which clusters based on shared phrases in document snippets: Suffix Tree Clustering Hypotheses: STC has higher precision than standard clustering methods for this domain. Clustering based on snippets does not significantly degrade precision for this domain. STC is faster than standard clustering methods for this domain.

Zamir & Etzioni Experiment Data set: Thought up 10 queries Generated 10 collections of 200 snippets from MetaCrawler query results (retrieved original docs as well) Manually assigned relevance (mean=40 relevant docs/query) Methods: Single-Pass, K-Means, Buckshot, Fractionation, GAHC

Z&E Exp: STC has higher precision.

Z&E Exp: Snippet Clustering does not unduly affect precision.

Z&E Exp: STC is faster.

Evaluation Information retrieval Web. Purposes of Evaluation System Performance Evaluation efficiency of data structures and methods operational profile.

Similar presentations

Presentation on theme: "Evaluation Information retrieval Web. Purposes of Evaluation System Performance Evaluation efficiency of data structures and methods operational profile."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Evaluation Information retrieval Web. Purposes of Evaluation System Performance Evaluation efficiency of data structures and methods operational profile.

Similar presentations

Presentation on theme: "Evaluation Information retrieval Web. Purposes of Evaluation System Performance Evaluation efficiency of data structures and methods operational profile."— Presentation transcript:

Similar presentations

About project

Feedback