Presentation is loading. Please wait.

Presentation is loading. Please wait.

Evaluation INST 734 Module 5 Doug Oard. Agenda Evaluation fundamentals  Test collections: evaluating sets Test collections: evaluating rankings Interleaving.

Similar presentations


Presentation on theme: "Evaluation INST 734 Module 5 Doug Oard. Agenda Evaluation fundamentals  Test collections: evaluating sets Test collections: evaluating rankings Interleaving."— Presentation transcript:

1 Evaluation INST 734 Module 5 Doug Oard

2 Agenda Evaluation fundamentals  Test collections: evaluating sets Test collections: evaluating rankings Interleaving User studies

3 Batch Evaluation Model IR Black Box Query Search Result Documents Evaluation Module Measure of Effectiveness Relevance Judgments These are the four things we need

4 IR Test Collection Design Representative document collection –Size, sources, genre, topics, … “Random” sample of topics –Associated somehow with queries Known (often binary) levels of relevance –For each topic-document pair (topic, not query!) –Assessed by humans, used only for evaluation Measure(s) of effectiveness –Used to compare alternate systems

5 A TREC Ad Hoc Topic Title: Health and Computer Terminals Description: Is it hazardous to the health of individuals to work with computer terminals on a daily basis? Narrative: Relevant documents would contain any information that expands on any physical disorder/problems that may be associated with the daily working with computer terminals. Such things as carpel tunnel, cataracts, and fatigue have been said to be associated, but how widespread are these or other problems and what is being done to alleviate any health problems.

6 Saracevic on Relevance measure degree dimension estimate appraisal relation correspondence utility connection satisfaction fit bearing matching document article textual form reference information provided fact query request information used point of view information need statement person judge user requester Information specialist Tefko Saracevic. (1975) Relevance: A Review of and a Framework for Thinking on the Notion in Information Science. Journal of the American Society for Information Science, 26(6), 321-343. Relevance is theof a existing between aand a as determined by

7 Teasing Apart “Relevance” Relevance relates a topic and a document –Duplicates are equally relevant by definition –Constant over time and across users Pertinence relates a task and a document –Accounts for quality, complexity, language, … Utility relates a user and a document –Accounts for prior knowledge Dagobert Soergel (1994). Indexing and Retrieval Performance: The Logical Evidence. JASIS, 45(8), 589-599.

8 Set-Based Effectiveness Measures Precision –How much of what was found is relevant? Often of interest, particularly for interactive searching Recall –How much of what is relevant was found? Particularly important for law, patents, and medicine Fallout –How much of what was irrelevant was rejected? Useful when different size collections are compared

9 Retrieved Relevant Relevant + Retrieved Not Relevant + Not Retrieved

10 Effectiveness Measures Relevant Retrieved False AlarmIrrelevant Rejected Miss Relevant Not Relevant RetrievedNot Retrieved Truth System User- Oriented System- Oriented

11 Single-Figure Set-Based Measures Balanced F-measure –Harmonic mean of recall and precision –Weakness: What if no relevant documents exist? Cost function –Reward relevant retrieved, Penalize non-relevant For example, 3R + - 2N + –Weakness: Hard to normalize, so hard to average

12 (Paired) Statistical Significance Tests System A 0.02 0.39 0.16 0.58 0.04 0.09 0.12 System B 0.76 0.07 0.37 0.21 0.02 0.91 0.46 Query 12345671234567 Average0.200.40 Sign Test +-+--+++-+--++ p=1.0 Wilcoxon +0.74 - 0.32 +0.21 - 0.37 - 0.02 +0.82 + 0.38 p=0.94 0 95% of outcomes Try some at: http://www.socscistatistics.com/tests/signedranks/ t-test +0.74 - 0.32 +0.21 - 0.37 - 0.02 +0.82 + 0.38 p=0.34

13 Reporting Results Do you have a measurable improvement? –Inter-assessor agreement limits max precision –Using one judge to assess another yields ~0.8 Do you have a meaningful improvement? –0.05 (absolute) in precision might be noticed –0.10 (absolute) in precision makes a difference Do you have a reliable improvement? –Two-tailed paired statistical significance test

14 Agenda Evaluation fundamentals Test collections: evaluating sets  Test collections: evaluating rankings Interleaving User studies


Download ppt "Evaluation INST 734 Module 5 Doug Oard. Agenda Evaluation fundamentals  Test collections: evaluating sets Test collections: evaluating rankings Interleaving."

Similar presentations


Ads by Google