Presentation is loading. Please wait.

Presentation is loading. Please wait.

Evaluating Evaluation Measure Stability Authors: Chris Buckley, Ellen M. Voorhees Presenters: Burcu Dal, Esra Akbaş.

Similar presentations


Presentation on theme: "Evaluating Evaluation Measure Stability Authors: Chris Buckley, Ellen M. Voorhees Presenters: Burcu Dal, Esra Akbaş."— Presentation transcript:

1 Evaluating Evaluation Measure Stability Authors: Chris Buckley, Ellen M. Voorhees Presenters: Burcu Dal, Esra Akbaş

2 Retrieval System Evaluation  Experiments on the accuracies of evaluation measures  Requirements for acceptable experiments:  Reasonable number of requests.  Reasonable evaluation measure.  Reasonable notion of difference. A test collection consists of a set of documents, a set of topics, and a set of relevance judgments.

3 Retrieval System Evaluation-2  Each retrieval strategy: a ranked list of documents for each topic  The list is ordered by decreasing likelihood  The effectiveness of a strategy is computed as a function of the ranks

4 IR Measures  Prec( λ )  Recall (1000)  Prec at.5 Recall  R-Prec  Average Precision

5 Computing error rate  Goal: to quantify the error rate associated with deciding that one retrieval method is better another  Based on experiment a particular number of topics a specific evaluation measure a particular value, as fuzziness value

6  Select an evaluation measure and fuzziness value  Pick a query set for each of nine retrieval methods  Compare them first is better than, worse than or equal to the second method with respect to the fuzziness

7 Figure 1: Counts of the number of times the retrieval method of the row was better than, worse than, or equal to the method of the column. Counts were computed using a fuzziness factor of 5% and the original 21 query sets.

8  |A > B| is the number of times method A is better than method B in an entry.  The number of times methods are deemed to be equivalent reflects on the power of a measure to discriminate among systems.  The proportion of ties

9 Average error rate and average proportion of ties for different evaluation measures.

10 Varying topic set size  investigate how changing the number of topics used in a test affects the error rate of the evaluation measures  Look topic set sizes of 5, 10, 15, 20, 25, 30, 40, and 50  100 trials for each topic set size

11

12 Varying fuzziness values  larger fuzziness values decrease the error rate but also decrease the discrimination power of the measure.

13 The effect of fuzziness value on average error rate.

14 Conclusion  Error rate depends on  Topic set size  Query size  Fuzziness value  Evaluation measure

15  Thanks


Download ppt "Evaluating Evaluation Measure Stability Authors: Chris Buckley, Ellen M. Voorhees Presenters: Burcu Dal, Esra Akbaş."

Similar presentations


Ads by Google