Evaluating Evaluation Measure Stability Authors: Chris Buckley, Ellen M. Voorhees Presenters: Burcu Dal, Esra Akbaş.

Evaluating Evaluation Measure Stability Authors: Chris Buckley, Ellen M. Voorhees Presenters: Burcu Dal, Esra Akbaş

Retrieval System Evaluation  Experiments on the accuracies of evaluation measures  Requirements for acceptable experiments:  Reasonable number of requests.  Reasonable evaluation measure.  Reasonable notion of difference. A test collection consists of a set of documents, a set of topics, and a set of relevance judgments.

Retrieval System Evaluation-2  Each retrieval strategy: a ranked list of documents for each topic  The list is ordered by decreasing likelihood  The effectiveness of a strategy is computed as a function of the ranks

IR Measures  Prec( λ )  Recall (1000)  Prec at.5 Recall  R-Prec  Average Precision

Computing error rate  Goal: to quantify the error rate associated with deciding that one retrieval method is better another  Based on experiment a particular number of topics a specific evaluation measure a particular value, as fuzziness value

 Select an evaluation measure and fuzziness value  Pick a query set for each of nine retrieval methods  Compare them first is better than, worse than or equal to the second method with respect to the fuzziness

Figure 1: Counts of the number of times the retrieval method of the row was better than, worse than, or equal to the method of the column. Counts were computed using a fuzziness factor of 5% and the original 21 query sets.

 |A > B| is the number of times method A is better than method B in an entry.  The number of times methods are deemed to be equivalent reflects on the power of a measure to discriminate among systems.  The proportion of ties

Average error rate and average proportion of ties for different evaluation measures.

Varying topic set size  investigate how changing the number of topics used in a test affects the error rate of the evaluation measures  Look topic set sizes of 5, 10, 15, 20, 25, 30, 40, and 50  100 trials for each topic set size

Varying fuzziness values  larger fuzziness values decrease the error rate but also decrease the discrimination power of the measure.

The effect of fuzziness value on average error rate.

Conclusion  Error rate depends on  Topic set size  Query size  Fuzziness value  Evaluation measure

 Thanks

Evaluating Evaluation Measure Stability Authors: Chris Buckley, Ellen M. Voorhees Presenters: Burcu Dal, Esra Akbaş.

Similar presentations

Presentation on theme: "Evaluating Evaluation Measure Stability Authors: Chris Buckley, Ellen M. Voorhees Presenters: Burcu Dal, Esra Akbaş."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Evaluating Evaluation Measure Stability Authors: Chris Buckley, Ellen M. Voorhees Presenters: Burcu Dal, Esra Akbaş.

Similar presentations

Presentation on theme: "Evaluating Evaluation Measure Stability Authors: Chris Buckley, Ellen M. Voorhees Presenters: Burcu Dal, Esra Akbaş."— Presentation transcript:

Similar presentations

About project

Feedback