Presentation is loading. Please wait.

Presentation is loading. Please wait.

Developments in Evaluation of Search Engines

Similar presentations


Presentation on theme: "Developments in Evaluation of Search Engines"— Presentation transcript:

1 Developments in Evaluation of Search Engines
Mark Sanderson, University of Sheffield

2 Evaluation in IR Use a test collection: set of… Documents Topics
Relevance Judgements 21/04/2018

3 How to get lots of judgements?
Do you check all documents for all topics? In the old days Yes But this doesn’t scale 21/04/2018

4 To form larger test collections
Get your relevance judgements from pools How does that work? 21/04/2018

5 Pooling – many participants
Collection Run 1 Run 2 Run 3 Run 4 Run 5 Run 6 Run 7 21/04/2018

6 Classic pool formation
runs Judge 1-2K documents per topic 10-20 hours per topic 50 topics, too much effort for one person 21/04/2018

7 Look at the two problem areas
Pooling requires many participants Relevance assessment requires many person hours 21/04/2018

8 Query pooling Don’t have multiple runs from groups
Have one person create multiple queries 21/04/2018

9 Query pooling First proposed by Confirmed by
Cormack, G.V., Palmer, R.P., Clarke, C.L.A. (1998): Efficient Constructions of Large Test Collections, in Proceedings of the 21st annual international ACM-SIGIR conference on Research and development in information retrieval: Confirmed by Forming test collections with no system pooling, M. Sanderson, H. Joho, In the 27th ACM Conference of the Special Interest Group in Information Retrieval 2004. 21/04/2018

10 Query pooling Collection Nuclear waste dumping Radioactive waste
Radioactive waste storage Hazardous waste Nuclear waste storage Utah nuclear waste Waste dump 21/04/2018

11 Another approach Maybe your assessors Can read very fast,
but can’t search very well. Form different queries with relevant feedback 21/04/2018

12 Query pooling, relevance feedback
Collection Nuclear waste dumping Feedback 1 Feedback 2 Feedback 3 Feedback 4 Feedback 5 Feedback 6 21/04/2018

13 Relevance feedback Use relevance feedback to form queries
Soboroff, I, Robertson, S. (2003) Building a filtering test collection for TREC 2002, in Proceedings of the ACM SIGIR conference. 21/04/2018

14 Both options save time With query pooling With system pooling
2 hours per topic With system pooling 10-20 hours per topic? 21/04/2018

15 Notice, didn’t get everything
How much was missed? Attempts to estimate Zobel, ACM SIGIR 1998 Manmatha, ACM SIGIR 2001 P(r) 1 Rank 21/04/2018

16 Do missing Rels matter? For conventional IR testing? Just want to know
No – not interested in such things Just want to know A>B A=B A<B 21/04/2018

17 Not good enough? 1-2 hours per topic still a lot of work
Hints that 50 topics are too few Million query task of TREC What can we do? 21/04/2018

18 Test collections are Reproducible Reusable Encourage collaboration
Cross comparison Tell you if your new idea works Help you publish your work 21/04/2018

19 How do you do this? Focus on reducing number of relevance assessments
21/04/2018

20 Simple approach TREC/CLEF: judge down to Judge down to top 10
top 100 (sometimes 50) Judge down to top 10 Far fewer documents 11%-14% relevance assessor effort Compared to top 100 21/04/2018

21 Impact of saving Save a lot of time
Loose a little in measurement accuracy 21/04/2018

22 Use time saved To work on more topics Measurement accuracy improves.
M. Sanderson, J. Zobel (2005) Information Retrieval System Evaluation: Effort, Sensitivity, and Reliability, in the proceedings of the 28th ACM SIGIR conference 21/04/2018

23 Questions? dis.shef.ac.uk/mark 21/04/2018


Download ppt "Developments in Evaluation of Search Engines"

Similar presentations


Ads by Google