Developments in Evaluation of Search Engines

Developments in Evaluation of Search Engines
Mark Sanderson, University of Sheffield

Evaluation in IR Use a test collection: set of… Documents Topics
Relevance Judgements 21/04/2018

How to get lots of judgements?
Do you check all documents for all topics? In the old days Yes But this doesn’t scale 21/04/2018

To form larger test collections
Get your relevance judgements from pools How does that work? 21/04/2018

Pooling – many participants
Collection Run 1 Run 2 Run 3 Run 4 Run 5 Run 6 Run 7 21/04/2018

Classic pool formation
runs Judge 1-2K documents per topic 10-20 hours per topic 50 topics, too much effort for one person 21/04/2018

Look at the two problem areas
Pooling requires many participants Relevance assessment requires many person hours 21/04/2018

Query pooling Don’t have multiple runs from groups
Have one person create multiple queries 21/04/2018

Query pooling First proposed by Confirmed by
Cormack, G.V., Palmer, R.P., Clarke, C.L.A. (1998): Efficient Constructions of Large Test Collections, in Proceedings of the 21st annual international ACM-SIGIR conference on Research and development in information retrieval: Confirmed by Forming test collections with no system pooling, M. Sanderson, H. Joho, In the 27th ACM Conference of the Special Interest Group in Information Retrieval 2004. 21/04/2018

Query pooling Collection Nuclear waste dumping Radioactive waste
Radioactive waste storage Hazardous waste Nuclear waste storage Utah nuclear waste Waste dump 21/04/2018

Another approach Maybe your assessors Can read very fast,
but can’t search very well. Form different queries with relevant feedback 21/04/2018

Query pooling, relevance feedback
Collection Nuclear waste dumping Feedback 1 Feedback 2 Feedback 3 Feedback 4 Feedback 5 Feedback 6 21/04/2018

Relevance feedback Use relevance feedback to form queries
Soboroff, I, Robertson, S. (2003) Building a filtering test collection for TREC 2002, in Proceedings of the ACM SIGIR conference. 21/04/2018

Both options save time With query pooling With system pooling
2 hours per topic With system pooling 10-20 hours per topic? 21/04/2018

Notice, didn’t get everything
How much was missed? Attempts to estimate Zobel, ACM SIGIR 1998 Manmatha, ACM SIGIR 2001 P(r) 1 Rank 21/04/2018

Do missing Rels matter? For conventional IR testing? Just want to know
No – not interested in such things Just want to know A>B A=B A<B 21/04/2018

Not good enough? 1-2 hours per topic still a lot of work
Hints that 50 topics are too few Million query task of TREC What can we do? 21/04/2018

Test collections are Reproducible Reusable Encourage collaboration
Cross comparison Tell you if your new idea works Help you publish your work 21/04/2018

How do you do this? Focus on reducing number of relevance assessments
21/04/2018

Simple approach TREC/CLEF: judge down to Judge down to top 10
top 100 (sometimes 50) Judge down to top 10 Far fewer documents 11%-14% relevance assessor effort Compared to top 100 21/04/2018

Impact of saving Save a lot of time
Loose a little in measurement accuracy 21/04/2018

Use time saved To work on more topics Measurement accuracy improves.
M. Sanderson, J. Zobel (2005) Information Retrieval System Evaluation: Effort, Sensitivity, and Reliability, in the proceedings of the 28th ACM SIGIR conference 21/04/2018

Questions? dis.shef.ac.uk/mark 21/04/2018

Developments in Evaluation of Search Engines

Similar presentations

Presentation on theme: "Developments in Evaluation of Search Engines"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Developments in Evaluation of Search Engines

Similar presentations

Presentation on theme: "Developments in Evaluation of Search Engines"— Presentation transcript:

Similar presentations

About project

Feedback