Modern Retrieval Evaluations Hongning Wang

Slides:



Advertisements
Similar presentations
Super Awesome Presentation Dandre Allison Devin Adair.
Advertisements

1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Modelling Relevance and User Behaviour in Sponsored Search using Click-Data Adarsh Prasad, IIT Delhi Advisors: Dinesh Govindaraj SVN Vishwanathan* Group:
Optimizing search engines using clickthrough data
Query Chains: Learning to Rank from Implicit Feedback Paper Authors: Filip Radlinski Thorsten Joachims Presented By: Steven Carr.
1.Accuracy of Agree/Disagree relation classification. 2.Accuracy of user opinion prediction. 1.Task extraction performance on Bing web search log with.
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.
Evaluating Search Engine
Click Evidence Signals and Tasks Vishwa Vinay Microsoft Research, Cambridge.
An investigation of query expansion terms Gheorghe Muresan Rutgers University, School of Communication, Information and Library Science 4 Huntington St.,
Evaluation of Image Retrieval Results Relevant: images which meet user’s information need Irrelevant: images which don’t meet user’s information need Query:
Online Search Evaluation with Interleaving Filip Radlinski Microsoft.
Retrieval Evaluation Hongning Wang
Minimal Test Collections for Retrieval Evaluation B. Carterette, J. Allan, R. Sitaraman University of Massachusetts Amherst SIGIR2006.
Evaluation Experiments and Experience from the Perspective of Interactive Information Retrieval Ross Wilkinson Mingfang Wu ICT Centre CSIRO, Australia.
Philosophy of IR Evaluation Ellen Voorhees. NIST Evaluation: How well does system meet information need? System evaluation: how good are document rankings?
IR Evaluation Evaluate what? –user satisfaction on specific task –speed –presentation (interface) issue –etc. My focus today: –comparative performance.
A Comparative Study of Search Result Diversification Methods Wei Zheng and Hui Fang University of Delaware, Newark DE 19716, USA
User Browsing Graph: Structure, Evolution and Application Yiqun Liu, Yijiang Jin, Min Zhang, Shaoping Ma, Liyun Ru State Key Lab of Intelligent Technology.
Understanding and Predicting Graded Search Satisfaction Tang Yuk Yu 1.
An Analysis of Assessor Behavior in Crowdsourced Preference Judgments Dongqing Zhu and Ben Carterette University of Delaware.
Improving Web Search Ranking by Incorporating User Behavior Information Eugene Agichtein Eric Brill Susan Dumais Microsoft Research.
Ramakrishnan Srikant Sugato Basu Ni Wang Daryl Pregibon 1.
Evaluating Search Engines in chapter 8 of the book Search Engines Information Retrieval in Practice Hongfei Yan.
A Comparison of Statistical Significance Tests for Information Retrieval Evaluation CIKM´07, November 2007.
Evaluation INST 734 Module 5 Doug Oard. Agenda Evaluation fundamentals  Test collections: evaluating sets Test collections: evaluating rankings Interleaving.
CIKM’09 Date:2010/8/24 Advisor: Dr. Koh, Jia-Ling Speaker: Lin, Yi-Jhen 1.
Implicit User Feedback Hongning Wang Explicit relevance feedback 2 Updated query Feedback Judgments: d 1 + d 2 - d 3 + … d k -... Query User judgment.
Search Engine Architecture
Less is More Probabilistic Models for Retrieving Fewer Relevant Documents Harr Chen, David R. Karger MIT CSAIL ACM SIGIR 2006 August 9, 2006.
Evaluation INST 734 Module 5 Doug Oard. Agenda Evaluation fundamentals Test collections: evaluating sets  Test collections: evaluating rankings Interleaving.
Personalizing Web Search using Long Term Browsing History Nicolaas Matthijs, Cambridge Filip Radlinski, Microsoft In Proceedings of WSDM
1 Using The Past To Score The Present: Extending Term Weighting Models with Revision History Analysis CIKM’10 Advisor : Jia Ling, Koh Speaker : SHENG HONG,
Qi Guo Emory University Ryen White, Susan Dumais, Jue Wang, Blake Anderson Microsoft Presented by Tetsuya Sakai, Microsoft Research.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Jiafeng Guo(ICT) Xueqi Cheng(ICT) Hua-Wei Shen(ICT) Gu Xu (MSRA) Speaker: Rui-Rui Li Supervisor: Prof. Ben Kao.
Adish Singla, Microsoft Bing Ryen W. White, Microsoft Research Jeff Huang, University of Washington.
Implicit User Feedback Hongning Wang Explicit relevance feedback 2 Updated query Feedback Judgments: d 1 + d 2 - d 3 + … d k -... Query User judgment.
Measuring How Good Your Search Engine Is. *. Information System Evaluation l Before 1993 evaluations were done using a few small, well-known corpora of.
Retrieval Evaluation Hongning Wang
Performance Measures. Why to Conduct Performance Evaluation? 2 n Evaluation is the key to building effective & efficient IR (information retrieval) systems.
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
Paired Experiments and Interleaving for Retrieval Evaluation Thorsten Joachims, Madhu Kurup, Filip Radlinski Department of Computer Science Department.
More Than Relevance: High Utility Query Recommendation By Mining Users' Search Behaviors Xiaofei Zhu, Jiafeng Guo, Xueqi Cheng, Yanyan Lan Institute of.
Date: 2012/11/29 Author: Chen Wang, Keping Bi, Yunhua Hu, Hang Li, Guihong Cao Source: WSDM’12 Advisor: Jia-ling, Koh Speaker: Shun-Chen, Cheng.
1 Click Chain Model in Web Search Fan Guo Carnegie Mellon University PPT Revised and Presented by Xin Xin.
Modern Retrieval Evaluations Hongning Wang
PERSONALIZED DIVERSIFICATION OF SEARCH RESULTS Date: 2013/04/15 Author: David Vallet, Pablo Castells Source: SIGIR’12 Advisor: Dr.Jia-ling, Koh Speaker:
Relevance Feedback Hongning Wang
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
A Framework to Predict the Quality of Answers with Non-Textual Features Jiwoon Jeon, W. Bruce Croft(University of Massachusetts-Amherst) Joon Ho Lee (Soongsil.
Learning to Rank: From Pairwise Approach to Listwise Approach Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date:
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 10 Evaluation.
Usefulness of Quality Click- through Data for Training Craig Macdonald, ladh Ounis Department of Computing Science University of Glasgow, Scotland, UK.
1 Context-Aware Ranking in Web Search (SIGIR 10’) Biao Xiang, Daxin Jiang, Jian Pei, Xiaohui Sun, Enhong Chen, Hang Li 2010/10/26.
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008 Annotations by Michael L. Nelson.
Simulation-based inference beyond the introductory course Beth Chance Department of Statistics Cal Poly – San Luis Obispo
Sampath Jayarathna Cal Poly Pomona
Evaluation Anisio Lacerda.
Walid Magdy Gareth Jones
Modern Retrieval Evaluations
Evaluation of IR Systems
An Empirical Study of Learning to Rank for Entity Search
Martin Rajman, Martin Vesely
Tingdan Luo 05/02/2016 Interactively Optimizing Information Retrieval Systems as a Dueling Bandits Problem Tingdan Luo
Lecture 10 Evaluation.
IR Theory: Evaluation Methods
Lecture 6 Evaluation.
How does Clickthrough Data Reflect Retrieval Quality?
Learning to Rank with Ties
Presentation transcript:

Modern Retrieval Evaluations Hongning Wang

What we have known about IR evaluations Three key elements for IR evaluation – A document collection – A test suite of information needs – A set of relevance judgments Evaluation of unranked retrieval sets – Precision/Recall Evaluation of ranked retrieval sets – MAP, MRR, NDCG Statistic significance – Avoid randomness in evaluation 6501: Information Retrieval2

Rethink retrieval evaluation Goal of any IR system – Satisfying users’ information need Core quality measure criterion – “how well a system meets the information needs of its users.” – wiki Are traditional IR evaluations qualified for this purpose? – What is missing? 6501: Information Retrieval3

Do user preferences and evaluation measures line up? [Sanderson et al. SIGIR’10] Research question 1.Does effectiveness measured on a test collection predict user preferences for one IR system over another? 2.If such a predictive power exists, does the strength of prediction vary across different search tasks and topic types? 3.If present, does the predictive power vary when different effectiveness measures are employed? 4.When choosing one system over another, what are the reasons given by users for their choice? 6501: Information Retrieval4

Experiment settings User population – Crowd sourcing Mechanical Turk 296 ordinary users Test collection – TREC’09 Web track 50 million documents from ClueWeb09 – 30 topics Each included several sub-topics Binary relevance judgment against the sub-topics 6501: Information Retrieval5

Experiment settings IR systems – 19 runs of submissions to the TREC evaluation Users need to make side-by-side comparison to give their preferences over the ranking results 6501: Information Retrieval6

Experimental results User preferences v.s. retrieval metrics – Metrics generally match users’ preferences, no significant differences between metrics 6501: Information Retrieval7

Experimental results Zoom into nDCG – Separate the comparison into groups of small differences and large differences – Users tend to agree more when the difference between the ranking results is large 6501: Information Retrieval8 Compare to mean difference

Experimental results What if when one system did not retrieve anything relevant – All metrics tell the same and mostly align with the users 6501: Information Retrieval9

Experimental results What if when both systems retrieved something relevant at top positions – cannot distinguish the difference between systems 6501: Information Retrieval10

Conclusions of this study IR evaluation metrics measured on a test collection predicted user preferences for one IR system over another The correlation is strong when the performance difference is large Effectiveness of different metrics vary 6501: Information Retrieval11

How does clickthrough data reflect retrieval quality [Radlinski CIKM’08] User behavior oriented retrieval evaluation – Low cost – Large scale – Natural usage context and utility Common practice in modern search engine systems – A/B test 6501: Information Retrieval12

A/B test Two-sample hypothesis testing – Two versions (A and B) are compared, which are identical except for one variation that might affect a user's behavior E.g., BM25 with different parameter settings – Randomized experiment Separate the population into equal size groups – 10% random users for system A and 10% random users for system B Null hypothesis: no difference between system A and B – Z-test, t-test 6501: Information Retrieval13

Behavior-based metrics Abandonment Rate – Fraction of queries for which no results were clicked on Reformulation Rate – Fraction of queries that were followed by another query during the same session Queries per Session – Mean number of queries issued by a user during a session 6501: Information Retrieval14

Behavior-based metrics 6501: Information Retrieval15

Behavior-based metrics 6501: Information Retrieval16

Experiment setup Philosophy – Given systems with known relative ranking performance – Test which metric can recognize such difference Reverse thinking of hypothesis testing In hypothesis testing, we choose system by test statistics In this study, we choose test statistics by systems 6501: Information Retrieval17

Constructing comparison systems Orig > Flat > Rand – Orig: original ranking algorithm from arXiv.org – Flat: remove structure features (known to be important) in original ranking algorithm – Rand: random shuffling of Flat’s results Orig > Swap2 > Swap4 – Swap2: randomly selects two documents from top 5 and swaps them with two random documents from rank 6 through 10 (the same for next page) – Swap4: similar to Swap2, but select four documents for swap 6501: Information Retrieval18

Result for A/B test 1/6 users of arXiv.org are routed to each of the testing system in one month period 6501: Information Retrieval19

Result for A/B test 1/6 users of arXiv.org are routed to each of the testing system in one month period 6501: Information Retrieval20

Result for A/B test Few of such comparisons are significant 6501: Information Retrieval21

Interleave test Design principle from sensory analysis – Instead of giving absolute ratings, ask for relative comparison between alternatives E.g., is A better than B? – Randomized experiment Interleave results from both A and B Giving interleaved results to the same population and ask for their preference Hypothesis test over preference votes 6501: Information Retrieval22

Interleave for IR evaluation Team-draft interleaving 6501: Information Retrieval23

Interleave for IR evaluation Team-draft interleaving Ranking A: Ranking B: Interleaved ranking RND = : Information Retrieval24

Result for interleaved test 1/6 users of arXiv.org are routed to each of the testing system in one month period – Test which group receives more clicks 6501: Information Retrieval25

Conclusions Interleaved test is more accurate and sensitive – 4 out of 6 experiments follows our expectation Only click count is utilized in this interleaved test – More aspects can be evaluated E.g., dwell-time, reciprocal rank, if leads to download, is last click, is first click 6501: Information Retrieval26

Comparing the sensitivity of information retrieval metrics [Radlinski & Craswell, SIGIR’10] How sensitive are those IR evaluation metrics? – How many queries do we need to get a confident comparison result? – How quickly it can recognize the difference between different IR systems? 6501: Information Retrieval27

Experiment setup IR systems with known search effectiveness Large set of annotated corpus – 12k queries – Each retrieved document is labeled into 5-grade level Large collection of real users’ clicks from a major commercial search engine Approach – Gradually increase evaluation query size to investigate the conclusion of metrics 6501: Information Retrieval28

Sensitivity of System effectiveness: A>B>C 6501: Information Retrieval29

Sensitivity of System effectiveness: A>B>C 6501: Information Retrieval30

Sensitivity of interleaving 6501: Information Retrieval31

Correlation between IR metrics and interleaving 6501: Information Retrieval32

How to assess search result quality? Query-level relevance evaluation – Metrics: MAP, NDCG, MRR Task-level satisfaction evaluation – Users’ satisfaction of the whole search task Goal: find existing work for “action-level search satisfaction prediction” 6501: Information Retrieval33

Example of search task Information need: find out what metal can float on water Search ActionsEngineTime Q: metals float on waterGoogle10s SR: wiki.answers.com2s BR: blog.sciseek.com3s Q: which metals float on waterGoogle31s Q: metals floating on waterGoogle16s SR: Q: metals floating on waterBing53s Q: lithium sodium potassium float on waterGoogle38s SR: quick back query reformulation search engine switch 6501: Information Retrieval34

Beyond DCG: User Behavior as a Predictor of a Successful Search [Ahmed et al. WSDM’10] Modeling users’ sequential search behaviors with Markov models – A model for successful search patterns – A model for unsuccessful search patterns ML for parameter estimation on annotated data set 6501: Information Retrieval35

Predict user satisfaction Prior: difficulty of this task, or users’ expertise of search Likelihood: how well the model explains users’ behavior Prediction performance for search task satisfaction 6501: Information Retrieval36

What you should know IR evaluation metrics generally aligns with users’ result preferences A/B test v.s. interleaved test Sensitivity of evaluation metrics Direct evaluation of search satisfaction 6501: Information Retrieval37