Practical and Reliable Retrieval Evaluation Through Online Experimentation WSDM Workshop on Web Search Click Data February 12 th, 2012 Yisong Yue Carnegie.

Slides:

Advertisements

Similar presentations

A Support Vector Method for Optimizing Average Precision

Advertisements

The K-armed Dueling Bandits Problem

An Interactive Learning Approach to Optimizing Information Retrieval Systems Yahoo! August 24 th, 2010 Yisong Yue Cornell University.

ICML 2009 Yisong Yue Thorsten Joachims Cornell University

Vote Elicitation with Probabilistic Preference Models: Empirical Estimation and Cost Tradeoffs Tyler Lu and Craig Boutilier University of Toronto.

Google News Personalization: Scalable Online Collaborative Filtering

Accurately Interpreting Clickthrough Data as Implicit Feedback Joachims, Granka, Pan, Hembrooke, Gay Paper Presentation: Vinay Goel 10/27/05.

Evaluating the Robustness of Learning from Implicit Feedback Filip Radlinski Thorsten Joachims Presentation by Dinesh Bhirud

Super Awesome Presentation Dandre Allison Devin Adair.

Presenting work by various authors, and own work in collaboration with colleagues at Microsoft and the University of Amsterdam.

Diversified Retrieval as Structured Prediction Redundancy, Diversity, and Interdependent Document Relevance (IDR ’09) SIGIR 2009 Workshop Yisong Yue Cornell.

1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.

Modelling Relevance and User Behaviour in Sponsored Search using Click-Data Adarsh Prasad, IIT Delhi Advisors: Dinesh Govindaraj SVN Vishwanathan* Group:

Imbalanced data David Kauchak CS 451 – Fall 2013.

Optimizing search engines using clickthrough data

Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.

Query Chains: Learning to Rank from Implicit Feedback Paper Authors: Filip Radlinski Thorsten Joachims Presented By: Steven Carr.

1.Accuracy of Agree/Disagree relation classification. 2.Accuracy of user opinion prediction. 1.Task extraction performance on Bing web search log with.

Beat the Mean Bandit Yisong Yue (CMU) & Thorsten Joachims (Cornell)

Practical Online Retrieval Evaluation SIGIR 2011 Tutorial Filip Radlinski (Microsoft) Yisong Yue (CMU)

Eye Tracking Analysis of User Behavior in WWW Search Laura Granka Thorsten Joachims Geri Gay.

Proactive Learning: Cost- Sensitive Active Learning with Multiple Imperfect Oracles Pinar Donmez and Jaime Carbonell Pinar Donmez and Jaime Carbonell Language.

1 Learning User Interaction Models for Predicting Web Search Result Preferences Eugene Agichtein Eric Brill Susan Dumais Robert Ragno Microsoft Research.

Linear Submodular Bandits and their Application to Diversified Retrieval Yisong Yue (CMU) & Carlos Guestrin (CMU) Optimizing Recommender Systems Every.

Click Evidence Signals and Tasks Vishwa Vinay Microsoft Research, Cambridge.

A Markov Random Field Model for Term Dependencies Donald Metzler and W. Bruce Croft University of Massachusetts, Amherst Center for Intelligent Information.

Federated Search of Text Search Engines in Uncooperative Environments Luo Si Language Technology Institute School of Computer Science Carnegie Mellon University.

A Search-based Method for Forecasting Ad Impression in Contextual Advertising Defense.

Online Search Evaluation with Interleaving Filip Radlinski Microsoft.

Evaluation INST 734 Module 5 Doug Oard. Agenda Evaluation fundamentals Test collections: evaluating sets Test collections: evaluating rankings  Interleaving.

Hierarchical Exploration for Accelerating Contextual Bandits Yisong Yue Carnegie Mellon University Joint work with Sue Ann Hong (CMU) & Carlos Guestrin.

Modern Retrieval Evaluations Hongning Wang

1 Information Filtering & Recommender Systems (Lecture for CS410 Text Info Systems) ChengXiang Zhai Department of Computer Science University of Illinois,

Minimal Test Collections for Retrieval Evaluation B. Carterette, J. Allan, R. Sitaraman University of Massachusetts Amherst SIGIR2006.

A Comparative Study of Search Result Diversification Methods Wei Zheng and Hui Fang University of Delaware, Newark DE 19716, USA

Some Vignettes from Learning Theory Robert Kleinberg Cornell University Microsoft Faculty Summit, 2009.

The Dueling Bandits Problem Yisong Yue. Outline Brief Overview of Multi-Armed Bandits Dueling Bandits – Mathematical properties – Connections to other.

Understanding and Predicting Graded Search Satisfaction Tang Yuk Yu 1.

Ramakrishnan Srikant Sugato Basu Ni Wang Daryl Pregibon 1.

Fan Guo 1, Chao Liu 2 and Yi-Min Wang 2 1 Carnegie Mellon University 2 Microsoft Research Feb 11, 2009.

CIKM’09 Date:2010/8/24 Advisor: Dr. Koh, Jia-Ling Speaker: Lin, Yi-Jhen 1.

Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.

Karthik Raman, Pannaga Shivaswamy & Thorsten Joachims Cornell University 1.

Implicit User Feedback Hongning Wang Explicit relevance feedback 2 Updated query Feedback Judgments: d 1 + d 2 - d 3 + … d k -... Query User judgment.

Search Engines that Learn from Implicit Feedback Jiawen, Liu Speech Lab, CSIE National Taiwan Normal University Reference: Search Engines that Learn from.

Inference and Inferential Statistics Methods of Educational Research EDU 660.

Collecting High Quality Overlapping Labels at Low Cost Grace Hui Yang Language Technologies Institute Carnegie Mellon University Anton Mityagin Krysta.

Implicit User Feedback Hongning Wang Explicit relevance feedback 2 Updated query Feedback Judgments: d 1 + d 2 - d 3 + … d k -... Query User judgment.

Paired Experiments and Interleaving for Retrieval Evaluation Thorsten Joachims, Madhu Kurup, Filip Radlinski Department of Computer Science Department.

More Than Relevance: High Utility Query Recommendation By Mining Users' Search Behaviors Xiaofei Zhu, Jiafeng Guo, Xueqi Cheng, Yanyan Lan Institute of.

ASSOCIATIVE BROWSING Evaluating 1 Jinyoung Kim / W. Bruce Croft / David Smith for Personal Information.

Modern Retrieval Evaluations Hongning Wang

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

Learning to Rank: From Pairwise Approach to Listwise Approach Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date:

Predicting Consensus Ranking in Crowdsourced Setting Xi Chen Mentors: Paul Bennett and Eric Horvitz Collaborator: Kevyn Collins-Thompson Machine Learning.

IMPACT EVALUATION PBAF 526 Class 5, October 31, 2011.

Bayesian Optimization. Problem Formulation Goal  Discover the X that maximizes Y  Global optimization Active experimentation  We can choose which values.

Accurately Interpreting Clickthrough Data as Implicit Feedback

Modern Retrieval Evaluations

Tingdan Luo 05/02/2016 Interactively Optimizing Information Retrieval Systems as a Dueling Bandits Problem Tingdan Luo

Content-Aware Click Modeling

Evidence from Behavior

Approaching an ML Problem

How does Clickthrough Data Reflect Retrieval Quality?

Click Chain Model in Web Search

Jonathan Elsas LTI Student Research Symposium Sept. 14, 2007

Efficient Multiple-Click Models in Web Search

Learning to Rank with Ties

Interactive Information Retrieval

Presentation transcript:

Practical and Reliable Retrieval Evaluation Through Online Experimentation WSDM Workshop on Web Search Click Data February 12 th, 2012 Yisong Yue Carnegie Mellon University

Offline Post-hoc Analysis Launch some ranking function on live traffic – Collect usage data (clicks) – Often beyond our control

Offline Post-hoc Analysis Launch some ranking function on live traffic – Collect usage data (clicks) – Often beyond our control Do something with the data – User modeling, learning to rank, etc

Offline Post-hoc Analysis Launch some ranking function on live traffic – Collect usage data (clicks) – Often beyond our control Do something with the data – User modeling, learning to rank, etc Did we improve anything? – Often only evaluated on pre-collected data

Evaluating via Click Logs Click Suppose our model swaps results 1 and 6 Did retrieval quality improve?

What Results do Users View/Click? [Joachims et al. 2005, 2007]

Online Evaluation Try out new ranking function on real users Collect usage data Interpret usage data Conclude whether or not quality has improved

Challenges Establishing live system Getting real users Needs to be practical – Evaluation shouldnt take too long – I.e., a sensitive experiment Needs to be reliable – Feedback needs to be properly interpretable – Not too systematically biased

Challenges Establishing live system Getting real users Needs to be practical – Evaluation shouldnt take too long – I.e., a sensitive experiment Needs to be reliable – Feedback needs to be properly interpretable – Not too systematically biased Interleaving Experiments!

Team Draft Interleaving Ranking A 1.Napa Valley – The authority for lodging Napa Valley Wineries - Plan your wine Napa Valley College 4.Been There | Tips | Napa Valley 5.Napa Valley Wineries and Wine 6.Napa Country, California – Wikipedia en.wikipedia.org/wiki/Napa_Valley Ranking B 1.Napa Country, California – Wikipedia en.wikipedia.org/wiki/Napa_Valley 2.Napa Valley – The authority for lodging Napa: The Story of an American Eden... books.google.co.uk/books?isbn=... 4.Napa Valley Hotels – Bed and Breakfast NapaValley.org 6.The Napa Valley Marathon Presented Ranking 1.Napa Valley – The authority for lodging Napa Country, California – Wikipedia en.wikipedia.org/wiki/Napa_Valley 3.Napa: The Story of an American Eden... books.google.co.uk/books?isbn=... 4.Napa Valley Wineries – Plan your wine Napa Valley Hotels – Bed and Breakfast Napa Balley College 7NapaValley.org A B [Radlinski et al., 2008]

Team Draft Interleaving Ranking A 1.Napa Valley – The authority for lodging Napa Valley Wineries - Plan your wine Napa Valley College 4.Been There | Tips | Napa Valley 5.Napa Valley Wineries and Wine 6.Napa Country, California – Wikipedia en.wikipedia.org/wiki/Napa_Valley Ranking B 1.Napa Country, California – Wikipedia en.wikipedia.org/wiki/Napa_Valley 2.Napa Valley – The authority for lodging Napa: The Story of an American Eden... books.google.co.uk/books?isbn=... 4.Napa Valley Hotels – Bed and Breakfast NapaValley.org 6.The Napa Valley Marathon Presented Ranking 1.Napa Valley – The authority for lodging Napa Country, California – Wikipedia en.wikipedia.org/wiki/Napa_Valley 3.Napa: The Story of an American Eden... books.google.co.uk/books?isbn=... 4.Napa Valley Wineries – Plan your wine Napa Valley Hotels – Bed and Breakfast Napa Balley College 7NapaValley.org Tie! Click [Radlinski et al., 2008]

Simple Example Two users, Alice & Bob – Alice clicks a lot, – Bob clicks very little, Two retrieval functions, r 1 & r 2 – r 1 > r 2 Two ways of evaluating: – Run r 1 & r 2 independently, measure absolute metrics – Interleave r 1 & r 2, measure pairwise preference

Simple Example Two users, Alice & Bob – Alice clicks a lot, – Bob clicks very little, Two retrieval functions, r 1 & r 2 – r 1 > r 2 Two ways of evaluating: – Run r 1 & r 2 independently, measure absolute metrics – Interleave r 1 & r 2, measure pairwise preference Absolute metrics: Higher chance of falsely concluding that r 2 > r 1 Interleaving: UserRet Func#clicks Alicer2r2 5 Bobr1r1 1 User#clicks on r 1 #clicks on r 2 Alice41 Bob10

Comparison with Absolute Metrics (Online) [Radlinski et al. 2008; Chapelle et al., 2012] p-value Query set size Experiments on arXiv.org About 1000 queries per experiment Interleaving is more sensitive and more reliable diverges in preference estimate Interleaving achieves significance faster ArXiv.org Pair 1 ArXiv.org Pair 2 Disagreement Probability

Comparison with Absolute Metrics (Online) p-value Query set size Experiments on Yahoo! (smaller differences in quality) Large scale experiment Interleaving is sensitive and more reliable (~7K queries for significance) Yahoo! Pair 1 Yahoo! Pair 2 Disagreement Probability [Radlinski et al. 2008; Chapelle et al., 2012]

Benefits & Drawbacks of Interleaving Benefits – A more direct way to elicit user preferences – A more direct way to perform retrieval evaluation – Deals with issues of position bias and calibration Drawbacks – Can only elicit pairwise ranking-level preferences – Unclear how to interpret at document-level – Unclear how to derive user model

Demo!

Story So Far Interleaving is an efficient and consistent online experiment framework. How can we improve interleaving experiments? How do we efficiently schedule multiple interleaving experiments?

Not All Clicks Created Equal Interleaving constructs a paired test – Controls for position bias – Calibrates clicks But not all clicks are equally informative – Attractive summaries – Last click vs first click – Clicks at rank 1

Title Bias Effect Bars should be equal if no title bias Adjacent Rank Positions Click Percentage on Bottom [Yue et al., 2010]

Not All Clicks Created Equal Example: query session with 2 clicks – One click at rank 1 (from A) – Later click at rank 4 (from B) – Normally would count this query session as a tie

Not All Clicks Created Equal Example: query session with 2 clicks – One click at rank 1 (from A) – Later click at rank 4 (from B) – Normally would count this query session as a tie – But second click is probably more informative… – …so B should get more credit for this query

Linear Model for Weighting Clicks Feature vector φ(q,c): Weight of click is w T φ(q,c) [Yue et al., 2010; Chapelle et al., 2012]

Example w T φ(q,c) differentiates last clicks and other clicks [Yue et al., 2010; Chapelle et al., 2012]

Example w T φ(q,c) differentiates last clicks and other clicks Interleave A vs B – 3 clicks per session – Last click 60% on result from A – Other 2 clicks random [Yue et al., 2010; Chapelle et al., 2012]

Example w T φ(q,c) differentiates last clicks and other clicks Interleave A vs B – 3 clicks per session – Last click 60% on result from A – Other 2 clicks random Conventional w = (1,1) – has significant variance Only count last click w = (1,0) – minimizes variance [Yue et al., 2010; Chapelle et al., 2012]

Learning Parameters Training set: interleaved click data on pairs of retrieval functions (A,B) – We know A > B [Yue et al., 2010; Chapelle et al., 2012]

Learning Parameters Training set: interleaved click data on pairs of retrieval functions (A,B) – We know A > B Learning: train parameters w to maximize sensitivity of interleaving experiments [Yue et al., 2010; Chapelle et al., 2012]

Learning Parameters Training set: interleaved click data on pairs of retrieval functions (A,B) – We know A > B Learning: train parameters w to maximize sensitivity of interleaving experiments Example: z-test depends on z-score = mean / std – The larger the z-score, the more confident the test [Yue et al., 2010; Chapelle et al., 2012]

Learning Parameters Training set: interleaved click data on pairs of retrieval functions (A,B) – We know A > B Learning: train parameters w to maximize sensitivity of interleaving experiments Example: z-test depends on z-score = mean / std – The larger the z-score, the more confident the test – Inverse z-test learns w to maximize z-score on training set [Yue et al., 2010; Chapelle et al., 2012]

Inverse z-Test [Yue et al., 2010; Chapelle et al., 2012] Aggregate features of all clicks in a query

Inverse z-Test [Yue et al., 2010; Chapelle et al., 2012] Aggregate features of all clicks in a query Choose w * to maximize the resulting z-score

ArXiv.org Experiments Baseline Learned Trained on 6 interleaving experiments Tested on 12 interleaving experiments [Yue et al., 2010; Chapelle et al., 2012]

ArXiv.org Experiments Baseline Ratio Learned / Baseline Trained on 6 interleaving experiments Tested on 12 interleaving experiments Median relative score of 1.37 Baseline requires 1.88 times more data [Yue et al., 2010; Chapelle et al., 2012]

Yahoo! Experiments Baseline Learned 16 Markets, 4-6 interleaving experiments Leave-one-market-out validation [Yue et al., 2010; Chapelle et al., 2012]

Baseline Ratio Learned / Baseline Yahoo! Experiments 16 Markets, 4-6 interleaving experiments Leave-one-market-out validation Median relative score of 1.25 Baseline requires 1.56 times more data [Yue et al., 2010; Chapelle et al., 2012]

Improving Interleaving Experiments Can re-weight clicks based on importance – Reduces noise – Parameters correlated so hard to interpret – Largest weight on single click at rank > 1 Can alter the interleaving mechanism – Probabilistic interleaving [Hofmann et al., 2011] Reusing interleaving usage data

Story So Far Interleaving is an efficient and consistent online experiment framework. How can we improve interleaving experiments? How do we efficiently schedule multiple interleaving experiments?

Information Systems

… Left winsRight wins A vs B10 A vs C00 B vs C00 Interleave A vs B

… Left winsRight wins A vs B10 A vs C01 B vs C00 Interleave A vs C

… Left winsRight wins A vs B10 A vs C01 B vs C10 Interleave B vs C

… Left winsRight wins A vs B11 A vs C01 B vs C10 Interleave A vs B

… Left winsRight wins A vs B11 A vs C01 B vs C10 Interleave A vs B Exploration / Exploitation Tradeoff!

Identifying Best Retrieval Function Tournament – E.g., tennis – Eliminated by an arbitrary player Champion – E.g., boxing – Eliminated by champion Swiss – E.g., group rounds – Eliminated based on overall record

Tournaments are Bad Two bad retrieval functions are dueling They are similar to each other – Takes a long time to decide winner – Cant make progress in tournament until deciding Suffer very high regret for each comparison – Could have been using better retrieval functions

Champion is Good The champion gets better fast – If starts out bad, quickly gets replaced – Duel against each competitor in round robin Treat sequence of champions as a random walk – Log number of rounds to arrive at best retrieval function [Yue et al., 2009] One of these will become next champion

Champion is Good The champion gets better fast – If starts out bad, quickly gets replaced – Duel against each competitor in round robin Treat sequence of champions as a random walk – Log number of rounds to arrive at best retrieval function [Yue et al., 2009] One of these will become next champion

Champion is Good The champion gets better fast – If starts out bad, quickly gets replaced – Duel against each competitor in round robin Treat sequence of champions as a random walk – Log number of rounds to arrive at best retrieval function [Yue et al., 2009] One of these will become next champion

Champion is Good The champion gets better fast – If starts out bad, quickly gets replaced – Duel against each competitor in round robin Treat sequence of champions as a random walk – Log number of rounds to arrive at best retrieval function [Yue et al., 2009]

Swiss is Even Better Champion has a lot of variance – Depends on initial champion Swiss offers low-variance alternative – Successively eliminate retrieval function with worst record Analysis & intuition more complicated [Yue & Joachims, 2011]

Interleaving for Online Evaluation Interleaving is practical for online evaluation – High sensitivity – Low bias (preemptively controls for position bias) Interleaving can be improved – Dealing with secondary sources of noise/bias – New interleaving mechanisms Exploration/exploitation tradeoff – Need to balance evaluation with servicing users

References: Large Scale Validation and Analysis of Interleaved Search Evaluation (TOIS 2012) Olivier Chapelle, Thorsten Joachims, Filip Radlinski, Yisong Yue A Probabilistic Method for Inferring Preferences from Clicks (CIKM 2012) Katja Hofmann, Shimon Whiteson, Maarten de Rijke Evaluating the Accuracy of Implicit Feedback from Clicks and Query Reformulations (TOIS 2007) Thorsten Joachims, Laura Granka, Bing Pan, Helen Hembrooke, Filip Radlinski, Geri Gay How Does Clickthrough Data Reflect Retrieval Quality? (CIKM 2008) Filip Radlinski, Madhu Kurup, Thorsten Joachims The K-armed Dueling Bandits Problem (COLT 2009) Yisong Yue, Josef Broder, Robert Kleinberg, Thorsten Joachims Learning More Powerful Test Statistics for Click-Based Retrieval Evaluation (SIGIR 2010) Yisong Yue, Yue Gao, Olivier Chapelle, Ya Zhang, Thorsten Joachims Beat the Mean Bandit (ICML 2011) Yisong Yue, Thorsten Joachims Beyond Position Bias: Examining Result Attractiveness as a Source of Presentation Bias in Clickthrough Data (WWW 2010) Yisong Yue, Rajan Patel, Hein Roehrig Papers and demo scripts available at