Download presentation

Presentation is loading. Please wait.

Published byOlivia Klein Modified over 3 years ago

1
The K-armed Dueling Bandits Problem COLT 2009 Cornell University Yisong Yue Josef Broder Robert Kleinberg Thorsten Joachims

2
Multi-armed Bandits K bandits (arms / actions / strategies) Each time step, algorithm chooses bandit based on prior actions and feedback. Only observe feedback from actions taken. Partial Feedback Online Learning Problem

3
Optimizing Retrieval Functions Interactively learn the best retrieval function Search users provide implicit feedback –E.g., what they click on Seems like a natural bandit problem –Each retrieval function is a bandit –Feedback is clicks –Find the best w.r.t. clickthrough rate –Assumes clicks explicit absolute feedback!

4
What Results do Users View/Click? [Joachims et al., TOIS 2007]

5
Team-Game Interleaving 1. Kernel Machines 2.Support Vector Machine 3.An Introduction to Support Vector Machines 4.Archives of SUPPORT-VECTOR-MACHINES SVM-Light Support Vector Machine light/ 1. Kernel Machines 2.SVM-Light Support Vector Machine light/ 3.Support Vector Machine and Kernel... References 4.Lucent Technologies: SVM demo applet 5.Royal Holloway Support Vector Machine 1. Kernel Machines T2 2.Support Vector Machine T1 3.SVM-Light Support Vector Machine T2 light/ 4.An Introduction to Support Vector Machines T1 5.Support Vector Machine and Kernel... ReferencesT2 6.Archives of SUPPORT-VECTOR-MACHINES...T1 7.Lucent Technologies: SVM demo applet T2 f 1 (u,q) r 1 f 2 (u,q) r 2 Interleaving(r 1,r 2 ) (u=thorsten, q=svm) Interpretation: ( r 1 > r 2 ) clicks(T 1 ) > clicks(T 2 ) Mix results of f 1 and f 2 Relative feedback More reliable [Radlinski, Kurup, Joachims; CIKM 2008]

6
Dueling Bandits Problem Given K bandits b 1, …, b K Each iteration: compare (duel) two bandits –E.g., interleaving two retrieval functions Comparison is noisy –Each comparison result independent –Comparison probabilities initially unknown –Comparison probabilities fixed over time Total preference ordering, initially unknown

7
Dueling Bandits Problem Want to find best (or good) bandit –Similar to finding the max w/ noisy comparisons –Ours is a regret minimization setting Choose pair (b t, b t ) to minimize regret: (% users who prefer best bandit over chosen ones)

8
Related Work Existing MAB settings –[Lai & Robbins, 85] [Auer et al., 02] –PAC setting [Even-Dar et al., 06] Partial monitoring problem –[Cesa-Bianchi et al., 06] Computing with noisy comparisons –Finding max [Feige et al., 97] –Binary search [Karp & Kleinberg, 07] [Ben-Or & Hassidim, 08] Continuous-armed Dueling Bandits Problem –[Yue & Joachims, 09]

9
Assumptions P(b i > b j ) = ½ + ε ij (distinguishability) Strong Stochastic Transitivity –For three bandits b i > b j > b k : –Monotonicity property Stochastic Triangle Inequality –For three bandits b i > b j > b k : –Diminishing returns property Satisfied by many standard models –E.g., Logistic/Bradley-Terry

10
Naïve Approach In deterministic case, O(K) comparisons to find max Extend to noisy case: –Repeatedly compare until confident one is better Problem: comparing two awful (but similar) bandits –Waste comparisons to see which awful bandit is better –Incur high regret for each comparison –Also applies to elimination tournaments

11
Interleaved Filter Choose candidate bandit at random

12
Interleaved Filter Choose candidate bandit at random Make noisy comparisons (Bernoulli trial) against all other bandits simultaneously –Maintain mean and confidence interval for each pair of bandits being compared

13
Interleaved Filter Choose candidate bandit at random Make noisy comparisons (Bernoulli trial) against all other bandits simultaneously –Maintain mean and confidence interval for each pair of bandits being compared …until another bandit is better –With confidence 1 – δ

14
Interleaved Filter Choose candidate bandit at random Make noisy comparisons (Bernoulli trial) against all other bandits simultaneously –Maintain mean and confidence interval for each pair of bandits being compared …until another bandit is better –With confidence 1 – δ Repeat process with new candidate –Remove all empirically worse bandits

15
Interleaved Filter Choose candidate bandit at random Make noisy comparisons (Bernoulli trial) against all other bandits simultaneously –Maintain mean and confidence interval for each pair of bandits being compared …until another bandit is better –With confidence 1 – δ Repeat process with new candidate –Remove all empirically worse bandits Continue until 1 candidate left

16
Regret Analysis Define a round to be all the time steps for a particular candidate bandit –Halts round when 1 - δ confidence met –Treat candidate bandits as random walk –Takes log K rounds to reach best bandit Define a match to be all the comparisons between two bandits in a round –O(K) total matches in expectation –Constant fraction of bandits removed each round

17
Regret Analysis O(K) total matches Each match incurs regret –Depends on δ = K -2 T -1 Finds best bandit w.p. 1-1/T Expected regret:

18
Lower Bound Example Order bandits b 1 > b 2 > … > b K –P(b > b) = ½ + ε Each match takes comparisons Pay Θ(ε) regret for each comparison Accumulated regret over all matches is at least

19
Moving Forward Extensions –Drifting user interests Changing user population / document collection –Contextualization / side information Cost-sensitive active learning w/ relative feedback Live user studies –Interactive experiments on search service –Sheds insight; guides future design

20
Extra Slides

21
Per-Match Regret Number of comparisons in match b i vs b j : –ε 1i > ε ij : round ends before concluding b i > b j –ε 1i b j before round ends, remove b j Pay ε 1i + ε 1j regret for each comparison –By triangle inequality ε 1i + ε 1j 2*max{ε 1i, ε ij } –Thus by stochastic transitivity accumulated regret is

22
Number of Rounds Assume all superior bandits have equal prob of defeating candidate –Worst case scenario under transitivity Model this as a random walk –r j transitions to each r i (i < j) with equal probability –Compute total number of steps before reaching r 1 (i.e., r*) Can show O(log K) w.h.p. using Chernoff bound

23
Total Matches Played O(K) matches played in each round –Naïve analysis yields O(K log K) total However, all empirically worse bandits are also removed at the end of each round –Will not participate in future rounds Assume worst case that inferior bandits have ½ chance of being empirically worse –Can show w.h.p. that O(K) total matches are played over O(log N) rounds

24
Removing Inferior Bandits At conclusion of each round –Remove any empirically worse bandits Intuition: –High confidence that winner is better than incumbent candidate –Empirically worse bandits cannot be much better than incumbent candidate –Can show via Hoeffding bound that winner is also better than empirically worsebandits with high confidence –Preserves 1-1/T confidence overall that well find the best bandit

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google