# The K-armed Dueling Bandits Problem

## Presentation on theme: "The K-armed Dueling Bandits Problem"— Presentation transcript:

The K-armed Dueling Bandits Problem
COLT 2009 Cornell University Yisong Yue Josef Broder Robert Kleinberg Thorsten Joachims

Multi-armed Bandits K bandits (arms / actions / strategies)
Each time step, algorithm chooses bandit based on prior actions and feedback. Only observe feedback from actions taken. Partial Feedback Online Learning Problem

Optimizing Retrieval Functions
Interactively learn the best retrieval function Search users provide implicit feedback E.g., what they click on Seems like a natural bandit problem Each retrieval function is a bandit Feedback is clicks Find the best w.r.t. clickthrough rate Assumes clicks → explicit absolute feedback!

What Results do Users View/Click?
[Joachims et al., TOIS 2007]

Team-Game Interleaving

Dueling Bandits Problem
Given K bandits b1, …, bK Each iteration: compare (duel) two bandits E.g., interleaving two retrieval functions Comparison is noisy Each comparison result independent Comparison probabilities initially unknown Comparison probabilities fixed over time Total preference ordering, initially unknown

Dueling Bandits Problem
Want to find best (or good) bandit Similar to finding the max w/ noisy comparisons Ours is a regret minimization setting Choose pair (bt, bt’) to minimize regret: (% users who prefer best bandit over chosen ones)

Related Work Existing MAB settings Partial monitoring problem
[Lai & Robbins, ’85] [Auer et al., ’02] PAC setting [Even-Dar et al., ’06] Partial monitoring problem [Cesa-Bianchi et al., ’06] Computing with noisy comparisons Finding max [Feige et al., ’97] Binary search [Karp & Kleinberg, ’07] [Ben-Or & Hassidim, ’08] Continuous-armed Dueling Bandits Problem [Yue & Joachims, ’09]

Assumptions P(bi > bj) = ½ + εij (distinguishability)
Strong Stochastic Transitivity For three bandits bi > bj > bk : Monotonicity property Stochastic Triangle Inequality Diminishing returns property Satisfied by many standard models E.g., Logistic/Bradley-Terry

Naïve Approach In deterministic case, O(K) comparisons to find max
Extend to noisy case: Repeatedly compare until confident one is better Problem: comparing two awful (but similar) bandits Waste comparisons to see which awful bandit is better Incur high regret for each comparison Also applies to elimination tournaments

Interleaved Filter Choose candidate bandit at random

Interleaved Filter Choose candidate bandit at random
Make noisy comparisons (Bernoulli trial) against all other bandits simultaneously Maintain mean and confidence interval for each pair of bandits being compared

Interleaved Filter Choose candidate bandit at random
Make noisy comparisons (Bernoulli trial) against all other bandits simultaneously Maintain mean and confidence interval for each pair of bandits being compared …until another bandit is better With confidence 1 – δ

Interleaved Filter Choose candidate bandit at random
Make noisy comparisons (Bernoulli trial) against all other bandits simultaneously Maintain mean and confidence interval for each pair of bandits being compared …until another bandit is better With confidence 1 – δ Repeat process with new candidate Remove all empirically worse bandits

Interleaved Filter Choose candidate bandit at random
Make noisy comparisons (Bernoulli trial) against all other bandits simultaneously Maintain mean and confidence interval for each pair of bandits being compared …until another bandit is better With confidence 1 – δ Repeat process with new candidate Remove all empirically worse bandits Continue until 1 candidate left

Regret Analysis Define a round to be all the time steps for
a particular candidate bandit Halts round when 1 - δ confidence met Treat candidate bandits as random walk Takes log K rounds to reach best bandit Define a match to be all the comparisons between two bandits in a round O(K) total matches in expectation Constant fraction of bandits removed each round

Regret Analysis O(K) total matches Each match incurs regret
Depends on δ = K-2T-1 Finds best bandit w.p. 1-1/T Expected regret:

Lower Bound Example Order bandits b1 > b2 > … > bK
P(b > b’) = ½ + ε Each match takes comparisons Pay Θ(ε) regret for each comparison Accumulated regret over all matches is at least

Moving Forward Extensions Live user studies Drifting user interests
Changing user population / document collection Contextualization / side information Cost-sensitive active learning w/ relative feedback Live user studies Interactive experiments on search service Sheds insight; guides future design

Extra Slides

Per-Match Regret Number of comparisons in match bi vs bj :
ε1i > εij : round ends before concluding bi > bj ε1i < εij : conclude bi > bj before round ends, remove bj Pay ε1i + ε1j regret for each comparison By triangle inequality ε1i + ε1j ≤ 2*max{ε1i , εij} Thus by stochastic transitivity accumulated regret is

Number of Rounds Assume all superior bandits have
equal prob of defeating candidate Worst case scenario under transitivity Model this as a random walk rj transitions to each ri (i < j) with equal probability Compute total number of steps before reaching r1 (i.e., r*) Can show O(log K) w.h.p. using Chernoff bound

Total Matches Played O(K) matches played in each round
Naïve analysis yields O(K log K) total However, all empirically worse bandits are also removed at the end of each round Will not participate in future rounds Assume worst case that inferior bandits have ½ chance of being empirically worse Can show w.h.p. that O(K) total matches are played over O(log N) rounds

Removing Inferior Bandits
At conclusion of each round Remove any empirically worse bandits Intuition: High confidence that winner is better than incumbent candidate Empirically worse bandits cannot be “much better” than incumbent candidate Can show via Hoeffding bound that winner is also better than empirically worse bandits with high confidence Preserves 1-1/T confidence overall that we’ll find the best bandit