Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Dueling Bandits Problem Yisong Yue. Outline Brief Overview of Multi-Armed Bandits Dueling Bandits – Mathematical properties – Connections to other.

Similar presentations


Presentation on theme: "The Dueling Bandits Problem Yisong Yue. Outline Brief Overview of Multi-Armed Bandits Dueling Bandits – Mathematical properties – Connections to other."— Presentation transcript:

1 The Dueling Bandits Problem Yisong Yue

2 Outline Brief Overview of Multi-Armed Bandits Dueling Bandits – Mathematical properties – Connections to other problems – Algorithmic Principles New Directions & Ongoing Research

3 Multi-Armed Bandit Problem (stochastic version) K actions (aka arms or bandits) Each action has an average reward: μ k – Unknown to us – Assume WLOG that u 1 is largest For t = 1…T – Algorithm chooses action a(t) – Receives random reward y(t) Expectation μ a(t) Goal: minimize Tu 1 – (μ a(1) + μ a(2) + … + μ a(T) ) Algorithm only receives feedback on chosen action If we had perfect information to start Expected Reward of Algorithm “Regret”

4 Sports -- 00010 # Shown Average Likes : 0 Example: Interactive Personalization

5 -- 0 00010 # Shown Average Likes : 0 Example: Interactive Personalization Sports

6 -- 0 00110 # Shown Average Likes : 0 Politics Example: Interactive Personalization

7 -- 10 00110 # Shown Average Likes : 1 Politics Example: Interactive Personalization

8 -- 10 00111 # Shown Average Likes : 1 World Example: Interactive Personalization

9 -- 100 00111 # Shown Average Likes : 1 World Example: Interactive Personalization

10 -- 100 01111 # Shown Average Likes : 1 Economy Example: Interactive Personalization

11 --1100 01111 # Shown Average Likes : 2 Economy … Example: Interactive Personalization

12 --0.440.40.330.2 025101520 # Shown Average Likes : 24 What should Algorithm Recommend? Exploit:Explore: Best: Politics Economy Celebrity How to Optimally Balance Explore/Exploit Tradeoff? Characterized by the Multi-Armed Bandit Problem

13 ( ) Opportunity cost of not knowing preferences “no-regret” if R(T)/T  0 – Efficiency measured by convergence rate Regret: Time Horizon ( ) … …

14 The Motivating Problem Slot Machine = One-Armed Bandit Goal: Minimize regret From pulling suboptimal arms Image source: http://research.microsoft.com/en-us/projects/bandits/ Each Arm Has Different Payoff

15 Many Applications Online Advertising Search EnginesRecommender Systems Personalized Clinical Treatment Sequential Experimental Design

16 What if Rewards aren’t Directly Measureable? Many applications involve human preferences MAB: direct (noisy) access to rewards: – Rate option A – Rate option B Dueling Bandits: only preference feedback – Prefer option A or option B? Difficult to reliably measure! Much easier to reliably measure!

17 Dueling Bandits Problem (with Josef Broder, Robert Kleinberg and Thorsten Joachims) K bandits b 1, …, b K Each iteration: compare (duel) two bandits – Observe (noisy) outcome Cost function (regret): (b t, b t ’) are the two bandits chosen b * is the overall best one (How much human user preferred b * over chosen bandits) [Yue, Broder, Kleinberg & Joachims, COLT 2009] Requires Dueling Mechanism

18 Example Pairwise Preferences ABCDEF A00.030.040.060.100.11 B-0.0300.030.050.080.11 C-0.04-0.0300.040.070.09 D-0.06-0.05-0.0400.050.07 E-0.10-0.08-0.07-0.0500.03 F-0.11 -0.09-0.07-0.030 Values are Pr(row > col) – 0.5

19 Example Pairwise Preferences Values are Pr(row > col) – 0.5 Compare E & F: P(A > E) = 0.60 P(A > F) = 0.61 Incurred Regret = 0.21 ABCDEF A00.030.040.060.100.11 B-0.0300.030.050.080.11 C-0.04-0.0300.040.070.09 D-0.06-0.05-0.0400.050.07 E-0.10-0.08-0.07-0.0500.03 F-0.11 -0.09-0.07-0.030 Observe

20 Example Pairwise Preferences Values are Pr(row > col) – 0.5 Compare B & C: P(A > B) = 0.53 P(A > C) = 0.54 Incurred Regret = 0.07 ABCDEF A00.030.040.060.100.11 B-0.0300.030.050.080.11 C-0.04-0.0300.040.070.09 D-0.06-0.05-0.0400.050.07 E-0.10-0.08-0.07-0.0500.03 F-0.11 -0.09-0.07-0.030 Observe

21 Example Pairwise Preferences Values are Pr(row > col) – 0.5 Compare A & A: P(A > A) = 0.50 Incurred Regret = 0.00 ABCDEF A00.030.040.060.100.11 B-0.0300.030.050.080.11 C-0.04-0.0300.040.070.09 D-0.06-0.05-0.0400.050.07 E-0.10-0.08-0.07-0.0500.03 F-0.11 -0.09-0.07-0.030 Observe

22 Modeling Assumptions P(b i > b j ) = ½ + ε ij (distinguishability) Strong Stochastic Transitivity – For three bandits b i > b j > b k : – Monotonicity property Stochastic Triangle Inequality – For three bandits b i > b j > b k : – Diminishing returns property Satisfied by many standard models – E.g., Logistic / Bradley-Terry

23 Example Pairwise Preferences ABCDEF A00.030.040.060.100.11 B-0.0300.030.050.080.11 C-0.04-0.0300.040.070.09 D-0.06-0.05-0.0400.050.07 E-0.10-0.08-0.07-0.0500.03 F-0.11 -0.09-0.07-0.030 Values are Pr(row > col) – 0.5 Monotonic

24 Example Pairwise Preferences ABCDEF A00.030.040.060.100.11 B-0.0300.030.050.080.11 C-0.04-0.0300.040.070.09 D-0.06-0.05-0.0400.050.07 E-0.10-0.08-0.07-0.0500.03 F-0.11 -0.09-0.07-0.030 Values are Pr(row > col) – 0.5 Red ≤ Blue + Green

25 Example Pairwise Preferences ABCDEF A00.030.040.060.100.11 B-0.0300.030.050.080.11 C-0.04-0.0300.040.070.09 D-0.06-0.05-0.0400.050.07 E-0.10-0.08-0.07-0.0500.03 F-0.11 -0.09-0.07-0.030 Values are Pr(row > col) – 0.5 Red ≤ Blue + Green

26 Explore-then-Exploit Decompose into 2 Phases Explore Phase – Identify the best bandit w.h.p. – Minimize incurred regret Exploit Phase – Play best bandit vs itself – Incurs no regret

27 Connection to Tournaments Each pair “duels” until statistical significance Aka Noisy Tournament – Guarantees finding best bandit w.h.p. – Can we use as explore algorithm? Best! [Feige et al., 1994]

28 Analogy: Hypothetical Soccer Tournament – A team wins when it has a 3-goal lead – Audience prefers good teams play (regret) – Two (nearly) equally bad teams will play for a long time Tournament is Bad Each pair “duels” until statistical significance Problem: two Equally bad bandits

29 Tournament – E.g., tennis – [Feige et al., 1994] Champion – E.g., boxing – [Yue, Broder, Kleinberg & Joachims, 2009] Swiss – E.g., group rounds in World Cup – [Yue & Joachims, 2011] Other Explore Strategies

30 Champion duels each challenger (round robin) – Until statistical significance [Yue, Broder, Kleinberg & Joachims, COLT 2009] Champion (Interleaved Filter) ChallengersChampion Best! Dueling Mechanism

31 Regret per champion bounded – Quickly replaced if bad Sequence of champions as a random walk – Log #rounds to arrive at best [Yue, Broder, Kleinberg & Joachims, COLT 2009] One of these will become next champion Champion is Good Better Margin between best bandit and rest Regret per Champion: # Remaining Bandits

32 One of these will become next champion [Yue, Broder, Kleinberg & Joachims, COLT 2009] Champion is Good Regret per champion bounded – Quickly replaced if bad Sequence of champions as a random walk – Log #rounds to arrive at best Better Margin between best bandit and rest Regret per Champion: # Remaining Bandits

33 One of these will become next champion [Yue, Broder, Kleinberg & Joachims, COLT 2009] Champion is Good Regret per champion bounded – Quickly replaced if bad Sequence of champions as a random walk – Log #rounds to arrive at best Better Margin between best bandit and rest Regret per Champion: # Remaining Bandits

34 [Yue, Broder, Kleinberg & Joachims, COLT 2009] Champion is Good Regret per champion bounded – Quickly replaced if bad Sequence of champions as a random walk – Log #rounds to arrive at best Better # Bandits Margin between best bandit and others Time Horizon Regret: Optimal Regret Guarantee! Margin between best bandit and rest Regret per Champion: # Remaining Bandits

35 Each iteration: Duel random pair – Eliminate bandit w/ worst record [Yue & Joachims, ICML 2011] Swiss (Beat the Mean) Best! Dueling Mechanism Remove duels with eliminated bandit from all remaining records (only a fraction of a records) Record = win rate vs “mean” bandit Related to action elimination algorithms [Even-Dar et al., 2006] Regret until next removal:

36 Champion has high variance – Depends on initial champion Swiss offers low-variance alternative – Successively eliminate worst bandit [Yue & Joachims, ICML 2011] Swiss is Better Regret: Optimal Regret w/ High Probability!

37 Regret B E T T E R # Bandits [Yue & Joachims, ICML 2011] Tournament performs poorly (>10x worse) (In some experiments >10000x worse) Swiss has lower variance than Champion Swiss Champion Tournament

38 General Algorithmic Structure (most existing DB algorithms follow this structure) First bandit is a pivot/anchor – Interleaved Filter: Champion – Beat the Mean: Uniform(all remaining bandits) Second bandit explores relative to pivot – Interleaved Filter: Round Robin – Beat the Mean: Randomized Round Robin Update Pivot Periodically – This structure makes it easier to analyze regret – E.g., spend more time exploring good pivots

39 More Recent Results Relaxing Strong Transitivity: Beyond Condorcet Winner – Borda Winner – von Neumann Winner – Copeland Winner Anytime algorithms – Original work required knowing T a priori More sophisticated dueling mechanisms Including off-policy evaluation 3 – Adversarial Setting – Contextual Setting

40 Ongoing Work: Dependent Arms (with Yanan Sui, Vincent Zhuang and Joel Burdick) Suppose K is very large (possibly infinite) – But arms have dependency structure – E.g., P(a>b) similar to P(a>b’) if b similar to b’ – Measure similarity using kernel Initial results: – Asymptotically optimal but impractical algorithm:

41 Personalized Clinical Treatment (with Yanan Sui, Vincent Zhuang and Joel Burdick) 49 mm 10 mm Medtronic human array Image source: williamcapicottomd.com SCI Patient Each patient is unique 10 9 possible configurations!

42 Centralized vs Distributed Most DB algorithms are centralized – Single algorithm controls choice of both bandits What about distributed algorithms? – Two algorithms each controlling one bandit? – “Sparring” [Ailon, Karnin and Joachims, ICML 2014] Each algorithm plays a standard MAB algorithm Recently Analyzed [Dudik, Schapire and Slivkins, COLT 2015]

43 Dueling Bandits ≈ Zero-Sum Game ABCDEF A00.030.040.060.100.11 B-0.0300.030.050.080.11 C-0.04-0.0300.040.070.09 D-0.06-0.05-0.0400.050.07 E-0.10-0.08-0.07-0.0500.03 F-0.11 -0.09-0.07-0.030 Values are Pr(row > col) – 0.5 Basic Setting: Single Dominant Strategy Regret = Opportunity Cost to Social Welfare

44 Ongoing Work: Learning in Games (with Sid Barman and Katrina Ligett) Efficient convergence to good equilibriums – I.e., low regret in Dueling Bandits setting Recent result for sparse preference matrix: (zero-sum games) Sparsity of each Column or Row

45 Ongoing Work: Learning in Games (with Sid Barman and Katrina Ligett) Centralized algorithms = “coordination” – Settings that benefit from minimal coordination? Low Rank Matrix: – Two algorithms coordinate on when to explore – Analogous to compressive sensing! – Speed up convergence??? (in zero-sum games)

46 Summary: Dueling Bandits Problem Elicits preference feedback – Motivated by human-centric personalization – Characterizes explore/exploit tradeoff Connections to noisy tournaments Connections to learning in games

47 References The K-armed Dueling Bandits Problem, by Yisong Yue, Josef Broder, Robert Kleinberg and Thorsten Joachims, COLT 2009 Interactively Optimizing Information Retrieval Systems as a Dueling Bandits Problem, by Yisong Yue and Thorsten Joachims, ICML 2009 Beat the Mean Bandit, by Yisong Yue and Thorsten Joachims, ICML 2011 Reusing Historical Interaction Data for Faster Online Learning to Rank for IR, by Katja Hofmann, Anne Schuth, Shimon Whiteson, and Maarten de Rijke, WSDM 2013 Generic Exploration and K-armed Voting Bandits, by Tanguy Urvoy, Fabrice Clerot, Raphael Feraud and Sami Naamane, ICML 2013 Reducing Dueling Bandits to Cardinal Bandits, by Nir Ailon, Zohar Karnin and Thorsten Joachims, ICML 2014 Relative Upper Confidence Bound for the K-armed Dueling Bandit Problem, by Masrour Zoghi, Shimon Whiteson, Remi Munos and Maarten de Rijke, ICML 2014 Clinical Online Recommendation with Subgroup Rank Feedback, by Yanan Sui and Joel Burdick, RecSys 2014 Sparse Dueling Bandits, Kevin Jamieson, Sumeet Katariya, Atul Deshpande and Robert Nowak, AISTATS 2015 Contextual Dueling Bandits, by Miro Dudik, Robert Schapire and Alex Slivkins, COLT 2015 A Relative Exponential Weighing Algorithm for Adversarial Utility-based Dueling Bandits, by Pratik Gajane, Tanguy Urvoy and Fabrice Clerot, ICML 2015 Copeland Dueling Bandits, by Masrour Zoghi, Zohar Karnin, Shimon Whiteson and Maarten de Rijke, NIPS 2015 Online Rank Elicitation for Plackett-Luce: A Dueling Bandits Approach, Balazs Szorenyi, Robert Busa-Fekete, Adil Paul and Eyke Hullermeier, NIPS 2015

48 Extra Slides

49 Click Interpretation 1: Result #2 is good. (Absolute) Interpretation 2: Result #2 is better than Result #1. (Relative / Preference) Learning from Click Data

50 Retrieval Function ARetrieval Function B Which is better? Learning from Click Data Click

51 Analogy to Sensory Testing (Hypothetical) taste experiment: vs – Natural usage context Experiment 1: Absolute Metrics 3 cans 2 cans1 can 5 cans 3 cans Total: 8 cansTotal: 9 cans Very Thirsty!

52 Analogy to Sensory Testing (Hypothetical) taste experiment: vs – Natural usage context Experiment 1: Relative Metrics 2 - 13 - 02 - 01 - 0 4 - 1 2 - 1 All 6 prefer Pepsi

53 Ranking A 1.Napa Valley – The authority for lodging... www.napavalley.com 2.Napa Valley Wineries - Plan your wine... www.napavalley.com/wineries 3.Napa Valley College www.napavalley.edu/homex.asp 4.Been There | Tips | Napa Valley www.ivebeenthere.co.uk/tips/16681 5.Napa Valley Wineries and Wine www.napavintners.com 6.Napa Country, California – Wikipedia en.wikipedia.org/wiki/Napa_Valley Ranking B 1.Napa Country, California – Wikipedia en.wikipedia.org/wiki/Napa_Valley 2.Napa Valley – The authority for lodging... www.napavalley.com 3.Napa: The Story of an American Eden... books.google.co.uk/books?isbn=... 4.Napa Valley Hotels – Bed and Breakfast... www.napalinks.com 5.NapaValley.org www.napavalley.org 6.The Napa Valley Marathon www.napavalleymarathon.org Presented Ranking 1.Napa Valley – The authority for lodging... www.napavalley.com 2.Napa Country, California – Wikipedia en.wikipedia.org/wiki/Napa_Valley 3.Napa: The Story of an American Eden... books.google.co.uk/books?isbn=... 4.Napa Valley Wineries – Plan your wine... www.napavalley.com/wineries 5.Napa Valley Hotels – Bed and Breakfast... www.napalinks.com 6.Napa Balley College www.napavalley.edu/homex.asp 7NapaValley.org www.napavalley.org A B [Radlinski et al. 2008] Interleaving (Taste Test in Search)

54 Ranking A 1.Napa Valley – The authority for lodging... www.napavalley.com 2.Napa Valley Wineries - Plan your wine... www.napavalley.com/wineries 3.Napa Valley College www.napavalley.edu/homex.asp 4.Been There | Tips | Napa Valley www.ivebeenthere.co.uk/tips/16681 5.Napa Valley Wineries and Wine www.napavintners.com 6.Napa Country, California – Wikipedia en.wikipedia.org/wiki/Napa_Valley Ranking B 1.Napa Country, California – Wikipedia en.wikipedia.org/wiki/Napa_Valley 2.Napa Valley – The authority for lodging... www.napavalley.com 3.Napa: The Story of an American Eden... books.google.co.uk/books?isbn=... 4.Napa Valley Hotels – Bed and Breakfast... www.napalinks.com 5.NapaValley.org www.napavalley.org 6.The Napa Valley Marathon www.napavalleymarathon.org Presented Ranking 1.Napa Valley – The authority for lodging... www.napavalley.com 2.Napa Country, California – Wikipedia en.wikipedia.org/wiki/Napa_Valley 3.Napa: The Story of an American Eden... books.google.co.uk/books?isbn=... 4.Napa Valley Wineries – Plan your wine... www.napavalley.com/wineries 5.Napa Valley Hotels – Bed and Breakfast... www.napalinks.com 6.Napa Valley College www.napavalley.edu/homex.asp 7NapaValley.org www.napavalley.org B wins! Click [Radlinski et al. 2008] Click Interleaving (Taste Test in Search) Extension for Comparing Greedy Submodular Policies [Yue & Guestrin, 2011]

55 # Queries Interleaving is more sensitive and more reliable Disagreement Probability [Chapelle, Joachims, Radlinski & Yue, TOIS 2012] Deployment on Yahoo! Search Engine Comparing Two Ranking Functions Interleaving Absolute Metrics E.g., #Clicks@1, Total #Clicks, etc. B E T T E R 100x Each ranking function receives 50% traffic


Download ppt "The Dueling Bandits Problem Yisong Yue. Outline Brief Overview of Multi-Armed Bandits Dueling Bandits – Mathematical properties – Connections to other."

Similar presentations


Ads by Google