Download presentation

Presentation is loading. Please wait.

Published byBriana Pointer Modified over 2 years ago

1
A Simple Distribution- Free Approach to the Max k-Armed Bandit Problem Matthew Streeter and Stephen Smith Carnegie Mellon University

2
Outline The max k-armed bandit problem Previous work Our distribution-free approach Experimental evaluation

3
What is the max k-armed bandit problem?

4
You are in a room with k slot machines Pulling the arm of machine i returns a payoff drawn (independently at random) from unknown distribution D i Allowed n total pulls Goal: maximize total payoff > 50 years of papers The classical k-armed bandit

5
You are in a room with k slot machines Pulling the arm of machine i returns a payoff drawn (independently at random) from unknown distribution D i Allowed n total pulls Goal: maximize highest payoff Introduced ~2003 The max k-armed bandit

6
Why study it?

7
Goal: improve multi-start heuristics A multi-start heuristic runs an underlying randomized heuristic a bunch of times and returns the best solution Examples: HBSS (Bresina 1996) VBSS (Cicirello & Smith 2005) GRASPs (Feo & Resende 1995, and many others)

8
Given: some optimization problem, k randomized heuristics Each time you run a heuristic, get a solution with a certain quality Allowed n runs Goal: maximize quality of best solution Application: selecting among heuristics

9
Given n pulls, how can we maximize the (expected) maximum payoff? If n=1, should pull blue arm (higher mean) If n=1000, should mainly pull maroon arm (higher variance) The max k-armed bandit: example

10
Distributional assumptions? Without distributional assumptions, optimal strategy is not interesting. For example suppose payoffs are in {0,1}; arms are shuffled so you don’t know which is which. Optimal strategy samples the arms in round-robin order! Can’t distinguish a “good” arm until you receive payoff 1, at which point max payoff can’t be improved

11
Why? Extremal Types Theorem: let M n = max. of n independent draws from some fixed distribution. As n , distribution of M n a GEV distribution GEV sometimes gives an excellent fit to payoff distributions we care about Distributional assumptions? All previous work assumed each machine returns payoff from a generalized extreme value (GEV) distribution

12
Previous work Cicirello & Smith (CP 2004, AAAI 2005): Assumed Gumbel distributions (special case of GEV), no rigorous performance guarantees Good results selecting among heuristics for the RCPSP/max Streeter & Smith (AAAI 2006) Rigorous result for general GEV distributions But no experimental evaluation

13
Our contributions Threshold ascent: strategy to solve max k- armed problem using classical k-armed solver as subroutine Chernoff interval estimation: strategy for classical k-armed bandit algorithm that works well when mean payoffs are small (we assume payoffs in [0,1])

14
Threshold Ascent Parameters: strategy S for classical k-armed bandit, integer m > 0 Idea: Initialize t - Use S to maximize number of payoffs that exceed t Once m payoffs > t have been received, increase t and repeat

15
Threshold Ascent Designed to work well when: For t > t critical, there is a growing gap between probability that eventually-best arm yields payoff > t and corresponding prob. for other arms

16
Threshold Ascent Parameters: strategy S for classical k-armed bandit, integer m > 0 Idea: Initialize t - Use S to maximize number of payoffs that exceed t Once m payoffs > t have been received, increase t and repeat m controls exploration/exploitation tradeoff (larger m means algorithm converges more before increasing t) as t gets large, S sees a classical k- armed bandit instance where almost all payoffs are zero we don’t really start S from scratch each time we increase t

17
Interval Estimation Interval estimation (Lai & Robbins 1987, Kaelbling 1993) maintains confidence interval for each arm’s mean payoff; pulls arm with highest upper bound 11 22 33 Arm 1Arm 2 Arm 3

18
Chernoff Interval Estimation We analyze a variant of interval estimation with confidence intervals derived from Chernoff bounds regret = average_payoff(strategy) - *, where * = mean payoff of best arm. We prove an O(sqrt( * )*X) regret bound, where X = sqrt(k (log n)/n). Using Hoeffding’s inequality just gives O(X). (Auer et al. 2002). As * 0, our bound is much better. Can get comparable bounds using “multiplicative weight update” algorithms

19
Experimental Evaluation

20
The RCPSP/max Assign start times to activities subject to resource and temporal constraints Goal: find a schedule with minimum makespan NP-hard, “one of the most intractable problems in operations research” (Mohring 2000) Multi-start heuristics give state-of-the-art performance (Cicirello & Smith 2005)

21
Evaluation Five multi-start heuristics; each is a randomized rule for greedily building a schedule LPF - “longest path following” LST - “latest start time” MST - “minimum slack time” MTS - “most total successors” RSM - “resource scheduling method” Three max k-armed bandit strategies: Threshold Ascent (m=100, S = Chernoff interval estimation with 99% confidence intervals) round robin sampling QD-BEACON (Cicirello & Smith 2004, 2005) Note: we use a less aggressive variant of interval estimation in these experiments

22
Evaluation Ran on 169 instances from ProGen/max library For each instance, ran each of five rules 10,000 times and saved results in file For each of three strategies, solve as max 5- armed bandit with n=10,000 pulls Define regret = difference between max. possible payoff and max. payoff actually obtained

23
Results Threshold Ascent outperforms the other max k- armed bandit strategies, as well as the five “pure” strategies

24
Summary & Conclusions The max k-armed bandit problem is a simple online learning problem with applications to heuristic search We described a new, distribution-free approach to the max k-armed bandit problem Our strategy is effective at selecting among randomized priority dispatching rules for the RCPSP/max

Similar presentations

OK

Section 8.1 Estimating When is Known In this section, we develop techniques for estimating the population mean μ using sample data. We assume that.

Section 8.1 Estimating When is Known In this section, we develop techniques for estimating the population mean μ using sample data. We assume that.

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google

Ppt on energy giving foods Product mix ppt on nestle jobs Ppt on mobile shop management project Ppt on green building materials Seminar report and ppt on cloud computing Ppt on mathematician carl friedrich gauss Ppt on indian national congress Ppt on the road not taken lesson Ppt on kpo and bpo Ppt on best hr practices in india