Presentation is loading. Please wait.

Presentation is loading. Please wait.

Multi-armed Bandit Problems with Dependent Arms

Similar presentations


Presentation on theme: "Multi-armed Bandit Problems with Dependent Arms"— Presentation transcript:

1 Multi-armed Bandit Problems with Dependent Arms
Sandeep Pandey Deepayan Chakrabarti Deepak Agarwal

2 Background: Bandits Bandit “arms” μ1 μ2 μ3 (unknown reward probabilities) Pull arms sequentially so as to maximize the total expected reward Show ads on a webpage to maximize clicks Product recommendation to maximize sales

3 “Skiing, snowboarding”
Dependent Arms Reward probabilities μi are generally assumed to be independent of each other What if they are dependent? E.g., ads on similar topics, using similar text/phrases, should have similar rewards “Skiing, snowboarding” “Skiing, snowshoes” “Snowshoe rental” “Get Vonage!” μ1=0.3 μ2=0.28 μ2=0.31 μ3=10-6

4 Dependent Arms Reward probabilities μi are generally assumed to be independent of each other What if they are dependent? E.g., ads on similar topics, using similar text/phrases, should have similar rewards A click on one ad  other “similar” ads may generate clicks as well Can we increase total reward using this dependency?

5 Cluster Model of Dependence
Arm 1 Arm 2 Arm 3 Arm 4 Cluster 1 Cluster 2 # pulls of arm i Successes si ~ Bin(ni, μi) μi ~ f(π[i]) No dependence across clusters Some distribution (known) Cluster-specific parameter (unknown)

6 Cluster Model of Dependence
Arm 1 Arm 2 Arm 3 Arm 4 μi ~ f(π1) μi ~ f(π2) Total reward: Discounted: ∑ αt.E[R(t)], α = discounting factor Undiscounted: ∑ E[R(t)] t=0 T

7 Discounted Reward Arm 2 x’1 x’2 The optimal policy can be computed using per-cluster MDPs only. MDP for cluster 1 Pull Arm 1 x1 x2 x”1 x”2 Optimal Policy: Compute an (“index”, arm) pair for each cluster Pick the cluster with the largest index, and pull the corresponding arm Arm 4 x’3 x’4 MDP for cluster 2 Pull Arm 3 x3 x4 x”3 x”4

8 Discounted Reward Arm 2 x’1 x’2 The optimal policy can be computed using per-cluster MDPs only. MDP for cluster 1 Pull Arm 1 Reduces the problem to smaller state spaces Reduces to Gittins’ Theorem [1979] for independent bandits Approximation bounds on the index for k-step lookahead x1 x2 x”1 x”2 Optimal Policy: Compute an (“index”, arm) pair for each cluster Pick the cluster with the largest index, and pull the corresponding arm Arm 4 x’3 x’4 MDP for cluster 2 Pull Arm 3 x3 x4 x”3 x”4

9 Cluster Model of Dependence
Arm 1 Arm 2 Arm 3 Arm 4 μi ~ f(π1) μi ~ f(π2) Total reward: Discounted: ∑ αt.E[R(t)], α = discounting factor Undiscounted: ∑ E[R(t)] t=0 T t=0

10 Undiscounted Reward “Cluster arm” 1 “Cluster arm” 2 Arm 1 Arm 2 Arm 3 Arm 4 All arms in a cluster are similar  They can be grouped into one hypothetical “cluster arm”

11 Undiscounted Reward Two-Level Policy In each iteration:
Pick “cluster arm” using a traditional bandit policy Pick an arm within that cluster using a traditional bandit policy “Cluster arm” 1 “Cluster arm” 2 Each “cluster arm” must have some estimated reward probability Arm 1 Arm 2 Arm 3 Arm 4

12 Issues What is the reward probability of a “cluster arm”?
How do cluster characteristics affect performance?

13 Reward probability of a “cluster arm”
What is the reward probability r of a “cluster arm”? MEAN: r = ∑si / ∑ni, i.e., average success rate, summing over all arms in the cluster [Kocsis+/2006, Pandey+/2007] Initially, r = μavg = average μ of arms in cluster Finally, r = μmax = max μ among arms in cluster “Drift” in the reward probability of the “cluster arm”

14 Reward probability drift causes problems
Best (optimal) arm, with reward probability μopt Arm 1 Arm 2 Arm 3 Arm 4 Cluster 1 Cluster 2 (opt cluster) Drift  Non-optimal clusters might temporarily look better  optimal arm is explored only O(log T) times

15 Reward probability of a “cluster arm”
What is the reward probability r of a “cluster arm”? MEAN: r = ∑si / ∑ni MAX: r = max( E[μi] ) PMAX: r = E[ max(μi) ] Both MAX and PMAX aim to estimate μmax and thus reduce drift for all arms i in cluster

16 Reward probability of a “cluster arm”
Bias in estimation of μmax MEAN: r = ∑si / ∑ni MAX: r = max( E[μi] ) PMAX: r = E[ max(μi) ] Both MAX and PMAX aim to estimate μmax and thus reduce drift Variance of estimator High Unbiased Low High

17 Comparison of schemes 10 clusters, 11.3 arms/cluster MAX performs best

18 Issues What is the reward probability of a “cluster arm”?
How do cluster characteristics affect performance?

19 Effects of cluster characteristics
We analytically study the effects of cluster characteristics on the “crossover-time” Crossover-time: Time when the expected reward probability of the optimal cluster becomes highest among all “cluster arms”

20 Effects of cluster characteristics
Crossover-time Tc for MEAN depends on: Cluster separation Δ = μopt – μmax outside opt cluster Δ increases  Tc decreases Cluster size Aopt Aopt increases  Tc increases Cohesiveness in opt cluster 1-avg(μopt – μi) Cohesiveness increases  Tc decreases

21 Experiments (effect of separation)
Circle mean, triangle max, square indep. Δ increases  Tc decreases  higher reward

22 Experiments (effect of size)
Aopt increases  Tc increases  lower reward

23 Experiments (effect of cohesiveness)
Cohesiveness increases  Tc decreases  higher reward

24 Related Work Typical multi-armed bandit problems
Do not consider dependencies Very few arms Bandits with side information Cannot handle dependencies among arms Active learning Emphasis on #examples required to achieve a given prediction accuracy

25 Conclusions We analyze bandits where dependencies are encapsulated within clusters Discounted Reward  the optimal policy is an index scheme on the clusters Undiscounted Reward Two-level Policy with MEAN, MAX, and PMAX Analysis of the effect of cluster characteristics on performance, for MEAN

26 Discounted Reward 1 2 3 4 x”1 x”2 x’1 x’2 x3 x4 Pull Arm 1 success
failure Change of belief for both arms 1 and 2 x1 x2 Estimated reward probabilities x3 x4 Create a belief-state MDP Each state contains the estimated reward probabilities for all arms Solve for optimal

27 Background: Bandits Regret = optimal payoff – actual payoff
Bandit “arms” p1 p2 p3 (unknown payoff probabilities) Regret = optimal payoff – actual payoff

28 Reward probability of a “cluster arm”
What is the reward probability of a “cluster arm”? Eventually, every “cluster arm” must converge to the most rewarding arm μmax within that cluster since a bandit policy is used within each cluster However, “drift” causes problems

29 Experiments Simulation based on one week’s worth of data from a large-scale ad-matching application 10 clusters, with 11.3 arms/cluster on average

30 Comparison of schemes MAX performs best 10 clusters, 11.3 arms/cluster
Cluster separation Δ = 0.08 Cluster size Aopt = 31 Cohesiveness = 0.75 MAX performs best

31 Reward probability drift causes problems
Best (optimal) arm, with reward probability μopt Arm 1 Arm 2 Arm 3 Arm 4 Cluster 1 Cluster 2 (opt cluster) Intuitively, to reduce regret, we must: Quickly converge to the optimal “cluster arm” and then to the best arm within that cluster


Download ppt "Multi-armed Bandit Problems with Dependent Arms"

Similar presentations


Ads by Google