Download presentation

Presentation is loading. Please wait.

Published byAliyah Dolphin Modified about 1 year ago

1
Taming the monster: A fast and simple algorithm for contextual bandits PRESENTED BY Satyen Kale Joint work with Alekh Agarwal, Daniel Hsu, John Langford, Lihong Li and Rob Schapire

2
Learning to interact: example #1 Loop: › 1. Patient arrives with symptoms, medical history, genome, … › 2. Physician prescribes treatment. › 3. Patient’s health responds (e.g., improves, worsens). Goal: prescribe treatments that yield good health outcomes.

3
Learning to interact: example #2 Loop: › 1. User visits website with profile, browsing history, … › 2. Website operator choose content/ads to display. › 3. User reacts to content/ads (e.g., click, “like”). Goal: choose content/ads that yield desired user behavior.

4
Contextual bandit setting (i.i.d. version) Set X of contexts/features and K possible actions For t = 1,2,…,T: › 0. Nature draws (x t, r t ) from distribution D over X × [0,1] K. › 1. Observe context x t. [e.g., user profile, browsing history] › 2. Choose action a t [K]. [e.g., content/ad to display] › 3. Collect reward r t (a t ). [e.g., indicator of click or positive feedback] Goal: algorithm for choosing actions a t that yield high reward. Contextual setting: use features x t to choose good actions a t. Bandit setting: r t (a) for a ≠ a t is not observed. › Exploration vs. exploitation

5
Learning objective and difficulties No single action is good in all situations – need to exploit context. Policy class Π: set of functions (“policies”) from X [K] (e.g., advice of experts, linear classifiers, neural networks). Regret (i.e., relative performance) to policy class Π: … a strong benchmark if Π contains a policy with high reward. Difficulties: feedback on action only informs about subset of policies; explicit bookkeeping is computationally infeasible when Π is large.

6
Arg max oracle (AMO) Given fully-labeled data (x 1, r 1 ),…,(x t, r t ), AMO returns Abstraction for efficient search of policy class Π. In practice: implement using standard heuristics (e.g., convex relax., backprop) for cost-sensitive multiclass learning algorithms.

7
Our results New fast and simple algorithm for contextual bandits › Optimal regret bound (up to log factors): › Amortized calls to argmax oracle (AMO) per round. Comparison to previous work › [Thompson’33]: no general analysis. › [ACBFS’02]: Exp4 algorithm; optimal regret, enumerates policies. › [LZ’07]: ε-greedy variant; suboptimal regret, one AMO call/round. › [DHKKLRZ’11]: “monster paper”; optimal regret, O(T 5 K 4 ) AMO calls/round. Note: Exp4 also works in adversarial setting.

8
Rest of this talk 1.Action distributions, reward estimates via inverse probability weights [oldies but goodies] 2.Algorithm for finding policy distributions that balance exploration/exploitation 3.Warm-start / epoch trick New

9
Basic algorithm structure (same as Exp4) Start with initial distribution Q 1 over policies Π. For t=1,2,…,T: › 0. Nature draws (x t,r t ) from distribution D over X × [0,1] K. › 1. Observe context x t. › 2a. Compute distribution p t over actions {1,2,…,K} (based on Q t and x t ). › 2b. Draw action a t from p t. › 3. Collect reward r t (a t ). › 4. Compute new distribution Q t+1 over policies Π.

10
Inverse probability weighting (old trick) Importance-weighted estimate of reward from round t: Unbiased, and has range & variance bounded by 1/p t (a). Can estimate total reward and regret of any policy:

11
Constructing policy distributions Optimization problem (OP): Find policy distribution Q such that: Low estimated regret (LR) – “exploitation" Low estimation variance (LV) – “exploration” Theorem: If we obtain policy distributions Q t via solving (OP), then with high probability, regret after T rounds is at most

12
Feasibility Feasibility of (OP): implied by minimax argument. Monster solution [DHKKLRZ’11]: solves variant of (OP) with ellipsoid algorithm, where Separation Oracle = AMO + perceptron + ellipsoid.

13
Coordinate descent algorithm INPUT: Initial weights Q. LOOP: › IF (LR) is violated, THEN replace Q by cQ. › IF there is a policy π causing (LV) to be violated, THEN UPDATE Q(π) = Q(π) + α. › ELSE RETURN Q. Above, both 0 < c < 1 and α have closed form expressions. (Technical detail: actually optimize over sub-distributions Q that may sum to < 1.) Claim: Can check by making one AMO call per iteration.

14
Iteration bound for coordinate descent # steps of coordinate descent = Also gives bound on sparsity of Q. Analysis via a potential function argument.

15
Warm-start If we warm-start coordinate descent (initialize with Q t to get Q t+1 ), then only need coordinate descent iterations over all T rounds. Caveat: need one AMO call/round to even check if (OP) is solved.

16
Epoch trick Regret analysis: Q t has low instantaneous expected regret (crucially relying on i.i.d. assumption). › Therefore same Q t can be used for O(t) more rounds! Epoching: Split T rounds into epochs, solve (OP) once per epoch. Doubling: only update on rounds 2 1,2 2,2 3,2 4,… › Total of O(log T) updates, so overall # AMO calls unchanged (up to log factors). Squares: only update on rounds 1 2,2 2,3 2,4 2,… › Total of O(T 1/2 ) updates, each requiring AMO calls, on average.

17
Experiments AlgorithmEpsilon- greedy BaggingLinear UCB“Online Cover” [Supervised] Loss0.0950.0590.1280.0530.051 Time (seconds) 22339212000176.9 Bandit problem derived from classification task (RCV1). Reporting progressive validation loss. “Online Cover” = variant with stateful AMO.

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google