Presentation is loading. Please wait.

Presentation is loading. Please wait.

Taming the monster: A fast and simple algorithm for contextual bandits PRESENTED BY Satyen Kale Joint work with Alekh Agarwal, Daniel Hsu, John Langford,

Similar presentations


Presentation on theme: "Taming the monster: A fast and simple algorithm for contextual bandits PRESENTED BY Satyen Kale Joint work with Alekh Agarwal, Daniel Hsu, John Langford,"— Presentation transcript:

1 Taming the monster: A fast and simple algorithm for contextual bandits PRESENTED BY Satyen Kale Joint work with Alekh Agarwal, Daniel Hsu, John Langford, Lihong Li and Rob Schapire

2 Learning to interact: example #1  Loop: › 1. Patient arrives with symptoms, medical history, genome, … › 2. Physician prescribes treatment. › 3. Patient’s health responds (e.g., improves, worsens).  Goal: prescribe treatments that yield good health outcomes.

3 Learning to interact: example #2  Loop: › 1. User visits website with profile, browsing history, … › 2. Website operator choose content/ads to display. › 3. User reacts to content/ads (e.g., click, “like”).  Goal: choose content/ads that yield desired user behavior.

4 Contextual bandit setting (i.i.d. version)  Set X of contexts/features and K possible actions  For t = 1,2,…,T: › 0. Nature draws (x t, r t ) from distribution D over X × [0,1] K. › 1. Observe context x t. [e.g., user profile, browsing history] › 2. Choose action a t [K]. [e.g., content/ad to display] › 3. Collect reward r t (a t ). [e.g., indicator of click or positive feedback]  Goal: algorithm for choosing actions a t that yield high reward.  Contextual setting: use features x t to choose good actions a t.  Bandit setting: r t (a) for a ≠ a t is not observed. › Exploration vs. exploitation

5 Learning objective and difficulties  No single action is good in all situations – need to exploit context.  Policy class Π: set of functions (“policies”) from X  [K] (e.g., advice of experts, linear classifiers, neural networks).  Regret (i.e., relative performance) to policy class Π: … a strong benchmark if Π contains a policy with high reward.  Difficulties: feedback on action only informs about subset of policies; explicit bookkeeping is computationally infeasible when Π is large.

6 Arg max oracle (AMO)  Given fully-labeled data (x 1, r 1 ),…,(x t, r t ), AMO returns  Abstraction for efficient search of policy class Π.  In practice: implement using standard heuristics (e.g., convex relax., backprop) for cost-sensitive multiclass learning algorithms.

7 Our results  New fast and simple algorithm for contextual bandits › Optimal regret bound (up to log factors): › Amortized calls to argmax oracle (AMO) per round.  Comparison to previous work › [Thompson’33]: no general analysis. › [ACBFS’02]: Exp4 algorithm; optimal regret, enumerates policies. › [LZ’07]: ε-greedy variant; suboptimal regret, one AMO call/round. › [DHKKLRZ’11]: “monster paper”; optimal regret, O(T 5 K 4 ) AMO calls/round. Note: Exp4 also works in adversarial setting.

8 Rest of this talk 1.Action distributions, reward estimates via inverse probability weights [oldies but goodies] 2.Algorithm for finding policy distributions that balance exploration/exploitation 3.Warm-start / epoch trick New

9 Basic algorithm structure (same as Exp4)  Start with initial distribution Q 1 over policies Π.  For t=1,2,…,T: › 0. Nature draws (x t,r t ) from distribution D over X × [0,1] K. › 1. Observe context x t. › 2a. Compute distribution p t over actions {1,2,…,K} (based on Q t and x t ). › 2b. Draw action a t from p t. › 3. Collect reward r t (a t ). › 4. Compute new distribution Q t+1 over policies Π.

10 Inverse probability weighting (old trick)  Importance-weighted estimate of reward from round t:  Unbiased, and has range & variance bounded by 1/p t (a).  Can estimate total reward and regret of any policy:

11 Constructing policy distributions Optimization problem (OP): Find policy distribution Q such that: Low estimated regret (LR) – “exploitation" Low estimation variance (LV) – “exploration” Theorem: If we obtain policy distributions Q t via solving (OP), then with high probability, regret after T rounds is at most

12 Feasibility  Feasibility of (OP): implied by minimax argument.  Monster solution [DHKKLRZ’11]: solves variant of (OP) with ellipsoid algorithm, where Separation Oracle = AMO + perceptron + ellipsoid.

13 Coordinate descent algorithm  INPUT: Initial weights Q.  LOOP: › IF (LR) is violated, THEN replace Q by cQ. › IF there is a policy π causing (LV) to be violated, THEN UPDATE Q(π) = Q(π) + α. › ELSE RETURN Q. Above, both 0 < c < 1 and α have closed form expressions. (Technical detail: actually optimize over sub-distributions Q that may sum to < 1.) Claim: Can check by making one AMO call per iteration.

14 Iteration bound for coordinate descent # steps of coordinate descent = Also gives bound on sparsity of Q. Analysis via a potential function argument.

15 Warm-start  If we warm-start coordinate descent (initialize with Q t to get Q t+1 ), then only need coordinate descent iterations over all T rounds.  Caveat: need one AMO call/round to even check if (OP) is solved.

16 Epoch trick  Regret analysis: Q t has low instantaneous expected regret (crucially relying on i.i.d. assumption). › Therefore same Q t can be used for O(t) more rounds!  Epoching: Split T rounds into epochs, solve (OP) once per epoch.  Doubling: only update on rounds 2 1,2 2,2 3,2 4,… › Total of O(log T) updates, so overall # AMO calls unchanged (up to log factors).  Squares: only update on rounds 1 2,2 2,3 2,4 2,… › Total of O(T 1/2 ) updates, each requiring AMO calls, on average.

17 Experiments AlgorithmEpsilon- greedy BaggingLinear UCB“Online Cover” [Supervised] Loss Time (seconds) Bandit problem derived from classification task (RCV1). Reporting progressive validation loss. “Online Cover” = variant with stateful AMO.


Download ppt "Taming the monster: A fast and simple algorithm for contextual bandits PRESENTED BY Satyen Kale Joint work with Alekh Agarwal, Daniel Hsu, John Langford,"

Similar presentations


Ads by Google