Tradeoffs in contextual bandit learning

Tradeoffs in contextual bandit learning
Alekh Agarwal Microsoft Research Joint work with Daniel Hsu, John Langford, Lihong Li, Satyen Kale and Rob Schapire

Learning to interact: example #1
Loop: 1. Patient arrives with symptoms, medical history, genome, … 2. Physician prescribes treatment. 3. Patient’s health responds (e.g., improves, worsens). Goal: prescribe treatments that yield good health outcomes.

Learning to interact: example #2
Loop: 1. User visits website with profile, browsing history, … 2. Website operator choose content/ads to display. 3. User reacts to content/ads (e.g., click, “like”). Goal: choose content/ads that yield desired user behavior.

Contextual bandit setting
For t = 1,2,…,T: 1. Observe context 𝒙 𝒕 [e.g., user profile, browsing history] 1b Generate reward vector 𝒓 𝒕 ∈ 𝐑 𝐾 2. Choose action 𝒂 𝒕 ∈{𝟏, 𝟐, …, 𝑲} [e.g., content/ad to display] 3. Collect reward 𝑟 𝑡 ( 𝑎 𝑡 ) [e.g., indicator of click or positive feedback] Reward observed only for chosen action Goal: algorithm for choosing actions with high reward 𝑡=1 𝑇 𝑟 𝑡 𝑎 𝑡 i.i.d. setting: (xt,rt) drawn i.i.d. from distribution D over X × [0,1]K. A simple setting that captures a large class of interactive learning problems is the contextual bandit problem. Here we consider iid version of the problem, where in each round t, … GOAL: choose actions that yield high reward over the T rounds. MUST USE CONTEXT: no single action is good in all situations. BANDIT PROBLEM: (or PARTIAL LABEL PROBLEM): don’t see rewards for actions you don’t take. Need exploration (take actions just to learn about them) balanced with exploitation (take actions known to be good).

Challenges Fundamental dilemma Need to use context effectively
Exploit what has been learned Explore to find which behaviors lead to high rewards Need to use context effectively Different actions are preferred under different contexts Might not see the same context twice Computational efficiency

Special case: Multi-armed bandits
Reward information only on selected action No context Goal: Do as well as best single action Tacit assumption: there is one action which always gives high rewards E.g.: Single treatment/content/ad that is right for the entire population

From actions to policies
Policy: rule mapping context to action Allows choice of different good actions in different contexts E.g.: If (sex = male) choose action 1 Else if (age > 45) choose action 2 Else choose action 3 Policy 𝜋 : context 𝑥 ↦(action 𝑎) Goal: Find a good policy 𝜋 from a rich policy class Π Policy class examples: all decision trees, all linear models, all neural networks, … Tacit assumption: Given a rich enough Π, there must be a policy 𝜋∈Π which yields high rewards

Learning with contexts and policies
Goal: Learn through experimentation to do (almost) as well as best 𝜋∈Π Policies may be very complex and expressive ⇒ powerful approach Challenges: Π extremely large for expressivity, computationally difficult Exploration versus exploitation at an enormous scale!

Formal model (revisited)
For t = 1,2,…,T: 1. Observe context 𝑥𝑡. 1b. Reward vector 𝒓 𝒕 chosen (but not observed) 2. Choose action at ϵ {1,2,…,K}. 3. Collect reward rt(at). Goal: Maximize net reward relative to best policy i.e., want small regret max 𝜋∈Π 1 𝑇 𝑡=1 𝑇 𝑟 𝑡 (𝜋 𝑥 𝑡 ) − 𝑇 𝑡=1 𝑇 𝑟 𝑡 𝑎 𝑡 A simple setting that captures a large class of interactive learning problems is the contextual bandit problem. Here we consider iid version of the problem, where in each round t, … GOAL: choose actions that yield high reward over the T rounds. MUST USE CONTEXT: no single action is good in all situations. BANDIT PROBLEM: (or PARTIAL LABEL PROBLEM): don’t see rewards for actions you don’t take. Need exploration (take actions just to learn about them) balanced with exploitation (take actions known to be good). best policy’s average reward learner’s average reward

A solution template Start with initial distribution Q1 over policies Π. For t=1,2,…,T: 1. Observe context xt. 2a. Compute distribution pt over actions {1,2,…,K}. 2b. Draw action at from pt. 3. Collect reward rt(at). 4. Create importance weighted reward estimates 𝒓 𝒕 5. Compute new distribution Qt+1 over policies Π. (using 𝑄 𝑡 ) Maintain policy distribution Q --- need to do this efficiently, so we’ll make sure it’s sparse. After seeing context x_t, need to pick action a_t “smoothed” projection of Q_t to get p_t randomly pick a_t according to p_t. After collecting reward, update policy distribution from Q_t to Q_{t+1}.

Computational properties
Need to update and sample from 𝑄 𝑡 at each round Typically scales with number of policies unless 𝑄 has a sparse support

Prior work Comparison to previous work
Thompson Sampling(1933): no general analysis. Exp4 algorithm(2002): optimal regret, exponential computation ε-greedy variant(2007): suboptimal regret, optimal computation “monster paper”(2011): optimal regret, poly computation New fast and simple algorithm for contextual bandits Optimal regret bound (up to log factors): 𝑂 𝐾𝑇 log |Π| Computational complexity: 𝑂 𝐾𝑇 1.5 New algorithm: - Statistical performance: Achieves statistically optimal regret bound - Computation benchmark: “oracle complexity” --- how many times it has to call AMO. - Computational performance: sublinear in number of rounds (i.e., vanishing per round complexity). Previous algorithms: either statistically suboptimal or computationally more complex --- challenging to achieve both simultaneously. NOTE: we crucially rely on IID assumption, whereas EXP4 works in adversarial setting. Notes: 1. Exp4 also works in adversarial setting. 2. Other works assume argmax oracle.

Regret v/s computation
𝑶( 𝑻 𝟐/𝟑 ) 𝝐-greedy Regret 𝑶( 𝑻 ) EXP4 𝑶(𝟏) 𝑶(|𝚷|) Computation

This talk New and general algorithm for contextual bandit problem
Optimal statistical performance Much faster and simpler than predecessors Computational barrier for a class of algorithms

Formal model For t = 1,2,…,T: Goal: Maximize net reward
1. Observe context xt. 1b. Reward vector 𝒓 𝒕 chosen (but not observed) 2. Choose action at ϵ {1,2,…,K}. 3. Collect reward rt(at). Goal: Maximize net reward 𝑡=1 𝑇 𝑟 𝑡 𝑎 𝑡 i.i.d. setting: (xt,rt) drawn i.i.d. from distribution D over X × [0,1]K. A simple setting that captures a large class of interactive learning problems is the contextual bandit problem. Here we consider iid version of the problem, where in each round t, … GOAL: choose actions that yield high reward over the T rounds. MUST USE CONTEXT: no single action is good in all situations. BANDIT PROBLEM: (or PARTIAL LABEL PROBLEM): don’t see rewards for actions you don’t take. Need exploration (take actions just to learn about them) balanced with exploitation (take actions known to be good).

Example Context Actions Action 1 Action 2 Action 3 (Male, 50, …) 1.0
0.2 0.0

Example Total reward = 0.2 + Context Actions Action 1 Action 2
(Male, 50, …) 0.2 Total reward = 0.2 +

Example Total reward = 0.2 + 1.0 + Context Actions Action 1 Action 2
(Male, 50, …) 0.2 (Female, 18, …) 1.0 Total reward =

Example Total reward = 0.2 + 1.0 + 0.1 + Context Actions Action 1
(Male, 50, …) 0.2 (Female, 18, …) 1.0 (Female, 48, …) 0.1 Total reward =

Example Total reward = 0.2 + 1.0 + 0.1 + …. Context Actions Action 1
(Male, 50, …) 0.2 (Female, 18, …) 1.0 (Female, 48, …) 0.1 … Total reward = ….

For t = 1,2,…,T: 1. Observe context 𝑥𝑡. 1b. Reward vector 𝒓 𝒕 chosen (but not observed) 2. Choose action at ϵ {1,2,…,K}. 3. Collect reward rt(at). Goal: Maximize net reward relative to best policy i.e., want small regret max 𝜋∈Π 1 𝑇 𝑡=1 𝑇 𝑟 𝑡 (𝜋 𝑥 𝑡 ) − 𝑇 𝑡=1 𝑇 𝑟 𝑡 𝑎 𝑡 A simple setting that captures a large class of interactive learning problems is the contextual bandit problem. Here we consider iid version of the problem, where in each round t, … GOAL: choose actions that yield high reward over the T rounds. MUST USE CONTEXT: no single action is good in all situations. BANDIT PROBLEM: (or PARTIAL LABEL PROBLEM): don’t see rewards for actions you don’t take. Need exploration (take actions just to learn about them) balanced with exploitation (take actions known to be good). best policy’s average reward learner’s average reward

For t = 1,2,…,T: 1. Observe context 𝑥𝑡. 1b. Reward vector 𝒓 𝒕 chosen (but not observed) 2. Choose action at ϵ {1,2,…,K}. 3. Collect reward rt(at). Goal: Maximize net reward relative to best policy i.e., want small regret (or average risk relative to best policy) 1 𝑇 𝑡=1 𝑇 Risk 𝜋 𝑡 A simple setting that captures a large class of interactive learning problems is the contextual bandit problem. Here we consider iid version of the problem, where in each round t, … GOAL: choose actions that yield high reward over the T rounds. MUST USE CONTEXT: no single action is good in all situations. BANDIT PROBLEM: (or PARTIAL LABEL PROBLEM): don’t see rewards for actions you don’t take. Need exploration (take actions just to learn about them) balanced with exploitation (take actions known to be good).

The (Fantasy) Full Information Setting
Imagine we see rewards for all actions Learner’s total reward = 0.2 + Context Actions Action 1 Action 2 Action 3 (Male, 50, …) 1.0 0.2 0.0

Imagine we see rewards for all actions Learner’s total reward = … Context Actions Action 1 Action 2 Action 3 (Male, 50, …) 1.0 0.2 0.0 (Female, 18, …) (Female, 48, …) 0.5 0.1 0.7 Context Actions Action 1 Action 2 Action 3 (Male, 50, …) 1.0 0.2 0.0

Imagine we see rewards for all actions Learner’s total reward = … Easily compute reward of some other 𝜋∈Π: 𝜋 ′ 𝑠 total reward = Context Actions Action 1 Action 2 Action 3 (Male, 50, …) 1.0 0.2 0.0 (Female, 18, …) (Female, 48, …) 0.5 0.1 0.7 Context Actions Action 1 Action 2 Action 3 (Male, 50, …) 1.0 0.2 0.0 = 𝜋 ′ 𝑠 action

Greedy Full-Information Algorithm
Compute average reward of each policy 𝜋∈Π Act according to 𝜋 with largest empirical reward Can demonstrate: regre t full Greedy =𝑂 ln |Π| 𝑇 Need: efficient computation of empirical best policy for large Π

Arg max oracle (AMO) Input: fully-labeled data (𝑥1,𝒓1),…,(𝑥𝑡,𝒓𝑡),
𝑥 𝑡 = context 𝒓 𝒕 = 𝑟 𝑡 1 , …, 𝑟 𝑡 𝐾 = rewards of all actions AMO outputs: arg max 𝜋∈Π 𝑖=1 𝑡 𝑟 𝑖 (𝜋 𝑥 𝑖 ) Abstraction for efficient search of policy class Π. Many policy classes have efficient implementations. AMO is abstraction for efficient search of a policy class. GIVEN: fully-labeled data set, return policy in policy class that has maximum total reward. i.e., solve OFF-LINE FULL-INFO problem. Generally computationally hard, but in practice we have effective heuristics for very rich policy classes. Still, not clear how to use this because we only have PARTIAL FEEDBACK.

Back to bandit setting…
AMO allows implementation of Greedy in full information setting Only see reward of chosen action in bandit setting Context Actions Action 1 Action 2 Action 3 (Male, 50, …) 1.0 0.2 0.0 (Female, 18, …) (Female, 48, …) 0.5 0.1 0.7

AMO allows implementation of Greedy in full information setting Only see reward of chosen action in bandit setting Learner’s total reward = … Context Actions Action 1 Action 2 Action 3 (Male, 50, …) 0.2 (Female, 18, …) 1.0 (Female, 48, …) 0.1

AMO allows implementation of Greedy in full information setting Only see reward of chosen action in bandit setting Learner’s total reward = … For any 𝜋∈Π, rewards only observed on subset of rounds 𝜋 ′ 𝑠 total reward = ?? ?? Context Actions Action 1 Action 2 Action 3 (Male, 50, …) 0.2 (Female, 18, …) 1.0 (Female, 48, …) 0.1 = 𝜋 ′ 𝑠 action

Inverse probability weighting (old trick)
Importance-weighted estimate of reward from round t: 𝑟 𝑡 𝑎 ≔ 𝑟 𝑡 𝑎 𝑡 ⋅1 𝑎= 𝑎 𝑡 𝑝 𝑡 ( 𝑎 𝑡 ) Unbiased Challenge: Range & Variance bounded by 1/pt(a). Can estimate total rewards of any policy (+ use w/ AMO): Reward 𝑡 𝜋 ≔ 𝑖=1 𝑡 𝑟 𝑖 (𝜋 𝑥 𝑖 ) 𝑎 𝑡 Old trick for unbiased estimates of rewards for all actions --- including actions you didn’t take. - Estimate zero for actions you didn’t take - For action you take, scale up its reward by inverse of probability of taking that probability. - Upshot: can estimate total reward of any policy pi through time t --- use this with AMO! So where does p_t come from?

An algorithmic template
Start with initial distribution p1 over actions. For t=1,2,…,T: 1. Observe context xt. 2a. Compute distribution pt over actions {1,2,…,K}. 2b. Draw action at from pt. 3. Collect reward rt(at). 4. Create importance weighted reward estimates 𝒓 𝒕 Caveat: Quality of actions changes over time! Construction of pt unclear. Maintain policy distribution Q --- need to do this efficiently, so we’ll make sure it’s sparse. After seeing context x_t, need to pick action a_t “smoothed” projection of Q_t to get p_t randomly pick a_t according to p_t. After collecting reward, update policy distribution from Q_t to Q_{t+1}.

Computing a distribution over actions
Draw 𝜋 according to Q, play 𝜋(𝑥): 𝑄 𝑎 𝑥 ≔ 𝜋∈Π 𝑄 𝜋 ∙1 𝜋 𝑥 =𝑎 We use 𝑝 𝑡 ≔ 𝑄 𝑡 𝜇 𝑡 ⋅ 𝑥 𝑡 ≔ 1− 𝜇 𝑡 𝑄 𝑡 ⋅ 𝑥 𝑡 + 𝜇 𝑡 , so every action has probability at least μt. Given Q and x, For every action a, sum up weights of policies that pick action a in context x. We then mix resulting ACTION DISTRIBUTION with a little bit of uniform distribution. Amount of mixing mu_t goes to zero over time.

Constructing policy distributions
Optimization problem (OP): Find policy distribution Q satisfying: Low estimated regret (LR) – “exploitation" Low estimation variance (LV) – “exploration” Theorem (Regret): If we obtain policy distributions Qt via solving (OP), then with high probability, regret after T rounds is at most 𝑂 𝐾𝑇 𝜇 𝑇 = 𝑂 𝐾𝑇 log |Π| . We will repeatedly call AMO to construction policy distribution --- sorta how boosting uses WL. Before details: describe optimization (feasibility) problem (over space of policy distributions!) such that using such solutions imply exploitatios Low estimated regret bound (LR) --- for exploitation. Low estimation variance (LV) --- for explortations. --- Theorem says: distributions that satisfy (LR) and (LV) have optimal explore/exploit trade-off.

Low estimated regret constraint
Optimization problem (OP): Find policy distribution Q such that: 𝜋 𝑄(𝜋) 𝑹𝒆𝒈𝒓𝒆𝒕 (𝜋) is small Low estimated regret (LR) – “exploitation“ Enforce low regret using importance-weighted estimates Good if estimated regret ≈ true regret We will repeatedly call AMO to construction policy distribution --- sorta how boosting uses WL. Before details: describe optimization (feasibility) problem (over space of policy distributions!) such that using such solutions imply exploitatios Low estimated regret bound (LR) --- for exploitation. Low estimation variance (LV) --- for explortations. --- Theorem says: distributions that satisfy (LR) and (LV) have optimal explore/exploit trade-off.

Low variance constraint
𝑉𝑎 𝑟 𝑄 𝜋 is small ∀𝜋∈Π Low estimation variance (LV) – “exploration” Low variance for regret estimate of each policy Helps ensure low estimated regret ⇒ low true regret One constraint for each policy!

Feasibility Feasibility of (OP): implied by minimax argument.
Monster solution [DHKKLRZ’11]: solves variant of (OP) with ellipsoid algorithm. We present a simpler coordinate descent method.

Potential for rescue Define the potential function Φ 𝑄 ≔𝑡 𝜇 𝑡 𝔼 𝑥 [RE(unif, 𝑄 𝜇 𝑡 (∙|𝑥))] 1−𝐾 𝜇 𝑡 +𝜆 𝜋 𝑄 𝜋 Regret 𝑡 𝜋 𝐾𝑡 𝜇 𝑡 Constraints based on derivatives of 𝚽 Minimization of Φ leads to satisfaction of constraints Optimize using coordinate descent, good for large |Π|

Coordinate descent algorithm
Claim: Can check by making one AMO call per iteration. INPUT: Initial weights 𝑄. LOOP: IF (LR) is violated, THEN replace 𝑄 by 𝑐𝑄. IF there is a policy 𝜋∈Π causing (LV) to be violated, THEN UPDATE 𝑄 𝜋 ≔𝑄 𝜋 +𝛼. ELSE RETURN 𝑄. Above, both 0<𝑐<1 and 𝛼>0 have closed form expressions. We use coordinate descent to iteratively construct a policy distribution Q. In each iteration, we add at most one new policy to the support of Q. [Technical detail: optimize over subdistributions] Repeat: If (LR) violated, re-scale so it’s satisfied. If some policy is causing (LV) to be violated, increase its weight. --- Each iteration requires just one AMO call.

I LOVE TO CON BANDITS Start with initial distribution Q1 over policies Π. For t=1,2,…,T: 1. Observe context xt. 2a. Compute distribution pt over actions {1,2,…,K} (based on Qt and xt). 2b. Draw action at from pt. 3. Collect reward rt(at). 4. Create importance weighted reward estimates 𝒓 𝒕 5. Compute sparse Qt+1 using coordinate descent to solve (OP). Recap: Use COORD DESCENT to solve (OP) to get sparse policy distribution that balances explore/exploit. So far, seems like we need T^{3/2} AMO calls --- i.e., T^{1/2} calls per round, over T rounds. But I promised T^{0.5} AMO calls over all T rounds.

Iteration bound for coordinate descent
Theorem (Computational complexity): # steps of coordinate descent = 𝑂 1 𝜇 𝑡 = 𝑂 𝐾𝑡 log |Π| .

Story so far… New optimization problem for statistical optimality
Coordinate descent algorithm for computational tractability Optimally sparse distributions over policies Can we do even better?

Warm-start + epoch trick
If we warm-start coordinate descent (initialize with Qt to get Qt+1), then only need 𝑂 𝐾𝑇 log |Π| coordinate descent iterations over all T rounds. Caveat: need one AMO call/round to even check if (OP) is solved. Solution: split T rounds into epochs (e.g., doubling), only solve (OP) once per epoch. When solving (OP) with COORD DESCENT to get Q_{t+1}, can initialize with current policy distribution Q_t --- WARM START. Total number of COORD DESCENT iterations, over all T rounds, is O(sqrt{ KT / log N }). Based on amortized analysis that bounds round-to-round increase in a potential function with high probability. Caveat: need one AMO call to even check if (OP) is solved. Solution: don’t (re-)solve (OP) too often. Split rounds into epochs, and only solve (OP) once per epoch, using warm start initialization.

Epoch trick Regret analysis: Qt has low instantaneous expected regret (crucially relying on i.i.d. assumption). Therefore same Qt can be used for O(t) more rounds! Epoch trick: Split up T rounds into epochs, solve (OP) once per epoch. Doubling: only update on rounds 21,22,23,24,… Total of O(log T) updates, so overall # AMO calls unchanged (up to log factors). Squares: only update on rounds 12,22,32,42,… Total of O(T1/2) updates, each requiring 𝑂 𝐾 log |Π| AMO calls, on average.

Warm-start If we warm-start coordinate descent (initialize with Qt to get Qt+1), then only need 𝑂 𝐾𝑇 log |Π| coordinate descent iterations over all T rounds.

Computational barrier
Suppose we want to construct distribution 𝑄 which satisfies the (LV) constraint. The distribution is at most 𝛀 𝑲𝑻 log |𝜫| sparse. Shows computational optimality in a class of algorithms. Related: No AMO algorithm with 𝑜( Π ) oracle calls and low regret in an adversarial setting [HK2016].

Conclusion Contextual bandit reduction to supervised learning
Access policy class only via AMO. New coordinate descent algorithm. Statistically optimal and computationally practical. Learn more:

Statistical analysis ideas
Low-regret constraint ensures small empirical regret Unbiased reward estimates from importance weighting Low-variance constraint ensures concentration around mean Martingale concentration + union bound over policies for main result

Open problems and future steps
More adaptive algorithm + analysis (similar to Epoch-greedy)? Algorithm for an online oracle? Can we use a computationally efficient AMO? Extensions to structured actions (such as rankings, lists …) Extensions to Reinforcement Learning

Implementation via AMO
To check violation of (LV): 1 𝑡 𝑖=1 𝑡 1 𝑄 𝜇 𝑡 (𝜋( 𝑥 𝑖 )| 𝑥 𝑖 ) ≤𝐾+𝜆 Regret 𝑡 𝜋 𝑡 𝜇 𝑡 ∀𝜋∈Π Alternatively, 𝑖=1 𝑡 𝜇 𝑡 𝑄 𝜇 𝑡 𝜋 𝑥 𝑖 𝑥 𝑖 +𝜆 𝑟 𝑖 (𝜋) ≤𝐾𝑡 𝜇 𝑡 +𝜆 𝑟 𝑡 𝜋 ∀𝜋∈Π, where 𝜋 ≔AMO reward 𝑡 . Obtain 𝜋 ≔AMO 𝑣 𝑡 . (LV) violated by 𝜋 iff 𝑖=1 𝑡 𝑣 𝑖 ( 𝜋 ) >𝐾𝑡 𝜇 𝑡 +𝜆 𝑟 𝑡 𝜋 . 𝑣 𝑖

Tradeoffs in contextual bandit learning

Similar presentations

Presentation on theme: "Tradeoffs in contextual bandit learning"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Tradeoffs in contextual bandit learning

Similar presentations

Presentation on theme: "Tradeoffs in contextual bandit learning"— Presentation transcript:

Similar presentations

About project

Feedback