Presentation is loading. Please wait.

Presentation is loading. Please wait.

Online Learning in Complex Environments

Similar presentations


Presentation on theme: "Online Learning in Complex Environments"— Presentation transcript:

1 Online Learning in Complex Environments
Aditya Gopalan iisc) MLSIG, 30 March 2015 (joint work with Shie Mannor, Yishay Mansour)

2 Machine Learning Algorithms/systems for learning to “do stuff” …
… with data/observations of some sort. - R. E. Schapire Learning by Experience = Reinforcement Learning (RL) (Action 1, Reward 1), (Action 2, Reward 2), … Act based on past data, future data/rewards depend on present action; maximize some notion of utility / reward Data interactively gathered

3 (Simple) Multi-Armed Bandit
1 2 3 N “arms” or actions (items in a recommender system, transmission freqs, trades, …) each arm is an unknown probability distribution with parameter and mean (think Bernoulli)

4 (Simple) Multi-Armed Bandit
Time 1 2 1 3 N Play arm, collect iid “reward” (ad clicks, data rate, profit, …)

5 (Simple) Multi-Armed Bandit
Time 2 1 2 3 N

6 (Simple) Multi-Armed Bandit
Time 3 2 1 3 N

7 (Simple) Multi-Armed Bandit
Time 4 3 1 2 N

8 (Simple) Multi-Armed Bandit
Time N 1 2 3 Play awhile …

9 Performance Metrics Total (expected) reward at time Regret:
Probability of identifying the best arm Risk aversion: (Mean – Variance) of reward

10 Motivation, Applications
Clinical trials (original) Internet Advertising A/B testing Comment Scoring Cognitive Radio Dynamic Pricing Sequential Investment Noisy Function Optimization Adaptive Routing/Congestion Control Job Scheduling Bidding in auctions Crowdsourcing Learning in games

11 Upper Confidence Bound algo [AuerEtAl’02]
Idea 1: Consider variance of estimates! Idea 2: Be optimistic under uncertainty! 1 2 3 N Play arm maximizing

12 UCB Performance [AuerEtAl’02] After plays, UCB’s expected reward is
Per-round regret vanishes as t becomes large Learning

13 Variations on the Theme
(Idealized) assumption in MAB: All arms’ rewards independent of each other Often, more structure/coupling Variation 1: Linear Bandits [DaniEtAl’08, …] Arm 3 Each arm is a vector Arm 2 Playing arm at time gives reward Arm 4 Arm 1 where is unknown and Arm 5

14 Variations on the Theme
(Idealized) assumption in MAB: All arms’ rewards independent of each other Often, more structure/coupling Variation 1: Linear Bandits [DaniEtAl’08, …] Arm 3 Regret after time steps: Arm 2 Arm 4 Arm 1 Arm 5

15 Variations on the Theme
(Idealized) assumption in MAB: All arms’ rewards independent of each other Often, more structure/coupling Variation 1: Linear Bandits [DaniEtAl’08, …] e.g. Binary vectors representing Paths in a graph Collection of subsets of a ground set (budgeted ad display) representing Per edge/per element cost/utility Arm 3 Arm 2 Arm 4 Arm 1 Arm 5

16 Variations on the Theme
(Idealized) assumption in MAB: All arms’ distributions/parameters independent/separate of each other Often, more structure/coupling Variation 1: Linear Bandits [DaniEtAl’08, …] The LinUCB algo Build a point estimate (least squares estimate) and a confidence region (ellipsoid) Play the most optimistic action w.r.t. this ellipsoid,

17 Variations on the Theme
Variation 2: Non-parametric or X-Armed Bandits [Agrawal’95, Kleinberg’04, …] Noisy reward Function from some smooth function class e.g., Lipschitz Mean Reward Arm 1 UCB style algs based on Hierarchical/Adaptive Discretization

18 What would a “Bayesian” do?
The Thompson Sampling algorithm A “fake Bayesian’s” approach to bandits Prehistoric [Thompson’33] 1 2 “Prior” distribution for Arm 1’s mean “Prior” distribution for Arm 2’s mean

19 What would a “Bayesian” do?
The Thompson Sampling algorithm A “fake Bayesian’s” approach to bandits Prehistoric [Thompson’33] 1 2 Random samples “Prior” distribution for Arm 1’s mean “Prior” distribution for Arm 2’s mean

20 What would a “Bayesian” do?
The Thompson Sampling algorithm A “fake Bayesian’s” approach to bandits Prehistoric [Thompson’33] 1 2 Play best arm assuming sampled means = true means

21 What would a “Bayesian” do?
The Thompson Sampling algorithm A “fake Bayesian’s” approach to bandits Prehistoric [Thompson’33] 1 2 Update to “Posterior”, Bayes’ Rule

22 What would a “Bayesian” do?
The Thompson Sampling algorithm A “fake Bayesian’s” approach to bandits Prehistoric [Thompson’33] 1 2 Random samples

23 What would a “Bayesian” do?
The Thompson Sampling algorithm A “fake Bayesian’s” approach to bandits Prehistoric [Thompson’33] 1 2 Play best arm assuming sampled means = true means

24 What would a “Bayesian” do?
The Thompson Sampling algorithm A “fake Bayesian’s” approach to bandits Prehistoric [Thompson’33] 1 2 Update to “Posterior”, Bayes’ Rule

25 What we know [Thompson 1933], [Ortega-Braun 2010] [Agr-Goy 2011,2012]
Optimal for standard MAB Linear contextual bandits [Kaufmann et al. 2012,2013] Standard MAB optimality Purely Bayesian setting – Bayesian regret [Russo-VanRoy 2013] [Bubeck-Liu 2013] But analysis doesn’t generalize Specific conjugate priors, No closed-form for complex bandit feedback e.g. MAX

26 TS for Linear Bandits Idea: Use same Least Squares estimate as Lin-UCB, but sample from (multivariate Gaussian) posterior and act greedily! (no need to optimize over an ellipsoid) Shipra Agrawal, Navin Goyal. Thompson Sampling for Contextual Bandits with Linear Payoffs. ICML 2013.

27 More Generally – “Complex Bandits”
1 2 3 N

28 e.g. Subset Selection

29 e.g. Job Scheduling

30 e.g. Ranking

31 General Thompson Sampling
Imagine ‘fictitious’ prior distribution over all parameters

32 General Thompson Sampling
Sample a set of parameters Prior

33 General Thompson Sampling
Assume is true, play BestAction( )

34 General Thompson Sampling
Get reward , Update prior to posterior (Bayes’ Theorem)

35 Thompson Sampling Alg = Space of all basic parameters
A fictitious prior measure E.g., = Uniform( ) At each time SAMPLE by current prior: PLAY best complex action given sample: GET reward: UPDATE prior to posterior:

36 “Information Complexity”
Main Result Gap/problem-dependent Regret Bound Under any “reasonable” discrete prior, finite # of actions, with probability at least . “Information Complexity” Captures structure of complex bandit Can be much smaller than #Actions! Solution to optimization problem in “path space” LP interpretation General feedback! Previously: only LINEAR complex bandits [DaniEtAl’08, Abbasi-YadkoriEtAl’11, Cesa-Bianchi-Lugosi’12] * Aditya Gopalan, Shie Mannor & Yishay Mansour, “Complex Bandit Problems and Thompson Sampling”, ICML 2014

37 Example 1: “Semi-bandit”
Pick size-K subsets of arms, Observe All K chosen rewards “Semi-bandit” Regret Bound Under a reasonable prior, with probability at least , actions but regret only

38 Example 2: MAX Feedback Pick size-K subsets of arms, Observe MAX of K chosen rewards Regret Bound under Max Feedback Structure Under a reasonable prior, with probability at least , SAVING! BOUND #ACTIONS

39 Numerics: Play Subsets, See MAX
UCB still exploring (linear region)!

40 Markov Decision Process
States Actions Transition Probabilities Rewards Special case: Multi-armed Bandit

41 Markov Decision Process
Source: Wikipedia 3 states, 2 actions

42 Online Reinforcement Learning
Suppose true MDP parameter is , but this is unknown to the decision maker a priori. Must “LEARN optimal policy” - what action to take in each state to maximize equivalently, minimize regret

43 Online Reinforcement Learning
Tradeoff: Explore the state space or Exploit existing knowledge to design good current policy? Upper Confidence-based approaches: Build confidence intervals per state-action pair, be optimistic! Rmax [Brafman-Tennenholtz 2001] UCRL2 [JakschEtAl 2007] Key Idea: Maintain estimates + high-confidence sets for transition probability & reward for every (state,action) pair “Wasteful” if transitions/rewards have structure/relations

44 Parameterized MDP – Queueing System
1 2 N Single queue with N states, discrete-time Bernoulli( ) arrivals at every state 2 actions: {FAST, SLOW}, (Bernoulli) service rates { } Assume service rates known, uncertainty in only

45 Thompson Sampling Draw an MDP instance ~ Prior over possible MDPs
Compute optimal policy for instance (Value Iteration, Linear Programming, simulation, …) Play action prescribed by optimal policy for current state Repeat indefinitely. In fact, consider the following variant (TSMDP): Designate a marker state Divide time into visits to marker state (epochs) At each epoch Sample an MDP ~ Curent posterior Compute optimal policy for sampled MDP Play policy until end of epoch Update posterior using epoch samples

46 Numerics –Queueing System

47 Main Result [G.-Mannor’15] Under suitably “nice” prior on , with probability at least (1 - ), TSMDP gives regret B + C log(T) in T rounds, where B depends on , and the prior, C depends only on , the true model and, more importantly, the “effective dimension” of . Implication: Provably rapid learning if effective dimensionality of MDP is small, e.g., queueing system with single scalar uncertain parameter

48 Future Directions Continuum-armed bandits? Risk-averse decision making
Misspecified models (e.g., “almost-linear bandits”) Relax epoch structure? Relax ergodicity assumptions? Infinite State/Action Spaces Function Approximation for State/Policy Representations?

49 Thank you


Download ppt "Online Learning in Complex Environments"

Similar presentations


Ads by Google