Tuning bandit algorithms in stochastic environments

Slides:



Advertisements
Similar presentations
IDSIA Lugano Switzerland Master Algorithms for Active Experts Problems based on Increasing Loss Values Jan Poland and Marcus Hutter Defensive Universal.
Advertisements

Efficient Sequential Decision-Making in Structured Problems Adam Tauman Kalai Georgia Institute of Technology Weizmann Institute Toyota Technological Institute.
1 Reinforcement Learning: Approximate Planning Csaba Szepesvári University of Alberta Kioloa, MLSS’08 Slides:
Ordinal Optimization. Copyright by Yu-Chi Ho2 Some Additional Issues F Design dependent estimation noises vs. observation noise F Goal Softening to increase.
Randomized Sensing in Adversarial Environments Andreas Krause Joint work with Daniel Golovin and Alex Roper International Joint Conference on Artificial.
A Simple Distribution- Free Approach to the Max k-Armed Bandit Problem Matthew Streeter and Stephen Smith Carnegie Mellon University.
Tuning bandit algorithms in stochastic environments The 18th International Conference on Algorithmic Learning Theory October 3, 2007, Sendai International.
Gaussian Process Optimization in the Bandit Setting: No Regret & Experimental Design Niranjan Srinivas Andreas Krause Caltech Sham Kakade Matthias Seeger.
Niranjan Srinivas Andreas Krause Caltech Caltech
ANDREW MAO, STACY WONG Regrets and Kidneys. Intro to Online Stochastic Optimization Data revealed over time Distribution of future events is known Under.
Questions?. Setting a reward function, with and without subgoals Difference between agent and environment AI for games, Roomba Markov Property – Broken.
1 Monte-Carlo Planning: Introduction and Bandit Basics Alan Fern.
Extraction and Transfer of Knowledge in Reinforcement Learning A.LAZARIC Inria “30 minutes de Science” Seminars SequeL Inria Lille – Nord Europe December.
Fundamentals of Data Analysis Lecture 12 Methods of parametric estimation.
1 Kshitij Judah, Alan Fern, Tom Dietterich TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: UAI-2012 Catalina Island,
Nonstochastic Multi-Armed Bandits With Graph-Structured Feedback Noga Alon, TAU Nicolo Cesa-Bianchi, Milan Claudio Gentile, Insubria Shie Mannor, Technion.
COSC 878 Seminar on Large Scale Statistical Machine Learning 1.
Sample-based Planning for Continuous Action Markov Decision Processes Ari Weinstein.
TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAA A.
Algorithmic and Economic Aspects of Networks Nicole Immorlica.
Mortal Multi-Armed Bandits Deepayan Chakrabarti,Yahoo! Research Ravi Kumar,Yahoo! Research Filip Radlinski, Microsoft Research Eli Upfal,Brown University.
The Value of Knowing a Demand Curve: Regret Bounds for Online Posted-Price Auctions Bobby Kleinberg and Tom Leighton.
Jointly Optimal Transmission and Probing Strategies for Multichannel Systems Saswati Sarkar University of Pennsylvania Joint work with Sudipto Guha (Upenn)
1 An Asymptotically Optimal Algorithm for the Max k-Armed Bandit Problem Matthew Streeter & Stephen Smith Carnegie Mellon University NESCAI, April
Exploration and Exploitation Strategies for the K-armed Bandit Problem by Alexander L. Strehl.
Visual Recognition Tutorial
Nonmyopic Active Learning of Gaussian Processes An Exploration – Exploitation Approach Andreas Krause, Carlos Guestrin Carnegie Mellon University TexPoint.
Handling Advertisements of Unknown Quality in Search Advertising Sandeep Pandey Christopher Olston (CMU and Yahoo! Research)
Hierarchical Exploration for Accelerating Contextual Bandits Yisong Yue Carnegie Mellon University Joint work with Sue Ann Hong (CMU) & Carlos Guestrin.
1 Monte-Carlo Planning: Policy Improvement Alan Fern.
Monte-Carlo Tree Search
Reinforcement Learning Evaluative Feedback and Bandit Problems Subramanian Ramamoorthy School of Informatics 20 January 2012.
A Brief Introduction to GA Theory. Principles of adaptation in complex systems John Holland proposed a general principle for adaptation in complex systems:
© 2009 IBM Corporation 1 Improving Consolidation of Virtual Machines with Risk-aware Bandwidth Oversubscription in Compute Clouds Amir Epstein Joint work.
Ran El-Yaniv and Dmitry Pechyony Technion – Israel Institute of Technology, Haifa, Israel Transductive Rademacher Complexity and its Applications.
Hypothesis Testing.  Select 50% users to see headline A ◦ Titanic Sinks  Select 50% users to see headline B ◦ Ship Sinks Killing Thousands  Do people.
1 Monte-Carlo Planning: Policy Improvement Alan Fern.
1 Monte-Carlo Planning: Policy Improvement Alan Fern.
COMP 2208 Dr. Long Tran-Thanh University of Southampton Bandits.
Game Theory: Optimal stopping problem I: Introduction Sarbar Tursunova.
1 Monte-Carlo Tree Search Alan Fern. 2 Introduction  Rollout does not guarantee optimality or near optimality  It only guarantees policy improvement.
1 Optimizing Decisions over the Long-term in the Presence of Uncertain Response Edward Kambour.
Javad Azimi, Ali Jalali, Xiaoli Fern Oregon State University University of Texas at Austin In NIPS 2011, Workshop in Bayesian optimization, experimental.
Application of Dynamic Programming to Optimal Learning Problems Peter Frazier Warren Powell Savas Dayanik Department of Operations Research and Financial.
Using MDP Characteristics to Guide Exploration in Reinforcement Learning Paper: Bohdana Ratich & Doina Precucp Presenter: Michael Simon Some pictures/formulas.
Sequential Off-line Learning with Knowledge Gradients Peter Frazier Warren Powell Savas Dayanik Department of Operations Research and Financial Engineering.
Week 21 Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution that produced.
Bayesian Optimization. Problem Formulation Goal  Discover the X that maximizes Y  Global optimization Active experimentation  We can choose which values.
Jian Li Institute for Interdisciplinary Information Sciences Tsinghua University Multi-armed Bandit Problems WAIM 2014.
By: Kenny Raharjo 1. Agenda Problem scope and goals Game development trend Multi-armed bandit (MAB) introduction Integrating MAB into game development.
Tradeoffs Between Fairness and Accuracy in Machine Learning
Visual Recognition Tutorial
12. Principles of Parameter Estimation
Monitoring Churn in Wireless Networks
Stochastic Linear Bandits
Adaptive Ambulance Redeployment via Multi-armed Bandits
TexPoint fonts used in EMF.
 Real-Time Scheduling via Reinforcement Learning
The Nonstochastic Multiarmed Bandit Problem
Gokarna Sharma Costas Busch Louisiana State University, USA
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Summarizing Data by Statistics
Chapter 2: Evaluative Feedback
Thompson Sampling for learning in online decision making
 Real-Time Scheduling via Reinforcement Learning
Chapter 15 Decisions under Risk and Uncertainty
Pattern Recognition and Machine Learning Chapter 2: Probability Distributions July chonbuk national university.
12. Principles of Parameter Estimation
Chapter 2: Evaluative Feedback
Presentation transcript:

Tuning bandit algorithms in stochastic environments Jean-Yves Audibert, CERTIS - Ecole des Ponts Remi Munos, INRIA Futurs Lille Csaba Szepesvári, University of Alberta The 18th International Conference on Algorithmic Learning Theory October 3, 2007, Sendai International Center, Sendai, Japan TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAAAAAAA

Contents Bandit problems UCB and Motivation Tuning UCB by using variance estimates Concentration of the regret Finite horizon – finite regret (PAC-UCB) Conclusions

Exploration vs. Exploitation Two treatments Unknown success probabilities Goal: find the best treatment while losing the smallest number of patients Explore or exploit?

Playing Bandits Payoff is 0 or 1 Arm 1: X11, X12, X13, X14, X15, X16, X17, … Arm 2: X21, X22, X23, X24, X25, X26, X27, … 1 1 1 1 1 1

Exploration vs. Exploitation: Some Applications Simple processes: Clinical trials Job shop scheduling (random jobs) What ad to put on a web-page More complex processes (memory): Optimizing production Controlling an inventory Optimal investment Poker

Bandit Problems – “Optimism in the Face of Uncertainty” Introduced by Lai and Robbins (1985) (?) i.i.d. payoffs X11,X12,…,X1t,… X21,X22,…,X2t,… Principle: Inflated value of an option = maximum expected reward that looks “quite” possible given the observations so far Select the option with best inflated value

Some definitions Now: t=11 T1(t-1) = 4 T2(t-1) = 6 I1 = 1, I2 = 2, … Payoff is 0 or 1 Arm 1: X11, X12, X13, X14, X15, X16, X17, … Arm 2: X21, X22, X23, X24, X25, X26, X27, … 1 1 1 1 1 1

Parametric Bandits [Lai&Robbins] Xit » pi,i(¢), i unknown, t=1,2,… Uncertainty set: “Reasonable values of  given the experience so far” Ui,t={ | pi, (Xi,1:Ti(t)) is “large” mod (t,Ti(t)) } Inflated values: Zi,t=max{ E | 2 Ui,t } Rule: It = arg maxi Zi,t

Bounds Upper bound: Lower bound: If an algorithm is uniformly good then..

UCB1 Algorithm (Auer et al., 2002) Algorithm: UCB1(b) Try all options once Use option k with the highest index: Regret bound: Rn: Expected loss due to not selecting the best option at time step n. Then:

Problem #1 When b2À 2, regret should scale with 2 and not b2!

UCB1-NORMAL Algorithm: UCB1-NORMAL Regret bound: Try all options once Use option k with the highest index: Regret bound:

Problem #1 The regret of UCB1(b) scales with O(b2) The regret of UCB1-NORMAL scales with O(2) … but UCB1-NORMAL assumes normally distributed payoffs UCB-Tuned(b): Good experimental results No theoretical guarantees

UCB-V Algorithm: UCB-V(b) Regret bound: Try all options once Use option k with the highest index: Regret bound:

Proof The “missing bound” (hunch.net): Bounding the sampling times of suboptimal arms (new bound)

Can we decrease exploration? Algorithm: UCB-V(b,,c) Try all options once Use option k with the highest index: Theorem: When <1, the regret will be polynomial for some bandit problems When c<1/6, the regret will be polynomial for some bandit problems

Concentration bounds RISK?? Averages concentrate: Does the regret of UCB* concentrate? RISK??

Logarithmic regret implies high risk Theorem: Consider the pseudo-regret Rn = k=1K Tk(n) k. Then for any >1 and z> log(n), P(Rn>z)· C z- (Gaussian tail:P(Rn>z)· C exp(-z2)) Illustration: Two arms; 2 = 2-1>0. Modes of law of Rn at O(log(n)), O(2n)! Only happens when the support of the second best arm’s distribution overlaps with that of the optimal arm

Finite horizon: PAC-UCB Algorithm: PAC-UCB(N) Try all options ones Use option k with the highest index: Theorem: At time N with probability 1-1/N, suboptimal plays are bounded by O(log(K N)). Good when N is known beforehand

Conclusions Taking into account the variance lessens dependence on the a priori bound b Low expected regret => high risk PAC-UCB: Finite regret, known horizon, exponential concentration of the regret Optimal balance? Other algorithms? Greater generality: look up the paper!

Thank you! Questions?

References Optimism in the face of uncertainty: Lai, T. L. and Robbins, H. (1985). Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6:4–22. UCB1 and more: Auer, P., Cesa-Bianchi, N., and Fischer, P. (2002). Finite time analysis of the multiarmed bandit problem. Machine Learning, 47(2-3):235–256. Audibert, J., Munos, R., and Szepesvári, Cs. (2007). Tuning bandit algorithms in stochastic environments, ALT-2007.