Download presentation

Presentation is loading. Please wait.

Published bySalvador Grubb Modified about 1 year ago

1
Gaussian Process Optimization in the Bandit Setting: No Regret & Experimental Design Niranjan Srinivas Andreas Krause Caltech Sham Kakade Matthias Seeger Wharton Saarland theory and practice collide

2
Optimizing Noisy, Unknown Functions Given: Set of possible inputs D; black-box access to unknown function f Want: Adaptive choice of inputs from D maximizing Many applications: robotic control [Lizotte et al. ’07], sponsored search [Pande & Olston, ’07], clinical trials, … Sampling is expensive Algorithms evaluated using “regret” Goal: minimize

3
3 Running example: Noisy Search How to find the hottest point in a building? Many noisy sensors available but sampling is expensive D: set of sensors; : temperature at chosen at step i Observe Goal: Find with minimal number of queries

4
Key insight: Exploit correlation Sampling f(x) at one point x yields information about f(x’) for points x’ near x In this paper: Model correlation using a Gaussian process (GP) prior for f 4 Temperature is spatially correlated

5
Gaussian Processes to model payoff f Gaussian process (GP) = normal distribution over functions Finite marginals are multivariate Gaussians Closed form formulae for Bayesian posterior update exist Parameterized by covariance function K(x,x’) = Cov(f(x),f(x’)) 5 Normal dist. (1-D Gaussian) Multivariate normal (n-D Gaussian) Gaussian process (∞-D Gaussian)

6
6 Example of GPs Squared exponential kernel K(x,x’) = exp(-(x-x’) 2 /h 2 ) Bandwidth h=.1 Distance |x-x’| Bandwidth h=.3 Samples from P(f)

7
7 Gaussian process optimization [e.g., Jones et al ’98] x f(x) Goal: Adaptively pick inputs such that Key question: how should we pick samples? So far, only heuristics: Expected Improvement [Močkus et al. ‘78] Most Probable Improvement [Močkus ‘89] Used successfully in machine learning [Ginsbourger et al. ‘08, Jones ‘01, Lizotte et al. ’07] No theoretical guarantees on their regret!

8
8 Simple algorithm for GP optimization In each round t do: Pick Observe Use Bayes’ rule to get posterior mean Can get stuck in local maxima! 8 x f(x)

9
9 Uncertainty sampling Pick: That’s equivalent to (greedily) maximizing information gain Popular objective in Bayesian experimental design (where the goal is pure exploration of f) But…wastes samples by exploring f everywhere! 9 x f(x)

10
10 Avoiding unnecessary samples Key insight: Never need to sample where Upper Confidence Bound (UCB) < best lower bound! x f(x) Best lower bound

11
11 Upper Confidence Bound (UCB) Algorithm Naturally trades off explore and exploit; no samples wasted Regret bounds: classic [Auer ’02] & linear f [Dani et al. ‘07] But none in the GP optimization setting! (popular heuristic) x f(x) Pick input that maximizes Upper Confidence Bound (UCB): How should we choose ¯ t ? Need theory!

12
12 How well does UCB work? Intuitively, performance should depend on how “learnable” the function is 12 “Easy”“Hard” The quicker confidence bands collapse, the easier to learn Key idea: Rate of collapse growth of information gain Bandwidth h=.3 Bandwidth h=.1

13
Learnability and information gain We show that regret bounds depend on how quickly we can gain information Mathematically: Establishes a novel connection between GP optimization and Bayesian experimental design 13 T

14
14 Performance of optimistic sampling 14 Theorem If we choose ¯ t = £ (log t), then with high probability, Hereby The slower ° T grows, the easier f is to learn Key question: How quickly does ° T grow? Maximal information gain due to sampling!

15
Learnability and information gain Information gain exhibits diminishing returns (submodularity) [Krause & Guestrin ’05] Our bounds depend on “rate” of diminishment 15 Little diminishing returns Returns diminish fast

16
Dealing with high dimensions Theorem: For various popular kernels, we have: Linear: ; Squared-exponential: ; Matérn with, ; Smoothness of f helps battle curse of dimensionality! Our bounds rely on submodularity of 16

17
What if f is not from a GP? In practice, f may not be Gaussian Theorem: Let f lie in the RKHS of kernel K with, and let the noise be bounded almost surely by. Choose.Then with high probab., Frees us from knowing the “true prior” Intuitively, the bound depends on the “complexity” of the function through its RKHS norm 17

18
Experiments: UCB vs. heuristics Temperature data 46 sensors deployed at Intel Research, Berkeley Collected data for 5 days (1 sample/minute) Want to adaptively find highest temperature as quickly as possible Traffic data Speed data from 357 sensors deployed along highway I-880 South Collected during 6am-11am, for one month Want to find most congested (lowest speed) area as quickly as possible 18

19
Comparison: UCB vs. heuristics 19 GP-UCB compares favorably with existing heuristics

20
Conclusions First theoretical guarantees and convergence rates for GP optimization Both true prior and agnostic case covered Performance depends on “learnability”, captured by maximal information gain Connects GP Bandit Optimization & Experimental Design! Performance on real data comparable to other heuristics 20

21
Relating the two problems Adaptive choice of from decision set D Min regret: Easy to see that We bound ; also applies to

22
Performance of Optimistic sampling 22 Theorem [Srinivas, Krause, Kakade, Seeger] Let and be compact and convex. Pick and. If K satisfies weak regularity conditions, we have where is independent of T. is the maximal information-gain due to sampling First performance guarantee for GP optimization! “It pays to be optimistic”

23
23 Bounding Information Gain Theorem: For finite D, submodularity and the greedy algorithm yield : #samples of eigenvalues of the covariance matrix Greedy algorithm samples eigenvectors Maximal info-gain depends on spectral properties of kernel Faster the eigenvalues decay, better the performance

24
What if f is not from a GP? In practice, f may not be Gaussian Theorem: Let. Assume that f lies in the RKHS of kernel K. Let be a noise process that has zero mean conditioned on history and is bounded by almost surely. Assume and pick Then for, we have Frees us from knowing the “true prior” For f ~ GP, ; sample paths rougher Therefore neither theorem subsumes the other 24

25
25 Examples of GPs Exponential kernel K(x,x’) = exp(-|x-x’|/h) Bandwidth h=1 Bandwidth h=.3 Distance |x-x’|

26
26 Multi-armed bandits At each time pick arm i; get independent payoff with probability p i Classic model for exploration – exploitation tradeoff Extensively studied (Robbins ’52, Gittins ’79) Typically assume each arm is tried multiple times … p1p1 p2p2 p3p3 pkpk

27
27 Infinite-armed bandits … p1p1 p2p2 p3p3 pkpk …p∞p∞ p1p1 p2p2 … In many applications, number of arms is huge (sponsored search, sensor selection) Cannot try each arm even once Assumptions on payoff function f essential

28
28 Thinking about GPs Kernel function K(x, x’) specifies covariance Encodes smoothness assumptions x f(x) P(f(x)) f(x)

29
29 Examples of GPs Linear kernel with features: K(x,x’) = (x) T (x’) E.g., (x) = [0,x,x 2 ]E.g., (x) = sin(x)

30
30 Assumptions on f Linear? [Dani et al, ’07] Lipschitz-continuous (bounded slope) [Kleinberg ‘08] Fast convergence; But strong assumption Very flexible, but

Similar presentations

© 2016 SlidePlayer.com Inc.

All rights reserved.

Ads by Google