Download presentation

Presentation is loading. Please wait.

Published bySalvador Grubb Modified over 2 years ago

1
**Gaussian Process Optimization in the Bandit Setting: No Regret & Experimental Design**

Niranjan Srinivas Andreas Krause Caltech Caltech Sham Kakade Matthias Seeger Wharton Saarland ..where theory and practice collide

2
**Optimizing Noisy, Unknown Functions**

Given: Set of possible inputs D; black-box access to unknown function f Want: Adaptive choice of inputs from D maximizing Many applications: robotic control [Lizotte et al. ’07], sponsored search [Pande & Olston, ’07], clinical trials, … Sampling is expensive Algorithms evaluated using “regret” Goal: minimize Repeat what f is – give an example !

3
**Running example: Noisy Search**

How to find the hottest point in a building? Many noisy sensors available but sampling is expensive D: set of sensors; : temperature at chosen at step i Observe Goal: Find with minimal number of queries Floorplan looks funny (pixelated)

4
**Key insight: Exploit correlation**

Temperature is spatially correlated Sampling f(x) at one point x yields information about f(x’) for points x’ near x In this paper: Model correlation using a Gaussian process (GP) prior for f 4

5
**Gaussian Processes to model payoff f**

+ Normal dist. (1-D Gaussian) Multivariate normal (n-D Gaussian) Gaussian process (∞-D Gaussian) Gaussian process (GP) = normal distribution over functions Finite marginals are multivariate Gaussians Closed form formulae for Bayesian posterior update exist Parameterized by covariance function K(x,x’) = Cov(f(x),f(x’)) 5

6
**Example of GPs Squared exponential kernel K(x,x’) = exp(-(x-x’)2/h2)**

Distance |x-x’| Bandwidth h=.3 Samples from P(f) Bandwidth h=.1 6

7
**Gaussian process optimization [e.g., Jones et al ’98]**

Goal: Adaptively pick inputs such that x f(x) Key question: how should we pick samples? So far, only heuristics: Expected Improvement [Močkus et al. ‘78] Most Probable Improvement [Močkus ‘89] Used successfully in machine learning [Ginsbourger et al. ‘08, Jones ‘01, Lizotte et al. ’07] No theoretical guarantees on their regret!

8
**Simple algorithm for GP optimization**

In each round t do: Pick Observe Use Bayes’ rule to get posterior mean Can get stuck in local maxima! x f(x) 8

9
**Uncertainty sampling But…wastes samples by exploring f everywhere!**

Pick: That’s equivalent to (greedily) maximizing information gain Popular objective in Bayesian experimental design (where the goal is pure exploration of f) But…wastes samples by exploring f everywhere! x f(x) 9

10
**Avoiding unnecessary samples**

x f(x) Best lower bound Key insight: Never need to sample where Upper Confidence Bound (UCB) < best lower bound!

11
**Upper Confidence Bound (UCB) Algorithm**

Naturally trades off explore and exploit; no samples wasted Regret bounds: classic [Auer ’02] & linear f [Dani et al. ‘07] But none in the GP optimization setting! (popular heuristic) Pick input that maximizes Upper Confidence Bound (UCB): x f(x) How should we choose ¯t? Need theory!

12
How well does UCB work? Intuitively, performance should depend on how “learnable” the function is Bandwidth h=.3 Bandwidth h=.1 “Easy” “Hard” The quicker confidence bands collapse, the easier to learn Key idea: Rate of collapse growth of information gain 12

13
**Learnability and information gain**

We show that regret bounds depend on how quickly we can gain information Mathematically: Establishes a novel connection between GP optimization and Bayesian experimental design Add cartoon plot for \gamma_T; need axes, etc. T 13

14
**Performance of optimistic sampling**

Theorem If we choose ¯t = £(log t), then with high probability, Hereby The slower °T grows, the easier f is to learn Key question: How quickly does °T grow? Maximal information gain due to sampling! 14

15
**Learnability and information gain**

Little diminishing returns Returns diminish fast Information gain exhibits diminishing returns (submodularity) [Krause & Guestrin ’05] Our bounds depend on “rate” of diminishment 15

16
**Dealing with high dimensions**

Theorem: For various popular kernels, we have: Linear: ; Squared-exponential: ; Matérn with , ; Smoothness of f helps battle curse of dimensionality! Our bounds rely on submodularity of 16

17
**What if f is not from a GP? In practice, f may not be Gaussian**

Theorem: Let f lie in the RKHS of kernel K with , and let the noise be bounded almost surely by . Choose Then with high probab., Frees us from knowing the “true prior” Intuitively, the bound depends on the “complexity” of the function through its RKHS norm 17

18
**Experiments: UCB vs. heuristics**

Temperature data 46 sensors deployed at Intel Research, Berkeley Collected data for 5 days (1 sample/minute) Want to adaptively find highest temperature as quickly as possible Traffic data Speed data from 357 sensors deployed along highway I-880 South Collected during 6am-11am, for one month Want to find most congested (lowest speed) area as quickly as possible 18

19
**Comparison: UCB vs. heuristics**

GP-UCB compares favorably with existing heuristics 19

20
**Conclusions First theoretical guarantees and convergence rates**

for GP optimization Both true prior and agnostic case covered Performance depends on “learnability”, captured by maximal information gain Connects GP Bandit Optimization & Experimental Design! Performance on real data comparable to other heuristics 20

21
**Relating the two problems**

Adaptive choice of from decision set D Min regret: Easy to see that We bound ; also applies to

22
**Performance of Optimistic sampling**

Theorem [Srinivas, Krause, Kakade, Seeger] Let and be compact and convex. Pick and If K satisfies weak regularity conditions, we have where is independent of T. is the maximal information-gain due to sampling First performance guarantee for GP optimization! “It pays to be optimistic” 22

23
**Bounding Information Gain**

Theorem: For finite D, submodularity and the greedy algorithm yield : #samples of eigenvalues of the covariance matrix Greedy algorithm samples eigenvectors Maximal info-gain depends on spectral properties of kernel Faster the eigenvalues decay, better the performance

24
**What if f is not from a GP? In practice, f may not be Gaussian**

Theorem: Let Assume that f lies in the RKHS of kernel K. Let be a noise process that has zero mean conditioned on history and is bounded by almost surely. Assume and pick Then for , we have Frees us from knowing the “true prior” For f ~ GP, ; sample paths rougher Therefore neither theorem subsumes the other 24

25
**Examples of GPs Exponential kernel K(x,x’) = exp(-|x-x’|/h)**

Distance |x-x’| Bandwidth h=1 Bandwidth h=.3 25

26
Multi-armed bandits … p1 p2 p3 pk At each time pick arm i; get independent payoff with probability pi Classic model for exploration – exploitation tradeoff Extensively studied (Robbins ’52, Gittins ’79) Typically assume each arm is tried multiple times Explanation of k-armed bandit ! 26

27
**Infinite-armed bandits**

… p1 p2 p3 pk … p∞ p1 p2 In many applications, number of arms is huge (sponsored search, sensor selection) Cannot try each arm even once Assumptions on payoff function f essential 27

28
**Thinking about GPs Kernel function K(x, x’) specifies covariance**

f(x) f(x) x P(f(x)) Kernel function K(x, x’) specifies covariance Encodes smoothness assumptions 28

29
**Examples of GPs Linear kernel with features: K(x,x’) = (x)T(x’)**

E.g., (x) = [0,x,x2] E.g., (x) = sin(x) 29

30
**Assumptions on f Linear? [Dani et al, ’07]**

Lipschitz-continuous (bounded slope) [Kleinberg ‘08] Fast convergence; But strong assumption Very flexible, but 30

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google