Download presentation

Presentation is loading. Please wait.

Published bySalvador Grubb Modified over 3 years ago

1
**Gaussian Process Optimization in the Bandit Setting: No Regret & Experimental Design**

Niranjan Srinivas Andreas Krause Caltech Caltech Sham Kakade Matthias Seeger Wharton Saarland ..where theory and practice collide

2
**Optimizing Noisy, Unknown Functions**

Given: Set of possible inputs D; black-box access to unknown function f Want: Adaptive choice of inputs from D maximizing Many applications: robotic control [Lizotte et al. ’07], sponsored search [Pande & Olston, ’07], clinical trials, … Sampling is expensive Algorithms evaluated using “regret” Goal: minimize Repeat what f is – give an example !

3
**Running example: Noisy Search**

How to find the hottest point in a building? Many noisy sensors available but sampling is expensive D: set of sensors; : temperature at chosen at step i Observe Goal: Find with minimal number of queries Floorplan looks funny (pixelated)

4
**Key insight: Exploit correlation**

Temperature is spatially correlated Sampling f(x) at one point x yields information about f(x’) for points x’ near x In this paper: Model correlation using a Gaussian process (GP) prior for f 4

5
**Gaussian Processes to model payoff f**

+ Normal dist. (1-D Gaussian) Multivariate normal (n-D Gaussian) Gaussian process (∞-D Gaussian) Gaussian process (GP) = normal distribution over functions Finite marginals are multivariate Gaussians Closed form formulae for Bayesian posterior update exist Parameterized by covariance function K(x,x’) = Cov(f(x),f(x’)) 5

6
**Example of GPs Squared exponential kernel K(x,x’) = exp(-(x-x’)2/h2)**

Distance |x-x’| Bandwidth h=.3 Samples from P(f) Bandwidth h=.1 6

7
**Gaussian process optimization [e.g., Jones et al ’98]**

Goal: Adaptively pick inputs such that x f(x) Key question: how should we pick samples? So far, only heuristics: Expected Improvement [Močkus et al. ‘78] Most Probable Improvement [Močkus ‘89] Used successfully in machine learning [Ginsbourger et al. ‘08, Jones ‘01, Lizotte et al. ’07] No theoretical guarantees on their regret!

8
**Simple algorithm for GP optimization**

In each round t do: Pick Observe Use Bayes’ rule to get posterior mean Can get stuck in local maxima! x f(x) 8

9
**Uncertainty sampling But…wastes samples by exploring f everywhere!**

Pick: That’s equivalent to (greedily) maximizing information gain Popular objective in Bayesian experimental design (where the goal is pure exploration of f) But…wastes samples by exploring f everywhere! x f(x) 9

10
**Avoiding unnecessary samples**

x f(x) Best lower bound Key insight: Never need to sample where Upper Confidence Bound (UCB) < best lower bound!

11
**Upper Confidence Bound (UCB) Algorithm**

Naturally trades off explore and exploit; no samples wasted Regret bounds: classic [Auer ’02] & linear f [Dani et al. ‘07] But none in the GP optimization setting! (popular heuristic) Pick input that maximizes Upper Confidence Bound (UCB): x f(x) How should we choose ¯t? Need theory!

12
How well does UCB work? Intuitively, performance should depend on how “learnable” the function is Bandwidth h=.3 Bandwidth h=.1 “Easy” “Hard” The quicker confidence bands collapse, the easier to learn Key idea: Rate of collapse growth of information gain 12

13
**Learnability and information gain**

We show that regret bounds depend on how quickly we can gain information Mathematically: Establishes a novel connection between GP optimization and Bayesian experimental design Add cartoon plot for \gamma_T; need axes, etc. T 13

14
**Performance of optimistic sampling**

Theorem If we choose ¯t = £(log t), then with high probability, Hereby The slower °T grows, the easier f is to learn Key question: How quickly does °T grow? Maximal information gain due to sampling! 14

15
**Learnability and information gain**

Little diminishing returns Returns diminish fast Information gain exhibits diminishing returns (submodularity) [Krause & Guestrin ’05] Our bounds depend on “rate” of diminishment 15

16
**Dealing with high dimensions**

Theorem: For various popular kernels, we have: Linear: ; Squared-exponential: ; Matérn with , ; Smoothness of f helps battle curse of dimensionality! Our bounds rely on submodularity of 16

17
**What if f is not from a GP? In practice, f may not be Gaussian**

Theorem: Let f lie in the RKHS of kernel K with , and let the noise be bounded almost surely by . Choose Then with high probab., Frees us from knowing the “true prior” Intuitively, the bound depends on the “complexity” of the function through its RKHS norm 17

18
**Experiments: UCB vs. heuristics**

Temperature data 46 sensors deployed at Intel Research, Berkeley Collected data for 5 days (1 sample/minute) Want to adaptively find highest temperature as quickly as possible Traffic data Speed data from 357 sensors deployed along highway I-880 South Collected during 6am-11am, for one month Want to find most congested (lowest speed) area as quickly as possible 18

19
**Comparison: UCB vs. heuristics**

GP-UCB compares favorably with existing heuristics 19

20
**Conclusions First theoretical guarantees and convergence rates**

for GP optimization Both true prior and agnostic case covered Performance depends on “learnability”, captured by maximal information gain Connects GP Bandit Optimization & Experimental Design! Performance on real data comparable to other heuristics 20

21
**Relating the two problems**

Adaptive choice of from decision set D Min regret: Easy to see that We bound ; also applies to

22
**Performance of Optimistic sampling**

Theorem [Srinivas, Krause, Kakade, Seeger] Let and be compact and convex. Pick and If K satisfies weak regularity conditions, we have where is independent of T. is the maximal information-gain due to sampling First performance guarantee for GP optimization! “It pays to be optimistic” 22

23
**Bounding Information Gain**

Theorem: For finite D, submodularity and the greedy algorithm yield : #samples of eigenvalues of the covariance matrix Greedy algorithm samples eigenvectors Maximal info-gain depends on spectral properties of kernel Faster the eigenvalues decay, better the performance

24
**What if f is not from a GP? In practice, f may not be Gaussian**

Theorem: Let Assume that f lies in the RKHS of kernel K. Let be a noise process that has zero mean conditioned on history and is bounded by almost surely. Assume and pick Then for , we have Frees us from knowing the “true prior” For f ~ GP, ; sample paths rougher Therefore neither theorem subsumes the other 24

25
**Examples of GPs Exponential kernel K(x,x’) = exp(-|x-x’|/h)**

Distance |x-x’| Bandwidth h=1 Bandwidth h=.3 25

26
Multi-armed bandits … p1 p2 p3 pk At each time pick arm i; get independent payoff with probability pi Classic model for exploration – exploitation tradeoff Extensively studied (Robbins ’52, Gittins ’79) Typically assume each arm is tried multiple times Explanation of k-armed bandit ! 26

27
**Infinite-armed bandits**

… p1 p2 p3 pk … p∞ p1 p2 In many applications, number of arms is huge (sponsored search, sensor selection) Cannot try each arm even once Assumptions on payoff function f essential 27

28
**Thinking about GPs Kernel function K(x, x’) specifies covariance**

f(x) f(x) x P(f(x)) Kernel function K(x, x’) specifies covariance Encodes smoothness assumptions 28

29
**Examples of GPs Linear kernel with features: K(x,x’) = (x)T(x’)**

E.g., (x) = [0,x,x2] E.g., (x) = sin(x) 29

30
**Assumptions on f Linear? [Dani et al, ’07]**

Lipschitz-continuous (bounded slope) [Kleinberg ‘08] Fast convergence; But strong assumption Very flexible, but 30

Similar presentations

OK

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.

© 2018 SlidePlayer.com Inc.

All rights reserved.

To ensure the functioning of the site, we use **cookies**. We share information about your activities on the site with our partners and Google partners: social networks and companies engaged in advertising and web analytics. For more information, see the Privacy Policy and Google Privacy & Terms.
Your consent to our cookies if you continue to use this website.

Ads by Google

Raster scan display ppt on tv Ppt on weapons of mass destruction bush Ppt on blue star operation Ppt on federalism in india Ppt on low level language define Ppt on paintings and photographs related to colonial period in american Ppt on hindu religion diet Ppt on division as equal sharing worksheets Ppt on principles of peace building review Ppt on elements of a story