# Niranjan Srinivas Andreas Krause Caltech Caltech

## Presentation on theme: "Niranjan Srinivas Andreas Krause Caltech Caltech"— Presentation transcript:

Gaussian Process Optimization in the Bandit Setting: No Regret & Experimental Design
Niranjan Srinivas Andreas Krause Caltech Caltech Sham Kakade Matthias Seeger Wharton Saarland ..where theory and practice collide

Optimizing Noisy, Unknown Functions
Given: Set of possible inputs D; black-box access to unknown function f Want: Adaptive choice of inputs from D maximizing Many applications: robotic control [Lizotte et al. ’07], sponsored search [Pande & Olston, ’07], clinical trials, … Sampling is expensive Algorithms evaluated using “regret” Goal: minimize Repeat what f is – give an example !

Running example: Noisy Search
How to find the hottest point in a building? Many noisy sensors available but sampling is expensive D: set of sensors; : temperature at chosen at step i Observe Goal: Find with minimal number of queries Floorplan looks funny (pixelated)

Key insight: Exploit correlation
Temperature is spatially correlated Sampling f(x) at one point x yields information about f(x’) for points x’ near x In this paper: Model correlation using a Gaussian process (GP) prior for f 4

Gaussian Processes to model payoff f
+ Normal dist. (1-D Gaussian) Multivariate normal (n-D Gaussian) Gaussian process (∞-D Gaussian) Gaussian process (GP) = normal distribution over functions Finite marginals are multivariate Gaussians Closed form formulae for Bayesian posterior update exist Parameterized by covariance function K(x,x’) = Cov(f(x),f(x’)) 5

Example of GPs Squared exponential kernel K(x,x’) = exp(-(x-x’)2/h2)
Distance |x-x’| Bandwidth h=.3 Samples from P(f) Bandwidth h=.1 6

Gaussian process optimization [e.g., Jones et al ’98]
Goal: Adaptively pick inputs such that x f(x) Key question: how should we pick samples? So far, only heuristics: Expected Improvement [Močkus et al. ‘78] Most Probable Improvement [Močkus ‘89] Used successfully in machine learning [Ginsbourger et al. ‘08, Jones ‘01, Lizotte et al. ’07] No theoretical guarantees on their regret!

Simple algorithm for GP optimization
In each round t do: Pick Observe Use Bayes’ rule to get posterior mean Can get stuck in local maxima! x f(x) 8

Uncertainty sampling But…wastes samples by exploring f everywhere!
Pick: That’s equivalent to (greedily) maximizing information gain Popular objective in Bayesian experimental design (where the goal is pure exploration of f) But…wastes samples by exploring f everywhere! x f(x) 9

Avoiding unnecessary samples
x f(x) Best lower bound Key insight: Never need to sample where Upper Confidence Bound (UCB) < best lower bound!

Upper Confidence Bound (UCB) Algorithm
Naturally trades off explore and exploit; no samples wasted Regret bounds: classic [Auer ’02] & linear f [Dani et al. ‘07] But none in the GP optimization setting! (popular heuristic) Pick input that maximizes Upper Confidence Bound (UCB): x f(x) How should we choose ¯t? Need theory!

How well does UCB work? Intuitively, performance should depend on how “learnable” the function is Bandwidth h=.3 Bandwidth h=.1 “Easy” “Hard” The quicker confidence bands collapse, the easier to learn Key idea: Rate of collapse  growth of information gain 12

Learnability and information gain
We show that regret bounds depend on how quickly we can gain information Mathematically: Establishes a novel connection between GP optimization and Bayesian experimental design Add cartoon plot for \gamma_T; need axes, etc. T 13

Performance of optimistic sampling
Theorem If we choose ¯t = £(log t), then with high probability, Hereby The slower °T grows, the easier f is to learn Key question: How quickly does °T grow? Maximal information gain due to sampling! 14

Learnability and information gain
Little diminishing returns Returns diminish fast Information gain exhibits diminishing returns (submodularity) [Krause & Guestrin ’05] Our bounds depend on “rate” of diminishment 15

Dealing with high dimensions
Theorem: For various popular kernels, we have: Linear: ; Squared-exponential: ; Matérn with , ; Smoothness of f helps battle curse of dimensionality! Our bounds rely on submodularity of 16

What if f is not from a GP? In practice, f may not be Gaussian
Theorem: Let f lie in the RKHS of kernel K with , and let the noise be bounded almost surely by . Choose Then with high probab., Frees us from knowing the “true prior” Intuitively, the bound depends on the “complexity” of the function through its RKHS norm 17

Experiments: UCB vs. heuristics
Temperature data 46 sensors deployed at Intel Research, Berkeley Collected data for 5 days (1 sample/minute) Want to adaptively find highest temperature as quickly as possible Traffic data Speed data from 357 sensors deployed along highway I-880 South Collected during 6am-11am, for one month Want to find most congested (lowest speed) area as quickly as possible 18

Comparison: UCB vs. heuristics
GP-UCB compares favorably with existing heuristics 19

Conclusions First theoretical guarantees and convergence rates
for GP optimization Both true prior and agnostic case covered Performance depends on “learnability”, captured by maximal information gain Connects GP Bandit Optimization & Experimental Design! Performance on real data comparable to other heuristics 20

Relating the two problems
Adaptive choice of from decision set D Min regret: Easy to see that We bound ; also applies to

Performance of Optimistic sampling
Theorem [Srinivas, Krause, Kakade, Seeger] Let and be compact and convex. Pick and If K satisfies weak regularity conditions, we have where is independent of T.  is the maximal information-gain due to sampling First performance guarantee for GP optimization! “It pays to be optimistic”  22

Bounding Information Gain
Theorem: For finite D, submodularity and the greedy algorithm yield : #samples of eigenvalues of the covariance matrix Greedy algorithm samples eigenvectors Maximal info-gain depends on spectral properties of kernel Faster the eigenvalues decay, better the performance

What if f is not from a GP? In practice, f may not be Gaussian
Theorem: Let Assume that f lies in the RKHS of kernel K. Let be a noise process that has zero mean conditioned on history and is bounded by almost surely. Assume and pick Then for , we have Frees us from knowing the “true prior” For f ~ GP, ; sample paths rougher Therefore neither theorem subsumes the other 24

Examples of GPs Exponential kernel K(x,x’) = exp(-|x-x’|/h)
Distance |x-x’| Bandwidth h=1 Bandwidth h=.3 25

Multi-armed bandits p1 p2 p3 pk At each time pick arm i; get independent payoff with probability pi Classic model for exploration – exploitation tradeoff Extensively studied (Robbins ’52, Gittins ’79) Typically assume each arm is tried multiple times Explanation of k-armed bandit !  26

Infinite-armed bandits
p1 p2 p3 pk p∞ p1 p2 In many applications, number of arms is huge (sponsored search, sensor selection) Cannot try each arm even once Assumptions on payoff function f essential 27

Thinking about GPs Kernel function K(x, x’) specifies covariance
f(x) f(x) x P(f(x)) Kernel function K(x, x’) specifies covariance Encodes smoothness assumptions 28

Examples of GPs Linear kernel with features: K(x,x’) = (x)T(x’)
E.g., (x) = [0,x,x2] E.g., (x) = sin(x) 29

Assumptions on f Linear? [Dani et al, ’07]
Lipschitz-continuous (bounded slope) [Kleinberg ‘08] Fast convergence; But strong assumption Very flexible, but 30