Presentation on theme: "Niranjan Srinivas Andreas Krause Caltech Caltech"— Presentation transcript:
1Gaussian Process Optimization in the Bandit Setting: No Regret & Experimental Design Niranjan Srinivas Andreas KrauseCaltech CaltechSham Kakade Matthias SeegerWharton Saarland..where theory and practice collide
2Optimizing Noisy, Unknown Functions Given: Set of possible inputs D; black-box access to unknown function fWant: Adaptive choice of inputsfrom D maximizingMany applications: robotic control [Lizotte et al. ’07], sponsored search [Pande & Olston, ’07], clinical trials, …Sampling is expensiveAlgorithms evaluated using “regret”Goal: minimizeRepeat what f is – give an example !
3Running example: Noisy Search How to find the hottest point in a building?Many noisy sensors available but sampling is expensiveD: set of sensors; : temperature at chosen at step i ObserveGoal: Find with minimal number of queriesFloorplan looks funny (pixelated)
4Key insight: Exploit correlation Temperature is spatially correlatedSampling f(x) at one point x yields information about f(x’) for points x’ near xIn this paper: Model correlation using a Gaussian process (GP) prior for f4
5Gaussian Processes to model payoff f +Normal dist. (1-D Gaussian)Multivariate normal (n-D Gaussian)Gaussian process (∞-D Gaussian)Gaussian process (GP) = normal distribution over functionsFinite marginals are multivariate GaussiansClosed form formulae for Bayesian posterior update existParameterized by covariance function K(x,x’) = Cov(f(x),f(x’))5
6Example of GPs Squared exponential kernel K(x,x’) = exp(-(x-x’)2/h2) Distance |x-x’|Bandwidth h=.3Samples from P(f)Bandwidth h=.16
7Gaussian process optimization [e.g., Jones et al ’98] Goal: Adaptively pick inputs such thatxf(x)Key question: how should we pick samples?So far, only heuristics:Expected Improvement [Močkus et al. ‘78]Most Probable Improvement [Močkus ‘89]Used successfully in machine learning [Ginsbourger et al. ‘08, Jones ‘01, Lizotte et al. ’07]No theoretical guarantees on their regret!
8Simple algorithm for GP optimization In each round t do:PickObserveUse Bayes’ rule to get posterior meanCan get stuck in local maxima!xf(x)8
9Uncertainty sampling But…wastes samples by exploring f everywhere! Pick:That’s equivalent to (greedily) maximizing information gainPopular objective in Bayesian experimental design (where the goal is pure exploration of f)But…wastes samples by exploring f everywhere!xf(x)9
10Avoiding unnecessary samples xf(x)Best lower boundKey insight: Never need to sample where Upper Confidence Bound (UCB) < best lower bound!
11Upper Confidence Bound (UCB) Algorithm Naturally trades off explore and exploit; no samples wastedRegret bounds: classic [Auer ’02] & linear f [Dani et al. ‘07]But none in the GP optimization setting! (popular heuristic)Pick input that maximizes Upper Confidence Bound (UCB):xf(x)How should we choose ¯t?Need theory!
12How well does UCB work?Intuitively, performance should depend on how “learnable” the function isBandwidth h=.3Bandwidth h=.1“Easy”“Hard”The quicker confidence bands collapse, the easier to learnKey idea: Rate of collapse growth of information gain12
13Learnability and information gain We show that regret bounds depend on how quickly we can gain informationMathematically:Establishes a novel connection between GP optimization and Bayesian experimental designAdd cartoon plot for \gamma_T; need axes, etc.T13
14Performance of optimistic sampling TheoremIf we choose ¯t = £(log t), then with high probability,HerebyThe slower °T grows, the easier f is to learnKey question: How quickly does °T grow?Maximal information gain due to sampling!14
15Learnability and information gain Little diminishing returnsReturns diminish fastInformation gain exhibits diminishing returns (submodularity) [Krause & Guestrin ’05]Our bounds depend on “rate” of diminishment15
16Dealing with high dimensions Theorem: For various popular kernels, we have:Linear: ;Squared-exponential: ;Matérn with , ;Smoothness of f helps battle curse of dimensionality!Our bounds rely on submodularity of16
17What if f is not from a GP? In practice, f may not be Gaussian Theorem: Let f lie in the RKHS of kernel K with ,and let the noise be bounded almost surely by .Choose Then with high probab.,Frees us from knowing the “true prior”Intuitively, the bound depends on the “complexity” of the function through its RKHS norm17
18Experiments: UCB vs. heuristics Temperature data46 sensors deployed at Intel Research, BerkeleyCollected data for 5 days (1 sample/minute)Want to adaptively find highest temperature as quickly as possibleTraffic dataSpeed data from 357 sensors deployed along highway I-880 SouthCollected during 6am-11am, for one monthWant to find most congested (lowest speed) area as quickly as possible18
19Comparison: UCB vs. heuristics GP-UCB compares favorably with existing heuristics19
20Conclusions First theoretical guarantees and convergence rates for GP optimizationBoth true prior and agnostic case coveredPerformance depends on “learnability”, captured by maximal information gainConnects GP Bandit Optimization & Experimental Design!Performance on real data comparable to other heuristics20
21Relating the two problems Adaptive choice of from decision set DMin regret:Easy to see thatWe bound ; also applies to
22Performance of Optimistic sampling Theorem [Srinivas, Krause, Kakade, Seeger]Let and be compact and convex. Pick andIf K satisfies weak regularity conditions, we havewhere is independent of T.is the maximal information-gaindue to samplingFirst performance guarantee for GP optimization! “It pays to be optimistic” 22
23Bounding Information Gain Theorem: For finite D, submodularity and the greedy algorithm yield: #samples of eigenvalues of the covariance matrixGreedy algorithm samples eigenvectorsMaximal info-gain depends on spectral properties of kernelFaster the eigenvalues decay, better the performance
24What if f is not from a GP? In practice, f may not be Gaussian Theorem: Let Assume that f lies in the RKHS of kernel K. Let be a noise process that has zero mean conditioned on history and is bounded by almost surely. Assume and pickThen for , we haveFrees us from knowing the “true prior”For f ~ GP, ; sample paths rougherTherefore neither theorem subsumes the other24
26Multi-armed bandits…p1p2p3pkAt each time pick arm i; get independent payoff with probability piClassic model for exploration – exploitation tradeoffExtensively studied (Robbins ’52, Gittins ’79)Typically assume each arm is tried multiple timesExplanation of k-armed bandit ! 26
27Infinite-armed bandits …p1p2p3pk…p∞p1p2In many applications, number of arms is huge(sponsored search, sensor selection)Cannot try each arm even onceAssumptions on payoff function f essential27
28Thinking about GPs Kernel function K(x, x’) specifies covariance f(x)f(x)xP(f(x))Kernel function K(x, x’) specifies covarianceEncodes smoothness assumptions28
29Examples of GPs Linear kernel with features: K(x,x’) = (x)T(x’) E.g., (x) = [0,x,x2]E.g., (x) = sin(x)29
30Assumptions on f Linear? [Dani et al, ’07] Lipschitz-continuous (bounded slope) [Kleinberg ‘08]Fast convergence;But strong assumptionVery flexible, but30