# Nonmyopic Active Learning of Gaussian Processes An Exploration – Exploitation Approach Andreas Krause, Carlos Guestrin Carnegie Mellon University TexPoint.

## Presentation on theme: "Nonmyopic Active Learning of Gaussian Processes An Exploration – Exploitation Approach Andreas Krause, Carlos Guestrin Carnegie Mellon University TexPoint."— Presentation transcript:

Nonmyopic Active Learning of Gaussian Processes An Exploration – Exploitation Approach Andreas Krause, Carlos Guestrin Carnegie Mellon University TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A AAA A A AAA A A A A

River monitoring  Want to monitor ecological condition of river  Need to decide where to make observations! Mixing zone of San Joaquin and Merced rivers NIMS (UCLA)

Observation Selection for Spatial prediction  Gaussian processes  Distribution over functions (e.g., how pH varies in space)  Allows estimating uncertainty in prediction Horizontal position pH value observations Unobserved process Prediction Confidence bands

Mutual Information [Caselton Zidek 1984]  Finite set of possible locations V  For any subset A µ V, can compute Want: A* = argmax MI(A) subject to |A| ≤ k  Finding A* is NP hard optimization problem  Entropy of uninstrumented locations after sensing Entropy of uninstrumented locations before sensing

 Want to find: A* = argmax |A|=k MI(A)  Greedy algorithm:  Start with A = ;  For i = 1 to k  s* := argmax s MI(A [ {s})  A := A [ {s*} The greedy algorithm for finding optimal a priori sets Theorem [ICML 2005, with Carlos Guestrin, Ajit Singh] Optimal solution Result of greedy algorithm Constant factor, ~63% 1 2 3 4 5

Sequential design  Observed variables depend on previous measurements and observation policy   MI(  ) = expected MI score over outcome of observations X 5 =? X 3 =?X 2 =? <20°C ¸ 20°C X 7 =? >15°C MI( X 5 =17, X 3 =16, X 7 =19 ) = 3.4 X 5 =17X 5 =21 X 3 =16 X 7 =19 X 12 =?X 23 =? ¸ 18°C <18°C MI( … ) = 2.1MI( … ) = 2.4 Observation policy  MI(  ) = 3.1

A priori vs. sequential  Sets are very simple policies. Hence: max A MI(A) · max  MI(  ) subject to |A|=|  |=k  Key question addressed in this work: How much better is sequential vs. a priori design?  Main motivation:  Performance guarantees about sequential design?  A priori design is logistically much simpler!

GPs slightly more formally  Set of locations V  Joint distribution P(X V )  For any A µ V, P(X A ) Gaussian  GP defined by  Prior mean  (s) [often constant, e.g., 0]  Kernel K(s,t) V …… XVXV  1 : Variance (Amplitude)  2 : Bandwidth Example: Squared exponential kernel 42024 0 0.5 1 Distance Correlation

Known parameters Known parameters  (bandwidth, variance, etc.) No benefit in sequential design! max A MI(A) = max  MI(  ) Mutual Information does not depend on observed values:

Mutual Information does depend on observed values! Unknown parameters Unknown (discretized) parameters: Prior P(  =  ) Sequential design can be better! max A MI(A) · max  MI(  ) depends on observations!

Theorem: Key result: How big is the gap?  If  =  known: MI(A*) = MI(  *)  If  “almost” known: MI(A*) ¼ MI(  *) MI MI(A*) MI(  *) 0 Gap depends on H(  ) MI of best policy MI of best set Gap size As H(  ) ! 0: MI of best policy MI of best param. spec. set

Near-optimal policy if parameter approximately known  Use greedy algorithm to optimize MI(A greedy |  ) =   P(  ) MI(A greedy |  )  Note:  | MI(A |  ) – MI(A) | · H(  )  Can compute MI(A |  ) analytically, but not MI(A) Corollary [using our result from ICML 05] Optimal seq. plan Result of greedy algorithm ~63% Gap ≈ 0 (known par.)

Exploration—Exploitation for GPs Reinforcement Learning Active Learning in GPs ParametersP(S t+1 |S t, A t ), Rew(S t ) Kernel parameters  Known parameters: Exploitation Find near-optimal policy by solving MDP! Find near-optimal policy by finding best set Unknown parameters: Exploration Try to quickly learn parameters! Need to waste only polynomially many robots!  Try to quickly learn parameters. How many samples do we need?

Parameter info-gain exploration (IGE)  Gap depends on H(  )  Intuitive heuristic: greedily select s * = argmax s I(  ; X s ) = argmax s H(  ) – H(  | X s )  Does not directly try to improve spatial prediction  No sample complexity bounds  Parameter entropy before observing s P.E. after observing s

Implicit exploration (IE)  Intuition: Any observation will help us reduce H(  )  Sequential greedy algorithm: Given previous observations X A = x A, greedily select s * = argmax s MI ({X s } | X A =x A,  )  Contrary to a priori greedy, this algorithm takes observations into account (updates parameters) Proposition: H(  | X  ) · H(  ) “Information never hurts” for policies No sample complexity bounds 

Can narrow down kernel bandwidth by sensing inside and outside bandwidth distance!  Learning the bandwidth Kernel Bandwidth Sensors within bandwidth are correlated Sensors outside bandwidth are ≈ independent A B C

-4-2024 0 0.5 1  Square exponential kernel:  Choose pairs of samples at distance  to test correlation! Hypothesis testing: Distinguishing two bandwidths Correlation under BW=1 Correlation under BW=3 At this distance  correlation gap largest BW = 1 BW = 3 -202 0 2 02 0 2

Hypothesis testing: Sample complexity Theorem: To distinguish bandwidths with minimum gap  in correlation and error <  we needindependent samples.  In GPs, samples are dependent, but “almost” independent samples suffice! (details in paper)  Other tests can be used for variance/noise etc.  What if we want to distinguish more than two bandwidths?

12345 0 0.2 0.4 0.6 P(  )  Find “most informative split” at posterior median Hypothesis testing: Binary searching for bandwidth Testing policy  ITE needs only logarithmically many tests!  Theorem: If we have tests with error <  T then

Exploration—Exploitation Algorithm  Exploration phase  Sample according to exploration policy  Compute bound on gap between best set and best policy  If bound < specified threshold, go to exploitation phase, otherwise continue exploring.  Exploitation phase  Use a priori greedy algorithm select remaining samples  For hypothesis testing, guaranteed to proceed to exploitation after logarithmically many samples! 

05101520 0.3 0.35 0.4 0.45 0.5 IE ITE IGE 0510152025 0.5 1 1.5 2 IE IGE ITE Results  None of the strategies dominates each other  Usefulness depends on application More RMS error More observations More param. uncertainty Temperature data IGE: Parameter info-gain ITE: Hypothesis testing IE: Implicit exploration

Nonstationarity by spatial partitioning  Isotropic GP for each region, weighted by region membership  spatially varying linear combination Stationary fit Nonstationary fit  Problem: Parameter space grows exponentially in #regions!  Solution: Variational approximation (BK-style) allows efficient approximate inference (Details in paper) 

010203040 0 0.05 0.1 0.15 0.2 IE, nonstationary IE, isotropic a priori, nonstationary Results on river data  Nonstationary model + active learning lead to lower RMS error More RMS error More observations Larger bars = later sample

05101520 0.5 1 1.5 IE, isotropic IGE, nonstationary IE, nonstationary Random, nonstationary 051015202530 6.5 7 7.5 8 8.5 9 9.5 10 IE nonstationary IGE nonstationary Results on temperature data  IE reduces error most quickly  IGE reduces parameter entropy most quickly More RMS error More observations More param. uncertainty More observations

Conclusions  Nonmyopic approach towards active learning in GPs  If parameters known, greedy algorithm achieves near-optimal exploitation  If parameters unknown, perform exploration  Implicit exploration  Explicit, using information gain  Explicit, using hypothesis tests, with logarithmic sample complexity bounds!  Each exploration strategy has its own advantages  Can use bound to compute stopping criterion  Presented extensive evaluation on real world data

Download ppt "Nonmyopic Active Learning of Gaussian Processes An Exploration – Exploitation Approach Andreas Krause, Carlos Guestrin Carnegie Mellon University TexPoint."

Similar presentations