Download presentation

Presentation is loading. Please wait.

Published byBarrett Maher Modified over 2 years ago

1
Nonmyopic Active Learning of Gaussian Processes An Exploration – Exploitation Approach Andreas Krause, Carlos Guestrin Carnegie Mellon University TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A AAA A A AAA A A A A

2
River monitoring Want to monitor ecological condition of river Need to decide where to make observations! Mixing zone of San Joaquin and Merced rivers NIMS (UCLA)

3
Observation Selection for Spatial prediction Gaussian processes Distribution over functions (e.g., how pH varies in space) Allows estimating uncertainty in prediction Horizontal position pH value observations Unobserved process Prediction Confidence bands

4
Mutual Information [Caselton Zidek 1984] Finite set of possible locations V For any subset A µ V, can compute Want: A* = argmax MI(A) subject to |A| ≤ k Finding A* is NP hard optimization problem Entropy of uninstrumented locations after sensing Entropy of uninstrumented locations before sensing

5
Want to find: A* = argmax |A|=k MI(A) Greedy algorithm: Start with A = ; For i = 1 to k s* := argmax s MI(A [ {s}) A := A [ {s*} The greedy algorithm for finding optimal a priori sets Theorem [ICML 2005, with Carlos Guestrin, Ajit Singh] Optimal solution Result of greedy algorithm Constant factor, ~63% 1 2 3 4 5

6
Sequential design Observed variables depend on previous measurements and observation policy MI( ) = expected MI score over outcome of observations X 5 =? X 3 =?X 2 =? <20°C ¸ 20°C X 7 =? >15°C MI( X 5 =17, X 3 =16, X 7 =19 ) = 3.4 X 5 =17X 5 =21 X 3 =16 X 7 =19 X 12 =?X 23 =? ¸ 18°C <18°C MI( … ) = 2.1MI( … ) = 2.4 Observation policy MI( ) = 3.1

7
A priori vs. sequential Sets are very simple policies. Hence: max A MI(A) · max MI( ) subject to |A|=| |=k Key question addressed in this work: How much better is sequential vs. a priori design? Main motivation: Performance guarantees about sequential design? A priori design is logistically much simpler!

8
GPs slightly more formally Set of locations V Joint distribution P(X V ) For any A µ V, P(X A ) Gaussian GP defined by Prior mean (s) [often constant, e.g., 0] Kernel K(s,t) V …… XVXV 1 : Variance (Amplitude) 2 : Bandwidth Example: Squared exponential kernel 42024 0 0.5 1 Distance Correlation

9
Known parameters Known parameters (bandwidth, variance, etc.) No benefit in sequential design! max A MI(A) = max MI( ) Mutual Information does not depend on observed values:

10
Mutual Information does depend on observed values! Unknown parameters Unknown (discretized) parameters: Prior P( = ) Sequential design can be better! max A MI(A) · max MI( ) depends on observations!

11
Theorem: Key result: How big is the gap? If = known: MI(A*) = MI( *) If “almost” known: MI(A*) ¼ MI( *) MI MI(A*) MI( *) 0 Gap depends on H( ) MI of best policy MI of best set Gap size As H( ) ! 0: MI of best policy MI of best param. spec. set

12
Near-optimal policy if parameter approximately known Use greedy algorithm to optimize MI(A greedy | ) = P( ) MI(A greedy | ) Note: | MI(A | ) – MI(A) | · H( ) Can compute MI(A | ) analytically, but not MI(A) Corollary [using our result from ICML 05] Optimal seq. plan Result of greedy algorithm ~63% Gap ≈ 0 (known par.)

13
Exploration—Exploitation for GPs Reinforcement Learning Active Learning in GPs ParametersP(S t+1 |S t, A t ), Rew(S t ) Kernel parameters Known parameters: Exploitation Find near-optimal policy by solving MDP! Find near-optimal policy by finding best set Unknown parameters: Exploration Try to quickly learn parameters! Need to waste only polynomially many robots! Try to quickly learn parameters. How many samples do we need?

14
Parameter info-gain exploration (IGE) Gap depends on H( ) Intuitive heuristic: greedily select s * = argmax s I( ; X s ) = argmax s H( ) – H( | X s ) Does not directly try to improve spatial prediction No sample complexity bounds Parameter entropy before observing s P.E. after observing s

15
Implicit exploration (IE) Intuition: Any observation will help us reduce H( ) Sequential greedy algorithm: Given previous observations X A = x A, greedily select s * = argmax s MI ({X s } | X A =x A, ) Contrary to a priori greedy, this algorithm takes observations into account (updates parameters) Proposition: H( | X ) · H( ) “Information never hurts” for policies No sample complexity bounds

16
Can narrow down kernel bandwidth by sensing inside and outside bandwidth distance! Learning the bandwidth Kernel Bandwidth Sensors within bandwidth are correlated Sensors outside bandwidth are ≈ independent A B C

17
-4-2024 0 0.5 1 Square exponential kernel: Choose pairs of samples at distance to test correlation! Hypothesis testing: Distinguishing two bandwidths Correlation under BW=1 Correlation under BW=3 At this distance correlation gap largest BW = 1 BW = 3 -202 0 2 02 0 2

18
Hypothesis testing: Sample complexity Theorem: To distinguish bandwidths with minimum gap in correlation and error < we needindependent samples. In GPs, samples are dependent, but “almost” independent samples suffice! (details in paper) Other tests can be used for variance/noise etc. What if we want to distinguish more than two bandwidths?

19
12345 0 0.2 0.4 0.6 P( ) Find “most informative split” at posterior median Hypothesis testing: Binary searching for bandwidth Testing policy ITE needs only logarithmically many tests! Theorem: If we have tests with error < T then

20
Exploration—Exploitation Algorithm Exploration phase Sample according to exploration policy Compute bound on gap between best set and best policy If bound < specified threshold, go to exploitation phase, otherwise continue exploring. Exploitation phase Use a priori greedy algorithm select remaining samples For hypothesis testing, guaranteed to proceed to exploitation after logarithmically many samples!

21
05101520 0.3 0.35 0.4 0.45 0.5 IE ITE IGE 0510152025 0.5 1 1.5 2 IE IGE ITE Results None of the strategies dominates each other Usefulness depends on application More RMS error More observations More param. uncertainty Temperature data IGE: Parameter info-gain ITE: Hypothesis testing IE: Implicit exploration

22
Nonstationarity by spatial partitioning Isotropic GP for each region, weighted by region membership spatially varying linear combination Stationary fit Nonstationary fit Problem: Parameter space grows exponentially in #regions! Solution: Variational approximation (BK-style) allows efficient approximate inference (Details in paper)

23
010203040 0 0.05 0.1 0.15 0.2 IE, nonstationary IE, isotropic a priori, nonstationary Results on river data Nonstationary model + active learning lead to lower RMS error More RMS error More observations Larger bars = later sample

24
05101520 0.5 1 1.5 IE, isotropic IGE, nonstationary IE, nonstationary Random, nonstationary 051015202530 6.5 7 7.5 8 8.5 9 9.5 10 IE nonstationary IGE nonstationary Results on temperature data IE reduces error most quickly IGE reduces parameter entropy most quickly More RMS error More observations More param. uncertainty More observations

25
Conclusions Nonmyopic approach towards active learning in GPs If parameters known, greedy algorithm achieves near-optimal exploitation If parameters unknown, perform exploration Implicit exploration Explicit, using information gain Explicit, using hypothesis tests, with logarithmic sample complexity bounds! Each exploration strategy has its own advantages Can use bound to compute stopping criterion Presented extensive evaluation on real world data

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google