Presentation is loading. Please wait.

Presentation is loading. Please wait.

Predicting Output from Computer Experiments Design and Analysis of Computer Experiments Chapter 3 Kevin Leyton-Brown.

Similar presentations


Presentation on theme: "Predicting Output from Computer Experiments Design and Analysis of Computer Experiments Chapter 3 Kevin Leyton-Brown."— Presentation transcript:

1 Predicting Output from Computer Experiments Design and Analysis of Computer Experiments Chapter 3 Kevin Leyton-Brown

2 Overview Overall program in this chapter: – predict the output of a computer simulation – we’re going to review approaches to regression, looking for various kinds of optimality First, we’ll talk about just predicting our random variable( x 3.2) – note, in this setting, we have no “features” Then, we’ll consider the inclusion of features in our predictions, based on features in our training data( x 3.2, 3.3) In the end, we’ll apply these ideas to computer experiments( x 3.3) Not covered: – an empirical evaluation of seven EBLUPs on small-sample data( x 3.3, pp. 69-81) – proofs of some esoteric BLUP theorems( x 3.4, pp. 82-84) If you’ve done the reading you already know: – the difference between “minimum MSPE linear unbiased predictors” and BLUPs – three different “intuitive” interpretations of r > 0 R -1 ( Y n – F B ) – a lot about statistics – whether this chapter has anything to do with computer experiments If you haven’t you’re in for a treat

3 Predictors Y 0 is our random variable; our data is Y n = (Y 1, …, Y n ) > – no “features” – just predict one response from the others A generic predictor predicts Y 0 based on Y n – to avoid powerpoint agony, I’ll denote as Y 0 from now on There are three kinds of predictors discussed: – “Predictors”: Y 0 ( Y n ) has unrestricted functional form – “Linear Predictors”: Y 0 = a 0 +  n i=1 a i Y i = a 0 + a > Y n – “Linear unbiased predictors (LUP): again, linear predictors Y 0 = a 0 + a > Y n furthermore, “unbiased” with respect to a given family F of distributions for ( Y 0, Y n ) Definition: a predictor Y 0 is unbiased for Y 0 with respect to the class of distributions F over ( Y 0, Y n ) if for all F 2 F, E F { Y 0 } = E F { Y 0 }. – E F denotes expectation under the F( ¢ ) distribution for ( Y 0, Y n ) – this definition depends on F : a linear predictor is unbiased with respect to a class – as F gets bigger, the set of LUPs gets weakly smaller

4 LUP Example 1 Suppose that Y i =  0 +  i, where  i » N(0,  2  ),  2  > 0. Define F as those distributions in which –  0 is a given nonzero constant –  2  is unknown, but  2  > 0 is known Any Y 0 = a 0 + a T Y n is a LP of Y 0 Which are unbiased? We know that: – E { Y 0 } = E {a 0 +  n i=1 a i Y i } = a 0 +  0  n i=1 a i (Eq 1) – and E { Y 0 } =  0 (Eq 2) For our LP to be unbiased, we must have (Eq 1) = (Eq 2) 8  2  – since (Eq 1), (Eq 2) are independent of  2 , we just need that, given  0, a satisfies a 0 +  0  n i=1 a i =  0 – solutions: a 0 =  0, a such that  n i=1 a i = 0 (data-independent predictor Y 0 =  0 ) a 0 = 0, a such that  n i=1 a i = 1 – e.g., sample mean of Y n is the LUP corresponding to a 0 = 0, a i = 1/n

5 LUP Example 2 Suppose that Y i =  0 +  i, where  i » N(0,  2  ),  2  > 0. Define F as those distributions in which –  0 is an unknown real constant –  2  is unknown, but  2  > 0 is known Any Y 0 = a 0 + a T Y n is a LP of Y 0 Which are unbiased? We know that: – E { Y 0 } = E {a 0 +  n i=1 a i Y i } = a 0 +  0  n i=1 a i (Eq 1) – and E { Y 0 } =  0 (Eq 2) For our LP to be unbiased, we must have (Eq 1) = (Eq 2) 8  2  and 8  0 – since (Eq 1), (Eq 2) are independent of  2 , we just need that, 8  0, a satisfies a 0 +  0  n i=1 a i =  0 – solutions: a 0 =  0, a such that  n i=1 a i = 0 (data-independent predictor Y 0 =  0 ) a 0 = 0, a such that  n i=1 a i = 1 – e.g., sample mean of Y n is the LUP corresponding to a 0 = 0, a i = 1/n This illustrates that a LUP for F is a LUP for subfamilies of F

6 Best Mean Squared Prediction Error (MSPE) Predictors Definition: MSPE( Y 0,F) ´ E F {( Y 0 - Y 0 ) 2 } Definition: Y 0 is a minimum MSPE predictor at F if, for any predictor Y 0 * MSPE( Y 0,F) · MSPE( Y 0 *,F) – we’ll also call this a best MSPE predictor “Fundamental theorem of prediction”: – the conditional mean of Y 0 given Y n is the minimum MSPE predictor of Y 0 based on Y n

7 Best Mean Squared Prediction Error (MSPE) Predictors Theorem: Suppose that ( Y 0, Y n ) has a joint distribution F for which the conditional mean of Y 0 given Y n exists. Then Y 0 = E{ Y 0 | Y n } is the best MSPE predictor of Y 0. Proof: Fix an arbitrary unbiased predictor Y 0 * ( Y n ). – MSPE( Y 0 *,F) = E F {( Y 0 * - Y 0 ) 2 } = E F {( Y 0 * - Y 0 + Y 0 - Y 0 ) 2 } = E F {( Y 0 * - Y 0 ) 2 } + MSPE( Y 0,F) + 2E F {( Y 0 * - Y 0 )( Y 0 - Y 0 )} ¸ MSPE( Y 0,F) + 2E F {( Y 0 * - Y 0 )( Y 0 - Y 0 )}(Eq 3) – E F {( Y 0 * - Y 0 )( Y 0 - Y 0 )} = E F {( Y 0 * - Y 0 ) E F {( Y 0 - Y 0 ) | Y n }} = E F {( Y 0 * - Y 0 ) ( Y 0 - E F { Y 0 | Y n })} = E F {( Y 0 * - Y 0 ) £ 0} = 0 – Thus, MSPE( Y 0 *,F) ¸ MSPE( Y 0,F) ¥ Notes: – Y 0 = E{ Y 0 | Y n } is essentially the unique MSPE predictor MSPE( Y 0 *,F) = MSPE( Y 0,F) iff Y 0 = Y 0 * almost everywhere – Y 0 = E{ Y 0 | Y n } is always unbiased: E{ Y 0 } = E{E{ Y 0 | Y n }} = E{ Y 0 } (Why can we condition here?)

8 Example: Continued-Best MSPE Predictors What is the best MSPE predictor when each Y i » N(  0,  2  )? – Since the Y i ’s are independent, [ Y 0 | Y n ] = N(  0,  2  ) – Thus, Y 0 = E{ Y 0 |Y n } =  0 What if  2  is known, and Y i » N(  0,  2  ), but  0 is unknown (i.e., [  0 ]  1)? – improper priors do not always give proper posteriors. But here: [ Y 0 | Y n = y n ] » N 1 [ ,  2  (1 + 1/n)] where  is the sample mean on the training data Y n – Thus, the best MSPE predictor of Y 0 is Y 0 = (  i Y i ) / n

9 Now let’s dive in to Gaussian Processes (uh oh…) Consider the regression model from chapter 2: Y i ´ Y ( x i ) =  p j=1 f j  j + Z ( x i ) = f > ( x i )  + Z ( x i ) – each f j is a known regression function –  is an unknown nonzero p £ 1 vector – Z ( x ) is a zero mean stationary Gaussian process with dependence specified by Cov{ Z ( x i ), Z ( x j )} =  2 Z R( x i - x j ) for some known correlation function R. – Then the joint distribution of Y 0 = Y ( x 0 ) and Y n = ( Y ( x 1 ), …, Y ( x n )) is (Eq 4) the def’n of unbiased and the conditional dist. of a multivariate normal give

10 Gaussian Process Example Continued The best MSPE predictor of Y 0 is Y 0 = E{ Y 0 | Y n } = f > 0  + r > 0 R -1 ( Y n - F  )(Eq4) …But for what class of distributions F is this true? – Y 0 depends on: multivariate normality of ( Y 0, Y n )  R ( ¢ ) – thus the best MSPE predictor changes when  or R change, however, it remains the same for all  2 Z > 0

11 Second GP example Second example: analogous to the previous linear example, what if we add uncertainty about  ? – we assume that  2 Z is known, although the authors say this isn’t required Now we have a two-stage model: – The first stage, our conditional distribution of ( Y 0, Y n ) given , is the same distribution we saw before. – The second stage is our prior on . One can show that the best MSPE predictor of Y 0 is Y 0 = E{ Y 0 | Y n } = f > 0 E{  | Y n } + r > 0 R -1 ( Y n - F E{  | Y n }) – Compare this to what we had in the one-stage case: Y 0 = f > 0  + r > 0 R -1 ( Y n - F  ) – but the authors give a derivation; see the book

12 So what about E{  | Y n }? Of course, the formula for E{  | Y n } depends on our  prior – when this prior is uninformative, we can derive [  | Y n ] » N p [( F > R -1 F ) -1 F -1 Y n,  2 Z ( F > R -1 F ) -1 ] – this (somehow) gives us Y 0 = f > 0 B + r > 0 R -1 ( Y n – F B ), (Eq 5) B = ( F > R -1 F) -1 F > R -1 Y n * as above with Y 0, for powerpoint reasons I use B instead of What sense can we make of (Eq 5)? 1. the sum of the regression predictor f > 0 B and a “correction” r > 0 R -1 ( Y n – F B ) 2. a function of the training data Y n 3. a function of x 0, the point at which a prediction is made recall that f > 0 ´ f ( x 0 ) > ; r > 0 ´ ( R ( x 0 - x 1 ), …, R ( x 0 - x n )) > For the moment, we consider (1); we consider (2) and (3) in x 3.3 – (that’s right, we’re still in x 3.2!) The correction term is a linear combination of the residuals Y n – F B based on the GP model f >  + Z with prediction point specific coefficients: r > 0 R -1 ( Y n – F B ) =  i c i ( x 0 )( Y n - F B ) where the weight c i ( x 0 ) is the i th element of R -1 r 0 and ( Y n - F B ) is the i th residual based on the fitted model

13 Example Suppose the true unknown curve is the 1D dampened cosine: y(x) = e -1.4x cos(7  x/2) 7-point training set – x 1 drawn from [0,1/7] – x i = x 1 + (i-1)/7 Consider predicting y using a stationary GP Y(x) =  0 + Z (x) – Z has zero mean, variance  2 Z, correlation function R (h) = e -136.1h 2 – F is a 7 £ 1 column vector of ones i.e., we have no features, just an intercept  0 Using the regression/correction interpretation of (Eq 5), we can write: Y (x 0 ) = B 0 +  7 i=1 c i (x 0 )( Y i - B 0 ) – c i ( x 0 ) is the i th element of R -1 r 0 – ( Y i - B 0 ) are the residuals from fitting the constant model

14 Example continued Consider y(x 0 ) at x 0 = 0.55 (plotted as a cross below) – The residuals ( Y i - B 0 ) and their associated weights c i (x 0 ) are plotted below Note: – weights can be positive or negative – the correction to the regression B 0 is based primarily on the residuals at the training data points closest to x 0 the weights for the 3 furthest training instances are indistinguishable from zero – y(0.55) has interpolated the data – what does the whole curve look like? We need to wait for x 3.3 to find out…

15 …but I’ll show you now anyway!

16 Interpolating the data The correction term r > 0 R -1 ( Y n – F B ) forces the model to interpolate the data – suppose x 0 is x i for some i 2 {1, …, n} then f 0 = f > ( x i ), and r 0 > = ( R ( x i - x 1 ), …, R ( x i - x n )) >, which is the i th row of R – Because R -1 r 0 is the i th column of R -1 R = I n, the identity matrix, thus R -1 r 0 = (0, …, 0,1,0, …, 0) > = e i, the i th unit vector – Hence: r > 0 R -1 ( Y n – F B ) = e i > ( Y n – F B ) = Y i - f > ( x i ) B – and so Y ( x 0 ) = f > ( x i ) B + ( Y i - f > ( x i ) B ) = Y i (Eq 5),

17 An example showing that best MSPE predictors need not be linear Suppose that (Y 0, Y 1 ) has the joint distribution: Then the conditional distribution of Y 0 given Y 1 = y 1 is uniform over the interval (0, y 1 2 ). The best MSPE predictor of Y 0 is the center of this interval: Y 0 = E{Y 0 | Y 1 } = Y 1 2 /2 The minimum MSPE linear unbiased predictor is Y 0 L = -1/12 + ½ Y 1 – based on a bunch of calculus Their MSPEs are very similar: – E{(Y 0 - Y 1 2 /2) 2 }  0.01667 – E{(Y 0 - -1/12 + ½ Y 1 ) 2 }  0.01806

18 Best Linear Unbiased MSPE Predictors minimum MSPE predictors depend on the joint distribution of Y n and Y 0 – thus, they tend to be optimal within a very restricted class F In an attempt to find predictors that are more broadly optimal, consider: 1. predictors that are linear in Y n ; – these are called best linear predictors (BLPs) 2. predictors that are both linear and unbiased for Y 0. – these are called best linear unbiased predictors (BLUPs)

19 BLUP Example 1 Recall our first example: – Y i =  0 +  i, where  i » N(0,  2  ),  2  > 0. – Define F as those distributions in which  0 is a given nonzero constant  2  is unknown, but  2  > 0 is known – Any Y 0 = a 0 + a > Y n is a LUP of Y 0 if a 0 +  0  n i=1 a i =  0 The MSPE of a linear unbiased predictor Y 0 = a 0 + a > Y n is – E {(a 0 +  n i=1 a i Y i - Y 0 ) 2 } = E{(a 0 +  i a i (  0 +  i ) -  0 -  0 ) 2 } = (a 0 +  0  i a i -  0 ) 2 +  2   i a i 2 +  2  =  2  (1+  i a i 2 )(Eq 6) ¸  2  (Eq 7) – we have equality in (Eq 6) because Y 0 is unbiased – we have equality in (Eq 7) iff a i = 0, i 2 {1, …, n} (and hence a 0 =  0 ) – Thus, the unique BLUP is Y 0 =  0

20 BLUP Example 2 Consider again the enlarged model F with  0 as an unknown real;  2  > 0 – recall that every unbiased Y 0 = a 0 + a > Y n must satisfy a 0 = 0 and  i a i = 1 – The MSPE of Y 0 is E{(  i a i Y i - Y 0 ) 2 } = (  0  i a i -  0 ) 2 +  2   i a i 2 +  2  = 0 +  2  (1 +  i a i 2 )(Eq 8) ¸  2  (1 + 1/n)(Eq 9) – equality holds in (Eq 8) because  i a i = 1 – (Eq 9):  i a i 2 is minimized under  i a i = 1 when a i = 1/n Thus the sample mean Y 0 = 1/n  i Y i is the best linear unbiased predictor of Y 0 for the enlarged F. – How can the BLUP for a large class not also the BLUP for a subclass? (didn’t we see a claim to the contrary earlier)? the previous claim was that every LUP for a class is also a LUP for a subclass, but it doesn’t hold for BLUPs.

21 BLUP Example 3 Consider the measurement error model: Y i = Y( x i ) =  j f j ( x i )  j +  i where the f are known regression functions, the  are unknown, and each  i » N(0,  2  ) Consider the BLUP of Y( x 0 ) for unknown real  0 and  2  > 0 A linear predictor Y 0 = a 0 + a T Y n is unbiased provided that for all ( ,  2  ) E{a 0 + a T Y n } = a 0 + a T F  is equal to E{ Y 0 } = f > ( x 0 )  – This implies a 0 = 0 and F > a = f ( x 0 ) The BLUP of Y 0 is Y 0 = f > ( x 0 ) B – where B = ( F > F ) -1 F > Y n is the ordinary least squares estimator of  – and the BLUP is unique This is proved in the chapter notes, x 3.4. …and now we’ve reached the end of x 3.2!

22 …that’s all for today!

23 Prediction for Computer Experiments The idea is to build a “surrogate” or “simulator” – a model that predicts the output of a simulation, to spare you from having to run the actual simulation – Neural networks, splines, GPs all work—guess what, they like GPs Let f 1, …, f p be known regression functions,  be a vector of unknown regression coefficients, Z be a stationary GP on X having zero mean, variance  2 Z, correlation function R. – Then we can see experimental output Y( x ) as the realization of the random function This model implies that Y 0 and Y n have the multivariate normal distribution where  and  2 Z > 0 are unknown Now, drop the Gaussian assumption to consider a nonparametric moment model based on an arbitrary second-order stationary process for unknown  and  2 Z :

24 Conclusion: is this the right tool when we can get lots of data?


Download ppt "Predicting Output from Computer Experiments Design and Analysis of Computer Experiments Chapter 3 Kevin Leyton-Brown."

Similar presentations


Ads by Google