2Cost of surrogatesIn linear regression, the process of fitting involves solving a set of linear equations once.With radial basis neural networks we have to optimize the selection of neurons, which will again entail multiple solutions of the linear system. We may find the best spread by minimizing cross-validation errors.Kriging, our next surrogate is even more expensive, we have a spread constant in every direction and we have to perform optimization to calculate the best set of constants (hyperparameters).With many hundreds of data points this can become significant computational burden.For moving least squares, we need to form and solve the system at every prediction point.
3Introduction to Kriging Method invented in the 1950s by South African geologist Danie G. Krige ( ) for predicting distribution of minerals.Formalized by French engineer, Georges Matheron in 1960.Statisticians refer to a more general Gaussian Process regression.Became very popular for fitting surrogates to expensive computer simulations in the 21st century.It is one of the best surrogates available.It probably became popular late mostly because of the high computer cost of fitting it to data.Kriging was invented in the 1950s by Daniel Krige (Professor at University of the Witwatersrand, Republic of South Africa), a South African geologist , for the purpose of predicting the distribution of minerals on the basis of samples. It became very popular for fitting surrogates to expensive computer simulations only about years later. This is at least partly because it is an expensive surrogate in terms of computation, and it had to wait until computers became fast enough.My experience is that of all currently popular surrogates, it has the highest chance of being the most accurate for a given problem. However, from my experience that chance is still less than 50%.Krige, D. G. (1951). "A statistical approach to some basic mine valuation problems on the Witwatersrand". J. of the Chem., Metal. and Mining Soc. of South Africa 52 (6): 119–139.Matheron, G., "Principles of geostatistics", Economic Geology, 58, pp 1246–1266, 1963
4Kriging philosophyWe assume that the data is sampled from an unknown function that obeys simple correlation rules.The value of the function at a point is correlated to the values at neighboring points based on their separation in different directions.The correlation is strong to nearby points and weak with far away points, but strength does not change based on location.Normally Kriging is used with the assumption that there is no noise so that it interpolates exactly the function values.For linear regression we normally assume that we know the functional shape (e.g. a polynomial) and the data is to be used to find the coefficients that will minimize the root mean square errors at the data points. Kriging takes a very different approach. It assumes that we don’t know much about the function, except for the form of correlation between the value of the function at nearby points. In particular, the correlation depends only on the distance between points and decays as they are further apart.Kriging is a smooth predictor. If two points are close, then their Kriging estimates are close.Kriging is usually used as an interpolator, so that if fits exactly the data, and we could not use the rms error as a way to find its parameters. That is, we assume that there is no noise in the data. There is a version of kriging with noise (often called kriging with a nugget).Because of the correlation decay, kriging is a local surrogate with shape functions that are similar to those used for radial basis surrogates, except that the decay can be different in different directions.
5Reminder: Covariance and Correlation Covariance of two random variables X and YThe covariance of a random variable with itself is the square of the standard deviation.Covariance matrix for a vector contains the covariances of the components.CorrelationThe correlation matrix has 1 on the diagonal.
6Correlation between function values at nearby points for sine(x) Generate 10 random numbers, translate them by a bit (0.1), and by more (1.0)x=10*rand(1,10)xnear=x+0.1; xfar=x+1;Calculate the sine function at the three sets.ynear=sin(xnear)y=sin(x)yfar=sin(xfar)Compare corelations: r=corrcoef(y,ynear) ; rfar=corrcoef(y,yfar)Decay to about 0.4 over one sixth of the wavelength.Wavelength on sine function is 2𝜋~6To illustrate the values of correlations that are expected between function values, we generate random numbers between 0 and 10 and evaluate the sine function at these points. We also translate the points by a small amount compared to the wavelength (0.1) and a larger amount (1.0) and calculate the correlation coefficients between the function values in the original and translated sets.The correlation coefficient is about 0.99 with the nearby points and 0.42 with the set further away. This reflects the change in function values, as illustrated by the pair of points marked in red.In kriging, finding the rate of correlation decay is part of the fitting process. This example shows us that with a wavy function we can expect the correlation to decay to about 0.4 over one sixth of the wavelength.
7Gaussian correlation function Correlation between point x and point sWe would like the correlation to decay to about 0.4 at one sixth of the wavelength 𝑙 𝑖 .Approximately 𝜃 𝑖 𝑙 𝑖 =1 or 𝜃 𝑖 = 36 𝑙 𝑖 2For the function 𝑦=sin( 𝑥 1 )∗sin(5 𝑥 2 we would like to estimate 𝜃 1 ≈1, 𝜃 2 ≈25The most popular correlation decay function for kriging is Gaussian. Given two points x (with components 𝑥 𝑖 ) and s we assume that the correlation of the unknown function Z at these two points is given as𝐶 𝑍(𝐱),𝑍(𝐬),𝛉 = 𝑖=1 𝑁 𝑣 exp − 𝜃 𝑖 𝑥 𝑖 − 𝑠 𝑖 2Where the larger 𝜃 𝑖 is, the faster the decay of the correlation with distance. For 𝜃 𝑖 =1, the decay in the ith direction is 1/e=0.37, which approximately corresponds to what we found for one sixth of the wavelength for the function sin(x). So if the wavelength in the ith direction is 𝑙 𝑖 , we should have approximately𝜃 𝑖 𝑙 𝑖 = or 𝜃 𝑖 = 36 𝑙 𝑖 2So for the function 𝑦=sin( 𝑥 1 )∗sin(5 𝑥 2 we have 𝑙 1 =2𝜋≈6, 𝑙 2 ≈1.2 ~(6/5) so we should use 𝜃 1 ≈1, 𝜃 2 ≈25.The gaussian correlation function can also be generally referred to as gaussian kernel or basis functions.
8Universal KrigingTrend modelSystematic departureTrend function is most often a low order polynomialWe will cover Ordinary Kriging, where linear trend is just a constant to be estimated by data.There is also Simple Kriging , where constant is assumed to be known (often as 0).Assumption: Systematic departures Z(x) are correlated.Kriging prediction comes with a normal distribution of the uncertainty in the prediction.xyKrigingSampling data pointsSystematic DepartureLinear Trend ModelThe most general version of kriging is called universal kriging, and it includes a linear trend function and the deviation 𝑍(𝐱 which includes the local contributions and is governed by the correlation function.𝑦 (𝐱)= 𝑖=1 𝑁 𝑣 𝛽 𝑖 𝜉 𝑖 (𝐱 +𝑍(𝐱)The trend function is composed of known shape functions 𝜉 𝑖 𝐱 and coefficients 𝛽 𝑖 to be determined. The shape functions are typically monomials, so that popular trend functions are linear and quadratic polynomials.Here we will limit the description to a constant trend, a version of kriging known as ordinary kriging, with the constant estimated from the data. Another version of kriging, known as simple kriging is when the constant is assumed to be known.As in linear regression, the kriging fit, shown in red in the figure, is only the mean of a normal distribution, and the distribution can be used to provide estimates of uncertainty in the kriging predictions, as well as guide the placements of further sampling points so as to reduce the uncertainty in the kriging prediction.
9NotationThe function values are given at 𝑛 𝑦 points 𝐱 𝑖 , 𝑖=1,..., 𝑛 𝑦 , with the point 𝐱 𝑖 having components 𝑥 𝑘 𝑖 , k=1,…,n (number of dimensions).The function value at the ith point is 𝑦 𝑖 =y( 𝐱 𝑖 ), and the vector of 𝑛 𝑦 function values is denoted y.Given decay rates 𝜃 𝑘 , we form the covariance matrix of the dataThe correlation matrix R above is formed from the covariance matrix, assuming a constant standard deviation 𝜎, which measures the uncertainty in function values.For dense data, 𝜎 will be small, for sparse data 𝜎 will be large.How do you decide whether the data is sparse or dense?We use subscripts to denote components of a vector, and superscript to identify data points. We are in n dimensional space, and we have 𝑛 𝑦 data points.In the process of fitting the data, we will also need to estimate the decay rates 𝜃 𝑘 and the standard deviation 𝜎. The standard deviation will be small when the data is dense. Dense means that we have more than enough points to describe the waviness of the function. That means that the distance from one point to the nearest one should be an order of magnitude less than the wavelength of the function.
10Prediction and shape functions Ordinary Kriging prediction formulaThe equation is linear in r, which means that the exponentials 𝑟 𝑖 may be viewed as basis functions.The equation is linear in the data y, in common with linear regression, but b is not calculated by minimizing RMS.Note that far away from data 𝑦 (𝐱)~ 𝜇 (not good for substantial extrapolation).The form of the surrogate for ordinary kriging is𝑦 (𝐱)= 𝜇 + 𝐫 𝑇 𝑅 −1 𝐲= 𝜇 + 𝐛 𝑇 𝐫Where 𝜇 is the constant, the estimated mean value. The vector b of coefficient is 𝒃=𝑅 −1 𝐲, and the shape functions that it multiplies are the exponentials𝑟 𝑖 =exp − 𝑘=1 𝑛 𝜃 𝑘 𝑥 𝑘 𝑖 − 𝑥 𝑘 2So kriging is similar to linear regression, which is a linear combination of unknown coefficients time known shape function. However, there are two important differences. First the shape functions are not known, in that the decay rates 𝜃 𝑘 will be estimated from the data (in the next slide). Second, the coefficients are not found by minimizing the rms error at the data points (which is zero, because the data is exactly interpolated).Note that since the exponentials decay away from the data points, when we are far from all data points the prediction will be close to 𝜇 . So ordinary kriging is not good for substantial extrapolation. If we expect to need to extrapolate far from data points, we should use universal kriging with a trend.
11Fitting the data Fitting refers to finding the parameters 𝜃 𝑘 . We fit by maximizing the likelihood that the data comes from a Gaussian process defined by 𝜃 𝑘 .ℒ 𝜃 𝑌,𝑋 =𝑃(𝑌|𝑋,𝜃)Once they are found, the estimate of the mean and standard deviation is obtained asMaximum likelihood is a tough optimization problem.Some kriging codes minimize the cross validation error.Stationary covariance assumption: smoothness of response is quite uniform in all regions of input space.The basic assumption of kriging is that the function is a realization of a Gaussian process with given mean and covariance matrix. The fitting process consists of finding the mean and the covariance matrix from the data. The most common way of doing that is to find the mean and covariance that maximize the likelihood that the data came from them. It can be shown that this likelihood is𝐿= 1 2𝜋 𝜎 2 𝑛 2 exp[− 𝑖=1 𝑛 𝑦 𝑦 𝑖 −𝜇 2 𝜎 2 = 1 2𝜋 𝜎 2 𝑛 2 |𝑅 | exp 𝐲−𝟏𝜇 𝑇 𝑅 −1 𝐲−𝟏𝜇 2 𝜎 2Where 1 denotes a vector of ones.The dependence on the 𝜃 𝑘 is complex since it is part of the matrix R, and so finding the 𝜃 𝑘 that maximize the likelihood is done numerically. It is a tough optimization problem with normally large number of local maxima. On the other hand, given the 𝜃 𝑘 , one can find 𝜇 and 𝜎 that will maximize the likelihood by differentiating the log of the likelihood with respect to these two parameters. This yields the following estimates.𝜇 = 𝟏 𝑇 𝑅 −1 𝐲 𝟏 𝑇 𝑅 −1 𝟏 𝜎 2 = 𝐲−𝟏𝜇 𝑇 𝑅 −1 𝐲−𝟏𝜇 𝑛Instead of using a maximum likelihood approach for the parameters, cross validation is also used by some. This means leaving out one of the points, fitting the rest with the parameters, and calculating the error at the left-out point. This is repeated for every point to produce a total error measure (e.g. rms). Then the set of parameters that yields the lowest cross validation error is chosen.Stationary covariance assumption: smoothness of response is quite uniform in all regions of input space. So, once the covariance matrix is found using the initial data, it is constant throughout the input space.There has also been some work on non-stationary covariance based kriging also.Xiong, Y., Chen, W., Apley, D., & Ding, X. (2007). A non‐stationary covariance‐based Kriging method for metamodelling in engineering design. International Journal for Numerical Methods in Engineering, 71(6),
12Prediction variance Square root of variance is called standard error. The uncertainty at any x is normally distributed.𝑦 𝑥 represents the mean of Kriging prediction.Once we have fitted the kriging parameters, besides the surrogate estimate, 𝑦 (𝐱 , we also get a prediction variance or the variance of the estimate as𝑉 𝑦 (𝐱 = 𝜎 2 1− 𝐫 𝑇 𝑅 −1 𝐫+ 1− 𝟏 𝑇 𝑅 −1 𝐫 𝟏 𝑇 𝑅 −1 𝐫The square root of V is estimate of the standard deviation, called the standard error. It is often shown along the kriging fit by a shaded area as can be seen in the top right figure.In general it is high far from data points and low near data points. It can be used to decide where to put additional data points in order to reduce the uncertainty in the kriging predictions (that is add points where V is maximal).For some applications we take advantage that the distribution of uncertainty about the prediction is normal, with 𝑦 (𝐱 being the mean and the standard error being the standard deviation, as shown in the left figure. This entire distribution is used for the Efficient Global Optimization in order to find points that have maximal expected improvement over the best sample value available so far.For derivations one can refer to: Sacks, J., Welch, W.J., Mitchell, T.J. and Wynn, H.P. (1989), Design and analysis of computer experiments (with discussion), Statistical Science 4: 409–435.12
13Kriging fitting issues The maximum likelihood or cross-validation optimization problem solved to obtain the kriging fit is often ill-conditioned leading to poor fit, or poor estimate of the prediction variance.Poor estimate of the prediction variance can be checked by comparing it to the cross validation error.Poor fits are often characterized by the kriging surrogate having large curvature near data points (see example on next slide).It is recommended to visualize by plotting the kriging fit and its standard error.
14Example of poor fits. True function 𝑦= 𝑥 2 +5𝑥−10 Kriging Model Trend model : 𝜉(𝑥)=0Covariance function : c𝑜𝑣 𝑥 𝑖 , 𝑥 𝑗 =exp(−𝜃 ( 𝑥 𝑖 − 𝑥 𝑗 ) 2 )The example of fiting a quadratic function was used with several kriging packages, using the points shown in the figure or also additional points outside the range. The two cases shown on the next slide are extreme cases for this simple function, but they show the kind of behavior that is occasionally encountered and can be detected by plotting the kriging fit and bounds representing two standard errors.
15𝜃 selection 𝑃𝑜𝑜𝑟 𝑓𝑖𝑡 𝑎𝑛𝑑 Good fit and 𝑝𝑜𝑜𝑟 𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑒𝑟𝑟𝑜𝑟 poor standard errorSE: standard error
16Kriging with nuggetsNuggets – refers to the inclusion of noise at data points.The more general Gaussian Process surrogates or Kriging with nuggets can handle data with noise (e.g. experimental results with noise).
17Introduction to optimization with surrogates Based on cycles. Each consists of sampling design points by simulations, fitting surrogates to simulations and then optimizing an objective.Construct surrogate, optimize original objective, refine region and surrogate.Typically small number of cycles with large number of simulations in each cycle.Adaptive samplingConstruct surrogate, add points by taking into account not only surrogate prediction but also uncertainty in prediction.Most popular, Jones’s EGO (Efficient Global Optimization).Easiest with one added sample at a time.Optimization with surrogates goes through cycles, with each cycle consisting of performing a number of simulations, fitting a surrogates to these simulations (possibly using also simulations from previous cycles) and then optimizing based on the surrogates. A termination test is undertaken, and if it is not satisfied, another cycle is undertaken. The optimization algorithm used to solve the optimization problem in each cycle is less critical than in optimization without surrogates, because the objective functions and constraints are inexpensive to evaluate from the surrogates.The other strategy is called adaptive sampling. Its best known algorithm is called EGO (efficient global optimization) and it is the subject of another lecture. In adaptive sampling we add one or small number of simulations in each cycle. The points selected for sampling are based not only on the surrogate value but also on its estimate of the uncertainty in its predictions. So points with high uncertainty have a chance of being sampled even if the surrogate prediction there is not as good as at other points with less uncertainty.
18Problems KrigingFit the quadratic function of Slide 13 with kriging using different options, like different covariance (different 𝜃s) and trend functions and compare the accuracy of the fit.For this problem compare the standard error with the actual error.