# Sampling plans for linear regression

## Presentation on theme: "Sampling plans for linear regression"— Presentation transcript:

Sampling plans for linear regression
Given a domain, we can reduce the prediction error by good choice of the sampling points. The choice of sampling locations is called “design of experiments” or DOE. In this lecture we will consider DOEs for linear regression using linear and quadratic polynomials and where errors are due to noise in the data. With a given number of points the best DOE is one that will reduce the prediction variance (reviewed in next few slides). The simplest DOE is full factorial design where we sample each variable (factor) at a fixed number of values (levels) Example: with four factors and three levels each we will sample 81 points Full factorial design is not practical except for low dimensions This lecture is about sampling plans that reduce the prediction errors for linear regression where the main source of error is assumed to be noise in the data. We will assume in this lecture that we fit the data with either a linear or quadratic polynomials. When noise is not the main issue, and instead the error is due to fitting an unknown function, we often fit with other surrogates such as Kriging. Then other sampling plans, such as Latin Hypercube sampling (LHS) are used. The choice of sampling locations is called ‘design of experiments’ or DOE. For a given number of samples, the best DOE would be the one that reduced the prediction variance (reviewed in the next slide). We also mostly discuss sampling points in a box-like domain. In that case the simplest DOE is a grid, where we sample each direction (factor) with a fixed number of levels. This is called full-factorial design. For example, if we have 3 variables in the unit box, 0≤ 𝑥 1 , 𝑥 2 , 𝑥 3 ≤1, we may decide to sample 𝑥 1 at five locations (0, 0.25, 0.5, 0.75, 1) and 𝑥 2 at 2 locations (0,1), and 𝑥 3 at three locations (0,0.5, 1), we will get a total of 30 samples from all possible combinations of 𝑥 1 , 𝑥 2 , 𝑥 3 , with the first one being (0,0,0), the second being (0.25,0, 0), and the last two (#29 and #30) are (1,1,0.5) and (1,1,1), respectively. Full factorial design is not practical for high dimensions. For example, if we are in 20 dimensional space, even if we have only two points for each factor, we will end up with 2 20 =1,048,576 sampling locations, which we can rarely afford.

Linear Regression Surrogate is linear combination of 𝑛 𝑏 given shape functions For linear approximation Difference (error) between 𝑛 𝑦 data and surrogate Minimize square error Differentiate to obtain In linear regression the surrogate is a linear combination of given shape functions with unknown coefficient vector b, that is 𝑖=1 𝑛 𝑏 𝑏 𝑖 𝜉 𝑖 (𝐱 Where 𝜉 𝑖 (x) are given shape functions, usually monomials. For example, for the linear approximation in two variables we may have 𝜉 1 =1, 𝜉 2 = 𝑥 1 , 𝜉 3 = 𝑥 2 . We fit the surrogate to 𝑛 𝑦 data points 𝑦 𝑗 . The difference between the surrogate and the data at the jth point is denoted as 𝑒 𝑗 and is given as 𝑒 𝑗 = 𝑦 𝑗 − 𝑖=1 𝑛 𝑏 𝑏 𝑖 𝜉 𝑖 ( 𝐱 𝑗 or in vector form e=y-Xb. Note that the (I,j) component of the matrix X is 𝜉 𝑗 ( 𝑥 𝑖 . The sum of the squares of the errors is minimized in regression. That sum is given as 𝐞 𝑇 𝐞= 𝐲−𝑋𝐛 𝑇 (𝐲−𝑋𝐛)= 𝐲 𝑇 𝐲− 𝐲 𝑇 𝑋𝐛− 𝐛 𝑇 𝑋𝐲+ 𝐛 𝑇 𝑋 𝑇 𝑋𝐛. Setting the derivative of 𝐞 𝑇 𝐞 to zero in order to find the best fit we get 𝑋 𝑇 𝑋𝐛= 𝑋 𝑇 𝐲, a set of nb equations. The equations are often ill conditioned, especially when the number of coefficients is large and close to the number of data points. The fact that linear regression merely requires the solution of a set of linear equations to do the fit is a reason for its popularity. Nonlinear regression, or other fit metrics usually require the numerical solution of an optimization problem in order to obtain b. ,

Model based error for linear regression
The common assumptions for linear regression The true function is described by the functional form of the surrogate. The data is contaminated with normally distributed error with the same standard deviation at every point. The errors at different points are not correlated. Under these assumptions, the noise standard deviation (called standard error) is estimated as 𝜎 is used as estimate of the prediction error. In linear regression we commonly assume that the form that we select for the surrogate is the true form of the function. When we compare the prediction of the surrogate to future simulations we will have two sources of error, the noise in the simulation and the inaccuracy in the coefficients of the shape functions that is caused by the noise in the data. The noise is assumed to be Gaussian (normally distributed) with a zero mean and the same standard deviation at all points. Furthermore, the noise at different points in uncorrelated. Under these assumptions, we can show that the estimate of the square of the standard deviation of the noise is 𝜎 2 = 𝐞 𝑇 𝐞 𝑛 𝑦 − 𝑛 𝑏 . This standard deviation called the standard error and it is a good measure of the overall accuracy of the surrogate even when the assumptions do not hold.

Prediction variance Linear regression model Define then
With some algebra Standard error To estimate the variability in prediction at a point x we start with the linear regression model 𝑦 = 𝑖=1 𝑛 𝛽 𝑏 𝑖 𝜉 𝑖 (𝐱), and write it in vector form as 𝑦 = 𝐱 𝑚)𝑇 𝐛 , where the vector of coefficients b multiplies a vector of shape functions 𝑥 𝑖 𝑚 = 𝜉 𝑖 (𝐱 . Since the coefficients are linear functions of the data, and the data is normally distributed, the coefficients b are also normally distributed. From the equation for b (Slide 2) we can show that the covariance matrix of b is 𝑏 = 𝜎 2 𝑋 𝑇 𝑋 −1 and then with some algebra it follows that 𝑉𝑎𝑟[ 𝑦 (𝐱)]= 𝐱 𝑚)𝑇 𝑏 𝐱 𝑚 = 𝜎 2 𝐱 𝑚)𝑇 𝑋 𝑇 𝑋 −1 𝐱 𝑚 Since we do not know the exact value of the noise 𝜎 we substitute the estimate. And take the square root to get the standard error, which as usual, is the estimate of the standard deviation 𝑠 𝑦 = 𝜎 𝐱 𝑚)𝑇 𝑋 𝑇 𝑋 −1 𝐱 𝑚

Prediction variance for full factorial design
Recall that standard error (square root of prediction variance is For full factorial design the domain is normally a box. Cheapest full factorial design: two levels (not good for quadratic polynomials). For a linear polynomial standard error is then Maximum error at vertices What does the ratio in the square root represent? We consider first the unit box where all the variables are in [-1,1]. For a linear polynomial we can make do with two levels for each variable, that is include only the vertices. For example in two dimension the surrogate is 𝑦 = 𝑏 1 + 𝑏 2 𝑥 1 + 𝑏 3 𝑥 2 . Then the vector of shape functions 𝑥 𝑚 = 1 𝑥 1 𝑥 2 𝑇 , and the matrix X which is the shape functions evaluated at the data points is 𝑋= 1 −1 − − − 𝑋 𝑇 𝑋= Then we get that 𝑠 𝑦 = 𝜎 1+ 𝑥 𝑥 For the a linear polynomial in n dimensions we similarly get 𝑠 𝑦 = 𝜎 𝑛 𝑥 𝑥 𝑥 𝑛 The minimum error at the origin is substantially smaller than the noise. However, even the maximum error (at the vertices) 𝑠 𝑦 = 𝜎 𝑛+1 2 𝑛 , is much smaller than the noise. Notice that the noise there is reduced by the square root of the ratio between the number of coefficients in the polynomial and the number of data points. This ratio is in general a good estimate to the effect of having a surplus of data over coefficients for filtering out noise. If the noise is the only source of error (that is the true function is a linear polynomial) then the fit will be more accurate than the data.

Designs for linear polynomials
Traditionally use only two levels. Orthogonal design when XTX is diagonal. Full factorial design is orthogonal, not so easy to produce other orthogonal designs with less points. It is beneficial to place the points at the edges of the design domain. Stability: Small variation of prediction variance in domain is also desirable property. The full factorial design is often not affordable because of the large number of points, especially in high dimensions. However, even with a smaller number of points, it pays to retain the property of orthogonality of the columns of X that leads to diagonal XTX . This is not easy to do with any number of points. The minimum number of points, is of course, n+1, when the number of points is equal to the number of coefficients of the linear polynomial. This is called a saturated design, and it does not allow us to estimate the noise from the data. The quest for small values of the prediction variance, usually pushes the design towards placing points on the boundary of the domain. However, another desired property is called stability, which is limited variation of the prediction variance between different points in the domain. Stability is usually measured by the ratio of the highest to lowest prediction variance. This sometimes pushes us to add points at the center of the domain as we will see by examples later.

Example Linear polynomial y=b1+b2x1+b3x2
Compare prediction variance for an orthogonal design based on equilateral triangle , − ,   − , − ,   0 , to right triangle (both are saturated) Linear polynomial y=b1+b2x1+b3x2 For right triangle obtain 𝑋= 1 −1 −1 1 − −1 𝑋 𝑇 𝑋= 3 −1 −1 −1 3 −1 −1 − 𝑋 𝑇 𝑋 −1 = For two dimensions, a full factorial design has 4 points, and the minimum (called saturated design) for fitting a linear polynomial is three points. So we will compare triangles that lead to orthogonal designs to a right triangle that does not lead to an orthogonal design. We start with the equilateral triangle shown in blue in the figure, and for that design 𝑋= − − − 𝑋 𝑇 𝑋= 𝑋 𝑇 𝑋 −1 = So it is orthogonal. For the right triangle, we have 𝑋= 1 −1 −1 1 − −1 𝑋 𝑇 𝑋= 3 −1 −1 −1 3 −1 −1 − 𝑋 𝑇 𝑋 −1 =

Comparison Prediction variances for equilateral triangle
The maximum variance at (1,1) is three times larger than the lowest one. For right triangle Maximum variance (3) is six times the lowest, and triple that of the equilateral triangle. A fairer comparison is when we restrict triangle to lie inside box The prediction variance is 𝑥 𝑥 2 2 /3 Maximum prediction variance (1.5) and stability (ratio of 4.5) are still better than for the right triangle, but by less. For the equilateral triangle we get 𝐱 𝑚 = 1 𝑥 1 𝑥 2 , 𝐱 𝑚)𝑇 𝑋 𝑇 𝑋 −1 𝐱 𝑚 = 𝑥 𝑥 So in the range of [-1,1], the maximum prediction variance is 1 (for unit 𝜎 ) and the minimum is 1/3. For the right triangle, 𝑉𝑎𝑟[ 𝑦 (𝑥)]=0.5(1+ 𝑥 1 + 𝑥 2 + 𝑥 𝑥 1 𝑥 2 + 𝑥 2 2 So that the maximum prediction variance is 3 and, and the ratio between minimum and maximum is six. A fairer comparison would restrict the orthogonal design to lie in the [-1.1] box. This is accomplish by moving the top vertex of the right triangle to the center, to get an isosceles triangle. (1,-0.5), (-1,-0.5), (0,1). Then X= 1 1 −0.5 1 −1 − 𝑋 𝑇 𝑋= 𝑋 𝑇 𝑋 −1 = And Var 𝑦 (𝑥 = 𝑥 𝑥 2 2 /3. Now the maximum prediction variance is 1.5, still only half of that of the right triangle, and the ratio of the maximum to the minimum is 4.5, which is 25% lower than for the right triangle.

Quadratic Polynomial A quadratic polynomial has (n+1)(n+2)/2 coefficients, so we need at least that many points. Need at least three different values of each variable. Simplest DOE is three-level, full factorial design Impractical for n>5 Also unreasonable ratio between number of points and number of coefficients For example, for n=8 we get 6561 samples for 45 coefficients. My rule of thumb is that you want twice as many points as coefficients A quadratic polynomial in n variables has (n+1)(n+2)/2 coefficients, so you need at least that many data points. You also need three different values of each variable. We can achieve both requirements with a full factorial design with three levels. However for n>5 this will normally give more points than we can afford to evaluate. In addition, for n>3 we will get unreasonable ratio between the number of points and number of coefficients. This ratio is normally around 2. For n=4 we have 15 coefficients and 81 points, and for n=8 we have 45 coefficients and 6561 points.

Central Composite Design
Includes 2n vertices, 2n face points plus nc repetitions of central point Can choose α so to achieve spherical design achieve rotatibility (prediction variance is spherical) 𝛼= 4 𝑉 Stay in box (face centered) FCCCD Still impractical for n>8 The central composite design takes the two-level full-factorial design and adds to it the minimum number of points needed to provide three levels for each variable, so that a quadratic polynomial can be fitted. This is done by adding points along the axes, in both the positive and negative directions, at a distance 𝛼 from the origin. The figure showing the central composite design for n=3 is taken from Myers and Montgomery’s Response Surface Methodology, Figure 7.5 in 1995 edition. The value of 𝛼 can be chosen to achieve a spherical design, where all the points are at the same distance from the origin (𝛼= 3 ), or to achieve “rotatibility” which is when the prediction variance is a spherical function. Myers and Montgomery show that this requires that 𝛼= 4 𝑉 where V is the number of vertices. So for n=3, 𝛼= 4 8 = Of course, if the sample points must stay in the box, we have to chose 𝛼=1. This design is called face-centered, central composite design (FCCCD). Central composite design is popular for 3≤n≤6, when it gives reasonable ratios between the number of points and the number of coefficients of a quadratic polynomial. For n=7, we already require at least 143 data points (without repetitions at the center) for 36 coefficients, and for n=9, 531 points for 55 coefficients.

Repeated observations at origin
Unlike linear designs, prediction variance is high at origin. Repetition at origin decreases variance there and improves stability. What other rationale for choosing the origin for repetition? Repetition also gives an independent measure of magnitude of noise. Can be used also for lack-of-fit tests. Central Composite Designs without repetition at the origin have high prediction variance there. This is in contrast to the linear designs that we discussed earlier in the lecture where the prediction variance was the lowest at the origin. Repetition at the origin decreases variance there, and so it improves stability. There is also another obvious reason for choosing the origin for repetition rather than any other point. What is it? Repetition also gives an independent measure of the magnitude of the noise, and if that noise is substantially smaller than the value of 𝜎 , it indicates that there lack-of-fit, that is, the function we chose for the regression is different in form from the true function. Repeated observations make sense in the case of noisy measurements. Numerical simulations can also be noisy due to discretization error as remeshing occurs. However, repeating a simulation will usually give the same results, not showing the numerical noise. In that case, it is advisable to repeat them at a small distance from the origin, such as 0.01 if the box is normalized to the range [-1,1].

Without repetition (9 points)
Contours of prediction variance for spherical CCD design. How come it is rotatable? From Myers and Montgomery’s Response Surface Methodology, 1995 edition. Figure in 1995 edition (Fig on next slide). For n=2, the spherical design with 𝛼= 2 is also rotatable, as can be checked from the equation on Slide 10. We see that without repetition, the prediction variance at the origin is substantially higher than elsewhere.

Center repeated 5 times (13 points)
. With five repetitions we reduce the maximum prediction variance and greatly improve the uniformity. Five points is the optimum for uniformity. We now add four repetitions at the origin, for a total of five points, and we get a much improved prediction variance, both in terms of the maximum and in terms of the stability. This design can be obtained from Matlab by using the ccdesign function with the call below. The function call tells it to repeat points at the center. We can give the actual number of repetitions, or ask it as we do below, for the number that would lead to the most uniform prediction variance. d=ccdesign(2,'center', 'uniform') d =

Variance optimal designs
Full factorial and CCD are not flexible in number of points Standard error A key to most optimal DOE methods is moment matrix A good design of experiments will maximize the terms in this matrix, especially the diagonal elements. D-optimal designs maximize determinant of moment matrix. Determinant is inversely proportional to square of volume of confidence region on coefficients. Full factorial design and CCD are not flexible in terms of the number of points and shape of the domain, and often lead to unaffordable number of points. More modern DOEs emphasize availability for any number of points. Then the issue is how to spread the points to minimize the prediction variance. The formula for the standard error for the prediction variance 𝑠 𝑦 = 𝜎 𝐱 𝑚)𝑇 𝑋 𝑇 𝑋 −1 𝐱 𝑚 Reveals that we need to minimize the terms of the matrix 𝑋 𝑇 𝑋. A normalized version of this matrix 𝑀= 𝑋 𝑇 𝑋 𝑛 𝑦 𝑛 𝑏 is known as the moment matrix. D-optimal designs maximize the determinant of the moment matrix. It can be shown that the determinant is proportional to the square of the volume of confidence region in the polynomial coefficients. Another set of popular designs that are often used with other surrogates, such as Kriging are DOE that strive for geometric uniformity in design space, such as Latin Hypercube Sampling (LHS) designs.

Example Given the model y=b1x1+b2x2, and the two data points (0,0) and (1,0), find the optimum third data point (p,q) in the unit square. We have So that the third point is (p,1), for any value of p Finding D-optimal design in higher dimensions is a difficult optimization problem often solved heuristically We use the simple example of fitting y=b1x1+b2x2 in the unit square with three points. We know that variance optimal designs push points to the boundary, so we place two points at (0,0) and (1,0), and we will solve analytically for the position of the third point (p,q). We get 𝑋= 𝑝 0 𝑞 𝑋 𝑇 𝑋= 1+ 𝑝 2 𝑝𝑞 𝑝𝑞 𝑞 det( 𝑋 𝑇 𝑋)= 𝑞 2 So to maximize the determinant we must choose q=1. The value of p does not matter (for symmetry we may want to place it at 0.5). Finding the D-optimal design in higher dimensions is a difficult optimization problem, and in the next slide we will use Matlab to get us designs.

Matlab example >> ny=6;nbeta=6; >> [dce,x]=cordexch(2,ny,'quadratic'); >> dce' scatter(dce(:,1),dce(:,2),200,'filled') >> det(x'*x)/ny^nbeta ans = With 12 points: >> ny=12; ans =0.0102

Other criteria A-optimal minimizes trace of the inverse of the moment matrix. This minimizes the sum of the variances of the coefficients. G-optimality minimizes the maximum of the prediction variance.

Example For the previous example, find the A-optimal design
Minimum at (0,1), so this point is both A-optimal and D-optimal.

Problems Create a 13-point D-optimal design in two dimensional space and compare its prediction variance to that of the CCD design shown on Slide 13. Generate noisy data for the function y=(x+y)2 and fit using the two designs and compare the accuracy of the coefficients.