COMPUTATIONAL ASPECTS OF LOCAL REGRESSION MODELLING: taking spatial analysis to another level Stewart Fotheringham Martin Charlton Chris Brunsdon Spatial Analysis Research Group Department of Geography, University of Newcastle upon Tyne, Newcastle upon Tyne, ENGLAND NE1 7RU
GEOGRAPHICALLY WEIGHTED REGRESSION The mechanics of GWR New software for GWR: GWR 2.0 GWR in practice: an example of the determinants of London house prices Won’t discuss the math of GWR in much detail
Some Definitions Spatial nonstationaritySpatial nonstationarity exists when the same stimulus provokes a different response in different parts of the study region Global modelsGlobal models are statements about processes which are assumed to be stationary and as such are location independent Local modelsLocal models are spatial disaggregations of global models, the results of which are location-specific
Local versus Global LocalglobaldataLocal versus global data: the example of US climate data LocalglobalrelationshipsLocal versus global relationships: the example of house price determinants LocalglobalmodelsLocal versus global models: the example of regression
Why might relationships vary spatially? Sampling variation Relationships intrinsically different across space e.g. differences in attitudes, preferences or different administrative, political or other contextual effects produce different responses to the same stimuli Model misspecification - suppose a global statement can ultimately be made but models not properly specified to allow us to make it. Local models good indicator of how model is misspecified. Can all contextual effects ever be removed? Can all significant variations in local relationships be removed?
Regression In a typical linear regression model applied to spatial data we assume a stationary process: y i = 0 + 1 x 1i + 2 x 2i +… n x ni + i
so that... The parameter estimates obtained in the calibration of such a model are constant over space: ’ = (X T X) -1 X T Y which means that any spatial variations in the processes being examined can only be measured by the error term
Consequently... We might map the residuals from the regression to determine whether there are any spatial patterns. Or compute an autocorrelation statistic We might even try to ‘model’ the error dependency with various types of spatial regression models.
However... Why not address the issue of spatial nonstationarity directly and allow the relationships we are measuring to vary over space? This is the essence of GWR y( g ) = 0 ( g ) + 1 ( g ) x 1 + 2 (g) x 2 +… n (g) x n + ( g ) where (g) refers to a location at which estimates of the parameters are obtained
… with the estimator ’ = (X T W (g) X) -1 X T W (g) Y where W (g) is a matrix of weights specific to location g such that observations nearer to g are given greater weight than observations further away.
w g1 0.……..…..0 0 w g2 …..……..0 W(g)=0 0 w g3 …… ………w gn where w gn is the weight given to data point n for the estimate of the local parameters at location g
Weighting schemes fixedadaptive Numerous weighting schemes can be used. They can be either fixed or adaptive. Two examples of a fixed weighting scheme are the Gaussian function: w ij = exp[-(d ij 2 / h 2 )/2] where h is known as the bandwidth and controls the degree of distance-decay and the bisquare function: w ij = [1-(d ij 2 / h 2 )] 2 if d ij < h = 0 otherwise
Perhaps better... Is to use a spatially adaptive weighting function such as: w ij = exp(-R ij / h) where R is the ranked distance or w ij = [1-(d ij 2 / h 2 )] 2 if j is one of the Nth nearest neighbours of i = 0 otherwise In the latter, we estimate an optimal value of N in the GWR routine
Calibration The results of GWR appear to be relatively insensitive to the choice of weighting function as long as it is a continuous distance-based function Whichever weighting function is used, the results will, however, be sensitive to the degree of distance-decay. Therefore an optimal value of either h or N has to be obtained. This can be found by minimising a crossvalidation score or the Akaike Information Criterion
where... CV = i [y i - y i ’ (h) ] 2 where y i ’ (h) is the fitted value of y i with data from point i omitted from the calibration AIC = 2n ln( ’) + n ln(2 ) SS + n[n+ Tr(S)] / [n Tr(S)] S where n is the number of data points, ’ is the estimated standard deviation of the error term, and Tr(S) is the trace of the hat matrix. y’ = Sy
GWR Jargon Sample pointsSample points –locations at which your data are measured Regression pointsRegression points –locations at which you require parameter estimates These need not be the same locations This can be handy if you want to map the results from very large data sets
In GWR, we can also... estimate local standard errors derive local t statistics calculate local goodness-of-fit measures calculate local leverage measures perform tests to assess the significance of the spatial variation in the local parameter estimates perform tests to determine if the local model performs better than the global one, accounting for differences in degrees of freedom
Software for GWR (GWR 2.0) User-friendly and menu-driven Provide a range of options Provide some helpful diagnostics Provide results for other software
The architecture of GWR 2.0 Currently about 3000 lines of FORTRAN VB front-end to create a control file and run the program Can run FORTRAN code under Unix with a control file or, more easily, by using the VB model editor in Windows (95; 98; ME)
Running the program There are several routes through the program depending on what you want to do
Creating a new model Data Dependent variable Independent variables Sample point locational variables Regression point locational variables Weighting scheme/fitting Output/listing
Data first... Usual Windows form Assumes filetype of.dat
Data structure First line is a comma separated list of variable names (<= 8 characters) Data lines have numeric items only terminated by a carriage return One line of data per location Space or comma delimited (easily imported)
User’s decision Are the sample points and regression points identical?
Output File This will appear in the results folder What type do you want?
Types of output Currently the program will output parameter estimates and other information in these formats: –ArcINFO ungenerated export.e00 –MapInfo Interface File.mif –Comma Separated Variables.csv
The Model Editor The variable names have already been extracted from the file
The Model Editor To specify a dependent variable, highlight it in the list on the left and click on the [->] symbol
The Model Editor To specify independent variables, highlight them in the list on the left and click on the [- >] symbol
The Model Editor To specify location variables, highlight them in the list on the left and click on the relevant [->] symbols
The Model Editor To specify a weight variable, highlight it in the list on the left and click on the [->] symbol
The Model Editor Next you specify the type of kernel: this can be fixed (Gaussian) or variable (bisquare)
The Model Editor You can either preset the bandwidth in the units that the location variables are measured in (for example, metres)
Or if you want the program to determine the optimal bandwidth, specify crossvalidation. For large files, there is a sampling option to speed the process.
The Model Editor The parameter estimates can be output in different formats..csv is text format. ArcInfo and MapINFO will load directly.
The Model Editor The type of output in the printed listing can also be controlled.
The Model Editor Once the model specification is completed, save it before you run it. The file appears in the model folder.
Running the Model The saved model is identified. You must also specify a listing file for the printed output Then hit Run
Dependent (y) variable PctBatch Easting (x-coord) variable.....Longitud Northing (y-coord) variable.....Latitude No weight variable specified Dependent variables in this model... PctPov Kernel type: Variable width Determine kernel size using crossvalidation... ** Crossvalidation report requested ** Prediction report requested ** Parameter estimates to be written to.e00 file ** Monte Carlo significance tests for spatial variation ** Casewise diagnostics to be printed ** Regression models will include intercept Number of data cases read: 159 Observation points read... Dependent mean= Number of observations, nobs= 159 Number of predictors, nvar= 1 Obseration Easting extent: Observation Northing extent: The output listing Some useful statistics
CrossValidation begins... CV will be based on 159 cases Initial estimates for bandwidth Bandwidth AIC … … … … … … … … … … ** Convergence: NNN= 23 Calibration report In this case we are using an adaptive kernel. Each kernel will contain the locations of the 23 nearest sample points to each regression point.
********************************************************** * GLOBAL REGRESSION PARAMETERS * ********************************************************** Diagnostic information... Residual sum of squares Effective degrees of freedom Sigma Akaike Information Criterion Parameter Estimate Std Err T Intercept PctPov The parameter estimates for the “global” model are reported first. Both parameters are significant. Global Results
********************************************************** * GWR ESTIMATION * ********************************************************** Fitting Geographically Weighted Regression Model... Number of observations Number of independent variables... 2 (Intercept is variable 1) Number of nearest neighbours Number of locations to fit model Diagnostic information... Residual sum of squares Effective degrees of freedom Sigma Akaike Information Criterion The equivalent diagnostic information for the GWR model is printed. The AIC value is smaller than that of the global model, suggesting that the GWR model is “better”. GWR Results
Predictions from this model... Obs Y(i) Yhat(i) Res(i) X(i) Y(i) F F F F F F F If you have requested an output file, this information and the diagnostics are also written to this file.
********************************************************* * PARAMETER 5-NUMBER SUMMARIES * ********************************************************** Label Minimum Lwr Quartile Median Upr Quartile Maximum Intrcept PctPov As there are as many sets of parameter estimates as there are regression points a “5 number summary” is provided to give the main “highlights” of each distribution. As well as the maximum, minimum and median, the program prints the upper and lower quartiles.
************************************************* * * Test for spatial variability of parameters * * ************************************************* Tests based on the Monte Carlo significance test procedure due to Hope [1968,JRSB,30(3), ] Parameter p-value Intercept PctPov It is useful to know whether there is sufficient evidence to suggest that the variation in parameter estimates is not random. Two tests have been programmed: a Monte Carlo significance test and a test after Leung.
To import the results into ArcVIEW, use the Import71 utility. This creates an ArcINFO point coverage. Parameter estimates
Spatial variation in the intercept term Spatial variation in the intercept term
Spatial variation in the poverty parameter
What’s planned for GWR 3.0 ? Regression through the origin Further kernel types Poisson regression Further export options Other suggestions from users of GWR 2.0 The GWR website is: Details of program availability given