Chapter 8 – Spatial contiguity (geostatistic + lattice data)

Slides:

Advertisements

Similar presentations

Spatial point patterns and Geostatistics an introduction

Advertisements

11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.

Chapter 7 Statistical Data Treatment and Evaluation

Hypothesis Testing Steps in Hypothesis Testing:

Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. Lecture Slides Elementary Statistics Eleventh Edition and the Triola.

Correlation and regression

Objectives (BPS chapter 24)

Basic geostatistics Austin Troy.

Spatial Autocorrelation Basics NR 245 Austin Troy University of Vermont.

GIS and Spatial Statistics: Methods and Applications in Public Health

Chapter 13 Additional Topics in Regression Analysis

Correlation and Autocorrelation

Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.

Deterministic Solutions Geostatistical Solutions

1 Testing spatial correlation (autocorrelation) 1.Moran’s I 2.Geary’s c 3.Variogram 4.Join counts Cliff, A. D. & Ord, J. K Spatial processes: models.

Applied Geostatistics

Examining Relationship of Variables  Response (dependent) variable - measures the outcome of a study.  Explanatory (Independent) variable - explains.

Nemours Biomedical Research Statistics April 2, 2009 Tim Bunnell, Ph.D. & Jobayer Hossain, Ph.D. Nemours Bioinformatics Core Facility.

Deterministic Solutions Geostatistical Solutions

Chapter 11 Multiple Regression.

SA basics Lack of independence for nearby obs

Why Geography is important.

Ordinary Kriging Process in ArcGIS

Business Statistics - QBM117 Statistical inference for regression.

Introduction to Regression Analysis, Chapter 13,

1 1 Slide © 2008 Thomson South-Western. All Rights Reserved Slides by JOHN LOUCKS & Updated by SPIROS VELIANITIS.

Objectives of Multiple Regression

Descriptive Methods in Regression and Correlation

Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved Section 10-3 Regression.

Regression Method.

Regression Analysis (2)

CORRELATION & REGRESSION

What does it mean? The variance of the error term is not constant

The Examination of Residuals. The residuals are defined as the n differences : where is an observation and is the corresponding fitted value obtained.

1 1 Slide © 2004 Thomson/South-Western Slides Prepared by JOHN S. LOUCKS St. Edward’s University Slides Prepared by JOHN S. LOUCKS St. Edward’s University.

Ecosystems are: Hierarchically structured, Metastable, Far from equilibrium Spatial Relationships Theoretical Framework: “An Introduction to Applied Geostatistics“,

Random Regressors and Moment Based Estimation Prepared by Vera Tabakova, East Carolina University.

1 1 Slide © 2014 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole.

The Examination of Residuals. Examination of Residuals The fitting of models to data is done using an iterative approach. The first step is to fit a simple.

Chap 14-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 14 Additional Topics in Regression Analysis Statistics for Business.

Geographic Information Science

Chapter 11 Linear Regression Straight Lines, Least-Squares and More Chapter 11A Can you pick out the straight lines and find the least-square?

GEOSTATISICAL ANALYSIS Course: Special Topics in Remote Sensing & GIS Mirza Muhammad Waqar Contact: EXT:2257.

The Semivariogram in Remote Sensing: An Introduction P. J. Curran, Remote Sensing of Environment 24: (1988). Presented by Dahl Winters Geog 577,

MGS3100_04.ppt/Sep 29, 2015/Page 1 Georgia State University - Confidential MGS 3100 Business Analysis Regression Sep 29 and 30, 2015.

Chapter 4 Linear Regression 1. Introduction Managerial decisions are often based on the relationship between two or more variables. For example, after.

© Copyright McGraw-Hill Correlation and Regression CHAPTER 10.

MARKETING RESEARCH CHAPTER 18 :Correlation and Regression.

Exploratory Tools for Spatial Data: Diagnosing Spatial Autocorrelation Main Message when modeling & analyzing spatial data: SPACE MATTERS! Relationships.

Geo479/579: Geostatistics Ch4. Spatial Description.

VI. Regression Analysis A. Simple Linear Regression 1. Scatter Plots Regression analysis is best taught via an example. Pencil lead is a ceramic material.

Spatial Statistics in Ecology: Point Pattern Analysis Lecture Two.

Simple linear regression Tron Anders Moger

1 1 Slide © 2003 South-Western/Thomson Learning™ Slides Prepared by JOHN S. LOUCKS St. Edward’s University.

Correlation & Regression Analysis

ANOVA, Regression and Multiple Regression March

Copyright © 2010 Pearson Education, Inc Chapter Seventeen Correlation and Regression.

Geo479/579: Geostatistics Ch7. Spatial Continuity.

McGraw-Hill/IrwinCopyright © 2009 by The McGraw-Hill Companies, Inc. All Rights Reserved. Simple Linear Regression Analysis Chapter 13.

Tutorial I: Missing Value Analysis

More on regression Petter Mostad More on indicator variables If an independent variable is an indicator variable, cases where it is 1 will.

Spatial statistics: Spatial Autocorrelation

Chapter 5 Part B: Spatial Autocorrelation and regression modelling.

Correlation and Regression

CHAPTER 29: Multiple Regression*

Prepared by Lee Revere and John Large

Spatial Autocorrelation

Product moment correlation

MGS 3100 Business Analysis Regression Feb 18, 2016

Presentation transcript:

Chapter 8 – Spatial contiguity (geostatistic + lattice data) EDA involves methods of describing and visualizing data and its structure in order to formulate hypotheses and check the validity of assumptions. R is a particularly useful tool for EDA. In the follows, some common techniques are introduced. Data posting – It is a map on which each data location is plotted along with its corresponding data value. gigantemap.r(soil87.dat) # soil87.dat is regularly spaced at 87 locations text(soil87.dat$gx,soil87.dat$gy, as.character(soil87.dat$pHFsrf),col=4,cex=0.8) 100 200 300 400 500 600 800 5.92 5.25 5.66 6.23 6.18 6.97 6.16 5.45 4.98 6.44 5.21 5.97 5.83 5.95 6.03 5.86 5.24 6.24 5.54 5.74 5.6 5.23 4.83 6.59 5.71 5.85 5.27 5.59 5.13 4.73 5.35 5.52 5.03 5.17 5.96 5.72 6.4 5.53 4.48 5.5 6.07 5.42 5.51 4.97 5.06 5.19 4.95 4.94 5.26 5.37 5.93 4.65 4.75 6.17 5.39 5.3 4.69 5.47 4.51 4.99 5.58 5.46 5.16 5.79 5.64

Symbol map It is similar to the data posting map, but with each location replaced by a symbol that represents the class to which the data belongs. z=soil87.dat$pHFsrf summary(z) Min. 1st Qu. Median Mean 3rd Qu. Max. 4.48 5.23 5.47 5.499 5.83 6.97 z[z<=5.23]=1 z[z>5.23&z<=5.83]=2 z[z>5.83]=3 gigantemap.r(soil87.dat) text(soil87.dat$gx,soil87.dat$gy, as.character(z),col=4) x y 100 200 300 400 500 600 800 3 2 1

Contour map, image and 3-D plot Contour/image maps are useful to reveal overall trends in the data. However, the smoothing out of the variability in the data sometimes can be misleading. Contour map (all these are R functions): 1. >pH.mat=gigante.mat.r(soil87.dat$gx,soil87.dat$gy,soil87.dat$pHFsrf) # Interpolate a surface for z values. 2. >gigantemap.r(soil87.dat) # draw a location map 3. >contour(sort(unique(soil87.dat$gx)),sort(unique(soil87.dat$gy)),t(pH.mat), add=T) # add a contour map Image: 1. >gigantemap.r(soil87.dat) # draw a location map 2. >image(sort(unique(soil87.dat$gx)),sort(unique(soil87.dat$gy)),t(pH.mat), add=T) # add an image 3-D perspective plot: >persp(sort(unique(soil87.dat$gx)),sort(unique(soil87.dat$gy)),t(pH.mat))

H-scatter plot Locations close to each other are likely to have similar values than locations far apart. H-scatter plot is a useful tool to display this spatial continuity. An h-scatter plot shows all possible pairs of data values whose locations are separated by a certain distance in a particular direction. A set of sample data, z(s), is from a systematical grid. The h-scatter plot is a scatter plot in which the x-axis is labeled z(s) and the y-axis is labeled z(s+h), where h is a spatial lag. If spatial continuity is strong, the cloud of points (z(s), z(s+h)) should fall closely along the 1-1 diagonal line. As h increases, the data values become less similar, the cloud on the h-scatter plot becomes fatter and more diffuse.

Distance lag and angle tolerance for h-scatter plot The construction of h-scatter plot is straightforward for regularly spaced sample locations. However, if the locations are irregularly spaced, in some situation there may be very few sample pairs for a given h, while in others there may be a lot of sample pairs. To overcome this problem we need to specify a tolerance both on the distance of h and on its direction. Illustrated below, any locations falling within the shaded area are considered to be within 5 m of a given location in an easterly direction. 5 m 20° 0.5 m + Regularly spaced + Irregularly spaced tol

Stationarity and isotropy The assumption of intrinsic stationarity of a random field is essential for geostatistical analysis. It is defined through first differences as follows: where g(h) is called (semi)variogram. Intrinsic stationarity basically implies a process with a constant mean and with a variance depending only through the magnitude and direction of h. Furthermore, if the variance 2g(h) is independent of direction, i.e., it only depends on the magnitude of ||h||, the random field is isotropic. Geostatistical data are typically composed of both large-scale trend and small-scale random variation. Geostatistics (variogram/Kriging) is designed to describe the small-scale variation, so it is necessary to remove the trend and anisotropy before conducting geostatistical analysis and spatial prediction.

zij = z.. + (zi. – z..) + (z.j – z..) + (zij – zi. – z.j + z..) Median polishing Median polishing is a resistant method and appropriate for grided data. It is based on an additive decomposition as data = grand mean + row + column + residual zij = z.. + (zi. – z..) + (z.j – z..) + (zij – zi. – z.j + z..) where a subscript dot denotes mean (median) over that subscript. The algorithm iterates successive sweeps of medians out of rows, then out of columns, accumulating them in row, column, and grand effects. At each step of the algorithm, the above relation is preserved. The residuals are what is left after the algorithm converges. The method cannot remove the interaction effects between rows and columns. Note: The residuals are the data that we should use in geostatistical analysis. It has been thought working with residuals introduces bias into variogram estimation, but the bias is generally small (particularly when median is used instead of mean).

Trend surface analysis Trend surface analysis is appropriate for (non-)grided data. It basically fits the data using a multiple regression (considering x and y coordinates as independent variables, interaction of x and y or higher order terms are usually needed). Data model: >grid=expand.grid(x=1:50,y=1:25) >x=grid$x >y=grid$y >z=10+0.1*x+0.0012*y^2+rnorm(1250,0,1) #data generating >x2=x^2 >y2=y^2 >xy=x*y >z=lm(z~x+y+xy+x2+y2) # fit the data z Mapping z: >x0=0:50 >y0=0:25 > image(x0,y0,matrix(z,nrow=50),col=topo.colors(100),main="Species distribution",xlab="x",ylab="y")

Trend surface analysis (cont’d) > summary(z.lm) Call: lm(formula = z0 ~ x + y + xy + x2 + y2) Residuals: Min 1Q Median 3Q Max -3.06491 -0.65335 0.03657 0.66170 2.88715 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.002e+01 1.478e-01 67.841 <2e-16 *** x 1.078e-01 8.404e-03 12.826 <2e-16 *** y -8.981e-03 1.706e-02 -0.527 0.599 xy 7.934e-05 2.614e-04 0.304 0.762 x2 -1.473e-04 1.461e-04 -1.008 0.314 y2 1.273e-03 5.862e-04 2.172 0.030 * --- Mapping z.lm residuals: > image(x0,y0,matrix(resid(z.lm),nrow=50),col=topo.colors(100),main=“Residuals",xlab="x",ylab="y")

Spatial autocorrelation Several techniques available to detect spatial autocorrelation (e.g., variogram and correlogram). Here we introduce four indices; we defer variogram to next chapter. These indices deal with binary, ordinal and continuous data. Mantel test Wij = ||si-sj|| Uij = ||z(si)-z(sj)|| z(si) random variable β is the slope of the regression Uij= βWij+eij

Mantel test (cont’d) Wij = ||si-sj|| Uij = ||z(si)-z(sj)|| z(si) We do not usually use β for Mantel test. We instead directly use M2 (or M1) for testing spatial patterning. Several methods can be used to test for significance. The basic idea is to randomly assign z1, z2, …, zn to the study area and then compare the observed M2 against the random M2’s to see how extreme the observed M2obs is. If M2obs is very different from the majority of random M2’s, we then reject the assumption of random distribution.

Mantel test – significance test Permutation test: Under the hypothesis of random distribution, the observed z(si) can be thought of as a random assignment of values to spatial location s. There are in total n! possible # of assignments (maps). Enumerating M2 for the n! arrangements yields its null distribution and the probability to exceed M2obs can be ascertained. Monte Carlo test: To permute n! is computingly very expensive. In this case, we can sample independently small subset of random assignments (e.g., 99) to construct a null distribution of M2. M2obs

Spatial connectivity (contiguity) Rook Bishop Queen Spatial connectivity (contiguity) North Carolina Measures on lattices Defining spatial connectivity is critical to the analysis of lattice data. There are several ways to define connectivity: by contiguity and distance. if sites i and j are connected if they are not connected

2. Black-Black and Black-White join counts Defining spatial connectivity is critical to the analysis of lattice data. There are several ways to define connectivity: by contiguity and distance. if sites i and j are connected if they are not connected if site i is occupied if it is not occupied

Test for significance Permutation or Monte Carlo test can be used to test the significance of the BB or BW statistics. Typically, the significance is tested by assuming an asymptotic normal distribution: Cliff, A.D. & Ord, J.K. 1981. Spatial processes: models and applications. Pion Limited, London.

3. Moran’s I where zi is the data value at location i, h is the distance between locations i and j, wij takes 1 if the pair (i, j) pertains to distance class h (the one for which the coefficient is computed), otherwise 0. W is the sum of wij. Moran’s I behaves like Pearson’s correlation coef because its numerator consists of a covariance term. Thus, it is sensitive to extreme values. Covariance based Distance based The association of z values at locations i and j can be measured by several ways:

Testing Moran’s I If zi is iid normal distribution, it can be shown that, as n is large, Moran’s I is asymptotically normally distributed with the first two moments: where Two types of test: Normal test if data are iid normal dist. Permutation (randomization) test for whatever data distribution.

Geary’s c Different from Moran’s I, this coefficient is a distance-type function because the numerator sums the squared differences between values at the various pairs of locations. In fact, Geary’s c corresponds in form to the Durbin-Watson statistic used to test the presence of autocorrelation in the residuals of a regression (particularly in time series data). A side note on the effects of autocorrelation on regression analysis: OLS regression coefs. are still unbiased, but they are no longer minimum variance estimates, i.e., the estimates are inefficient. If the residuals are positively autocorrelated, the residual mean square errors (MSE) may seriously underestimate s2, so that the std error may be too small. Thus, the CI’s are shorter than they really should be (see Cressie p.13-15), and tests of hypotheses on individual regression coefs may indicate that one or more regressors contribute significantly to the model when they really do not. Inflating Type I error! Generally, underestimating s2 gives a false impression of accuracy. The CI’s and tests of hypotheses based on the t- and F-distributions are not really appropriate.

Testing Geary’s I Similar to the Moran’s I, c is also asymptotically normally distributed if zi is iid normal distribution and n is large. The first two moments are: Two types of test: Normal test if data are iid normal distribution. Permutation (randomization) test for whatever data distribution.

spatstat functions for Moran’s I and Geary’s c > xy.dat=expand.grid(x=1:4, y=1:4) #x and y locations > xy.z=1:16 #illustrative data > xy.nb=dnearneigh(as.matrix(xy.dat), 0, 1) #create neighborhood structure > str(xy.nb) #view the neighbors > plot(xy.nb,xy.dat) #plot neighborhood > xy.W=nb2listw(xy.nb) #generate weight for each neighbor #assign equal weight to each neighbor > moran(xy.z, xy.W, length(xy.nb), Szero(xy.W)) # compute Moran’s I. > geary(xy.z, xy.W, length(xy.nb), length(xy.nb)-1,Szero(xy.W)) # compute Geary’s c. sp.correlogram is used to compute correlogram (the graphic relationship between autocorrelation coef against distance). However, care is needed for the proper use of sp.correlogram. When using sp.correlogram, only use the basic nb object that connect all cells, e.g. for the regular grids, we only need to use xy.nb=dnearneigh(as.matrix(xy.dat), 0, 1) to define the nb (higher orders of neighborhood is not needed). If using higher order neighborhood structure, e.g., xy.nb=dnearneigh(as.matrix(xy.dat), 0, 3), sp.correlogram will still compute correct sp.correlogram but the lag has a slightly different tolerance band.

styles in nb2listw >xy.W=nb2listw(xy.nb, style=“W”) It is helpful to understand the different weighting style in nblistw. There are five options in style: “W”, “B”, “C”, “U” and “S”. “W” is the default. You can use >str(xy.W) to view the difference between the styles. “W”: it weighs the neighbors of each focal point independently from the neighbors of other focal points. Each neighbors of a focal point has the same weight to the focal point. For example, if a focal point has two neighbors, then the weight for each neighbor is 0.5. If a focal point has four neighbors, then the weight for each neighbor is 0.25. The sum of the weights for all the focal points must be the total number of the focal points. “B”: it simply takes binary weight, i.e., each neighbor of a focal point contributes 1 to the focal point. The sum of the weights for all the focal points must be the total number of the neighbors. “C”: global standardized. It is calculated by dividing the number of neighbors of each focal point by the total number of neighbors (the sum calculated from “B” method). The sum of the weights must be 1. “S”: the variance-stabilizing coding scheme proposed by Tiefelsdorf et al. 1999, p. 167-168. The sum of the weights is n.

Let’s compute correlogram for Gigante surface water pH values > soil87.xy=data.frame(x=soil87.dat$gx,y=soil87.dat$gy) #x and y locations > soil87.nb=dnearneigh(as.matrix(soil87.xy), 0, 60) #create xy neighborhood structure > plot(soil87.nb, soil87.xy) #plot the nb structure > soil87.pH.correlogram=sp.correlogram(soil87.nb, soil87.dat$pHFsrf, order=18, method=“I”, zero.policy=T) #compute correlogram > soil87.pH.correlogram > plot(soil87.pH.correlogram, main=“Gigante surface water pH values”) Neighborhood structure

R function for Moran’s I and Geary’s c The R program, autocorrelation.r, computes both I and c for a given data at a particular distance. >autocorrelation.r(data,x,y,dist1,dist2,k=0) # data – a data array we want to test its autocorrelation # x – the x-coordinates for the data array # y – the y-coordinates for the data array # dist1 and dist2 – used to define neighbors; locations fall within (dist1,dist2) of a focal location are defined as neighbors, not including dist1 and dist2. # k – the number of simulations for computing permutation distribution under the null hypothesis of no spatial autocorrelation (k=0 default) Example: autocorrelation.r(soil87.dat$pHFsrf,soil87.dat$gx,soil87.dat$gy,50,100,k=99)

LISAs (Localized Indicators of Spatial Autocorrelation) We can rewrite Mantel’s statistic M2 in terms of contribution of each individual site: Mi can be thought of as the autocorrelation measure associated with the ith site, a local measure. Definition of LISA: (1) It measures the extent of spatial autocorrelation around a particular site, (2) if its sum is proportional to a global measure of spatial autocorrelation such as BB, I or c. Applications: Detect regions where the autocorrelation is markedly different from other area. Detect local spatial clusters (hotspots). There are many definitions of hot (cold) spots. One way is to use histogram and identifies those sites as hot (cold) spots where observations are extreme. LISAs enable hotspot definition to take into account of the surrounding values of a site. A hot spot can be defined as one or several contiguous sites where the LISAs are unusually large or small. Distinguish spatial outliers from distributional outliers. A spatial outlier may not be a distributional outlier. where Anselin, L. 1995. Local indicators of spatial association – LISA. Geographic Analysis 27:93-115.

Moran’s LISAs A local version of Moran’s I can be written as It is obvious that the sum of the local Moran’s I is proportional to the total I (scaled by W, the total weights): Anselin (1995) derives the mean and variance of Ii under the randomization assumption: spdep packagge: “localmoran”.

A few technical comments Spatial autocorrelation analysis should not be performed with fewer than ~30 data points because the number of location pairs in each distance class would then become too small to produce reliable results. There are two ways of dividing distance into classes: (1) by forming equal distance classes (as we did for the soil pH values in soil87.dat), (2) by forming classes with equal frequencies (i.e., distance h is not equally spaced, but for each distance class the number of locations are approximately equal). Spatial autocorrelation analysis should not be performed with a data containing a lot of 0’s, because the degree of autocorrelation would then be overestimated and would reflect the fact that the locations share their absence for that variable, which is not what is intended in most applications. Euclidean distances between locations may not be the best way to express spatial relationships in applications. Such a spatial structure should be defined depending on the particular purposes of a study. Legendre, P. and Fortin, M.-J. 1989. Spatial pattern and ecological analysis. Vegetatio 80:107-138.