# EXPLORING SPATIAL CORRELATION IN RIVERS by Joshua French.

## Presentation on theme: "EXPLORING SPATIAL CORRELATION IN RIVERS by Joshua French."— Presentation transcript:

EXPLORING SPATIAL CORRELATION IN RIVERS by Joshua French

Introduction A city is required to extends its sewage pipelines farther in its bay to meet EPA requirements. How far should the pipelines be extended? The city doesnt want to spend any more money than it needs to extend the pipelines. It needs to find a way to make predictions for the waste levels at different sites in the bay.

With the passage of the Clean Water Act in the 1970s, spatial analysis of aquatic data has become even more important. Section 305 b) requires state governments to make, a description of the water quality of all navigable waters in such State... It is not physically or financially possible to make measurements at all sites. Some sort of spatial interpolation will need to be used.

Usually we might try to fit some sort of linear model to the data to make predictions. Usually we assume observations are independent. For spatial data however, we intuitively know that two sampling sites close together will probably be similar. We would expect that two sites in close proximity would be more similar than two sites separated by a great distance. We can use the correlation between sampling sites to make better predictions with our model.

The Ohio River

The Road Ahead -Methods -Introduction to the Variogram -Exploratory Analysis -Sample Variogram -Modeling the Variogram -Analysis -3 types of results -Conclusions -Future Work

Introduction to the Variogram Spatial data is often viewed as a stochastic process. For each point x, a specific property Z(x) is viewed as a random variable with mean µ, variance σ 2, higher-order moments, and a cumulative distribution function.

Each individual Z(x i ) is assumed to have its own distribution, and the set {Z(x 1 ),Z(x 2 ),…} is a stochastic process. The data values in a given data set are simply a realization of the stochastic process.

We want to measure the relationship between different points. Define the covariance for Z(x j ) and Z(x k ) to be: Cov(Z(x j ),Z(x k ))=E[{Z(x j )-µ(x j )} {Z(x k )-µ(x k )}] where µ(x j ) and µ(x k ) is the mean of Z at each respective location.

However, we have a problem. We dont know the means at each point because we only have one realization. To solve this, we must assume sort of stationarity – certain features of the distribution are identical everywhere. We will work with data that satisfies second- order stationarity.

Second-order stationarity means that the mean is the same everywhere: i.e. E[Z(x j )]=µ for all points x j. It also implies that Cov(Z(x j ),Z(x k )) becomes a function of the distance x j to x k.

Thus, Cov(Z(x j ),Z(x k )) = Cov(Z(x),Z(x+h)) = Cov(h) where h measures the distance between two points. We can then derive that Cov(Z(x),Z(x+h)) =E[(Z(x)-µ)(Z(x+h)- µ)] = E[(Z(x)(Z(x+h))-µ 2 ]

Sometimes it is clear that our data is not second-order stationary. Georges Matheron solved this problem in 1965 by establishing his intrinisic hypothesis. For small distances h, Matheron held that E[Z(x)-Z(x+h)]=0

Looking at the variance of differences, this leads to Var[Z(x)-Z(x+h)] =E[ (Z(x)-Z(x+h)) 2 ] = 2 γ(h) Intrinsic stationarity is good because analysis may be conducted even if second-order stationarity is violated. Unfortunately, the covariance equation is not defined for intrinsic stationarity.

For this reason, we will work with data that is second-order stationarity. If second-order stationarity is violated by the original data, then we will perform additional procedures to work with data that is second-order stationary.

Note that second-order stationarity implies intrinsic stationarity, so the variogram equation is still defined. Under second-order stationarity, γ(h)=Cov(0)-Cov(h). γ(h) is known as the semi-variogram. In practice however, it is usually referred to as the variogram.

Things to know about variograms: 1.γ(h)= γ(-h). Because it is an even function, usually only positive lag distances are shown. 2.Nugget effect - by definition, γ(0)= 0. In practice however, sample variograms often have a positive value at lag 0. This is called the nugget effect.

3.Tend to increase monotonically 4.Sill – the maximum variance of the variogram 5.Range – the lag distance at which the sill is reached The following figure shows these features

Variogram Example

Exploratory Analysis Before we model variograms, we should explore the data. We need to make sure that the data analyzed satisfies second-order stationarity We need to check for outliers We need to make sure that the data is not too badly skewed (G 1 >1)

We can look at the river data as a one-dimensional linear system. It is fairly easy to check for stationarity using a scatter plot.

If there is an obvious trend in the data, we should remove it and analyze the residuals. If the variance increases or decreases with lag distance, then we should transform the variable to correct this.

To check for outliers, we may use a typical boxplot. If the data contains outliers, we should do analysis both with and without outliers present.

If G 1 >1, then we should transform the data to approximate normality if possible. To check approximate normality, the standard qqplot can be used.

3.3 The Sample Variogram One of the previous definitions of semivariance is: The logical estimator is: where N(h) is the number of pairs of observations associated with that lag.

Sample Variogram Example

Modeling the Variogram Our goal is to estimate the true variogram of the data. There were four variogram models used to model the sample variogram: the spherical, Gaussian, exponential, and Matern models.

Variogram Models

The algorithm used to fit the spherical model uses least squares. The algorithm used to fit the exponential, Gaussian, and Matern models is maximum likelihood. The spherical model is fit to get an estimate of the sill, nugget, and range.

These estimates will be used to fit the other three models. The best model will be the model that minimizes the AICC statistic.

Analysis The data analyzed is a set of particle size and biological variables for the Ohio River. The data was collected by The Ohio River Valley Sanitation Commission. This is better known as ORSANCO.

ORANSCO data collection

There were between 190 and 235 unique sampling sites, depending on the variable. Some sites had more than one observation. In these situations, the average value for the site was used for analysis.

Ohio River Sampling Sites

There were two main types of data: particle size data and biological levels. The particle size data measured percent gravel, percent sand, percent fines, percent hardpan, percent boulder, and percent cobble.

The biological data measured -Number of individuals at a site -Number of species at a site -Percent tolerant fish -Percent simple lithophilic fish (fish that lay eggs on rocks) -Percent non-native fish -Percent detritivore fish (fish that eat mostly decomposed plants or animals) -Percent invertivore (fish that eat mostly invertebrate animals) -Percent Piscivore (fish that eat mostly other fish)

The results of the analysis fell into three main groups: -Sample variogram fit well -Sample variogram did not fit well -Analysis not reasonable

Good Results: Number of Individuals at a site Skewness coefficient of data is 8.16. This is much too high. The data is transformed using the natural logarithm New skewness coefficient is reduced to.56. Not perfect, but much less skewed.

Check Normality of log(Num Individuals)

Check Second-Order Stationarity of log(Num Individuals)

Check for outliers of log(Num Individuals)

There are a number of outliers for the transformed variable We should do analysis with and without the outliers present

log(Num Individuals) Sample Variogram with outliers

Check normality of log(Num Individuals) without outliers

log(Num Individuals) Sample Variogram without outliers

We were not able to model the sample variogram perfectly, but we were able to detect some amount of spatial correlation in the data, especially when the outliers were removed. For the transformed variable without outliers, the exponential model estimated the nugget to be.20, the sill to be.2709, and the range to be 37.7 miles.

Poor Results: Percent Sand Skewness coefficient only.18, so skewness not a major factor. Check second-order stationarity using scatter plot.

Check Stationarity of Percent Sand

There appears to be a trend in the data. After removing the trend, the data appears to be second-order stationary. The residuals are also approximately normal.

Check stationarity of percent sand residuals

Check normality of percent sand residuals

Sample Variogram of percent sand residuals

The sample variogram does not really increase monotonically with distance. Our variogram models cannot fit this very well. Though we can obtain estimates of the nugget, sill, and range, the estimates cannot be trusted.

No results: Percent Hardpan This variable was so badly skewed that analysis was not reasonable. The skewness coefficient is 12.38. This is extremely high.

QQplot of Percent Hardpan

Scatter plot of Percent Hardpan

The data is nearly all zeros! There is also an erroneous data value. A percentage cannot be greater than 100%. Data analysis does not seem reasonable. Our data does not meet the conditions necessary to use the spatial methods discussed.

Conclusions Able to fit sample variogram reasonably well – percent gravel, number of individuals, number of species Not able to fit sample variogram well – percent sand, percent detritivore, percent simple lithophilic individuals, percent invertivore No results – remaining variables

Summary of Results

Future Work Data set involving three streams in Norfolk, Virginia. Each stream has 25 observations. Collected by researchers at Old Dominion University. Difficulties to overcome - What is the best way to measure distance between points? - Few observations - Overlapping points after coordinate conversion

Problem: What is the best way to measure distance between points? There is some aspect of two-dimensionality to the data, but it is still really a one- dimensional problem.

Problem: 25 observations per stream is considered the minimum number of points to create a variogram - the sample variogram will be very rough - our variogram model estimates will probably be bad To correct this, we will explore the possibility of combining the data from the three streams

Problem: Overlapping points after conversion - Original data in longitude/latitude coordinates - Convert to UTM coordinates so that Euclidian distance makes sense - Converted UTM coordinates often result in overlapping sites (and even fewer unique sampling sites)

Stream Sampling Sites (Lat/Long)

Stream Sampling Sites (UTM)

Stream Sampling Sites (Lat/Long)

Stream Sampling Sites (UTM)

Acknowledgments - My committee: Dr. Urquhart, Dr. Wang, and Dr. Theobald - Dr. Davis and Dr. Reich for answering my spatial questions and letting me use their S-Plus spatial library

Concluding Thought Before you criticize someone, you should walk a mile in their shoes. That way, when you criticize them, youre a mile away and you have their shoes. - Jack Handey