Presentation on theme: "Comparison of Design-Based and Model-Based Techniques for Selecting Spatially Balanced Samples of Environmental Resources. Don L. Stevens, Jr. Department."— Presentation transcript:
Comparison of Design-Based and Model-Based Techniques for Selecting Spatially Balanced Samples of Environmental Resources. Don L. Stevens, Jr. Department of Statistics Oregon State University
The research described in this presentation has been funded by the U.S. Environmental Protection Agency through the STAR Cooperative Agreement CR82-9096-01 Program on Designs and Models for Aquatic Resource Surveys at Oregon State University. It has not been subjected to the Agency's review and therefore does not necessarily reflect the views of the Agency, and no official endorsement should be inferred R82-9096-01
Preview Two conceptual frameworks to support inference from sample properties to population characteristics: model-based & design-based Both encompass inference and sample selection methodologies Both sets of selection methodologies have techniques to incorporate prior information and knowledge
Preview Conjecture: With same prior information & knowledge, probabilistic samples can be near-optimal judged by model-based criteria Conjecture: Probabilistic samples can be more robust than optimal model-based samples
Preview Claim: With same prior information & knowledge, probabilistic samples can be near-optimal judged by model-based criteria Claim: Probabilistic samples can be more robust than optimal model-based samples There’s a catch: what is “optimal”?
Study Context Environmental monitoring and assessment application, particularly aquatics Response is a condition measure –Water quality –Chemical contamination –Biological quantity, e.g., IBI –Physical habitat metric –Salmon population levels
Study Context Populations distributed over space Sample sites will be visited more than once, possible over a period of many years Overall sample may be split into panels
Study Context Environmental populations have spatial structure –Things close together tend to be influenced by same set of factors –Things close together tend to share similar substrates
Study Context Environmental populations have spatial structure –Things close together tend to be influenced by same set of factors –Things close together tend to share similar substrates But the spatial structure is almost certainly not stationary
Study Context Structure may be patchy rather than smoothly changing –Localized management practices –Localized contamination –Localized development –Natural discontinuities Slope Substrate: soil, geology Watercourse
Study Context Most populations of interest have existing samples in place –Frequently convenience samples –Preservation of historical continuity important Most large populations (e.g., covering a substantial portion of a state) will have accessibility issues
Study Context Some portions of the population will require a higher intensity sample –Scientific, economic, or political interest Sample allocation may need to be modified –Emerging issues –Problems solved
Study Strategy Compare some techniques for optimal model-based design to Generalized Random Tessellation Stratified (GRTS) design Various scenarios: –Variety of optimality criteria –Existing sample points –Variable interest variable spatial density –Inaccessible regions a priori & a posteriori
Optimality Criteria Statisticians think “best” means minimum variance, so optimum design is one that gives the minimum variance estimator, but…..
Optimality Criteria Statisticians think “best” means minimum variance, so optimum design is one that gives the minimum variance estimator, but….. Not always straightforward to decide on appropriate variance!
Optimality Criteria For example, suppose we need a value for –Usual approach in spatial statistics is to use kriging to “predict” a mean, so we need the prediction variance –But we need a variogram to krige, which usually has to be estimated assuming some model, so we should include variogram parameter uncertainty –But what about variogram model itself?
Optimality Criteria For example, suppose we need a value for –Usual design-based is to minimize sampling variance over repeated sample selections –But even the design-based variance is dependent on spatial structure –So we could adopt super-population model, and minimize expected variance Which puts us back in the spatial stats arena
Optimality Criterion Minimal assumptions: Points that are close together contain redundant information, so we want a design that gives maximal dispersion A point pattern that is “regular” in the stochastic point process sense gives maximal dispersion Thus, we need to look at regularity criteria to select optimality criterion
Study Strategy Compare using several optimality criteria –Regularity of point process K-function Von Groenigen & Stein MMSD Fractal dimension Mean square deviation of distance to side, vertex, boundary of Voronoi polygon –Variance of estimated population mean Over replicated sample selection Over replicated population realizations With models for non-stationary mean structure
Optimal Design Number of recent papers have used spatial simulated annealing to locate optimal sampling points –Begin with a random set of points –Cycle through points,perturbing one at a time –At each step, calculate an optimality criterion –If better than old optimum, keep –If worse, accept with some probability that decreases with the number of cycles
Optimal Design Van Groenigen & Stein MMSD –Minimized the Mean Shortest Distance: S a set of sample points, x a point in target domain D, let d(x,S) be the distance from x to the nearest point in S. Then –Note that for C(s) the Voronoi polygon of s
Optimal Design Ripley’s K function: K(r) : average number of additional sites within radius r of a site divided by the intensity of the process
Optimal Design Di Zio, Fontanella & Ippoliti used a measure related to the fractal dimension: Let D be the slope of the best fitting line produced when log(K(r)) is regressed against log(r ) As sites become more evenly dispersed, D should approach 2, so 2-D is a measure of irregularity.
Optimal Design Proposed criterion: Let B(C(s)) be the boundary of the Voronoi polygon of s. Define SVB is approximated by the MSD distance from a sample point to Sides, Vertices, and Boundaries relative to a nominal value (Side is an edge that separates two sample points; a boundary is an edge determined by the domain)
Existing Points SSA can optimize placement of new sample points given some existing points Can do something similar with GRTS: –Determine limits on grid resolution & placement such that existing points are all in distinct cells –Do GRTS design conditional on those limits, and “select” cells with existing points
Existing Points Illustrate with 25 point design –Unrestricted –5 points fixed, 20 unrestricted
Simulation Study Model-based approach: vary the surface, not the sample Created a patchy surface by “mixing” 3 smooth surfaces: a plane, a normal density, and a surface with several bumps, plus random noise
Simulation Study Generated 1000 replicates of the random surface Sample each replicate with the Uniform Random, Fractal, SVB, and GRTS design points Calculate mean for each replicate, & variance of estimated mean over all replicates
Mean Structure Model Express the response as where m( s) is mean structure, and z(s) is a random field (hopefully stationary) Following a suggestion by Cressie, we’ll use a model based on applying a median polish to determine mean structure
Mean Structure Model Median polish is analogous to ANOVA, in that the mean is expressed as sum of overall, row, & column effects Effects are estimated in an iterative procedure: –Extract row-wise medians –Extract column-wise medians –Add sum of median of row medians & median of column medians to overall effect –Iterate several times.
Mean Structure Model Median polish will extract some kinds of structure, but doesn’t handle a patch-like response Try CART, with x,y coordinates as “classifying” variables
Example Data Set ODFW Coho Salmon spawners –Basic response is density (fish/km) of adult fish at a site –Pooled data set over five years –Normalized each year by total number of fish counted that year –Response is then proportion of total run at the site