Spatial Scan Statistic for Geographical and Network Hotspot Detection C. Taillie and G. P. Patil Center for Statistical Ecology and Environmental Statistics.

Slides:



Advertisements
Similar presentations
Summary of A Spatial Scan Statistic by M. Kulldorff Presented by Gauri S. Datta SAMSI September 29, 2005.
Advertisements

Summary of A Spatial Scan Statistic by M. Kulldorff Presented by Gauri S. Datta Mid-Year Meeting February 3, 2006.
Hotspot/cluster detection methods(1) Spatial Scan Statistics: Hypothesis testing – Input: data – Using continuous Poisson model Null hypothesis H0: points.
Bayesian inference “Very much lies in the posterior distribution” Bayesian definition of sufficiency: A statistic T (x 1, …, x n ) is sufficient for 
Introduction to Statistics
Zakaria A. Khamis GE 2110 GEOGRAPHICAL STATISTICS GE 2110.
Basics of Linkage Analysis
Empirical/Asymptotic P-values for Monte Carlo-Based Hypothesis Testing: an Application to Cluster Detection Using the Scan Statistic Allyson Abrams, Martin.
A Spatial Scan Statistic for Survival Data Lan Huang, Dep Statistics, Univ Connecticut Martin Kulldorff, Harvard Medical School David Gregorio, Dep Community.
Mean Shift A Robust Approach to Feature Space Analysis Kalyan Sunkavalli 04/29/2008 ES251R.
EPIDEMIOLOGY AND BIOSTATISTICS DEPT Esimating Population Value with Hypothesis Testing.
 Statistical approaches for detecting unexplained clusters of disease.  Spatial Aggregation Thomas Talbot New York State Department of Health Environmental.
Introduction to Mapping Sciences: Lecture #5 (Form and Structure) Form and Structure Describing primary and secondary spatial elements Explanation of spatial.
Today Concepts underlying inferential statistics
Spatial Statistics for Cancer Surveillance Martin Kulldorff Harvard Medical School and Harvard Pilgrim Health Care.
Geographic Information Science
University of Wisconsin-Milwaukee Geographic Information Science Geography 625 Intermediate Geographic Information Science Instructor: Changshan Wu Department.
Using ArcGIS/SaTScan to detect higher than expected breast cancer incidence Jim Files, BS Appathurai Balamurugan, MD, MPH.
The Spatial Scan Statistic. Null Hypothesis The risk of disease is the same in all parts of the map.
Online Detection of Change in Data Streams Shai Ben-David School of Computer Science U. Waterloo.
Statistics: Unlocking the Power of Data Lock 5 Synthesis STAT 250 Dr. Kari Lock Morgan SECTIONS 4.4, 4.5 Connecting bootstrapping and randomization (4.4)
Multiscale Symmetric Part Detection and Grouping Alex Levinshtein, Sven Dickinson, University of Toronto and Cristian Sminchisescu, University of Bonn.
10.2 Tests of Significance Use confidence intervals when the goal is to estimate the population parameter If the goal is to.
Generalized Hough Transform
Biostatistics Class 6 Hypothesis Testing: One-Sample Inference 2/29/2000.
Estimating  0 Estimating the proportion of true null hypotheses with the method of moments By Jose M Muino.
Fitting probability models to frequency data. Review - proportions Data: discrete nominal variable with two states (“success” and “failure”) You can do.
Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests.
August 30, 2004STDBM 2004 at Toronto Extracting Mobility Statistics from Indexed Spatio-Temporal Datasets Yoshiharu Ishikawa Yuichi Tsukamoto Hiroyuki.
The generalization of Bayes for continuous densities is that we have some density f(y|  ) where y and  are vectors of data and parameters with  being.
Point Pattern Analysis
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 11: Models Marshall University Genomics Core Facility.
The Statistical Urban Zoning. The Experience of the Municipality of Firenze La zonizzazione statistica in ambito urbano. L’esperienza del Comune di Firenze.
Sampling and Statistical Analysis for Decision Making A. A. Elimam College of Business San Francisco State University.
Data Mining and Decision Support
Constraining population synthesis (and binary black hole inspiral rates) using binary neutron stars Richard O’Shaughnessy GWDAW
In Bayesian theory, a test statistics can be defined by taking the ratio of the Bayes factors for the two hypotheses: The ratio measures the probability.
Emergence of Landscape Ecology Equilibrium View Constant species composition Disturbance & succession = subordinate factors Ecosystems self-contained Internal.
Towards efficient prospective detection of multiple spatio-temporal clusters Bráulio Veloso, Andréa Iabrudi and Thais Correa. Universidade Federal de Ouro.
REGRESSION MODEL FITTING & IDENTIFICATION OF PROGNOSTIC FACTORS BISMA FAROOQI.
NIEHS G. P. Patil. This report is very disappointing. What kind of software are you using?
1 Forum for Interdisciplinary Mathematics Patna, India G. P. Patil December 2010.
General Elliptical Hotspot Detection Xun Tang, Yameng Zhang Group
Project Geoinformatic Surveillance NSF DGP Grant G. P. Patil, Penn State, PI EPA: Watershed Characterization and Prioritization PADOH: Disease Clusters.
1 Bivariate Hotspot Detection The circle-based SaTScan and data- driven ULS scan statistic are designed to identify hotspots based on the elevated responses.
Exposure Prediction and Measurement Error in Air Pollution and Health Studies Lianne Sheppard Adam A. Szpiro, Sun-Young Kim University of Washington CMAS.
1 NJ DHSS CES SEER G. P. Patil January 17, This report is very disappointing. What kind of software are you using?
Geographic and Network Surveillance for Arbitrarily Shaped Hotspots Overview Geospatial Surveillance Upper Level Set Scan Statistic System Spatial-Temporal.
Chapter 9 Introduction to the t Statistic
1 Spatial Temporal Surveillance. 2 3 Geographic Surveillance and Hotspot Detection for Homeland Security: Cyber Security and Computer Network Diagnostics.
A genetic algorithm for irregularly shaped spatial clusters Luiz Duczmal André L. F. Cançado Lupércio F. Bessegato 2005 Syndromic Surveillance Conference.
4.6.1 Upper Echelons of Surfaces
Modeling and Simulation CS 313
Chapter Nine Hypothesis Testing.
Health GeoInformatics
Spatially Constrained Clustering and Upper Level Set Scan Hotspot Detection in Surveillance GeoInformatics G.P.Patil, Penn State University Reza Modarres,
5/22/2018 Forum for Interdisciplinary Mathematics Patna, India G. P. Patil December 2010.
Chapter 4. Inference about Process Quality
Modeling and Simulation CS 313
Dept of Biostatistics, Emory University
Summary of Prev. Lecture
NSF Digital Government surveillance geoinformatics project, federal agency partnership and national applications for digital governance.
Statistical inference: distribution, hypothesis testing
Mean Shift Segmentation
When we free ourselves of desire,
Stochastic Hydrology Hydrological Frequency Analysis (II) LMRD-based GOF tests Prof. Ke-Sheng Cheng Department of Bioenvironmental Systems Engineering.
Geographic and Network Surveillance for Arbitrarily Shaped Hotspots
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
CHAPTER 12 STATISTICAL METHODS FOR OPTIMIZATION IN DISCRETE PROBLEMS
Presentation transcript:

Spatial Scan Statistic for Geographical and Network Hotspot Detection C. Taillie and G. P. Patil Center for Statistical Ecology and Environmental Statistics Penn State University Joint Statistical Meetings Toronto, Canada August 11, 2004

Examples of Hotspot Analysis Spatial Disease surveillance Biodiversity: species-rich and species-poor areas Geographical poverty analysis Network Water resource impairment at watershed scales Water distribution systems, subway systems, and road transport systems Social Networks/Terror Networks

Issues in Hotspot Analysis Estimation: Identification of areas having unusually high (or low) response Testing: Can the elevated response be attributed to chance variation (false positive) or is it statistically significant? Explanation: Assess explanatory factors that may account for the elevated response

Spatial Scan Statistic Model Study Area spatially distributed response

Spatial Scan Statistic Model Study Area spatially distributed response Candidate Hotspot (Zone) Z

Spatial Scan Statistic Model Study Area spatially distributed response Candidate Hotspot (Zone) Z Given Z, assume: response is spatially homogeneous inside Z and outside Z, but with different mean values response inside Z and outside Z described by parametric distributions (binomial or Poisson in disease surveillance)

Spatial Scan Statistic Model Study Area spatially distributed response Candidate Hotspot (Zone) Z Given Z, assume: response is spatially homogeneous inside Z and outside Z, but with different mean values response inside Z and outside Z described by parametric distributions (binomial or Poisson in disease surveillance) Likelihood: L(Z, p 0, p 1 ) Key Idea: Z is an unknown parameter

Likelihood Estimation Study Area spatially distributed response Maximize L(Z, p 0, p 1 ) Two stages: Fix Z, maximize wrt the conventional parameters. Gives the profile likelihood for Z Maximize L(Z) across all candidate zones Z Z

Likelihood Estimation Study Area spatially distributed response Maximize L(Z, p 0, p 1 ) Two stages: Fix Z, maximize wrt the conventional parameters. Gives the profile likelihood for Z Maximize L(Z) across all candidate zones Z What is a candidate zone ? Z

Tessellated study area Want zones to be connected Maximize L(Z) for Z in  Candidate Zones Z Allowable zoneNot a zone

Maximize L(Z) for Z in   is a finite set

Maximize L(Z) for Z in   is a finite set --- but usually too big for exhaustive search

Maximize L(Z) for Z in   is a finite set --- usually too big for exhaustive search Possible strategies: Search space reduction Replace  by a smaller set  0 and do an exhaustive search across  0

Maximize L(Z) for Z in   is a finite set --- usually too big for exhaustive search Possible strategies: Search space reduction Replace  by a smaller set  0 and do an exhaustive search across  0  Circles (Martin Kulldorff)

Poor Hotspot Delineation by Circular Zones Hotspot Circular zone approximations

Poor Hotspot Delineation by Circular Zones Hotspot Circular zone approximations Circular zones may represent single hotspot as multiple hotspots

Maximize L(Z) for Z in   is a finite set --- usually too big for exhaustive search Possible strategies: Search space reduction Replace  by a smaller set  0 and do an exhaustive search across  0  Circles (Martin Kulldorff)  Ellipses

Maximize L(Z) for Z in   is a finite set --- usually too big for exhaustive search Possible strategies: Search space reduction Replace  by a smaller set  0 and do an exhaustive search across  0  Circles (Martin Kulldorff)  Ellipses  Upper level sets

Maximize L(Z) for Z in   is a finite set --- usually too big for exhaustive search Possible strategies: Search space reduction Replace  by a smaller set  0 and do an exhaustive search across  0  Circles (Martin Kulldorff)  Ellipses  Upper level sets Stochastic optimization  Simulated annealing (Luis Duczmal)  Genetic algorithms

Tessellation of a geographic region Region R, Tessellation T = {a} of R Cell a, Response Y a, Cell “Size” A a Two distributional settings: –Y a is Binomial ( N a, p a ), A a = N a, p a = cell rate –Y a is Poisson ( a A a ), a = cell rate Cell sizes A a are known and fixed Cell responses Y a, a  A, are independent Spatial Scan Statistic Setup a c b k d e f g h i j a, b, c, … are cell labels

Data-adaptive approach to reduced parameter space  0 Zones in  0 are connected components of upper level sets of the empirical rate function G a = Y a / A a Upper level set (ULS) at level g consists of all cells a where G a  g Upper level sets may be disconnected. Connected components are the candidate zones in  0 These connected components form a rooted tree under set inclusion. –Root node = entire region R –Leaf nodes = local maxima of empirical rates –Junction nodes occur when two zones coalesce Upper Level Set (ULS)

Upper Level Sets (ULS) of Response Surface Hotspot zones at level g (Connected Components of upper level set)

Changing Connectivity of ULS as Level Drops

ULS Zonal Tree A, B, C are junction nodes where multiple zones coalesce into a single zone Schematic rate “surface” A B C

ULS Zonal Tree Schematic rate “surface” A B C Maximize profile likelihood L(Z) across the tree Typical behavior of L(Z) Number of nodes does not exceed number of cells in tessellation

Determine a confidence set for the hotspot Each member of the confidence set is a zone which is a statistically plausible delineation of the hotspot at specified confidence Confidence set lets us rate individual cells a for hotspot membership Rating for cell a is percentage of zones in confidence set that contain a Map of cell ratings: –Inner envelope = cells with 100% rating –Outer envelope = cells with positive rating Confidence Sets and Hotspot Rating

Hotspot Rating Outer envelope Inner envelope MLE

Confidence set is all null hypotheses that cannot be rejected As hypotheses, use where Z 0   0 is a given zone. Confidence set is all Z 0   0 for which cannot be rejected. Likelihood ratio test: Null distributions have to be determined by simulation Confidence Set Determination

Confidence Region on ULS Tree MLE Junction Node Alternative Hotspot Locus Alternative Hotspot Delineation Tessellated Region R

Space-Time Generalizations

Some Space-Time Hotspots and Their Cylindrical Approximations Hotspot Cylindrical approximation Cylindrical approximation sees single hotspot as multiple hotspots Space Time

Oakland 1970 Poverty dataOakland 1980 Poverty data Oakland 1990 Poverty data Shifting poverty

Typology of Space-Time Hotspots Space (census tract) Time (census year) Stationary Hotspot Space (census tract) Time (census year) Shifting Hotspot Space (census tract) Time (census year) Expanding Hotspot Time slices of space-time hotspot Space (census tract) Time (census year) Merging Hotspot

Trajectory of a Merging Hotspot

Hotspot Detection for Continuous Responses Human Health Context:  Blood pressure levels for spatial variation in hypertension  Cancer survival (censoring issues) Environmental Context:  Landscape metrics such as forest cover, fragmentation, etc.  Pollutant loadings  Animal abundance

Hotspot Model for Continuous Responses Simplest distributional model: Additivity with respect to the index parameter k suggests that we model k as proportional to size: Scale parameter  takes one value inside Z and another outside Z Other distribution models (e.g., lognormal) are possible but are computationally more complex and applicable to only a single spatial scale

 Circles capture only compactly shaped clusters  Want to identify clusters of arbitrary shape  Circles provide point estimate of hotspot  Want to assess estimation uncertainty (hotspot confidence set)  Circles handle only synoptic (tessellated) data  Want to also handle data on a network Circles vs ULS

Features of ULS Scan Statistic Identifies arbitrary shaped clusters Identifies arbitrary shaped clusters Applicable to data on a network Applicable to data on a network Confidence set, hotspot rating Confidence set, hotspot rating Computationally efficient Computationally efficient Applicable to continuous response Applicable to continuous response Generalizes to space-time scan Generalizes to space-time scan