Presentation on theme: "Hot Spot Detection in a Network Space: Geocomputational Approaches Ikuho Yamada, Ph.D. Department of Geography & School of Informatics IUPUI October 3,"— Presentation transcript:
Hot Spot Detection in a Network Space: Geocomputational Approaches Ikuho Yamada, Ph.D. Department of Geography & School of Informatics IUPUI October 3, 2005 Fall 2005 Talk Series on Network and Complex Systems
Introduction Clusters in a spatial phenomenon = hot spots, where occurrence or level of the phenomenon is higher than expected. Detecting hot spots is useful for Understanding of the nature of the phenomenon itself: Factors influencing the phenomenon; Decision making in related policies/planning: Remedial/preventive actions; Regional development planning; New facility design, etc…
Introduction (cont.) Potential problem: Spatial distribution of the phenomenon may be affected by a transportation network; E.g., vehicle crashes, retail facilities, crime locations, … Analytical results derived w/o considering the network’s influence will be misleading, especially for Detailed micro-scale data, and local scale analysis. Analysis based on a network space, rather than the Euclidean space. Cluster ? No!!
Objectives Data Highway network Vehicle crash location Detecting local clusters Black spots (Clusters of crashes) Stage 1:Stage2: Identifying influencing factors Classifier to determine cluster or not (e.g., Decision tree) 1.Is there any clustering tendency? 2.Where are the clusters? 3.How large are the clusters? 4.What causes the clusters? Answer to Questions 1, 2, & 3 Answer to Question 4
Objectives (cont.) Stage 1: Cluster detection in the network space To develop exploratory spatial data analysis methods for network-based local-cluster detection, named local indicators of network-constrained clusters (LINCS). Event-based dataLink-attribute-based data K-functionMoran’s I and Getis & Ord’s G statistics
Objectives (cont.) Stage 2: Influencing factor identification To examine applicability of inductive learning techniques for constructing models that explain the clusters in relation to the characteristics of the network space; Decision tree induction algorithms; Feedforward neural networks; Discrete choice/regression models --- as examples of traditional statistical methods.
Outline Constraints imposed by the network space Stage 1 — Development of LINCS Network K-function for event-based data Stage 2 — Inductive learning Decision tree induction to model relationships between the detected clusters using the network attributes Case study: 1997 vehicle crash data in Buffalo, NY Conclusions
Constraints imposed by the network space Location constraint: Some spatial phenomena occur only on the links of the network. E.g., vehicle crashes, retail facilities, geocoded addresses (crime locations, patient residences, …); Movement constraint: Movement between locations is restricted to the network links; E.g., One can get to a gas station only by driving along the streets; Distance between locations is more appropriately represented by the network (shortest-path) distance than by the Euclidean (straight-line) distance.
Network constraints (cont.) YesNo Yes Vehicle crashes, Retail facilities, Traffic speed, … No Traffic noise, Vehicle emission, … Trees in a forest, People on a square, Molecules in the air, … Location constraint Movement constraint
Stage 1 Cluster detection in the network space
Global Network K-function (Okabe & Yamada 2001) Extension of Ripley’s K-function (1976) to determine If a point pattern has clustering/dispersal tendency significantly different from random with respect to the network; For a set of network-constrained events P, where ρ is the intensity of points. Planar K-function Network K-function Within distance hNot within distance h
Global Net K-function (cont.) An example of random distribution in a network space
Global Net K-function (cont.) If there is a strong cluster with radius R, K(h) tends to exceed the upper significance envelope, indicating clustering, even for h≥R. Incremental K-function: Instead of examining the total number of events within distance h, examine an increment of the number of events by a unit distance; It can identify clustering scale more accurately than the original K-function. Weakness of the global K-function in determining the scale of clustering: Similar K(h) Different IncK(h t )
Local Network K-function Local indicator of clustering tendency: Decomposition of the global K-function: This indicator is determined only for event locations; only for limited locations in a network; Introduction of reference points: Distributed over the network with a constant interval for which indicator values are calculated; c.f., regular grid used in the planar space analysis such as Geographical Analysis Machine (GAM).
Local Net K-function (cont.) Local network K-function: where j =1, …, m, and m is the number of reference points; For an observed pattern, Local K-function values are obtained for the reference points for a range of distance h. LINCS for event-based data (KLINCS)
Example of the KLINCS analysis The incremental K-function can be an indicator of the scale of clustering to help us determine which scale(s) of the local K-function to be closely examined; Distance 2, in this case.
KLINCS (cont.) Results of the local network K-function: Significance of individual reference points is determined by comparing with 1,000 simulations of random patterns on the network; Obs. LK j (h) ≥ the largest simulated LK j (h) clustering; Obs. LK j (h) ≤ the smallest simulated LK j (h) dispersal. (0.1% significance level)
LINCS for link-attribute-based data Moran’s I statistic (1948): A global measure of spatial autocorrelation; Dependence of a variable value at a location on those on its nearby locations in a spatial context LISA (local indicators of spatial association) by Anselin (1995); Network Moran’s I (Black 1992): A measure of network autocorrelation; Dependence between a variable value at a given link and those of other links that are connected to the link in a network context. Getis and Ord local G statistics (1992): A local measure of concentration of variable values around a region; Applicable to link-attribute-based data (Berglund and Karlström 1999). GLINCS ILINCS Local version
Relationship between I and G statistics High Low High Positive I i Positive G i * Negative I i Non-significant G i * Low Negative I i Nonsignificant G i * Positive I i N egative G i * Value of the target link i Values of the links in the neighborhood of link i
From LINCS to inductive learning Question: What causes the detected clusters? LINCS gives a measure of clustering tendency for each spatial unit (ref. point or link segment). Network data include attributes that may be related to the cause of the clusters. E.g., travel speed, traffic volume, … Spatial attributes can also be assigned to the spatial units. E.g., distance from the closest intersection, travel time from the closest police station, average income of the area, …
LINCS to IL (cont.) The spatial units can be categorized based on their LINCS values. E.g., cluster/random/dispersion; large cluster/medium cluster/ small cluster/random; cluster center/cluster fringe/random. Inductive Learning Decision tree induction Feedforward neural network Network attributes Spatial attributes Spatial units Clustering Random Dispersion LINCS results Relationships? Causality
Stage 2 Influencing factor identification
Inductive learning A means to model relationships between input variables and outcome (classification) without relying on prior knowledge: (Gahegan 2000) Learns from a set of instances for which desired outcome is known; Predicts outcomes for new instances with known input variables. AdvantageDisadvantage Robustness to noise in data; Flexibility as to the data types to be combined; Ability to handle a large number of attributes; Less training data required; Fewer assumptions required on variable distributions and model structure. Overfitting, i.e., learning too much detail of the training dataset to capture general patterns embedded in it.
Decision tree A way of representing rules for classification in a hierarchical manner; (Witten & Frank 2000; Thill & Wheeler 2000) Node --- test on an attribute; Leaf node --- specification of a class. Decision tree induction: Recursive process of splitting a set of instances with correct class information (training set) into subsets based on a particular attribute; E.g., CHAID (Kass 1980), CART (Breiman et al. 1984), C4.5 (Quinlan 1993).
Other techniques of modeling Feedforward neural network with back- propagation: (Thill & Mozolin 2000, Demuth & Beale 2000) A way of deriving a mapping of multiple input variables to classification from a training dataset. Discrete choice model ~ as an example of traditional statistical modeling: A way to analyze a relationship between a set of independent variables and a dependent variable of binary form or discrete choice outcome among a small set of alternatives; Probit model/logit model.
Data 1997 vehicle crash data for the Buffalo, NY area (by New York State Department of Transportation): NY State highways; Milepost system with the resolution of 0.1 mile; 1,658 crashes in the study region; Mileposts are used as reference points; Scale of analysis = 0.1 mile; Monte Carlo simulation with 1,000 trials (0.1% significance level). Crash distribution in the study region
Stage 1: Global scale results Under the null hypothesis: Crash probability = uniform over the network; Crash probability = proportional to traffic volume; Annual Average Daily Traffic. 0.1~0.5mile 0.1mile
Stage 1: Local scale results KLINCS at 0.1 mile scale Not adjusted for AADT Cluster: 125 ref. points Random: 1327 ref. points Dispersion: 0 ref. points (Total: 1452) KLINCS at 0.1 mile scale Adjusted for AADT Cluster: 110 ref. points Random: 1304 ref. points Dispersion: 38 ref. points (Total: 1452)
Stage 1 local results (cont.) GLINCS at 0.1 mile scale adjusted for AAD T ILINCS at 0.1 mile scale adjusted for AADT Positive autocorrelation: 23 links Not significant: 1462 links Negative autocorrelation: 0 links (Total: 1485) High-valued cluster: 19 links Not significant: 1438 links Low-valued cluster: 28 links (Total: 1485)
Stage 1 local results (cont.) Priority Investigation Locations (PILs) designated by NYSDOT KLINCS at 0.1 mile scale Adjusted for AADT
Stage 2: Inductive learning results AADT-adjusted KLINCS classification Decision tree by the C4.5 induction algorithm with 24 attributes
Stage 2 results (cont.) AADT-adjusted GLINCS model Dependent variable = degree of significant clustering (0~1000) Model tree, where each leaf node represents a linear model PositiveNegative 1AADT, Area type, Base type, Pavement type, Roadways, Shoulder width, VC ratio ARC, Culture, Inter_link, Median type, Shoulder, Surface, Work type 2AADT, ARC, Base type, Lanes, Pavement type, Roadways, Shoulder width, Sub-base type Area type, Culture, Inter_link, Median type, Shoulder, Surface, VC ratio, Work type 3ARC, Base type, Lanes, Pavement type, Roadways, Shoulder width, Sub-base type, VC ratio AADT, Area type, Culture, Inter_link, Median type, Shoulder, Surface, Work type 4ARC, Area type, Pavement type, Roadways Base type, Culture, Inter_link, Median type, Median width, Shoulder, Surface, Work type 5ARC, Area type, Pavement type, Pavement width, Roadways, Work type AADT, Base type, Culture, Inter_link, Lanes, Median type, Median width, Shoulder, Shoulder width, Surface 6ARC, Culture, Roadways, Shoulder width, VC ratio Base type, Inter_link, Median type, Pavement width, Shoulder, Surface, Surface type, Work type
Stage 2 results (cont.) Accuracy for the test set: Not much difference between the three models, especially in terms of all instances; Because 90% of the instances are “random,” the modeling processes tried to fit the models more to the random instances to make fewer errors Weighting schemes to emphasize underrepresented classes KLINCS results GLINCS results All instances Cluster instances Random instances Dispersion instances All instances DTree89%56%97%55%MTree0.539 FNNet89%23%95%58%FNNet0.587 DChoice90%0%98%55%Regression0.553
Conclusions This research proposes a comprehensive framework for a network-based spatial cluster analysis when the phenomenon of interest is constrained by a network space; Event-based data & link-attribute-based data; Detection of local clusters (stage 1) The LINCS methods can detect clusters without detecting spurious clusters caused merely by the network constraints; Identification of influencing factors (stage 2) Inductive learning techniques are useful to construct robust models to explain the detected clusters in relation to the network’s attributes.
Conclusions (cont.) Combination of exploratory spatial data analysis and inductive learning modeling is a powerful tool to reveal latent relationships between distributions of spatial phenomena and characteristics of physical/social environments; and then to assist spatial decision making processes by providing guidance where/what to focus attention; Stage 1 Spatial focus; Stage 2 Contextual focus. The case study showed relatively well correspondence between the LINCS results and PILs, which verifies the effectiveness of the LINCS methods.