Presentation is loading. Please wait.

Presentation is loading. Please wait.

11 Comparison of Perturbation Approaches for Spatial Outliers in Microdata Natalie Shlomo* and Jordi Marés** * Social Statistics, University of Manchester,

Similar presentations


Presentation on theme: "11 Comparison of Perturbation Approaches for Spatial Outliers in Microdata Natalie Shlomo* and Jordi Marés** * Social Statistics, University of Manchester,"— Presentation transcript:

1 11 Comparison of Perturbation Approaches for Spatial Outliers in Microdata Natalie Shlomo* and Jordi Marés** * Social Statistics, University of Manchester, Natalie.Shlomo@manchester.ac.uk ** IIIA and CSIC, Barcelona jmares@iiia.csic.es The project was funded by the Census Statistical Disclosure Control project at Westat, Inc. through the sponsorship of the U.S. Bureau of the Census

2 22 Topics Covered Introduction Description of Data Outlier Detection Coherence Function Perturbation Methods Record Swapping Method Hot Deck Method Results Conclusions

3 33 Geographical spatial outliers arise from multivariate relationships between spatial and non-spatial characteristics and have a high probability of identification Treat through targetted SDC perturbation in the microdata Focus on US American Community Survey (ACS) transportation outputs, trajectories defined as vectors of coordinates: place of residence (origin) and workplace (destination) Example of an outlier: overly long commutes to work on a non-typical means of transportation (MOT), such as cycling Objective: to inform and guide decisions about best practices that could be used for future dissemination strategies on these and other similar types of datasets by the US Census Bureau Introduction

4 44 Simulation study based on an artificial population produced from 2006-2008 combined PUMS of the ACS Those living in California, employed and worked within the US (N=438,850) Latitude and longitude of residence and workplace generated by adding random distances around a radius of the centroid of the relevant PUMA (public-use microdata area with population greater than 100K) Did not take into account survey weights (need to recalibrate following perturbation) however use other calibration variables as controls to minimize distortions to original weights Description of Data

5 55 Outlier detection methods include univariate and multivariate methods and can take parametric or non- parametric forms For this study we use a multivariate outlier detection based on the Mahalanobis Distance where large values indicate outliers Replace mean vector by median vector and covariance matrix by minimum covariance determinant (MCD) (Rousseeuw, 1985) Let h be the minimum number of points which are not outlying: Squared Mahalanobis distances based on p variables generallly uses a quantile of the distribution Under robust Mahalanobis distances use the adjusted cut-off: Outlier Detection

6 66 Robust Mahalanobis distances calculated on distance travelled and minutes to work DistanceToWork =geodist(latitude,longitude,POW_latitude,POW_longitude,'DM'); Determine explanatory variables predictive of distance travelled to produce classes: mode of transport, sex, earnings and occupation SAS macro: ‘Robcov’ Version 1.3-2 (written by Michael Friendly) Collapse classes to at least 20 individuals and calculate robust Mahalanobis distance with a flag if exceeds critical value Reduced dataset to 283,423 without missing values and high degree of consistencies: 60,007 outliers (21.2%) reduced to 59,080 (20.8%) outliers after deleting ‘other’ mode of transport Outlier Detection

7 77 Coherence function maximum and minimum velocity for each mode of transport based on the set of non-outliers Assign high coherence to individuals whose travelled distance is close to mean, and low coherence to individuals whose travelled distance is far from mean Use as objective function to guide perturbation where we aim to obtain a higher coherence for outliers Coherence Function

8 88 Pair outliers with different workplaces by swapping place of residence and increase coherence funcion for at least one of the outliers (without decreasing coherence) Carry out within classes: mode of transport, sex and age group Split outliers according to workplace, calculate coherence function by swapping residence of outlier with all other outliers in different workplace If one of the outliers have higher coherence then swap Continue iteratively Record Swapping

9 99 Impute residence of outlier by residence of non-outlier within the class and having same workplace 2 approaches for selecting donor (note: need more than one individual in the workplace) 1. Candidate donors among those having distance to work within the coherence range of distances and donor selected that maximiazes coherence function, i.e. candidate donor whose distance to work is closer to the mean velocity) 2. Instead of coherence function, choose donor from non- outlier in the same workplace having similar travelled minutes (nearest neighbor) Hot Deck

10 10 Results Original Outliers TotalOutliers after Swapping Outliers after HD (Coherence Measure) Outliers after HD (Minutes) YesNoYesNoYesNo Yes 59,080 (20.9%) 42,788 (92.0%) 16,292 (6.9%) 27,099 (76.2%) 31,981 (12.9%) 28,123 (79.3%) 30,957 (12.5%) No 224,343 (79.2%) 3,731 (8.0%) 220,612 (93.1%) 8,456 (23.8%) 215,887 (87.1%) 7,321 (20.7%) 217,022 (87.5%) Total 283,423 (100%) 46,519 (100%) 236,904 (100%) 35,555 (100%) 247,868 (100%) 35,444 (100%) 247,979 (100%) Swapping corrected fewer outliers than hot deck methods (16K vs 31K) but swapping carried out only on outliers Some non-outliers that became outliers since we changed the distribution structure following perturbation (4K swapping vs 8K hotdeck)) Number of non-outliers defined as outliers following perturbation was much less than those outliers that were corrected to non-outliers

11 11 Results Individuals who had their PUMA changed due to the perturbation: Swapping Method: 56,562 ; Hot Deck Method (Minutes): 53,945 ; Hot Deck Method (Coherence): 53,181 Hotdeck methods perturb bivariate counts more than swapping since swapping does not change marginal frequencies Hotdeck using the coherence function approach resulted in less information loss than nearest neighbor approach Bivariate Variables Crossed with PUMA Normalized Absolute DifferenceNormalized Hellinger’s Distance SwappingHD MinutesHD Coherence SwappingHD MinutesHD Coherence AGE900.1090.09500.1540.134 AGEP0.0480.1190.1070.0590.1660.147 SEX00.1130.09500.1610.140 OCCUPATION0.0940.1340.1250.1640.2150.203 EARNINGS0.0240.1040.0890.0290.1480.129 MODE00.1040.08700.1540.131

12 12 Discussion Record swapping had lowest information loss (especially for bivariate counts of swapping variable with other control variables) but only corrected 21.3% of the outliers, while the hot-deck methods corrected ~ 40.0% of the outliers Hot-deck method transformed more non-outliers to outliers compared to record swapping Recommendation would be to carry out both methods, starting with record swapping and then proceeding to hotdeck method on remaining outliers Recalibrate survey weights to new place of residence but including calibration variables as controls minimizes distortion to survey weights, especially under record swapping

13 13 Thank you for your attention


Download ppt "11 Comparison of Perturbation Approaches for Spatial Outliers in Microdata Natalie Shlomo* and Jordi Marés** * Social Statistics, University of Manchester,"

Similar presentations


Ads by Google