11 Comparison of Perturbation Approaches for Spatial Outliers in Microdata Natalie Shlomo* and Jordi Marés** * Social Statistics, University of Manchester,

Slides:



Advertisements
Similar presentations
Katherine Jenny Thompson
Advertisements

Statistical Disclosure Control (SDC) for 2011 Census Progress Update Keith Spicer – ONS SDC Methodology 23 April 2009.
WP 33 Information Loss Measures for Frequency Tables Natalie Shlomo University of Southampton Office for National Statistics Caroline.
SDC for continuous variables under edit restrictions Natalie Shlomo & Ton de Waal UN/ECE Work Session on Statistical Data Editing, Bonn, September 2006.
Prediction and Imputation in ISEE - Tools for more efficient use of combined data sources Li-Chun Zhang, Statistics Norway Svein Nordbotton, University.
1 A Common Measure of Identity and Value Disclosure Risk Krish Muralidhar University of Kentucky Rathin Sarathy Oklahoma State University.
Assessing Disclosure Risk in Sample Microdata Under Misclassification
11 ACS Public Use Microdata Samples of 2005 and 2006 – How to Use the Replicate Weights B. Dale Garrett and Michael Starsinic U.S. Census Bureau AAPOR.
Mutual Information Mathematical Biology Seminar
Detecting univariate outliers Detecting multivariate outliers
The Simple Regression Model
Aaker, Kumar, Day Ninth Edition Instructor’s Presentation Slides
Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the.
Lecture 24: Thurs. Dec. 4 Extra sum of squares F-tests (10.3) R-squared statistic (10.4.1) Residual plots (11.2) Influential observations (11.3,
Basics: Notation: Sum:. PARAMETERS MEAN: Sample Variance: Standard Deviation: * the statistical average * the central tendency * the spread of the values.
11 American Community Survey Summary Data Products.
Overview of Robust Methods Analysis Jinxia Ma November 7, 2013.
1 Multivariate Normal Distribution Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking.
1 Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data Presented by: Tun-Hsiang Yang.
Vienna, 23 April 2008 UNECE Work Session on SDE Topic (v) Editing on results (post-editing) 1 Topic (v): Editing based on results Discussants: Maria M.
Marketing Research Aaker, Kumar, Day Seventh Edition Instructor’s Presentation Slides.
Descriptive Statistics for Spatial Distributions Chapter 3 of the textbook Pages
Household Surveys ACS – CPS - AHS INFO 7470 / ECON 8500 Warren A. Brown University of Georgia February 22,
Eurostat Statistical Data Editing and Imputation.
Chapter 14: Nonparametric Statistics
Microdata Simulation for Confidentiality of Tax Returns Using Quantile Regression and Hot Deck Jennifer Huckett Iowa State University June 20, 2007.
Robust PCA in Stata Vincenzo Verardi FUNDP (Namur) and ULB (Brussels), Belgium FNRS Associate Researcher.
Copyright 2010, The World Bank Group. All Rights Reserved. Estimation and Weighting, Part I.
1/26/09 1 Community Health Assessment in Small Populations: Tools for Working With “Small Numbers” Region 2 Quarterly Meeting January 26, 2009.
Rudi Seljak, Metka Zaletel Statistical Office of the Republic of Slovenia TAX DATA AS A MEANS FOR THE ESSENTIAL REDUCTION OF THE SHORT-TERM SURVEYS RESPONSE.
IB Psychology Internal Assessment Guide - Introduction
Using IPUMS.org Katie Genadek Minnesota Population Center University of Minnesota The IPUMS projects are funded by the National Science.
Introduction to the Public Use Microdata Sample (PUMS) File from the American Community Survey Updated February 2013.
1 Statistical Disclosure Control Methods for Census Outputs Natalie Shlomo SDC Centre, ONS January 11, 2005.
TerraPop Vision An organizational and technical framework to preserve, integrate, disseminate, and analyze global-scale spatiotemporal data describing.
Parents’ basic skills and children’s test scores Augustin De Coulon, Elena Meschi and Anna Vignoles.
Edoardo PIZZOLI, Chiara PICCINI NTTS New Techniques and Technologies for Statistics SPATIAL DATA REPRESENTATION: AN IMPROVEMENT OF STATISTICAL DISSEMINATION.
1 Introduction to Survey Data Analysis Linda K. Owens, PhD Assistant Director for Sampling & Analysis Survey Research Laboratory University of Illinois.
1 Assessing the Impact of SDC Methods on Census Frequency Tables Natalie Shlomo Southampton Statistical Sciences Research Institute University of Southampton.
Jeroen Pannekoek - Statistics Netherlands Work Session on Statistical Data Editing Oslo, Norway, 24 September 2012 Topic (I) Selective and macro editing.
Mobility MATTERS! Connecting People to Life Who Rides the Bus? How Understanding Transit Demographic Can Improve Service May 7, 2015.
MGT-491 QUANTITATIVE ANALYSIS AND RESEARCH FOR MANAGEMENT OSMAN BIN SAIF Session 16.
V pátek nebude přednáška. Cvičení v tomto týdnu bude.
Copyright © 2014 by Nelson Education Limited. 3-1 Chapter 3 Measures of Central Tendency and Dispersion.
Using cluster analysis for Identifying outliers and possibilities offered when calculating Unit Value Indices OECD NOVEMBER 2011 Evangelos Pongas.
Introduction to Quantitative Research Analysis and SPSS SW242 – Session 6 Slides.
Nonparametric Statistics
WP 19 Assessment of Statistical Disclosure Control Methods for the 2001 UK Census Natalie Shlomo University of Southampton Office for National Statistics.
1 IPAM 2010 Privacy Protection from Sampling and Perturbation in Surveys Natalie Shlomo and Chris Skinner Southampton Statistical Sciences Research Institute.
American Community Survey (ACS) Product Types: Tables and Maps Samples Revised
Summary of Tract-to-Tract Commuter Flows by Type of Geographic Area. A useful way of comparing the general pattern of tract-to-tract commuter flows across.
The challenge of a mixed-mode design survey and new IT tools application: the case of the Italian Structure Earning Surveys Fabiana Rocci Stefania Cardinleschi.
Chapter 3, Part B Descriptive Statistics: Numerical Measures n Measures of Distribution Shape, Relative Location, and Detecting Outliers n Exploratory.
SW388R7 Data Analysis & Computers II Slide 1 Detecting Outliers Detecting univariate outliers Detecting multivariate outliers.
Chapter XIV Data Preparation and Basic Data Analysis.
Course14 Dynamic Vision. Biological vision can cope with changing world Moving and changing objects Change illumination Change View-point.
Exploring Microsimulation Methodologies for the Estimation of Household Attributes Dimitris Ballas, Graham Clarke, and Ian Turton School of Geography University.
Ahmad Salam AlRefai.  Introduction  System Features  General Overview (general process)  Details of each component  Simulation Results  Considerations.
INFO 4470/ILRLE 4470 Visualization Tools and Data Quality John M. Abowd and Lars Vilhuber March 16, 2011.
Census 2011 – A Question of Confidentiality Statistical Disclosure control for the 2011 Census Carole Abrahams ONS Methodology BSPS – York, September 2011.
Integrating Times Analysis of count statistics. The variability of the coronal X-ray emission is commonly known to be present on various time scales.
Small area estimation combining information from several sources Jae-Kwang Kim, Iowa State University Seo-Young Kim, Statistical Research Institute July.
ASDC Annual Meeting November 10, 2011 Kathleen Gabler Socioeconomic Research Associate Center for Business and Economic Research Culverhouse College of.
11 Measuring Disclosure Risk and Data Utility for Flexible Table Generators Natalie Shlomo, Laszlo Antal, Mark Elliot University of Manchester
Estimating standard error using bootstrap
Aaker, Kumar, Day Ninth Edition Instructor’s Presentation Slides
Assessing Disclosure Risk in Microdata
Chapter 8: Weighting adjustment
New Techniques and Technologies for Statistics 2017  Estimation of Response Propensities and Indicators of Representative Response Using Population-Level.
Facultad de Ingeniería, Centro de Cálculo
Presentation transcript:

11 Comparison of Perturbation Approaches for Spatial Outliers in Microdata Natalie Shlomo* and Jordi Marés** * Social Statistics, University of Manchester, ** IIIA and CSIC, Barcelona The project was funded by the Census Statistical Disclosure Control project at Westat, Inc. through the sponsorship of the U.S. Bureau of the Census

22 Topics Covered Introduction Description of Data Outlier Detection Coherence Function Perturbation Methods Record Swapping Method Hot Deck Method Results Conclusions

33 Geographical spatial outliers arise from multivariate relationships between spatial and non-spatial characteristics and have a high probability of identification Treat through targetted SDC perturbation in the microdata Focus on US American Community Survey (ACS) transportation outputs, trajectories defined as vectors of coordinates: place of residence (origin) and workplace (destination) Example of an outlier: overly long commutes to work on a non-typical means of transportation (MOT), such as cycling Objective: to inform and guide decisions about best practices that could be used for future dissemination strategies on these and other similar types of datasets by the US Census Bureau Introduction

44 Simulation study based on an artificial population produced from combined PUMS of the ACS Those living in California, employed and worked within the US (N=438,850) Latitude and longitude of residence and workplace generated by adding random distances around a radius of the centroid of the relevant PUMA (public-use microdata area with population greater than 100K) Did not take into account survey weights (need to recalibrate following perturbation) however use other calibration variables as controls to minimize distortions to original weights Description of Data

55 Outlier detection methods include univariate and multivariate methods and can take parametric or non- parametric forms For this study we use a multivariate outlier detection based on the Mahalanobis Distance where large values indicate outliers Replace mean vector by median vector and covariance matrix by minimum covariance determinant (MCD) (Rousseeuw, 1985) Let h be the minimum number of points which are not outlying: Squared Mahalanobis distances based on p variables generallly uses a quantile of the distribution Under robust Mahalanobis distances use the adjusted cut-off: Outlier Detection

66 Robust Mahalanobis distances calculated on distance travelled and minutes to work DistanceToWork =geodist(latitude,longitude,POW_latitude,POW_longitude,'DM'); Determine explanatory variables predictive of distance travelled to produce classes: mode of transport, sex, earnings and occupation SAS macro: ‘Robcov’ Version (written by Michael Friendly) Collapse classes to at least 20 individuals and calculate robust Mahalanobis distance with a flag if exceeds critical value Reduced dataset to 283,423 without missing values and high degree of consistencies: 60,007 outliers (21.2%) reduced to 59,080 (20.8%) outliers after deleting ‘other’ mode of transport Outlier Detection

77 Coherence function maximum and minimum velocity for each mode of transport based on the set of non-outliers Assign high coherence to individuals whose travelled distance is close to mean, and low coherence to individuals whose travelled distance is far from mean Use as objective function to guide perturbation where we aim to obtain a higher coherence for outliers Coherence Function

88 Pair outliers with different workplaces by swapping place of residence and increase coherence funcion for at least one of the outliers (without decreasing coherence) Carry out within classes: mode of transport, sex and age group Split outliers according to workplace, calculate coherence function by swapping residence of outlier with all other outliers in different workplace If one of the outliers have higher coherence then swap Continue iteratively Record Swapping

99 Impute residence of outlier by residence of non-outlier within the class and having same workplace 2 approaches for selecting donor (note: need more than one individual in the workplace) 1. Candidate donors among those having distance to work within the coherence range of distances and donor selected that maximiazes coherence function, i.e. candidate donor whose distance to work is closer to the mean velocity) 2. Instead of coherence function, choose donor from non- outlier in the same workplace having similar travelled minutes (nearest neighbor) Hot Deck

10 Results Original Outliers TotalOutliers after Swapping Outliers after HD (Coherence Measure) Outliers after HD (Minutes) YesNoYesNoYesNo Yes 59,080 (20.9%) 42,788 (92.0%) 16,292 (6.9%) 27,099 (76.2%) 31,981 (12.9%) 28,123 (79.3%) 30,957 (12.5%) No 224,343 (79.2%) 3,731 (8.0%) 220,612 (93.1%) 8,456 (23.8%) 215,887 (87.1%) 7,321 (20.7%) 217,022 (87.5%) Total 283,423 (100%) 46,519 (100%) 236,904 (100%) 35,555 (100%) 247,868 (100%) 35,444 (100%) 247,979 (100%) Swapping corrected fewer outliers than hot deck methods (16K vs 31K) but swapping carried out only on outliers Some non-outliers that became outliers since we changed the distribution structure following perturbation (4K swapping vs 8K hotdeck)) Number of non-outliers defined as outliers following perturbation was much less than those outliers that were corrected to non-outliers

11 Results Individuals who had their PUMA changed due to the perturbation: Swapping Method: 56,562 ; Hot Deck Method (Minutes): 53,945 ; Hot Deck Method (Coherence): 53,181 Hotdeck methods perturb bivariate counts more than swapping since swapping does not change marginal frequencies Hotdeck using the coherence function approach resulted in less information loss than nearest neighbor approach Bivariate Variables Crossed with PUMA Normalized Absolute DifferenceNormalized Hellinger’s Distance SwappingHD MinutesHD Coherence SwappingHD MinutesHD Coherence AGE AGEP SEX OCCUPATION EARNINGS MODE

12 Discussion Record swapping had lowest information loss (especially for bivariate counts of swapping variable with other control variables) but only corrected 21.3% of the outliers, while the hot-deck methods corrected ~ 40.0% of the outliers Hot-deck method transformed more non-outliers to outliers compared to record swapping Recommendation would be to carry out both methods, starting with record swapping and then proceeding to hotdeck method on remaining outliers Recalibrate survey weights to new place of residence but including calibration variables as controls minimizes distortion to survey weights, especially under record swapping

13 Thank you for your attention