A Measure of Disclosure Risk for Fully Synthetic Data Mark Elliot Manchester University Acknowledgements: Chris Dibben, Beata Nowak and Gillian Raab.

Slides:



Advertisements
Similar presentations
Alternative Approaches to Data Dissemination and Data Sharing Jerome Reiter Duke University
Advertisements

Estimating Identification Risks for Microdata Jerome P. Reiter Institute of Statistics and Decision Sciences Duke University, Durham NC, USA.
Poverty trajectories after risky life events in Germany, Spain, Denmark and the United Kingdom: a latent class approach Leen Vandecasteele Post-doctoral.
Multilevel modelling short course
The methodology used for the 2001 SARs Special Uniques Analysis Mark Elliot Anna Manning Confidentiality And Privacy Group ( University.
Sociology 680 Multivariate Analysis Logistic Regression.
Confidentiality risks of releasing measures of data quality Jerry Reiter Department of Statistical Science Duke University
On method-specific record linkage for risk assessment Jordi Nin Javier Herranz Vicenç Torra.
© Statistisches Bundesamt, IIA - Mathematisch Statistische Methoden Summary of Topic ii (Tabular Data Protection) Frequency Tables Magnitude Tables Web.
United Nations Workshop on the 2010 World Programme on Population and Housing Censuses: Census Evaluation and Post Enumeration Surveys, Amman, Jordan,
STA305 week 31 Assessing Model Adequacy A number of assumptions were made about the model, and these need to be verified in order to use the model for.
1 A Common Measure of Identity and Value Disclosure Risk Krish Muralidhar University of Kentucky Rathin Sarathy Oklahoma State University.
Assessing Disclosure Risk in Sample Microdata Under Misclassification
Autocorrelation and Linkage Cause Bias in Evaluation of Relational Learners David Jensen and Jennifer Neville.
NLSCY – Non-response. Non-response There are various reasons why there is non-response to a survey  Some related to the survey process Timing Poor frame.
Multiple Criteria for Evaluating Land Cover Classification Algorithms Summary of a paper by R.S. DeFries and Jonathan Cheung-Wai Chan April, 2000 Remote.
Chapter 7 – K-Nearest-Neighbor
Sampling.
Methods of Geographical Perturbation for Disclosure Control Division of Social Statistics And Department of Geography Caroline Young Supervised jointly.
© John M. Abowd 2005, all rights reserved Recent Advances In Confidentiality Protection John M. Abowd April 2005.
Supplementary material Figure S1. Cumulative histogram of the fitness of the pairwise alignments of random generated ESSs. In order to assess the statistical.
Decision Tree Models in Data Mining
Inferential Statistics
Microdata Simulation for Confidentiality of Tax Returns Using Quantile Regression and Hot Deck Jennifer Huckett Iowa State University June 20, 2007.
The Application of the Concept of Uniqueness for Creating Public Use Microdata Files Jay J. Kim, U.S. National Center for Health Statistics Dong M. Jeong,
Continuous Random Variables
1 Statistical Disclosure Control Methods for Census Outputs Natalie Shlomo SDC Centre, ONS January 11, 2005.
2011 CENSUS Coverage Assessment – What’s new? OWEN ABBOTT.
Experimental Design All experiments have independent variables, dependent variables, and experimental units. Independent variable. An independent.
Scot Exec Course Nov/Dec 04 Survey design overview Gillian Raab Professor of Applied Statistics Napier University.
1 Assessing the Impact of SDC Methods on Census Frequency Tables Natalie Shlomo Southampton Statistical Sciences Research Institute University of Southampton.
Panel Study of Entrepreneurial Dynamics Richard Curtin University of Michigan.
Post enumeration survey in the 2009 Pilot Census of Population, Households and Dwellings in Serbia Olga Melovski Trpinac.
WP 19 Assessment of Statistical Disclosure Control Methods for the 2001 UK Census Natalie Shlomo University of Southampton Office for National Statistics.
Disclosure Avoidance at Statistics Canada INFO747 Session on Confidentiality Protection April 19, 2007 Jean-Louis Tambay, Statistics Canada
1 IPAM 2010 Privacy Protection from Sampling and Perturbation in Surveys Natalie Shlomo and Chris Skinner Southampton Statistical Sciences Research Institute.
The dynamics of poverty in Ethiopia : persistence, state dependence and transitory shocks By Abebe Shimeles, PHD.
IAB homepage: Institut für Arbeitsmarkt- und Berufsforschung/Institute for Employment Research A New Approach for Disclosure Control in the.
Limits to Statistical Theory Bootstrap analysis ESM April 2006.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Using Targeted Perturbation of Microdata to Protect Against Intelligent Linkage Mark Elliot, University of Manchester Cathie.
Disclosure Limitation in Microdata with Multiple Imputation Jerry Reiter Institute of Statistics and Decision Sciences Duke University.
Chapter 6 – Three Simple Classification Methods © Galit Shmueli and Peter Bruce 2008 Data Mining for Business Intelligence Shmueli, Patel & Bruce.
Creating Open Data whilst maintaining confidentiality Philip Lowthian, Caroline Tudor Office for National Statistics 1.
Evaluating the Quality of Editing and Imputation: the Simulation Approach M. Di Zio, U. Guarnera, O. Luzi, A. Manzari ISTAT – Italian Statistical Institute.
Disclosure Risk and Grid Computing Mark Elliot, Kingsley Purdam, Duncan Smith and Stephan Pickles CCSR, University of Manchester
Review Continuous Random Variables Density Curves
Multivariate selective editing via mixture models: first applications to Italian structural business surveys Orietta Luzi, Guarnera U., Silvestri F., Buglielli.
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
Microdata masking as permutation Krish Muralidhar Price College of Business University of Oklahoma Josep Domingo-Ferrer UNESCO Chair in Data Privacy Dept.
Exploring Microsimulation Methodologies for the Estimation of Household Attributes Dimitris Ballas, Graham Clarke, and Ian Turton School of Geography University.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Methods of Secure Computation and Data Integration Jerome Reiter, Duke University Alan Karr, NISS Xiaodong Lin, University of Cincinnati Ashish Sanil,
Census 2011 – A Question of Confidentiality Statistical Disclosure control for the 2011 Census Carole Abrahams ONS Methodology BSPS – York, September 2011.
ESSNET Data Integration - Rome, January 2010 ESSNET on Statistical Disclosure Control Daniela Ichim.
Combinations of SDC methods for continuous microdata Anna Oganian National Institute of Statistical Sciences.
Synthetic Approaches to Data Linkage Mark Elliot, University of Manchester Jerry Reiter Duke University Cathie Marsh Centre.
Bootstrapping James G. Anderson, Ph.D. Purdue University.
Conjoint Analysis. 1. Managers frequently want to know what utility a particular product feature or service feature will have for a consumer. 2. Conjoint.
Reconciling Confidentiality Risk Measures from Statistics and Computer Science Jerry Reiter Department of Statistical Science Duke University.
11 Measuring Disclosure Risk and Data Utility for Flexible Table Generators Natalie Shlomo, Laszlo Antal, Mark Elliot University of Manchester
Privacy-Preserving Data Mining
Creation of synthetic microdata in 2021 Census Transformation Programme (proof of concept) Robert Rendell.
Assessing Disclosure Risk in Microdata
Beata Nowok Chris Dibben & Gillian Raab Administrative Data
Anonymisation: Theory and Practice
Hypothesis Tests: One Sample
Regression Analysis Week 4.
Calculating Probabilities for Any Normal Variable
Jerome Reiter Department of Statistical Science Duke University
Presentation transcript:

A Measure of Disclosure Risk for Fully Synthetic Data Mark Elliot Manchester University Acknowledgements: Chris Dibben, Beata Nowak and Gillian Raab.

Traditional approaches On partially synthetic data ◦ Reiter (2008), Domingo-Ferrer and Torra (2007) ◦ Reidentification risk is assessed by linkage between original data and synthetic part.

Traditional approaches On fully synthetic data ◦ Naive view: Its synthetic why are you even asking this question? ◦ But if the data generation process for the synthetic data is fully saturated?

What is synthetic data? Random Data Original Data Pure Noise Fully Saturated Model Useable synthetic/ disclosure controlled data Negligible identification risk Empirical attribution risk Non Negligible identification risk and attribution risk Zone of plausible Inference

Disclosure risk and synthetic data On fully synthetic data ◦ Reidentification is meaningless ◦ But attribution is not

Approach for the Sylls project Part 1 Empirical differential privacy ◦ Can I learn more about a particular individual from a synthetic dataset based on a source dataset that contains that individuals record than a dataset which does not. ◦ Also interested in relative residuals.

Approach for the Sylls evaluation project Test data set ◦ 2011 Living Conditions and Food survey ◦ Synthetic versions of the 2010, 2011, 2012 surveys.

Principles Generate a risk measure for a given record based on the Probability of accurate attribution (for categorical targets) Residuals of estimated values (for continuous targets). Non-parametric methodology – similar to the Skinner and Elliot (2002) method.

By record empirical differential privacy procedure assuming a categorical target variable 1. Obtain two consecutive years of the same survey dataset. 2. Generate synthetic versions of those data sets. 3. Take a record r at random from the original dataset and using a predefined key (K). i.Match r back onto the original dataset(O) ii.Match r against the synthetic version of O (S) iii.Match r against the synthetic dataset for the other year (S’) 4. Select a target variable T 5. Each of 3 a-c will produce a match set (a set of records which match r). i.The proportion of each match set with the correct value on T is the probability of an accurate attribution (PAA) value for that match set. 6. Repeat 4 and 5 several times with different targets 7. Repeat 3-6 with different records until the PAA values stabilise.

General empirical differential privacy procedure assuming a categorical target variable 1. For each record in O record their multivariate class membership for both K and K+T. ◦ The equivalence class for K+T divided by the equivalence class for K for a given record will be the PAA for that record. 2. Repeat 1 for each record in S. Impute the PAA values from S into O against the corresponding records (matched on K+T). 3. Repeat 2 for S’

General empirical differential privacy procedure assuming a continuous target variable

Key variables Key 1: GOR, Output area classifier, Key 2: GOR, Output area classifier, tenure. Key 3: GOR, Output area classifier, tenure, dwelling type. Key 4: GOR, Output area classifier, tenure, dwelling type, Internet in household.

Example residuals for saturated model predicting income for a single case

Mean Probability of accurate attribution(PAA) of economic position of household reference person File Key 1 Key 2 Key 3 Key 4 mean cumulative key impact 2011 original synth synth synth baseline0.26 synth2011- baseline DP residual

Mean Probability of accurate attribution (PAA) of economic position of household reference person given that a match has occurred. File Key 1 Key 2 Key 3 Key 4 mean cumulative key impact 2011 original synth synth synth baseline0.26 synth2011- baseline DP residual

Mean PAA scores over 3 categorical targets across four keys of increasing size.

Hit rate for primary key matches from the original file onto the synthetic file. FileKey 1Key 2Key 3Key 4 mean cumulative key impact 2011 original 100% 0% 2010 synth84%69%51%41%-14% 2011 synth95%77%58%48%-16% 2012 synth94%78%57%48%-15% DP residual6%3% 4%

Table 10: Residual sizes for estimated weekly income using two keys against the LCF and three synthetic files File Key 1 Key 2 Key 3 Key 4 mean cumulative key impact 2011 original synth synth synth baseline385 synth2011- baseline DP residual197811

Concluding remarks The empirical DP approach to measuring disclosure risk looks promising. Future work ◦ Intruder strategies – optimising key size ◦ Testing with disclosure controlled data ◦ Investigating the impact of risky records ◦ Testing with a wider range of synthetic data