Statistical Confidentiality: Is Synthetic Data the Answer? George Duncan 2006 February 13.

Slides:



Advertisements
Similar presentations
Alternative Approaches to Data Dissemination and Data Sharing Jerome Reiter Duke University
Advertisements

Characterization and Management of Multiple Components of Cost and Risk in Disclosure Protection for Establishment Surveys Discussion of Advances in Disclosure.
Estimating Identification Risks for Microdata Jerome P. Reiter Institute of Statistics and Decision Sciences Duke University, Durham NC, USA.
The methodology used for the 2001 SARs Special Uniques Analysis Mark Elliot Anna Manning Confidentiality And Privacy Group ( University.
Chapter 1 Business Driven Technology
1 The Synthetic Longitudinal Business Database Based on presentations by Kinney/Reiter/Jarmin/Miranda/Reznek 2 /Abowd on July 31, 2009 at the Census-NSF-IRS.
Confidentiality risks of releasing measures of data quality Jerry Reiter Department of Statistical Science Duke University
National Science Foundation Division of Science Resources Statistics May The Confidential Information Protection and Statistical Efficiency Act.
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Business microdata dissemination at Istat Daniela Ichim Luisa Franconi
© Statistisches Bundesamt, IIA - Mathematisch Statistische Methoden Summary of Topic ii (Tabular Data Protection) Frequency Tables Magnitude Tables Web.
Research developments at the Census Bureau Roderick J. Little Associate Director for Research & Methodology and Chief Scientist Bureau of the Census.
John M. Abowd Cornell University and Census Bureau
1 A Common Measure of Identity and Value Disclosure Risk Krish Muralidhar University of Kentucky Rathin Sarathy Oklahoma State University.
1 Confidentiality and Data Access: Perspectives on Demographic Data Pat Doyle U.S. Census Bureau Prepared for the IASSIST Annual Conference, University.
Access routes to 2001 UK Census Microdata: Issues and Solutions Jo Wathan SARs support Unit, CCSR University of Manchester, UK
Evaluating Hypotheses Chapter 9. Descriptive vs. Inferential Statistics n Descriptive l quantitative descriptions of characteristics.
Optimizing the Use of Microdata: Julia Lane Adapted from ASA presentation in honor of Pat Doyle.
© John M. Abowd 2005, all rights reserved Recent Advances In Confidentiality Protection John M. Abowd April 2005.
Recent Advances In Confidentiality Protection – Synthetic Data John M. Abowd April 2007.
Privacy Preserving Data Mining: An Overview and Examination of Euclidean Distance Preserving Data Transformation Chris Giannella cgiannel AT acm DOT org.
Copyright © 2007 Pearson Education Canada 1 Chapter 12: Audit Sampling Concepts.
A Measure of Disclosure Risk for Fully Synthetic Data Mark Elliot Manchester University Acknowledgements: Chris Dibben, Beata Nowak and Gillian Raab.
UNECE Workshop on Confidentiality Manchester, December 2007 Comparing Fully and Partially Synthetic Data Sets for Statistical Disclosure Control.
“OnTheMap” The Census Bureau’s New Tool for Residence-Workplace Analysis Fredrik Andersson and Jeremy Wu May 7, 2007 Daytona Beach, FL.
United Nations Economic Commission for Europe Statistical Division Applying the GSBPM to Business Register Management Steven Vale UNECE
Metadata driven application for aggregation and tabular protection Andreja Smukavec SURS.
Microdata Simulation for Confidentiality of Tax Returns Using Quantile Regression and Hot Deck Jennifer Huckett Iowa State University June 20, 2007.
The Application of the Concept of Uniqueness for Creating Public Use Microdata Files Jay J. Kim, U.S. National Center for Health Statistics Dong M. Jeong,
Intruder Testing: Demonstrating practical evidence of disclosure protection in 2011 UK Census Keith Spicer, Caroline Tudor and George Cornish 1 Joint UNECE/Eurostat.
© Federal Statistical Office, Research Data Centre, Maurice Brandt Folie 1 Analytical validity and confidentiality protection of anonymised longitudinal.
Using IPUMS.org Katie Genadek Minnesota Population Center University of Minnesota The IPUMS projects are funded by the National Science.
WP. 46 Providing access to data and making microdata safe, experiences of the ONS Jane Longhurst Paul Jackson ONS.
Role of Statistics in Geography
Generic Approaches to Model Validation Presented at Growth Model User’s Group August 10, 2005 David K. Walters.
Lecturer: Gareth Jones. How does a relational database organise data? What are the principles of a database management system? What are the principal.
Disclosure risk when responding to queries with deterministic guarantees Krish Muralidhar University of Kentucky Rathindra Sarathy Oklahoma State University.
1 New Implementations of Noise for Tabular Magnitude Data, Synthetic Tabular Frequency and Microdata, and a Remote Microdata Analysis System Laura Zayatz.
1 Assessing the Impact of SDC Methods on Census Frequency Tables Natalie Shlomo Southampton Statistical Sciences Research Institute University of Southampton.
Assessing Disclosure for a Longitudinal Linked File Sam Hawala – US Census Bureau November 9 th, 2005.
Innovations in Data Dissemination Thomas L. Mesenbourg, Jr. Acting Director U.S. Census Bureau United Nations Seminar on Innovations in Official Statistics.
Metadata driven application for data processing – from local toward global solution Rudi Seljak Statistical Office of the Republic of Slovenia.
© 2008 McGraw-Hill Higher Education The Statistical Imagination Chapter 1. The Statistical Imagination.
Presenter: Silas Mulwah Organization:Kenya National Bureau of Statistics  th September 2013, United Nations Regional workshop on Data Dissemination.
Joint UNECE / Eurostat meeting on Population and Housing Censuses 7-9 July 2010, Geneva Disseminating Census information to maximise use and value Keith.
WP 19 Assessment of Statistical Disclosure Control Methods for the 2001 UK Census Natalie Shlomo University of Southampton Office for National Statistics.
Disclosure Avoidance at Statistics Canada INFO747 Session on Confidentiality Protection April 19, 2007 Jean-Louis Tambay, Statistics Canada
1 IPAM 2010 Privacy Protection from Sampling and Perturbation in Surveys Natalie Shlomo and Chris Skinner Southampton Statistical Sciences Research Institute.
The experience of a National Statistical Institute after a law change: Estonia First Regional Workshop Microdata Access in European Countries ― Cooperation.
© John M. Abowd 2007, all rights reserved General Methods for Missing Data John M. Abowd March 2007.
Disclosure Limitation in Microdata with Multiple Imputation Jerry Reiter Institute of Statistics and Decision Sciences Duke University.
Introduction Chapter 1 and 2 Slides From Research Methods for Business
Creating Open Data whilst maintaining confidentiality Philip Lowthian, Caroline Tudor Office for National Statistics 1.
Disclosure Risk and Grid Computing Mark Elliot, Kingsley Purdam, Duncan Smith and Stephan Pickles CCSR, University of Manchester
Chapter 3 Surveys and Sampling © 2010 Pearson Education 1.
Jerry Reiter Department of Statistical Science and the Information Initiative at Duke Duke University.
Methods of Secure Computation and Data Integration Jerome Reiter, Duke University Alan Karr, NISS Xiaodong Lin, University of Cincinnati Ashish Sanil,
Synthetic Approaches to Data Linkage Mark Elliot, University of Manchester Jerry Reiter Duke University Cathie Marsh Centre.
McGraw-Hill © 2007 The McGraw-Hill Companies, Inc. All rights reserved. Slide 1 Sociological Research SOCIOLOGY Richard T. Schaefer 2.
Reconciling Confidentiality Risk Measures from Statistics and Computer Science Jerry Reiter Department of Statistical Science Duke University.
11 Measuring Disclosure Risk and Data Utility for Flexible Table Generators Natalie Shlomo, Laszlo Antal, Mark Elliot University of Manchester
Expanding the Role of Synthetic Data at the U.S. Census Bureau 59 th ISI World Statistics Congress August 28 th, 2013 By Ron S. Jarmin U.S. Census Bureau.
Data Confidentiality and the Common Good.
Differentially Private Verification of Regression Model Results
The user as data detective
Census Data for Transportation Planning—Some Thoughts
Classification Trees for Privacy in Sample Surveys
Imputation as a Practical Alternative to Data Swapping
Jerome Reiter Department of Statistical Science Duke University
Presentation transcript:

Statistical Confidentiality: Is Synthetic Data the Answer? George Duncan 2006 February 13

Acknowledging Colleagues Diane Lambert, Google Diane Lambert, Google Stephen Fienberg, Carnegie Mellon Stephen Fienberg, Carnegie Mellon Stephen Roehrig, Carnegie Mellon Stephen Roehrig, Carnegie Mellon Lynne Stokes, Southern Methodist Lynne Stokes, Southern Methodist Sallie Keller-McNulty, Rice Sallie Keller-McNulty, Rice Mark Elliot, Manchester, UK Mark Elliot, Manchester, UK JJ Salazar, Universidad de La Laguna, Spain JJ Salazar, Universidad de La Laguna, Spain

Acknowledging Current Funding NSF, NISS Digital Government II, Data Confidentialty, Data Quality and Data Integration for Federal Databases: Foundations to Software Prototypes NSF, NISS Digital Government II, Data Confidentialty, Data Quality and Data Integration for Federal Databases: Foundations to Software Prototypes Agency Partners: Bureau of Labor Statistics Bureau of Transportation Statistics Census Bureau National Agricultural Statistics Service National Center for Education Statistics Agency Partners: Bureau of Labor Statistics Bureau of Transportation Statistics Census Bureau National Agricultural Statistics Service National Center for Education Statistics

Questions Addressed What’s the R-U confidentiality map? What’s the R-U confidentiality map? What are synthetic data? What are synthetic data? Can the research community benefit from synthetic data? Can the research community benefit from synthetic data? Source data—the Gold Standard? Source data—the Gold Standard? How should we evaluate a synthesizer? How should we evaluate a synthesizer?

Brokering Role of the Information Organization Respondent DATACAPTUREDATACAPTURE Policy Analyst Decision Maker Media Researcher Data Snooper DISSEMINTIONDISSEMINTION

Why Confidentiality Matters Ethical: Keeping promises; basic value tied to privacy concerns of solitude, autonomy and individuality Ethical: Keeping promises; basic value tied to privacy concerns of solitude, autonomy and individuality Pragmatic: Without confidentiality, respondent may not provide data; worse, may provide inaccurate data Pragmatic: Without confidentiality, respondent may not provide data; worse, may provide inaccurate data Legal: Required under law Legal: Required under law

Confidentiality Audit Sensitive objects Sensitive objects Attribute values Attribute values Relationships Relationships Susceptible data Susceptible data Geographical detail Geographical detail Longitudinal or panel structure Longitudinal or panel structure Outliers Outliers Many attribute variables Many attribute variables Detailed attribute variables Detailed attribute variables Census versus survey/sample Census versus survey/sample Existence of linkable external databases Existence of linkable external databases

Restricted Data Restricted Access Making It Safe

RESTRICTED ACCESS Special Sworn Employee Special Sworn Employee Census Bureau Census Bureau Licensed Researchers Licensed Researchers National Center for Education Statistics National Center for Education Statistics External Sites External Sites California Census Research Data Center California Census Research Data Center

On Line Access

Restricted Access Restricted Data Restricted Access

Matrix Masking Transforming the source data (X) to the disseminated data (Y) Suppressions Suppressions Perturbations Perturbations Samplings Samplings Aggregations Aggregations Y=AXB + C

Matrix Masking Transforming the original data (X) to the disseminated data (Y) Suppressions Suppressions Perturbations Perturbations Samplings Samplings Aggregations Aggregations Y=AXB + C

Matrix Masking Transforming the original data (X) to the disseminated data (Y) Suppressions Suppressions Perturbations Perturbations Samplings Samplings Aggregations Aggregations Y=AXB + C Row operator, so record transformation Column operator, so attribute transformation Additive perturbation

Use X to estimate Generate samples from

Origins of the Synthetic Data Idea Computer Science: Computer Science: Liew, C. K., Choi, U. J., and Liew, C. J. (1985) A data distortion by probability distribution, ACM Transactions on Database Systems Liew, C. K., Choi, U. J., and Liew, C. J. (1985) A data distortion by probability distribution, ACM Transactions on Database Systems Statistics: Statistics: Rubin, D. B. (1993), Satisfying confidentiality constraints through the use of synthetic multiply-imputed microdata, Journal of Official Statistics Rubin, D. B. (1993), Satisfying confidentiality constraints through the use of synthetic multiply-imputed microdata, Journal of Official Statistics

Further Developments Fienberg, S. E., Makov, U. E. and Steele, R. J. (1998) Disclosure limitation using perturbation and related methods for categorical data. Journal of Official Statistics Fienberg, S. E., Makov, U. E. and Steele, R. J. (1998) Disclosure limitation using perturbation and related methods for categorical data. Journal of Official Statistics Kennickell, Arthur B. (1999) Multiple imputation and disclosure protection. Statistical Data Protection ’98 Lisbon Kennickell, Arthur B. (1999) Multiple imputation and disclosure protection. Statistical Data Protection ’98 Lisbon Now attention of other authors, particularly Little, Raghunathan, Reiter, Rubin, Abowd, Woodcock Now attention of other authors, particularly Little, Raghunathan, Reiter, Rubin, Abowd, Woodcock My latest bibliography on SD has 31 entries My latest bibliography on SD has 31 entries

What was the original purpose? Public-use microdata file to allow user to make valid inferences about population parameters using straightforward statistical tools while protecting confidentiality (Rubin 1993 )

One Person’s Assessment “… synthetic data sets which have all of the statistical properties of the original data set, but have entirely false data - made-up data, so that you cannot break confidentiality because, in fact, any data set, any data record you have is a synthetic data record. … … possibly the way of the future for lots of very, very confidential data, and maybe because the … the ability to protect confidentiality … is being eroded by the internet …this is probably where we are going to be driven to, although, I hope not. ---Norman Bradburn (2003) ---Norman Bradburn (2003)

Use X to estimate Generate samples from How should we get the synthesizer?

Less-Ambitious Data-Use Purposes “Gain familiarity with the dataset structure, develop code, and estimate analytical models— compare against “gold standard file” “Gain familiarity with the dataset structure, develop code, and estimate analytical models— compare against “gold standard file” (Abowd and Lane 2003, Abowd 2005) (Abowd and Lane 2003, Abowd 2005) “…people can send in their sort of model. They can make up the synthetic data. You can go back, you can run things, sharpen up your hypotheses and so forth, and then after you’ve got everything and get your codes all right and get your SAS Codes right, and then send it in and they will run the data - the real data, and they’ll send you back the results.” “…people can send in their sort of model. They can make up the synthetic data. You can go back, you can run things, sharpen up your hypotheses and so forth, and then after you’ve got everything and get your codes all right and get your SAS Codes right, and then send it in and they will run the data - the real data, and they’ll send you back the results.” (Bradburn 2003) (Bradburn 2003)

R-U Confidentiality Map No Data Data Utility U Disclosure Risk R Original Data Maximum Tolerable Risk Released Data

Disclosure Limitation Parameters Specify extent of disclosure limitation Specify extent of disclosure limitation Disclosure risk and data utility vary with these parameter values Disclosure risk and data utility vary with these parameter values Top-coding limit Top-coding limit Standard deviation of additive noise Standard deviation of additive noise Interpretation for synthetic data Interpretation for synthetic data Extent released data are synthetic—partial synthetic data (Little, 1993) Extent released data are synthetic—partial synthetic data (Little, 1993) Extent synthetic data matches source data (e.g., outliers) Extent synthetic data matches source data (e.g., outliers)

Does Synthetic Data Guarantee Confidentiality? Synthetic data record not respondent’s actual data record, so identity disclosure is impossible Attribute disclosure can happen Particularly with extreme values, it may be possible to re-identify a source record

Does Synthetic Data Guarantee Confidentiality? If simulated individuals have data values virtually identical to source individuals, possibility of both identity and attribute disclosure (Fienberg 1997, 2003) If quasi-identifier attributes are synthesized, re-identification can happen if data snooper can link an external identified data source using the quasi-identifier attributes (Domingo-Ferrer et al 2005)

Does Synthetic Data Guarantee Confidentiality? Because a synthetic data record is not any respondent’s actual data record, identity disclosure is directly impossible Because a synthetic data record is not any respondent’s actual data record, identity disclosure is directly impossible Attribute disclosure is still possible Attribute disclosure is still possible But, particularly with extreme values, it may still be possible to re-identify a source record But, particularly with extreme values, it may still be possible to re-identify a source record Some simulated individuals may have data values virtually identical to original sample individuals, so the possibility of both identity and attribute disclosure remain (Fienberg 1997, 2003) Some simulated individuals may have data values virtually identical to original sample individuals, so the possibility of both identity and attribute disclosure remain (Fienberg 1997, 2003) Not fully, but it can appreciably lower disclosure risk

Are Synthetic Data Valid? Not unless we are careful in how it is synthesized Not unless we are careful in how it is synthesized Sophisticated research users must help develop the synthesizers in order to promote and improve analytic validity (Abowd) Sophisticated research users must help develop the synthesizers in order to promote and improve analytic validity (Abowd)

Are Synthetic Data Valid? Not unless we are careful in how it is synthesized Not unless we are careful in how it is synthesized Sophisticated research users must help develop the synthesizers in order to promote and improve analytic validity (Abowd) Sophisticated research users must help develop the synthesizers in order to promote and improve analytic validity (Abowd) If we do it right

Synthesizer Build Synthesizer build involves constructing a statistical model Synthesizer build involves constructing a statistical model But… model purpose not the usual But… model purpose not the usual Not prediction, control or scientific understanding Not prediction, control or scientific understanding Usual model construction exploits Occam’s Razor and seeks parsimony Usual model construction exploits Occam’s Razor and seeks parsimony

Careful with Occam’s Razor "Everything should be made as simple as possible, but not one bit simpler." "Everything should be made as simple as possible, but not one bit simpler." -- Albert Einstein -- Albert Einstein "Seek simplicity, and distrust it.“ "Seek simplicity, and distrust it.“ -- Alfred North Whitehead -- Alfred North Whitehead

Source Data not 24 Karat Gold Standard? Steve Fienberg has noted Steve Fienberg has noted Sampled population often not target population Sampled population often not target population Coding errors, imputed missing data Coding errors, imputed missing data Do we really want to duplicate the statistical results obtainable from the source data? Do we really want to duplicate the statistical results obtainable from the source data? Match source data Match source data Or, do we want to obtain statistical inferences equally valid as those from the source data? Or, do we want to obtain statistical inferences equally valid as those from the source data? Match source data goal Match source data goal

What posterior predictive distribution for synthetic data? “In actual implementations, the correct posterior predictive distribution is not known, and an imputer-constructed approximation is used.” “In actual implementations, the correct posterior predictive distribution is not known, and an imputer-constructed approximation is used.” Jerry Reiter (2002) Jerry Reiter (2002) What sampling distributions? What sampling distributions? What priors work best? What priors work best? What if the data analyst uses a prior very different from the synthesizer? What if the data analyst uses a prior very different from the synthesizer?

Regression Analysis: Y versus X, X-squared The regression equation is Y = X X-squared Predictor Coef SE Coef T P Constant X X-squared S = R-Sq = 99.9%

The regression equation is Y = X Predictor Coef SE Coef T P Constant X S = R-Sq = 99.9%

What should we use to generate the synthetic data? Descriptive Statistics: X, Y Variable N Mean StDev X Y

Usual Modeling Approach (non- informative Bayes) Take Take

The regression equation is Sim Y = Sim X Predictor Coef SE Coef T P Constant Sim X S = R-Sq = 99.9%

Compare with the “Gold Standard” Analysis Based on Source Data Based on Simulated Data The regression equation is Y = X Predictor Coef SE Coef T P Constant X S = R-Sq = 99.9% The regression equation is Sim Y = Sim X Predictor Coef SE Coef T P Constant Sim X S = R-Sq = 99.9%

Reality

So What’s So Bad? Lost quadratic effect Lost quadratic effect Think of analyst with positive prior on this Think of analyst with positive prior on this Lost outliers Lost outliers

Data Utility: Inference-Valid? What does inference valid mean? What does inference valid mean? Same results as with original data Same results as with original data Equal inference capability as original data? (Think like post-19 th century statistician) Equal inference capability as original data? (Think like post-19 th century statistician)

Is Inference-Valid Synthetic Data Possible? “How robust are inferences to mis- specifications in the model used to draw synthetic data?” “How robust are inferences to mis- specifications in the model used to draw synthetic data?” Jerry Reiter Jerry Reiter Method used in imputation must foresee complete-data analyses Method used in imputation must foresee complete-data analyses

Implementation is Hard Model development time-consuming and human-resource demanding, typically needing domain knowledge and statistical skills Model is a simplification of reality—an incomplete image Model selection/parameterization subjective Data users’ models and methods more and more sophisticated   (Bucher & Vckovski, 1995)

Multivariate Difficulties Capturing multivariate statistical characteristics is time consuming Dandekar (2004) Difficult to model joint distribution for several variables, especially in the presence of categorical variables Singh, Yu, and Dunteman (2003)

Sample Survey Data Generate synthetic data for sampled units Generate synthetic data for sampled units More disclosure risk More disclosure risk Data utility? Data utility? Generate synthetic data for population units Generate synthetic data for population units Less disclosure risk Less disclosure risk Data utility? Data utility? Preserve structure of sampling design? Preserve structure of sampling design? Singh, Yu, and Dunteman (2003)

Usual Hard Problems Remain Hard! Geographical detail Geographical detail Synthetic data for sampled units? Synthetic data for sampled units? Longitudinal data Longitudinal data Preserve complex relationships Preserve complex relationships Approximate ala Abowd and Woodcock (2001) Approximate ala Abowd and Woodcock (2001) Target known to be in sample Target known to be in sample Synthetic data for sampled units? Synthetic data for sampled units?

Final Messages Follow the R-U confidentiality map Follow the R-U confidentiality map Don’t accept the source data as the Gold Standard Don’t accept the source data as the Gold Standard In sculpting a synthesizer, Occam’s Razor cuts too deeply In sculpting a synthesizer, Occam’s Razor cuts too deeply Implementing synthetic data is hard, so no panacea for microdata release Implementing synthetic data is hard, so no panacea for microdata release