Presentation is loading. Please wait.

Presentation is loading. Please wait.

Statistical Confidentiality: Is Synthetic Data the Answer? George Duncan 2006 February 13.

Similar presentations


Presentation on theme: "Statistical Confidentiality: Is Synthetic Data the Answer? George Duncan 2006 February 13."— Presentation transcript:

1 Statistical Confidentiality: Is Synthetic Data the Answer? George Duncan 2006 February 13

2 Acknowledging Colleagues Diane Lambert, Google Diane Lambert, Google Stephen Fienberg, Carnegie Mellon Stephen Fienberg, Carnegie Mellon Stephen Roehrig, Carnegie Mellon Stephen Roehrig, Carnegie Mellon Lynne Stokes, Southern Methodist Lynne Stokes, Southern Methodist Sallie Keller-McNulty, Rice Sallie Keller-McNulty, Rice Mark Elliot, Manchester, UK Mark Elliot, Manchester, UK JJ Salazar, Universidad de La Laguna, Spain JJ Salazar, Universidad de La Laguna, Spain

3 Acknowledging Current Funding NSF, NISS Digital Government II, Data Confidentialty, Data Quality and Data Integration for Federal Databases: Foundations to Software Prototypes NSF, NISS Digital Government II, Data Confidentialty, Data Quality and Data Integration for Federal Databases: Foundations to Software Prototypes Agency Partners: Bureau of Labor Statistics Bureau of Transportation Statistics Census Bureau National Agricultural Statistics Service National Center for Education Statistics Agency Partners: Bureau of Labor Statistics Bureau of Transportation Statistics Census Bureau National Agricultural Statistics Service National Center for Education Statistics

4 Questions Addressed What’s the R-U confidentiality map? What’s the R-U confidentiality map? What are synthetic data? What are synthetic data? Can the research community benefit from synthetic data? Can the research community benefit from synthetic data? Source data—the Gold Standard? Source data—the Gold Standard? How should we evaluate a synthesizer? How should we evaluate a synthesizer?

5 Brokering Role of the Information Organization Respondent DATACAPTUREDATACAPTURE Policy Analyst Decision Maker Media Researcher Data Snooper DISSEMINTIONDISSEMINTION

6 Why Confidentiality Matters Ethical: Keeping promises; basic value tied to privacy concerns of solitude, autonomy and individuality Ethical: Keeping promises; basic value tied to privacy concerns of solitude, autonomy and individuality Pragmatic: Without confidentiality, respondent may not provide data; worse, may provide inaccurate data Pragmatic: Without confidentiality, respondent may not provide data; worse, may provide inaccurate data Legal: Required under law Legal: Required under law

7 Confidentiality Audit Sensitive objects Sensitive objects Attribute values Attribute values Relationships Relationships Susceptible data Susceptible data Geographical detail Geographical detail Longitudinal or panel structure Longitudinal or panel structure Outliers Outliers Many attribute variables Many attribute variables Detailed attribute variables Detailed attribute variables Census versus survey/sample Census versus survey/sample Existence of linkable external databases Existence of linkable external databases

8 Restricted Data Restricted Access Making It Safe

9 RESTRICTED ACCESS Special Sworn Employee Special Sworn Employee Census Bureau Census Bureau Licensed Researchers Licensed Researchers National Center for Education Statistics National Center for Education Statistics External Sites External Sites California Census Research Data Center California Census Research Data Center

10 On Line Access

11 Restricted Access Restricted Data Restricted Access

12

13 Matrix Masking Transforming the source data (X) to the disseminated data (Y) Suppressions Suppressions Perturbations Perturbations Samplings Samplings Aggregations Aggregations Y=AXB + C

14 Matrix Masking Transforming the original data (X) to the disseminated data (Y) Suppressions Suppressions Perturbations Perturbations Samplings Samplings Aggregations Aggregations Y=AXB + C

15 Matrix Masking Transforming the original data (X) to the disseminated data (Y) Suppressions Suppressions Perturbations Perturbations Samplings Samplings Aggregations Aggregations Y=AXB + C Row operator, so record transformation Column operator, so attribute transformation Additive perturbation

16

17 Use X to estimate Generate samples from

18 Origins of the Synthetic Data Idea Computer Science: Computer Science: Liew, C. K., Choi, U. J., and Liew, C. J. (1985) A data distortion by probability distribution, ACM Transactions on Database Systems 10 395-411 Liew, C. K., Choi, U. J., and Liew, C. J. (1985) A data distortion by probability distribution, ACM Transactions on Database Systems 10 395-411 Statistics: Statistics: Rubin, D. B. (1993), Satisfying confidentiality constraints through the use of synthetic multiply-imputed microdata, Journal of Official Statistics 91 461-468 Rubin, D. B. (1993), Satisfying confidentiality constraints through the use of synthetic multiply-imputed microdata, Journal of Official Statistics 91 461-468

19 Further Developments Fienberg, S. E., Makov, U. E. and Steele, R. J. (1998) Disclosure limitation using perturbation and related methods for categorical data. Journal of Official Statistics 14 347-360 Fienberg, S. E., Makov, U. E. and Steele, R. J. (1998) Disclosure limitation using perturbation and related methods for categorical data. Journal of Official Statistics 14 347-360 Kennickell, Arthur B. (1999) Multiple imputation and disclosure protection. Statistical Data Protection ’98 Lisbon 381-400 Kennickell, Arthur B. (1999) Multiple imputation and disclosure protection. Statistical Data Protection ’98 Lisbon 381-400 Now attention of other authors, particularly Little, Raghunathan, Reiter, Rubin, Abowd, Woodcock Now attention of other authors, particularly Little, Raghunathan, Reiter, Rubin, Abowd, Woodcock My latest bibliography on SD has 31 entries My latest bibliography on SD has 31 entries

20 What was the original purpose? Public-use microdata file to allow user to make valid inferences about population parameters using straightforward statistical tools while protecting confidentiality (Rubin 1993 )

21 One Person’s Assessment “… synthetic data sets which have all of the statistical properties of the original data set, but have entirely false data - made-up data, so that you cannot break confidentiality because, in fact, any data set, any data record you have is a synthetic data record. … … possibly the way of the future for lots of very, very confidential data, and maybe because the … the ability to protect confidentiality … is being eroded by the internet …this is probably where we are going to be driven to, although, I hope not. ---Norman Bradburn (2003) ---Norman Bradburn (2003)

22 Use X to estimate Generate samples from How should we get the synthesizer?

23 Less-Ambitious Data-Use Purposes “Gain familiarity with the dataset structure, develop code, and estimate analytical models— compare against “gold standard file” “Gain familiarity with the dataset structure, develop code, and estimate analytical models— compare against “gold standard file” (Abowd and Lane 2003, Abowd 2005) (Abowd and Lane 2003, Abowd 2005) “…people can send in their sort of model. They can make up the synthetic data. You can go back, you can run things, sharpen up your hypotheses and so forth, and then after you’ve got everything and get your codes all right and get your SAS Codes right, and then send it in and they will run the data - the real data, and they’ll send you back the results.” “…people can send in their sort of model. They can make up the synthetic data. You can go back, you can run things, sharpen up your hypotheses and so forth, and then after you’ve got everything and get your codes all right and get your SAS Codes right, and then send it in and they will run the data - the real data, and they’ll send you back the results.” (Bradburn 2003) (Bradburn 2003)

24 R-U Confidentiality Map No Data Data Utility U Disclosure Risk R Original Data Maximum Tolerable Risk Released Data

25 Disclosure Limitation Parameters Specify extent of disclosure limitation Specify extent of disclosure limitation Disclosure risk and data utility vary with these parameter values Disclosure risk and data utility vary with these parameter values Top-coding limit Top-coding limit Standard deviation of additive noise Standard deviation of additive noise Interpretation for synthetic data Interpretation for synthetic data Extent released data are synthetic—partial synthetic data (Little, 1993) Extent released data are synthetic—partial synthetic data (Little, 1993) Extent synthetic data matches source data (e.g., outliers) Extent synthetic data matches source data (e.g., outliers)

26 Does Synthetic Data Guarantee Confidentiality? Synthetic data record not respondent’s actual data record, so identity disclosure is impossible Attribute disclosure can happen Particularly with extreme values, it may be possible to re-identify a source record

27 Does Synthetic Data Guarantee Confidentiality? If simulated individuals have data values virtually identical to source individuals, possibility of both identity and attribute disclosure (Fienberg 1997, 2003) If quasi-identifier attributes are synthesized, re-identification can happen if data snooper can link an external identified data source using the quasi-identifier attributes (Domingo-Ferrer et al 2005)

28 Does Synthetic Data Guarantee Confidentiality? Because a synthetic data record is not any respondent’s actual data record, identity disclosure is directly impossible Because a synthetic data record is not any respondent’s actual data record, identity disclosure is directly impossible Attribute disclosure is still possible Attribute disclosure is still possible But, particularly with extreme values, it may still be possible to re-identify a source record But, particularly with extreme values, it may still be possible to re-identify a source record Some simulated individuals may have data values virtually identical to original sample individuals, so the possibility of both identity and attribute disclosure remain (Fienberg 1997, 2003) Some simulated individuals may have data values virtually identical to original sample individuals, so the possibility of both identity and attribute disclosure remain (Fienberg 1997, 2003) Not fully, but it can appreciably lower disclosure risk

29 Are Synthetic Data Valid? Not unless we are careful in how it is synthesized Not unless we are careful in how it is synthesized Sophisticated research users must help develop the synthesizers in order to promote and improve analytic validity (Abowd) Sophisticated research users must help develop the synthesizers in order to promote and improve analytic validity (Abowd)

30 Are Synthetic Data Valid? Not unless we are careful in how it is synthesized Not unless we are careful in how it is synthesized Sophisticated research users must help develop the synthesizers in order to promote and improve analytic validity (Abowd) Sophisticated research users must help develop the synthesizers in order to promote and improve analytic validity (Abowd) If we do it right

31 Synthesizer Build Synthesizer build involves constructing a statistical model Synthesizer build involves constructing a statistical model But… model purpose not the usual But… model purpose not the usual Not prediction, control or scientific understanding Not prediction, control or scientific understanding Usual model construction exploits Occam’s Razor and seeks parsimony Usual model construction exploits Occam’s Razor and seeks parsimony

32 Careful with Occam’s Razor "Everything should be made as simple as possible, but not one bit simpler." "Everything should be made as simple as possible, but not one bit simpler." -- Albert Einstein -- Albert Einstein "Seek simplicity, and distrust it.“ "Seek simplicity, and distrust it.“ -- Alfred North Whitehead -- Alfred North Whitehead

33 Source Data not 24 Karat Gold Standard? Steve Fienberg has noted Steve Fienberg has noted Sampled population often not target population Sampled population often not target population Coding errors, imputed missing data Coding errors, imputed missing data Do we really want to duplicate the statistical results obtainable from the source data? Do we really want to duplicate the statistical results obtainable from the source data? Match source data Match source data Or, do we want to obtain statistical inferences equally valid as those from the source data? Or, do we want to obtain statistical inferences equally valid as those from the source data? Match source data goal Match source data goal

34 What posterior predictive distribution for synthetic data? “In actual implementations, the correct posterior predictive distribution is not known, and an imputer-constructed approximation is used.” “In actual implementations, the correct posterior predictive distribution is not known, and an imputer-constructed approximation is used.” Jerry Reiter (2002) Jerry Reiter (2002) What sampling distributions? What sampling distributions? What priors work best? What priors work best? What if the data analyst uses a prior very different from the synthesizer? What if the data analyst uses a prior very different from the synthesizer?

35

36 Regression Analysis: Y versus X, X-squared The regression equation is Y = 6.61 + 3.05 X + 0.00062 X-squared Predictor Coef SE Coef T P Constant 6.605 9.829 0.67 0.507 X 3.0516 0.2044 14.93 0.000 X-squared 0.000621 0.001046 0.59 0.558 S = 1.62190 R-Sq = 99.9%

37 The regression equation is Y = 0.88 + 3.17 X Predictor Coef SE Coef T P Constant 0.881 1.890 0.47 0.645 X 3.17236 0.01892 167.64 0.000 S = 1.60303 R-Sq = 99.9%

38 What should we use to generate the synthetic data? Descriptive Statistics: X, Y Variable N Mean StDev X 30 98.65 15.73 Y 30 313.85 9.12

39

40 Usual Modeling Approach (non- informative Bayes) Take Take

41

42 The regression equation is Sim Y = 3.39 + 3.14 Sim X Predictor Coef SE Coef T P Constant 3.393 1.810 1.87 0.071 Sim X 3.14138 0.01921 163.56 0.000 S = 1.55825 R-Sq = 99.9%

43 Compare with the “Gold Standard” Analysis Based on Source Data Based on Simulated Data The regression equation is Y = 0.88 + 3.17 X Predictor Coef SE Coef T P Constant 0.881 1.890 0.47 0.645 X 3.17236 0.01892 167.64 0.000 S = 1.60303 R-Sq = 99.9% The regression equation is Sim Y = 3.39 + 3.14 Sim X Predictor Coef SE Coef T P Constant 3.393 1.810 1.87 0.071 Sim X 3.14138 0.01921 163.56 0.000 S = 1.55825 R-Sq = 99.9%

44 Reality

45 So What’s So Bad? Lost quadratic effect Lost quadratic effect Think of analyst with positive prior on this Think of analyst with positive prior on this Lost outliers Lost outliers

46 Data Utility: Inference-Valid? What does inference valid mean? What does inference valid mean? Same results as with original data Same results as with original data Equal inference capability as original data? (Think like post-19 th century statistician) Equal inference capability as original data? (Think like post-19 th century statistician)

47 Is Inference-Valid Synthetic Data Possible? “How robust are inferences to mis- specifications in the model used to draw synthetic data?” “How robust are inferences to mis- specifications in the model used to draw synthetic data?” Jerry Reiter Jerry Reiter Method used in imputation must foresee complete-data analyses Method used in imputation must foresee complete-data analyses http://www.multiple-imputation.com/ http://www.multiple-imputation.com/ http://www.multiple-imputation.com/

48 Implementation is Hard Model development time-consuming and human-resource demanding, typically needing domain knowledge and statistical skills Model is a simplification of reality—an incomplete image Model selection/parameterization subjective Data users’ models and methods more and more sophisticated   (Bucher & Vckovski, 1995)

49 Multivariate Difficulties Capturing multivariate statistical characteristics is time consuming Dandekar (2004) Difficult to model joint distribution for several variables, especially in the presence of categorical variables Singh, Yu, and Dunteman (2003)

50 Sample Survey Data Generate synthetic data for sampled units Generate synthetic data for sampled units More disclosure risk More disclosure risk Data utility? Data utility? Generate synthetic data for population units Generate synthetic data for population units Less disclosure risk Less disclosure risk Data utility? Data utility? Preserve structure of sampling design? Preserve structure of sampling design? Singh, Yu, and Dunteman (2003)

51 Usual Hard Problems Remain Hard! Geographical detail Geographical detail Synthetic data for sampled units? Synthetic data for sampled units? Longitudinal data Longitudinal data Preserve complex relationships Preserve complex relationships Approximate ala Abowd and Woodcock (2001) Approximate ala Abowd and Woodcock (2001) Target known to be in sample Target known to be in sample Synthetic data for sampled units? Synthetic data for sampled units?

52 Final Messages Follow the R-U confidentiality map Follow the R-U confidentiality map Don’t accept the source data as the Gold Standard Don’t accept the source data as the Gold Standard In sculpting a synthesizer, Occam’s Razor cuts too deeply In sculpting a synthesizer, Occam’s Razor cuts too deeply Implementing synthetic data is hard, so no panacea for microdata release Implementing synthetic data is hard, so no panacea for microdata release


Download ppt "Statistical Confidentiality: Is Synthetic Data the Answer? George Duncan 2006 February 13."

Similar presentations


Ads by Google