Presentation on theme: "John M. Abowd Cornell University and Census Bureau"— Presentation transcript:
1 Confidentiality Protection of Social Science Micro Data: Synthetic Data and Related Methods John M. Abowd Cornell University and Census BureauJanuary 30, 2006 UCLA Institute for Digital Research and EducationPresentation
2 AcknowledgementsMany current and past LEHD staff and senior research fellows contributed to the development of the LEHD infrastructure system and the Quarterly Workforce Indicators. Kevin McKinney, Bryce Stephens and Lars Vilhuber were particularly responsible for the confidentiality protection system.Fredrik Andersson and Marc Roemer at LEHD did the data analysis and implementation of the On the Map package. John Carpenter of Excensus, Inc. developed the mapping application.Gary Benedetto, Lisa Dragoset, Martha Stinson and Bryan Ricchetti did the synthesis programming for the SIPP-PUF application.
3 Overview What is the problem? What are synthetic data? How can the research community benefit from synthetic data?The NSF-ITR synthetic data grantThe Census Bureau’s synthetic data and related products:QWI OnlineOn the MapThe new SIPP-SSA-IRS Public Use FileTools
4 Information Release and Data Protection are Competing Objectives Statisticians call this the Risk-Utility tradeoffEconomists prefer to distinguish between technological trade-offs and preference trade-offsInformation release and data protection are technological tradeoffs
5 A Simple Example of the Technological Trade-off There are two outputs: information released and data protectionConsider a census with sampling as the release technologyThe PPF measures the amount of information that must be sacrificed to get additional protectionThe information measure is Shannon’s H (or the Kullback-Liebler difference between the census and the sample)The protection measure is the maximum probability of an exact disclosure
8 What Are Synthetic Data? Public use micro data products that reproduce essential features of confidential micro data productsEssential features include:Univariate distributions overall and in subpopulationsMultivariate relations among the variables
9 Some HistoryOriginal fully synthetic data idea was due to Rubin (JOS, 1993)Synthesize the Decennial Census long form responses for the short form households, then release samples that do not include any actual long form recordsOriginal partially synthetic data idea was due to Little (JOS, 1993)Synthesize the sensitive values on the public use fileCritical refinement (Fienberg, 1994)Use a parametric posterior predictive distribution (instead of a Bayes bootstrap) to do the samplingOther authors, particularly Raghunathan, Reiter, Rubin, Abowd, WoodcockPartially synthetic data with missing data (Reiter)Sequential Regression Multivariate Imputation (Raghunathan, Reither, and Rubin; Abowd and Woodcock)
10 How Can You Preserve Confidentiality and Multivariate Relations? Fundamental trade-off:better protection v. better data qualityProtection results from summarizing the data with a complicated multivariate distribution, then sampling that distribution instead of the original dataThe synthetic data are not any respondent’s actual dataBut, for some techniques, it may still be possible to re-identify the source record in the confidential dataNew techniques address this problem
11 How Can the Research Community Benefit from Synthetic Data? Sophisticated research users must help develop the synthesizers in order to promote and improve analytic validityMany more users will have access to the information because there is a public use micro data product.
12 The Research – Synthetic Data Feedback Cycle Confidentiality ProtectionScientific ModelingData SynthesisAnalytic Validity
13 The Multi-layer System Basic confidential dataFundamental product of virtually all Census programsLeads to the publication of public-use products (summary data, micro data, narrative data)Gold-standard confidential dataEdited, documented and archived research versions of confidential dataUsed in internal Census research and at Research Data Centers
14 More Layers Partially-synthetic micro data Fully-synthetic micro data Preserves the record structure or sampling frame of the gold standard micro dataReplaces the data elements with synthetic values sampled from an appropriate probability modelFully-synthetic micro dataUses only the population or record linkage structure of the gold standard micro dataGenerates synthetic entities and data elements from appropriate probability models
15 The NSF Information Technologies Research Grant A program that encourages innovative, high-payoff IT research and educationOur grant proposal cited the many research studies and data products created by previous NSF support for the Research Data Center network and the Longitudinal Employer-Household Dynamics Program
16 What Is It?$2.9 million 3-year grant to the RDC network (Cornell is the coordinating institution)Provides core support for scientific activities at the RDCsTo develop public use, analytically valid synthetic data from many of the RDC-accessible data setsTo facilitate collaboration with RDC projects that help design and test these products
17 The Quarterly Workforce Indicators QWI was the LEHD Program’s first public use data productQWI OnlineDetailed labor force information by sub-state geography, detailed industry, ownership class, sex and age group.
18 The Confidentiality Protection System All QWI protections are done by noise infusion of the micro-dataAll micro-data items are distorted at least minimal percentage up to a maximal percentageOnly the distorted items are used in the production of the release product
19 Protection and Validity Principles Cells with few businesses contributing or with few individuals contributing have been distorted in the cross-section but not the time-seriesBias in the cross-section is controlled and random, no analyst knows its signMore information
20 Theoretical Distribution of the QWI Distortion Factor
21 Theoretical Distribution of the QWI Distortion Factor
22 Actual Confidentiality Protection Distortion: Employment, Beginning-of-Quarter
23 Table 8: Distribution of Error in First Order Serial Correlation
24 Graph: Distribution of Error in First Order Serial Correlation
25 EnhancementsThe current product has suppressions for cells too small to protect by noise infusionThe enhanced product replaces these suppressions with synthetic data
26 Percentage of Data Items in County Level Release File
27 Beginning of Period Employment in NAICS Sector 62
29 The Census Bureau’s First Public Use Synthetic Data Application LEHD On-the-map applicationShows commuting patterns at the Census Block level with characteristics of the origin and destination block groupsOrigin block data are syntheticSampled from the posterior predictive distribution of origin blocks and origin characteristics given destination block, destination block characteristics.On-the-map
30 DRAFT – Beta Test Document Only Where people living in the selected area (Mobile’s neighboring communities of Daphne and Fairhope) workDRAFT – Beta Test Document OnlySource: “On the Map” beta application, Longitudinal Employer-Household Dynamics Program, U.S. Census BureauSeptember 23, 2005
31 Where people working in the selected area (downtown Mobile) live DRAFT – Beta Test Document OnlySource: “On the Map” beta application, Longitudinal Employer-Household Dynamics Program, U.S. Census BureauSeptember 23, 2005
32 Synthetic Data Modelyijk are the counts for residence block i, work place block j and characteristics k.Characteristics are age groups, earnings groups, industry (NAICS sector), ownership sector.
33 Complications Informative prior “shape” Prior “sample size” Work place counts must be compatible with the protection system used by Quarterly Workforce Indicators (QWI)Dynamically consistent noise infusion
41 SIPP-SSA-IRS Public Use File Links IRS detailed earnings records and Social Security benefit data to public use SIPP dataBasic confidential data: SIPP ( , 1996); W-2 earnings data; SSA benefit dataGold standard: completely linked, edited version of the data with variables drawn from all of the sourcesPartially-synthetic data: created using the record structure of the existing SIPP panels with all data elements synthesized using Bayesian bootstrap and sequential regression multivariate imputation methods
42 Multiple Imputation Confidentiality Protection Denote confidential data by Y and disclosable data by X.Both Y and X may contain missing data, so that Y = (Yobs , Ymis) and X = (Xobs, Xmis).Assume database can be represented by joint density p(Y,X,θ).
43 Sequential Regression Multivariate Imputation Method Synthetic data values Y are draws from the posterior predictive density:In practice, use a two-step procedure: 1) draw m completed datasets using SRMI (imputes values for all missing data) 2) draw r synthetic datasets for each completed dataset from predictive density given the completed data.
44 Confidentiality Protection Protection is based on the inability of PUF users to re-identify the SIPP record upon which the PUF record is based.This prevents wholesale addition of SIPP data to the IRS and SSA data in the PUFGoal: re-identification of SIPP records from the PUF should result in true matches and false matches with equal probability
45 Disclosure Analysis Uses probabilistic record linking Each synthetic implicate is matched to the gold standardAll unsynthesized variables are used as blocking variablesDifferent matching variable sets are used
47 Testing Analytic Validity Run analyses on each synthetic implicate.Average coefficientsCombine standard errors using formulae that take account of average variance of estimates (within implicate variance) and differences in variance across estimates (between implicate variance).Run analyses on gold standard data.Compare average synthetic coefficient and standard error to the same quantities for the gold standard.Analytic validity is measured by the overlap in the coverage of the synthetic and gold standard confidence intervals for a parameter.
50 ToolsNSF sponsored supercomputerVirtual RDCCornell INFO 747
51 The NSF-sponsored Supercomputer on the RDC Network NSF01 is a 64-processor (384GB memory) supercomputerInstalled and optimized for complex data synthesizing and simulationProjects related to the ITR grant have access and priority
52 The Virtual RDC Virtual RDC (news server) The virtual RDC environment contains multiple servers that closely approximate an RDC compute server (e.g., NSF01)Disclosure-proofed metadata and synthetic dataNow fully operationalAny current or potential RDC user can have an account
53 Cornell Information Science 747 Course available to any potential RDC user, on DVD and via internet feedTraining for using RDC-based data productsTraining for creating and testing synthetic data
54 ConclusionsAn important and challenging area that social scientists must be part ofUse of confidential data collected by a public agency carries with it an obligation to disseminate enough data to permit scientific discourseSynthetic data is an important tool for this dissemination