Presentation on theme: "Alternative Approaches to Data Dissemination and Data Sharing Jerome Reiter Duke University"— Presentation transcript:
Alternative Approaches to Data Dissemination and Data Sharing Jerome Reiter Duke University
Two general settings Agency seeks to release confidential data to the public. Multiple agencies seek to improve analyses by sharing their confidential data. For both settings, agencies seek strategies that: i) do not reveal identities or sensitive attributes, ii) are useful for a wide range of analyses, iii) are easy for analysts and agencies to use.
Some alternative approaches Remote access servers Synthetic (i.e. simulated) data Secure computation techniques
Definition of servers Server is any system that (i) allows users to submit queries for output from statistical analyses of microdata, but (ii) does not give direct access to microdata. Table Servers / Model Servers
Queries and responses Queries to model server: Users request results from fitting a statistical model to the data. Response from model server: Answerable query: model output. Unanswerable query: no results. Model output also should include diagnostics.
Challenges in developing model servers Non-statistical: Operation costs, server security, etc. Statistical: -- Disclosure risks from smart queries (e.g., subsets, transformations). -- Inferential disclosure risks. -- Enabling complex model fitting.
Synthetic data Rubin (1993, JOS ): create multiple, fully synthetic datasets for public release so that: No unit in released data has sensitive data from actual unit in population. Released data look like actual data. Statistical procedures valid for original data are valid for released data.
Generating fully synthetic data Randomly sample new units from sampling frame. Impute survey variables for new units using models fit from observed data. Repeat multiple times and release datasets.
Modification: Release partially synthetic data Little (1993, JOS ): create multiple, partially synthetic datasets for public release so that: Released data comprise mix of observed and synthetic values. Released data look like actual data. Statistical procedures valid for original data are valid for released data.
Existing applications Kennickel (1997, Record Linkage Techniques): Replace sensitive values for selected units. Liu and Little (2002, JSM Proceedings): Replace values of key identifiers for selected units. Abowd and Woodcock (2001, Confidentiality, Disclosure, and Data Access): Replace all values of sensitive variables.
Sample of research agenda Implement and compare various data generation approaches on genuine data in production settings. Evaluate risk/usefulness profile on genuine data in production setting. Develop packaged synthesizers for data disseminators to use.
Secure computations Horizontally Partitioned: Agencies have different records but same variables. Purely Vertically Partitioned: Agencies have same records but different variables. Partially Overlapping, Vertically Partitioned: Agencies have different records and different variables, with some common records and variables.
Horizontal Partitioning: Secure summation Obtain without sharing individual values 1. Agency A passes (x + R) to 2 nd agency. 2. Agency B adds its x to this value and passes sum to Agency C. 3. Process continues until all agencies have added their x. 4. Agency A subtracts R from the sum.
Purely vertical partitioning Secure dot/matrix product -- shares dot/matrix products without sharing data. -- allows regressions, clustering, classification. -- assumes semi-honest. Synthetic data approaches -- share synthetic copies of data across agencies. -- allows any analysis when distributions used to generate data are accurate. -- generates public use data file.
A research agenda for secure computation methods - How to specify models without viewing data? - What if sophisticated models needed? - How to incorporate matching errors, differences in data quality and definitions? - How to account for disclosure risks from models that fit too well?
Some References Remote access servers - Rowland (2003, NAS Panel on Data Access). - Gomatam, Karr, Reiter, Sanil (2005, Stat. Science) Synthetic data - Raghunathan, Reiter, and Rubin (2003, JOS ) - Reiter (2003, Surv. Meth.; 2005, JRSSA) Secure computation - Benaloh (1987, CRYPTO86 ) - Karr, Lin, Sanil, and Reiter (2005, NISS tech. rep.)