John M. Abowd Cornell University and Census Bureau

Slides:

Advertisements

Similar presentations

Alternative Approaches to Data Dissemination and Data Sharing Jerome Reiter Duke University

Advertisements

Local Employment Dynamics (LED) Online Toolset For the Workforce Information in Regional Economic Development Conference ETA Regions 4 and 6, Phoenix,

Local Employment Dynamics Data: Advanced Topics C2ER Training Workshop June 4, 2012 Stephen Tibbets Erika McEntarfer LEHD Program US Census Bureau.

1 The Synthetic Longitudinal Business Database Based on presentations by Kinney/Reiter/Jarmin/Miranda/Reznek 2 /Abowd on July 31, 2009 at the Census-NSF-IRS.

Confidentiality risks of releasing measures of data quality Jerry Reiter Department of Statistical Science Duke University

INFO 7470/ECON 7400 Synthetic Data Creation and Use John M. Abowd and Lars Vilhuber with a big assist from Abigail Cooke, Javier Miranda, Martha Stinson,

Research on Improvements to Current SIPP Imputation Methods ASA-SRM SIPP Working Group September 16, 2008 Martha Stinson.

Research developments at the Census Bureau Roderick J. Little Associate Director for Research & Methodology and Chief Scientist Bureau of the Census.

1 A Common Measure of Identity and Value Disclosure Risk Krish Muralidhar University of Kentucky Rathin Sarathy Oklahoma State University.

What are Wage Records? Wage records are an administrative database used to calculate Unemployment Insurance benefits for employees who have been laid-off.

© 2007 John M. Abowd, Lars Vilhuber, all rights reserved Introduction to Record Linking John M. Abowd and Lars Vilhuber April 2007.

© John M. Abowd 2005, all rights reserved Household Samples John M. Abowd March 2005.

© John M. Abowd 2005, all rights reserved Analyzing Frames and Samples with Missing Data John M. Abowd March 2005.

Bridging the Gaps: Dealing with Major Survey Changes in Data Set Harmonization Joint Statistical Meetings Minneapolis, MN August 9, 2005 Presented by:

Synthetic Data – A New Future for Public Use Micro-Data? John M. Abowd December 7, 2004.

Are Public Use (Micro) Data a Thing of the Past? John M. Abowd Cornell University US Census Bureau Prepared for IASSIST 2002.

The American Community Survey (ACS) Lisa Neidert NPC Workshop: Analyzing Poverty and Socioeconomic Trends Using the American Community Survey June 22 –

© John M. Abowd 2005, all rights reserved Recent Advances In Confidentiality Protection John M. Abowd April 2005.

© John M. Abowd 2005, all rights reserved Sampling Frame Maintenance John M. Abowd February 2005.

Recent Advances In Confidentiality Protection – Synthetic Data John M. Abowd April 2007.

© John M. Abowd 2005, all rights reserved Introduction John M. Abowd January 2005.

Census Bureau Employment Data ACS, EC, and LED… And why you should use the data from one program vs. another… SDC/CIC Annual Training Conference Wednesday,

1 Commuting and Migration Data Products from the American Community Survey Journey-to-Work and Migration Statistics Branch U.S. Census Bureau State Data.

Trade and business statistics: use of administrative data Lunch Seminar Enrico Giovannini Italian National Statistical Institute (ISTAT) New York, February,

UNECE Workshop on Confidentiality Manchester, December 2007 Comparing Fully and Partially Synthetic Data Sets for Statistical Disclosure Control.

“OnTheMap” The Census Bureau’s New Tool for Residence-Workplace Analysis Fredrik Andersson and Jeremy Wu May 7, 2007 Daytona Beach, FL.

INFO 7470/ILRLE 7400 Survey of Income and Program Participation (SIPP) Synthetic Beta File John M. Abowd and Lars Vilhuber April 26, 2011.

INFO 7470/ILRLE 7400 Statistical Tools: Missing Data Methods John M. Abowd and Lars Vilhuber March 15, 2011.

Local Employment Dynamics Jeff Matson CURA, University of Minnesota Oriane Casale Labor Market Information Office, MN Dept. of Employment and Economic.

Microdata Simulation for Confidentiality of Tax Returns Using Quantile Regression and Hot Deck Jennifer Huckett Iowa State University June 20, 2007.

Introduction to Record Linking John M. Abowd and Lars Vilhuber April 2011 © 2011 John M. Abowd, Lars Vilhuber, all rights reserved.

OnTheMap and LODES Data Heath Hayward Geographer LEHD Program Center for Economic Studies.

12th Meeting of the Group of Experts on Business Registers

1 Supplementing ACS: The LEHD Program Jeremy S. Wu Marc Roemer U.S. Census Bureau May 12, 2005 Jeremy S. Wu Marc Roemer U.S. Census Bureau May 12, 2005.

Multiple Indicator Cluster Surveys Survey Design Workshop Sampling: Overview MICS Survey Design Workshop.

Dissemination to support Research & Analysis John Cornish.

Local Employment Dynamics (LED) & OnTheMap Nick Beleiciks Oregon Census State Data Center Meeting April 14, 2009.

Using IPUMS.org Katie Genadek Minnesota Population Center University of Minnesota The IPUMS projects are funded by the National Science.

© John M. Abowd 2007, all rights reserved Analyzing Frames and Samples with Missing Data John M. Abowd March 2007.

1 Longitudinal Employer- Household Dynamics (LEHD) Program Jeremy S. Wu U.S. Census Bureau May 11, 2005 Jeremy S. Wu U.S. Census Bureau May 11, 2005.

Topic (ii): New and Emerging Methods Maria Garcia (USA) Jeroen Pannekoek (Netherlands) UNECE Work Session on Statistical Data Editing Paris, France,

1 New Implementations of Noise for Tabular Magnitude Data, Synthetic Tabular Frequency and Microdata, and a Remote Microdata Analysis System Laura Zayatz.

1 Assessing the Impact of SDC Methods on Census Frequency Tables Natalie Shlomo Southampton Statistical Sciences Research Institute University of Southampton.

Assessing Disclosure for a Longitudinal Linked File Sam Hawala – US Census Bureau November 9 th, 2005.

Innovations in Data Dissemination Thomas L. Mesenbourg, Jr. Acting Director U.S. Census Bureau United Nations Seminar on Innovations in Official Statistics.

MCRDC Michigan Census Research Data Center The MCRDC is a joint project of the U.S. Bureau of the Census and the University of Michigan to enable qualified.

LEHD and OnTheMap From Jobs to Transportation Matthew Graham Geographer U.S. Census Bureau 1.

IAB homepage: Institut für Arbeitsmarkt- und Berufsforschung/Institute for Employment Research A New Approach for Disclosure Control in the.

Fastpowerfulfree OnTheMap Yes, the data are complex. Matthew Graham Geographer Center for Economics Studies U.S. Census Bureau mappingsimpleconvenientweb-based.

© John M. Abowd 2005, all rights reserved Multiple Imputation, II John M. Abowd March 2005.

© John M. Abowd 2007, all rights reserved General Methods for Missing Data John M. Abowd March 2007.

Disclosure Limitation in Microdata with Multiple Imputation Jerry Reiter Institute of Statistics and Decision Sciences Duke University.

Exploring Collaboration Between State Data Centers and Census Bureau Research and Methodology Ron Jarmin Assistant Director for Research and Methodology.

INFO 4470/ILRLE 4470 Visualization Tools and Data Quality John M. Abowd and Lars Vilhuber March 16, 2011.

INFO 7470/ECON 7400/ILRLE 7400 Understanding Social and Economic Data John M. Abowd and Lars Vilhuber January 21, 2013.

Workshop on MDG, Bangkok, Jan.2009 MDG 3.2: Share of women in wage employment in the non-agricultural sector National and global data.

LED Local Employment Dynamics Bradley Keen Pennsylvania Department of Labor & Industry Center for Workforce Information & Analysis (CWIA)

Synthetic Approaches to Data Linkage Mark Elliot, University of Manchester Jerry Reiter Duke University Cathie Marsh Centre.

The LEHD Program and Employment Dynamics Estimates Ronald Prevost Director, LEHD Program US Bureau of the Census

Local Employment Dynamics: Partnership, Public-Use Data, and Innovative Web Tools Eric Coyle Data Dissemination Specialist U.S. Census Bureau 1.

Demographic Full Count Review Presentation to the FSCPE March 26, 2001 Washington D.C.

INFO 7470 Statistical Tools: Edit and Imputation Examples of Multiple Imputation John M. Abowd and Lars Vilhuber April 18, 2016.

Developing job linkages for the Health and Retirement Study John Abowd, Margaret Levenstein, Kristin McCue, Dhiren Patki, Ann Rodgers, Matthew Shapiro,

Expanding the Role of Synthetic Data at the U.S. Census Bureau 59 th ISI World Statistics Congress August 28 th, 2013 By Ron S. Jarmin U.S. Census Bureau.

No Free Lunch: Working Within the Tradeoff Between Quality and Privacy

Multiple Imputation using SOLAS for Missing Data Analysis

Identifying Worker Characteristics Using LEHD and GIS

Jerome Reiter Department of Statistical Science Duke University

Presentation transcript:

Confidentiality Protection of Social Science Micro Data: Synthetic Data and Related Methods John M. Abowd Cornell University and Census Bureau January 30, 2006 UCLA Institute for Digital Research and Education Presentation

Acknowledgements Many current and past LEHD staff and senior research fellows contributed to the development of the LEHD infrastructure system and the Quarterly Workforce Indicators. Kevin McKinney, Bryce Stephens and Lars Vilhuber were particularly responsible for the confidentiality protection system. Fredrik Andersson and Marc Roemer at LEHD did the data analysis and implementation of the On the Map package. John Carpenter of Excensus, Inc. developed the mapping application. Gary Benedetto, Lisa Dragoset, Martha Stinson and Bryan Ricchetti did the synthesis programming for the SIPP-PUF application.

Overview What is the problem? What are synthetic data? How can the research community benefit from synthetic data? The NSF-ITR synthetic data grant The Census Bureau’s synthetic data and related products: QWI Online On the Map The new SIPP-SSA-IRS Public Use File Tools

Information Release and Data Protection are Competing Objectives Statisticians call this the Risk-Utility tradeoff Economists prefer to distinguish between technological trade-offs and preference trade-offs Information release and data protection are technological tradeoffs

A Simple Example of the Technological Trade-off There are two outputs: information released and data protection Consider a census with sampling as the release technology The PPF measures the amount of information that must be sacrificed to get additional protection The information measure is Shannon’s H (or the Kullback-Liebler difference between the census and the sample) The protection measure is the maximum probability of an exact disclosure

What Are Synthetic Data? Public use micro data products that reproduce essential features of confidential micro data products Essential features include: Univariate distributions overall and in subpopulations Multivariate relations among the variables

Some History Original fully synthetic data idea was due to Rubin (JOS, 1993) Synthesize the Decennial Census long form responses for the short form households, then release samples that do not include any actual long form records Original partially synthetic data idea was due to Little (JOS, 1993) Synthesize the sensitive values on the public use file Critical refinement (Fienberg, 1994) Use a parametric posterior predictive distribution (instead of a Bayes bootstrap) to do the sampling Other authors, particularly Raghunathan, Reiter, Rubin, Abowd, Woodcock Partially synthetic data with missing data (Reiter) Sequential Regression Multivariate Imputation (Raghunathan, Reither, and Rubin; Abowd and Woodcock)

How Can You Preserve Confidentiality and Multivariate Relations? Fundamental trade-off: better protection v. better data quality Protection results from summarizing the data with a complicated multivariate distribution, then sampling that distribution instead of the original data The synthetic data are not any respondent’s actual data But, for some techniques, it may still be possible to re-identify the source record in the confidential data New techniques address this problem

How Can the Research Community Benefit from Synthetic Data? Sophisticated research users must help develop the synthesizers in order to promote and improve analytic validity Many more users will have access to the information because there is a public use micro data product.

The Research – Synthetic Data Feedback Cycle Confidentiality Protection Scientific Modeling Data Synthesis Analytic Validity

The Multi-layer System Basic confidential data Fundamental product of virtually all Census programs Leads to the publication of public-use products (summary data, micro data, narrative data) Gold-standard confidential data Edited, documented and archived research versions of confidential data Used in internal Census research and at Research Data Centers

More Layers Partially-synthetic micro data Fully-synthetic micro data Preserves the record structure or sampling frame of the gold standard micro data Replaces the data elements with synthetic values sampled from an appropriate probability model Fully-synthetic micro data Uses only the population or record linkage structure of the gold standard micro data Generates synthetic entities and data elements from appropriate probability models

The NSF Information Technologies Research Grant A program that encourages innovative, high-payoff IT research and education Our grant proposal cited the many research studies and data products created by previous NSF support for the Research Data Center network and the Longitudinal Employer-Household Dynamics Program

What Is It? $2.9 million 3-year grant to the RDC network (Cornell is the coordinating institution) Provides core support for scientific activities at the RDCs To develop public use, analytically valid synthetic data from many of the RDC-accessible data sets To facilitate collaboration with RDC projects that help design and test these products

The Quarterly Workforce Indicators QWI was the LEHD Program’s first public use data product QWI Online Detailed labor force information by sub-state geography, detailed industry, ownership class, sex and age group.

The Confidentiality Protection System All QWI protections are done by noise infusion of the micro-data All micro-data items are distorted at least minimal percentage up to a maximal percentage Only the distorted items are used in the production of the release product

Protection and Validity Principles Cells with few businesses contributing or with few individuals contributing have been distorted in the cross-section but not the time-series Bias in the cross-section is controlled and random, no analyst knows its sign More information

Theoretical Distribution of the QWI Distortion Factor

Theoretical Distribution of the QWI Distortion Factor

Actual Confidentiality Protection Distortion: Employment, Beginning-of-Quarter

Table 8: Distribution of Error in First Order Serial Correlation

Graph: Distribution of Error in First Order Serial Correlation

Enhancements The current product has suppressions for cells too small to protect by noise infusion The enhanced product replaces these suppressions with synthetic data

Percentage of Data Items in County Level Release File

Beginning of Period Employment in NAICS Sector 62

Full Quarter New Hires in NAICS4 3259

The Census Bureau’s First Public Use Synthetic Data Application LEHD On-the-map application Shows commuting patterns at the Census Block level with characteristics of the origin and destination block groups Origin block data are synthetic Sampled from the posterior predictive distribution of origin blocks and origin characteristics given destination block, destination block characteristics. On-the-map

DRAFT – Beta Test Document Only Where people living in the selected area (Mobile’s neighboring communities of Daphne and Fairhope) work DRAFT – Beta Test Document Only Source: “On the Map” beta application, Longitudinal Employer-Household Dynamics Program, U.S. Census Bureau September 23, 2005

Where people working in the selected area (downtown Mobile) live DRAFT – Beta Test Document Only Source: “On the Map” beta application, Longitudinal Employer-Household Dynamics Program, U.S. Census Bureau September 23, 2005

Synthetic Data Model yijk are the counts for residence block i, work place block j and characteristics k. Characteristics are age groups, earnings groups, industry (NAICS sector), ownership sector.

Complications Informative prior “shape” Prior “sample size” Work place counts must be compatible with the protection system used by Quarterly Workforce Indicators (QWI) Dynamically consistent noise infusion

Analytic Validity Assess the bias Assess the incremental variation

Confidentiality Protection The reclassification index is a measure of how many workers were geographically relocated by the synthetic data.

SIPP-SSA-IRS Public Use File Links IRS detailed earnings records and Social Security benefit data to public use SIPP data Basic confidential data: SIPP (1990-1993, 1996); W-2 earnings data; SSA benefit data Gold standard: completely linked, edited version of the data with variables drawn from all of the sources Partially-synthetic data: created using the record structure of the existing SIPP panels with all data elements synthesized using Bayesian bootstrap and sequential regression multivariate imputation methods

Multiple Imputation Confidentiality Protection Denote confidential data by Y and disclosable data by X. Both Y and X may contain missing data, so that Y = (Yobs , Ymis) and X = (Xobs, Xmis). Assume database can be represented by joint density p(Y,X,θ).

Sequential Regression Multivariate Imputation Method Synthetic data values Y are draws from the posterior predictive density: In practice, use a two-step procedure: 1) draw m completed datasets using SRMI (imputes values for all missing data) 2) draw r synthetic datasets for each completed dataset from predictive density given the completed data.

Confidentiality Protection Protection is based on the inability of PUF users to re-identify the SIPP record upon which the PUF record is based. This prevents wholesale addition of SIPP data to the IRS and SSA data in the PUF Goal: re-identification of SIPP records from the PUF should result in true matches and false matches with equal probability

Disclosure Analysis Uses probabilistic record linking Each synthetic implicate is matched to the gold standard All unsynthesized variables are used as blocking variables Different matching variable sets are used

Testing Analytic Validity Run analyses on each synthetic implicate. Average coefficients Combine standard errors using formulae that take account of average variance of estimates (within implicate variance) and differences in variance across estimates (between implicate variance). Run analyses on gold standard data. Compare average synthetic coefficient and standard error to the same quantities for the gold standard. Analytic validity is measured by the overlap in the coverage of the synthetic and gold standard confidence intervals for a parameter.

Log Annual Earnings Amount

Log Annual Benefit Amount

Tools NSF sponsored supercomputer Virtual RDC Cornell INFO 747

The NSF-sponsored Supercomputer on the RDC Network NSF01 is a 64-processor (384GB memory) supercomputer Installed and optimized for complex data synthesizing and simulation Projects related to the ITR grant have access and priority

The Virtual RDC Virtual RDC (news server) The virtual RDC environment contains multiple servers that closely approximate an RDC compute server (e.g., NSF01) Disclosure-proofed metadata and synthetic data Now fully operational Any current or potential RDC user can have an account

Cornell Information Science 747 Course available to any potential RDC user, on DVD and via internet feed Training for using RDC-based data products Training for creating and testing synthetic data

Conclusions An important and challenging area that social scientists must be part of Use of confidential data collected by a public agency carries with it an obligation to disseminate enough data to permit scientific discourse Synthetic data is an important tool for this dissemination