Recent Advances In Confidentiality Protection – Synthetic Data John M. Abowd April 2007.

Slides:



Advertisements
Similar presentations
1 Local Employment Dynamics Powerful Analytic Tools: First Look at OnTheMap Version 3 (Beta) Colleen D. Flannery New Jersey State Data Center June 11,
Advertisements

1 Uses for OnTheMap Economic Planning & Time Series - Where is the labor supply located? - Which industries are growing or declining over time? Transportation.
Local Employment Dynamics (LED) Online Toolset For the Workforce Information in Regional Economic Development Conference ETA Regions 4 and 6, Phoenix,
1 Local Employment Dynamics’ OnTheMap Dorothy Paugh Marketing Services Office April 2007 Dorothy Paugh Marketing Services Office April 2007 MD State Data.
Local Employment Dynamics Data: Advanced Topics C2ER Training Workshop June 4, 2012 Stephen Tibbets Erika McEntarfer LEHD Program US Census Bureau.
Local Employment Dynamics Training October 2014 Earlene Dowell Longitudinal Employer-Household Dynamics U.S. Census Bureau 1.
1 The Synthetic Longitudinal Business Database Based on presentations by Kinney/Reiter/Jarmin/Miranda/Reznek 2 /Abowd on July 31, 2009 at the Census-NSF-IRS.
INFO 7470/ECON 7400 Synthetic Data Creation and Use John M. Abowd and Lars Vilhuber with a big assist from Abigail Cooke, Javier Miranda, Martha Stinson,
Research on Improvements to Current SIPP Imputation Methods ASA-SRM SIPP Working Group September 16, 2008 Martha Stinson.
Presented to: Presented by: Transportation leadership you can trust. LEHD OnTheMap Data Planning Applications Conference, Session 2 Bruce Spear, Cambridge.
John M. Abowd Cornell University and Census Bureau
What are Wage Records? Wage records are an administrative database used to calculate Unemployment Insurance benefits for employees who have been laid-off.
© John M. Abowd 2005, all rights reserved Analyzing Frames and Samples with Missing Data John M. Abowd March 2005.
MCCORMICK SRI: GOING DEEP WITH CENSUS DEMOGRAPHIC AND ECONOMIC DATA EMPLOYMENT AND UNEMPLOYMENT ESTIMATES FROM THE U.S. DEPARTMENT OF LABOR, BUREAU OF.
Presented to: Presented by: Transportation leadership you can trust. LEHD OnTheMap Data 2011 GIS in Public Transportation Tampa, FL Bruce Spear September.
© John M. Abowd 2005, all rights reserved Recent Advances In Confidentiality Protection John M. Abowd April 2005.
© John M. Abowd 2005, all rights reserved Sampling Frame Maintenance John M. Abowd February 2005.
Census Bureau Employment Data ACS, EC, and LED… And why you should use the data from one program vs. another… SDC/CIC Annual Training Conference Wednesday,
1 Commuting and Migration Data Products from the American Community Survey Journey-to-Work and Migration Statistics Branch U.S. Census Bureau State Data.
Socio-Economic & Demographic Data Tools for Proactive Planning Robin Blakely-Armitage STATE OF NEW YORK CITIES: Creative Responses to Fiscal Stress March.
State Data Center Annual Affiliate Meeting New York State Department of Labor Earlene Dowell LEHD Program Center for Economic Studies U.S. Census Bureau.
Planning.Maryland.gov LEHD L ONGITUDINAL E MPLOYER – H OUSEHOLD D YNAMICS.
UNECE Workshop on Confidentiality Manchester, December 2007 Comparing Fully and Partially Synthetic Data Sets for Statistical Disclosure Control.
“OnTheMap” The Census Bureau’s New Tool for Residence-Workplace Analysis Fredrik Andersson and Jeremy Wu May 7, 2007 Daytona Beach, FL.
INFO 7470/ILRLE 7400 Survey of Income and Program Participation (SIPP) Synthetic Beta File John M. Abowd and Lars Vilhuber April 26, 2011.
Household Surveys ACS – CPS - AHS INFO 7470 / ECON 8500 Warren A. Brown University of Georgia February 22,
INFO 7470/ILRLE 7400 Statistical Tools: Missing Data Methods John M. Abowd and Lars Vilhuber March 15, 2011.
Presented by Krishnan Viswanathan Cambridge Systematics, Inc. Co-authors Vidya Mysore, Florida Department of Transportation Nanda Srinivasan, Cambridge.
Local Employment Dynamics Jeff Matson CURA, University of Minnesota Oriane Casale Labor Market Information Office, MN Dept. of Employment and Economic.
0 presented to Model Task Force Meeting presented by Vidya Mysore, FDOT Central Office Krishnan Viswanathan, Cambridge Systematics, Inc. 12/12/06 LEHD.
American Community Survey Presented at the Meeting of the National Neighborhood Indicators Partnership Susan Schechter May
OnTheMap and LODES Data Heath Hayward Geographer LEHD Program Center for Economic Studies.
1 Supplementing ACS: The LEHD Program Jeremy S. Wu Marc Roemer U.S. Census Bureau May 12, 2005 Jeremy S. Wu Marc Roemer U.S. Census Bureau May 12, 2005.
Local Employment Dynamics (LED) & OnTheMap Nick Beleiciks Oregon Census State Data Center Meeting April 14, 2009.
Using IPUMS.org Katie Genadek Minnesota Population Center University of Minnesota The IPUMS projects are funded by the National Science.
Introduction to the Public Use Microdata Sample (PUMS) File from the American Community Survey Updated February 2013.
Record matching for census purposes in the Netherlands Eric Schulte Nordholt Senior researcher and project leader of the Census Statistics Netherlands.
© John M. Abowd 2007, all rights reserved Analyzing Frames and Samples with Missing Data John M. Abowd March 2007.
1 Longitudinal Employer- Household Dynamics (LEHD) Program Jeremy S. Wu U.S. Census Bureau May 11, 2005 Jeremy S. Wu U.S. Census Bureau May 11, 2005.
American Community Survey Maryland State Data Center Affiliate Meeting September 16, 2010.
American Community Survey (ACS) 1 Oregon State Data Center Meeting Portland State University April 14,
1 New Implementations of Noise for Tabular Magnitude Data, Synthetic Tabular Frequency and Microdata, and a Remote Microdata Analysis System Laura Zayatz.
Assessing Disclosure for a Longitudinal Linked File Sam Hawala – US Census Bureau November 9 th, 2005.
© 2013 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Mobility MATTERS! Connecting People to Life Who Rides the Bus? How Understanding Transit Demographic Can Improve Service May 7, 2015.
MCRDC Michigan Census Research Data Center The MCRDC is a joint project of the U.S. Bureau of the Census and the University of Michigan to enable qualified.
American Community Survey “It Don’t Come Easy”, Ringo Starr Jane Traynham Maryland State Data Center March 15, 2011.
Using Census Data to Understand Things ​ OpenGovChicago March 26, 2014.
LEHD and OnTheMap From Jobs to Transportation Matthew Graham Geographer U.S. Census Bureau 1.
1 IPAM 2010 Privacy Protection from Sampling and Perturbation in Surveys Natalie Shlomo and Chris Skinner Southampton Statistical Sciences Research Institute.
Fastpowerfulfree OnTheMap Yes, the data are complex. Matthew Graham Geographer Center for Economics Studies U.S. Census Bureau mappingsimpleconvenientweb-based.
© John M. Abowd 2005, all rights reserved Multiple Imputation, II John M. Abowd March 2005.
© John M. Abowd 2007, all rights reserved General Methods for Missing Data John M. Abowd March 2007.
Disclosure Limitation in Microdata with Multiple Imputation Jerry Reiter Institute of Statistics and Decision Sciences Duke University.
Data For You Income Measures: What They Are and How To Use Them Sandra Burke with Liesl Eathington, Cynthia Fletcher, and Bailey Hanson American Community.
Household Surveys: American Community Survey & American Housing Survey Warren A. Brown February 8, 2007.
INFO 4470/ILRLE 4470 Visualization Tools and Data Quality John M. Abowd and Lars Vilhuber March 16, 2011.
INFO 7470/ECON 7400/ILRLE 7400 Understanding Social and Economic Data John M. Abowd and Lars Vilhuber January 21, 2013.
The LEHD Program and Employment Dynamics Estimates Ronald Prevost Director, LEHD Program US Bureau of the Census
Local Employment Dynamics: Partnership, Public-Use Data, and Innovative Web Tools Eric Coyle Data Dissemination Specialist U.S. Census Bureau 1.
INFO 7470 Statistical Tools: Edit and Imputation Examples of Multiple Imputation John M. Abowd and Lars Vilhuber April 18, 2016.
Reconciling Confidentiality Risk Measures from Statistics and Computer Science Jerry Reiter Department of Statistical Science Duke University.
Local Employment Dynamics: Getting in Touch with Your Local Workforce from a National Point of View 1 Earlene Dowell LEHD Program Center for Economic Studies.
Expanding the Role of Synthetic Data at the U.S. Census Bureau 59 th ISI World Statistics Congress August 28 th, 2013 By Ron S. Jarmin U.S. Census Bureau.
John M. Abowd and Lars Vilhuber February 16, 2011
Identifying Worker Characteristics Using LEHD and GIS
Chapter 13: Item nonresponse
Jerome Reiter Department of Statistical Science Duke University
Presentation transcript:

Recent Advances In Confidentiality Protection – Synthetic Data John M. Abowd April 2007

Overview The NSF-ITR Grant Synthetic Data Projects SIPP-SSA-IRS Synthetic Beta file OnTheMap

NSF-ITR Synthetic Data Projects Longitudinal Business Database –First release data due this week –First establishment-level micro-data file of its type Survey of Income and Program Participation/ Social Security Administration/ Internal Revenue Service Synthetic Beta File –Details later, beta file to be accessible on the Virtual RDC very soon OnTheMap –First release February 2006 (on Census.gov) –Second release April 2007 (on Census.gov) –All synthetic data (10 implicates on Virtual RDC) American Community Survey –Next official PUMS uses synthetic data as part of the confidentiality protection LEHD Infrastructure Files –Employer, individual, job data all synthetic Quarterly Workforce Indicators –All suppressions and variables related by identities replaced with synthetic values –Protype completed

SIPP-SSA-IRS Synthetic Beta File Links IRS detailed earnings records and Social Security benefit data to public use SIPP data Basic confidential data: SIPP ( , 1996); W-2 earnings data; SSA benefit data Gold standard: completely linked, edited version of the data with variables drawn from all of the sources Partially-synthetic data: created using the record structure of the existing SIPP panels with all data elements synthesized using Bayesian bootstrap and sequential regression multivariate imputation methods

Multiple Imputation Confidentiality Protection Denote confidential data by Y and disclosable data by X. Both Y and X may contain missing data, so that Y = (Y obs, Y mis ) and X = (X obs, X mis ). Assume database can be represented by joint density p(Y,X,θ).

Sequential Regression Multivariate Imputation Method Synthetic data values Y are draws from the posterior predictive density: In practice, use a two-step procedure: 1) draw m=4 completed datasets using SRMI (imputes values for all missing data) 2) draw r=4 synthetic datasets for each completed dataset from predictive density given the completed data.

Confidentiality Protection Protection is based on the inability of PUF users to re-identify the SIPP record upon which the PUF record is based. This prevents wholesale addition of SIPP data to the IRS and SSA data in the PUF Goal: re-identification of SIPP records from the PUF should result in true matches and false matches with equal probability

Disclosure Analysis Uses probabilistic record linking and distance- based record linking Each synthetic implicate is matched to the gold standard All unsynthesized variables are used as blocking variables Different sets of unsynthesized variables are used as matching in the probabilistic record linking All synthesized variables are used in the distance-based record linking

Testing Analytic Validity Verify all marginal distributions Verify second, third, and fourth order interactions for stratifying variables Run analyses on each synthetic implicate –Average coefficients –Combine standard errors using formulae that take account of average variance of estimates (within implicate variance) and differences in variance across estimates (between implicate variance) Run analyses on gold standard data Compare average synthetic coefficient and standard error to the same quantities for the gold standard Analytic validity is measured by the overlap in the coverage of the synthetic and gold standard confidence intervals for a parameter

Log Total Annual Labor Earnings (white males)

Log Total Annual Labor Earnings (black males)

Log Annual Benefit Amount

On The Map The Census Bureau’s first public-use synthetic data application (publicly released on February 3, 2006) Developed by the U.S. Census Bureau’s Longitudinal Employer-Household Dynamics Program (LEHD) Data on commute patterns between Census Blocks and area characteristics

Home Block (U/I Wage Data) # of Workers (primary job) and # of Jobs Home Profile - block group # workers Worker distribution by age range (-30; 31-54; 55+) by monthly earnings range (-$1,199; $1,200-$3,399; $3,400+) by industry (20 NAICS) Work Profile - block group # establishments # workers Worker distribution by age range (-30; 31-54; 55+) by monthly earnings range (-$1,199; $1,200-$3,399; $3,400+) by industry (20 NAICS) Demand/growth indicators Job creation/loss Hires/separations Earnings hires/separations Workplace Block (ES-202 Data) Origin-Destination Database

On The Map Origin block data are synthetic –Sampled from the posterior predictive distribution of origin blocks given destination block, worker characteristics All 10 implicates of the synthetic O/D data are available via the Virtual RDC Data available within a mapping application online: On The MapOn The Map

Where Are Workers Residing in Sausalito, CA Employed?

Shed Report

Area Profile Report

Disclosure Protection System Goal: “to protect confidentiality while preserving analytical validity of data” –No cell suppression –Synthetic place of residence data conditional on data on workplace and other characteristics –Bayesian techniques to estimate the posterior predictive distribution –Workplace data protected by QWI confidentiality rules –“Noise” in data increases as population in work place cell decreases

Synthetic Data Model y ijk are the counts for residence block i, work place block j and characteristics k Characteristics are age groups, earnings group, industry (NAICS sector), ownership sector

Complications Informative prior “shape” Prior “sample size” Work place counts must be compatible with the protection system used by Quarterly Workforce Indicators (QWI) –Dynamically consistent noise infusion

Design of Prior Unique priors for each of the J x K cells in the contingency table Only consider priors that have support across at least 10 residence blocks Search algorithm 1.Work place tract, age category, earnings category, industry and ownership sector, else 2.Work place tract, age category, earnings category, else 3.Work place county, age category, earnings category, else 4.Work place county Shape parameters based on observed distribution Scale parameters are confidential –The relative weight of the prior when sampling from the posterior distribution is larger for smaller populations

Analytic Validity Assess the bias Assess the incremental variation

Confidentiality Protection The reclassification index (RI) is a measure of how many workers were geographically relocated by the synthetic data Interpretation: Proportion of workers that need to be reallocated across residence areas in synthetic data in order to replicate confidential data –If counts identical in synthetic and confidential data, RI = 0 –If no overlap, RI = 1

In aggregate synthetic data mimic residence patterns in confidential data well

Level of protection increases as population in work block decreases