1 Methods of Confidentiality Protection Arnold P. Reznek U.S. Census Bureau CES Room 2K128F Washington, DC 20233 301-763-1856 Fax 301-763-5935

Slides:



Advertisements
Similar presentations
Alternative Approaches to Data Dissemination and Data Sharing Jerome Reiter Duke University
Advertisements

Issues in Designing a Confidentiality Preserving Model Server by Philip M Steel & Arnold Reznek.
Protecting the Confidentiality of Tables by Adding Noise to the Underlying Microdata Paul Massell and Jeremy Funk Statistical Research Division U.S. Census.
The Microdata Analysis System (MAS): A Tool for Data Dissemination Disclaimer: The views expressed are those of the authors and not necessarily those of.
Statistical Disclosure Control (SDC) at SURS Andreja Smukavec General Methodology and Standards Sector.
Local Employment Dynamics Data: Advanced Topics C2ER Training Workshop June 4, 2012 Stephen Tibbets Erika McEntarfer LEHD Program US Census Bureau.
Confidentiality risks of releasing measures of data quality Jerry Reiter Department of Statistical Science Duke University
© Statistisches Bundesamt, IIA - Mathematisch Statistische Methoden Summary of Topic ii (Tabular Data Protection) Frequency Tables Magnitude Tables Web.
SDC for continuous variables under edit restrictions Natalie Shlomo & Ton de Waal UN/ECE Work Session on Statistical Data Editing, Bonn, September 2006.
What are Wage Records? Wage records are an administrative database used to calculate Unemployment Insurance benefits for employees who have been laid-off.
11 ACS Public Use Microdata Samples of 2005 and 2006 – How to Use the Replicate Weights B. Dale Garrett and Michael Starsinic U.S. Census Bureau AAPOR.
Methods of Geographical Perturbation for Disclosure Control Division of Social Statistics And Department of Geography Caroline Young Supervised jointly.
© John M. Abowd 2005, all rights reserved Recent Advances In Confidentiality Protection John M. Abowd April 2005.
Recent Advances In Confidentiality Protection – Synthetic Data John M. Abowd April 2007.
Census Bureau Employment Data ACS, EC, and LED… And why you should use the data from one program vs. another… SDC/CIC Annual Training Conference Wednesday,
“OnTheMap” The Census Bureau’s New Tool for Residence-Workplace Analysis Fredrik Andersson and Jeremy Wu May 7, 2007 Daytona Beach, FL.
Metadata driven application for aggregation and tabular protection Andreja Smukavec SURS.
Household Surveys ACS – CPS - AHS INFO 7470 / ECON 8500 Warren A. Brown University of Georgia February 22,
Regional Seminar on Census Data Archiving for Africa, Addis Ababa, Ethiopia, September 2011 Overview of Archiving of Microdata Session 4 United Nations.
Microdata Simulation for Confidentiality of Tax Returns Using Quantile Regression and Hot Deck Jennifer Huckett Iowa State University June 20, 2007.
1 Overview of Statistical Disclosure Methodology for Microdata Laura Zayatz Census Bureau BTS Confidentiality Seminar Series, April.
Overview of 2002 CIPSEA: Methods to Protect Confidential Tabular Data Amrut Champaneri, Ph.D. U.S. Department of Transportation Bureau of Transportation.
1 Supplementing ACS: The LEHD Program Jeremy S. Wu Marc Roemer U.S. Census Bureau May 12, 2005 Jeremy S. Wu Marc Roemer U.S. Census Bureau May 12, 2005.
Confidentiality Issues with “Small Cell” Data Michael C. Samuel, DrPH STD Control Branch California Department of Public Health 2008 National STD Prevention.
Local Employment Dynamics (LED) & OnTheMap Nick Beleiciks Oregon Census State Data Center Meeting April 14, 2009.
Disclosure Avoidance: An Overview Irene Wong ACCOLEDS/DLI Training December 8, 2003.
Using IPUMS.org Katie Genadek Minnesota Population Center University of Minnesota The IPUMS projects are funded by the National Science.
Record matching for census purposes in the Netherlands Eric Schulte Nordholt Senior researcher and project leader of the Census Statistics Netherlands.
Discussion of “ Statistical Disclosure Limitation: Releasing Useful Data for Statistical Analysis” Nancy J. Kirkendall Energy Information Administration.
Daniel Beckler United States Department of Agriculture National Agricultural Statistics Service Timothy Mulcahy NORC at the University of Chicago Topic.
Population census micro data for research: the case of Slovenia Danilo Dolenc Statistical Office of the Republic of Slovenia Ljubljana, First Regional.
Disclosure risk when responding to queries with deterministic guarantees Krish Muralidhar University of Kentucky Rathindra Sarathy Oklahoma State University.
1 New Implementations of Noise for Tabular Magnitude Data, Synthetic Tabular Frequency and Microdata, and a Remote Microdata Analysis System Laura Zayatz.
1 Assessing the Impact of SDC Methods on Census Frequency Tables Natalie Shlomo Southampton Statistical Sciences Research Institute University of Southampton.
Assessing Disclosure for a Longitudinal Linked File Sam Hawala – US Census Bureau November 9 th, 2005.
Innovations in Data Dissemination Thomas L. Mesenbourg, Jr. Acting Director U.S. Census Bureau United Nations Seminar on Innovations in Official Statistics.
The use of protected microdata in tabulation: case of SDC-methods microaggregation and PRAM Researcher Janika Konnu Manchester, United Kingdom December.
Security Control Methods for Statistical Database Li Xiong CS573 Data Privacy and Security.
LEHD and OnTheMap From Jobs to Transportation Matthew Graham Geographer U.S. Census Bureau 1.
WP 19 Assessment of Statistical Disclosure Control Methods for the 2001 UK Census Natalie Shlomo University of Southampton Office for National Statistics.
Michelle Simard Statistics Canada UNECE Worksessions on Statistical Disclosure Control Methods Helsinki, October 2015 Development of rules from administrative.
Disclosure Avoidance at Statistics Canada INFO747 Session on Confidentiality Protection April 19, 2007 Jean-Louis Tambay, Statistics Canada
1 IPAM 2010 Privacy Protection from Sampling and Perturbation in Surveys Natalie Shlomo and Chris Skinner Southampton Statistical Sciences Research Institute.
1 Using Fixed Intervals to Protect Sensitive Cells Instead of Cell Suppression By Steve Cohen and Bogong Li U.S. Bureau of Labor Statistics UNECE/Work.
Disclosure Limitation in Microdata with Multiple Imputation Jerry Reiter Institute of Statistics and Decision Sciences Duke University.
Statistical Methodology for the Automatic Confidentialisation of Remote Servers at the ABS Session 1 UNECE Work Session on Statistical Data Confidentiality.
Protection of frequency tables – current work at Statistics Sweden Karin Andersson Ingegerd Jansson Karin Kraft Joint UNECE/Eurostat.
Creating Open Data whilst maintaining confidentiality Philip Lowthian, Caroline Tudor Office for National Statistics 1.
The views expressed herein are those of the author and should not necessarily be attributed to the IMF, its Executive Board, or its management Data Confidentiality,
Disclosure Analysis: What do RDC Analysts do? Research Data Centre Program, Statistics Canada James Chowhan Ontario DLI Training, Queen's University
Joint UNECE/Eurostat work session on statistical data confidentiality Manchester, December 2007 Dealing with Confidentiality in Dissemination: The.
Estimation of the Probit Model From Anonymized Micro Data Gerd Ronning and Martin Rosemann Universität Tübingen & IAW Tübingen UNECE Work Session on Statistical.
The LEHD Program and Employment Dynamics Estimates Ronald Prevost Director, LEHD Program US Bureau of the Census
Local Employment Dynamics: Partnership, Public-Use Data, and Innovative Web Tools Eric Coyle Data Dissemination Specialist U.S. Census Bureau 1.
INFO 7470 Statistical Tools: Edit and Imputation Examples of Multiple Imputation John M. Abowd and Lars Vilhuber April 18, 2016.
11 Measuring Disclosure Risk and Data Utility for Flexible Table Generators Natalie Shlomo, Laszlo Antal, Mark Elliot University of Manchester
Expanding the Role of Synthetic Data at the U.S. Census Bureau 59 th ISI World Statistics Congress August 28 th, 2013 By Ron S. Jarmin U.S. Census Bureau.
No Free Lunch: Working Within the Tradeoff Between Quality and Privacy
Natalie Shlomo Social Statistics, School of Social Sciences
Data Confidentiality and the Common Good.
Creation of synthetic microdata in 2021 Census Transformation Programme (proof of concept) Robert Rendell.
Progress towards a table builder with in-built disclosure control for 2021 Census Keith Spicer UNECE, 22 September 2017.
Dissemination Workshop for African countries on the Implementation of International Recommendations for Distributive Trade Statistics May 2008,
Treatment of statistical confidentiality Table protection using Excel and tau-Argus Practical course Trainer: Felix Ritchie CONTRACTOR IS ACTING UNDER.
Protecting Confidential Data
Disclosure Avoidance: An Overview
Treatment of statistical confidentiality Part 3: Generalised Output SDC Introductory course Trainer: Felix Ritchie CONTRACTOR IS ACTING UNDER A FRAMEWORK.
Imputation as a Practical Alternative to Data Swapping
Jerome Reiter Department of Statistical Science Duke University
Presentation transcript:

1 Methods of Confidentiality Protection Arnold P. Reznek U.S. Census Bureau CES Room 2K128F Washington, DC Fax April 19, 2007

2 Plan of talk Define Confidentiality, Disclosure, and set the context. Very General principles Tables Microdata Models

3 Confidentiality –Def. 1: Release (“disclose”) data so that Dissemination of data about respondent is not harmful to respondent Immune from legal process –Def. 2: Treat info a respondent provides In a “relationship of trust” So it will only be used in ways consistent with original purpose.

4 Confidentiality and Disclosure Statistical disclosure – when released statistical data (tabular or individual records) reveal confidential information about an individual respondent. Types of Disclosure –Identity – respondent is identified from released file –Attribute - sensitive information about a respondent is revealed –Inferential - released data make it possible to determine the value of some characteristic of an individual more accurately than otherwise

5 Legal Requirements (U.S.) Census Bureau –Title 13, U.S. Code Other U.S. Statistical Agencies – –See Tech. Paper 22 (FCSM 2005) Chap. 3. Confidential Information Protection and Statistical Efficiency Act (CIPSEA) of 2002 Information collected or acquired for exclusively statistical purposes under a pledge of confidentiality

6 The Balancing Act Publish as much valuable statistical information as possible without violating the confidentiality of respondents Preserve data utility while avoiding disclosure Conflicting forces – driven by computer revolution –Increasing disclosure risks from agencies’ traditional data outputs (Sweeney 2001). –Increased value of research with data.

7 General Principles of Disclosure Review and Control Traditional methods –Suppression –Coarsening –Adding noise (explicit or via swapping) Types of data –Business –Household New methods (Abowd lecture) –Partially-synthetic data –Fully-synthetic data

8 Tables of Magnitude Data Disclosure occurs if user can estimate a respondent’s value “too closely.” Risk is at firm (enterprise) level Linear sensitivity measures (Cox 2001) –p%, (n,k), other rules Complementary Suppressions

9 Other cells to “protect” primary suppressions This is Tricky –Example - Jewett (1992) p. I-D-4.Jewett (1992) Cell K is unprotected – can “back out” exact value! TotalCol 1Col 2Col 3Col 4 Total Row A40B Row 2125E20F30 Row CKD Row 4 80G10H20

10 Complementary suppressions (cont’d) Linear Programming-based methods Fischetti-Salazar (2005) advances But – too much info lost! –E.g. ~ 1.5 million cells suppressed in County Business Patterns (Isserman and Westervelt 2006) Community is exploring alternatives

11 Alternatives to Cell Suppression: Adding Noise Adding Noise –Each unit’s data perturbed up or down randomly by number “close to” a (confidential) percentage. –Estabs of a single firm are perturbed in the same direction. –Overall distribution of perturbations symmetric about 1. –Sensitive cells perturbed most. –All cells can be shown.

12 Adding Noise (Cont’d) U.S. Census Bureau using noise in Quarterly Workforce Indicators (QWI) –Longitudinal Employer Household Dynamics (LEHD) program –Local Employment Dynamics (LED) partnership with states –Employment, layoffs, hires, separations, job creations & destructions, earnings, etc.

13 Adding noise (cont’d) SDC for QWI –Permanent multiplicative noise factor for each record –All statistics distorted –Analytically valid statistics – unbiased, time series properties preserved –Some cells still suppressed –Now working on replacing suppressions with synthetic data

14 Adding Noise – (cont’d) Census Bureau considering noise in County Business Patterns, Economic Census Noise addition could affect Census RDCs –Question: should research output use “real” or “noisy” microdata? –Noise addition should decrease inquiries about “tabular” projects.

15 Microdata (in U.S., household) In U.S., almost always household A few typical records: AgeRaceSexMarital S.Occup.State 26BMMarriedDoctorDE 54WFDivorcedTeacherMD 40AMWidowedLawyerFL

16 Practices for Microdata Removal of variables Subsampling Geographic thresholds Rounding Combining categories Changing continuous variables into categorical variables Noise addition

17 Practices for Microdata Thresholds on categories Data swapping Top and bottom coding Synthetic data

18 Typical Demographic Table MarriedDivorcedOtherTotal Total

19 Practices for Frequency Count Data Data swapping Aggregates, means, medians, other quantiles Universe thresholds Traditional rounding Group quarters data Profiles (300 people and check for slivers)

20 SDC for Regressions, and Model Servers SDC for statistical model output becoming more important –RDCs –Model servers

21 SDC in Regressions Some models produce (parts of) tables –Regressions, logits, probits with only fully- interacted right-hand side dummy variables –Correlation and covariance matrices that include dummies –Regressions run on “nested” samples Can reveal means of dependent variable via “subtraction”

22 Linear Regression Models with Only Discrete Regressors: Regressions Can Produce Tables –Example: Model with two dummy vars, with and without interactions, and adding a continuous variable: (CPS data from Berndt (1991) Chap. 5 Lwage = natural log of wage Colplus = 1 for persons with a college degree or above, 0 otherwise fe = 1 if female colfe = colplus*fe, - interaction, i.e., females with a college education. Ex = experience (age – education –6); Exsq = square of experience

23 Linear Regression (cont’d) Results: –(1) Without interaction, dummies not = means –(2) With interaction they do: a = mean (log) income of males without college a+b = mean income of males with college a+c = mean income of females without college a+b+c+d =income of females with college. –(3) With interaction and dummies, not so

24 SDC in Regressions (cont’d) Other Risks in Regression Models (NISS project researchers) –Indicator for single enterprise –Use of transformations to generate outliers that pull regression line close to single observation –Regression diagnostics – synthetic diagnostics Regressions produced by data from several agencies while maintaining confidentiality

25 Model Servers How they work –Users do not have access to microdata –Users specify models remotely –System returns disclosure-free output SDC –All the considerations discussed above under models –System needs to prevent “attacks” Several countries and agencies developing model servers

26 References Cox, L. (2001). “Disclosure Risk for Tabular Economic Data.” Chapter 8 in Doyle et al. (2001). Cox, L. (2002). “Confidentiality Issues For Statistical Database Query Systems.” Invited Paper for Joint UNECE/Eurostat Seminar on Integrated Statistical Information Systems and Related Matters (ISIS 2002). (17-19 April 2002, Geneva, Switzerland). Available at Doyle, P., J.I. Lane, J.J.M. Theeuwes, and L.V. Zayatz (2001). Confidentiality, Disclosure, and Data Access: Theory and Practical Applications for Statistical Agencies. Amsterdam: Elsevier Science B.V. Federal Committee on Statistical Methodology (FCSM, 2005). Statistical Policy Working Paper 22: Report on Statistical Disclosure Limitation Methodology, Washington, DC, U.S. Office of Management and Budget. Revision of 1994 edition; available at Fischetti, M. and J.J. Salazar (2005). “A Unified Mathematical Programming Framework for Different Statistical Disclosure Limitation Methods.” Operations Research, Vol. 53, No. 5, pp Giessing, S. (2001). “Nonperturbative Disclosure Control Methods for Tabular Data.” Chapter 9 in Doyle et al. (2001). Isserman, A.M. and Westervelt, J. (2006). “1.5 Million Missing Numbers: Overcoming Employment Suppression in County Busness Patterns Data.” International Regional Science Review 20, 2: pp Reiter, J. (2003). “Model Diagnostics for Remote Access Regression Servers.” Statistics and Computing, 13, pp

27 References (cont’d) Sweeney, L. (2001). “Information Explosion.” In Doyle et al. (2001), Chapter 3. Willenborg, L. and T. de Waal (2001). Elements of Statistical Disclosure Control. Lecture Notes in Statistics, volume 155. New York: Springer-Verlag. Willenborg, L. and T. de Waal (1996). Statistical Disclosure Control in Practice. Lecture Notes in Statistics, volume 111. New York: Springer-Verlag.