Issues in Designing a Confidentiality Preserving Model Server by Philip M Steel & Arnold Reznek.

Slides:



Advertisements
Similar presentations
The Business Register Research, Design and Evaluation Division Statistical Institute of Jamaica.
Advertisements

Alternative Approaches to Data Dissemination and Data Sharing Jerome Reiter Duke University
Using American FactFinder John DeWitt Project Manager Social Science Data Analysis Network Lisa Neidert Data Services Population Studies Center.
The Microdata Analysis System (MAS): A Tool for Data Dissemination Disclaimer: The views expressed are those of the authors and not necessarily those of.
Statistical Disclosure Control (SDC) at SURS Andreja Smukavec General Methodology and Standards Sector.
Confidentiality risks of releasing measures of data quality Jerry Reiter Department of Statistical Science Duke University
Preparing Data for Quantitative Analysis
Making the Case for Metadata at SRS-NSF National Science Foundation Division of Science Resources Statistics Jeri Mulrow, Geetha Srinivasarao, and John.
Business microdata dissemination at Istat Daniela Ichim Luisa Franconi
Business Register Outputs in Support of Regional Policy John Perry UK Office for National Statistics.
The Special Licence model for access to more detailed micro data IASSIST 2006 Thursday 25 May Karen Dennison UK Data Archive.
Using synthetic data to improve the accessibility of the SLS Susan Carsley, SLS Project Manager.
Farm Business and Farm Household Survey Data Customized Data Summaries from ARMS for Statistical Analysis Philip Friend USDA ‘s Economic Research Service.
Access routes to 2001 UK Census Microdata: Issues and Solutions Jo Wathan SARs support Unit, CCSR University of Manchester, UK
Searching the University of Alberta Library’s Statistics Canada-based Websites 2001 Census of Canada Canadian Centre for Justice Statistics Canadian Business.
Statistics and Data for Marketing Data Library, Rutherford North 1 st Floor Chuck Humphrey Data Library October 27, 2008.
Luxembourg Income Study (LIS) asbl 17, rue des Pommiers L-2343 Luxembourg –City Tél : +(352) Fax: +(352)
United Nations Economic Commission for Europe Statistical Division Applying the GSBPM to Business Register Management Steven Vale UNECE
IPUMS to IHSN: Leveraging structured metadata for discovering multi-national census and survey data Wendy L. Thomas 4 th Conference of the European Survey.
Basque Statistics Office Confidentiality Project: Final stages Joint UNECE/Eurostat Work Session on Statistical Data Confidentiality Tarragona, Spain,
Effects of Income Imputation on Traditional Poverty Estimates The views expressed here are the authors and do not represent the official positions.
Microdata Simulation for Confidentiality of Tax Returns Using Quantile Regression and Hot Deck Jennifer Huckett Iowa State University June 20, 2007.
+ Websites Vulnerabilities. + Content Expand of The Internet Use of the Internet Examples Importance of the Internet How to find Security Vulnerabilities.
Survey Data Management and Combined use of DDI and SDMX DDI and SDMX use case Labor Force Statistics.
Statistics Canada’s Real Time Remote Access Solution 2011 MSIS Meeting – Karen Doherty May 2011.
Disclosure Avoidance: An Overview Irene Wong ACCOLEDS/DLI Training December 8, 2003.
Using IPUMS.org Katie Genadek Minnesota Population Center University of Minnesota The IPUMS projects are funded by the National Science.
Introduction to the Public Use Microdata Sample (PUMS) File from the American Community Survey Updated February 2013.
TheDataWeb & DataFerrett Rebecca Blash Bill Hazard The DataWeb Applications Branch U.S. Census Bureau.
Record matching for census purposes in the Netherlands Eric Schulte Nordholt Senior researcher and project leader of the Census Statistics Netherlands.
The 2006 National Health Interview Survey (NHIS) Paradata File: Overview And Applications Beth L. Taylor 2008 NCHS Data User’s Conference August 13 th,
Access to microdata in Europe P resented by Michel Isnard – Insee DwB Training Course, Barcelona, Jan
American Community Survey Overview September 4, 2013 Tim Gilbert American Community Survey Office.
Introduction to HTML Reporting with SAS Welcome to HTML reporting with SAS Sam Gordji, Weir 107.
LOGO 2 nd Project Design for Library Programs Supervised By Dr: Mohammed Mikii.
Population census micro data for research: the case of Slovenia Danilo Dolenc Statistical Office of the Republic of Slovenia Ljubljana, First Regional.
1 New Implementations of Noise for Tabular Magnitude Data, Synthetic Tabular Frequency and Microdata, and a Remote Microdata Analysis System Laura Zayatz.
1 Assessing the Impact of SDC Methods on Census Frequency Tables Natalie Shlomo Southampton Statistical Sciences Research Institute University of Southampton.
Assessing Disclosure for a Longitudinal Linked File Sam Hawala – US Census Bureau November 9 th, 2005.
What is a schema ? Schema is a collection of Database Objects. Schema Objects are logical structures created by users to contain, or reference, their data.
Database Design and Management CPTG /23/2015Chapter 12 of 38 Functions of a Database Store data Store data School: student records, class schedules,
Ames Community Schools (ACS) has been concerned with the performance of their students’ problem solving abilities on a nationally standardized exam. While.
United Nations Regional Seminar on Census Data Dissemination and Spatial Analysis Amman - Jordan 16 – 19 May 2011 Determination of the scope and form of.
2008 NCHS Data Users’ Conference Omni Shoreham Hotel Washington, DC Wednesday, August 13, 2008.
Disclosure Avoidance at Statistics Canada INFO747 Session on Confidentiality Protection April 19, 2007 Jean-Louis Tambay, Statistics Canada
Statistical data confidentiality and micro data in Albania
Pilot Census in Poland Some Quality Aspects Geneva, 7-9 July 2010 Janusz Dygaszewicz Central Statistical Office POLAND.
Creating Open Data whilst maintaining confidentiality Philip Lowthian, Caroline Tudor Office for National Statistics 1.
Access to microdata in the Netherlands: from a cold war to co-operation projects Eric Schulte Nordholt Senior researcher and project leader of the Census.
The Integrated Public Use Microdata Series database IPUMSwww.ipums.org Lab 1 Background on the IPUMS and SPSS.
Disclosure Risk and Grid Computing Mark Elliot, Kingsley Purdam, Duncan Smith and Stephan Pickles CCSR, University of Manchester
Development of UK Virtual Microdata Laboratory Felix Ritchie Shanghai, March 2010.
1-1 Copyright © 2014, 2011, and 2008 Pearson Education, Inc.
Dr. Michael R. Hyman, NMSU Stages of the Research Process.
Data Organization Quality Assurance and Transformations.
Data Dissemination Conditions in the European Statistical System (ESS) UNECE, Warschau May 2009.
How Web Database Architectures Work CPS181s April 8, 2003.
Accessing and Using NCHS Data: An Overview of Microdata Access Tools with SETS Demonstration Ann Aikin, Avay Dolberry, and Brady Hamilton 2004 Data Users.
Census 2011 – A Question of Confidentiality Statistical Disclosure control for the 2011 Census Carole Abrahams ONS Methodology BSPS – York, September 2011.
1-1: Tables and Graphs FST Chap 1: Exploring Data.
Aaron Corso COSC Spring What is LAMP?  A ‘solution stack’, or package of an OS and software consisting of:  Linux  Apache  MySQL  PHP.
FREQUENCY DISTRIBUTION
The Luxembourg Income Study at Age 25
Disclosure scenario and risk assessment: Structure of Earnings Survey
Chapter 2 Client/Server Applications
Database Driven Websites
Presentation 2b 2018 Census Products & Services Engagement.
Project Team Information
microdata.no Instant Access to Microdata
Presentation transcript:

Issues in Designing a Confidentiality Preserving Model Server by Philip M Steel & Arnold Reznek

Talk Outline Background Basic design Description of operation Confidentiality outline Constraints on universe formation Other constraints Summary

Background PUBLIC remote access to confidential data Restriction of queries and responses rather than the registering and monitoring the user Current population survey (CPS), employment and economic well-being; demographic supplement Software development by Synectics HTML, mySQL, php, to develop the query … SAS as the statistical package run against the data

Risk Model for Microdata Intruder has access to record linkage software and identified data sources Disclosure occurs if the intruder is successful in linking his identified data with the published microdata

Risk Model for a Model Server Intruder has access to record linkage software and identified data sources Intruder uses model server to reconstruct microdata for both the variables overlapping his data sources and a sensitive variable Disclosure occurs if the intruder is successful in linking his identified data with the reconstructed microdata and has valid estimate of a sensitive characteristic or value

Basic Design Choice Enable: Choose which functions will operate –Must construct a friendly interface –Limited to the procedures developed –Safe from unknown code Disable: Choose which functions will not operate –User free to program within disabling constraints –No limit on complexity –Must be monitored (human, program or mix)

Operation User visits web site, chooses data set, explores data, chooses geography, analysis type User chooses population, constructs model, selects output Web site constructs code to send behind firewall Code checked and run against data at Census Results checked and returned to user

Structure of Confidentiality Rules Data preparation Data exploration Model universe formation Model Statement Model Output

Data exploration rules Users may request tables for categorical variables and numeric recodes up to e1 dimensions. (start e1=4 including geo) User may transform numeric recodes using a limited set of functions: log, root, square.

Universe formation: Categorical Variables Example: Hispanic heads of household with a college degree. Conditions: X 1 =H,X 2 =1,X 3 =5 (table cell) Implication: Data preparation must support safe lower dimensional tables

Universe formation rules: Categorical Variables Limit on the number of categorical variables (u1=3) Minimum on the size of the universe selected (u2=75)

Universe Formation: Numeric Variables Example: Families in poverty Condition: Family income<18,500. Or Family income<18,501? Implication: Rounding or pre-assigned cutpoints.

Universe formation rules: Numeric variables Users will select categorical variables first Numeric variables can be used only at pre-assigned cutpoints. The number of observations in the whole CPS universe between cutpoints shall be at least u3 for every numeric variable. (start u3=80)

Universe formation rules (cont) If a cutpoint is used in universe formation then the difference in the size of the model universe obtained by incrementing the cutpoint up or down cannot be less than u4. (start u4=4) The universe for the model must have at least u2 observations. (start u2=75) There will be no cutpoints above the 97 th percentile of nonzero points or the last half percentile of all points.

Model statements rules At most m1 variables may be used in the model statement (start m1=20) Dummy variables must distinguish at least m2 observations (start m2=20) No interaction term may involve more than 4 variables. (m3=4) No model involving 3 or more variables can be fully interacted. (m4=3)

Model Output Residuals will be based on synthetic data Limit on the number of significant digits? R 2 cannot be 1? Rules for other diagnostics

Synthetic Residuals Users may see synthetic bar charts or distributions and synthetic 2-way plots. Synthetic data must be generated from fixed random number starts and topcoded (and bottom coded where appropriate) at 4 standard deviations from the mean.

Data preparation The topcode for numeric data needs to be calculated Cutpoints must be determined Separate lists of variables for exploration, universe formation, dependent and independent variables, model estimation Standard recodes added Inference from the collection of all 4-way categorical tables checked

Major Hurdles Implementing facility for dummy variables Presentation of geographic options Implementing synthetic residuals Architecture for differing variable roles

Future development Relaxation of top codes Implementation of model variance estimation (NSO weighting) Introduction of new dataset Introduction of new statistical procedures Facility to add contextual data or merge files Use of non-sampled data

Overview Avoids (as much as possible) tests which accept or reject a users choice. Restricts the dimension of the data access. Has some flexibility in setting system confidentiality parameters. Changes the intruder model. Introduces a modification of k-anonymity.

My thanks to Jerry Reiter, Laura Zayatz and Stephen Wenck Contact: