Disclosure Avoidance: An Overview

Slides:



Advertisements
Similar presentations
DLI Orientation: Concepts A Framework for Thinking about Statistical Information Train the Trainers Montreal, March 9, 2004 Chuck Humphrey Data Library.
Advertisements

Dealing with confidential research information anonymisation techniques and other measures to enable using and sharing research data Data Management and.
Dealing with confidential research information - Anonymisation techniques and access regulations to enable using and sharing research data Data Management.
Statistical Disclosure Control (SDC) at SURS Andreja Smukavec General Methodology and Standards Sector.
HIPAA. What Why Who How When What Is HIPAA? Health Insurance Portability & Accountability Act of 1996.
NCES Data Confidentiality and Data Licensing Program Marilyn Seastrom July, 2013 Washington, DC.
Access routes to 2001 UK Census Microdata: Issues and Solutions Jo Wathan SARs support Unit, CCSR University of Manchester, UK
Confidentiality and Security Issues in ART & MTCT Clinical Monitoring Systems Meade Morgan and Xen Santas Informatics Team Surveillance and Infrastructure.
Disclosure Avoidance: An Overview Irene Wong ACCOLEDS/DLI Training December 8, 2003.
Using IPUMS.org Katie Genadek Minnesota Population Center University of Minnesota The IPUMS projects are funded by the National Science.
Disclosure Control in Practice: issues and approaches Andy Sutherland Health and Social Care Information Centre.
Health Datasets in Spatial Analyses: The General Overview Lukáš MAREK Department of Geoinformatics, Faculty.
1 The 2001 Census PUMFS Odyssey Sponsored by HAL and PALS Presented by Chuck Humphrey.
Framework of Statistical Information. This is a typology of the categories or classes of statistical information. Remember the relationship between statistics.
RESEARCH ETHICS AND DATA CONFIDENTALITY: ANONYMISATION AND ACCESS CONTROL ……………………………………………………………………………………………………………………………….…………………………….. ……………………………………………………………......…...
Achieving Anonymity in Micro Data Files 10th Symposium on Identity and Trust on the Internet April 6-7, 2011 Privacy: An Emerging Landscape Alvan O. Zarate,
Creating Something from Nothing: Synthetic and Dummy files Bo Wandschneider University of Guelph Chuck Humphrey University of Alberta DLI Training: Ottawa,
Disclosure Avoidance at Statistics Canada INFO747 Session on Confidentiality Protection April 19, 2007 Jean-Louis Tambay, Statistics Canada
1 IPAM 2010 Privacy Protection from Sampling and Perturbation in Surveys Natalie Shlomo and Chris Skinner Southampton Statistical Sciences Research Institute.
Disclosure Control in the UK Census Keith Spicer 11 January 2005.
Creating Something from Nothing: Working with Synthetic Files ACCOLEDS /DLI Training: December 2003 Chuck Humphrey University of Alberta.
Creating Open Data whilst maintaining confidentiality Philip Lowthian, Caroline Tudor Office for National Statistics 1.
Disclosure Analysis: What do RDC Analysts do? Research Data Centre Program, Statistics Canada James Chowhan Ontario DLI Training, Queen's University
Access to microdata in the Netherlands: from a cold war to co-operation projects Eric Schulte Nordholt Senior researcher and project leader of the Census.
Lisa Neidert Population Studies Center May 26-28, 2010 Ann Arbor, MI Third Working Group on Data Access.
The London Health Observatory: monitoring health and health care in the capital, supporting practitioners and informing decision-makers Disclosure control.
Expanding the Role of Synthetic Data at the U.S. Census Bureau 59 th ISI World Statistics Congress August 28 th, 2013 By Ron S. Jarmin U.S. Census Bureau.
Regional Roundtable on
WHO The World Health Survey HOUSEHOLD QUESTIONNAIRE
UNHCR‘s Policy on the Protection of Personal Data of Persons of Concern - An introduction (October 2016)
PROCESSING DATA.
Natalie Shlomo Social Statistics, School of Social Sciences
Disclosure scenario and risk assessment: Structure of Earnings Survey
Privacy Education Session CMHA-WECB/CCHC Volunteers/Students
Creation of synthetic microdata in 2021 Census Transformation Programme (proof of concept) Robert Rendell.
Creating Something from Nothing: Working with Synthetic Files
WHO The World Health Survey General Introduction
Privacy & Confidentiality
General Social Survey Enquête sociale générale
Assessing Disclosure Risk in Microdata
Research Data Centre DLI Workshop (December, 2001)
Addressing Pushback from Patients
Slides to accompany Weathington, Cunningham & Pittenger (2010), Chapter 16: Research with Categorical Data.
Working with Sensitive or Confidential Data John Southall Bodleian Data Librarian Subject Consultant for Economics, Sociology, Social Policy and.
Confidentiality in Published Statistical Tables
Dissemination Workshop for African countries on the Implementation of International Recommendations for Distributive Trade Statistics May 2008,
Using Weights in the Analysis of Survey Data
General Social Survey Enquête sociale générale
Move this to online module slides 11-56
The European Statistical Training Programme (ESTP)
Snapshot of the Clinical Trials Enterprise as revealed by ClinicalTrials.gov Content downloaded: September 2012.
Tabulations and Statistics
Welcome to the FERPA training for Faculty and Staff.
Gathering and Organizing Data
Move this to online module slides 11-56
Information management and communication
Chapter 8: Weighting adjustment
Finding and Using Health Statistics and Data Files Epidemiology 1
Using Weights in the Analysis of Survey Data
High-level Working Group on Statistical Confidentiality
Telling Canada’s story in numbers Marie-Josée Major
Gathering and Organizing Data
The Halton District School Board expects that everyone associated with the Board has a right to be treated with respect and dignity and to teach, learn.
The European Statistical Training Programme (ESTP)
By A.Arul Xavier Department of mathematics
Chapter 2: Research in Child Development
SAFE – a method for anonymising the German Census
Chapter 5: The analysis of nonresponse
Item 2.2 Scientific Use Files for the Time Use Survey
Creating Something from Nothing: Working with Synthetic Files
Presentation transcript:

Disclosure Avoidance: An Overview Irene Wong ACCOLEDS/DLI Training December 8, 2003

Note: The following slides were prepared in conjunction with the ACCOLEDS/DLI Training presentations at the University of Calgary (Alberta) on December 8, 2003, and are not intended for use as documentation of disclosure risk control and practices. For more information about the slides, please contact the author at RDC@library.ualberta.ca.

Presentation Outline Overview of data confidentiality Different types of disclosure and output Some examples Facing the challenge

Why is keeping data confidentiality so important? Retain and Respect Public Trust Most household/population surveys do not have mandatory participation Respondents volunteer their time and information Respondents trust Statistics Canada to ensure their privacy and the confidentiality of their information To ensure future data collection Statistics Act - judiciously guarding respondents’ confidential information

Types of data Aggregated data vs. Microdata Dictate the data release method Enterprise data vs. Household data Mandatory vs. voluntary participation Admin Data and Census vs. Sample Survey Different degree of risk of disclosure

Confidentiality and Disclosure Under the Statistics Act, Statistics Canada must protect the confidentiality of respondents’ data and identity. Disclosure relates to the inappropriate attribution of information to a data subject, whether the subject is an individual or an organization.

So what’s the problem? Direct Identifiers (name, address, health number, etc.) that uniquely identify a respondent. These are all stripped from released data files. Indirect Identifiers refer to variables such as age, marital status, occupation, ethnicity, postal code, type of business etc.). When combined they could be used to identify a respondent. Sensitive variables refer to information or characteristics relating to a respondent’s private life or business which are usually unknown to others (income, illness, behaviour etc.).

The concern is… Combining indirect identifiers with sensitive variables poses a disclosure risk, but… It is usually what researchers like to do to relate specific characteristics of some response groups to some specific activities/characteristics and how/why they are related Control method: restricted access, data reduction, disclosure analysis …

Controls on microdata release Restricted Access License and data sharing agreement Strictly control record linkage (direct identifier) Survey data access restricted within the organization Employee access granted on a “need to know” basis only Analytical (confidential) database with direct identifiers removed Direct access – authorized employee/deemed employee only Indirect data access (Remote Access services/Remote Data Access services) - screening Data Reduction – e.g. PUMF

Public Use Microdata File (PUMF) Files of anonymous individual records Created for research purposes Follows Statistics Canada’s Policy on Microdata Release Expect some forms of data reduction and suppression Expect suppression of sample design information (cluster, stratification, etc.)

PUMF disclosure risk control Suppress some indirect identifiers (e.g. small geographical code, race details, etc.) Avoid unique combination of indirect identifiers that can disclose a response unit (such as gender, age, occupation, chronic conditions, religion, etc.) Perform Univariate analyses and look for outliers Sometimes maximum/minimum values are capped And more…

Protection of confidential data Physical protection of the data storage area Protection of the computer systems Enforcement of data releasers’ and users’ responsibilities to protect respondent confidentiality Disclosure analysis on output that leaves the restricted data storage area

Identity Disclosure Identity Disclosure - When a respondent can be identified from the released data. Combine identifier with sensitive variables Examples: Spontaneous recognition of well-known characteristic by others (e.g. from small sample) Self-disclosure (e.g., respondent self-identifies when complaining to the media on privacy violation)

Attribute Disclosure Attribute Disclosure - When confidential information is revealed and can be attributed to an individual or a group. Such as, all persons with characteristic x have characteristic y Examples: People in occupation W make $ 50-60,000/year… 100% of the respondents of age W in area X reported that they experimented with …

Residual Disclosure Residual disclosure - when confidential information is disclosed by combining previously released output and information. Extra care is needed where risk of residual disclosure is high, such as Subsequent cycles of longitudinal data files (e.g. NLSCY, NPHS, etc.) Sample from dependent surveys (e.g. SLID and LFS) Research projects using the same data file Overlapping small geographical area (e.g. Health Region and Economic Region)

Types of outputs Analytic studies (e.g. inferential statistics/model output) Model parameters such as, regression coefficients, etc. Hypothesis test results such as, p-value, t-statistics, etc. Descriptive studies (e.g. table output) Frequencies, percentiles, cross-tabulation, standard errors, correlation matrix, etc.

To lower disclosure risk General rules we follow for household sample surveys: Do not report statistics or table cells with small number of respondents (e.g. fewer than 5 respondents) No anecdotal information may be given about specific respondents ‘Zero’ and ‘Full’ cell restriction Min. and Max. value restriction Saturated models, covariance/correlation matrices treated like underlying tables And more…..

Some examples…

Low frequency cells F, 0 is a low frequency cell. Solution? Collapse column ‘M’ and ‘F’ = column ‘total’ Collapse row ‘1’ and ‘0’ = row ‘total’ Report either column ‘M’ and row ‘1’ but not along with the ‘total’ M F total 1 34 14 48 15 2 17 49 16 65 X

Frequency distributions If < 5 respondents are above the 99th percentile, there is a problem. One solution is to describe the distribution using the 95th percentile. * If the survey is multilevel (NLSCY), then the 5 or more respondents from level 1 (child) must come from at least 3 different units from level 2 (household). Frequency curve, e.g.: user wishes to release the the value of observation at the 99th percentile * child 1: family 1 child 2: family 1 child 3: family 2 child 4: family 2 child 5: family 3….

‘Zero’ and ‘Full’ cell (F, 1) is a full cell M F total 1 52 64 116 13 65 129 age married single <12 40 13-20 5 35 >20 32 8 37 83 120 (F, 1) is a full cell (F, 0) is a non-structural zero cell Both could pose confidentiality problem (Married, age <12) is a structural zero cell Not a data confidentiality problem Not expect anyone to be in this category

Implied tables - residual disclosure Select if Married = 1 Yes No 1 2013 40 2 205 35 3 132 8 2350 83 Select all cases 2020 41 209 52 430 16 2659 109 Implied tables are tables produced by subtracting results from one or more published tables from another published table In this example, ‘non-married’ individuals can easily be calculated

When reporting information… Writing a report is no different than working with table output, avoid statements such as: “… responded incomes ranging from $2,498 to $579,789.” If necessary, give general indications (e.g. “no income was above $600,000”.) “… all respondents of age 16 reported experimenting with drugs.” This is equivalent to a full cell situation.

Related Outputs If PUMF as well as analytical outputs using confidential data are released for the same survey, the published results should not disclose sensitive information about individual respondents that was suppressed in the PUMF. That is, from the reported results, it should not be possible to infer information that allows the identification of a PUMF respondent.

Facing Challenges No single control of all the releases Remote Access, PUMFs, RDCs, survey data publications, etc. Potential residual disclosure Can residual disclosure be totally accounted for? Can it be better controlled?

What RDCs are doing now… Educate data users to Take precautions when dealing with confidential information Recognize disclosure risk Make use of alternative reporting and complementary suppression Limit intermediary outputs

What else should we do? Match against other types of file releases to assess overall disclosure risk? Future data reduction in PUMFs and publications? Follow the American RDC approach? Different disclosure analysis approach for different data files? Stricter screening process? ……