1 Tel Aviv April 29th, 2007 Disclosure Limitation from a Statistical Perspective Natalie Shlomo Dept. of Statistics, Hebrew University Central Bureau of.

Slides:



Advertisements
Similar presentations
Estimating Identification Risks for Microdata Jerome P. Reiter Institute of Statistics and Decision Sciences Duke University, Durham NC, USA.
Advertisements

Statistical Disclosure Control (SDC) for 2011 Census Progress Update Keith Spicer – ONS SDC Methodology 23 April 2009.
WP 33 Information Loss Measures for Frequency Tables Natalie Shlomo University of Southampton Office for National Statistics Caroline.
Statistical Disclosure Control (SDC) at SURS Andreja Smukavec General Methodology and Standards Sector.
BTS Confidentiality Seminar Series June 11, 2003 FCSM/CDAC Disclosure Limiting Auditing Software: DAS Mark A. Schipper Ruey-Pyng Lu Energy Information.
Confidentiality risks of releasing measures of data quality Jerry Reiter Department of Statistical Science Duke University
© Statistisches Bundesamt, IIA - Mathematisch Statistische Methoden Summary of Topic ii (Tabular Data Protection) Frequency Tables Magnitude Tables Web.
SDC for continuous variables under edit restrictions Natalie Shlomo & Ton de Waal UN/ECE Work Session on Statistical Data Editing, Bonn, September 2006.
CmpE 104 SOFTWARE STATISTICAL TOOLS & METHODS MEASURING & ESTIMATING SOFTWARE SIZE AND RESOURCE & SCHEDULE ESTIMATING.
Computational Methods for Management and Economics Carla Gomes Module 8b The transportation simplex method.
1 A Common Measure of Identity and Value Disclosure Risk Krish Muralidhar University of Kentucky Rathin Sarathy Oklahoma State University.
Random effects estimation RANDOM EFFECTS REGRESSIONS When the observed variables of interest are constant for each individual, a fixed effects regression.
Assessing Disclosure Risk in Sample Microdata Under Misclassification
Visual Recognition Tutorial
1 Wavelet synopses with Error Guarantees Minos Garofalakis Phillip B. Gibbons Information Sciences Research Center Bell Labs, Lucent Technologies Murray.
Privacy Preserving Data Mining: An Overview and Examination of Euclidean Distance Preserving Data Transformation Chris Giannella cgiannel AT acm DOT org.
Security in Databases. 2 Outline review of databases reliability & integrity protection of sensitive data protection against inference multi-level security.
Linear and generalised linear models Purpose of linear models Least-squares solution for linear models Analysis of diagnostics Exponential family and generalised.
Formalizing the Concepts: Simple Random Sampling.
Chapter 14 Inferential Data Analysis
Aspects of Bayesian Inference and Statistical Disclosure Control in Python Duncan Smith Confidentiality and Privacy Group CCSR University of Manchester.
11 Comparison of Perturbation Approaches for Spatial Outliers in Microdata Natalie Shlomo* and Jordi Marés** * Social Statistics, University of Manchester,
Physics 114: Lecture 15 Probability Tests & Linear Fitting Dale E. Gary NJIT Physics Department.
Microdata Simulation for Confidentiality of Tax Returns Using Quantile Regression and Hot Deck Jennifer Huckett Iowa State University June 20, 2007.
The Application of the Concept of Uniqueness for Creating Public Use Microdata Files Jay J. Kim, U.S. National Center for Health Statistics Dong M. Jeong,
Model Inference and Averaging
Multiple Indicator Cluster Surveys Survey Design Workshop Sampling: Overview MICS Survey Design Workshop.
Chap 20-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 20 Sampling: Additional Topics in Sampling Statistics for Business.
Lecture 12 Statistical Inference (Estimation) Point and Interval estimation By Aziza Munir.
Confidentiality Issues with “Small Cell” Data Michael C. Samuel, DrPH STD Control Branch California Department of Public Health 2008 National STD Prevention.
Statistical Disclosure Control for the 2011 UK Census Jane Longhurst, Caroline Young and Caroline Miller (ONS)
1 Statistical Disclosure Control Methods for Census Outputs Natalie Shlomo SDC Centre, ONS January 11, 2005.
CS433 Modeling and Simulation Lecture 16 Output Analysis Large-Sample Estimation Theory Dr. Anis Koubâa 30 May 2009 Al-Imam Mohammad Ibn Saud University.
Section 8.1 Estimating  When  is Known In this section, we develop techniques for estimating the population mean μ using sample data. We assume that.
User-focused Threat Identification For Anonymised Microdata Hans-Peter Hafner HTW Saar – Saarland University of Applied Sciences
Disclosure risk when responding to queries with deterministic guarantees Krish Muralidhar University of Kentucky Rathindra Sarathy Oklahoma State University.
1 New Implementations of Noise for Tabular Magnitude Data, Synthetic Tabular Frequency and Microdata, and a Remote Microdata Analysis System Laura Zayatz.
1 Assessing the Impact of SDC Methods on Census Frequency Tables Natalie Shlomo Southampton Statistical Sciences Research Institute University of Southampton.
Some ACS Data Issues and Statistical Significance (MOEs) Table Release Rules Statistical Filtering & Collapsing Disclosure Review Board Statistical Significance.
Danila Filipponi Simonetta Cozzi ISTAT, Italy Outlier Identification Procedures for Contingency Tables in Longitudinal Data Roma,8-11 July 2008.
Security Control Methods for Statistical Database Li Xiong CS573 Data Privacy and Security.
Calibrated imputation of numerical data under linear edit restrictions Jeroen Pannekoek Natalie Shlomo Ton de Waal.
WP 19 Assessment of Statistical Disclosure Control Methods for the 2001 UK Census Natalie Shlomo University of Southampton Office for National Statistics.
Disclosure Avoidance at Statistics Canada INFO747 Session on Confidentiality Protection April 19, 2007 Jean-Louis Tambay, Statistics Canada
1 IPAM 2010 Privacy Protection from Sampling and Perturbation in Surveys Natalie Shlomo and Chris Skinner Southampton Statistical Sciences Research Institute.
Using Targeted Perturbation of Microdata to Protect Against Intelligent Linkage Mark Elliot, University of Manchester Cathie.
Sampling Sources: -EPIET Introductory course, Thomas Grein, Denis Coulombier, Philippe Sudre, Mike Catchpole -IDEA Brigitte Helynck, Philippe Malfait,
Protection of frequency tables – current work at Statistics Sweden Karin Andersson Ingegerd Jansson Karin Kraft Joint UNECE/Eurostat.
Differential Privacy Some contents are borrowed from Adam Smith’s slides.
1 WP 10 On Risk Definitions and a Neighbourhood Regression Model for Sample Disclosure Risk Estimation Natalie Shlomo Hebrew University Southampton University.
Creating Open Data whilst maintaining confidentiality Philip Lowthian, Caroline Tudor Office for National Statistics 1.
1 1 Confidentiality protection of large frequency data cubes UNECE Workshop on Statistical Confidentiality Ottawa October 2013 Johan Heldal and Svetlana.
Access to microdata in the Netherlands: from a cold war to co-operation projects Eric Schulte Nordholt Senior researcher and project leader of the Census.
Eurostat Accuracy of Results of Statistical Matching Training Course «Statistical Matching» Rome, 6-8 November 2013 Marcello D’Orazio Dept. National Accounts.
Security Methods for Statistical Databases. Introduction  Statistical Databases containing medical information are often used for research  Some of.
1 ES Chapters 14 & 16: Introduction to Statistical Inferences E n  z  
Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona Sampling as a way to reduce risk and create a Public Use File maintaining weighted.
Census 2011 – A Question of Confidentiality Statistical Disclosure control for the 2011 Census Carole Abrahams ONS Methodology BSPS – York, September 2011.
6-1 Copyright © 2014, 2011, and 2008 Pearson Education, Inc.
FDI - Imputation. Overview Introduction Overview of Imputation Methods Overview of Outliering methods Overview of Estimation methods Aggregation Disclosure.
Transforming Data to Satisfy Privacy Constraints 컴퓨터교육 전공 032CSE15 최미희.
Reconciling Confidentiality Risk Measures from Statistics and Computer Science Jerry Reiter Department of Statistical Science Duke University.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
11 Measuring Disclosure Risk and Data Utility for Flexible Table Generators Natalie Shlomo, Laszlo Antal, Mark Elliot University of Manchester
Natalie Shlomo Social Statistics, School of Social Sciences
Assessing Disclosure Risk in Microdata
Establishing an Automated Confidentiality Service in Stats NZ
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Classification Trees for Privacy in Sample Surveys
Imputation as a Practical Alternative to Data Swapping
Presentation transcript:

1 Tel Aviv April 29th, 2007 Disclosure Limitation from a Statistical Perspective Natalie Shlomo Dept. of Statistics, Hebrew University Central Bureau of Statistics

2 Topics of Discussion 1.Introduction and Motivation 2.Disclosure risk – data utility decision problem 3.Assessing disclosure risk 4.Methods for masking statistical data - microdata - tabular data 5. Assessing information loss

3 Statistical Data Sources of Statistical Data: Census - full enumeration of the population Administrative – data collected by Government Agencies for other purposes, i.e. tax records, population register Survey – random sample out of the population. Each unit in the sample is assigned a sampling weight. Often population is unknown. SDC Approach – “Safe Data” versus “Safe Settings” Microdata Review Panels need to make informed decisions on releasing microdata and mode of access

4 Assessing Disclosure Risk Physical disclosure – disclosure from breach of physical security, e.g. Stolen questionnaires, computer hacker Statistical disclosure – disclosure from statistical outputs Disclosure risk scenarios - assumptions about information or IT tools available to an intruder that increase the probability of disclosure, e.g. matching to external files or spontaneous recognition Key - combination of indirect identifying variables, such as sex, age, occupation, place of residence, country of birth and year of immigration, marital status, etc.

5 Types of Statistical Disclosure Identity disclosure - an intruder identifies a data subject confidentiality pledges and code of practice: “…no statistics will be produced that are likely to identify an individual unless specifically agreed with them” Individual attribute disclosure - confidential information revealed and can be attributed to a data subject Identity disclosure a necessary condition for attribute disclosure and should be avoided Group attribute disclosure - learn about a group but not about a single subject. May cause harm, i.e. all adults in a village collect unemployment

6 The SDC Problem R-U Confidentiality Map (Duncan, et.al. 2001) Original Data Maximum Tolerable Risk Released Data No data Data Utility: Quantitative measure on the statistical quality Disclosure Risk: Probability of re-identification

7 Disclosure Risk Measures Frequency tables with full population counts: - 1’s and 2’s in cells lead to disclosure - 0’s may be disclosive if only one non-zero cell in a row or column Disclosure risk quantified by the percentage of small cells, probability that a high-risk cell is protected (take into account other measurement errors, i.e. imputation rates) Magnitude Tables : Sensitivity measures based on the number of contributing units and the distribution of the target variable in the cell

8 Disclosure Risk Measures Microdata from surveys (and frequency tables): Decisions typically based on check lists and ad-hoc decision rules regarding low frequencies in combinations of identifying key variables In recent years, objective quantitative criteria for measuring disclosure risk when the population is unknown: - Probability that a sample unique is a population unique - Probability of a correct match between a record in the microdata to an external file

9 On Definitions of Disclosure Risk In the statistics literature, we present examples of risk measures, but lack formal definitions of when a file is safe In the computer science literature, there is a formal definition of disclosure risk (e.g., Dinur, Dwork, Nisim (2004-5), Adam and Wortman(1989 In some of the CS literature any data must be released with noise of magnitude Adding noise of order hides information on individuals and small groups, but yields meaningful information about sums of O(n) units for which noise of order is natural

10 On Definitions of Disclosure Risk Worst Case scenario of the CS approach, for example, that the intruder has all information on everyone in the data set except the individual being snooped, simplifies definitions and there is no need to consider other, more realistic but more complicated scenarios. But would Statistics Bureaus and statisticians agree to adding noise to any data? Other approaches like query restriction or query auditing do not lead to formal definitions.

11 On Definitions of Disclosure Risk Collaboration with the CS and Statistical Community where: 1.In the statistical community, there is a need for more formal and clear definitions of disclosure risk 2.In the CS community, there is a need for statistical methods to preserve the utility of the data - allow sufficient statistics to be released without perturbation - methods for adding correlated noise - sub-sampling and other methods for data masking Can the formal notions from CS and the practical approach of statisticians lead to a compromise that will allow us to set practical but well defined standard for disclosure risk?

12 SDC Methods for Microdata Data Masking Techniques: Non-perturbative methods – preserves the integrity of the data (impact on the variance) Examples: recoding, local suppression, sub-sampling, Perturbative Methods - alters the data (impact on bias) Examples: adding noise, rounding, microaggregation, record swapping, post- randomization method, synthetic data

13 Additive noise A random vector (for example, from a normal distribution) is generated (with zero mean) independently for each individual in the microdata and added to the continuous variables to be perturbed. Use correlated noise based on target variables in order to ensure equal means, covariance matrix and also preserves linear balance edits, i.e. X+Y=Z Let Generate Calculate: where controls the amount of noise SDC Methods for Microdata

14 PRAM ( Post-randomisation method) Misclassify categorical variables according to transition matrix P and a random draw: For vector of the perturbed frequencies, is an unbiased moment estimator of the data Condition of invariance (the vector of the original frequencies is the eigenvector of P), perturbed file is unbiased estimate of the original file. Expected values of marginal distribution reserved. Can also ensure exact marginal distributions by controlling the selection process for changing records Use control strata to ensure no silly combinations SDC Methods for Microdata

15 PRAM Example: T`=(25, 30, 50, 10) - Generate a Transition matrix with a minimum value on the diagonal and all other probabilities equal. SDC Methods for Microdata - Calculate Invariant matrix R and determine such that final matrix will have the desired diagonals Note that

16 Synthetic Data - Fit data to model, e.g. using multiple imputation techniques to develop posterior distribution of a population given the sample data - Can be implemented on parts of the data where a mixture is obtained of real and synthetic data - In practice, very difficult to capture all of the conditional relationships between variables and within sub-groups SDC Methods for Microdata Microaggregation - Identify groups of records, e.g. of size 3, and replace values by group mean (has been shown that it is easy to ‘unpick’ for one variable) - Carry out on several variables at once using clustering algorithms for reducing within variance

17 SDC Methods for Magnitude Tables Cell Suppression A2,608 (5)11,3584,87118,837 B2,562 (3)11,6313,65217,845 C2,608 (12)11,9563,05417,618 D Suppress 12,2813,05117,641 E2,240 (2)7,3473,53713,124 Total12,32754,57318,16585,065 Industry Region

A2,608 (5)11,3584,87118,837 B Secondary 11,631 Secondary 17,845 C2,608 (12)11,9563,05417,618 D Suppress 12,281 Secondary 17,641 E2,240 (2)7,3473,53713,124 Total12,32754,57318,16585,065 Industry Region SDC Methods for Magnitude Tables Cell Suppression

19 SDC Methods for Magnitude Tables Information Available to Table User (1) T(1)+T(2)= =4871 T(2)<= 4871 from (1) (2) T(1)+T(3)= =6214 T(4)<= 6703 from (3) (3) T(3)+T(4)= =6703 T(2)>= =0 from (4),(6) (4) T(2)+T(4)= =5360 etc… (5) T(1)>0, (6) T(2)>0, (7) T(3)>0, (8) T(4)>0 Represent as matrix equation and vector inequality A T=b, T >0 where A =1 0 T=T(1) b= b= T(2) T(3) T(4)6703

20 SDC Methods for Magnitude Tables Disclosure Protection Determine upper and lower bounds for T(1), ….., T(4) (feasibility intervals) using eight linear programming solutions. 1amaximise T(1) subject to AT=b, T>0 1aminimise T(1) subject to AT=b, T>0 2amaximise T(2) subject to AT=b, T>0 There must be ‘feasible’ solutions and true values of T(X) will lie within bounds. Let bounds be T(1) L and T(1) U etc.

21 SDC Methods for Magnitude Tables Disclosure Protection 2,60811,3584,87118,837 [0, 4871]11,631[1343, 6214]17,845 2,60811,9563,05417,618 [0, 4871]12,281[489, 5360]17,641 2,2407,3473,53713,124 12,32754,57318,16585,065

22 SDC Methods for Magnitude Tables Choice of Secondary Cells Stipulate requirement on T(1) L and T(1) U to ensure interval sufficiently wide with a fixed percentage, e.g. [T(X) U -T(X) L ]≥ (p/100)T(X) for all X Employ sensitivity measure: Require T(X) U >T(X)+(p/100)T(X) And by symmetry T(X) L <T(X)-(p/100)T(X) Sliding rule protection – only the width is predetermined and interval may be skewed

23 SDC Methods for Magnitude Tables Choice of Secondary Cells Many possible sets of suppressed cells (including all cells!), Define target function and minimise subject to constraints for preserving protection intervals Idea:Target function: Cost = information content of cell Common choices of C(X): a)C(X)=1 minimise number of cells suppressed b)C(X)=N(X) minimise number of contributors suppressed c)C(X)=T(X) minimise total value suppressed (all cells must be non-negative)

24 SDC Methods for Magnitude Tables Choice of Secondary Cells Hypercube method: Simple but not optimal On a k-dimensional table, choose a k-dimensional hypercube with the sensitive cell in a corner. All 2 k corner points are suppressed Criteria: Corner can’t be zero since structural zeros may allow recalculating other corners Feasibility intervals should be sufficiently wide (intervals simpler to calculate on a hypercube) Possible suppression candidates and choose one with minimal information loss (minimize cost function) A priori choose cells that were previously suppressed to minimize information loss by putting a large negative cost on the suppressed cells

25 SDC Methods for Frequency Tables Rounding Round frequencies - deterministic e.g. to nearest 5 - random e.g. to close multiple of 5 - controlled e.g. to multiple of 5 Usually interior cells and margins rounded independently - tables don’t add up Margins = sum of interior cells } Can implement rounding on only small cells of the table Margins added up from perturbed and non-perturbed cells

26 SDC Methods for Frequency Tables Rounding Example - complete census What types of disclosure risk are present in this table?

27 SDC Methods for Frequency Tables Rounding Deterministic rounding process to base 3 The published table 1234Total A00000 B C60039 D Total Total A B C D Total

28 SDC Methods for Frequency Tables Rounding Random rounding algorithm: Let be the largest multiple k of the base b such that for an entry x. Define x is rounded up to with probability and rounded down to with probability If x is already a multiple of b, it remains unchanged The expected value of the rounded entry is the original entry since: Each small cell is rounded independently in the table. Can also control the selection process to ensure additive totals in one dimension.

29 0  0 1  0 with probability 2/3 3 with probability 1/3 2  0 with probability 1/3 3 with probability 2/3 3  3 4  3 with probability 2/3 6 with probability 1/3 …... An example of random rounding SDC Methods for Frequency Tables Rounding A typical rounding scheme Margins rounded separately

30 Age SDC Methods for Frequency Tables Rounding Complete census in small area, after random rounding - Ones and twos disappear - Doubt cast on zeroes so disclosure prevented - Figures don’t add up, may allow table to be “unpicked”

31 Age SDC Methods for Frequency Tables Rounding - Ones and twos disappear - Doubt cast on zeros so disclosure prevented - Tables additive - Zero-restricted – the entry that is an integer multiple of the base b is unchanged Controlled Rounding – rounding in such a way that table are additive

32 Example - Random Rounding to base 5 Auditing Random Rounding Under Over 60Total Male67114 Female5409 Total Under Over 60Total Male Female Total Feasible interval generally 8 wide (between 1 and 9), except for column 3 which is 4 wide (between 0 and 4) Column one and row one do not add up to totals, nor one- way margins to grand total

33 Example Restrict attention just to one-way margins and total. Auditing Random Rounding

34 Feasible intervals Auditing Random Rounding

35 Feasible intervals sometimes much narrower than the rounding method suggests. In some cases, where frequencies low, can result in potential disclosure. Auditing Random Rounding

36 Impact on Analysis Loss of information – combining categories Inflate or deflate variance Bias and inconsistency in the data Some SDC methods are transparent and users can take them into account, e.g. rounding. Other methods have hidden bias and the effects are difficult to assess, e.g. record swapping

37 Information Loss Measures Types of Information loss measures depending on use of data: Distortion to distributions and totals (bias) as measured by distance metrics, entropy, average perturbation per cell, etc. Impact on variance of estimates Impact on measures of association based on chi-squared tests for independence Impact on goodness-of-fit criteria, regression coefficients, statistical analysis and inference

38 Information Loss Measures Measures for Bias and Distortion Hellinger’s Distance Relative Absolute Distance Average Absolute distance per cell where MethodSCACSCACRND HD RAD AAD SCA – small cell rounding CRND – semi controlled full rounding

39 Information Loss Measures Measures for Bias and Distortion for 10 consecutive OA’s R - random R/I – random (no imputed) T - targeted

40 Information Loss Measures Impact on Measure of Association – Cramer’s V MethodSJ Cramer’s V= SCACSCACRND Original 6.8%--6.7%-7.8% On two-Way Table defined by OA * Age-Sex and Economic Activity * Long-Term Illness calculate: The information loss measure: MethodSJ Cramer’s V= %10%20% Random 0.3%2.8%4.8% Rand/Imp 0.3%2.0%3.8% Targeted 0.1%1.4%3.3%

41 Disclosure Control Techniques Record Swapping Disclosure risk measure – Percent records in small cells of the tables that were not perturbed or not imputed Information Loss measure – Average absolute difference per cell