The use of protected microdata in tabulation: case of SDC-methods microaggregation and PRAM Researcher Janika Konnu Manchester, United Kingdom 17-19 December.

Slides:



Advertisements
Similar presentations
Estimating Identification Risks for Microdata Jerome P. Reiter Institute of Statistics and Decision Sciences Duke University, Durham NC, USA.
Advertisements

Statistical Disclosure Control (SDC) for 2011 Census Progress Update Keith Spicer – ONS SDC Methodology 23 April 2009.
STC1204 Mid Term Public Speaking Preparation 5 questions Randomly Answer 1 question Date: 19 th April 2010 Monday Time: 4.15 – 6.00pm.
1 Measures of Disclosure Risk and Harm Measures of Disclosure Risk and Harm Diane Lambert, Journal of Official Statistics, 9 (1993), pp Jim Lynch.
Statistical Disclosure Control (SDC) at SURS Andreja Smukavec General Methodology and Standards Sector.
Confidentiality risks of releasing measures of data quality Jerry Reiter Department of Statistical Science Duke University
On method-specific record linkage for risk assessment Jordi Nin Javier Herranz Vicenç Torra.
© Statistisches Bundesamt, IIA - Mathematisch Statistische Methoden Summary of Topic ii (Tabular Data Protection) Frequency Tables Magnitude Tables Web.
SDC for continuous variables under edit restrictions Natalie Shlomo & Ton de Waal UN/ECE Work Session on Statistical Data Editing, Bonn, September 2006.
1 A Common Measure of Identity and Value Disclosure Risk Krish Muralidhar University of Kentucky Rathin Sarathy Oklahoma State University.
Assessing Disclosure Risk in Sample Microdata Under Misclassification
Research Ethics Levels of Measurement. Ethical Issues Include: Anonymity – researcher does not know who participated or is not able to match the response.
Methods of Geographical Perturbation for Disclosure Control Division of Social Statistics And Department of Geography Caroline Young Supervised jointly.
Geoffrey Greenwell, IHSN/PARIS21 IASSIST Conference Tampere, Finland, May 2009 Development of Microdata Anonymization Tools by the Olivier.
Data Analysis Statistics. Levels of Measurement Nominal – Categorical; no implied rankings among the categories. Also includes written observations and.
Chapter 9 Flashcards. measurement method that uses uniform procedures to collect, score, interpret, and report numerical results; usually has norms and.
Copyright ©2005 Brooks/Cole, a division of Thomson Learning, Inc. The Diversity of Samples from the Same Population Chapter 19.
Aspects of Bayesian Inference and Statistical Disclosure Control in Python Duncan Smith Confidentiality and Privacy Group CCSR University of Manchester.
MOLLA HUNEGNAW STATISTICIAN AFRICAN CENTRE FOR STATISTICS ECASTATS.UNECA.ORG Confidentiality and Anonymization of Microdata 1 United Nations Regional Seminar.
1 Numerical Data Masking Techniques for Maintaining Sub-Domain Characteristics Krish Muralidhar University of Kentucky Rathindra Sarathy Oklahoma State.
11 Comparison of Perturbation Approaches for Spatial Outliers in Microdata Natalie Shlomo* and Jordi Marés** * Social Statistics, University of Manchester,
Medical Statistics (full English class) Ji-Qian Fang School of Public Health Sun Yat-Sen University.
Microdata Simulation for Confidentiality of Tax Returns Using Quantile Regression and Hot Deck Jennifer Huckett Iowa State University June 20, 2007.
1 Overview of Statistical Disclosure Methodology for Microdata Laura Zayatz Census Bureau BTS Confidentiality Seminar Series, April.
The Application of the Concept of Uniqueness for Creating Public Use Microdata Files Jay J. Kim, U.S. National Center for Health Statistics Dong M. Jeong,
Research Design. Research is based on Scientific Method Propose a hypothesis that is testable Objective observations are collected Results are analyzed.
Neural Networks for Data Privacy ONN the use of Neural Networks for Data Privacy Jordi Pont-Tuset Pau Medrano Gracia Jordi Nin Josep Lluís Larriba Pey.
1 Concepts of Variables Greg C Elvers, Ph.D.. 2 Levels of Measurement When we observe and record a variable, it has characteristics that influence the.
Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.
Quantitative Analysis. Quantitative / Formal Methods objective measurement systems graphical methods statistical procedures.
1 Assessing the Impact of SDC Methods on Census Frequency Tables Natalie Shlomo Southampton Statistical Sciences Research Institute University of Southampton.
Data Anonymization – Introduction and k-anonymity Li Xiong CS573 Data Privacy and Security.
Security Control Methods for Statistical Database Li Xiong CS573 Data Privacy and Security.
Quick and Simple Statistics Peter Kasper. Basic Concepts Variables & Distributions Variables & Distributions Mean & Standard Deviation Mean & Standard.
WP 19 Assessment of Statistical Disclosure Control Methods for the 2001 UK Census Natalie Shlomo University of Southampton Office for National Statistics.
PROCESSING OF DATA The collected data in research is processed and analyzed to come to some conclusions or to verify the hypothesis made. Processing of.
1 IPAM 2010 Privacy Protection from Sampling and Perturbation in Surveys Natalie Shlomo and Chris Skinner Southampton Statistical Sciences Research Institute.
Creating Synthetic Microdata from Official Statistics: Random Number Generation in Consideration of Anscombe's Quartet Kiyomi Shirakawa Hitotsubashi University.
Using Targeted Perturbation of Microdata to Protect Against Intelligent Linkage Mark Elliot, University of Manchester Cathie.
Disclosure Limitation in Microdata with Multiple Imputation Jerry Reiter Institute of Statistics and Decision Sciences Duke University.
European Conference on Quality in Official Statistics, Rome, July 2008 Community Innovation Survey: a Flexible Approach to the Dissemination of Microdata.
Creating Open Data whilst maintaining confidentiality Philip Lowthian, Caroline Tudor Office for National Statistics 1.
The views expressed herein are those of the author and should not necessarily be attributed to the IMF, its Executive Board, or its management Data Confidentiality,
Joint UNECE/Eurostat work session on statistical data confidentiality Manchester, December 2007 Dealing with Confidentiality in Dissemination: The.
Estimation of the Probit Model From Anonymized Micro Data Gerd Ronning and Martin Rosemann Universität Tübingen & IAW Tübingen UNECE Work Session on Statistical.
CHAPTER Basic Definitions and Properties  P opulation Characteristics = “Parameters”  S ample Characteristics = “Statistics”  R andom Variables.
Ensemble Methods in Machine Learning
Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007.
Microdata masking as permutation Krish Muralidhar Price College of Business University of Oklahoma Josep Domingo-Ferrer UNESCO Chair in Data Privacy Dept.
Security Methods for Statistical Databases. Introduction  Statistical Databases containing medical information are often used for research  Some of.
Errors. Random Errors A random error is due to the effects of uncontrolled variables. These exhibit no pattern. These errors can cause measurements to.
Combinations of SDC methods for continuous microdata Anna Oganian National Institute of Statistical Sciences.
Reconciling Confidentiality Risk Measures from Statistics and Computer Science Jerry Reiter Department of Statistical Science Duke University.
(www).
11 Measuring Disclosure Risk and Data Utility for Flexible Table Generators Natalie Shlomo, Laszlo Antal, Mark Elliot University of Manchester
Natalie Shlomo Social Statistics, School of Social Sciences
Disclosure scenario and risk assessment: Structure of Earnings Survey
Assessing Disclosure Risk in Microdata
Confidentiality in Published Statistical Tables
Measures for Information Loss in Protected Data
Comparing Theory and Measurement
Dissemination Workshop for African countries on the Implementation of International Recommendations for Distributive Trade Statistics May 2008,
Harmonisation process of anonymisation of microdata
Protecting Confidential Data
On data accessibility and confidentiality……..
Treatment of statistical confidentiality Part 4: Microdata Introductory course Trainer: Felix Ritchie CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT.
Perturbative methods for ESS census tables
SAFE – a method for anonymising the German Census
Open Data Sharing and its Statistical Limitations
Imputation as a Practical Alternative to Data Swapping
Presentation transcript:

The use of protected microdata in tabulation: case of SDC-methods microaggregation and PRAM Researcher Janika Konnu Manchester, United Kingdom December 2007

Tuesday 18 December Janika Konnu Outline Data SDC-methods Results Conclusions Forthcoming research

Tuesday 18 December Janika Konnu Data used in the study Data of teachers was originally collected for administrative purposes. Only high schools teachers (N=7798) were included in our study. Data included information about teachers: age, gender, position, etc. the schools those teachers taught in: the location of the school, number of students, etc.

Tuesday 18 December Janika Konnu SDC Methods: Microaggregation First data is divided into groups of k observations and the group averages are released instead of original values of variable. MDAV-algorithm was used in grouping: algorithm finds the average observation with respect to the values and forms groups by using the distance from this average observation. Grouping the data is the crucial point for this method: when the most similar observations are contained in the group, information loss will be minimised. In our study microaggregation was used for categorical data although it is intended for numerical data.

Tuesday 18 December Janika Konnu SDC Methods: The Post RAndomization Method Method changes values of a variable according to probability matrix (Markov matrix) example: When PRAM is applied, data user must take the probability matrix into account in order to obtain correct results. In our study we were testing usefulness of PRAM when probability matrix is not used in analysis.

Tuesday 18 December Janika Konnu Empirical work:  -Argus software Software includes disclosure risk measurement and following methods: global recoding, local suppression, top and bottom coding, PRAM, numerical microaggregation, numerical rank swapping and Sullivan masking. Software produces protected data if suppressions are allowed. In our case, only SDC-methods PRAM and numerical microaggregation were studied. No suppressions were made, because we needed information on the difference between original and protected data.

Tuesday 18 December Janika Konnu Results: Data protected by Microaggregation Group sizes used in protection are 2, 5, 8, 10 and 15 Microaggregation does not have an effect on frequencies. Unfortunately this implies that hardly any change occur in values. Conclusion: microaggregation does not give strong enough protection when it comes to categorical data.

Tuesday 18 December Janika Konnu Results: Data protected by PRAM (no bandwidth) Changing probabilities: 0.05, 0.10, 0.20, 0.30 and 0.40 PRAM changes values of variables and that way data will be protected. Unfortunately PRAM leads to problems when categories have big differences in the frequencies. The larger frequency keeps getting smaller and the other way around.

Tuesday 18 December Janika Konnu Results: Data protected by PRAM (bandwidth is 2) Changing probabilities: 0.05, 0.10, 0.20, 0.30 and 0.40 Restricting the change of values can not solve problem with difference in frequencies. Our study shows that frequencies in categories next to the one with largest frequency still grow too fast.

Tuesday 18 December Janika Konnu Results: Data protected by PRAM No bandwidthBandwidth is 2

Tuesday 18 December Janika Konnu Conclusion: Microaggregation Microaggregation perform well with numerical data, but its application for categorical data needs more research. Data protected by microaggregation includes almost the same information as the original data. Are we sure that microaggregation is able to protect categorical data properly?

Tuesday 18 December Janika Konnu Conclusion: PRAM PRAM seems to perform quite well when it comes to protecting data, but there are some issues to overcome. PRAM can protect data with small changing probabilities, because it is based on uncertainty of identification. In this case our concern is with information loss. Is the protected data useful without using probability matrix?

Tuesday 18 December Janika Konnu Forthcoming research Include more methods rank swapping noise adding Include disclosure risk measures Include more precise measurement for information loss

Tuesday 18 December Janika Konnu Some preferences Domingo-Ferrer, J., Torra, V A Quantitative Comparison of Disclosure Control Methods for Microdata. In Confidentiality, Disclosure, and Data Access: Theory and Practical Applications for Statistical Agencies. Amsterdam: North-Holland. Gouweleeuw, J., Kooiman, P., Willenborg, L., and de Wolf, P Post Randomisation for Statistical Disclosure Control: Theory and Implementation. Journal of Official Statistics. Vol. 14, No.4, s Group Crises Research Reports: Microaggregation for Privacy Protection in Statistical Databases. In July Thank You!