Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona Sampling as a way to reduce risk and create a Public Use File maintaining weighted.

Slides:



Advertisements
Similar presentations
Multiple Indicator Cluster Surveys Survey Design Workshop
Advertisements

Kinematic Synthesis of Robotic Manipulators from Task Descriptions June 2003 By: Tarek Sobh, Daniel Toundykov.
Business microdata dissemination at Istat Daniela Ichim Luisa Franconi
GLOBAL TOBACCO SURVEILLANCE SYSTEM Global Youth Tobacco Survey Training Workshop Introduction to the GYTS Sample Design & Weights.
SDC for continuous variables under edit restrictions Natalie Shlomo & Ton de Waal UN/ECE Work Session on Statistical Data Editing, Bonn, September 2006.
Sampling Strategy for Establishment Surveys International Workshop on Industrial Statistics Beijing, China, 8-10 July 2013.
Enhancing Data Quality of Distributive Trade Statistics Workshop for African countries on the Implementation of International Recommendations for Distributive.
Optimal Sampling Strategies for Multidomain, Multivariate Case with different amount of auxiliary information Piero Demetrio Falorsi, Paolo Righi 
Who and How And How to Mess It up
Sampling.
Privacy Preserving Data Mining: An Overview and Examination of Euclidean Distance Preserving Data Transformation Chris Giannella cgiannel AT acm DOT org.
Joint UNECE/Eurostat Meeting on Population and Housing Censuses (13-15 May 2008) Sample results expected accuracy in the Italian Population and Housing.
Joint Canada/U.S. Health Survey Catherine Simile, National Center for Health Statistics Patrice Mathieu, Statistics Canada Ed Rama, Statistics Canada NCHS.
Metadata driven application for aggregation and tabular protection Andreja Smukavec SURS.
1 Data Mining over the Deep Web Tantan Liu, Gagan Agrawal Ohio State University April 12, 2011.
Microdata Simulation for Confidentiality of Tax Returns Using Quantile Regression and Hot Deck Jennifer Huckett Iowa State University June 20, 2007.
CORE Rome Meeting – 3/4 October WP3: A Process Scenario for Testing the CORE Environment Diego Zardetto (Istat CORE team)
Near East Regional Workshop - Linking Population and Housing Censuses with Agricultural Censuses. Amman, Jordan, June 2012 Improving Efficiency.
Joint UNECE/Eurostat Meeting on Population and Housing Censuses (28-30 October 2009) Accuracy evaluation of Nuts level 2 hypercubes with the adoption of.
Chapter 1 Introduction and Data Collection
Multiple Indicator Cluster Surveys Survey Design Workshop Sampling: Overview MICS Survey Design Workshop.
9 th Workshop on Labour Force Survey Methodology – Rome, May 2014 The Italian LFS sampling design: recent and future developments 9 th Workshop on.
Integrating administrative and survey data in the new Italian system for SBS: quality issues O. Luzi, F. Oropallo, A. Puggioni, M. Di Zio, R. Sanzo Nurnberg,
Improving the Design of UK Business Surveys Gareth James Methodology Directorate UK Office for National Statistics.
1 Statistical Disclosure Control Methods for Census Outputs Natalie Shlomo SDC Centre, ONS January 11, 2005.
Optimal Allocation in the Multi-way Stratification Design for Business Surveys (*) Paolo Righi, Piero Demetrio Falorsi 
Luisa Franconi Integration, Quality, Research and Production Networks Development Department Unit on microdata access ISTAT Essnet on Common Tools and.
Daniel Beckler United States Department of Agriculture National Agricultural Statistics Service Timothy Mulcahy NORC at the University of Chicago Topic.
GSIM implementation in the Istat Metadata System: focus on structural metadata and on the joint use of GSIM and SDMX Mauro Scanu
Metadata Models in Survey Computing Some Results of MetaNet – WG 2 METIS 2004, Geneva W. Grossmann University of Vienna.
1 Assessing the Impact of SDC Methods on Census Frequency Tables Natalie Shlomo Southampton Statistical Sciences Research Institute University of Southampton.
Sampling Design and Analysis MTH 494 LECTURE-12 Ossam Chohan Assistant Professor CIIT Abbottabad.
Statistical Matching in the framework of the modernization of social statistics Aura Leulescu & Emilio Di Meglio EUROSTAT Unit F3 - Living conditions and.
Use of Administrative Data Seminar on Developing a Programme on Integrated Statistics in support of the Implementation of the SNA for CARICOM countries.
LINEAR PROGRAMMING. 2 Introduction  A linear programming problem may be defined as the problem of maximizing or minimizing a linear function subject.
OPENING QUESTIONS 1.What key concepts and symbols are pertinent to sampling? 2.How are the sampling distribution, statistical inference, and standard.
Evaluating generalised calibration / Fay-Herriot model in CAPEX Tracy Jones, Angharad Walters, Ria Sanderson and Salah Merad (Office for National Statistics)
Calibrated imputation of numerical data under linear edit restrictions Jeroen Pannekoek Natalie Shlomo Ton de Waal.
WP 19 Assessment of Statistical Disclosure Control Methods for the 2001 UK Census Natalie Shlomo University of Southampton Office for National Statistics.
A Theoretical Framework for Adaptive Collection Designs Jean-François Beaumont, Statistics Canada David Haziza, Université de Montréal International Total.
1 IPAM 2010 Privacy Protection from Sampling and Perturbation in Surveys Natalie Shlomo and Chris Skinner Southampton Statistical Sciences Research Institute.
Eurostat Statistical matching when samples are drawn according to complex survey designs Training Course «Statistical Matching» Rome, 6-8 November 2013.
The availability of Dutch census microdata Eric Schulte Nordholt Senior researcher and project leader of the Census Statistics Netherlands Division Social.
Eurostat Weighting and Estimation. Presented by Loredana Di Consiglio Istituto Nazionale di Statistica, ISTAT.
Sampling Sources: -EPIET Introductory course, Thomas Grein, Denis Coulombier, Philippe Sudre, Mike Catchpole -IDEA Brigitte Helynck, Philippe Malfait,
Oversampling the capital cities in the EU SAfety SUrvey (EU-SASU) Task Force on Victimization Eurostat, February 2010 Guillaume Osier Service Central.
Protection of frequency tables – current work at Statistics Sweden Karin Andersson Ingegerd Jansson Karin Kraft Joint UNECE/Eurostat.
European Conference on Quality in Official Statistics, Rome, July 2008 Community Innovation Survey: a Flexible Approach to the Dissemination of Microdata.
1 1 Confidentiality protection of large frequency data cubes UNECE Workshop on Statistical Confidentiality Ottawa October 2013 Johan Heldal and Svetlana.
Multivariate selective editing via mixture models: first applications to Italian structural business surveys Orietta Luzi, Guarnera U., Silvestri F., Buglielli.
Basic Business Statistics, 8e © 2002 Prentice-Hall, Inc. Chap 1-1 Inferential Statistics for Forecasting Dr. Ghada Abo-zaid Inferential Statistics for.
Marketing Information System A Marketing Information System is the structure of people, equipment, and procedures used to gather, analyze, and distribute.
Copyright © 2012 by Nelson Education Limited. Chapter 5 Introduction to inferential Statistics: Sampling and the Sampling Distribution 5-1.
HASE: A Hybrid Approach to Selectivity Estimation for Conjunctive Queries Xiaohui Yu University of Toronto Joint work with Nick Koudas.
Joint Eurostat Unece Worksession on Statistical Data Confidentiality 2011, Tarragona Initial analyses on comparable dissemination from the Essnet project.
IAOS Shanghai – Reshaping Official Statistics Some Initiatives on Combining Data to Support Small Area Statistics and Analytical Requirements at.
ESSNET Data Integration - Rome, January 2010 ESSNET on Statistical Disclosure Control Daniela Ichim.
1 General Recommendations of the DIME Task Force on Accuracy WG on HBS, Luxembourg, 13 May 2011.
Eurostat Overview of the project Meeting of the Expert Group on the integration of the European social surveys January 2015.
11 Measuring Disclosure Risk and Data Utility for Flexible Table Generators Natalie Shlomo, Laszlo Antal, Mark Elliot University of Manchester
Disclosure scenario and risk assessment: Structure of Earnings Survey
Statistical disclosure control on visualising geocoded population data using a structure in quadtrees Eduard Suñé, Cristina Rovira, Daniel Ibáñez, Mireia.
Harmonisation process of anonymisation of microdata
ESSnet on common tools and harmonized methodology for statistical data confidentiality Daniela Ichim, Luisa Franconi.
Chapter 8: Weighting adjustment
ANALYSIS OF POSSIBILITY TO USE TAX AUTHORITY DATA IN STS. RESULTS
Strategies to achieve SDC harmonisation at European level: multiple countries, multiple files, multiple surveys Daniela Ichim and Luisa Franconi Istat,
The role of metadata in census data dissemination
SMALL AREA ESTIMATION FOR CITY STATISTICS
Item 2.2 Scientific Use Files for the Time Use Survey
Presentation transcript:

Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona Sampling as a way to reduce risk and create a Public Use File maintaining weighted totals Maria Cristina Casciano, Laura Corallo, Daniela Ichim

Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona Outline Multiple releases: MFR and PUF Subsampling –allocation: reduce the risk of disclosure –selection: pre-defined quality standards Results –Career of Doctorate Holders Survey Further work

Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona Multiple … Multiple countries Multiple countries MS1 MS2 SURVEY 1 TABLES 1 PUF 1 MFR 1 OTHER 1 SURVEY 2 TABLES 2 PUF 2 MFR 2 OTHER 2 SURVEY X TABLES X PUF X MFR X OTHER X Multiple releases SURVEY 1 TABLES 1 PUF 1 MFR 1 OTHER 1 SURVEY 2 TABLES 2 PUF 2 MFR 2 OTHER 2 SURVEY X TABLES X PUF X MFR X OTHER X Multiple releases SURVEY 1 TABLES 1 PUF 1 MFR 1 OTHER 1 SURVEY 2 TABLES 2 PUF 2 MFR 2 OTHER 2 SURVEY X TABLES X PUF X MFR X OTHER X Multiple releases MS27 Multiple countries Multiple surveys

Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona Comparability ESSnet on SDC harmonisation and common tools –WP1: test the comparability concept –Istat, Destatis, Statistics Austria –multiple countries 1 Assessment of effects of different practices on predefined statistics 2 Definition of a threshold to define when action is needed 3 setting a process for choosing acceptable practices HOW

Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona Multiple releases SURVEY 1 TABLES 1 PUF 1 MFR 1 OTHER 1 A particular harmonisation dimension Hierarchical structure –Utility –Risk of disclosure

Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona Multiple releases hierarchical structure MFR + - More restrictive license PUF + - Less aggregated information Less restrictive licenseMore aggregated information UNIQUE PRODUCTION PROCESS!

Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona PUF-MFR MFR –definition of a disclosure scenario –risk assessment R 1 –risk limitation w.r.t. adopted disclosure scenario some data utility requirements PUF –harmonized with the MFR (e.g. weighted totals) –reduced the risk of disclosure –random sample –internal consistency of records –some (other) data utility requirements (CV and weighted totals – precision and accuracy)

Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona Data description Year t-5Year t-3 Year t Doctorate Holders CDH 2009 Survey Estimates by PhD scientific area, by gender and by region labour market entry usefulness of the PhD for obtaining a job type of contract type of work earnings job satisfaction Focus on the characterisation of the occupational status of the PhD holders:

Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona 72% resp 28% No resp respondents PhD Holders (Census) Citizenship (2 categories) PhD Scientific Area (14 categories) GenderRegion weights obtained by constraining on known marginal distributions: Adjustment for non-responses via calibration Data description

Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona PUF-subsampling Simple random sampling Utility: Weighted totals may always be preserved by calibration Risk: how many units at risk are sampled? Example (MFR-CDH): units, 24.7% of units at risk

Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona Subsampling allocation domains utility disclosure sample size stratification dissemination totals scenario calibration key variables quality users auxiliary

Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona PUF-subsampling: proposal 1.Optimal allocation of units to be sampled in each domain according to Bethel’s approach (Risk minimization) 2.Selection of a fixed size balanced sample(CUBE method) (Data utility maximization)

Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona ● Cost function to minimize: ● Expected Coefficient of Variation (CV) of the estimates of the total of variable P in domain j d equal or lower than prefixed thresholds: 1. Bethel’s approach (1989)  n h and C h related to the risk to be reduced  Optimal allocation: n h *

Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona 2. Balanced sampling A sampling design s is said to be balanced on the auxiliary variables if and only if the balancing equations given by: are satisfied, where X is the vector of known population totals, is the H.-T. estimator  exact estimates for pre-defined variables

Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona Balanced sampling: the CUBE method Geometrically each vertex of the hypercube is a sample: The balancing equations define a sub- space of R N named K. The problem is to choose a vertex (sample) of the N-cube that remains in the sub-space of constraints K (111) (000) (100) (101) (010) (011) (110)  Cube method (Deville & Tillé,2004): 1.Flight phase: it’s a random walk starting from the vector  and moving in the intersection of the cube C and K. It stops at the vertex of intersection of C and K, if this vertex exists. 2.Landing phase: At the end of the flight phase, if a sample is not exactly determined in C ∩ K, a sample is selected as close as possible to the constraints space K. K

Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona Implementation 1. determination of the optimal strata sizes in terms of reduction of the overall risk (cost function), keeping the CV level of the estimates below a 5% threshold for three combinations of the allocation and domain variables Allocation variables: Occup, JobS, Contract, Work, Income Domain variables: Gender, Region, Scientific Area, Year of Completion 2.six possible settings, corresponding to different choices of the parameters: a. Risk R1 used as the minimization cost of the algorithm b. Risk R1 used as a stratification variable c. include all units of the strata containing no units at risk

Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona C.S Risk.cost Risk.strat Cens.no.risk # Strata #Cens.strata #Cens.units Size Bethel Size Prop. Size Equal Max.Bethel-PropMax.Bethel-Equal 1NYN NYY YYN YYY YNN YNY NYN NYY YYN YYY YNN YNY NYN NYY YYN YYY YNN YNY Allocations (CV* = 5%)

Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona Allocations

Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona Balanced sample Selection of samples of fixed size from the CDH survey: Utility constraints on: the population size N the optimal sample size n the marginal frequency distributions by Gender, Year of Doctorate Completion and Scientific Area  18 equations CUBE algorithm: I. Input Vector  is the optimal one determined by Bethel II. Flight phase ends with no exact solution III. Landing phase starts: selection of a sample which ensures a low difference to the balance, according to the distance between p * to p

Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona Median of absolute relative errors Results

Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona Results C.S Risk.cost Risk.strat Cens.no.risk Risk Occup JobS Contract Work Income 1NYN NYY YYN YYY YNN YNY NYN NYY YYN YYY YNN YNY NYN NYY YYN YYY * YNN YNY

Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona Further work 1. the relationship between coefficients of variation and disclosure risk, together with different options of including the risk of disclosure in the sampling design; 2. the introduction of an utility-priority approach into the way to deal with the balancing equations; 3. the usage of other data utility constraints to be investigated.