Presentation is loading. Please wait.

Presentation is loading. Please wait.

Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona Sampling as a way to reduce risk and create a Public Use File maintaining weighted.

Similar presentations


Presentation on theme: "Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona Sampling as a way to reduce risk and create a Public Use File maintaining weighted."— Presentation transcript:

1 Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona Sampling as a way to reduce risk and create a Public Use File maintaining weighted totals Maria Cristina Casciano, Laura Corallo, Daniela Ichim

2 Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona Outline Multiple releases: MFR and PUF Subsampling –allocation: reduce the risk of disclosure –selection: pre-defined quality standards Results –Career of Doctorate Holders Survey Further work

3 Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona Multiple … Multiple countries Multiple countries MS1 MS2 SURVEY 1 TABLES 1 PUF 1 MFR 1 OTHER 1 SURVEY 2 TABLES 2 PUF 2 MFR 2 OTHER 2 SURVEY X TABLES X PUF X MFR X OTHER X Multiple releases SURVEY 1 TABLES 1 PUF 1 MFR 1 OTHER 1 SURVEY 2 TABLES 2 PUF 2 MFR 2 OTHER 2 SURVEY X TABLES X PUF X MFR X OTHER X Multiple releases SURVEY 1 TABLES 1 PUF 1 MFR 1 OTHER 1 SURVEY 2 TABLES 2 PUF 2 MFR 2 OTHER 2 SURVEY X TABLES X PUF X MFR X OTHER X Multiple releases MS27 Multiple countries Multiple surveys

4 Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona Comparability ESSnet on SDC harmonisation and common tools –WP1: test the comparability concept –Istat, Destatis, Statistics Austria –multiple countries 1 Assessment of effects of different practices on predefined statistics 2 Definition of a threshold to define when action is needed 3 setting a process for choosing acceptable practices HOW

5 Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona Multiple releases SURVEY 1 TABLES 1 PUF 1 MFR 1 OTHER 1 A particular harmonisation dimension Hierarchical structure –Utility –Risk of disclosure

6 Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona Multiple releases hierarchical structure MFR + - More restrictive license PUF + - Less aggregated information Less restrictive licenseMore aggregated information UNIQUE PRODUCTION PROCESS!

7 Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona PUF-MFR MFR –definition of a disclosure scenario –risk assessment R 1 –risk limitation w.r.t. adopted disclosure scenario some data utility requirements PUF –harmonized with the MFR (e.g. weighted totals) –reduced the risk of disclosure –random sample –internal consistency of records –some (other) data utility requirements (CV and weighted totals – precision and accuracy)

8 Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona Data description Year t-5Year t-3 Year t Doctorate Holders CDH 2009 Survey Estimates by PhD scientific area, by gender and by region labour market entry usefulness of the PhD for obtaining a job type of contract type of work earnings job satisfaction Focus on the characterisation of the occupational status of the PhD holders:

9 Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona 72% resp 28% No resp 12964 respondents 18500 PhD Holders (Census) Citizenship (2 categories) PhD Scientific Area (14 categories) GenderRegion weights obtained by constraining on known marginal distributions: Adjustment for non-responses via calibration Data description

10 Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona PUF-subsampling Simple random sampling Utility: Weighted totals may always be preserved by calibration Risk: how many units at risk are sampled? Example (MFR-CDH): 12964 units, 24.7% of units at risk

11 Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona Subsampling allocation domains utility disclosure sample size stratification dissemination totals scenario calibration key variables quality users auxiliary

12 Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona PUF-subsampling: proposal 1.Optimal allocation of units to be sampled in each domain according to Bethel’s approach (Risk minimization) 2.Selection of a fixed size balanced sample(CUBE method) (Data utility maximization)

13 Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona ● Cost function to minimize: ● Expected Coefficient of Variation (CV) of the estimates of the total of variable P in domain j d equal or lower than prefixed thresholds: 1. Bethel’s approach (1989)  n h and C h related to the risk to be reduced  Optimal allocation: n h *

14 Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona 2. Balanced sampling A sampling design s is said to be balanced on the auxiliary variables if and only if the balancing equations given by: are satisfied, where X is the vector of known population totals, is the H.-T. estimator  exact estimates for pre-defined variables

15 Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona Balanced sampling: the CUBE method Geometrically each vertex of the hypercube is a sample: The balancing equations define a sub- space of R N named K. The problem is to choose a vertex (sample) of the N-cube that remains in the sub-space of constraints K (111) (000) (100) (101) (010) (011) (110)  Cube method (Deville & Tillé,2004): 1.Flight phase: it’s a random walk starting from the vector  and moving in the intersection of the cube C and K. It stops at the vertex of intersection of C and K, if this vertex exists. 2.Landing phase: At the end of the flight phase, if a sample is not exactly determined in C ∩ K, a sample is selected as close as possible to the constraints space K. K

16 Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona Implementation 1. determination of the optimal strata sizes in terms of reduction of the overall risk (cost function), keeping the CV level of the estimates below a 5% threshold for three combinations of the allocation and domain variables Allocation variables: Occup, JobS, Contract, Work, Income Domain variables: Gender, Region, Scientific Area, Year of Completion 2.six possible settings, corresponding to different choices of the parameters: a. Risk R1 used as the minimization cost of the algorithm b. Risk R1 used as a stratification variable c. include all units of the strata containing no units at risk

17 Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona C.S Risk.cost Risk.strat Cens.no.risk # Strata #Cens.strata #Cens.units Size Bethel Size Prop. Size Equal Max.Bethel-PropMax.Bethel-Equal 1NYN925153252493353915550459618 2NYY925214704510555475550443446 3YYN925204558523957195550480311 4YYY925235814533057815550451220 5YNN925240687555559536475399921 6YNY925269983564960946475446827 7 NYN9253061614872592569250530524 8 NYY9253521919882793249250498424 9 YYN9254163229895594249250468294 10 YYY9254513398904595119250466205 11 YNN9254263243915196019250451100 12 YNY925457339992229669925044684 13 NYN5600474547734760138132 14 NYY56289761103201034610360166630 15YYN56215844881288418848189389 16 YYY56289761103231034910360166630 17 YNN280047604774478817688 18 YNY280047594774478817688 Allocations (CV* = 5%)

18 Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona Allocations

19 Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona Balanced sample Selection of samples of fixed size from the CDH survey: Utility constraints on: the population size N the optimal sample size n the marginal frequency distributions by Gender, Year of Doctorate Completion and Scientific Area  18 equations CUBE algorithm: I. Input Vector  is the optimal one determined by Bethel II. Flight phase ends with no exact solution III. Landing phase starts: selection of a sample which ensures a low difference to the balance, according to the distance between p * to p

20 Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona Median of absolute relative errors Results

21 Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona Results C.S Risk.cost Risk.strat Cens.no.risk Risk Occup JobS Contract Work Income 1NYN13660.880.97 0.99 2NYY13330.920.990.940.970.99 3YYN13350.920.980.950.99 4YYY13540.870.990.950.970.99 5YNN14900.860.980.970.98 6YNY15250.910.980.950.970.99 7NYN21940.830.910.990.971.00 8NYY21770.560.810.990.940.99 9YYN21490.780.910.990.911.00 10YYY21630.640.880.970.950.99 11YNN22320.630.870.990.861.00 12YNY22330.550.780.960.940.99 13NYN12720.960.990.920.960.98 14NYY5590.520.790.410.830.98 15YYN5640.770.940.930.970.99 16YYY5620.56*0.840.590.880.99 17YNN12700.950.990.980.99 18YNY12470.910.990.980.990.98

22 Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona Further work 1. the relationship between coefficients of variation and disclosure risk, together with different options of including the risk of disclosure in the sampling design; 2. the introduction of an utility-priority approach into the way to deal with the balancing equations; 3. the usage of other data utility constraints to be investigated.


Download ppt "Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona Sampling as a way to reduce risk and create a Public Use File maintaining weighted."

Similar presentations


Ads by Google