Download presentation

Presentation is loading. Please wait.

Published byJonathon Wherry Modified over 2 years ago

1
SDC for continuous variables under edit restrictions Natalie Shlomo & Ton de Waal UN/ECE Work Session on Statistical Data Editing, Bonn, September 2006

2
Contents The problem Evaluation data SDC techniques Additive noise Microaggregation Rounding Rank swapping Conclusions

3
The problem Statistical disclosure control (SDC): microdata need to be protected against disclosure before release Several SDC-techniques available for continuous microdata Do not take edit constraints into account Inconsistent microdata lead to loss of utility and pinpoint potential intruders to protected data Problem: extend SDC techniques for continuous microdata to take edit constraints into account Micro edits – record level inconsistencies Macro edits – overall loss of utility (bias and variance)

4
Evaluation data 2000 Israel Income Survey with three continuous variables (gross, net and tax) and one control variable (age) 32,896 individuals of which 16,232 earned income from salaries Edits: E1a:gross ≥ 0 E1b:net ≥ 0 E1c:tax ≥ 0 E2:IF age ≤ 17 THEN gross ≤ 6,910 E3:net + tax = gross

5
Additive noise Generate random value and add to value to be protected Random value can be drawn in several ways, depending on Aiming to preserve variances or not A single variable or multiple variables

6
Additive noise for a single variable using standard approach Adding standard noise: perturb Y as follows Y * = Y + e, e drawn from N(0, σ 2 ) Adding random noise to gross with σ 2 = 0,2xVar(gross) resulted in 1,685 failures of E1 and 119 failures of E2 Adding standard noise in groups Define 5 equal groupings (quintiles) by sorting Within each group applying above method resulted in 66 failures of E1 and no failures of E2

7
Additive noise for a single variable using correlated noise Perturb value Y as follows (Natalie’s trick): Y * = d 1 Y + d 2 e, d 1 = (1- δ 2 ) 1/2, d 2 = δ for positive parameter δ e drawn from N((1-d 1 )/d 2 x mean(Y), Var(Y)) Note that E(Y * )= E(d 1 Y) + E(d 2 e) = E(Y) Var(Y * ) = (1- δ 2 )Var(Y) + (δ 2 )Var(Y) = Var(Y) Linear equations are preserved

8
Additive noise for multiple variables and linear programming Perturb each variable Y i separately, resulting in Y i * Adjust perturbed values Y i * slightly so that all edits become satisfied (LP-trick) Minimize Σ i |Y i * - Y i,final | subject to edit constraints Y i,final are final perturbed values Problem is simple linear programming problem

9
Additive noise for multiple variables using correlated noise Perturb vector Y by applying Natalie’s trick Y * = d 1 Y + d 2 e, d 1 = (1- δ 2 ) 1/2, d 2 = δ for positive parameter δ e drawn from N((1-d 1 )/d 2 x mean(Y), Var(Y)) mean(Y) mean vector of Y; Var(Y) covariance matrix of Y Means, covariances and equations are again preserved E(Y * ) = E(Y) E(Var(Y * )) = E(Var(Y)) Linear equations are preserved

10
Microaggregation Replace value to be protected by average value in small group Reduction in variance due to elimination of “within” variance Microaggregation can be applied in several ways: Standard version of microaggregation Microaggregation followed by adding noise (to preserve original variance) and using linear programming to ensure preservation of linear equations (LP-trick) Microaggregation followed by adding correlated noise to ensure preservation of linear equations (Natalie’s trick) Avoids need for LP- trick but does not raise variance to expected level

11
Results for microaggregation Var. SD original SD micro- aggregation SD random noise SD and LP SD and Natalie’s trick tax 2,1192,0822,115 2,103 2,091 net 5,137 5,1145,134 5,1295,119 gross 7,1817,174 7,171

12
Rounding Round value to be protected to multiple of rounding base Rounding can be applied in several ways: Random rounding Controlling totals and additivity Controlling totals and additivity, and selecting all rounded values within base of original value

13
Random rounding Univariate rounding with rounding base b res(X) = X – largest multiple of b less than X Round X up with probability res(X)/b and down with probability 1 - res(X)/b Expectation of rounding is zero In expectation totals are preserved

14
Random rounding: controlling totals Select fraction of res(X)/b random entries to be rounded upward and round the rest downward total is exactly preserved gross is calculated as sum of rounded tax and net gross may jump a base apply reshuffling algorithm to correct this

15
Results for rounding Var.TotalDiff. random rounding Diff. controlled totals Diff. controlled totals and in base tax25,443,623-787-7-87 net86,724,755-575-15-105 gross112,168,378-1,362-22-192

16
Rank swapping Sort variable to be protected and construct groupings, select random pairs in each group and swap values between pairs Different group sizes lead to different results Evaluation criteria: AD = Σ i |X i,orig – X i,pert |/n r where i is cell in age group (14) x sex (2) x income group (22) BV = Σ j n j (average j (X) – average(X)) 2 /(p-1) with j=1,..,p in age group (14) x sex (2)

17
Results for rank swapping Groupings of 10Groupings of 20 Number and Percent of Cells with Differences 106 (22%)166 (34%) AD0.2240.338 Ratio of BV pert /BV orig -0.03%0.58%

18
Conclusion Standard perturbation methods can be extended so they take (micro and macro) edit constraints into account “Best” method to protect data set is to some extent subjective choice Must provide protection against disclosure risk according to tolerable risk threshold Must provide fit for purpose data according to needs of users

Similar presentations

OK

Statistical Disclosure Control (SDC) for 2011 Census Progress Update Keith Spicer – ONS SDC Methodology 23 April 2009.

Statistical Disclosure Control (SDC) for 2011 Census Progress Update Keith Spicer – ONS SDC Methodology 23 April 2009.

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google

Ppt on chapter management of natural resources Ppt online viewer for word Ppt on energy cogeneration Ppt on quality education institute Ppt on fair and lovely cream Ppt on holographic data storage technology Ppt on idiopathic thrombocytopenia purpura genetic Ppt on high level languages basic Ppt on op amp 741 Ppt on single entry system