WP 33 Information Loss Measures for Frequency Tables Natalie Shlomo University of Southampton Office for National Statistics Caroline.

WP 33 Information Loss Measures for Frequency Tables Natalie Shlomo University of Southampton Office for National Statistics n.shlomo@soton.ac.uk Caroline Young University of Southampton Office for National Statistics cjy@soton.ac.uk

Topics of Discussion 1.Introduction 2.Methods for perturbing frequency tables containing whole population counts 3.Information loss measures for assessing the impact of SDC methods on utility and quality 4.Data description and definition of tables 5.Examples and analysis of results 6.Conclusions and future research

Introduction 1.Focus on frequency tables containing whole population counts: UK Neighborhood Statistics (NeSS) website which disseminates small area statistics from census and administrative data 2. Tables are intentionally perturbed for statistical disclosure control (SDC) causing information loss 3. Develop quantitative information loss measures for choosing optimal SDC methods which preserves high utility in the tables 4. Information loss depends on the SDC method, characteristics of the table and the use of the data

SDC Methods for Frequency Tables SDC for frequency tables containing population counts: Small Cell Adjustments (SCA) – random rounding to base 3 of small cells: Perturbation has a mean of zero and variance of 2. Marginal totals obtained by adding perturbed and non-perturbed cells Full Random Rounding (RaRo) – random rounding to base 3 for all entries. Same method described above after converting all entries to residuals of 3. Marginal totals rounded separately and tables arent additive Can improve utility by semi-controlling for marginal totals

SDC Methods for Frequency Tables SDC for frequency tables containing population counts (cont.): Controlled Rounding (Cr(3)) – all entries rounded to base 3 according to solution of linear programming while ensuring that aggregated rounded internal cells equal the rounded margins. Controlled rounding via Tau-Argus (standard tool for NeSS tables) Cell suppression – small cells (ones and twos) are suppressed and secondary suppressions are found to protect against recalculation through margins. Cell suppression via Tau-Argus and the hyper-cube method

SDC Methods for Frequency Tables SDC for frequency tables containing population counts (cont.): Imputation methods for cell suppression: Margins are known and the total of the suppressed cells are known Impute by average of the total of the suppressed cells in each row (S-A) Impute by weighted average of the total of the suppressed cells in each row where weights are the column totals (S-WA)

Information Loss Measures Measuring distortion to distributions: Distance metrics between original and perturbed cells in each geography (i.e., ward ( NUTS5 )) and average across all wards Let be a table for ward k, the number of cells in the ward, the number of wards, and the cell frequency for cell c : Hellingers Distance (HD) Relative Absolute Distance (RAD) Average Absolute Distance per Cell (AAD)

Information Loss Measures Aggregation of perturbed cells and effects on sub-totals: Users aggregate lower level geographies which are perturbed to obtain non-standard geographies Calculate sub-total where Impact on Tests for Independence: Cramers V measure of association: where is the Pearson chi-square statistic Information loss measure:

Information Loss Measures Impact on Variance : - Little impact on variance of cell counts - Between variance of target variables for proportions in wards: Let the proportion in a ward k: and the overall proportion: Between variance: Information loss measure: Mixed effects for this information loss measure

Information Loss Measures Impact on Rank Correlations: Sort original cell counts and define deciles Repeat on perturbed cell counts Information loss measure: where I is the indicator function and the number of wards Log Linear Analysis: Information loss measure based on the ratio of the deviance (likelihood ratio test statistic) between perturbed table and original table for a given model: Need to also compare different models since model for original table may differ from model of perturbed table

Data Used Estimation Area Southwest England: 437,744 persons, 182,337 households, 70 wards (on average 6,250 persons to a ward) The tables were the following: Tenure(3) * Age (7) * Health(4) * Ward Ethnicity (17) * Ward Economic Activity (9) * Sex (2) * Long-Term Illness(2) * Ward

Data Used TenureEthnicityEmployment Number of cells5,8801,1202,520 Average cell size and SE 73.8 (3.3) 387.3 (51.3) 125.8 (6.6) % of small cells12%9% % of zero cells26%23%17%

Distance Metrics : (Left)-Hellingers Distance, (Centre)-Relative Absolute Difference and (Right)- Absolute Distance per cell CR3 RaRo SASCASWACR3 RaRo SASCASWACR3 RaRo SASCASWA

Box Plots: Difference between Perturbed and Original Subtotals of Three Consecutive Wards (ADs) PAs for Number of Unemployed Females with Long Term Illness Perturbation Method (Internal cells)

2.3648.27 Change in Cramers V Measure of Association after Perturbation Increase in association Decrease in association Percent Relative Difference

CR3 RaRo SA SCA SWA Percentage of Cells in a Different Decile after Perturbation Male (column 1) Female (column 2) Students with Long Term Illness Male StudentsFemale Students Percentage of cells N.B. The selected columns are very sparse with approx 70% of cells having counts < 4.

Log-Linear Models: Effect of Perturbation on Model Selection Original Model: Choose a better model? SCA5,2831.09 RaRo5,3161.10 CR35,2141.08 SA6,4041.32 SWA4,7440.98 Original4,486 DevianceRatio (/Orig)

Conclusions Inconsistent results for some of the information loss measures (Cramers V, between variance) showing that stochastic processes for SDC will have varying effects on the quality of the data Emergence of some guidelines: - skewed tables (one or two large columns and the rest small columns) - prefer rounding to cell suppression - uniform tables - less information loss due to SDC methods so choose method with least changes to the table - sparse tables – need to have benchmarked totals so control round (if possible) or semi-control random round Improve utility by: designing tables to avoid disclosive cells; controlling for totals when random or small cell rounding; giving clear guidance to users on how best to impute suppressed cells

Future Research Determine optimal methods of SDC depending on the use of the data and the characteristics of the table (skewed, sparse, uniform) Generalize and expand information loss measures for all types of statistical data (tabular and microdata) and statistical analysis Develop software to give to suppliers of data for assessing information loss under different SDC methods and choosing the optimal method which gives high utility tables

WP 33 Information Loss Measures for Frequency Tables Natalie Shlomo University of Southampton Office for National Statistics Caroline.

Similar presentations

Presentation on theme: "WP 33 Information Loss Measures for Frequency Tables Natalie Shlomo University of Southampton Office for National Statistics Caroline."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

WP 33 Information Loss Measures for Frequency Tables Natalie Shlomo University of Southampton Office for National Statistics Caroline.

Similar presentations

Presentation on theme: "WP 33 Information Loss Measures for Frequency Tables Natalie Shlomo University of Southampton Office for National Statistics Caroline."— Presentation transcript:

Similar presentations

About project

Feedback