Presentation on theme: "CBS-SSB STATISTICS NETHERLANDS – STATISTICS NORWAY Work Session on Statistical Data Editing Ljubljana, Slovenia, 9-11 May 2011 Jeroen Pannekoek and Li-Chun."— Presentation transcript:
CBS-SSB STATISTICS NETHERLANDS – STATISTICS NORWAY Work Session on Statistical Data Editing Ljubljana, Slovenia, 9-11 May 2011 Jeroen Pannekoek and Li-Chun Zhang Partial (donor) imputation with adjustments
CBS - SSB Contents The problem of inconsistent micro-data Simple solutions and there limitations More general approaches
CBS - SSB Example VariableResponse IResponse IIDonor values x1: Profit 330 x2: Employees 2520 x3: Turnover main 1000 x4: Turnover other 30 x5: Turnover total x6: Wages x7: Other costs 200 X8: Total costs 700 x3 + x4 x5 x6 + x7 x8 x1=x5-x
CBS - SSB Simple solutions (for response pattern I) Prorating Edit 1: Turnover = Profit + Total Costs 950 ≠ multiply imputations by 950 /( )=0.92 Edit 2: Total costs = Wages + Other costs 0.92*700 ≠ multiply r.h.s. by 0.92 Ratio adjustment (ratio imputation ) with R = Turnover main (donor) / Turnover main (observed). In this case the same results as for prorating except that Employees, that doesn't appear in any edit rule is also adjusted.
CBS - SSB Problems with single constraint adjustments Consider response pattern II Edit violations E1: Turnover ≠ Profit + Total costs E2: Total costs ≠ Wages + Other costs Option: 1. Adjust Profit and Total costs to fit E1. 2. For the resulting value of Total costs adjust Other costs to fit E2. Problems: -Order does matter, different solution if we do it the other way around -Information on Wages is not used in adjusting Total costs -Infeasible solutions for adjusted Total costs do occur (adjusted Total costs < Wages)
CBS - SSB Edit constraints as a system of equations For the vector of values x the constraints are Ex=0 with Each row of E is a constraint and the columns correspond to the variables. Constraints E1 and E2 are linked because they have variable x5 (Turnover total) in common. E2 and E3 are also linked (through E1).
CBS - SSB An optimization approach Change the values of the imputed variables such that: Edit rules are satisfied Change is as small as possible Formally, find an adjusted data vector x A such that: x A = arg min D(x A,x) s.t. Ex A ≤ 0. Ex A ≤ 0 means that we consider both equalities and inequalities.
CBS - SSB Distance functions Least Squares : (LS) Σ i (x i – x i A ) 2 Weighted Least Squares : (WLS) Σ i w i (x i – x i A ) 2 Kullback-Leibler Divergence: (KL) Σ i x i (ln x i – ln x i A )
CBS - SSB Adjustments models 1/2 Least squares(LS): D= Σ i (x i – x i A ) 2 x i A = x i + Σ k e ki α k Additive adjustments: total adjustment for a variable is a sum of adjustments to each of the constraints. The same adjustment parameter (α k ) for all variables in constraint k. Weighted least squares (WLS): D=Σ i w i (x i – x i A ) 2 x i A = x i + (1/w i )Σ k e ki α k Additive adjustments but amount of adjustment varies according to the weights.
CBS - SSB Adjustments models 2/2 Kullback-Leibler Divergence (KL): D=Σ i x i (ln x i – ln x i A ) x i A = x i × Π k exp(e ki α k ) Factor can be written as β k if e ki =1 and 1/ β k if e ki = -1 Multiplicative adjustments, the total adjustment to a variable is the product of adjustments to each constraint. The same multiplicative adjustment parameter β for all variables in constraint k. It can be shown that for weights 1/x i KL ≈ WLS.
CBS - SSB Algorithm Simple iterative procedures exists to estimate the adjustments for general convex distances. Adjust the x-vector to each constraint one by one. This series of single constraint adjustments are easy to perform. After all constraints are visited one iteration is completed. Repeat. For sum-to-total constraints and KL-divergence equivalent to repeated prorating and Iterative Proportional Fitting But, more general constraints: differences, linear inequalities, interval constraints. And more general distances and confidence weights
CBS - SSB The generalized ratio approach 1/2 Methods so far adjust only variables that appear in edit constraints. Aim is only to satisfy “hard” edits. Inconsistencies between imputed and observed values indicate a difference between the donor record and receptor record. Therefore: adjust all donor values to better fit the receptor record. For response pattern I, with only Turnover total observed, all donor values were multiplied by the ratio Observed/Donor Turnover. Thus rescaling with a measure of “size”.
CBS - SSB The generalized ratio approach 2/2 As a generalisation we propose the following component- wise multiplicative adjustments x i A = x i δ i The δ i are determined by minimizing their variance subject to the resulting adjusted record satisfying the edit constraints. Adjustments are as uniform as possible as with ratio- imputation. But, all kinds of constraints can be satisfied.
CBS - SSB Example revisited (response pattern II) Variable Imputed unadj. LS adjust.WLS / KLGen. ratio x1: Profit x2: Employees 25 x3: Turnover main x4: Turnover other x5: Turnover total 950 x6: Costs wages 550 x7: Costs other X8: Costs total
CBS - SSB Concluding remarks Optimization approach to solving inconsistency problems. Simultaneous adjustment to all constraints Generalizes prorating and ratio adjustment for single constraints Minimum distance approach that aims at consistency with minimum (optimal) adjustments. Generalized ratio approach, aims to better preserve the structure of the imputed record as in ratio- imputation.