Presentation is loading. Please wait.

Presentation is loading. Please wait.

Multivariate selective editing via mixture models: first applications to Italian structural business surveys Orietta Luzi, Guarnera U., Silvestri F., Buglielli.

Similar presentations


Presentation on theme: "Multivariate selective editing via mixture models: first applications to Italian structural business surveys Orietta Luzi, Guarnera U., Silvestri F., Buglielli."— Presentation transcript:

1 Multivariate selective editing via mixture models: first applications to Italian structural business surveys Orietta Luzi, Guarnera U., Silvestri F., Buglielli T., Nurra A., Siesto G. Italian National Institute of Statistics UNECE Worksession on Statistical Data Editing Oslo, 22-24 September 2012

2 Outline UNECE Worksession on Statistical Data Editing Objective of the work The SeleMix approach to selective editing The Software SeleMix The Applications Final remarks and future work September 22-24, Oslo

3 Objective of the work UNECE Worksession on Statistical Data Editing Assessing the advantages (in terms of quality improvements and costs reduction) deriving from the use of a multivariate model- based robust selective editing approach for the detection of influential errors in business surveys. Exploring the potential benefits deriving from the use of administrative data in the context of the detection of influential errors in economic business surveys The idea is to improve the effectiveness of selective editing by directly incorporating the auxiliary information available in external (both administrative and statistical) sources in the selective editing strategy. September 22-24, Oslo

4 Selective Editing Key elements: –score function –cut-off value (threshold) determining the units to be manually reviewed The components of a score function are: –risk ~ probability of error occurrence –influence ~ (expected) impact on estimates September 22-24, Oslo UNECE Worksession on Statistical Data Editing

5 Score Function A local score is often defined for each record and each variable through a comparison of current values and “estimated” true values, e.g. –historical values on the same units (when available) –estimates (predictions) obtained using auxiliary information (e.g. admin data) or covariates from the same survey Different local scores are combined in a single global score. The cut-off value of the global score determines which units are to be manually reviewed September 22-24, Oslo UNECE Worksession on Statistical Data Editing

6 Selective Editing The difference between observed and predicted values is due to the potential error the natural variability of the analyzed quantity. In the usual setting, there is no possibility to distinguish these two elements, and the score of an observation is not directly related to the expected error of that unit. As a consequence we will not be able to relate the selective editing threshold to the desired degree of accuracy in the final estimates. Problem: Relate the threshold value of the score function to the desired estimate accuracy (i.e. residual error left in data) September 22-24, Oslo UNECE Worksession on Statistical Data Editing

7 Model-based Selective Editing Proposed solution: use an approach based on 1)explicit modeling of both data and error mechanism (via mixture models). In particular, a latent variable model allows, under certain assumptions, to estimate the expected error associated to each unit. The method uses contamination normal models, where it is assumed that the distribution of the erroneous data can be obtained from the distribution of the error free data by inflating the variance 2)definition of the score function in terms of the conditional distribution of “true” data given observed data September 22-24, Oslo UNECE Worksession on Statistical Data Editing

8 The model Y* true data Y observed data X covariates (no error) B regression coefficients U residuals I Bernoullian variable: True data model: ~ Error model:  ~ Distribution of observed data: September 22-24, Oslo UNECE Worksession on Statistical Data Editing

9 The method Model parameters can be estimated based on the observed data via EM. These estimates can be used to estimate the conditional distribution of true data given observed data: posterior probabilty for unit i We obtain a prediction for unit i as: September 22-24, Oslo UNECE Worksession on Statistical Data Editing

10 Risk and Influence The expected error is: risk component influence component The expected error is the product of the two components It is natural to define the score function in terms of the expected error. September 22-24, Oslo UNECE Worksession on Statistical Data Editing

11 If a total Y in a finite population is to be estimated on a sample S via the robust estimator: The score function we define a (local) score function as: ( weighted expected error for variable Y in unit i) Ordering (in descending order) the records by that score function, correcting the first k units, and summing the r i Y scores over all the not edited units, we obtain an estimate of the relative expected residual error R k Y in data: September 22-24, Oslo UNECE Worksession on Statistical Data Editing

12 Warnings 1) Model assumptions - true data are assumed to be normal/log-normal - error is modeled as additive and Gaussian (in a suitable scale) - covariance matrices of true data and error distributions are supposed to be proportional 2) Population Estimates The score function and the stopping criterion have a straightforward interpretation only for linear estimates like means or totals. September 22-24, Oslo UNECE Worksession on Statistical Data Editing

13 The software SeleMix SeleMix is an R package for selective editing based on a contamination model. Its main functionalities are: parameter estimation via ECM algorithm prediction of “true” values conditional on observed values according to the estimated model computation of score functions, ordering of the units, and identification of influential errors according to the user-specified threshold SeleMix also provides anticipated values (predictions) for units where some (or all) of the Y variables are not observed. Missing values in the X covariates are not allowed. September 22-24, Oslo UNECE Worksession on Statistical Data Editing

14 The Applications: the surveys The Economic Surveys  the annual sampling survey on Information and Communication Technology usage and e-commerce in industry (ICT)  the annual sampling survey on Small and Medium Enterprises (SME) The target variables: Turnover, Costs The target Parameters: Variables’ Totals (by domain) UNECE Worksession on Statistical Data Editing September 22-24, Oslo

15 The Applications: the auxiliary sources Administrative Archives  Financial Statements (FS)  Corporate companies (~ 15.000 enterprises)  Best harmonized source w.r.t. SBS Regulation definitions  Sector Studies Survey (SS)  Fiscal survey (~ 4 million enterprises)  Detailed costs and income  Like financial statement Statistical Sources  Annual total Survey on the Economic Accounts of Enterprises (SEA) (  100 employees; ~12,000 enterprises) UNECE Worksession on Statistical Data Editing September 22-24, Oslo

16 ICT - Experiment 1 Objective :Evaluating the effectiveness of the proposed selective editing in terms of correct identification of influential errors and correct treatment of both influential errors and of item non responses in the ICT context Experimental approach Simulation of contaminated values and item non responses on edited values of Turnover and Costs on the sub-.sample of corporate enterprises of the 2009 ICT sample MonteCarlo evaluation of selective editing & imputation w.r.t. FS (different thresholds,  ); “corrections” based on either 2009 FS (true) data or model-based predictions Auxiliary variables: Turnover and Costs from 2008 FS data Results Editing a small number of units is sufficient to remove the most influential errors: bias of the estimates based on edited data is always below 0.3%, while the RRMSE is quite close to the threshold value (0.5%) for almost all domains UNECE Worksession on Statistical Data Editing September 22-24, Oslo

17 ICT - Results of experiment 1 Relative Bias (%)RR MSE (%) RAWEDITEDROB.ESTRAWEDITEDROB.EST Do m Nn.contn.outn.selturnvcostturnvcostturncostturncostturnvcostturnvcost G34973365151162.82.60.0 0.91.24.23.70.2 1.01.3 F326031756523815.418.1-0.20.0-7.6-7.022.332.80.40.27.77.1 DE87685143164.413.60.1 -0.210.439.30.30.52.01.8 C369136249423113.716.3-0.1 0.90.319.423.90.3 1.00.7 H65363144202.73.30.10.0-0.6-0.88.810.50.40.50.91.0 L13313251644.5166.70.0-0.13.910.295.4686.41.00.77.911.5 J56555761516.219.40.0-0.1-1.8-3.635.050.10.60.42.13.8 I2242235166.44.6-0.2 2.12.915.612.30.80.62.33.0 NS1156111211186.86.50.20.10.5-0.411.012.10.50.70.90.8 M45043783839.230.5-0.10.0-6.17.479.564.30.4 6.17.5 Relative bias and root mean square error (RRMSE) for the estimates based on raw data (RAW), edited data (EDITED) and SeleMix predictions (ROB.EST) (  =0.005) UNECE Worksession on Statistical Data Editing

18 ICT - Experiment 2 Objective: Assessing the advantages in terms of potential reduction of follow-up and interactive editing costs deriving by integrating selective editing in the current E&I procedure Experimental approach Application of selective editing to raw Turnover and Costs of all the 2008 ICT responding units (different thresholds,  ) Comparative evaluation of parameters’ estimates obtained after selective editing with estimates obtained by the current procedure Auxiliary variables: Turnover and Costs available in at least one external source (SEA, FS, SME, SS, with priority), year 2008 Correction using either ICT edited data or model-based predictions Results High reduction of units selected as suspect vs the corresponding number of manually revised units based on the current approach Low distances among totals’ estimates based on selective editing wrt the corresponding final ICT estimates for the most part of domains UNECE Worksession on Statistical Data Editing September 22-24, Oslo

19 ICT - Results of experiment 2 Influential errors an missing imputed with ICT edited data TurnoverCosts DomN n.sel n.missICT.Seln.missICT.Sel 1 745390,9060,55 2 338631,6251,70 3 29300-0,2210,39 4 546820,1670,21 5 25511-0,882-0,35 6 10363380,898-0,12 7 14663-0,2120,26 8 6031980,3410-0,24 9 16941-0,2910,37 10 416540,2650,96 11 20150-0,481-0,40 11 74717150,27170,56 12 51551674-1,6897-1,91 13 620370,139-0,61 14 27951322-0,94290,26 15 11745170,06220,30 16 752850,078-0,01 17 29000,0010,20 18 205640,2341,97 19 131140,0453,69 20 120020,352-0,32 21 47000,010-0,25 22 36200,1800,19 23 40621-1,774-2,28 24 1492220,0020,42 25 61316101,17121,33 26 112425160,32170,39 27 176020,1830,00 28 74900,2000,01 Total19,101235220 Relative distances between SeleMix estimates (Sel) with estimates based on raw data (Raw) and ICT edited data (ICT) (  =0.01) UNECE Worksession on Statistical Data Editing

20 SME - Experiment 1 Objective Assessing the advantages in terms of potential reduction of follow-up and interactive editing that could derive by integrating selective editing in the current E&I procedure Experimental approach Application of selective editing and imputation to raw Turnover and Costs of all the 2008 SME responding units (different thresholds and imputation approaches) Comparative evaluation of parameters’ estimates obtained after selective editing &imputation and the “true” estimates obtained from administrative archives Auxiliary variables: Turnover and Costs available in at least one external source (FS, SS, with priority), year 2007 UNECE Worksession on Statistical Data Editing September 22-24, Oslo

21 SME - Results of experiment 1 As expected, higher levels of  imply a consistent reduction of expected revisions which is balanced by less accurate estimates In SME this seems to happen in a too high number of domains  =0.01  869 units selected as influential (~2.9% of the experimental sub-sample)  Diff(True.Sel) ≤ 1.5 in the 89% of domains (the median of the distribution of Diff(True.Sel) over the domains is 0.65)  =0.02  382 influential units selected (~0.01% of the experimental sub-sample),  Diff(True.Sel) ≤ 1.5 in the 75% of domains (the median of the distribution of Diff(True.Sel) over the considered domains is 0.9) UNECE Worksession on Statistical Data Editing

22 SME - Results of experiment 1 Turnover – Relative differences between Diff(True.Sel) when  =0,01 and when  =0,02 UNECE Worksession on Statistical Data Editing

23 Conclusions Application to ICT data Fully satisfactory results. The integration of the method in the current E&I procedure is already in progress Application to SME data Further analyses are needed: Different thresholds for different domains? Additional covariates? UNECE Worksession on Statistical Data Editing September 22-24, Oslo

24 Thank you for your attention UNECE Worksession on Statistical Data Editing


Download ppt "Multivariate selective editing via mixture models: first applications to Italian structural business surveys Orietta Luzi, Guarnera U., Silvestri F., Buglielli."

Similar presentations


Ads by Google