Multivariate selective editing via mixture models: first applications to Italian structural business surveys Orietta Luzi, Guarnera U., Silvestri F., Buglielli.

Slides:



Advertisements
Similar presentations
Katherine Jenny Thompson
Advertisements

Statistics for Improving the Efficiency of Public Administration Daniel Peña Universidad Carlos III Madrid, Spain NTTS 2009 Brussels.
Introduction Describe what panel data is and the reasons for using it in this format Assess the importance of fixed and random effects Examine the Hausman.
Unido.org/statistics International workshop on industrial statistics 8 – 10 July, Beijing Non response in industrial surveys Shyam Upadhyaya.
CmpE 104 SOFTWARE STATISTICAL TOOLS & METHODS MEASURING & ESTIMATING SOFTWARE SIZE AND RESOURCE & SCHEDULE ESTIMATING.
Deliverable 2.8: Outliers Gary Brown Office for National Statistics UK.
Multiple Linear Regression Model
Class 17: Tuesday, Nov. 9 Another example of interpreting multiple regression coefficients Steps in multiple regression analysis and example analysis Omitted.
Review for Exam 2 Some important themes from Chapters 6-9 Chap. 6. Significance Tests Chap. 7: Comparing Two Groups Chap. 8: Contingency Tables (Categorical.
FINAL REPORT: OUTLINE & OVERVIEW OF SURVEY ERRORS
1 Methods for detecting errors in VAT Turnover data Phil Lewis Processing, Editing and Imputation branch Business Statistics Methods-Survey Methodology.
Relationships Among Variables
Review for Final Exam Some important themes from Chapters 9-11 Final exam covers these chapters, but implicitly tests the entire course, because we use.
Trade and business statistics: use of administrative data Lunch Seminar Enrico Giovannini Italian National Statistical Institute (ISTAT) New York, February,
Data Editing United Nations Statistics Division (UNSD) 16 March 2011 Santiago, Chile.
Objectives of Multiple Regression
Eurostat Statistical Data Editing and Imputation.
Combining administrative and survey data: potential benefits and impact on editing and imputation for a structural business survey UNECE Work Session on.
The converging pattern between Business statistics and Administrative data. Towards an “industrialized” statistical production process The Italian LCS2012.
1 Selection bias and auditing policies on insurance claims Leuven, July 20, 2006 Jean Pinquet, Montserrat Guillén & Mercedes Ayuso.
UNITED NATIONS - ECONOMIC COMMISSION FOR EUROPE CONFERENCE OF EUROPEAN STATISTICIANS Work Session on Statistical Data Editing Use of administrative data.
Lecture 12 Statistical Inference (Estimation) Point and Interval estimation By Aziza Munir.
Integrating administrative and survey data in the new Italian system for SBS: quality issues O. Luzi, F. Oropallo, A. Puggioni, M. Di Zio, R. Sanzo Nurnberg,
Use of Administrative Data in Statistics Canada’s Annual Survey of Manufactures Steve Matthews and Wesley Yung May 16, 2004 The United Nations Statistical.
Quality issues on the way from survey to administrative data: the case of SBS statistics of microenterprises in Slovakia Andrej Vallo, Andrea Bielakova.
The Italian new survey COEN – 2011: an innovative editing procedure, Giovanni Seri – Nuremberg, 11/09/2013 The Italian new survey on Enterprises Final.
Use of web scraping and text mining techniques in the Istat survey on “Information and Communication Technology in enterprises” Giulio Barcaroli(*), Alessandra.
Assessing Quality for Integration Based Data M. Denk, W. Grossmann Institute for Scientific Computing.
Deliverable 2.6: Selective Editing Hannah Finselbach 1 and Orietta Luzi 2 1 ONS, UK 2 ISTAT, Italy.
Educational Research: Competencies for Analysis and Application, 9 th edition. Gay, Mills, & Airasian © 2009 Pearson Education, Inc. All rights reserved.
Jeroen Pannekoek - Statistics Netherlands Work Session on Statistical Data Editing Oslo, Norway, 24 September 2012 Topic (I) Selective and macro editing.
12.1 Heteroskedasticity: Remedies Normality Assumption.
The new multiple-source system for Italian Structural Business Statistics based on administrative and survey data Orietta Luzi, Ugo Guarnera, Paolo Righi.
Topic (vi): New and Emerging Methods Topic organizer: Maria Garcia (USA) UNECE Work Session on Statistical Data Editing Oslo, Norway, September 2012.
Cristina Casciano, Viviana De Giorgi, Filippo Oropallo Istat Division for Structural Business Statistics, Agriculture, Foreign Trade and Consumer Prices.
Measurement Models: Exploratory and Confirmatory Factor Analysis James G. Anderson, Ph.D. Purdue University.
Evaluating generalised calibration / Fay-Herriot model in CAPEX Tracy Jones, Angharad Walters, Ria Sanderson and Salah Merad (Office for National Statistics)
Chapter 5 Parameter estimation. What is sample inference? Distinguish between managerial & financial accounting. Understand how managers can use accounting.
Eurostat Statistical matching when samples are drawn according to complex survey designs Training Course «Statistical Matching» Rome, 6-8 November 2013.
Measurement Models: Identification and Estimation James G. Anderson, Ph.D. Purdue University.
Eurostat Weighting and Estimation. Presented by Loredana Di Consiglio Istituto Nazionale di Statistica, ISTAT.
The challenge of a mixed-mode design survey and new IT tools application: the case of the Italian Structure Earning Surveys Fabiana Rocci Stefania Cardinleschi.
The Impact of Missing Data on the Detection of Nonuniform Differential Item Functioning W. Holmes Finch.
Econometrics Course: Cost as the Dependent Variable (I) Paul G. Barnett, PhD November 20, 2013.
Topic (i): Selective editing / macro editing Discussants Orietta Luzi - Italian National Statistical Institute Rudi Seljak - Statistical Office of Slovenia.
Evaluating the Quality of Editing and Imputation: the Simulation Approach M. Di Zio, U. Guarnera, O. Luzi, A. Manzari ISTAT – Italian Statistical Institute.
Latent regression models. Where does the probability come from? Why isn’t the model deterministic. Each item tests something unique – We are interested.
Methods and software for editing and imputation: recent advancements at Istat M. Di Zio, U. Guarnera, O. Luzi, A. Manzari ISTAT – Italian Statistical Institute.
Estimation Kline Chapter 7 (skip , appendices)
Tutorial I: Missing Value Analysis
1 Tom Edgar’s Contribution to Model Reduction as an introduction to Global Sensitivity Analysis Procedure Accounting for Effect of Available Experimental.
Chapter 7: The Distribution of Sample Means
Week 21 Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution that produced.
Session topic (i) – Editing Administrative and Census data Discussants Orietta Luzi and Heather Wagstaff UNECE Worksession on Statistical Data Editing.
4-6 September 2013, Vilnius Quality in Statistics: Administrative Data and Official Statistics USING ADMINISTRATIVE DATA SOURCES IN OFFICIAL.
Estimating standard error using bootstrap
An R package for selective editing based on a latent class model
How to handle missing data values
An Active Collection using Intermediate Estimates to Manage Follow-Up of Non-Response and Measurement Errors Jeannine Claveau, Serge Godbout and Claude.
Prague EU-SILC Best Practice Workshop, 14th and 15th September 2017
ADMINISTRATIVE DATA IN ANNUAL BUSINESS STATISTICS OF LATVIA
OVERVIEW OF LINEAR MODELS
Warmup To check the accuracy of a scale, a weight is weighed repeatedly. The scale readings are normally distributed with a standard deviation of
Italian situation in the following areas:
12 Inferential Analysis.
OVERVIEW OF LINEAR MODELS
DEVELOPMENT OF IMPUTATION MODEL FOR SMALL ENTERPRISES
New Techniques and Technologies for Statistics 2017  Estimation of Response Propensities and Indicators of Representative Response Using Population-Level.
Sampling and estimation
A handbook on validation methodology. Metrics.
Presentation transcript:

Multivariate selective editing via mixture models: first applications to Italian structural business surveys Orietta Luzi, Guarnera U., Silvestri F., Buglielli T., Nurra A., Siesto G. Italian National Institute of Statistics UNECE Worksession on Statistical Data Editing Oslo, September 2012

Outline UNECE Worksession on Statistical Data Editing Objective of the work The SeleMix approach to selective editing The Software SeleMix The Applications Final remarks and future work September 22-24, Oslo

Objective of the work UNECE Worksession on Statistical Data Editing Assessing the advantages (in terms of quality improvements and costs reduction) deriving from the use of a multivariate model- based robust selective editing approach for the detection of influential errors in business surveys. Exploring the potential benefits deriving from the use of administrative data in the context of the detection of influential errors in economic business surveys The idea is to improve the effectiveness of selective editing by directly incorporating the auxiliary information available in external (both administrative and statistical) sources in the selective editing strategy. September 22-24, Oslo

Selective Editing Key elements: –score function –cut-off value (threshold) determining the units to be manually reviewed The components of a score function are: –risk ~ probability of error occurrence –influence ~ (expected) impact on estimates September 22-24, Oslo UNECE Worksession on Statistical Data Editing

Score Function A local score is often defined for each record and each variable through a comparison of current values and “estimated” true values, e.g. –historical values on the same units (when available) –estimates (predictions) obtained using auxiliary information (e.g. admin data) or covariates from the same survey Different local scores are combined in a single global score. The cut-off value of the global score determines which units are to be manually reviewed September 22-24, Oslo UNECE Worksession on Statistical Data Editing

Selective Editing The difference between observed and predicted values is due to the potential error the natural variability of the analyzed quantity. In the usual setting, there is no possibility to distinguish these two elements, and the score of an observation is not directly related to the expected error of that unit. As a consequence we will not be able to relate the selective editing threshold to the desired degree of accuracy in the final estimates. Problem: Relate the threshold value of the score function to the desired estimate accuracy (i.e. residual error left in data) September 22-24, Oslo UNECE Worksession on Statistical Data Editing

Model-based Selective Editing Proposed solution: use an approach based on 1)explicit modeling of both data and error mechanism (via mixture models). In particular, a latent variable model allows, under certain assumptions, to estimate the expected error associated to each unit. The method uses contamination normal models, where it is assumed that the distribution of the erroneous data can be obtained from the distribution of the error free data by inflating the variance 2)definition of the score function in terms of the conditional distribution of “true” data given observed data September 22-24, Oslo UNECE Worksession on Statistical Data Editing

The model Y* true data Y observed data X covariates (no error) B regression coefficients U residuals I Bernoullian variable: True data model: ~ Error model:  ~ Distribution of observed data: September 22-24, Oslo UNECE Worksession on Statistical Data Editing

The method Model parameters can be estimated based on the observed data via EM. These estimates can be used to estimate the conditional distribution of true data given observed data: posterior probabilty for unit i We obtain a prediction for unit i as: September 22-24, Oslo UNECE Worksession on Statistical Data Editing

Risk and Influence The expected error is: risk component influence component The expected error is the product of the two components It is natural to define the score function in terms of the expected error. September 22-24, Oslo UNECE Worksession on Statistical Data Editing

If a total Y in a finite population is to be estimated on a sample S via the robust estimator: The score function we define a (local) score function as: ( weighted expected error for variable Y in unit i) Ordering (in descending order) the records by that score function, correcting the first k units, and summing the r i Y scores over all the not edited units, we obtain an estimate of the relative expected residual error R k Y in data: September 22-24, Oslo UNECE Worksession on Statistical Data Editing

Warnings 1) Model assumptions - true data are assumed to be normal/log-normal - error is modeled as additive and Gaussian (in a suitable scale) - covariance matrices of true data and error distributions are supposed to be proportional 2) Population Estimates The score function and the stopping criterion have a straightforward interpretation only for linear estimates like means or totals. September 22-24, Oslo UNECE Worksession on Statistical Data Editing

The software SeleMix SeleMix is an R package for selective editing based on a contamination model. Its main functionalities are: parameter estimation via ECM algorithm prediction of “true” values conditional on observed values according to the estimated model computation of score functions, ordering of the units, and identification of influential errors according to the user-specified threshold SeleMix also provides anticipated values (predictions) for units where some (or all) of the Y variables are not observed. Missing values in the X covariates are not allowed. September 22-24, Oslo UNECE Worksession on Statistical Data Editing

The Applications: the surveys The Economic Surveys  the annual sampling survey on Information and Communication Technology usage and e-commerce in industry (ICT)  the annual sampling survey on Small and Medium Enterprises (SME) The target variables: Turnover, Costs The target Parameters: Variables’ Totals (by domain) UNECE Worksession on Statistical Data Editing September 22-24, Oslo

The Applications: the auxiliary sources Administrative Archives  Financial Statements (FS)  Corporate companies (~ enterprises)  Best harmonized source w.r.t. SBS Regulation definitions  Sector Studies Survey (SS)  Fiscal survey (~ 4 million enterprises)  Detailed costs and income  Like financial statement Statistical Sources  Annual total Survey on the Economic Accounts of Enterprises (SEA) (  100 employees; ~12,000 enterprises) UNECE Worksession on Statistical Data Editing September 22-24, Oslo

ICT - Experiment 1 Objective :Evaluating the effectiveness of the proposed selective editing in terms of correct identification of influential errors and correct treatment of both influential errors and of item non responses in the ICT context Experimental approach Simulation of contaminated values and item non responses on edited values of Turnover and Costs on the sub-.sample of corporate enterprises of the 2009 ICT sample MonteCarlo evaluation of selective editing & imputation w.r.t. FS (different thresholds,  ); “corrections” based on either 2009 FS (true) data or model-based predictions Auxiliary variables: Turnover and Costs from 2008 FS data Results Editing a small number of units is sufficient to remove the most influential errors: bias of the estimates based on edited data is always below 0.3%, while the RRMSE is quite close to the threshold value (0.5%) for almost all domains UNECE Worksession on Statistical Data Editing September 22-24, Oslo

ICT - Results of experiment 1 Relative Bias (%)RR MSE (%) RAWEDITEDROB.ESTRAWEDITEDROB.EST Do m Nn.contn.outn.selturnvcostturnvcostturncostturncostturnvcostturnvcost G F DE C H L J I NS M Relative bias and root mean square error (RRMSE) for the estimates based on raw data (RAW), edited data (EDITED) and SeleMix predictions (ROB.EST) (  =0.005) UNECE Worksession on Statistical Data Editing

ICT - Experiment 2 Objective: Assessing the advantages in terms of potential reduction of follow-up and interactive editing costs deriving by integrating selective editing in the current E&I procedure Experimental approach Application of selective editing to raw Turnover and Costs of all the 2008 ICT responding units (different thresholds,  ) Comparative evaluation of parameters’ estimates obtained after selective editing with estimates obtained by the current procedure Auxiliary variables: Turnover and Costs available in at least one external source (SEA, FS, SME, SS, with priority), year 2008 Correction using either ICT edited data or model-based predictions Results High reduction of units selected as suspect vs the corresponding number of manually revised units based on the current approach Low distances among totals’ estimates based on selective editing wrt the corresponding final ICT estimates for the most part of domains UNECE Worksession on Statistical Data Editing September 22-24, Oslo

ICT - Results of experiment 2 Influential errors an missing imputed with ICT edited data TurnoverCosts DomN n.sel n.missICT.Seln.missICT.Sel ,9060, ,6251, ,2210, ,1670, ,882-0, ,898-0, ,2120, ,3410-0, ,2910, ,2650, ,481-0, ,27170, ,6897-1, ,139-0, ,94290, ,06220, ,078-0, ,0010, ,2341, ,0453, ,352-0, ,010-0, ,1800, ,774-2, ,0020, ,17121, ,32170, ,1830, ,2000,01 Total19, Relative distances between SeleMix estimates (Sel) with estimates based on raw data (Raw) and ICT edited data (ICT) (  =0.01) UNECE Worksession on Statistical Data Editing

SME - Experiment 1 Objective Assessing the advantages in terms of potential reduction of follow-up and interactive editing that could derive by integrating selective editing in the current E&I procedure Experimental approach Application of selective editing and imputation to raw Turnover and Costs of all the 2008 SME responding units (different thresholds and imputation approaches) Comparative evaluation of parameters’ estimates obtained after selective editing &imputation and the “true” estimates obtained from administrative archives Auxiliary variables: Turnover and Costs available in at least one external source (FS, SS, with priority), year 2007 UNECE Worksession on Statistical Data Editing September 22-24, Oslo

SME - Results of experiment 1 As expected, higher levels of  imply a consistent reduction of expected revisions which is balanced by less accurate estimates In SME this seems to happen in a too high number of domains  =0.01  869 units selected as influential (~2.9% of the experimental sub-sample)  Diff(True.Sel) ≤ 1.5 in the 89% of domains (the median of the distribution of Diff(True.Sel) over the domains is 0.65)  =0.02  382 influential units selected (~0.01% of the experimental sub-sample),  Diff(True.Sel) ≤ 1.5 in the 75% of domains (the median of the distribution of Diff(True.Sel) over the considered domains is 0.9) UNECE Worksession on Statistical Data Editing

SME - Results of experiment 1 Turnover – Relative differences between Diff(True.Sel) when  =0,01 and when  =0,02 UNECE Worksession on Statistical Data Editing

Conclusions Application to ICT data Fully satisfactory results. The integration of the method in the current E&I procedure is already in progress Application to SME data Further analyses are needed: Different thresholds for different domains? Additional covariates? UNECE Worksession on Statistical Data Editing September 22-24, Oslo

Thank you for your attention UNECE Worksession on Statistical Data Editing