Topic (vi): New and Emerging Methods Topic organizer: Maria Garcia (USA) UNECE Work Session on Statistical Data Editing Oslo, Norway, 24-26 September 2012.

Slides:



Advertisements
Similar presentations
Katherine Jenny Thompson
Advertisements

Handling attrition and non- response in longitudinal data Harvey Goldstein University of Bristol.
Non response and missing data in longitudinal surveys.
ECG Signal processing (2)
Preparing Data for Quantitative Analysis
Prediction with Regression
CmpE 104 SOFTWARE STATISTICAL TOOLS & METHODS MEASURING & ESTIMATING SOFTWARE SIZE AND RESOURCE & SCHEDULE ESTIMATING.
FTP Biostatistics II Model parameter estimations: Confronting models with measurements.
SLICE 1.5: A software framework for automatic edit and imputation Ton de Waal Statistics Netherlands UN/ECE Work Session on Statistical Data Editing,
Efficient modelling of record linked data A missing data perspective Harvey Goldstein Record Linkage Methodology Research Group Institute of Child Health.
Gizem ALAGÖZ. Simulation optimization has received considerable attention from both simulation researchers and practitioners. Both continuous and discrete.
Chapter 17 Overview of Multivariate Analysis Methods
New procedures for Editing and Imputation of demographic variables G. Bianchi, A. Manzari, A. Pezone, A. Reale, G. Saporito ISTAT.
1 Editing Administrative Data and Combined Data Sources Introduction.
Rodent Behavior Analysis Tom Henderson Vision Based Behavior Analysis Universitaet Karlsruhe (TH) 12 November /9.
How to deal with missing data: INTRODUCTION
July 3, Department of Computer and Information Science (IDA) Linköpings universitet, Sweden Minimal sufficient statistic.
FINAL REPORT: OUTLINE & OVERVIEW OF SURVEY ERRORS
Classification and Prediction: Regression Analysis
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
Machine Learning Usman Roshan Dept. of Computer Science NJIT.
Vienna, 23 April 2008 UNECE Work Session on SDE Topic (v) Editing on results (post-editing) 1 Topic (v): Editing based on results Discussants: Maria M.
SENG521 (Fall SENG 521 Software Reliability & Testing Software Reliability Tools (Part 8a) Department of Electrical & Computer.
Eurostat Statistical Data Editing and Imputation.
Work Package 5: Integrating data from different sources in the production of business statistics Daniel Lewis Office for National Statistics (UK)
Regression Method.
B. RAMAMURTHY EAP#2: Data Mining, Statistical Analysis and Predictive Analytics for Automotive Domain CSE651C, B. Ramamurthy 1 6/28/2014.
RESEARCH A systematic quest for undiscovered truth A way of thinking
Metadata Models in Survey Computing Some Results of MetaNet – WG 2 METIS 2004, Geneva W. Grossmann University of Vienna.
New and Emerging Methods Maria Garcia and Ton de Waal UN/ECE Work Session on Statistical Data Editing, May 2005, Ottawa.
Topic (ii): New and Emerging Methods Maria Garcia (USA) Jeroen Pannekoek (Netherlands) UNECE Work Session on Statistical Data Editing Paris, France,
1 Introduction to Survey Data Analysis Linda K. Owens, PhD Assistant Director for Sampling & Analysis Survey Research Laboratory University of Illinois.
Jeroen Pannekoek - Statistics Netherlands Work Session on Statistical Data Editing Oslo, Norway, 24 September 2012 Topic (I) Selective and macro editing.
1 Dealing with Item Non-response in a Catering Survey Pauli Ollila Statistics Finland Kaija Saarni Finnish Game and Fisheries Research Institute Asmo Honkanen.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Recommended Practices for Editing and Imputation in the European Statistical System: the EDIMBUS Project* Orietta Luzi (Istat, Italy) Ton De Waal (Statistics.
Eurostat On the use of data mining for imputation Pilar Rey del Castillo, EUROSTAT.
Calibrated imputation of numerical data under linear edit restrictions Jeroen Pannekoek Natalie Shlomo Ton de Waal.
Eurostat Statistical matching when samples are drawn according to complex survey designs Training Course «Statistical Matching» Rome, 6-8 November 2013.
CBS-SSB STATISTICS NETHERLANDS – STATISTICS NORWAY Work Session on Statistical Data Editing Oslo, Norway, September 2012 Jeroen Pannekoek and Li-Chun.
Topic (iii): Macro Editing Methods Paula Mason and Maria Garcia (USA) UNECE Work Session on Statistical Data Editing Ljubljana, Slovenia, 9-11 May 2011.
Some Aspects of Bayesian Approach to Model Selection Vetrov Dmitry Dorodnicyn Computing Centre of RAS, Moscow.
Topic (i): Selective editing / macro editing Discussants Orietta Luzi - Italian National Statistical Institute Rudi Seljak - Statistical Office of Slovenia.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Multivariate selective editing via mixture models: first applications to Italian structural business surveys Orietta Luzi, Guarnera U., Silvestri F., Buglielli.
International Conference on Fuzzy Systems and Knowledge Discovery, p.p ,July 2011.
1 Statistics & R, TiP, 2011/12 Neural Networks  Technique for discrimination & regression problems  More mathematical theoretical foundation  Works.
Data Mining and Decision Support
Data Analytics CMIS Short Course part II Day 1 Part 1: Introduction Sam Buttrey December 2015.
Analysis of Experiments
Tutorial I: Missing Value Analysis
Some type of major TSE effort TSE for an “important” statistic Form a group to design a TSE evaluation for existing survey “exemplary TSE estimation plan”
Q2010 Special session 34 Data quality and inference under register information Discussion by Carl-Erik Särndal.
An Overview of Editing and Imputation Methods for the next Italian Censuses Gianpiero Bianchi, Antonia Manzari, Alessandra Reale UNECE-Eurostat Meeting.
Ljubljana, 11 Mai 2011UNECE Work session on SDE Topic (vii) New and emerging methods 1 Topic (vii): New and emerging methods Discussion Discussants: Rudi.
Ljubljana, 11 Mai 2011UNECE Work session on SDE Topic (vii) New and emerging methods 1 Topic (vii): New and emerging methods Introduction Session organizers:
University of Warwick, Department of Sociology, 2012/13 SO 201: SSAASS (Surveys and Statistics) (Richard Lampard) Survey Design: Some Implications for.
Session topic (i) – Editing Administrative and Census data Discussants Orietta Luzi and Heather Wagstaff UNECE Worksession on Statistical Data Editing.
Theme (v): Managing change
Theme (i): New and emerging methods
Multiple Imputation using SOLAS for Missing Data Analysis
Table 1. Advantages and Disadvantages of Traditional DM/ML Methods
Dealing with data qualitative data The main report
UNITED NATIONS ECONOMIC COMMISSION FOR EUROPE CONFERENCE OF EUROPEAN STATISTICIANS Work Session on Statistical Data Editing April 2017 The Hague,
How to handle missing data values
Multivariate Statistics
Automatic Editing with Soft Edits
Implementation of the Bayesian approach to imputation at SORS Zvone Klun and Rudi Seljak Statistical Office of the Republic of Slovenia Oslo, September.
New and Emerging Methods
Presentation transcript:

Topic (vi): New and Emerging Methods Topic organizer: Maria Garcia (USA) UNECE Work Session on Statistical Data Editing Oslo, Norway, September 2012

Topic (vi): Overview  Papers under this topic present new ideas and advancements in the development of methods and techniques for solving, improving, and optimizing the editing and imputation of data.  Contributions cover: – Probability editing – Machine learning methods – Model-based imputation methods – Automatic editing of numerical data

Topic (vi): Overview  Probability editing WP.36 (Sweden) – Select a subset of units to edit using a probability sampling framework.  Machine learning methods WP.37 (EUROSTAT) – Imputation of categorical data using a neural networks classifier and a Bayesian networks classifier

Topic (vi): Overview  Model-based imputation WP.38 (Slovenia) – Bayesian model + linear regression – Multiple imputation  Automatic editing WP.39 (Netherlands) – Ensure hard edits are satisfied while incorporating information from soft edits into the editing/imputation process.

Topic (iii): New and Emerging Methods Enjoy the presentations!

Topic (vi): New and Emerging Methods Summary of main developments and points for discussion

Topic (vi): Summary  WP.36 – Probability Editing (Sweden) – Propose selecting units for editing using a traditional probability sampling framework in which only a fraction of the data is edited – Applies to all types of data – Address statistical properties of the estimators

Topic (vi): Summary  WP.37 – Use of machine learning methods to impute categorical data (EUROSTAT) – New neural networks classifier supervised learning method extended to deal with mixed numerical and categorical data. – Bayesian network classifier – Compare machine learning results with results obtained from logistic regression and multiple imputation.

Topic (vi): Summary  WP.38 – Implementation of the Bayesian approach to imputation at SORS (Slovenia) – Bayesian model and linear regression combined into a method for imputing annual gross income in a household survey. – Solve the imputation problem within separate data groupings with different levels of the auxiliary variables within each group. – Multiple imputation.

Topic (vi): Summary  WP.39 – Automatic editing with hard and soft edits (Netherlands) – Error localization problem involving both hard (inconsistency) and soft (query) edits. – Minimizing the number of fields to impute incorporates the cost associated with failed query edits. – Associated software written in R, uses an existing R package.

Topic (vi): Points for discussion In probability editing, a sampling design is used to select units for editing. If not using the two step approach, important units may not be in the sample and errors (possibly large) may remain in the data file.  What are the implications?  How will this affect the estimator and the variance of the estimator?  How do errors still remaining in the data affect analysis, particularly if the data is used for other purposes not envisioned in the original survey design?

Topic (vi): Points for discussion The paper from EUROSTAT deals with the use of machine learning methods for imputing missing data:  The method uses either a Bayesian network classifier or a neural network classifier. What is the effect of using these methods for imputing missing values on the original distribution of the variables?  Choosing the appropriate method for handling imputation of missing values is a challenging problem. For what kind of surveys are these methods suitable?

Topic (vi): Points for discussion  What are the experiences at other agencies when incorporating machine learning methods into their imputation menu (e.g., at BOC - B. Winkler 2009, 2010)?

Topic (vi): Points for discussion The paper from Slovenia implements the imputation model using an available procedure within the SAS language, PROC MCMC:  For some complex imputation problems (i.e. large number of variables, different types of variables, large data files) the default settings within commercial software procedures may be inappropriate. What type of diagnostics, graphical or analytical, should be examined to ensure the procedure is working properly?

Topic (vi): Points for discussion  How can we address the situation in which a large proportion of the missing fields occur within a particular data group?  Or the situation in which a particular data group is too small to fit the model?

Topic (vi): Points for discussion Incorporating soft or query edits into the error localization problem increases the number of variables and edits of this computationally intensive optimization problem:  How do we approach the added complexity? Is this new approach computationally feasible?  Does adding the information from the soft (query) edits lead to a reduction of time and resources spend on analysts’ review (trade-off )?  What are the effects of adding more edits and thus changing more fields on data quality?

New and Emerging Methods: Closing Remarks

Closing Remarks  What happened to error localization?  Should we be focusing more on imputation than on editing (again: error localization)?  Do we forecast multiple imputation as becoming the “standard” at most NSIs: – to smooth variability of single imputations? – to estimate variance due to imputation? – issue of non-response bias vs. variance estimation?