New and Emerging Methods Maria Garcia and Ton de Waal UN/ECE Work Session on Statistical Data Editing, 16-18 May 2005, Ottawa.

Slides:



Advertisements
Similar presentations
Katherine Jenny Thompson
Advertisements

Integrated Data Editing and Imputation Ton de Waal Department of Methodology Voorburg Statistics Netherlands ICES III conference, Montréal June 19, 2007.
Data Imputation United Nations Statistics Division (UNSD) 16 March 2011 Santiago, Chile.
Discussion of topic VI Censuses Work Session on Data Editing Vienna, April 21 st -23 rd 2008 Heather Wagstaff & Thomas Burg.
Migration of a large survey onto a micro-economic platform Val Cox April 2014.
CountrySTAT Team-I November 2014, ECO Secretariat,Teheran.
1 An Adaptive GA for Multi Objective Flexible Manufacturing Systems A. Younes, H. Ghenniwa, S. Areibi uoguelph.ca.
Deliverable 2.8: Outliers Gary Brown Office for National Statistics UK.
Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.
Prediction and Imputation in ISEE - Tools for more efficient use of combined data sources Li-Chun Zhang, Statistics Norway Svein Nordbotton, University.
SLICE 1.5: A software framework for automatic edit and imputation Ton de Waal Statistics Netherlands UN/ECE Work Session on Statistical Data Editing,
UNECE Work Session on Statistical Data Editing Vienna April 2008 Topic ii – Editing Administrative Data and Combined Sources.
Gizem ALAGÖZ. Simulation optimization has received considerable attention from both simulation researchers and practitioners. Both continuous and discrete.
New procedures for Editing and Imputation of demographic variables G. Bianchi, A. Manzari, A. Pezone, A. Reale, G. Saporito ISTAT.
1 Editing Administrative Data and Combined Data Sources Introduction.
Edit and Imputation of the 2011 Abu Dhabi Census Glenn Hui and Hanan AlDarmaki Statistics Centre - Abu Dhabi UNECE CES Work Session on Statistical Data.
Joint UNECE/Eurostat Meeting on Population and Housing Censuses (13-15 May 2008) Sample results expected accuracy in the Italian Population and Housing.
E&I for 2006 Canadian Census Mike Bankier Statistics Canada
Vienna, 23 April 2008 UNECE Work Session on SDE Topic (v) Editing on results (post-editing) 1 Topic (v): Editing based on results Discussants: Maria M.
Eurostat Statistical Data Editing and Imputation.
Work Package 5: Integrating data from different sources in the production of business statistics Daniel Lewis Office for National Statistics (UK)
Overview G. Jogesh Babu. Probability theory Probability is all about flip of a coin Conditional probability & Bayes theorem (Bayesian analysis) Expectation,
1 Business Register: Quality Practices Eddie Salyers
Chap 20-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 20 Sampling: Additional Topics in Sampling Statistics for Business.
THE MAIN INNOVATIONS OF DATA EDITING AND IMPUTATION FOR THE 2010 ITALIAN AGRICULTURAL CENSUS G. Bianchi, R. M. Lipsi, P. Francescangeli, G. Ruocco, A.
Luisa Franconi Integration, Quality, Research and Production Networks Development Department Unit on microdata access ISTAT Essnet on Common Tools and.
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
Metadata Models in Survey Computing Some Results of MetaNet – WG 2 METIS 2004, Geneva W. Grossmann University of Vienna.
Topic (ii): New and Emerging Methods Maria Garcia (USA) Jeroen Pannekoek (Netherlands) UNECE Work Session on Statistical Data Editing Paris, France,
Jeroen Pannekoek - Statistics Netherlands Work Session on Statistical Data Editing Oslo, Norway, 24 September 2012 Topic (I) Selective and macro editing.
Recommended Practices for Editing and Imputation in the European Statistical System: the EDIMBUS Project* Orietta Luzi (Istat, Italy) Ton De Waal (Statistics.
Metadata driven application for data processing – from local toward global solution Rudi Seljak Statistical Office of the Republic of Slovenia.
Topic (vi): New and Emerging Methods Topic organizer: Maria Garcia (USA) UNECE Work Session on Statistical Data Editing Oslo, Norway, September 2012.
1 STAT 500 – Statistics for Managers STAT 500 Statistics for Managers.
Calibrated imputation of numerical data under linear edit restrictions Jeroen Pannekoek Natalie Shlomo Ton de Waal.
Quality Assurance Programme of the Canadian Census of Population Expert Group Meeting on Population and Housing Censuses Geneva July 7-9, 2010.
CBS-SSB STATISTICS NETHERLANDS – STATISTICS NORWAY Work Session on Statistical Data Editing Oslo, Norway, September 2012 Jeroen Pannekoek and Li-Chun.
Chapter 3: Software Project Management Metrics
Eurostat Statistical Matching using auxiliary information Training Course «Statistical Matching» Rome, 6-8 November 2013 Marco Di Zio Dept. Integration,
Implicit Linear Inequality Edits Generation and Error Localization in the SPEER Edit System Maria Garcia U.S. Census Bureau UNECE Work Session on Statistical.
Lyne Guertin Census Data Processing and Estimation Section Social Survey Methods Division Methodology Branch, Statistics Canada UNECE April 28-30, 2014.
Topic (iii): Macro Editing Methods Paula Mason and Maria Garcia (USA) UNECE Work Session on Statistical Data Editing Ljubljana, Slovenia, 9-11 May 2011.
Outlining a Process Model for Editing With Quality Indicators Pauli Ollila (part 1) Outi Ahti-Miettinen (part 2) Statistics Finland.
Paolo Valente - UNECE Statistical Division Slide 1 Technology for census data coding, editing and imputation Paolo Valente (UNECE) UNECE Workshop on Census.
Topic (i): Selective editing / macro editing Discussants Orietta Luzi - Italian National Statistical Institute Rudi Seljak - Statistical Office of Slovenia.
Michelle Simard, Thérèse Lalor Statistics Canada CSPA Project Manager UNECE Work Session on Statistical Data Confidentiality Helsinki, October 2015 Confidentialized.
Software Engineering1  Verification: The software should conform to its specification  Validation: The software should do what the user really requires.
Methods and software for editing and imputation: recent advancements at Istat M. Di Zio, U. Guarnera, O. Luzi, A. Manzari ISTAT – Italian Statistical Institute.
New and Emerging Methods UN/ECE Work Session on Statistical Data Editing Vienna April 21-23, 2008.
Generic Statistical Data Editing Models (GSDEMs) Workshop on the Modernisation of Official Statistics The Hague, 24 November 2015.
Stats Term Test 4 Solutions. c) d) An alternative solution is to use the probability mass function and.
Automatic Editing Data. A New Version of DIA System Prepared by J.M. Gomez Presented by D.Lorca National Statistical Institute of Spain.
Towards the 2011 UK Census Editing Strategy Heather Wagstaff and Steven Rogers Methodology Directorate Office for National Statistics, U.K.
The development of a data editing and imputation tool set UN/ECE Work Session on Statistical Data Editing Topic (ii): Global solutions to editing Claude.
An Overview of Editing and Imputation Methods for the next Italian Censuses Gianpiero Bianchi, Antonia Manzari, Alessandra Reale UNECE-Eurostat Meeting.
Ljubljana, 11 Mai 2011UNECE Work session on SDE Topic (vii) New and emerging methods 1 Topic (vii): New and emerging methods Discussion Discussants: Rudi.
Ljubljana, 11 Mai 2011UNECE Work session on SDE Topic (vii) New and emerging methods 1 Topic (vii): New and emerging methods Introduction Session organizers:
Tommy Messelis * Stefaan Haspeslagh Burak Bilgin Patrick De Causmaecker Greet Vanden Berghe *
Overview G. Jogesh Babu. R Programming environment Introduction to R programming language R is an integrated suite of software facilities for data manipulation,
Theme (v): Managing change
Theme (i): New and emerging methods
Theme (ii): New Data Sources and Census
Modeling approaches for the allocation of costs
UNITED NATIONS ECONOMIC COMMISSION FOR EUROPE CONFERENCE OF EUROPEAN STATISTICIANS Work Session on Statistical Data Editing April 2017 The Hague,
UNECE Work Session on Statistical Data Editing
Tomaž Špeh, Rudi Seljak Statistical Office of the Republic of Slovenia
The European Statistical Training Programme (ESTP)
Data processing German foreign trade statistics
Automatic Editing with Soft Edits
Chapter 13: Item nonresponse
Presentation transcript:

New and Emerging Methods Maria Garcia and Ton de Waal UN/ECE Work Session on Statistical Data Editing, May 2005, Ottawa

Introduction  New methods of data editing and imputation  Subdivided into 5 different themes: Automatic editing Imputation E & I for demographic variables Selective editing Software

Invited Papers WP 30: Methods and software for editing and imputation: recent advancements at ISTAT (ISTAT, Italy) WP 32: Using a quadratic programming approach to solve simultaneous ratio and balance edit problems (USCB, US) WP 31: Smoothing Imputations for categorical data in the linear regression paradigm (USCB, US)

Automatic editing: papers (1/2) Six papers: WP 30: Methods and software for editing and imputation: recent advancements at ISTAT (ISTAT, Italy) WP 32: Using a quadratic programming approach to solve simultaneous ratio and balance edit problems (USCB, US) WP 33: Data editing and logic (Australia)

Automatic editing: papers (2/2) WP 43: Automatic editing system for the case of two short-term business surveys (Republic of Slovenia) WP 44: A variable neighbourhood local search approach for the continuous data editing problem (Spain) WP 46: Implicit linear inequality edits and error localization in the SPEER edit system (USCB, US)

Automatic Editing: main developments Methods based on Fellegi-Holt model  Developments at SORS General system combines error localization with outlier detection Plans for automation of implied edit generation  Further improvements of SPEER Preprocessing program for generation of implied edits Improve error localization

 Framework of Fellegi-Holt theory in propositional logic Generation of implied edits framed as logical deduction Automatic tools that can potentially be used for finding minimal deletion set Automatic Editing: main developments

Methods based on some other approach  Erroneous unit measures Model as cluster analysis problem  Ratio and balance constraints Hybrid ratio editing and quadratic programming Controlled rounding  Error localization as a combinatorial optimization problem Continuous data Successful on very large data sets

Imputation: papers (1/2) Six papers: WP 30: Methods and software for editing and imputation: recent advancements at ISTAT (ISTAT, Italy) WP 31: Smoothing imputations for categorical data in the linear regression paradigm (USCB, US) WP 36: Integrated modeling approach to imputation and discussion on imputation variance (Statistics Finland)

Imputation: papers (2/2) WP 40: Imputation of data subject to balance and inequality restrictions using the truncated normal distribution (Statistics Netherlands) WP 41: On the imputation of categorical data subject to edit restrictions using loglinear models (Statistics Netherlands) WP 48: Improving imputation: the plan to examine count, status, vacancy and item imputation in the decennial census (USCB, US)

Imputation: main developments Model based methods  Discrete Data Constrained loglinear model Linear regression model  Continuous Data Truncated normal distribution followed by MCEM

Imputation: main developments Implementation of imputation methods  Use Bayesian networks for imputation of discrete data  Development of QUIS for imputation of continuous data written in SAS uses EM algorithm, nearest neighbor, and MI

Imputation: main developments Implementation of imputation methods  Integrated Modeling Approach (IMAI) Summary and analysis of principles of IMAI Estimation of imputation variance  U.S. Decennial Census Research on alternative imputation options Administrative records, model based imputation, CANCEIS, hot deck Development of a truth deck for evaluation

E & I for demographic variables: papers Three papers: WP 30: Methods and software for editing and imputation: recent advancements at ISTAT (ISTAT, Italy) WP 35: Edit and imputation for the 2006 Canadian Census (Statistics Canada) WP 38: New procedures for editing and imputation of demographic variables (ISTAT, Italy)

E & I for demographic variables: main developments  Further improvement of CANCEIS capability of processing all census variables improved editing and imputation of alphanumeric, discrete, continuous and coded variables improved user interface  Development of DIESIS combined use of “data driven” approach (NIM) and “minimum change” approach (Fellegi-Holt)

E & I for demographic variables: main developments  Development of DIESIS Use of graph theory to improve quality of sequential imputation Optimization procedure to locate the household reference person New approach for selection of donors  based on partitioning passed records into smaller subsets of similar characteristics  search for donor records within the smaller clusters

Selective editing: papers Two papers: WP 42: Evaluation of score functions for selective editing of annual structural business statistics (Statistics Netherlands) WP 45: An editing procedure for low pay data in the annual survey of hours and earning (Office for National Statistics, UK)

Selective editing: main developments  Continued use and development of selective editing  Evaluation of selective editing approaches experiments with different sets of score functions  Development of “hybrid editing” validate a sample of failed records use associated data to impute remaining records

Software: papers Four papers: WP 34: The transition from GEIS to BANFF (Statistics Canada) WP 37: Concepts, materials and IT modules for data editing of German statistics (Destatis, Germany) WP 39: SLICE 1.5: a software framework for automatic edit and imputation (Statistics Netherlands) WP 47: Improving an edit and imputation system for the US Census of agriculture (NASS, US)

Software: main developments  Flexibility modules rather than large systems are developed standard statistical packages are used (SAS in BANFF and US Census of Agriculture)  Testing and implementation of the software  Quality control measures e.g. for (donor) imputation  Integration of the edit and imputation software in entire production process process chain: planning, data collection, edit and imputation

General points for discussion  Are there any really new approaches? new approaches extensions of existing ideas? new approaches combinations of old ones?  Develop new approaches or consolidate old approaches? development versus evaluation studies and testing prototype software versus implementation of production software  Is our focus shifting? from editing towards imputation? from development towards implementation? from computational aspects towards quality issues?

Automatic editing: points for discussion  Can operations research techniques be combined with techniques from mathematical logic?  What are the (dis)advantages of using SAT solvers when compare to direct integer programming methods?  What is the quality of the imputations when editing data using the quadratic programming approach?

Automatic editing: points for discussion  What is the quality of the solutions found by using the combinatorial optimization approach on real survey data? How fast is this approach on realistic data?  Can finite mixture models be used for detection of other types of systematic errors?  Should we invest on developing generic tools or software tools tailored to a particular application?

Automatic editing: points for discussion  Are there any other types of surveys that are worth the effort of generating implied edits prior to error localization?  What are the most cost-effective methods for edit/imputation in terms of resources, time, clerical intervention, quality of results?

Imputation: points for discussion  What are the (dis)advantages of using complex mathematical models for missing data imputation? Are these models too complex for survey practitioners?  What are the expected computational difficulties of applying complex models to real survey data?  What are the largest (most complex) surveys that can be imputed using these models?

Imputation: points for discussion  What is the quality of the imputations carried out using model based methods for filling-in missing data?  Can we compare the different imputation models?

Imputation: points for discussion  Can more guidelines for the IMAI process be developed?  To what extent can we develop a systematic way of applying IMAI?  Is imputation variance an important issue at the moment, or should we (still) focus on imputation bias?

E & I for demographic variables: points for discussion  Can CANCEIS/DIESIS be used for other data besides demographic census data?  Can CANCEIS/DIESIS be further developed?  Should we use a combination of edit and imputation methods or a single method for demographic variables?

Selective editing: points for discussion  Can selective editing be successfully applied to large/complex surveys?  Can current methods for selective editing be further developed?  Can a general theory for selective editing be developed?  How promising is hybrid editing?

Software: points for discussion  Should we develop generic software or software tools for particular applications?  How can we ensure the flexibility of software?  Are the software tools fast enough for large/complex data sets?  To what extent should we aim to automate the editing process?