Methods and software for editing and imputation: recent advancements at Istat M. Di Zio, U. Guarnera, O. Luzi, A. Manzari ISTAT – Italian Statistical Institute.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

CmpE 104 SOFTWARE STATISTICAL TOOLS & METHODS MEASURING & ESTIMATING SOFTWARE SIZE AND RESOURCE & SCHEDULE ESTIMATING.
FTP Biostatistics II Model parameter estimations: Confronting models with measurements.
LECTURE 11: BAYESIAN PARAMETER ESTIMATION
Introduction of Probabilistic Reasoning and Bayesian Networks
New procedures for Editing and Imputation of demographic variables G. Bianchi, A. Manzari, A. Pezone, A. Reale, G. Saporito ISTAT.
Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.
Before doing comparative research with SEM … Prof. Jarosław Górniak Institute of Sociology Jagiellonian University Krakow.
Visual Recognition Tutorial
Edit and Imputation of the 2011 Abu Dhabi Census Glenn Hui and Hanan AlDarmaki Statistics Centre - Abu Dhabi UNECE CES Work Session on Statistical Data.
1 Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data Presented by: Tun-Hsiang Yang.
Mixture Modeling Chongming Yang Research Support Center FHSS College.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
 1  Outline  stages and topics in simulation  generation of random variates.
Chap 20-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 20 Sampling: Additional Topics in Sampling Statistics for Business.
THE MAIN INNOVATIONS OF DATA EDITING AND IMPUTATION FOR THE 2010 ITALIAN AGRICULTURAL CENSUS G. Bianchi, R. M. Lipsi, P. Francescangeli, G. Ruocco, A.
1 HMM - Part 2 Review of the last lecture The EM algorithm Continuous density HMM.
Bayesian networks Classification, segmentation, time series prediction and more. Website: Twitter:
SAMPLING.
Sergios Theodoridis Konstantinos Koutroumbas Version 2
New and Emerging Methods Maria Garcia and Ton de Waal UN/ECE Work Session on Statistical Data Editing, May 2005, Ottawa.
The relationship between error rates and parameter estimation in the probabilistic record linkage context Tiziana Tuoto, Nicoletta Cibella, Marco Fortini.
Jeroen Pannekoek - Statistics Netherlands Work Session on Statistical Data Editing Oslo, Norway, 24 September 2012 Topic (I) Selective and macro editing.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Topic (vi): New and Emerging Methods Topic organizer: Maria Garcia (USA) UNECE Work Session on Statistical Data Editing Oslo, Norway, September 2012.
Danila Filipponi Simonetta Cozzi ISTAT, Italy Outlier Identification Procedures for Contingency Tables in Longitudinal Data Roma,8-11 July 2008.
Sampling Methods and Sampling Distributions
ELEC 303 – Random Signals Lecture 18 – Classical Statistical Inference, Dr. Farinaz Koushanfar ECE Dept., Rice University Nov 4, 2010.
Eurostat Statistical matching when samples are drawn according to complex survey designs Training Course «Statistical Matching» Rome, 6-8 November 2013.
Eurostat Statistical Matching using auxiliary information Training Course «Statistical Matching» Rome, 6-8 November 2013 Marco Di Zio Dept. Integration,
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
© John M. Abowd 2007, all rights reserved General Methods for Missing Data John M. Abowd March 2007.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 07: BAYESIAN ESTIMATION (Cont.) Objectives:
HMM - Part 2 The EM algorithm Continuous density HMM.
CS Statistical Machine learning Lecture 24
Predictive Mean Matching using a Factor Model, Varriale - Guarnera – Nuremberg, 09/09/2013 Predictive Mean Matching using a Factor Model, an application.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Evaluating the Quality of Editing and Imputation: the Simulation Approach M. Di Zio, U. Guarnera, O. Luzi, A. Manzari ISTAT – Italian Statistical Institute.
Lecture 2: Statistical learning primer for biologists
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Multivariate selective editing via mixture models: first applications to Italian structural business surveys Orietta Luzi, Guarnera U., Silvestri F., Buglielli.
1 Introduction to Statistics − Day 4 Glen Cowan Lecture 1 Probability Random variables, probability densities, etc. Lecture 2 Brief catalogue of probability.
Latent regression models. Where does the probability come from? Why isn’t the model deterministic. Each item tests something unique – We are interested.
Statistics Sampling Distributions and Point Estimation of Parameters Contents, figures, and exercises come from the textbook: Applied Statistics and Probability.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Univariate Gaussian Case (Cont.)
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
Topics Semester I Descriptive statistics Time series Semester II Sampling Statistical Inference: Estimation, Hypothesis testing Relationships, casual models.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
An Overview of Editing and Imputation Methods for the next Italian Censuses Gianpiero Bianchi, Antonia Manzari, Alessandra Reale UNECE-Eurostat Meeting.
1 Structure Learning (The Good), The Bad, The Ugly Inference Graphical Models – Carlos Guestrin Carnegie Mellon University October 13 th, 2008 Readings:
1. 2 DRAWING SIMPLE RANDOM SAMPLING 1.Use random # table 2.Assign each element a # 3.Use random # table to select elements in a sample.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Making inferences from collected data involve two possible tasks:
An R package for selective editing based on a latent class model
Statistical Models for Automatic Speech Recognition
Bayesian Models in Machine Learning
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Pattern Recognition and Image Analysis
Simple Linear Regression
LECTURE 15: REESTIMATION, EM AND MIXTURES
LECTURE 09: BAYESIAN LEARNING
LECTURE 07: BAYESIAN ESTIMATION
Parametric Methods Berlin Chen, 2005 References:
Learning From Observed Data
A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.
A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.
Presentation transcript:

Methods and software for editing and imputation: recent advancements at Istat M. Di Zio, U. Guarnera, O. Luzi, A. Manzari ISTAT – Italian Statistical Institute UN/ECE Work Session on Statistical Data Editing Ottawa, May 2005

Outline Introduction Editing: Finite Mixture Models for continuous data Imputation: Bayesian Networks for categorical data Imputation: Quis system for continuous data E&I: Data Clustering for improving the search of donors in the Diesis system

Recent advancements at Istat In order to reduce waste of resources and to disseminate best practices, efforts were addressed in two directions: –identifying methodological solutions for some common types of errors –providing survey practitioners with generalized tools in order to facilitate the adoption of new methods and increase the processes standardization

Editing Identifying systematic unity measure errors (UME) A UME occurs when the “true” value of a variable X j is reported in a wrong scale (e.g. X j ·C, C=100, C=1,000, and so on)

Finite Mixture Models of Normal Distributions  Probabilistic clustering based on the assumption that observations are from a mixture of a finite number of populations or groups G g in various proportions  g  Given some parametric form for the density function in each group maximum likelihood estimates can be obtained for the unknown parameters

Finite Mixture Models for UME  Given q variables X 1,.., X q, the h = 2 q possible clusters (mixture components) correspond to groups of units with different subsets of items affected by UME (error patterns) y   Assuming that valid data are normally distributed and using a log scale, each cluster is characterized by a p.d.f. f g (y;  t )  MN(  g,  ), where  g is translated by a known vector and  is constant for all clusters  Units are assigned to clusters based on their posterior probability  g (y i ;  )

Model diagnostics used to prioritise units for manual check  Atypicality Index: allows to identify outliers w.r.t. the defined model (e.g. units possibly affected by errors other than the UME)  Classification probabilities  g (y i ;  ) allow to identify possibly misclassified units. They can be directly used to identify misclassifications that are possibly influential on target estimates (significance editing)

Main findings  Finite Mixture Modelling allows multivariate and not hierarchical data analyses. Costs for developing ad hoc procedures are saved  Finite Mixture Modelling produces highly reliable automatic data clustering/error localization  Model diagnostics can be used for reducing editing costs due to manual editing  The approach is robust for moderate departures from normality  The number of model parameters is limited by the model constraints on  and 

Imputation Bayesian Neworks for categorical variables  The first idea of using BNs for imputation is by Thibaudeau and Winkler (2002) Let C 1 ….,C j be a set of categorical variables having each a finite set of mutually exclusive states BNs allows to represent graphically and numerically the joint distribution of variables: –A Bn can be viewed as a Directed Acyclic Graph, and –an inferential engine that allow to perform inferences on distributions parameters

Graphical representation of BNs To each variable C with parents Pa (C j ) there is attached a conditional probability P(C|Pa (C j )) BNs allow to factorize the joint probability distribution P(C 1,...,C j ) of so that P(C 1 ….,C j )= Π j=1,n P(C j |Pa(C j ))

BN ’ s and imputation: method 1 1.Order variables according to their “reliability” 2.Estimate the network conditioned on this order 3.Estimate the conditional probabilities for each node according to (2) 4.Impute each missing item by a random draw from its conditional prob. distribution

BNs and imputation: methods 2/3 In a multivariate context is more convenient to use not only information coming from parents, but also from the children. This can be done by using Markov Blanket (Mb): Mb(X)= Pa(X)+Ch(X)+Pa(X Children) In this case for each node the conditional probabilities are estimated w.r.t. its Mb

Main findings  BNs allow to express the joint probability distributions with a dramatic decrease of parameters to be estimated (reduction of complexity)  BNs may estimate the relationships between variables that are really informative for predicting values  Parametric models like BNs are efficient in terms of preservation of joint distributions  The graphical representation facilitates modelling  BN’s and hot deck methods have the same behaviour only in the case that the hot deck is stratified according to variables explaining exactly the missing mechanism

Imputation Quis system for continuous variables Quis (QUick Imputation System) is a SAS generalized tool developed at Istat to impute continuous survey data in a unified environment Given a set of variables subject to non response, different methods can be used in a completely integrated way:  Regression Imputation via EM algorithm  Nearest Neighbour Donor Imputation (NND)  Multivariate Predictive Mean Matching (PMM)

Regression imputation via EM In the context of imputation, the EM algorithm is used for obtaining Maximum Likelihood estimates in presence of missing data for the parameters of the model assumed for the data Assumptions  MAR mechanism  Normality

Regression imputation via EM Once ML estimates of parameters have been obtained, missing data can be imputed in two different ways:  directly through expectations of missing values conditional on observed ones (predictive means)  by adding a normal random residual to the predictive means (i.e. drawing values from the conditional distributions of missing values)

Multivariate Predictive Mean Matching (PMM) Let Y =(Y 1,...Y q ) be a set of variables subject to non response  ML estimates of the parameters  of the joint distribution of Y are derived via EM  For each pattern of missing data y miss, the parameters of the corresponding conditioned distribution are estimated starting from  (sweep operator)  For each unit u i the predictive mean based on estimated parameters is computed  For each unit with missing data, imputation is done using the nearest donor w.r.t. the predictive mean The Mahalanobis distance is adopted to find donors

Data clustering for improving the search for donors in the Diesis system The DIESIS system has been developed at ISTAT for treating the demographic variables of the 2001 Population Census Diesis uses both the data driven and the minimum change approach for editing and imputation For each failed household, the set of potential donors contains only the nearest passed households The adopted distance function is a weighted sum of the distances for each demographic variable over all the individuals within the household

The in use approach for donor search For each failed household e, the identification of potential donors should be made by searching within the set of all passed households D When D is very large, as in the case of a Census, the computation of the distance between each e and all d  D (exhaustive search) could require unacceptable computational time The in use sub-optimal search consists in arresting the search before examining the entire set D according to some stopping criteria. This solution does not guarantee the selection of the potential donors having actual minimum distance from e

The new approach for donor search In order to reduce the number of passed households to examine, the set of passed households D is preliminarily divided into smaller homogeneous subsets {D 1, …, D n } (D 1  …  D n =D,) Such subdivision is obtained by solving an unsupervised clustering problem (donor search guided by clustering) The search for the potential donors is then conducted, for each failed household e, by examining only the households within the cluster(s) more similar to e

Main findings  The donor search guided by clustering reduces computational times preserving the E&I quality obtained by the exhaustive search  The donor search guided by clustering increases the proportion of actual minimum distance donors selected with respect to the sub-optimal search (this is especially useful for households having uncommon structure for which few passed households are generally available)