Predictive Mean Matching using a Factor Model, Varriale - Guarnera – Nuremberg, 09/09/2013 Predictive Mean Matching using a Factor Model, an application.

Slides:



Advertisements
Similar presentations
Non response and missing data in longitudinal surveys.
Advertisements

Poverty trajectories after risky life events in Germany, Spain, Denmark and the United Kingdom: a latent class approach Leen Vandecasteele Post-doctoral.
Unido.org/statistics International workshop on industrial statistics 8 – 10 July, Beijing Non response in industrial surveys Shyam Upadhyaya.
Introduction to Propensity Score Matching
Treatment of missing values
Christopher Dougherty EC220 - Introduction to econometrics (chapter 2) Slideshow: a Monte Carlo experiment Original citation: Dougherty, C. (2012) EC220.
Prediction and Imputation in ISEE - Tools for more efficient use of combined data sources Li-Chun Zhang, Statistics Norway Svein Nordbotton, University.
How to deal with missing data: INTRODUCTION
Partially Missing At Random and Ignorable Inferences for Parameter Subsets with Missing Data Roderick Little Rennes
Statistical Methods for Missing Data Roberta Harnett MAR 550 October 30, 2007.
1 A MONTE CARLO EXPERIMENT In the previous slideshow, we saw that the error term is responsible for the variations of b 2 around its fixed component 
1 Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data Presented by: Tun-Hsiang Yang.
Joint UNECE/Eurostat Meeting on Population and Housing Censuses (13-15 May 2008) Sample results expected accuracy in the Italian Population and Housing.
E&I of administrative data used for producing business statistics Vera Costa, Frances Krsinich, Rudi Van der Mescht 2008 UNECE Work Session on Statistical.
Mixture Modeling Chongming Yang Research Support Center FHSS College.
11 Comparison of Perturbation Approaches for Spatial Outliers in Microdata Natalie Shlomo* and Jordi Marés** * Social Statistics, University of Manchester,
Eurostat Statistical Data Editing and Imputation.
Multiple Imputation Approaches for Right-Censored Wages in the German IAB Employment Register European Conference on Quality in Official Statistics 2008,
UNITED NATIONS - ECONOMIC COMMISSION FOR EUROPE CONFERENCE OF EUROPEAN STATISTICIANS Work Session on Statistical Data Editing Use of administrative data.
Optimal Allocation in the Multi-way Stratification Design for Business Surveys (*) Paolo Righi, Piero Demetrio Falorsi 
Propensity Score Matching and Variations on the Balancing Test Wang-Sheng Lee Melbourne Institute of Applied Economic and Social Research The University.
1 Multiple Imputation : Handling Interactions Michael Spratt.
Use of web scraping and text mining techniques in the Istat survey on “Information and Communication Technology in enterprises” Giulio Barcaroli(*), Alessandra.
Consumer behavior studies1 CONSUMER BEHAVIOR STUDIES STATISTICAL ISSUES Ralph B. D’Agostino, Sr. Boston University Harvard Clinical Research Institute.
1 Introduction to Survey Data Analysis Linda K. Owens, PhD Assistant Director for Sampling & Analysis Survey Research Laboratory University of Illinois.
Performance of Resampling Variance Estimation Techniques with Imputed Survey data.
G Lecture 11 G Session 12 Analyses with missing data What should be reported?  Hoyle and Panter  McDonald and Moon-Ho (2002)
The new multiple-source system for Italian Structural Business Statistics based on administrative and survey data Orietta Luzi, Ugo Guarnera, Paolo Righi.
Handling Attrition and Non- response in the 1970 British Cohort Study Tarek Mostafa Institute of Education – University of London.
Danila Filipponi Simonetta Cozzi ISTAT, Italy Outlier Identification Procedures for Contingency Tables in Longitudinal Data Roma,8-11 July 2008.
Statistical Matching in the framework of the modernization of social statistics Aura Leulescu & Emilio Di Meglio EUROSTAT Unit F3 - Living conditions and.
1 Introduction to Survey Data Analysis Linda K. Owens, PhD Assistant Director for Sampling & Analysis Survey Research Laboratory University of Illinois.
Eurostat On the use of data mining for imputation Pilar Rey del Castillo, EUROSTAT.
Calibrated imputation of numerical data under linear edit restrictions Jeroen Pannekoek Natalie Shlomo Ton de Waal.
Eurostat Statistical matching when samples are drawn according to complex survey designs Training Course «Statistical Matching» Rome, 6-8 November 2013.
Eurostat Statistical Matching using auxiliary information Training Course «Statistical Matching» Rome, 6-8 November 2013 Marco Di Zio Dept. Integration,
Generalizing Observational Study Results Applying Propensity Score Methods to Complex Surveys Megan Schuler Eva DuGoff Elizabeth Stuart National Conference.
SW 983 Missing Data Treatment Most of the slides presented here are from the Modern Missing Data Methods, 2011, 5 day course presented by the KUCRMDA,
Marcello D’Orazio UNECE - Work Session on Statistical Data Editing Ljubljana, Slovenia, 9-11 May 2011 Statistical.
Eurostat Weighting and Estimation. Presented by Loredana Di Consiglio Istituto Nazionale di Statistica, ISTAT.
The challenge of a mixed-mode design survey and new IT tools application: the case of the Italian Structure Earning Surveys Fabiana Rocci Stefania Cardinleschi.
© John M. Abowd 2007, all rights reserved General Methods for Missing Data John M. Abowd March 2007.
1 G Lect 13W Imputation (data augmentation) of missing data Multiple imputation Examples G Multiple Regression Week 13 (Wednesday)
Outlier Treatment in HCSO Present and future. Outline Outlier detection – types, editing, estimation Description of the current method Alternatives Future.
Evaluating the Quality of Editing and Imputation: the Simulation Approach M. Di Zio, U. Guarnera, O. Luzi, A. Manzari ISTAT – Italian Statistical Institute.
Correlation/Regression - part 2 Consider Example 2.12 in section 2.3. Look at the scatterplot… Example 2.13 shows that the prediction line is given by.
Simulation Study for Longitudinal Data with Nonignorable Missing Data Rong Liu, PhD Candidate Dr. Ramakrishnan, Advisor Department of Biostatistics Virginia.
Multivariate selective editing via mixture models: first applications to Italian structural business surveys Orietta Luzi, Guarnera U., Silvestri F., Buglielli.
Eurostat Accuracy of Results of Statistical Matching Training Course «Statistical Matching» Rome, 6-8 November 2013 Marcello D’Orazio Dept. National Accounts.
Methods and software for editing and imputation: recent advancements at Istat M. Di Zio, U. Guarnera, O. Luzi, A. Manzari ISTAT – Italian Statistical Institute.
Lecture 1 Introduction to econometrics
A framework for multiple imputation & clustering -Mainly basic idea for imputation- Tokei Benkyokai 2013/10/28 T. Kawaguchi 1.
Statistics 350 Lecture 2. Today Last Day: Section Today: Section 1.6 Homework #1: Chapter 1 Problems (page 33-38): 2, 5, 6, 7, 22, 26, 33, 34,
Small area estimation combining information from several sources Jae-Kwang Kim, Iowa State University Seo-Young Kim, Statistical Research Institute July.
Plausible Values of Latent Variables: A Useful Approach of Data Reduction for Outcome Measures in Pediatric Studies Jichuan Wang, Ph.D. Children’s National.
Research and Evaluation Methodology Program College of Education A comparison of methods for imputation of missing covariate data prior to propensity score.
Methods for Data-Integration
Latent Variable Modeling Summary / Final Thoughts
Multiple Imputation using SOLAS for Missing Data Analysis
Modeling approaches for the allocation of costs
An R package for selective editing based on a latent class model
Maximum Likelihood & Missing data
How to handle missing data values
The European Statistical Training Programme (ESTP)
MEASUREMENT OF THE QUALITY OF STATISTICS
Non response and missing data in longitudinal surveys
New Techniques and Technologies for Statistics 2017  Estimation of Response Propensities and Indicators of Representative Response Using Population-Level.
Sampling and estimation
Creating a synthetic database for research in migration and subjective well-being Statistical Matching techniques for combining the complementary questionnaires.
Chapter 13: Item nonresponse
Presentation transcript:

Predictive Mean Matching using a Factor Model, Varriale - Guarnera – Nuremberg, 09/09/2013 Predictive Mean Matching using a Factor Model, an application to the Business Multipurpose Survey Roberta Varriale, Ugo Guarnera Istat Nuremberg, 09/09/2013

Predictive Mean Matching using a Factor Model, Varriale - Guarnera – Nuremberg, 09/09/2013 Sample survey on businesses carried out by Istat together with the Business and Services Census 2011 Features Aims: identifying specific business profiles Complex questionnaire Large number of variables (different nature) Nonresponse-rate of 72k enterprises (more than 20 employees): 15% Difficulty to use an explicit parametric method for imputation Nearest Neighbor Donor (NND): consists in matching completely observed units (donors) with incomplete units (recipients), based on some distance function, and transferring values from donors to recipients Predictive Mean Matching (PMM): distance function “weights” the covariates used in NND with its relative predictive power with respect to the “target” variables Motivation: (Business) Multi Purpose Survey

Predictive Mean Matching using a Factor Model, Varriale - Guarnera – Nuremberg, 09/09/2013 Sample survey on businesses carried out by Istat together with the Business and Services Census 2011 Features Aims: identifying specific business profiles Complex questionnaire Large number of variables (different nature) difficulty to use an explicit parametric method for imputation Nearest Neighbor Donor (NND): consists in matching completely observed units (donors) with incomplete units (recipients), based on some distance function, and transferring values from donors to recipients Predictive Mean Matching (PMM): distance function “weights” the covariates used in NND with its relative predictive power with respect to the “target” variables Motivation: (Business) Multi Purpose Survey

Predictive Mean Matching using a Factor Model, Varriale - Guarnera – Nuremberg, 09/09/2013 Nonresponse-rate of 72k enterprises (more than 20 employees): 15% Possible approaches:  Provide users with an incomplete subset of data  Delete units with missing values and (possibly) use weighting methods  Manage missing values with imputation, by “supplying” artificial values to replace missing values Istat needs to meet specific demands of external users carrying out analysis with standard software The availability of auxiliary information, in the majority continuous, on the entire dataset and the complexity of the phenomenon under investigation makes weighting methods complex to apply Nonresponse imputation

Predictive Mean Matching using a Factor Model, Varriale - Guarnera – Nuremberg, 09/09/2013 Predictive Mean Matching

Predictive Mean Matching using a Factor Model, Varriale - Guarnera – Nuremberg, 09/09/2013 Y = Y 1,…,Y P variables of a sample survey to be imputed X = X 1,…,X Q variables available for all units (covariates) Y continuous, a typical application of PMM: 1.the parameters of the regression model of Y on X are estimated with standard methods 2.based on the estimates from step 1, predictive means Y*≡ E (Y | X) are computed both for the units with missing values (recipients) and units with complete data (donors) 3.for each receiver u r a donor u d is selected in order to minimize the Mahalanobis distance defined by: D(u d, u r ) ≡ (y d * - y r *)T S -1 (y d * - y r *) where y d * and y r * are the predictive means estimates on donor and recipient, respectively, and S is the residual variance-covariance matrix of the regression model 4.each u r is imputed by transferring the Y values from its closest donor Predictive Mean Matching

Predictive Mean Matching using a Factor Model, Varriale - Guarnera – Nuremberg, 09/09/2013 PMM Mahalanobis distance (residual covariance matrix from the regression model) Research problem: Which model and which distance function has to be used when the target variables are not all continuous? MPS: target variables are all categorical Titolo intervento, nome cognome relatore – Luogo, data Predictive Mean Matching “weights” the covariates to be used in NND with its relative predictive power with respect to the target variables establishes the distance between units weighting the “goodness of prediction” of the single target variables  largest “weights” to the target variables of the predictive means with the smallest prediction error

Predictive Mean Matching using a Factor Model, Varriale - Guarnera – Nuremberg, 09/09/2013 In PMM for categorical variables to be imputed,  the “natural” imputation model to define an appropriate distance is the log-linear model with covariates X  distance function on the estimated probabilities of the categories of the target variables 2 main limitations: 1.the model is very complex when the number of variables used in the imputation model is big (i.e. only when we are able to set up and process the full multi-way cross-tabulation required for the log-linear analysis) 2.the distance function has to take into account both the distance between the expected frequencies of the multi-way cross-tabulation and the variability due to the estimation process Predictive Mean Matching PMM with latent variables (factor model)

Predictive Mean Matching using a Factor Model, Varriale - Guarnera – Nuremberg, 09/09/2013 The model is composed of 2 parts: 1.factor model  links the latent factor to the observed indicators 2.regression model  links the covariates to the latent factor Structural Equation Model MIMIC (Multiple Indicators Multiple Causes) model Factor model (with covariates) y1y1 yPyP ypyp … … η x1x1 xQxQ xqxq … … u

Predictive Mean Matching using a Factor Model, Varriale - Guarnera – Nuremberg, 09/09/2013 Factor model (with covariates) y1y1 yPyP ypyp … … η x1x1 xQxQ xqxq … … u The model is composed of 2 parts: 1.factor model  links the latent factor to the observed indicators 2.regression model  links the covariates to the latent factor v p = µ p + λ p η g p (E(y p |η))= v p

Predictive Mean Matching using a Factor Model, Varriale - Guarnera – Nuremberg, 09/09/2013 Factor model (with covariates) y1y1 yPyP ypyp … … η x1x1 xQxQ xqxq … … u The model is composed of 2 parts: 1.factor model  links the latent factor to the observed indicators 2.regression model  links the covariates to the latent factor η= α 0 + α 1 X 1 + … + α Q X Q + u

Predictive Mean Matching using a Factor Model, Varriale - Guarnera – Nuremberg, 09/09/2013 Predictive Mean Matching

Predictive Mean Matching using a Factor Model, Varriale - Guarnera – Nuremberg, 09/09/2013 Y = Y 1,…,Y P variables of a sample survey to be imputed X = X 1,…,X Q variables available for all units (covariates) Y categorical (or not all continuous), the PMM with a latent factor: 1.the parameters of the factor model with covariates are estimated with standard methods 2.based on the estimates from step 1, predictive means η*≡ E (η | X) are computed both for the units with missing values (recipients) and units with complete data (donors) 3.for each receiver u r a donor u d is selected in order to minimize the distance between η values 4.each u r is imputed by transferring the Y values from its closest donor Titolo intervento, nome cognome relatore – Luogo, data PMM with one latent factor The “matching variable” of PMM: (expected value of) the latent factor η  summarizes data variability (response variables y p )  is related to the variables X 1,..,X Q

Predictive Mean Matching using a Factor Model, Varriale - Guarnera – Nuremberg, 09/09/2013 Target variables: Y 1 : type and nationality of decision management (4 categories) Y 2 : employees with high skills (2 categories) Y 3 : type of partnership (3 categories) Y 4 : delocalization of specific production functions (2 categories) Covariate: X 1 : number of employees (continuous) X 2 : added values (continuous) X 3 : turnover (continuous) X 4 : membership in an enterprise group (2 categories) Stratification variables in the imputation process: Section of economic activity (ATECO) Export/import activity (Business) Multi Purpose Survey 0=other 1=physical person/family, Itaian 2= other firm, Italian 3=abroad 0=no partnership 1= both formal and informal partn. 2= only formal partnership yes/no High/not high

Predictive Mean Matching using a Factor Model, Varriale - Guarnera – Nuremberg, 09/09/2013 Monte Carlo simulation study Data MPS, available at 20/12/2012 # enterprises: (20 or more than 20 employees) 200 iterations 1.Simulation of the item nonresponse (20% of the total number of observations) on Y variables according to a Missing at Random (MAR) mechanism 2.Estimation of the marginal and joint frequencies of the categories of Y 1,…,Y 4 for the dropped units, using different methods 3.Evaluation of the different methods by comparing the true and estimated frequencies obtained at each iteration with the Hellinger distance, and averaging the results over the 200 replications (frequencies are compared separately for each estimation domain defined by the ATECO) Simulation study Software: R, Latent Gold

Predictive Mean Matching using a Factor Model, Varriale - Guarnera – Nuremberg, 09/09/ Estimation of the marginal and joint frequencies of the categories of Y1,…,Y4 for the dropped units, using different methods Simulation study

Predictive Mean Matching using a Factor Model, Varriale - Guarnera – Nuremberg, 09/09/ Estimation of the marginal and joint frequencies of the categories of Y1,…,Y4 for the dropped units, using different methods Computation of the expected frequencies according to the estimates obtained by the models: Logit: multinomial logit model Factor: factor model with covariates Simulation study

Predictive Mean Matching using a Factor Model, Varriale - Guarnera – Nuremberg, 09/09/ Estimation of the marginal and joint frequencies of the categories of Y1,…,Y4 for the dropped units, using different methods Computation of the expected frequencies according to the estimates obtained by the models: Logit: multinomial logit model Factor: factor model with covariates Draw realization of Y 1,…,Y 4 using the probabilities estimated by the models: Logit.Rnd: multinomial logit model Factor.Rnd: factor model with covariates Simulation study

Predictive Mean Matching using a Factor Model, Varriale - Guarnera – Nuremberg, 09/09/ Estimation of the marginal and joint frequencies of the categories of Y1,…,Y4 for the dropped units, using different methods Computation of the expected frequencies according to the estimates obtained by the models: Logit: multinomial logit model Factor: factor model with covariates Draw realization of Y 1,…,Y 4 using the probabilities estimated by the models: Logit.Rnd: multinomial logit model Factor.Rnd: factor model with covariates Imputation method NND (Euclidean distance): X.Donor, matching variables: X 1,…,X 4 Logit.Donor, matching variables: probabilities of each category of Y 1,…,Y 4 estimated by a multinomial logit model with covariates X 1,…,X 4 Factor.Donor, matching variable: value of the latent factor estimated by a factor model with covariates (PMM with a latent factor) Simulation study

Predictive Mean Matching using a Factor Model, Varriale - Guarnera – Nuremberg, 09/09/2013  The Hellinger distance computed for each method is similar  The reduction of dimensionality performed by the PMM with a factor model does not harm the results of the imputation process  The performance of the NND methods is very similar to that ones of the corresponding methods based on directly drawing from the estimated probability of the categories of Y  an advantage of using a NND is that it allows us to impute all variables of each incomplete record, rather than only the target variables  The additional variability introduced by the random drawing methods result in a small increase of the Hellinger distance values Simulation study - results

Predictive Mean Matching using a Factor Model, Varriale - Guarnera – Nuremberg, 09/09/2013 Simulation study - results  The Hellinger distance computed for each method is similar  The reduction of dimensionality performed by the PMM with a factor model does not harm the results of the imputation process  The performance of the NND methods is very similar to that ones of the corresponding methods based on directly drawing from the estimated probability of the categories of Y  an advantage of using a NND is that it allows us to impute all variables of each incomplete record, rather than only the target variables  The additional variability introduced by the random drawing methods result in a small increase of the Hellinger distance values

Predictive Mean Matching using a Factor Model, Varriale - Guarnera – Nuremberg, 09/09/2013 Simulation study - results  The Hellinger distance computed for each method is similar  The reduction of dimensionality performed by the PMM with a factor model does not harm the results of the imputation process  The performance of the NND methods is very similar to that ones of the corresponding methods based on directly drawing from the estimated probability of the categories of Y  an advantage of using a NND is that it allows us to impute all variables of each incomplete record, rather than only the target variables  The additional variability introduced by the random drawing methods result in a small increase of the Hellinger distance values

Predictive Mean Matching using a Factor Model, Varriale - Guarnera – Nuremberg, 09/09/2013 Simulation study - results  The Hellinger distance computed for each method is similar  The reduction of dimensionality performed by the PMM with a factor model does not harm the results of the imputation process  The performance of the NND methods is very similar to that ones of the corresponding methods based on directly drawing from the estimated probability of the categories of Y  an advantage of using a NND is that it allows us to impute all variables of each incomplete record, rather than only the target variables  The additional variability introduced by the random drawing methods result in a small increase of the Hellinger distance values

Predictive Mean Matching using a Factor Model, Varriale - Guarnera – Nuremberg, 09/09/2013 The results show a quite good performance of the PMM with a factor model Further analysis:  (additional) analysis of the nonresponse process (nonresponse pattern)  (additional) analysis of the variable “meanings”  Monte Carlo experiments using simulated data  Suggestions? Thank you for your attention Future research subject matter methodological matter

Predictive Mean Matching using a Factor Model, Varriale - Guarnera – Nuremberg, 09/09/2013 References Bartholomew, D.J., Knott, M. (1999). Latent variable models and factor analysis. Arnold, London Bollen, K.A. (1989). Structural equations with latent variables. J.Wiley, New York Di Zio, M., Guarnera, U. (2009). Semiparametric predictive mean matching. AStA - Advances in Statistical Analysis. 93, Little, R.J.A. (1988). Missing-data adjustments in large surveys. Journal of Business & Economic Statistics. 6, Skrondal, A., Rabe-Hesketh, S. (2004). Generalized latent variables modeling: multilevel, longitudinal, and structural equation models. Boca Raton, FL: Chapman & Hall/CRC Lombardi, S., Lorenzini F., Verrecchia F. (2012). Three Pillars for a New Statistical System on Enterprises: Business Register, Thematic Surveys and Business Census In: Fourth International Conference on Establishment Surveys (ICES-IV), Montreal, Canada, June Vermunt, J.K., van Ginkel, J.R., van der Ark, L.A., Sijtsma K. (2008). Multiple imputation of incomplete categorical data using latent class analysis. Sociological Methodology 33, 369–297