Calibrated imputation of numerical data under linear edit restrictions Jeroen Pannekoek Natalie Shlomo Ton de Waal.

Slides:

Advertisements

Similar presentations

Assumptions underlying regression analysis

Advertisements

Confidentiality risks of releasing measures of data quality Jerry Reiter Department of Statistical Science Duke University

Computational Statistics. Basic ideas  Predict values that are hard to measure irl, by using co-variables (other properties from the same measurement.

1 1 Chapter 5: Multiple Regression 5.1 Fitting a Multiple Regression Model 5.2 Fitting a Multiple Regression Model with Interactions 5.3 Generating and.

The Multiple Regression Model.

CBS-SSB STATISTICS NETHERLANDS – STATISTICS NORWAY Work Session on Statistical Data Editing Ljubljana, Slovenia, 9-11 May 2011 Jeroen Pannekoek and Li-Chun.

Mean, Proportion, CLT Bootstrap

Chapter 7 Statistical Data Treatment and Evaluation

SDC for continuous variables under edit restrictions Natalie Shlomo & Ton de Waal UN/ECE Work Session on Statistical Data Editing, Bonn, September 2006.

6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.

Prediction, Correlation, and Lack of Fit in Regression (§11. 4, 11

Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 13 Nonlinear and Multiple Regression.

Chapter 13 Multiple Regression

© 2010 Pearson Prentice Hall. All rights reserved Least Squares Regression Models.

Chapter 12 Multiple Regression

Prediction and model selection

Statistical Background

Chapter 11 Multiple Regression.

THE IDENTIFICATION PROBLEM

1 A MONTE CARLO EXPERIMENT In the previous slideshow, we saw that the error term is responsible for the variations of b 2 around its fixed component 

Simple Linear Regression Analysis

Correlation & Regression

Regression Analysis. Regression analysis Definition: Regression analysis is a statistical method for fitting an equation to a data set. It is used to.

Chapter 13: Inference in Regression

1 Least squares procedure Inference for least squares lines Simple Linear Regression.

Inference for Linear Regression Conditions for Regression Inference: Suppose we have n observations on an explanatory variable x and a response variable.

Non-Linear Models. Non-Linear Growth models many models cannot be transformed into a linear model The Mechanistic Growth Model Equation: or (ignoring.

Montecarlo Simulation LAB NOV ECON Montecarlo Simulations Monte Carlo simulation is a method of analysis based on artificially recreating.

Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.

1 Chapter 3 Multiple Linear Regression Multiple Regression Models Suppose that the yield in pounds of conversion in a chemical process depends.

The Examination of Residuals. Examination of Residuals The fitting of models to data is done using an iterative approach. The first step is to fit a simple.

1 Introduction to Survey Data Analysis Linda K. Owens, PhD Assistant Director for Sampling & Analysis Survey Research Laboratory University of Illinois.

Various topics Petter Mostad Overview Epidemiology Study types / data types Econometrics Time series data More about sampling –Estimation.

Business Statistics for Managerial Decision Farideh Dehkordi-Vakil.

Brian Macpherson Ph.D, Professor of Statistics, University of Manitoba Tom Bingham Statistician, The Boeing Company.

Multiple Regression and Model Building Chapter 15 Copyright © 2014 by The McGraw-Hill Companies, Inc. All rights reserved.McGraw-Hill/Irwin.

Managerial Economics Demand Estimation & Forecasting.

Chapter 4 Linear Regression 1. Introduction Managerial decisions are often based on the relationship between two or more variables. For example, after.

Multiple Regression Petter Mostad Review: Simple linear regression We define a model where are independent (normally distributed) with equal.

1 Multiple Regression A single numerical response variable, Y. Multiple numerical explanatory variables, X 1, X 2,…, X k.

Stat 112: Notes 2 Today’s class: Section 3.3. –Full description of simple linear regression model. –Checking the assumptions of the simple linear regression.

© Copyright McGraw-Hill Correlation and Regression CHAPTER 10.

Copyright ©2011 Brooks/Cole, Cengage Learning Inference about Simple Regression Chapter 14 1.

1 Regression Analysis The contents in this chapter are from Chapters of the textbook. The cntry15.sav data will be used. The data collected 15 countries’

Chapter 2 Statistical Background. 2.3 Random Variables and Probability Distributions A variable X is said to be a random variable (rv) if for every real.

Copyright © 2011 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin Model Building and Model Diagnostics Chapter 15.

Correlation & Regression Analysis

Item-Non-Response and Imputation of Labor Income in Panel Surveys: A Cross-National Comparison ITEM-NON-RESPONSE AND IMPUTATION OF LABOR INCOME IN PANEL.

Basic Business Statistics, 8e © 2002 Prentice-Hall, Inc. Chap 1-1 Inferential Statistics for Forecasting Dr. Ghada Abo-zaid Inferential Statistics for.

Tutorial I: Missing Value Analysis

Chapter 7 Data for Decisions. Population vs Sample A Population in a statistical study is the entire group of individuals about which we want information.

Copyright © 2011 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin Multiple Regression Chapter 14.

Biostatistics Regression and Correlation Methods Class #10 April 4, 2000.

Economics 173 Business Statistics Lecture 18 Fall, 2001 Professor J. Petry

Synthetic Approaches to Data Linkage Mark Elliot, University of Manchester Jerry Reiter Duke University Cathie Marsh Centre.

4-1 MGMG 522 : Session #4 Choosing the Independent Variables and a Functional Form (Ch. 6 & 7)

Statistics 350 Lecture 2. Today Last Day: Section Today: Section 1.6 Homework #1: Chapter 1 Problems (page 33-38): 2, 5, 6, 7, 22, 26, 33, 34,

Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.

Stats Methods at IC Lecture 3: Regression.

Chapter 7. Classification and Prediction

Hypothesis testing. Chi-square test

Correlation and Regression

Chapter 3 Multiple Linear Regression

The European Statistical Training Programme (ESTP)

Chapter 8: Weighting adjustment

The Simple Linear Regression Model: Specification and Estimation

15.1 The Role of Statistics in the Research Process

New Techniques and Technologies for Statistics 2017 Estimation of Response Propensities and Indicators of Representative Response Using Population-Level.

The European Statistical Training Programme (ESTP)

Chapter 13: Item nonresponse

Presentation transcript:

Calibrated imputation of numerical data under linear edit restrictions Jeroen Pannekoek Natalie Shlomo Ton de Waal

Missing data Data may be missing from collected data sets Unit non-response Data from entire units are missing Often dealt with by means of weighting Item non-response Some items from units are missing Usually dealt with by means of imputation

Linear edit restrictions Data often have to satisfy edit restrictions For numerical data most edits are linear Balance equations: a 1 x 1 + a 2 x 2 + … + a n x n + b = 0 Inequalities: a 1 x 1 + a 2 x a n x n + b ≥ 0

Totals Sometimes also totals are known x 11 x 12 x 13 x 21 x 22 x 23 ……… x r1 x r2 x r3 X1X1 X2X2 X3X3

Eliminating balance equations We can “eliminate balance equations” Example: set of edits net + tax – gross = 0 net ≥ tax net ≥ 0 Eliminating the balance equations net = gross – tax gross – tax ≥ tax gross – tax ≥ 0

Eliminating balance equations We can “eliminate balance equations” Example: set of edits net + tax – gross = 0 net ≥ tax net ≥ 0 Eliminating the balance equations net = gross – tax gross – tax ≥ tax gross – tax ≥ 0

Eliminating balance equations By eliminating all balance equations we only have to deal with inequality edits If we sequentially impute variables, we only have to ensure that imputed values lie in an interval L i ≤ x i ≤ U i We can now focus on satisfying totals

Imputation methods Adjusted predicted mean imputation Adjusted predicted mean imputation with random residuals MCMC approach

Adjusted predicted mean imputation We use sequential imputation All missing values for a variable (the target variable) are imputed simultaneously We impute target column x t We use the model x t = β 0 + βx p + e We impute x t = β 0 + βx p Imputed values do not satisfy edits nor totals

Satisfying totals The totals of missing data for target variable (X t,mis ) as well as predictor (X p,mis ) are known We construct the following model for observed data x t,obs = β 0 + βx p,obs + e X t,mis = β 1 m + βX p,mis m is the number of missing values We apply OLS to estimate model parameters We impute x t,mis = β 1 + βx p,mis Sum of imputed values then equals known value of this total

Satisfying totals and intervals (edits) We impute x t,mis = β 1 + βx p,mis + a t a t,i are chosen in such a way that Imputed values lie in their feasible intervals Σ i a t,i = 0 Appropriate values for a t,i can be found by means of operations research technique For simple alternative technique, see paper

Satisfying totals and intervals (edits) Alternatively, draw m residuals by Acceptance/Rejection sampling from a Normal Distribution (zero mean and residual variance of the regression model) that satisfy interval constraints Adjust random residuals to meet the sum constraints as carried out for a t,i

MCMC approach Start with pre-imputed consistent dataset Randomly select two records We select a variable in these records. Note that we know the sum of these two values of this variable for the two records

MCMC approach We then apply following two steps 1. We determine intervals for the two values. 2. We then draw value for one missing value. Other value then immediately follows. Now, repeat Steps 1 and 2 until “convergence”. In Step 2 we draw a value from a posterior predictive distribution implied by a linear regression model under uninformative prior, conditional on the fact that it has to lie inside corresponding interval

Evaluation study: methods Evaluated imputation methods: UPMA: unbenchmarked simple predictive mean imputation with adjustments to imputations that satisfy interval constraints BPMA: benchmarked predictive mean imputation with adjustments to imputations that satisfy interval constraints and totals MCMC: BPMA with adjustments was used as pre- imputed data set for MCMC approach

Evaluation study: data set 11,907 individuals aged 15 and over that responded to all questions in 2005 Israel Income Survey and earned more than 1000 Israel Shekels for their monthly gross income Item non-response was introduced randomly to income variables 20% of records were selected randomly and their net income variable deleted 20% of records were selected randomly and their tax variable deleted while 10% of those records were in common with the missing net income variable Totals of each of the income variables are known

Evaluation study: data set We focus on three variables from the Income Survey: gross: gross income from earnings net: net income from earnings tax: tax paid Edits: net + tax = gross net ≥ tax gross ≥ 3 x tax gross ≥ 0, net ≥ 0, tax ≥ 0 Log transform was carried out on variables to ensure normality of data

Evaluation criteria d L1 average distance between imputed and true values Z number of imputed records on boundary of feasible region defined by edits K-S (Kolmogorov-Smirnov) compares empirical distribution of original values to empirical distribution of imputed values Sign sign test carried out on difference between original value and imputed value Kappa Kappa statistic for 2-dimensional contingency table; compares agreement against that which might be expected by chance

Results Net UPMABPMAMCMC d L Z K-S Sign < Kappa

Results Tax UPMABPMAMCMC d L Z K-S Sign < Kappa

Conclusions MCMC approach is doing worse than other methods on all criteria except number of records that lie on boundary However, MCMC allows multiple imputation in order to take imputation uncertainty into account in variance estimation BPMA appear to be slightly better compared to UPMA except for K-S statistic Number of records that lie on boundary for UPMA is cause for concern MCMC approach is doing slightly better than BPMA approach in this respect

Future research Improving MCMC approach Carrying out multiple imputation using MCMC approach to obtain proper variance estimation In our study a log transformation was carried out on variables to ensure normality of data Correction factor was introduced into constant term of regression model to correct for this log transformation Better approach to this problem will be investigated Extending problem to situations where one has non-equal sampling weights