# Integrated Data Editing and Imputation Ton de Waal Department of Methodology Voorburg Statistics Netherlands ICES III conference, Montréal June 19, 2007.

## Presentation on theme: "Integrated Data Editing and Imputation Ton de Waal Department of Methodology Voorburg Statistics Netherlands ICES III conference, Montréal June 19, 2007."— Presentation transcript:

Integrated Data Editing and Imputation Ton de Waal Department of Methodology Voorburg Statistics Netherlands ICES III conference, Montréal June 19, 2007

What is statistical data editing and imputation? Observed data generally contain errors and missing values Statistical Data Editing (SDE): process of checking observed data, and, when necessary, correcting them Imputation: process of estimating missing data and filling these values in into data set

What is integrated SDE and imputation? Integration of error localization and imputation Integration of several edit and imputation techniques to optimize edit and imputation process Integration of statistical data editing into rest of statistical process

What is integrated SDE and imputation? Integration of error localization and imputation Integration of several edit and imputation techniques to optimize edit and imputation process Integration of statistical data editing into rest of statistical process

SDE and the survey process We will focus on identifying and correcting errors Other goals of SDE are identify error sources in order to provide feedback on entire survey process provide information about the quality of incoming and outgoing data Role of SDE is slowly shifting towards these goals feedback on other survey phases can be used to improve those phases and reduce amount of errors arising in these phases

Edits Edit rules, or edits for short, often used to determine whether record is consistent or not Inconsistent records are considered to contain errors Consistent records that are also not suspicious otherwise, e.g. are not outlying with respect to the bulk of the data, are considered error-free Example of edits (T turnover, P profit, and C costs): T = P + C (balance edit) T 0

SDE and imputation Three related problems: Error localization: determine which values are erroneous Correction: correct missing and erroneous data in best possible way Consistency: adjust values such that all edits become satisfied Correction often done by means of imputation

SDE and imputation Three related problems: Error localization: determine which values are erroneous Imputation: impute missing and erroneous data in best possible way Consistency: adjust imputed values such that all edits become satisfied

SDE and imputation Three related problems: Error localization: determine which values are erroneous Imputation: impute missing and erroneous data in best possible way Consistency: adjust imputed values such that all edits become satisfied Most SDE techniques focus on error localization

SDE in the old days Use of computers in SDE started many years ago In early years role of computers restricted to checking which edits were violated Subject-matter specialists retrieved paper questionnaires that did not pass all edits and corrected them After correction, data were again entered into computer, and again checked whether all edits were satisfied Major problem: during manual correction process records were not checked for consistency

Modern SDE techniques Interactive editing Selective editing Automatic editing Macro-editing

Interactive editing During interactive editing a modern survey processing system (e.g. BLAISE) is used Such a system allows one to check and – if necessary – correct in a single step Advantages: number of variables, edits and records may be high quality of interactively edited data is generally high Disadvantage: all records have to be edited: costly in terms of budget and time not transparent

Selective editing Umbrella term for several methods to identify the influential errors Aim is to split data into two streams: critical stream: records that are the most likely ones to contain influential errors non-critical stream: records that are unlikely to contain influential errors Records in critical stream are edited interactively Records in non-critical stream are either not edited or are edited automatically

Selective editing Many selective editing methods are based on common sense Most often applied basic idea is to use a score function Two important components influence: measures relative influence of record on publication figure risk: measures deviation of observed values from anticipated values (e.g. medians or values from previous years)

Selective editing Local score for single variable within record usually defined as distance between observed and anticipated values, taking influence of record into account Example: W x |Y – Y*| W raising weight, Y observed value, Y* anticipated value influence component:W x Y* risk component: |Y – Y*| / Y* Local scores combined into global score for entire record by sum of local scores maximum of local scores Records with global score above certain cut-off value edited interactively

Selective editing: (dis)advantages Advantage: selective editing improves efficiency in terms of budget and time Disadvantage: no good techniques for combining local scores into global score are available if there are many variables Selective editing has gradually become popular method to edit business data

Automatic editing Two kinds of errors: systematic ones and random ones Systematic error: error reported consistently among (some) responding units gross values reported instead of net values values reported in units instead of requested thousands of units (so-called thousand-errors) Random error: error caused but by accident observed value where respondent by mistake typed in a digit too many

Automatic editing of systematic errors Can often be detected by comparing respondents present values with those from previous years comparing responses to questionnaire variables with values of register variables using subject-matter knowledge Once detected, systematic error is often simple to correct

Automatic editing of random errors Three classes of methods: methods based on statistical models (e.g. outlier detection techniques and neural networks) methods based on deterministic checking rules methods based on solving a mathematical optimization problem

Deterministic checking rules State which values are considered erroneous when record violates edits Example: if component variables do not sum up to total, total variable is considered to be erroneous Advantages: drastically improves efficiency in terms of budget and time transparency and simplicity Disadvantages: many rules have to be specified, maintained and checked for validity bias may be introduced as one aims to detect random errors in a systematic manner

Error localization as mathematical optimization problem Guiding principle is needed Freund and Hartley (1967): minimize sum of the distance between observed and corrected data and a measure for violation of edits Casado Valera et al. (90s): minimize quadratic function measuring distance between observed and corrected data such that corrected data satisfy all edits Bankier (90s): impute missing data and potentially erroneous values by means of donor imputation, and select imputed record that satisfies all edits and that is closest to original record

Fellegi-Holt paradigm (1976) Data should be made to satisfy all edits by changing values of fewest possible number of variables Generalization: data should be made to satisfy all edits by changing values of variables with smallest possible sum of reliability weights reliability weight expresses how reliable one considers values of this variable to be high reliability weight corresponds to variable of which values are considered trustworthy

Fellegi-Holt paradigm: (dis)advantages Advantages: drastically improves efficiency in terms of budget and time in comparison to deterministic checking rules less, and less detailed, rules have to be specified Disadvantages: class of errors that can safely be treated is limited to random errors class of edits that can be handled is restricted to so-called hard (or logical) edits which hold true for all correctly observed records risky to treat influential errors by means of automatic editing

Macro-editing Macro-editing techniques often examine potential impact on survey estimates to identify suspicious data in individual records Two forms of macro-editing aggregation method distribution method

Macro-editing: aggregation method Verification whether figures to be published seem plausible Compare quantities in publication tables with same quantities in previous publications quantities based on register data related quantities from other sources

Macro-editing: distribution method Available data used to characterize distribution of variables Individual values compared with this distribution Records containing values that are considered uncommon given the distribution are candidates for further inspection and possibly for editing

Macro-editing: graphical techniques Exploratory Data Analysis techniques can be applied box plots scatter plots (outlier robust) fitting Other often used techniques in software applications anomaly plots: graphical overviews of important estimates, where unusual estimates are highlighted time series analysis outlier detection methods Once suspicious data have been detected on a macro- level one can drill-down to sub-populations and individual units

Macro-editing: (dis)advantages Advantages: directly related to publication figures or distribution efficient in term of budget and time Disadvantages: records that are considered non-suspicious may still contain influential errors publication of unexpected (but true) changes in trend may be prevented for data sets with many important variables graphical macro-editing is not the most suitable SDE method most persons cannot interpret 10 scatter plots at the same time

Integrating SDE techniques We advocate an SDE approach that consists of the following phases: correction of evident systematic errors application of selective editing to split records in critical stream and non-critical stream editing of data: records in critical stream edited interactively records in non-critical stream edited automatically validation of the publication figures by means of (graphical) macro-editing

Imputation Expert guess Deductive imputation Multivariate regression imputation Nearest neighbor hot-deck imputation Ratio hot-deck imputation

Deductive imputation Sometimes missing values can be determined unambiguously from edits Examples: single missing value involved in balance edit for non-negative variables: if a total variable has zero value all missing subtotal (component) variables are zero

Regression imputation Regression model per variable to be imputed Y = A + B X + e Imputations for missing data can be obtained from Y = A est + B est X or from Y = A est + B est X + e* where e* is drawn from appropriate distribution

Regression imputation Imputation can also be based on multivariate regression model that relates each missing value to all observed values Y mis = Mean mis + B(Y obs – Mean obs ) + e Estimates of model parameters can be obtained by using EM-algorithm Imputations for missing data can be obtained from Y mis = Mean est,mis + B est (Y obs – Mean est,obs ) or from Y mis = Mean est,mis + B est (Y obs – Mean est,obs ) + e* where e* is drawn from appropriate distribution

Nearest neighbor hot deck imputation For each receptor record with missing values on some (target) variables a donor record is selected that has no missing values on auxiliary and target variables smallest distance to receptor Replace missing values by values from donor Often used distance measure is minimax distance Z si : value of scaled auxiliary variable i in record s distance between records s and t: D(s,t) = max_i |Z si – Z ti |

Ratio hot deck imputation Modified version of nearest neighbor hot-deck for variables that are part of balance edit Calculate difference between total variable and sum of observed components this difference equals the sum of the missing components Sum of missing components are distributed over missing components using ratios (of missing components to sum of missing components) from donor record level of imputed components is determined by total variable but their ratios are determined by donor imputed and observed components add up to total

Example of ratio hot deck P + C = T Record to be imputed given by T = 400, P = ?, C = ? Donor record T = 100, P = 25, C = 75 Imputed record T = 400, P = 100, C = 300

Consistency If imputed values violate edits, adjust them slightly Observed values not adjusted Minimize Σ i w i |Y i,final – Y i,imp | subject to restriction that Y i,final in combination with observed values satisfy all edits Y i,imp : imputed values (possibly failing edits) Y i,final : final values w i : user-specified weights As numerical edits are generally linear (in)equalities, resulting problem is a linear programming problem

Consistency Prerequisite: it should be possible to find values Y i,final such that all edits become satisfied this is the case if Fellegi-Holt paradigm has been applied to identify errors Instead of first imputing and then adjusting values, better (but more complicated) approach is to impute under restriction that edits become satisfy see doctorate thesis by Caren Tempelman (Statistics Netherlands, www.cbs.nl)

Conclusion All editing and imputation methods have their own (dis)advantages Integrated use of editing techniques (selective editing, interactive editing, automatic editing, and macro-editing) as well as various imputation techniques can improve efficiency of SDE and imputation process while at same time maintaining or even enhancing statistical quality of produced data

Download ppt "Integrated Data Editing and Imputation Ton de Waal Department of Methodology Voorburg Statistics Netherlands ICES III conference, Montréal June 19, 2007."

Similar presentations