Presentation is loading. Please wait.

Presentation is loading. Please wait.

MEASUREMENT OF THE QUALITY OF STATISTICS

Similar presentations


Presentation on theme: "MEASUREMENT OF THE QUALITY OF STATISTICS"— Presentation transcript:

1 MEASUREMENT OF THE QUALITY OF STATISTICS
Processing errors Orietta Luzi Istat – Department for National Accounts and Economic Statistics

2 Processing errors (1) Data processing is the set of operations aiming at transforming observed data from their “raw” state at the data capturing stage to a “clean” state which can be used for data analysis and dissemination (Biemer and Lyberg, 2003) Data processing is a cost and time consuming phase in the context of a statistical survey Data processing errors: derive from one or more operations used to capture and clean the data: Coding (pre-editing) Data entry and key entry Editing and Imputation Programming

3 Processing errors (2) Although it is a step to improve the quality of data, making them more usable, actually can itself introduce errors in the data The non-sampling errors that occur at this stage are called processing errors These errors are similar to measurement errors, in terms of effects that they cause on the estimates, in fact sometimes we speak of "measurement errors in the broad sense"

4 Processing errors (3) Definition*
Processing errors (or measurement errors in broad sense)are referred to errors which are introduced in the observed data once they have been collected, during the coding, data entry, editing and imputation phases … before that final estimated are computed. * Eurostat, 2003b Sources Typo errors (coding and data entry) Misinterpretation error (coding) Errors in the localization and correction of data Errors in the application of a model to data (processing)

5 Processing errors (4) By definition, these errors occur during the processing of the data, but it has to be noted that in the case of computer-aided techniques for data collection, most of these operations (recording, coding, editing and correction) are anticipated at the data collection phase, with an improvement in the quality of the data obtained and a reduction of the treatment time

6 Processing errors (5) Impact of the processing errors on target estimates As measurement errors, processing errors may determine bias in final estimates, as well as an increase of the variable component of the error In particular, in the case of operations performed by survey operators an increase of the variability and of the correlated component of the error can be observed , as happens in presence of the interviewers for the measurement error. The operations that are most frequently automated may cause errors having a systematic nature

7 Data entry error (1) Consists of transferring to computer support information collected through the questionnaire The data entry can be performed by an operator or can be partially automated through the use of instruments for optical reading If a computer-assisted technique is used, the data entry is anticipated at the data collection stage The data entry error, when it is carried out by the operator, corresponds to typo errors.

8 Data entry error (2) Some examples of typical errors:
Key entry errors relative to contiguous keys on the keyboard Exchange of keys Skipping a value Techniques for preventing data entry errors Training and instructions to operators Pre-coding of answers with codes from the questionnaire Computer-aided data entry Supervision by expert operators Preliminary test of optical reading tools

9 Data entry error (3) Evaluation techniques
To estimate the data entry error is necessary to draw a sample of the recorded questionnaires, re-record the sample of drawn questionnaires independently on the first data entry and then compare the results of the two entered steps. This approach is analogous to re-interview studies

10 Data entry error (4) Evaluation techniques
The error rate can be then computed as the percentage of the number of incorrect characters or the total number of characters, or as a percentage of erroneous "values" (fields, cells) on the total matrix of data That error, from literature, is generally very low (below 2%), but never an assessment of the possible impact on the data is provided. Example True value 17,400 Possible erroneous values: 1740, 17500, 17599 (Biemer & Lyberg, 2003)

11 Coding (1) Coding: consists in the assignment of a code to the answers to some questions that can not be pre-coded. Examples: Coding of complex variables like education, the professional status, the economic activity, the deseases. Coding of open questions Not all surveys provide a significant coding phase (in terms of costs), but often, for those that have it, the coding phase is the most important operation, for this is also accurately performed and monitored

12 Coding (2) Coding can be:
Performed manually by operators after the data collection phase Computer-aided, with ad hoc software that allows to identify more easily and quickly the codes corresponding to the answers given. In this case it may also be anticipated to the data collection stage, and in this case the interviewers also play the role of encoders Automatic, performed after the data collection phase in a batch mode

13 Coding error (1) Techniques for preventing and monitoring coding errors Use of standard classifications Testing of the coding procedure at the design stage Training and instructions to operators Use of supporting tools (paper dictionary or computerized, computer-assisted coding or fully automated coding) Supervision by expert operators

14 Coding error (2) Evaluation techniques Quality indicators
Recall rate = N° of values coded by means a specific coding procedure x 100 N° of coded values (Indicator of data processing) Precision of coding = N° of values correctly coded x 100 (To calculate it you need to double-coding by more experienced coders)

15 Editing and Imputation
The set of logical, statistical, mathematical operations carried out in order to detect, process and eliminate measurement errors in statistical survey data once all the previous survey phases (data collection, data entry, data coding) have been completed The definition includes the treatment (imputation) of non-responses

16 Aim of Editing (Error Detection)
Obtain a data set which is coherent (without logical or mathematical contradictions among variables) and complete (i.e. without item-non responses) Identify those measurement errors that, if ignored, may produce strong biases on results, inferences and data analyses Increase the accuracy of estimates by reducing the biasing effects of measurement errors and item-non responses Give information on measurement errors and their sources in order to improve the survey process in the following survey repetitions

17 How errors “appear” in data
Non sampling errors Inconsistencies Outliers Item-non responses Not identifiable Identifiable Inliers

18 Error detection: Inconsistencies
Definition Not coherent data with respect to logical and/or statistical and/or mathematical constraints (edits) derived from the structure of the questionnaire and from the knowledge of the observed variables Intra-record edits involve items observed in the same unit Examples: If ‘age’<14 then ‘occupation’=blank If ‘number employees’>0 then ‘number worked hours’>0 Inter-record edits involve items observed in hierarchically linked units If ‘relation with the household head’ = ‘spouse’ then age>14 employees_enterprise’=‘employees_branch1’+‘employees_branch2’

19 Error detection: Dealing with influential inconsistencies
Influential inconsistencies: not coherent data originated by measurement errors with a significant impact on target estimates Main approaches: Selective/Significance editing The manual review and editing are limited to a subset of (potentially) erroneous units considered most influential on target estimates on the basis of a score depending on: the magnitude of the unit affected by the error the sampling weight of the unit affected by the error the magnitude of the errors affecting the unit the number of variables potentially in error in the unit (Latouche and Berthelot, 1992; Lawrence et al., 2000; Farwell, 2000)

20 Error detection: Dealing with not influential inconsistencies
Main approaches Systematic error mechanisms Graphical editing Deterministic approach (analysis of failure rates of edits) Mixture models (Di Zio et al, 2005) Stochastic error mechanisms Probabilistic approach Fellegi-Holt paradigm based on the minimum change criterion (Fellegi and Holt, 1976; Riccini et al., 1995) Data-driven approach based on the nearest-neighbour criterion (Bankier et al. 2001; Manzari et al., 2002)

21 Error detection: Dealing with outliers (1)
Definition Values that highly contribute to target estimates (influential values) and significantly differ either from the other observations, or from an implicit or explicit model assumed for data Classification/Possible origin Representative outliers: anomalous values corresponding to acceptable (true) values, hence potentially representative of other units in the population Non representative outliers : Values which are anomalous due to measurement errors (either systematic or stochastic), hence not representative of other units in the population, or unique cases (i.e. there are not other similar units in the sample/population) Chambers (1986) Outlier Robust Finite Population Estimation, JASA

22 Error detection: Dealing with outliers (2)
In the editing phase the aim is identifying non representative outliers and correct them (manually) Representative outliers and unique cases are treated in the estimation phase Main approaches for outlier detection at the editing phase Parametric: assumptions are made on data distributions Non parametric: any assumption is made on data distributions Univariate: based on the analysis of marginal distributions of variables Multivariate: based on the analysis of joint distributions

23 Error detection: Dealing with outliers (3)
Microediting: outliers are identified by analysing the (marginal or joint) distributions of observed data (Lee, 1995; Hidiroglou et al., 1986; Winkler et al., 1996) Macroediting: aggregates of microdata are analysed first, then the (manual) review and is limited to data that mainly contribute to suspicious aggregates (Granquist, 1992; Lindstrom, 1992) Graphical editing : graphs are used to identify anomalies in data. Powerful interactive methods interconnect groups of graphs automatically and retrieve micro data for manual editing (Chambers et al., 1983; Des Jardins et al., 2000; De Waal et al., 2000; Engstrom et al., 1994; Esposito et al., 1993)

24 Error detection: Dealing with outliers at the estimation stage
Re-weighting: adjustment of weights associated to outliers (generally, by post-stratification) Robust estimation: use of robust estimators (e.g. ML estimators), i.e. estimators which are less sensitive to the presence of outliers in data, even if generally biased (need for trade-off between bias and variance) Lee , 1995, Outliers in Business Surveys, in Cox, Binder, Chinappa, Christianson, Colledge, Kott, New York: Wiley Huber, 1981, Robust Statistics, New York: Wiley

25 Error detection: Risk of editing
Over-editing: editing of data beyond a certain point after which as many errors are introduced as are corrected Creative editing: A process whereby manual editors (i.e. those doing manual review) invent editing procedures to avoid reviewing another error message from subsequent machine editing Inliers: data value that lies in the interior of a statistical distribution and is in error. Because inliers are difficult to distinguish from good data values they are sometimes difficult to find and correct Respondent’s burden: often respondents are re-interviewed/followed- up even in case of not influential measurement errors. Risk of non- response, additional time and costs without increase of data quality

26 Measuring the effects Editing and Imputation (1)
Editing and imputation represent additional sources of measurement errors (processing errors) E&I can modify the statistical properties of observed data E&I can affect the joint relationships among data E&I can introduce additional bias/variance terms in the MSE The impact of E&I on final results and target estimates are to be minimised E&I has to be optimized

27 Optimizing Editing and Imputation (1)
1) Perform preliminary test of editing and imputation methods Choose among competitive editing/imputation methods for a specific application/error problem Optimize the performance of an editing/imputation method/approach for a specific type of error problem by analysing its effects on the statistical properties of observed data Main approaches Use of data of a previous survey repetition or pilot surveys Use of experimental data (simulation approach) Euredit project ( Di Zio et al., 2005, Evaluating the Quality of Editing and Imputation: the Simulation Approach

28 Optimizing Editing and Imputation (2)
2) Measure the effects of editing and imputation Measure the effects on bias and variance of target estimates Measure the effects on marginal and joint distributions Collect information on error types (error profile) and analyse them in order to identify their possible sources for future improvements of the survey process Better quality reporting to users Granquist, 1997; Norbotten, 2000; Whitridge et al., 1999; Della Rocca et al., 2004;

29 Optimizing Editing and Imputation (3)
3) Improve the overall editing and imputation strategy Anticipate the data editing activities at the data collection stage (CATI, CAPI, web) or the data entry stage Optimize the questionnaire design (structure, wording, skip rules,…) Concentrate resources on most relevant units and errors Optimize the design of edits (analysis of edits, type of edits,…) Use the appropriate method to deal with each type of variables/ errors Use possibly available auxiliary information and/or historical data in the editing and imputation process Flag imputed data so that users can deal with non-response using their own methods and/or evaluate the impact of non-response on data distributions and final estimates

30 Errors in the editing and imputation phase
Monitoring and evaluating editing and imputation errors Analysis of indicators relating to the failure rates of edits rules => gathering information on the accuracy of the previous survey phases Analysis of indicators on the performed data changes (corrections and imputations) (e.g.: imputation rate = Number of imputed values/Total number of values) Comparison of distributions before and after editing and imputation Analysis of additional variance due to imputation

31 Data processing (1) Processing errors
Programming procedures can be affected by errors or can not be coherent with the sampling design and measures adopted to deal with non sampling errors computation of estimates (weights calculation, error correction of nonresponse and coverage, estimation of parameters of interest)) application of models data tabulation

32 Data processing (2) Preventing errors in data processing
Using consolidated methodologies for data processing - Development and test of data processing procedures at the design phase

33 References Biemer, P.P.; Lyberg L.E. (2003). Introduction to survey quality. Hoboken, New Jersey: John Wiley & Sons. Biemer, Groves, Lyberg, Mathiowetz, Sudman,(1991) Measurement errors in survey John Wiley & Sons Brancato G., Pellegrini C., Signore M., and Simeoni G., (2004) “Standardising, Evaluating and Documenting Quality: the Implementation of Istat Information System for Survey Documentation – SIDI”, Proceedings of the European Conference on Quality and Methodology in Official Statistics, Mainz, Germany Eurostat (2003a), Definition of Quality in Statistics. Eurostat Working Group on Assessment of Quality in Statistics, Luxembourg, October 2-3. Eurostat (2003b), Glossary. Eurostat Working Group on Assessment of Quality in Statistics, Luxembourg, October 2-3. FCSM (2001) “Measuring and Reporting Sources of Error in Surveys”. Federal Committee on Statistical Methodology, Statistical Policy Working Paper 31. Kalton G., Kasprzyk, D. (1986) The treatment of missing survey data. Survey methodology, vol.12, n.1, pp. 1-16 Lessler, J., and Kalsbeek, W. (1992) Nonsampling Errors in Surveys. Wiley, New York. Särndal C. E., Swensson, B., Wretman, L. (1992) Model Assisted Survey Sampling. Springer-Verlag, New York.

34 References (editing) Béguin C., Hulliger B., 2004, Multivariate outlier detection in incomplete survey data: the epidemic algorithm and transformed rank correlations, Journal of the Royal Statistical Society, Vol 167 , Part 2, pp Chambers J.M., (1986), Outlier Robust Finite Population Estimation, Journal of the American Statistical Association, 74, Della Rocca G., Di Zio M., Luzi O. (2004), Assessing editing and imputation effects on statistical survey data, Proceedings of the International Conference on Quality in Official Statistics, Mainz, May Di Zio M., Guarnera U., Luzi O., Manzari A. (2005), Evaluating the Quality of Editing and Imputation: the Simulation Approach. UN/ECE Work Session on Statistical Data Editing, Ottawa, May ( Farwell K. (2000), Some Current Approaches to Editing in the ABS, Proceedings of the International Conference on Establishment Surveys II, Buffalo, June Fellegi I.P., Holt D. (1976), A systematic approach to edit and imputation, Journal of the American Statistical Association, vol.71, pp.17-35 Granquist L. (1992), A Review of methods for rationalizing the editing of survey data, Statistical Data Editing Methods and Techiques, United Nations, Vol. I Granquist, L. (1997), An overview of Methods of Evaluating Data Editing Procedures, Statistical Data Editing Methods and Techniques, United Nations, Vol. II Hadi A.S., Simonoff J.S. (1993), Procedures for the identification of multiple outliers in linear models, Journal of the American Statistical Association, vol.88, n.424 Hawkins D. M. (1974), The Detection of Errors in Multivariate Data Using Principal Components, Journal of the American Statistical Association, Vol. 69. No 346.

35 Hidiroglou M. A. , Berthelot J. M
Hidiroglou M.A., Berthelot J.M. (1986), Statistical Editing and Imputation for Periodic Business Surveys, Survey Methodology, Statistics Canada, June 1986, vol.12, N.1, 73-83 Huber P.J. (1981), Robust Statistics, New York: Wiley Kovar J.G., MacMillian J.H., Whitridge P. (1988), Overview and strategy for the generalized edit and imputation system", Statistics Canada, Methodology Branch, April 1988 Lawrence D., McKenzie R. (2000), The General Application of Significance Editing, Journal of Official Statistics, Vol. 16, No 3, pp Latouche M., Berthelot J.M. (1992), Use of A Score Function to Prioritize and Limit Recontacts in Editing Business Surveys, Journal of Official Statistics, Vol.8, No.3, Part II Lee H. (1995), Outliers in Business Surveys, in Business Survey Methods, eds. B.G. Cox, D.A. Binder, B.N. Chinappa, A. Christianson, M.J. Colledge and P.S. Kott, New York: Wiley Norbotten S. (2000), Evaluating Efficiency of Statistical Data Editing: A General Framework, United Nations, 2000 Riccini E., Silvestri F., Barcaroli G., Ceccarelli C., Luzi O., Manzari A. (1995) “The methodology of editing and imputation by qualitative variables implemented in SCIA, Istat technical report


Download ppt "MEASUREMENT OF THE QUALITY OF STATISTICS"

Similar presentations


Ads by Google