Katherine Jenny Thompson

Slides:

Advertisements

Similar presentations

1 Data Editing, Coding, and Just a Little Imputation Katherine (Jenny) Thompson Office of Statistical Methods and Research for Economic Programs

Advertisements

Investigation of Treatment of Influential Values Mary H. Mulry Roxanne M. Feldpausch.

Variation, uncertainties and models Marian Scott School of Mathematics and Statistics, University of Glasgow June 2012.

Chapter 4: Basic Estimation Techniques

Sociology 690 Multivariate Analysis Log Linear Models.

Richard M. Jacobs, OSA, Ph.D.

AP Statistics Course Review.

Transformations & Data Cleaning

Forecasting Using the Simple Linear Regression Model and Correlation

Hypothesis Testing Steps in Hypothesis Testing:

Editing and Imputing VAT Data for the Purpose of Producing Mixed- Source Turnover Estimates Hannah Finselbach and Daniel Lewis Office for National Statistics,

Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 13 Nonlinear and Multiple Regression.

1 Chapter 2 Simple Linear Regression Ray-Bing Chen Institute of Statistics National University of Kaohsiung.

Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L10.1 CorrelationCorrelation The underlying principle of correlation analysis.

Chapter 13 Multiple Regression

The Simple Linear Regression Model: Specification and Estimation

Chapter 12 Multiple Regression

Chapter Topics Types of Regression Models

Chapter 11 Multiple Regression.

Chapter 11: Inference for Distributions

Data Analysis Statistics. Inferential statistics.

© 2000 Prentice-Hall, Inc. Chap Forecasting Using the Simple Linear Regression Model and Correlation.

Basic Statistical Concepts Part II Psych 231: Research Methods in Psychology.

1 BA 555 Practical Business Analysis Review of Statistics Confidence Interval Estimation Hypothesis Testing Linear Regression Analysis Introduction Case.

Review for Exam 2 Some important themes from Chapters 6-9 Chap. 6. Significance Tests Chap. 7: Comparing Two Groups Chap. 8: Contingency Tables (Categorical.

Chapter 14 Inferential Data Analysis

Structural Equation Modeling Intro to SEM Psy 524 Ainsworth.

Introduction to Regression Analysis, Chapter 13,

SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Basic Relationships Purpose of multiple regression Different types of multiple regression.

Correlation & Regression

Chapter 12 Inferential Statistics Gay, Mills, and Airasian

Correlation and Regression

11 Comparison of Perturbation Approaches for Spatial Outliers in Microdata Natalie Shlomo* and Jordi Marés** * Social Statistics, University of Manchester,

© 2013 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Copyright © 2010 Pearson Education, Inc., publishing as Prentice-Hall. 2-1 Chapter 2 Examining Your Data.

Simple Linear Regression

(a.k.a: The statistical bare minimum I should take along from STAT 101)

1 G Lect 10a G Lecture 10a Revisited Example: Okazaki’s inferences from a survey Inferences on correlation Correlation: Power and effect.

1 USING A QUADRATIC PROGRAMMING APPROACH TO SOLVE SIMULTANEOUS RATIO AND BALANCE EDIT PROBLEMS Katherine J. Thompson James T. Fagan Brandy L. Yarbrough.

Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.

1 Ratio estimation under SRS Assume Absence of nonsampling error SRS of size n from a pop of size N Ratio estimation is alternative to under SRS, uses.

Copyright © 2015 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.

Multiple Regression The Basics. Multiple Regression (MR) Predicting one DV from a set of predictors, the DV should be interval/ratio or at least assumed.

The Examination of Residuals. Examination of Residuals The fitting of models to data is done using an iterative approach. The first step is to fit a simple.

AP Stat Review Descriptive Statistics Grab Bag Probability

Basics of Data Cleaning

Examining Relationships in Quantitative Research

Inference for Regression Simple Linear Regression IPS Chapter 10.1 © 2009 W.H. Freeman and Company.

Chapter 16 Data Analysis: Testing for Associations.

Workshop on Price Index Compilation Issues February 23-27, 2015 Data Collection Issues Gefinor Rotana Hotel, Beirut, Lebanon.

Examining Relationships in Quantitative Research

Introduction to Biostatistics and Bioinformatics Regression and Correlation.

Evaluating the Quality of Editing and Imputation: the Simulation Approach M. Di Zio, U. Guarnera, O. Luzi, A. Manzari ISTAT – Italian Statistical Institute.

Correlation & Regression Analysis

1 Statistics & R, TiP, 2011/12 Neural Networks  Technique for discrimination & regression problems  More mathematical theoretical foundation  Works.

Residual Analysis Purposes –Examine Functional Form (Linear vs. Non- Linear Model) –Evaluate Violations of Assumptions Graphical Analysis of Residuals.

DETECTION OF OUTLIERS IN THE CANADIAN CONSUMER PRICE INDEX (CPI) DETECTION OF OUTLIERS IN THE CANADIAN CONSUMER PRICE INDEX (CPI) ABDELNASSER SAÏDI AND.

Strategies for Metabolomic Data Analysis Dmitry Grapov, PhD.

Appendix I A Refresher on some Statistical Terms and Tests.

Chapter 14 Introduction to Multiple Regression

APPROACHES TO QUANTITATIVE DATA ANALYSIS

Fundamentals of regression analysis

Essential Statistics (a.k.a: The statistical bare minimum I should take along from STAT 101)

Structural Business Statistics Data validation

I271b Quantitative Methods

CH2. Cleaning and Transforming Data

Product moment correlation

3 basic analytical tasks in bivariate (or multivariate) analyses:

Introductory Statistics

Structural Equation Modeling

Presentation transcript:

Investigation of Macro Editing Techniques for Outlier Detection in Survey Data Katherine Jenny Thompson Office of Statistical Methods and Research for Economic Programs

Simplified Survey Processing Cycle Data Collection/ Analyst Review Micro-editing And Imputation Individual Returns Macro-editing Tabulated Initial Estimates Publication Estimates Analyst Investigation And Correction

Identifying Outlying Estimates Set of Estimates Unknown parametric distribution (robust) Contains outliers (resistant) Outlier-identification problems (Multiple Outliers) Masking: difficult to detect an individual outlier Swamping: too many false outliers flagged

Outlier Detection Approaches Sets of “bivariate” (Ratio) comparisons Same estimate from two consecutive collection periods (historic cell ratios) Different estimates in same collection period (current cell ratios) Multivariate comparisons Current period data

Method for Bivariate Comparisons Resistant Fences Methods Symmetrized Resistant fences Asymmetric Fences Robust Regression Hidiroglou-Berthelot Edit

Bivariate Comparisons (Current Cell Ratios) Linear relationship between payroll and employment No intercept

“Traditional” Ratio Edit (Current Cell Ratio) Outlier Region Acceptance Region Outlier Region “Cone-shaped” tolerances Goes through origin Strong statistical association

Resistant Fences Methods q25-1.5H q75+1.5H q25 q75 Different numbers of interquartile ranges (1.5 = Inner, 3 = Outer) Implicitly assumes symmetry May want to “symmetrize”, apply rule, use inverse transformation

Asymmetric Fences Methods q25+3 (m – q25) q75+3 (q75- m) Different numbers of interquartile ranges (3 = Inner, 6 = Outer) Incorporates skewness of distribution in outlier rule (“Fences”)

Robust Regression Resistant (minimizes median residual) Least Trimmed Squares Robust Regression Resistant (minimizes median residual) Outlier = |residual|  3  robust M.S.E.

Issue at Origin (Historic Cell Ratio) Alternative Approach (Ratio Editing) --Hidiroglou-Berthelot (HB) Edit Originally designed to detect outlying values in periodically collected micro-data Requires “complete” set of micro-data in an “industry” Characterized by dynamic tolerances

Hidiroglou-Berthelot (HB) Edit Accounts for magnitude of unit (variability at origin)

Hidiroglou-Berthelot (HB) Edit Two-step transformation (Ei) Centering transformation on ratios Magnitude transformation that accounts for the relative importance of large cases Asymmetric Fences “Type” Outlier Rule Key Parameter U = magnitude transformation parameter (0  U  1) C = controls width of outlier region

Multivariate Methods: Mahalanobis Distance Multivariate normal (,) T(X) estimates  C(X) estimates  p is the number of distinct variables (items) Prone to masking (difficult to detect individual outliers)

Robust Alternatives M-estimation (not considered) “Production Method” Minimum Volume Ellipse (MVE) Resistant (50% breakdown) and robust Minimum Covariance Determinant (MCD) Assumption of Normality Log-transformation

Evaluation: Classify Item Estimates Input Value Reported Final Value Tabulated Ratio Input/Final Not an Outlier Potential Outlier Outlier

Evaluation: Classify Ratios (Bivariate) Conservative Ratio is “outlier” if numerator or denominator is an outlier Anti-Conservative Ratio is “outlier” if numerator or denominator is an outlier or a potential outlier

Evaluation: Classify Records (Multivariate) Conservative Record is “outlier” at least one estimate is an outlier Anti-Conservative Record is “outlier” at least one estimate is an outlier or a potential outlier

Evaluation Statistics: Bivariate Comparisons Individual Test Level Type I Error Rate: proportion of false rejects Type II Error Rate: proportion of false accepts Hit Rate: proportion of flagged estimates that are outliers All-Test Level All-item Type II error rate

Evaluation Statistics: Multivariate Comparisons Type I error rate: the proportion of non-outlier records that are flagged as outliers Type II error rate: the proportion of outlier records that are not flagged as outliers (missed “bad” values)

Annual Capital Expenditures Survey (ACES) Sample Survey (Stratified SRS-WOR) ACE-1: Employer companies ACE-2: Non-employer companies (not discussed) New sample selection each year Total and year-to-year change estimates Total Capital Expenditures Structures (New and Used) Equipment (New and Used)

Capital Expenditures Data Characterized by Low year-to-year correlation (same company) Weak association with available auxiliary data Editing procedures focus on additivity Outlier correction at micro-level

Bivariate Comparisons Robust Regression Resistant Fences HB Edit Structures/Total  New Structures/Structures New Structures/Used Structures Equipment/Total New Equipment/Equipment Resistant Fences: (Symmetric or Asymmetric)  (Inner or Outer) HB Edit: (U = 0.3 or 0.5)  (c = 10 or 20 )

Results – Individual Tests Robust Regression prone to swamping High Type I error rate (false rejects) Comparable performance with Asymmetric Inner Fences and HB Edit (U = 0.3, c = 10) Low Type I error rates High Hit Rates High Type II error rates Other variations of Resistant Fences and HB edit not as good

Results – All-Tests Very large Type II error rates (approx. 50%) Robust regression Symmetric resistant outer fences HB edit with c = 20 Improved Type II error rates (30% - 40%) Asymmetric inner fences HB edit (U = 0.3, C=10)

Multivariate Results Original Data: considered methods ineffective Log-transformed data: improved performance (MCD and MVE) Reduced Type I error rates Comparable Type II error rates (to original-data MCD and MVE)

Multivariate Versus Bivariate: Different Outcomes (Conservative) Combined HB edits flag more “outliers”: Higher Type I error rate Lower Type II error rates for the complete set of HB edits

Comments Economic data with inconsistent statistical association between items in each collection period Critical values must be determined by the data set at hand (no “hard-coding”) Dynamically Standardize the comparisons (HB edit, log transformation) Compute outlier limits Could try hybrid approach: Multivariate  a few current cell ratio tests with the HB edit Perform all bivariate tests, but unduplicate cells before analyst review

Final Thoughts/Next Steps Examine one set of economic data and considered only two separate collections from this program. Extrapolation would be foolish My results need to be validated on other economic data sets a more typical periodic business survey and/or a well-constructed simulation study

Any Questions? Katherine Jenny Thompson Katherine.J.Thompson@census.gov