Presentation on theme: "1 Data Editing, Coding, and Just a Little Imputation Katherine (Jenny) Thompson Office of Statistical Methods and Research for Economic Programs"— Presentation transcript:
1 Data Editing, Coding, and Just a Little Imputation Katherine (Jenny) Thompson Office of Statistical Methods and Research for Economic Programs (301)
2 The Basics: What is Editing? Editing (procedures) review reported/keyed data for errors and pinpoints inconsistent values For industry For respondent Editing does not change the data. Items that fail edits are referred to an analyst; or automatically imputed (replaced with consistent values)
3 The Basics: What Is Imputation? Imputation is the replacement of a missing or incorrectly reported item using logical edits or statistical procedures. In other words, Imputation replaces a missing or incorrect data item with an educated guess.
4 The Basics: What is Coding? Coding is the assignment of recognizable values to flags that describe key characteristics of the unit or item, such as Industry (unit level) Response status (unit or item level) Source of data correction (item level) Imputation model (item level)
5 We Begin With Coding Before we can evaluate whether a response is reasonable, we have to know where it comes from: Classification variable(s) value, e.g., industry, state Frame information may be erroneous or unit may have changed classification value Each unit must be assigned classification code(s) before editing/imputation
6 We End With Coding At the end of the processing cycle, we want to know How the data were changed, Where the data were changed, Why (if possible) data were changed, and The final status of the reporting unit (respondent, non-respondent).
7 Some Edit Definitions Editing:Procedures for detecting incorrect keyed or respondent data. Micro-Editing:Editing at the individual record (questionnaire) level Macro-Editing:Editing at the tabulated value level
8 Typical Editing Processing Flow Micro-editing (static) Performed on a flow basis Predetermined edit tests and edit parameters (historic data) Administered by machine Resolved by machine and human Outlier detection (dynamic) Performed after close-out Administered by machine Often resolved by human Macro-editing (dynamic) See above
9 Micro-Edits are Either: FatalMust be resolved before subsequent editing Unit is Out-of-Scope for Survey Unit is missing classification variable value Required data item not reported QueryCan be corrected automatically Detail items do not add to reported total Ratio of two items is outside (user- determined) limits
10 Where Do Micro-Edits Come From? Questionnaire Reality Subject-Matter Expert Rules (Enforced) Statistical Relationships
11 Fictional Sample Questionnaire Instructions: Report all dollar figures in thousands. Report all hours in thousands. Report employment in units. MillionsThousands Item 1. ANNUAL PAYROLL Item 2. 1 ST QUARTER PAYROLL Item 3. SALES 3.a. ON SITE MANUFACTURES 3.b. REPACKAGED MANUFACTURES 3.c. TOTAL (3.a. + 3.b.) Item 4. TOTAL HOURS WORKED Item 5. EMPLOYMENT
12 Edit Sources: Questionnaire Item 3. SALES 3.a. ON SITE MANUFACTURES 3.b. REPACKAGED MANUFACTURES 3.c. TOTAL (3.a. + 3.b.) Balance Edit Item 3.a. Value + Item 3.b. Value = Item 3.c. Value Things have to add up!
13 Edits Sources: Questionnaire/Reality MillionsThousands Item 1. ANNUAL PAYROLL Item 2. 1 ST QUARTER PAYROLL Ratio Edit ANNUAL PAYROLL/1 ST QUARTER PAYROLL 1 Cant spend more on payroll in one quarter than for the entire year!
14 Edit Sources: Questionnaire/Reality MillionsThousands Item 4. TOTAL HOURS WORKED Item 5. EMPLOYMENT Ratio Edit 0.96 < TOTAL HOURS WORKED/EMPLOYMENT < 8.76
15 Edits Sources: Questionnaire/Reality Item 5. EMPLOYMENT Range Edit 0 EMPLOYMENT 5,615,727 A unit cant have more employees than the population of the resident state (or negatively-value employees!)
16 Edit Sources: Subject-Matter Rules MillionsThousands Item 1. ANNUAL PAYROLL Item 3. SALES 3.a. ON SITE MANUFACTURES 3.b. REPACKAGED MANUFACTURES 3.c. TOTAL (3.a. + 3.b.) Ratio Edit TOTAL SALES/ANNUAL PAYROLL > 1 Full-year reporters should operate at a profit!
17 Edit Sources: Statistical Relationships MillionsThousands Item 1. ANNUAL PAYROLL Item 5. EMPLOYMENT Ratio Edit A ANNUAL PAYROLL/EMPLOYMENT B Wage per employee should be within the (industry) range.
18 Examples of Fatal Micro-edits Classification Edits Required Data Item Tests
19 Examples of Query Micro-edits List Directed (Verification) Edits Skip Pattern Validation Edits Range Edits (Including negative tests) Ratio Edits Within same questionnaire Current to prior period Balance Edits Subject-matter rules
20 List Directed/Verification Edits Purpose:To compare the reported value of a data field to a pre-determined list of legal values. Machine edits, but highly dependent on data- quality of list Human (manual) correction of edit failures
21 Skip Pattern Validation Edits Purpose:To verify that values of skip items are consistent with the skip instructions provided on the questionnaire. Machine edits that CAN be resolved by machine- imputation Subject-matter rules (if..then..logic) Operations Research approach
22 Range Edits Purpose:To check the reported value of a data item to see if it is within specified minimum and maximum values. Form of edit: lower bound data item upper bound Upper and lower bounds are tolerances. If data item is not contained within the bounds, then it fails the range edit (out of tolerance). Negative tests are a special case of range edits. Can be used to define an imputation region.
23 Range Edits Examples: 0 Employment 301,064,982 (2006 U.S. Population) 0 Sales 12,455.8 billion (2005 Gross Domestic Product) 0 Percent of work done in category 100%
24 Ratio Edits Purpose:To compare two related items in a questionnaire to see if reported values are consistent. Form of Ratio Edit: Upper and lower bounds are known as tolerances. Tolerances generally developed from prior period data. If ratio is not contained within the bounds, then it fails the ratio edit (out of tolerance).
25 Some Reasons for Ratio Editing One data item is a function of another. Annual Payroll = 1st Quarter Payroll + Payroll for Remaining 3 Quarters Ratio Edit:
26 Some Reasons for Ratio Editing One data item can only be evaluated in comparison with another item (reasonable lower bound) (reasonable upper bound)
27 Some Reasons for Ratio Editing One data item is a good predictor of another. Annual Payroll = factor Total Employment
28 Plot of a Typical Ratio Edit
29 Advantages of Ratio Edits Useful for detecting systematic and random errors Reasonable comparisons for quantitative data Verifiable assumptions Often insensitive to changes in economy when both items are in the same units Imply certain imputation models Can be solved simultaneously imputation region implications
30 Disadvantages of Ratio Edits Edit failure identifies a pair of potentially incorrect data fields Need to have a tie-breaker Often work best when combined with other edits (can be ratio edits) Very dependent on the distribution of ratios Highly correlated Goes through origin
31 Best Practices for Ratio Edits Incorporate unit size categories as well as classification variables in editing cells Perform preliminary data analysis to determine validity of edit model Incorporate tests to prior data from same unit and item when reasonable Use non-parametric outlier-resistant methods for setting ratio edit tolerances Audit edits An edit test that has a high rate of failure could indicate problems with the tolerances or the test itself
32 Periodic Data and Ratio Edits (Caution)
33 Brief Digression on Imputation Situation: Missing item or item marked for imputation (replacement) due to edit failure(s) We would like the machine to automatically replace the inconsistent item with a consistent value.
34 The ideal imputations find replacement values that are still considered reported (from the same respondent) Examples divide reported data by correct reporting unit replace reported total with sum of details
35 Link Between Imputation and Program Published tabulations (macro-data) Ratio imputation models Regression imputation models Published micro-data Hot deck imputation
36 Commonly-Used Imputation Methods (Economic Data) Rounding/Data Slides (systematic error) Respondent data divided by unit conversion factor (e.g., imputed value = reported value/1,000) Direct Substitution Another data item (same questionnaire) Absolute value of reported/keyed item Sum of Reported Details (logical edit) Derived value from other reported/keyed item Previously reported value (historic) from same respondent Administrative data value (same respondent)
37 Ratio Imputation (Model Imputation) imputed item = (factor) (another data field) Same reporting unit/questionnaire Edit-passing item Industry (Category) Average Ratio (use average ratio of two items in industry/category) e.g., factor = industry wage/employee ratio Historic Imputation (Auxiliary Trend) (use ratio of prior data to current data for same respondent) e.g. factor = previous tabulated value of edit-failing item previous tabulated value of auxiliary data field
38 Balance Edits Purpose:To determine if detail items add to associated reported total. Form of Edit: TOTAL = DETAIL 1 + DETAIL DETAIL n Developed from questionnaire A set of details along with their associated total is called a balance complex. More complicated balance complexes Nested 1-Dimensional 2-Dimensional
39 Sample Questionnaire Example Instructions: Report all dollar figures in thousands. Report all hours in thousands. Report employment in units. MillionsThousands Item 1. ANNUAL PAYROLL Item 2. 1 ST QUARTER PAYROLL Item 3. SALES 3.a. ON SITE MANUFACTURES 3.b. REPACKAGED MANUFACTURES 3.c. TOTAL (3.a. + 3.b.) Item 4. TOTAL HOURS WORKED Item 5. EMPLOYMENT
40 Fixing a Failed Balance Edit Editing generally integrated with imputation: Editor decides which is more believable: TOTAL or SUM OF DETAILS Only change one side of balance complex (TOTAL or SUM OF DETAILS)
41 Balance Edit Definitions Residual:TOTAL - SUM OF DETAILS Failed edit solution can depend on SIZE of residual (absolute tolerance) RATIO of residual to total (relative tolerance)
42 A Few Balance Edit Fixes RAKE*Rake all detail items to TOTAL YSUMX*Replace TOTAL with the SUM OF DETAILS ROUNDDivide all details by 1000 or divide TOTAL by 1000 RESIDUAL Set one missing DETAIL to the RESIDUAL IMPUTE*Replace all DETAILS with imputed values *Briefly discussed…
43 Raking Adjust each detail item as Conditions: Reported TOTAL must be acceptable. Relative tolerance is small (e.g., within 5%).
44 Raking -- Considerations Is not considered imputation Preserves reported distribution of the detail items
45 YSUMX Set TOTAL equal to SUM OF DETAILS Conditions: TOTAL can be changed by edit (not fixed); (Optional, but preferable) SUM OF DETAILS is reasonable (e.g., verify with ratio test or range test)
46 YSUMX -- Considerations Not (considered: imputation; logical edit or deductive imputation Useful when TOTAL is missing (and details are not); Can be imputation solution to ratio edit
47 Impute Replace ALL reported DETAILS with imputed values Imputed DETAIL i for reporting unit c is given by factor i TOTAL Conditions TOTAL > 0 (and value of TOTAL acceptable) No restriction on SUM OF DETAILS (all DETAILS are replaced...] Difference between TOTAL and SUM OF DETAILS too large for raking
48 Macro-Editing (Brief Comments) Systematic review of tabulations (estimates) Tendency to rely on ratio comparisons to identify outlying estimates Hidiroglou-Berthelot edit Ratio Edits Need to analyze micro-data in outlying cells
49 Back to Coding… Throughout the editing and imputation process, what do we need to keep track of?
50 Back to Coding… Original source of data item Reported from respondent Elicited by analyst/subject-matter expert Missing/not reported
51 Back to Coding… Final source of data item value Unchanged (respondent data) Raked/considered reported Rounded detail item (rescaled) Substitution Administrative data Other item from same questionnaire Prior period value from same respondent (can indicate bad survey practice) Model imputation (+ auxiliary data) Other imputation
52 Back to Coding… Why was data item value changed? Not reported? Edit failure (automatic) Macro or micro level? Which edit/edit module Analyst change (manual) Should be documented in notes
53 Back to Coding… What is the final disposition the data item? Reported data? Equivalent in quality to reported data? Imputed data? What is the final disposition of the entire reporting unit?
54 Wrap Up Talked in GREAT detail on Methods and sources of micro-edits Common imputation models Talked in SEMI-GREAT detail on imputation methods Brought up the idea of macro-editing Mentioned a few coding concerns here and there…
55 For me, 40 minutes is barely a warm-up. Contact information: (301)