Download presentation
Presentation is loading. Please wait.
1
Automatic Editing with Soft Edits
Sander Scholtus (Statistics Netherlands)
2
Automatic editing Goal: Detect and correct errors and missing values without human intervention Data is made consistent with respect to a set of edits Two steps: detecting erroneous and missing values (error localisation) imputation of new values Automatic Editing with Soft Edits
3
Automatic editing (2) Fellegi-Holt paradigm for error localisation: Find the smallest subset of the variables that can be imputed to satisfy all edits Generalised version uses confidence weights At Statistics Netherlands: SLICE software Automatic Editing with Soft Edits
4
SLICE Branch-and-bound algorithm: Automatic Editing with Soft Edits x1
x1 erroneous x1 correct x2 x2 x2 erroneous x2 correct x2 erroneous x2 correct x3 x3 x3 x3 Automatic Editing with Soft Edits
5
SLICE Branch-and-bound algorithm: Automatic Editing with Soft Edits x1
eliminate x1 fix x1 x2 x2 eliminate x2 fix x2 eliminate x2 fix x2 x3 x3 x3 x3 Automatic Editing with Soft Edits
6
SLICE (2) Leaf nodes of the tree: Associated sets of edits:
all variables have been either fixed or eliminated interpretation: eliminated variables are incorrect Associated sets of edits: contain no variables either empty or contain only trivial statements Theorem (De Waal and Quere, 2003): A leaf node corresponds to a feasible solution of the error localisation problem, if and only if the associated set of edits contains no contradictions Automatic Editing with Soft Edits
7
SLICE (3) Application of SLICE in the production process:
automatic editing of micro data for the Dutch structural business statistics approximately 100 variables and 100 edits evaluation studies: sometimes large differences between automatic and manual editing Automatic Editing with Soft Edits
8
Hard edits and soft edits
Examples of edits: Profit = Turnover – Costs Profit < 0.6 x Turnover First example: hard edit has to hold by definition Second example: soft edit can also be failed by correct values Automatic Editing with Soft Edits
9
Hard edits and soft edits (2)
Manual editing uses both hard and soft edits Current methods for automatic editing can only handle hard edits Practical solutions: ignore all soft edits treat soft edits as hard edits Can this be improved? Automatic Editing with Soft Edits
10
Error localisation with soft edits
Current error localisation problem: Minimise, among subsets of variables that can be imputed to satisfy all edits, the sum of the confidence weights Suggested new error localisation problem: Minimise, among subsets of variables that can be imputed to satisfy all hard edits, the sum of the confidence weights plus a cost term for failed soft edits Automatic Editing with Soft Edits
11
Error localisation with soft edits (2)
The new error localisation problem can be solved by an extended version of the SLICE algorithm x1 eliminate x1 fix x1 x2 x2 eliminate x2 fix x2 eliminate x2 fix x2 x3 x3 x3 x3 Automatic Editing with Soft Edits
12
Example Variables: Edits: Confidence weights:
Turnover (T), Profit (P), Costs (C), Number of Employees (N) Edits: Hard edits: Soft edits: Confidence weights: Turnover: 2; Profit: 1; Costs: 1; Number of Employees: 3 Contribution of each failed soft edit: 2 Automatic Editing with Soft Edits
13
Example (2) Original data and edits:
T = 100; P = 40000; C = 60000; N = 5 Hard edits: Soft edits: Automatic Editing with Soft Edits
14
Example (3) Original data and edits:
T = 100; P = 40000; C = 60000; N = 5 Hard edits: Soft edits: Eliminate P from the original edits: Implied hard edits: Implied soft edits: Automatic Editing with Soft Edits
15
Example (4) According to the theory, P can be imputed to satisfy all hard edits, but the second soft edit is failed Imputing only P is a feasible solution to the error localisation problem The value of the target function equals = 3 Automatic Editing with Soft Edits
16
Example (5) Data and edits after eliminating P:
T = 100; C = 60000; N = 5 Implied hard edits: Implied soft edits: Eliminate C from these edits: Automatic Editing with Soft Edits
17
Example (6) According to the theory, P and C can be imputed to satisfy all hard and soft edits Imputing P and C is a feasible solution to the error localisation problem The value of the target function equals = 2 This turns out to be the optimal solution Possible corrected version of the record: T = 100; P = 40; C = 60; N = 5 Automatic Editing with Soft Edits
18
Example (7) Imputing only P is the optimal solution if the soft edits are ignored Corrected version of the record: T = 100; P = ; C = 60000; N = 5 Automatic Editing with Soft Edits
19
Discussion Future work: Thank you for your attention!
Implementation of the algorithm in R (in progress) Test on realistic data (Dutch structural business statistics) How to model the costs of failed soft edits Thank you for your attention! Automatic Editing with Soft Edits
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.