Automatic Editing with Soft Edits

Slides:



Advertisements
Similar presentations
Integrated Data Editing and Imputation Ton de Waal Department of Methodology Voorburg Statistics Netherlands ICES III conference, Montréal June 19, 2007.
Advertisements

Data Imputation United Nations Statistics Division (UNSD) 16 March 2011 Santiago, Chile.
Incremental Linear Programming Linear programming involves finding a solution to the constraints, one that maximizes the given linear function of variables.
A Survey of Program Slicing Techniques A Survey of Program Slicing Techniques Sections 3.1,3.6 Swathy Shankar
C&O 355 Lecture 23 N. Harvey TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A A A A A A A A.
Introduction to Computer Science 2 Lecture 7: Extended binary trees
CBS-SSB STATISTICS NETHERLANDS – STATISTICS NORWAY Work Session on Statistical Data Editing Ljubljana, Slovenia, 9-11 May 2011 Jeroen Pannekoek and Li-Chun.
Approximation Algorithms Chapter 14: Rounding Applied to Set Cover.
Constraint Optimization Presentation by Nathan Stender Chapter 13 of Constraint Processing by Rina Dechter 3/25/20131Constraint Optimization.
Sensitivity Analysis Sensitivity analysis examines how the optimal solution will be impacted by changes in the model coefficients due to uncertainty, error.
SLICE 1.5: A software framework for automatic edit and imputation Ton de Waal Statistics Netherlands UN/ECE Work Session on Statistical Data Editing,
New procedures for Editing and Imputation of demographic variables G. Bianchi, A. Manzari, A. Pezone, A. Reale, G. Saporito ISTAT.
Sum of Subsets and Knapsack
Computability and Complexity 23-1 Computability and Complexity Andrei Bulatov Search and Optimization.
Lecture 6: Huffman Code Thinh Nguyen Oregon State University.
1 Single Machine Deterministic Models Jobs: J 1, J 2,..., J n Assumptions: The machine is always available throughout the scheduling period. The machine.
Message Passing Algorithms for Optimization
1 Tree Searching Strategies. 2 The procedure of solving many problems may be represented by trees. Therefore the solving of these problems becomes a tree.
Backtracking Reading Material: Chapter 13, Sections 1, 2, 4, and 5.
Branch and Bound Algorithm for Solving Integer Linear Programming
1 Linear Programming Supplements (Optional). 2 Standard Form LP (a.k.a. First Primal Form) Strictly ≤ All x j 's are non-negative.
Computational aspects of stability in weighted voting games Edith Elkind (NTU, Singapore) Based on joint work with Leslie Ann Goldberg, Paul W. Goldberg,
Finding Minimum Type Error Sources Zvonimir Pavlinovic Tim King Thomas Wies New York University.
Eurostat Statistical Data Editing and Imputation.
Graph Coalition Structure Generation Maria Polukarov University of Southampton Joint work with Tom Voice and Nick Jennings HUJI, 25 th September 2011.
Centraal Bureau voor de Statistiek Vincent Ohm National Accounts Statistics Netherlands (CBS) contribution for WPNA meeting contribution for WPNA meeting.
THE PYTHAGOREAN THEOREM. What is the Pythagorean Theorem? The theorem that the sum of the squares of the lengths of the sides of a right triangle is equal.
Design Techniques for Approximation Algorithms and Approximation Classes.
Approximating Minimum Bounded Degree Spanning Tree (MBDST) Mohit Singh and Lap Chi Lau “Approximating Minimum Bounded DegreeApproximating Minimum Bounded.
1 USING A QUADRATIC PROGRAMMING APPROACH TO SOLVE SIMULTANEOUS RATIO AND BALANCE EDIT PROBLEMS Katherine J. Thompson James T. Fagan Brandy L. Yarbrough.
Foundations of Software Testing Chapter 5: Test Selection, Minimization, and Prioritization for Regression Testing Last update: September 3, 2007 These.
Approximation Algorithms Department of Mathematics and Computer Science Drexel University.
Automatic Editing with Hard and Soft Edits – Some First Experiences Sander Scholtus Sevinç Göksen (Statistics Netherlands)
Approximation Algorithms for Knapsack Problems 1 Tsvi Kopelowitz Modified by Ariel Rosenfeld.
New and Emerging Methods Maria Garcia and Ton de Waal UN/ECE Work Session on Statistical Data Editing, May 2005, Ottawa.
Week 10Complexity of Algorithms1 Hard Computational Problems Some computational problems are hard Despite a numerous attempts we do not know any efficient.
Topic (vi): New and Emerging Methods Topic organizer: Maria Garcia (USA) UNECE Work Session on Statistical Data Editing Oslo, Norway, September 2012.
Calibrated imputation of numerical data under linear edit restrictions Jeroen Pannekoek Natalie Shlomo Ton de Waal.
Detecting Group Differences: Mining Contrast Sets Author: Stephen D. Bay Advisor: Dr. Hsu Graduate: Yan-Cheng Lin.
CBS-SSB STATISTICS NETHERLANDS – STATISTICS NORWAY Work Session on Statistical Data Editing Oslo, Norway, September 2012 Jeroen Pannekoek and Li-Chun.
Zhuo Peng, Chaokun Wang, Lu Han, Jingchao Hao and Yiyuan Ba Proceedings of the Third International Conference on Emerging Databases, Incheon, Korea (August.
Surya Bhupatiraju MIT PRIMES May 20 th, 2012 On the Complexity of the Marginal Consistency Problem.
Efficient Algorithms for Some Variants of the Farthest String Problem Chih Huai Cheng, Ching Chiang Huang, Shu Yu Hu, Kun-Mao Chao.
SAT 2009 Ashish Sabharwal Backdoors in the Context of Learning (short paper) Bistra Dilkina, Carla P. Gomes, Ashish Sabharwal Cornell University SAT-09.
Multivariate selective editing via mixture models: first applications to Italian structural business surveys Orietta Luzi, Guarnera U., Silvestri F., Buglielli.
Goal of Learning Algorithms  The early learning algorithms were designed to find such an accurate fit to the data.  A classifier is said to be consistent.
3 Components for a Spreadsheet Optimization Problem  There is one cell which can be identified as the Target or Set Cell, the single objective of the.
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
Automatic Editing Data. A New Version of DIA System Prepared by J.M. Gomez Presented by D.Lorca National Statistical Institute of Spain.
ICS 353: Design and Analysis of Algorithms Backtracking King Fahd University of Petroleum & Minerals Information & Computer Science Department.
Foundations of Software Testing Chapter 5: Test Selection, Minimization, and Prioritization for Regression Testing Last update: September 3, 2007 These.
How to deal with quality aspects in estimating national results Annalisa Pallotti Short Term Expert Asa 3st Joint Workshop on Pesticides Indicators Valletta.
More NP-Complete and NP-hard Problems
8.3.2 Constant Distance Approximations
Mathematical Rebus E.
Theme (v): Managing change
Exact Inference Continued
Computability and Complexity
Chapter 11 Limitations of Algorithm Power
DEVELOPMENT OF IMPUTATION MODEL FOR SMALL ENTERPRISES
Jeroen Pannekoek, Sander Scholtus and Mark van der Loo
Machine Learning: Lecture 6
Complexity Theory in Practice
Audit of the Rail Production Database
Machine Learning: UNIT-3 CHAPTER-1
6.3 Using Elimination to Solve Systems
A bootstrap method for estimators based on combined administrative and survey data Sander Scholtus (Statistics Netherlands) NTTS Conference 13 March 2019.
Implementation of the Bayesian approach to imputation at SORS Zvone Klun and Rudi Seljak Statistical Office of the Republic of Slovenia Oslo, September.
Jeroen Pannekoek, Mark van der Loo and Bart van den Broek
A handbook on validation methodology. Metrics.
Presentation transcript:

Automatic Editing with Soft Edits Sander Scholtus (Statistics Netherlands)

Automatic editing Goal: Detect and correct errors and missing values without human intervention Data is made consistent with respect to a set of edits Two steps: detecting erroneous and missing values (error localisation) imputation of new values Automatic Editing with Soft Edits

Automatic editing (2) Fellegi-Holt paradigm for error localisation: Find the smallest subset of the variables that can be imputed to satisfy all edits Generalised version uses confidence weights At Statistics Netherlands: SLICE software Automatic Editing with Soft Edits

SLICE Branch-and-bound algorithm: Automatic Editing with Soft Edits x1 x1 erroneous x1 correct x2 x2 x2 erroneous x2 correct x2 erroneous x2 correct x3 x3 x3 x3 Automatic Editing with Soft Edits

SLICE Branch-and-bound algorithm: Automatic Editing with Soft Edits x1 eliminate x1 fix x1 x2 x2 eliminate x2 fix x2 eliminate x2 fix x2 x3 x3 x3 x3 Automatic Editing with Soft Edits

SLICE (2) Leaf nodes of the tree: Associated sets of edits: all variables have been either fixed or eliminated interpretation: eliminated variables are incorrect Associated sets of edits: contain no variables either empty or contain only trivial statements Theorem (De Waal and Quere, 2003): A leaf node corresponds to a feasible solution of the error localisation problem, if and only if the associated set of edits contains no contradictions Automatic Editing with Soft Edits

SLICE (3) Application of SLICE in the production process: automatic editing of micro data for the Dutch structural business statistics approximately 100 variables and 100 edits evaluation studies: sometimes large differences between automatic and manual editing Automatic Editing with Soft Edits

Hard edits and soft edits Examples of edits: Profit = Turnover – Costs Profit < 0.6 x Turnover First example: hard edit has to hold by definition Second example: soft edit can also be failed by correct values Automatic Editing with Soft Edits

Hard edits and soft edits (2) Manual editing uses both hard and soft edits Current methods for automatic editing can only handle hard edits Practical solutions: ignore all soft edits treat soft edits as hard edits Can this be improved? Automatic Editing with Soft Edits

Error localisation with soft edits Current error localisation problem: Minimise, among subsets of variables that can be imputed to satisfy all edits, the sum of the confidence weights Suggested new error localisation problem: Minimise, among subsets of variables that can be imputed to satisfy all hard edits, the sum of the confidence weights plus a cost term for failed soft edits Automatic Editing with Soft Edits

Error localisation with soft edits (2) The new error localisation problem can be solved by an extended version of the SLICE algorithm x1 eliminate x1 fix x1 x2 x2 eliminate x2 fix x2 eliminate x2 fix x2 x3 x3 x3 x3 Automatic Editing with Soft Edits

Example Variables: Edits: Confidence weights: Turnover (T), Profit (P), Costs (C), Number of Employees (N) Edits: Hard edits: Soft edits: Confidence weights: Turnover: 2; Profit: 1; Costs: 1; Number of Employees: 3 Contribution of each failed soft edit: 2 Automatic Editing with Soft Edits

Example (2) Original data and edits: T = 100; P = 40000; C = 60000; N = 5 Hard edits: Soft edits: Automatic Editing with Soft Edits

Example (3) Original data and edits: T = 100; P = 40000; C = 60000; N = 5 Hard edits: Soft edits: Eliminate P from the original edits: Implied hard edits: Implied soft edits: Automatic Editing with Soft Edits

Example (4) According to the theory, P can be imputed to satisfy all hard edits, but the second soft edit is failed Imputing only P is a feasible solution to the error localisation problem The value of the target function equals 1 + 2 = 3 Automatic Editing with Soft Edits

Example (5) Data and edits after eliminating P: T = 100; C = 60000; N = 5 Implied hard edits: Implied soft edits: Eliminate C from these edits: Automatic Editing with Soft Edits

Example (6) According to the theory, P and C can be imputed to satisfy all hard and soft edits Imputing P and C is a feasible solution to the error localisation problem The value of the target function equals 1 + 1 = 2 This turns out to be the optimal solution Possible corrected version of the record: T = 100; P = 40; C = 60; N = 5 Automatic Editing with Soft Edits

Example (7) Imputing only P is the optimal solution if the soft edits are ignored Corrected version of the record: T = 100; P = -59900; C = 60000; N = 5 Automatic Editing with Soft Edits

Discussion Future work: Thank you for your attention! Implementation of the algorithm in R (in progress) Test on realistic data (Dutch structural business statistics) How to model the costs of failed soft edits Thank you for your attention! Automatic Editing with Soft Edits