Download presentation
Presentation is loading. Please wait.
Published byAlyson Walters Modified over 9 years ago
1
Konstanz, 20.12.2005Jens Gerken ZuiScat An Overview of data quality problems and data cleaning solution approaches Data Cleaning Seminarvortrag: Digital Information Curation Jens Gerken – 20.12.2005
2
Konstanz, 20.12.2005Jens Gerken Outline Motivation Data Quality Problems Data Cleaning Process Typical Conflict Resolution Steps Data Cleaning Tool: Potter’s Wheel Conclusion 1 - 12
3
Konstanz, 20.12.2005Jens Gerken Motivation Problem: How to deal with data errors and inconsistencies in large data collections such as databases e.g. missing information, spelling errors, invalid data, etc.. Especially crucial for data warehouses – combination of several sources where errors quickly multiply and problem of redundant data arises Objectives of Data Cleaning: Detection of Errors Remove Errors In a (semi) automatic process 2 - 12
4
Konstanz, 20.12.2005Jens Gerken Data Quality Problems (1/2) Single Source Problems Can be schema related as well as instance related 3 - 12
5
Konstanz, 20.12.2005Jens Gerken Data Quality Problems (2/2) Multi Source Problems Overlapping or contradicting data Different representations Again both schema & instance related problems 4 - 12
6
Konstanz, 20.12.2005Jens Gerken Data Cleaning Process (1/3) Data Auditing Analysis of Data to find errors Workflow Specification Definition of appropriate transformations Goal: Eliminate all/most errors and anomalies automatically Workflow Execution Post Processing/Controlling Check results Resolve remaining problems manually 5 - 12
7
Konstanz, 20.12.2005Jens Gerken Data Cleaning Process (2/3) Data Auditing Data profiling – create Metadata by analysing individual attributes on the instance level (e.g. data type, length, value range, discrete values, variance, uniqueness) 6 - 12 Data Auditing Data profiling – create Metadata by analysing individual attributes on the instance level (e.g. data type, length, value range, discrete values, variance, uniqueness) Data mining – discover data patterns by analysing the whole data collection (e.g. association rules like „total = quantity*unit price“)
8
Konstanz, 20.12.2005Jens Gerken Data Cleaning Process (3/3) Workflow Specification Sequence of operations on the data (schema related data transformations & cleaning steps) Choose the appropriate operations to handle data errors/anomalies/etc. Process should be as automatically as possible Possibility to include user-written cleaning code Cause of error important (e.g. keyboard layout can help to correct misspellings) Early steps: correct single source instance problems Later on: multi source problems, e.g. duplicates Workflow has to be tested and verified 7 – 12
9
Konstanz, 20.12.2005Jens Gerken Conflict Resolution Steps (1/2) Extracting values from free-form attributes Reordering of values Value extraction for attribute splitting Validation and correction Spell checking & dictionaries Attribute dependencies (e.g. birthday/age) Statistical methods (e.g. replace missing values with mean) Standardization Conversion of date and times entries Schema restructuring (multi source problems) Splitting, merging, folding, unfolding of attributes and tables Duplicate elimination 8 - 12
10
Konstanz, 20.12.2005Jens Gerken Conflict Resolution Steps (2/2) Duplicate Elimination Last step in workflow First step: identify records concerning the same real world entity (instance matching). Second step: merge those records and remove redundancy Instance Matching Easy way: There is an identifier attribute (e.g. same primary key) More difficult way: fuzzy matching which calculates degree of similarity between records Problem: very expensive operation – however overhead reducing methods exist 9 - 12
11
Konstanz, 20.12.2005Jens Gerken Potter‘s Wheel (1/3) Problem of conventional approaches Time consuming (many iterations), long waiting periods Users have to write complex transformation scripts Separate Tools for auditing and transformation Potter‘s Wheel approach: Interactive system, instant feedback Integration of both, data auditing and transformation Intuitive User Interface – spreadsheet like application 10 - 12
12
Konstanz, 20.12.2005Jens Gerken Potter‘s Wheel (2/3) Main Features: Instead of complex transform specifications with regular expressions or custom programs user specifies by example (e.g. splitting) Data auditing extensible with user defined domains Parse „Tayler, Jane, JFK to ORD on April 23, 2000 Coach“ as „[A-Za-z,]* to on “ instead of „[A-Za-z,]* [A-Z]³ to [A-Z]³ on [A-Za-z]* [0-9]*, [0-9]* [A-Za-z]* Allows easier detection of e.g. logical errors like false airport codes Problem: tradeoff between overfitting and underfitting structure Potter‘s Wheel uses Minimun description length method to balance this tradeoff and choose appropriate structure Data auditing in background on the fly (data streaming also possible) Reorderer allows sorting on the fly User only works on a view – real data isn‘t changed until user exports set of transforms e.g. as C program an runs it on the real data Undo without problems: just delete unwanted transform from sequence and redo everything else 11 - 12
13
Konstanz, 20.12.2005Jens Gerken Potter‘s Wheel (3/3) + Conclusion Problems: Usability of User Interface How does duplicate elimination work? Kind of a black box system General Open Problems of Data Cleaning: (Automatic) correction of wrong values Mask wrong values but keep them Keep several possible values at the same time (2*age. 2*birthday) Leeds to problems if other values depend on a certain alternative and this turns out to be wrong Maintenance of cleaned data, especially if sources can‘t be cleaned Data cleaning framework desireable 12 - 12
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.