Presentation is loading. Please wait.

Presentation is loading. Please wait.

Konstanz, 20.12.2005Jens Gerken ZuiScat An Overview of data quality problems and data cleaning solution approaches Data Cleaning Seminarvortrag: Digital.

Similar presentations


Presentation on theme: "Konstanz, 20.12.2005Jens Gerken ZuiScat An Overview of data quality problems and data cleaning solution approaches Data Cleaning Seminarvortrag: Digital."— Presentation transcript:

1 Konstanz, 20.12.2005Jens Gerken ZuiScat An Overview of data quality problems and data cleaning solution approaches Data Cleaning Seminarvortrag: Digital Information Curation Jens Gerken – 20.12.2005

2 Konstanz, 20.12.2005Jens Gerken Outline  Motivation  Data Quality Problems  Data Cleaning Process  Typical Conflict Resolution Steps  Data Cleaning Tool: Potter’s Wheel  Conclusion 1 - 12

3 Konstanz, 20.12.2005Jens Gerken Motivation  Problem: How to deal with data errors and inconsistencies in large data collections such as databases  e.g. missing information, spelling errors, invalid data, etc..  Especially crucial for data warehouses – combination of several sources where errors quickly multiply and problem of redundant data arises  Objectives of Data Cleaning:  Detection of Errors  Remove Errors  In a (semi) automatic process 2 - 12

4 Konstanz, 20.12.2005Jens Gerken Data Quality Problems (1/2)  Single Source Problems  Can be schema related as well as instance related 3 - 12

5 Konstanz, 20.12.2005Jens Gerken Data Quality Problems (2/2)  Multi Source Problems  Overlapping or contradicting data  Different representations  Again both schema & instance related problems 4 - 12

6 Konstanz, 20.12.2005Jens Gerken Data Cleaning Process (1/3)  Data Auditing  Analysis of Data to find errors  Workflow Specification  Definition of appropriate transformations  Goal: Eliminate all/most errors and anomalies automatically  Workflow Execution  Post Processing/Controlling  Check results  Resolve remaining problems manually 5 - 12

7 Konstanz, 20.12.2005Jens Gerken Data Cleaning Process (2/3)  Data Auditing  Data profiling – create Metadata by analysing individual attributes on the instance level (e.g. data type, length, value range, discrete values, variance, uniqueness) 6 - 12  Data Auditing  Data profiling – create Metadata by analysing individual attributes on the instance level (e.g. data type, length, value range, discrete values, variance, uniqueness)  Data mining – discover data patterns by analysing the whole data collection (e.g. association rules like „total = quantity*unit price“)

8 Konstanz, 20.12.2005Jens Gerken Data Cleaning Process (3/3)  Workflow Specification  Sequence of operations on the data (schema related data transformations & cleaning steps)  Choose the appropriate operations to handle data errors/anomalies/etc.  Process should be as automatically as possible  Possibility to include user-written cleaning code  Cause of error important (e.g. keyboard layout can help to correct misspellings)  Early steps: correct single source instance problems  Later on: multi source problems, e.g. duplicates  Workflow has to be tested and verified 7 – 12

9 Konstanz, 20.12.2005Jens Gerken Conflict Resolution Steps (1/2)  Extracting values from free-form attributes  Reordering of values  Value extraction for attribute splitting  Validation and correction  Spell checking & dictionaries  Attribute dependencies (e.g. birthday/age)  Statistical methods (e.g. replace missing values with mean)  Standardization  Conversion of date and times entries  Schema restructuring (multi source problems)  Splitting, merging, folding, unfolding of attributes and tables  Duplicate elimination 8 - 12

10 Konstanz, 20.12.2005Jens Gerken Conflict Resolution Steps (2/2)  Duplicate Elimination  Last step in workflow  First step: identify records concerning the same real world entity (instance matching).  Second step: merge those records and remove redundancy  Instance Matching  Easy way: There is an identifier attribute (e.g. same primary key)  More difficult way: fuzzy matching which calculates degree of similarity between records  Problem: very expensive operation – however overhead reducing methods exist 9 - 12

11 Konstanz, 20.12.2005Jens Gerken Potter‘s Wheel (1/3)  Problem of conventional approaches  Time consuming (many iterations), long waiting periods  Users have to write complex transformation scripts  Separate Tools for auditing and transformation  Potter‘s Wheel approach:  Interactive system, instant feedback  Integration of both, data auditing and transformation  Intuitive User Interface – spreadsheet like application 10 - 12

12 Konstanz, 20.12.2005Jens Gerken Potter‘s Wheel (2/3)  Main Features:  Instead of complex transform specifications with regular expressions or custom programs  user specifies by example (e.g. splitting)  Data auditing extensible with user defined domains  Parse „Tayler, Jane, JFK to ORD on April 23, 2000 Coach“ as „[A-Za-z,]* to on “ instead of „[A-Za-z,]* [A-Z]³ to [A-Z]³ on [A-Za-z]* [0-9]*, [0-9]* [A-Za-z]*  Allows easier detection of e.g. logical errors like false airport codes  Problem: tradeoff between overfitting and underfitting structure  Potter‘s Wheel uses Minimun description length method to balance this tradeoff and choose appropriate structure  Data auditing in background on the fly (data streaming also possible)  Reorderer allows sorting on the fly  User only works on a view – real data isn‘t changed until user exports set of transforms e.g. as C program an runs it on the real data  Undo without problems: just delete unwanted transform from sequence and redo everything else 11 - 12

13 Konstanz, 20.12.2005Jens Gerken Potter‘s Wheel (3/3) + Conclusion  Problems:  Usability of User Interface  How does duplicate elimination work?  Kind of a black box system  General Open Problems of Data Cleaning:  (Automatic) correction of wrong values  Mask wrong values but keep them  Keep several possible values at the same time (2*age. 2*birthday)  Leeds to problems if other values depend on a certain alternative and this turns out to be wrong  Maintenance of cleaned data, especially if sources can‘t be cleaned  Data cleaning framework desireable 12 - 12


Download ppt "Konstanz, 20.12.2005Jens Gerken ZuiScat An Overview of data quality problems and data cleaning solution approaches Data Cleaning Seminarvortrag: Digital."

Similar presentations


Ads by Google