Presentation is loading. Please wait.

Presentation is loading. Please wait.

These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to.

Similar presentations


Presentation on theme: "These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to."— Presentation transcript:

1 These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me Advanced Database Systems F24DS2 / F29AT2 Data Quality and Data Cleaning 2

2 These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me Acknowledgements I adapted this material from various sources, most notably: A ppt presentation called `Data Quality and Data Cleaning: An Overview’ by Tamrapani Dasu and Theodore Johnson, at AT & T Labs A paper called `Data Cleaning: Problems and Current Approaches’, by Erhard Rahm and Hong Hai Do, University of Leipzig, Germany. My thanks to these researchers for making their materials freely available online.

3 These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me What Keeps DBs Dirty A good DBMS will have built in tools for: Consistency in data types Consistency in field values Constraints and checks that deal with Null values, Outliers, Duplication. Automatic timestamps Powerful query language (makes retrieval logic errors less likely) … so, why are you refused a loan, have mail delivered to the wrong address, and get charged too much for your mobile calls?

4 These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me … all this: Consistency constraints are often not applied, or are applied! – suppose height is not allowed to go over 2 metres in a school student DB –My postcode problem The data are just too numerous, complex and ill- understood. `Cleaning it’ would cost too much! Undetectable problems: incorrect values, missing entries Metadata not maintained properly

5 These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me Single Source vs Multiple Source Schema Level v Instance Level One useful way to categorize problems, independent of how we did so in the last lecture, is according to whether the problems are the sort we can get if we have just one source of data, or whether the problem arises directly from trying to combine data from multiple sources Problems can also be schema level or instance level

6 These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me Single Source / Schema level examples ScopeProblemUncleanNotes attributeIllegal values DoB= Values out of range record Violated attribute dependencies Car-owner = No, make = Toyota Make should clearly have a Null value here. Record type Uniqueness violations Name= Jo Smith, NUS no. = 3067 Name= Ed Brown, NUS no. = 2124 NUS no.s should be unique Source Referential integrity violation Name= D Corne, Office = EM G.92 Where is G. 92 ?

7 These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me Single Source / Instance level examples ScopeProblemUncleanNotes attribute missing values, mis-spellings, abbreviations, Misfields, Embedded vals Top speed = 0 mph, Title = Dark Side of the Moan FullName = J. Smith Colour = 160mph Phone = “Dan Jones ” Dummy entries – values unavailable at entry time, human error record Violated attribute dependencies City = Edinburgh. Postcode = EX6 Record type Word transposition, Duplicates, contradictions Name= Jo Smith, Name = Carr, Jim Name= J. Smith, Name = Joe Smith Name = Jo Smith, DoB = 17/12/62 Name = Jo Smith, DoB = 17/11/62 Source Wrong references Name= D Corne, Office = EM G.46 EM G. 46 exists, but is not my office.

8 These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me Multiple Source Problems/ Instance and Schema level examples The Smiths buy books and music online from company A: Customer ID NameStreetCitySex 102 Luke Smith5 Chewie Rd Dawlish, Devon Leia SmithChewie St, 5 Dawlish 1 They also buy books and music online from company B: Client ID LastNameOther namesPhone Gender 23 SmithLuke Michael Male 35 SmithLeia S. +44(0) F

9 These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me When Companies A and B merge, various problems arise when they merge their DBs Combining customer fields and client fields – are they really the same things? How to ensure that Company A’s customer 37 and Company B’s client 37 get separate entries in the new DB. Are Luke Smith and Luke Michael Smith the same person?? Do Luke and Leia live at the same address? Etc … A forced `fast resolution’ to these problems will usually lead to errors in the new `integrated’ DB

10 These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me A Special but Common type of Problem: Semantic Complexity Semantic Complexity (SC) is the state of play where different users of a DB have different conceptions of what the data represent. E.g. Local Police DB keep record of all crimes in an area, where the key is the victim’s name. When someone who was a victim moves to a different area, they remove all records relating to that person.The local council use this DB to produce a report of the total amount of crime every month. Why does it give figures that are too low?

11 These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me Semantic Complexity: Missing/Default Values One source of semantic complexity is the different meanings that missing values can have. E.g. Suppose the histogram of value types in mobile phone no. field is:

12 These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me What does NULL mean? A. This record is of someone who does not have a mobile phone? B. This record is of someone who has a mobile phone, but chose not to supply the number? C. This record is of someone who has a mobile phone, but who forgot to supply the number, or it was hard to decipher and recorded as NULL? Maybe some are of type A and some are of type B and some are of type C. For some applications/analyses, we may wish to know the breakdown into types. What about the All zero and All nine entries? Precisely the same can be said of them. Or, perhaps the protocols for recording the entries indicated NULL for type A, for type B and for type C. The above relate to a quite simple form of semantic complexity – but what if someone uses this DB to estimate the proportion of people who have never had a mobile phone?

13 These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me Data Cleaning: Phases Phases in DC: Analysis: to detect errors and inconsistencies in the DB needs detailed analysis, involving both manual inspection and automated analysis programs. This reveals where (most of) the problems are. Defining transformation and mapping rules: Having found the problems, this next phase is concerned with defining the way you are going to automate solutions to clean the data

14 These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me Data Cleaning: phases continued Verification: In this phase we test and evaluate the transformation plans we made in stage 2; without this, we may end up making the data dirtier rather than cleaner. Transformation: Do the transformation, now that you’re sure it will be done correctly. Backflow of cleaned data: Do what we can to ensure that cleaned data percolates to various repositories that may still harbour errors.

15 These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me Phases in DC: Data Analysis Data Profiling: examine the instances to see how the attributes vary. E.g. Automatically generate a histogram of values for that attribute. How does the histogram help us in finding problems in this case?

16 These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me What problems does this analysis alert us to?

17 These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me Phases in DC: Data Mining Data Mining is simply about more advanced forms of data analysis. We talk about that next week.

18 These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me Phases in DC: Defining Data Transformation Rules As a result of the analysis phase, you will find various problems that translate to a list of actions, such as: –Remove all entries for J. Smith (duplicates of John Smith) –Find entries with `bule’ in colour field and change these to `blue’. –Output a list of all records where the Phone number field does not match the pattern (NNNNN NNNNNN) (further steps required to then cleanse these data) –Find all entries where the Name field contains a potential DoB string, and the DoB field is NULL, and then repair these entries. –Etc …

19 These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me Phases in DC: Verification This speaks for itself! Data transformation is the main step that actually changes the data itself – so you need to be sure you will do it correctly. So, test and examine the transformation plans very carefully. It is easy to mess the data up even more if you have a faulty transformation plan. –I have a very thick C++ book where it says strict in all the places where it should say struct

20 These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me Phases in DC: Transformation Go ahead and do it. For large DBs, this task is supported by a variety of tools (as also is data analysis, often in the same tool). The list is growing. E.g. DATACLEANSER is a specialist tool for identifying and eliminating duplicates. TRILLIUM focuses on cleaning name/address data. Such tools use a huge built-in library of rules for dealing with the common problems. Alternatively or additionally you can write your own code for specialised bits of cleaning (and then verify it!).

21 These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me Phases in DC: Backflow Once the `master’ source of data – perhaps a newly integrated DB, is `cleaned’, there is the opportunity to fix errors that may have spread beyond the DB before it was cleaned. This will be a very different and varied process in every case, and the results of the first Analysis stage should start to provide clues about what could be done here. Examples of such backflow can vary between: Refunding 1 customer 12p because he was mischarged for postage owing to a faulty postcode entry Removing £1,000,000,000’s worth of a brand of olive oil from supermarket shelves across Europe, since a DB (and hence the label) did not correctly indicate that it contains something dangerous to those with nut allergies.

22 These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me What this lecture was about Why DBs are almost always not `clean’ A single source/multi-source and instance level/schema level classification of errors Semantic Complexity Five Phases in a corporate Data Cleaning process


Download ppt "These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to."

Similar presentations


Ads by Google