Presentation is loading. Please wait.

Presentation is loading. Please wait.

These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to.

Similar presentations


Presentation on theme: "These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to."— Presentation transcript:

1 These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me Advanced Database Systems F24DS2 / F29AT2 Data Quality and Data Cleaning 1

2 These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me Acknowledgements I adapted this material from various sources, most notably: A ppt presentation called `Data Quality and Data Cleaning: An Overview’ by Tamrapani Dasu and Theodore Johnson, at AT & T Labs A paper called `Data Cleaning: Problems and Current Approaches’, by Erhard Rahm and Hong Hai Do, University of Leipzig, Germany. My thanks to these researchers for making their materials freely available online.

3 These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me On Data Quality Suppose you have a database sitting in front of you, and I ask ``Is it a good quality database?’’ What is your answer? What does quality depend on? Note: this is about the data themselves, not the system in use to access it.

4 These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me A Conventional Definition of Data Quality Good quality data are: Accurate, Complete, Unique, Up-to-date, and Consistent ; meaning …

5 These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me A Conventional Definition of Data Quality, continued … Accurate: This refers to how the data were recorded in the first place. What might be the inaccurately recorded datum in the following table? BarrattJohn22MathsBScMale BurnsRobert24CSBScMale CarterLaura20PhysicsMScFemale DaviesMichael12CSBScMale

6 These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me A Conventional Definition of Data Quality, continued … Complete: This refers to whether or not the database really contains everything it is supposed to contain. E.g. a patient’s medical records should contain references to all medication prescribed to date for that patient. The BBC Licensing DB should contain an entry for every address in the country. Does it?

7 These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me A Conventional Definition of Data Quality, continued … Unique: Every separate datum appears only once. How many `Data Quality errors’ can you find in the following table, and what types are they? SurnameFirstnameDoBDriving test passed: SmithJ.17/12/8517/12/05 SmithJack17/12/8517/12/2005 SmithJock17/12/9517/12/2005

8 These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me A Conventional Definition of Data Quality, continued … Up-to-date: The data are kept up to date. The post office has just changed my postcode from EX6 8RA to EX6 8NU. Why does this make it difficult for me to get a sensible quote for home insurance or car insurance? Can you think of a DB where it doesn’t matter whether or not the data are kept up to date??

9 These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me A Conventional Definition of Data Quality, continued … Consistent: The data contains no logical errors or impossibilities. It makes sense in and of itself. Why is the following mini DB inconsistent? DateSalesReturnsNet income 23 rd Nov£25,609£1,003£24, th Nov£26,202£1,601£24, th Nov£28,936£1,178£25,758

10 These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me The Several Problems with the Conventional (or any?) Definition Fine, we can define Data Quality (DQ) in a way that makes it clear what kinds of issues we are thinking about. But is the definition any use beyond that? Can it be used: to Measure DQ, so that we can say whether or not one DB is better than another, or so we can gauge improvements in DQ over time. on a wide range of different DBs, of widely different purposes, sizes, etc.

11 These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me No: Four problems There are at least four specific problems with this definition if we are going to use it for any other purpose than just discussing DQ issues: Measurability, rigidity, context independence, unhelpfulness

12 These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me DQ Def problems: measurability It is not clear how we might measure quality according to some of these items. Completeness: How will we know?? Uniqueness: It is hard to tell whether two entries are similar, or duplicates! Up-to-date-ness: How do we know? Consistent: consistency errors can be very hard to find, especially in a very large DB

13 These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me DQ Def problems: rigidity The conventional definition is rigid and static. It is easy to think of two DBs that are high quality according to this definition, but are nevertheless of entirely different quality. E.g. one may contain much more potentially relevant information. One may have better metadata. Etc…

14 These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me DQ Def problems: context independence The conventional definition does not take into account the context of the DB. Different things may be important for quality in different circumstances. E.g. a DB for public access must be highly accessible and interpretable; in a DB of experimental results, accuracy is more important than consistency (why?), etc.

15 These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me DQ Def problems: unhelpfulness/vagueness Partly as a result of the fact that we need different DQ measures of different DBs, depending on context, the conventional definition alone is vague and unhelpful when it comes to using it to provide clues for how to improve data quality. How do we improve accuracy? Depends on how the data are recorded. How do we improve consistency? Depends immensely on the kind of data we have, etc …

16 These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me Towards modern definitions of DQ An ideal definition of data quality: –Is backed up by a good understanding of how and why errors occur. –Relates to how the data are used –Is helpful in leading to improvements in processes –Is helpful in leading to usable measures/metrics

17 These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me The Data Quality Continuum It’s rare that a datum is entered once into a DB and then left alone. Usually, a datum has a long and varied life, into which errors can arise at each and every stage. The continuum is: –Data gathering –Data delivery –Data storage –Data integration –Data retrieval –Data analysis So, if we want to monitor DQ, we need to monitor it at each of these stages

18 These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me DQ Continuum: Example This is an example I am familiar with, helping to illustrate the DQ continuum. The International Seismological Centre (ISC) is in Thatcham, in Berkshire. It’s a charity funded by various governments. Their role is to be the repository for recording all earthquake events on the planet.

19 These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me DQ Continuum: ISC example: gathering ISC Data gathering centres

20 These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me DQ Continuum: ISC example: delivery Raw seismograph data from local collection points to DG centres. or ftp to ISC; some centres provide raw data, some provide interpreted data (e.g. maybe won’t send some data if they believe it in error in the first place)

21 These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me DQ Continuum: ISC example: integration The ISC’s role is actually to figure out where and when the Earth tremors were (there are hundreds per month) based on reverse engineering from seismograph readings. They integrate the raw data and attempt to do this, largely by hand and brain, and record their findings in archival CDs

22 These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me DQ Continuum: ISC example: retrieval/analysis You can get a CD from ISC anytime, for the earth tremor activity on any particular day. I’m not sure whether you can get the raw data from them. Naturally, you can analyse the data and see if you can find inconsistencies or errors.

23 These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me The ISC DQ Continuum Where might there occur errors, of: Accuracy? Completeness? Uniqueness? Timeliness? Consisency? What else is important in this case?

24 These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me Where DQ problems occur (gathering) Manual data entry (how can we improve this?) Lack of uniform standards for format and content. Duplicates arising from parallel entry Approximations, alternatives, entries altered in order to cope with s/w and/or h/w constraints. Measurement errors.

25 These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me Where DQ problems occur (delivery) Multiple hops from source to DB – problems can happen anywhere Inappropriate pre-processing (e.g. removing some `small’ seismograph readings before sending on to ISC; rounding up or down, when the destination needs more accurate data). Transmission problems: buffer overflows, checks (did all files arrive, and all correctly?)

26 These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me Where DQ problems occur (storage) Poor, out of date or inappropriate metadata Missing timestamps conversion to storage format (e.g. to excel files, to higher/lower precision

27 These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me Where DQ problems occur (integration) This is the business of combining datasets – e.g. from different parts of a company, from (previously) different companies following an acquisition; from different government agencies, etc. Different keys, different fields, different formats Different definitions (`customer’, `income’, …) Sociological factors: reluctance to share!

28 These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me Where DQ problems occur (retrieval/analysis) The problem here is usually the quality of DBs that store the retrieved data, or the use of the retrieved data in general. Problems arise because: The source DB is not properly understood! Straightforward mistakes in the queries that retrieve the relevant data. E.g. A database of genes contains entries that indicate whether or not each gene has a known or suspected link with cancer. A retrieval/analysis task leads to publishing a list of genes that are not relevant to cancer. What is the problem here?

29 These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me What this lecture was about A conventional definition of DQ Problems with that definition The DQ Continuum Where errors might occur along that continuum


Download ppt "These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to."

Similar presentations


Ads by Google