DWH-Ahsan Abdullah 1 Data Warehousing Lecture-22 DQM: Quantifying Data Quality Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center.

Slides:



Advertisements
Similar presentations
Lecture-19 ETL Detail: Data Cleansing
Advertisements

Metrics for Process and Projects
Data Warehousing 1 Lecture-25 Need for Speed: Parallelism Methodologies Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.
DWH-Ahsan Abdullah 1 Data Warehousing Lecture-5 Types & Typical Applications of DWH Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center.
Chapter 3 The Relational Model Transparencies © Pearson Education Limited 1995, 2005.
Chapter 3. 2 Chapter 3 - Objectives Terminology of relational model. Terminology of relational model. How tables are used to represent data. How tables.
L The Difference Between Logical and Physical Views of Information l Databases and Database Management Systems l How You Can Develop Database Applications.
Lecture-33 DWH Implementation: Goal Driven Approach (1)
Database Features Lecture 2. Desirable features in an information system Integrity Referential integrity Data independence Controlled redundancy Security.
1 Chapter 2 Reviewing Tables and Queries. 2 Chapter Objectives Identify the steps required to develop an Access application Specify the characteristics.
1 Introduction to Data Quality Management (DQM). 2 What is Quality? Informally Some things are better than others i.e. they are of higher quality. How.
Lecture-1 Introduction and Background
DWH-Ahsan Abdullah 1 Data Warehousing Lab Lect-2 Lab Data Set Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.
Ahsan Abdullah 1 Data Warehousing Lecture-12 Relational OLAP (ROLAP) Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.
CSC271 Database Systems Lecture # 6. Summary: Previous Lecture  Relational model terminology  Mathematical relations  Database relations  Properties.
Lecture 2 The Relational Model. Objectives Terminology of relational model. How tables are used to represent data. Connection between mathematical relations.
Chapter 4 The Relational Model Pearson Education © 2014.
Chapter 4 The Relational Model.
Chapter 3 The Relational Model Transparencies Last Updated: Pebruari 2011 By M. Arief
Ahsan Abdullah 1 Data Warehousing Lecture-17 Issues of ETL Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.
Ahsan Abdullah 1 Data Warehousing Lecture-11 Multidimensional OLAP (MOLAP) Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for.
Chapter 3 The Relational Model. 2 Chapter 3 - Objectives u Terminology of relational model. u How tables are used to represent data. u Connection between.
Lecture 7 Integrity & Veracity UFCE8K-15-M: Data Management.
Data Warehousing 1 Lecture-24 Need for Speed: Parallelism Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.
DWH-Ahsan Abdullah 1 Data Warehousing Lecture-37 Case Study: Agri-Data Warehouse Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center.
1 Data Warehousing Lecture-13 Dimensional Modeling (DM) Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics Research.
Ahsan Abdullah 1 Data Warehousing Lecture-7De-normalization Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.
DWH-Ahsan Abdullah 1 Data Warehousing Lecture-4 Introduction and Background Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for.
DIMENSIONAL MODELLING. Overview Clearly understand how the requirements definition determines data design Introduce dimensional modeling and contrast.
1 The Relational Database Model. 2 Learning Objectives Terminology of relational model. How tables are used to represent data. Connection between mathematical.
Ahsan Abdullah 1 Data Warehousing Lecture-18 ETL Detail: Data Extraction & Transformation Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. &
Ahsan Abdullah 1 Data Warehousing Lecture-9 Issues of De-normalization Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.
Software Metrics – part 2 Mehran Rezaei. Software Metrics Objectives – Provide State-of-art measurement of software products, processes and projects Why.
Data Warehousing 1 Lecture-28 Need for Speed: Join Techniques Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.
1 Data Warehousing Lecture-14 Process of Dimensional Modeling Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.
DWH-Ahsan Abdullah 1 Data Warehousing Lecture-2 Introduction and Background Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for.
Data Preprocessing Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2010.
Ahsan Abdullah 1 Data Warehousing Lecture-10 Online Analytical Processing (OLAP) Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center.
6 1 Lecture 8: Introduction to Structured Query Language (SQL) J. S. Chou, P.E., Ph.D.
Data Warehousing Lecture-31 Supervised vs. Unsupervised Learning Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.
Ahsan Abdullah 1 Data Warehousing Lecture-16 Extract Transform Load (ETL) Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for.
Database Application Design and Data Integrity AIMS 3710 R. Nakatsu.
1 Data Warehousing Lecture-15 Issues of Dimensional Modeling Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.
Data Warehousing Lecture-30 What can Data Mining do? Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics Research.
DWH-Ahsan Abdullah 1 Data Warehousing Lecture-29 Brief Intro. to Data Mining Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center.
Data Quality Improvement This material was developed by Johns Hopkins University, funded by the Department of Health and Human Services, Office of the.
The Relational Model. 2 Relational Model Terminology u A relation is a table with columns and rows. –Only applies to logical structure of the database,
Unit 11.2a: Data Quality Attributes Data Quality Improvement Component 12/Unit 11 Health IT Workforce Curriculum Version 1.0/Fall
Ahsan Abdullah 1 Data Warehousing Lecture-6Normalization Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.
The Relational Model © Pearson Education Limited 1995, 2005 Bayu Adhi Tama, M.T.I.
Ahsan Abdullah 1 Data Warehousing Lecture-8 De-normalization Techniques Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.
Chapter 4 The Relational Model Pearson Education © 2009.
Introduction to Core Database Concepts Getting started with Databases and Structure Query Language (SQL)
DWH-Ahsan Abdullah 1 Data Warehousing Lecture-21 Introduction to Data Quality Management (DQM) Virtual University of Pakistan Ahsan Abdullah Assoc. Prof.
Lecture-3 Introduction and Background
Lecture-32 DWH Lifecycle: Methodologies
Data Mining.
Data Warehouse.
Chapter 4 The Relational Model Pearson Education © 2009.
Chapter 4 The Relational Model Pearson Education © 2009.
Chapter 4 The Relational Model Pearson Education © 2009.
The Relational Model Transparencies
Data Model.
Lecture-38 Case Study: Agri-Data Warehouse
Chapter 4 The Relational Model Pearson Education © 2009.
Chapter 4 The Relational Model Pearson Education © 2009.
Lecture-35 DWH Implementation: Pitfalls, Mistakes, Keys
Chapter 4 The Relational Model Pearson Education © 2009.
INSTRUCTOR: MRS T.G. ZHOU
Organizational Aspects of Data Management
Presentation transcript:

DWH-Ahsan Abdullah 1 Data Warehousing Lecture-22 DQM: Quantifying Data Quality Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics Research National University of Computers & Emerging Sciences, Islamabad

2Background Companies want to measure the quality of their data that requires usable metrics. Have to deal with both the subjective perceptions and objective measurements. Subjective data quality assessments reflect the needs and experiences of stakeholders. Objective assessments can be task-independent or task-dependent. Task-independent metrics reflect states of the data without the contextual knowledge of the application. Task dependent metrics, include organization’s business rules, regulations etc. We will discuss objective assessment and validation techniques (dependent & independent), if time permits will briefly cover subjective assessment too. Text will not go to graphics

3 More on Characteristics of Data Quality Data Quality DimDefinition BelievabilityThe extent to which data is regarded as true and credible. Appropriate Amount of Data The extent to which the volume of data is appropriate for the task at hand. TimelinessA measure of how current or up to date the data is. AccessibilityThe extent to which data is available, or easily and quickly retrievable ObjectivityThe extent to which data is unbiased, unprejudiced, and impartial. InterpretabilityThe extent to which data is in appropriate languages, symbols, and units, and the definitions are clear. UniquenessThe state of being only one of its kind or being without an equal or parallel. Only this column will go to graphics

4 Data Quality Assessment Techniques  Ratios  Min-Max

5  Simple Ratios  Free-of-Error  Completeness  Schema  Column  Population  Consistency Ratio of violations to total number of consistency checks. Data Quality Assessment Techniques Sub-Sub-bullets will not go to graphics

6  Min-Max  Used for multiple values, based on aggregation of normalized individual values  Min is conservative, while max is liberal  Believability  Comparison with a standard or experience  Min {0.8, 0.7, 0.6) = 0.6  Weighted average  Appropriate Amount of Data Min {Dp/Dn, Dn/Dp} Min {Dp/Dn, Dn/Dp} Data Quality Assessment Techniques Dp: Data units provided Dn: Data units needed Sub-bullets and keys will not go to graphics

7  Min-Max  Timeliness Max {0, 1- C/V} C = A + Dt - It Max {0, 1- C/V} C = A + Dt - It  Accessibility Max {0, 1- Trd/Tru} Max {0, 1- Trd/Tru} Data Quality Assessment Techniques C: Currency V: Volatility A: Age Dt: Delivery time It: Input time (received in system) Trd: Time between request by user to delivery Tru: Request by user to time data remains useful Sub-bullets and keys will not go to graphics

8 Data Quality Validation Techniques  Referential Integrity (RI).  Attribute domain.  Using Data Quality Rules.  Data Histograming.

9 Referential Integrity Validation Example: How many outstanding payments in the DWH without a corresponding customer_ID in the customer table? RI checked every week or month, and no. of orphan records should be going down with time. RI peculiar to DWH, not for operational systems Yellow will not go to graphics

10 Business Case for RI Not very interesting to know number of outstanding payments from a business point of view. Interesting to know the actual amount outstanding, on per year basis, per region basis…

11 Performance Case for RI Cost of enforcing RI is very high for large volume DWH implementations, therefore:  Should RI constraints be turned OFF in a data warehouse? or  Should those records be “discarded” that violate one or more RI constraints?

12 3 steps of Attribute Domain Validation Step-1: Capture and quantify the occurrences of each domain value within each coded attribute of the database. Step-2: Compare actual content of attributes against set of valid values. Step-3: Investigate exceptions to determine cause and impact of the data quality defects. Note: Step 3 (above) applies to all defect types. Yellow will go to graphics

13 Attribute Domain Validation: What next? What to do next?  Trace back to source cause(s).  Quantify business impact of the defects.  Assess cost (and time frame) to fix and proceed accordingly.

14 Data Quality Rules

15 Statistical Validation using Histogram 1901 ………………………………………… Spike of Centurions (age >= 100 yrs) NOTE: For a certain environment, the above distribution may be perfectly normal. outliers