Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Quality David Loshin Knowledge Integrity Inc.

Similar presentations


Presentation on theme: "Data Quality David Loshin Knowledge Integrity Inc."— Presentation transcript:

1 Data Quality David Loshin Knowledge Integrity Inc.
Knowledge Integrity Incorporated

2 Course Structure Overview of Data Quality Dimensions of Data Quality
Data Ownership and Data Roles Cost Analysis of Poor Data Quality Dimensions of Data Quality Data models, Data values, Presentation Data Analysis Techniques Data Analysis Tools

3 Course Structure (2) Data Quality Improvement
Metadata and Enterprise Reference Data Domains and Mappings Data Quality Rules Definition Data Quality Rule Discovery

4 Course Structure (3) Data Profiling Using Data Quality Rules
Data Transformation Data Cleansing Ongoing Validation

5 Course Structure (4) Data Correction Data Cleansing Scalability Issues
Data Parsing Standardization Linkage Duplicate Elimination Approximate Searching Scalability Issues

6 Assignments 4 Assignments
“Handy Tools” for data analysis Domain Analysis Data Parsing Data Linkage Assignments to be programmed using Perl or Java

7 Some Examples Frequent Flyer Miles and Long-Distance Service
Corporate Credit Card Direct Marketing Event CD Club Scam

8 What is Data? Working definitions:
Data: arbitrary values (with their own representation) Information: data within a context Knowledge: Understanding of information within its context Metadata: data about data

9 Data Contexts Static flat file data Static databases
Dynamic data flows Message passing

10 Who Owns Data? Important question, because the answers indicate where responsibility for data quality lies Data quality can be difficult to effect because of complicating notions Data Processing as an “Information Factory” Actors in the information factory and their roles

11 Actors and Their Roles Supplier Acquirer Creator Processor Packager
Delivery Agent Consumer Middle Manager Senior Manager Decision-maker

12 Ownership Responsibilities
Definition of data Authorization and Security User support Data packaging and delivery Maintenance Data quality Management of business rules Management of metadata Standards management Supplier management

13 Ownership Paradigms Creator Consumer Compiler Enterprise Funder
Decoder Packager Reader Subject Purchaser Everyone

14 Complicating Notions Ownership is affected by: The value of data
Privacy Turf Fear Bureaucracy

15 The Data Ownership Policy
Order of enforcement Identify stakeholders Identify data sets Allocation of ownership Ownership roles and responsibilities Dispute Resolution

16 The Data Ownership Policy (2)
Maintain a metadata database for data ownership Parties table Data set table Roles and responsibilities Policies (i.e., dispute resolution, communication, etc.)

17 Ownership Roles CIO CKO Trustee Policy Manager Registrar Steward
Custodian Data Administrator Security Administrator Information Flow Information Processing Application development Data Provider Data Consumer

18 Map the Flow of Information
Data processing can be likened to an “information factory” Data sets from multiple sources are used as “raw input” Final products are created in the form of business processes, information products, strategic reports, etc. Knowledge Integrity Incorporated

19 Stages in the Information Map
Data Supply Data Acquisition Data Creation Data Processing Data Packaging Decision Making Decision Implementation Data Delivery Data Consumption Knowledge Integrity Incorporated

20 Directed Information Channels
Indicates the flow of information from one processing stage to another Example: supplier data is delivered to an acquisition stage through an information channel Directed indicates the direction in which data flows This effectively maps all points at which a data fault or nonconformance may appear Knowledge Integrity Incorporated

21 Example: Credit Approval

22 Example: Hotel Reservation Process

23 Example: Catalog Sales

24 What is Data Quality? “Fitness for Use”
Different rules for different data sets Includes: Data profiling Domain and cross-attribute analysis Discovery of business rules Data cleansing Standardization Deduplification and Merge-purge

25 Lather, Rinse, Repeat Data quality is a process:
Assess the current state of the quality of data Determine the area that needs most improvement Determine success criteria Implement the improvement Measure against success threshold If successful: goto 2

26 Data Quality is Hard to Do
No one wants to admit mistakes Denial of responsibility Lack of understanding “Dirty work” Lack of recognition

27 Steps to Data Quality Training Data ownership policy
Economic model of data quality Current state assessment and requirements analysis Project selection and implementation

28 Simple Tools Goal: To look for simple patterns that indicate a problem that needs to be addressed Grouping and Linking Frequency Analysis Pattern Analysis

29 Grouping Try to make similar items gravitate together
Joining data instances based on business rules Simple methods: Attribute selection Sorting Hashing

30 Frequency Analysis Look for insights in numbers Simple methods:
Counting Hashing

31 Pattern Analysis Looking to distinguish between what is expected and what is not expected Attempt to find outliers and nonconformities

32 Example


Download ppt "Data Quality David Loshin Knowledge Integrity Inc."

Similar presentations


Ads by Google