Data Cleansing Rule Based Strategy.

Slides:



Advertisements
Similar presentations
Shantanu Narang.  Background  Why and What of Normalization  Quick Overview of Lower Normal Forms  Higher Order Normal Forms.
Advertisements

 Definition  Components  Advantages  Limitations Contents  Definition Definition  Normal Forms Normal Forms  First Normal Form First Normal Form.
C6 Databases.
ITIL: Service Transition
Basic guidelines for the creation of a DW Create corporate sponsors and plan thoroughly Determine a scalable architectural framework for the DW Identify.
Lecture 5 Themes in this session Building and managing the data warehouse Data extraction and transformation Technical issues.
Managing Data Resources
Dimensional Modeling CS 543 – Data Warehousing. CS Data Warehousing (Sp ) - Asim LUMS2 From Requirements to Data Models.
Data Warehouse IMS5024 – presented by Eder Tsang.
1 Minggu 12, Pertemuan 23 Introduction to Distributed DBMS (Chapter , 22.6, 3rd ed.) Matakuliah: T0206-Sistem Basisdata Tahun: 2005 Versi: 1.0/0.0.
Normalization of Database Tables
Data Warehouse Yong Shi CSE DEPARTMENT. Strategic delivery of information The current Situation The never-ending quest to access any information, anywhere,
Security in Databases. 2 Srini & Nandita (CSE2500)DB Security Outline review of databases reliability & integrity protection of sensitive data protection.
Components and Architecture CS 543 – Data Warehousing.
DATA WAREHOUSING.
Karlstad University Computer Science Design Contracts and Error Management Design Contracts and Errors A Software Development Strategy (anpassad för PUMA)
Architectural Design Principles. Outline  Architectural level of design The design of the system in terms of components and connectors and their arrangements.
Security in Databases. 2 Outline review of databases reliability & integrity protection of sensitive data protection against inference multi-level security.
Chapter 5 Normalization of Database Tables
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 5 Normalization of Database Tables.
Data Warehouse View Maintenance Presented By: Katrina Salamon For CS561.
Managing Data Resources. File Organization Terms and Concepts Bit: Smallest unit of data; binary digit (0,1) Byte: Group of bits that represents a single.
Security Architecture Dr. Gabriel. Security Database security: –degree to which data is fully protected from tampering or unauthorized acts –Full understanding.
BUSINESS INTELLIGENCE/DATA INTEGRATION/ETL/INTEGRATION AN INTRODUCTION Presented by: Gautam Sinha.
Data Warehouse Tools and Technologies - ETL
Agenda 02/21/2013 Discuss exercise Answer questions in task #1 Put up your sample databases for tasks #2 and #3 Define ETL in more depth by the activities.
Chapter 1 Database Systems. Good decisions require good information derived from raw facts Data is managed most efficiently when stored in a database.
5.1 © 2007 by Prentice Hall 5 Chapter Foundations of Business Intelligence: Databases and Information Management.
Database Management System Lecture 6 The Relational Database Model – Keys, Integrity Rules.
Data Warehousing Seminar Chapter 5. Data Warehouse Design Methodology Data Warehousing Lab. HyeYoung Cho.
Software Design: An Introduction by David Budgen Presented by Shane Marcus EEL 6883 – Spring 2007 Presented by Shane Marcus EEL 6883 – Spring 2007.
The Traditional Accounting Information System Architecture
Architecture styles Pipes and filters Object-oriented design Implicit invocation Layering Repositories.
AN OVERVIEW OF DATA WAREHOUSING
Agenda 03/27/2014 Review first test. Discuss internal data project. Review characteristics of data quality. Types of data. Data quality. Data governance.
TM 1 Dr. Chen, Business Database Systems Data Modeling Professor Chen School of Business Administration Gonzaga University Spokane, WA
Data Warehousing Concepts, by Dr. Khalil 1 Data Warehousing Design Dr. Awad Khalil Computer Science Department AUC.
Database Systems: Design, Implementation, and Management Tenth Edition
Database Systems: Design, Implementation, and Management Ninth Edition Chapter 6 Normalization of Database Tables.
1 DATABASE SYSTEMS DESIGN IMPLEMENTATION AND MANAGEMENT INTERNATIONAL EDITION ROB CORONEL CROCKETT Chapter 7 Normalisation.
BIS Database Systems School of Management, Business Information Systems, Assumption University A.Thanop Somprasong Chapter # 5 Normalization of Database.
Next-generation databases Active databases: when a particular event occurs and given conditions are satisfied then some actions are executed. An active.
1.file. 2.database. 3.entity. 4.record. 5.attribute. When working with a database, a group of related fields comprises a(n)…
C6 Databases. 2 Traditional file environment Data Redundancy and Inconsistency: –Data redundancy: The presence of duplicate data in multiple data files.
(Spring 2015) Instructor: Craig Duckett Lecture 10: Tuesday, May 12, 2015 Mere Mortals Chap. 7 Summary, Team Work Time 1.
SYSTEM TESTING AND DEPLOYMENT CHAPTER 8. Chapter 8: System Testing and Deployment 2 KNOWLEDGE CAPTURE (Creation) KNOWLEDGE TRANSFER KNOWLEDGE SHARING.
Today’s Agenda  Reminder: HW #1 Due next class  Quick Review  Input Space Partitioning Software Testing and Maintenance 1.
Database Design. The process of developing database structures from user requirements for data a structured methodology Structured Methodology - a number.
AL-MAAREFA COLLEGE FOR SCIENCE AND TECHNOLOGY INFO 232: DATABASE SYSTEMS CHAPTER 1 DATABASE SYSTEMS Instructor Ms. Arwa Binsaleh.
Programming Logic and Design Fourth Edition, Comprehensive Chapter 16 Using Relational Databases.
Managing Data Resources. File Organization Terms and Concepts Bit: Smallest unit of data; binary digit (0,1) Byte: Group of bits that represents a single.
Principles of Database Design, Conclusions MBAA 609 R. Nakatsu.
Foundations of Business Intelligence: Databases and Information Management.
7 Strategies for Extracting, Transforming, and Loading.
IS 320 Notes for April 15, Learning Objectives Understand database concepts. Use normalization to efficiently store data in a database. Use.
DATAWAREHOUSE What? Why? How?. Contents  What is Datawarehouse ?  What is Datamart ?  DSS Environments (Big Picture)  Steps, Problems, Cautions 
Data Warehousing 4 Definition of Data Warehouse 4 Architecture of Data Warehouse 4 Different Data Warehousing Tools 4 Summary.
Software Quality Assurance and Testing Fazal Rehman Shamil.
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 1 Database Systems.
BUILDING THE INFORMATION INFRASTRUCTURE. The Challenge  Information understanding through increased context and consistency of definition.  Information.
Testing Integral part of the software development process.
The Systems Engineering Context
Software Quality Engineering
Software Requirements
Potter’s Wheel: An Interactive Data Cleaning System
Data Warehouse.
Database solutions Chosen aspects of the relational model Marzena Nowakowska Faculty of Management and Computer Modelling Kielce University of Technology.
Data Warehousing Concepts
M. Kezunovic (P.I.) S. S. Luo D. Ristanovic Texas A&M University
A handbook on validation methodology. Metrics.
Presentation transcript:

Data Cleansing Rule Based Strategy

Without Data Cleansing Strategy: DW will suffer from: lack of quality loss of trust diminishing user base, and loss of business sponsorship and funding Entity-Relationship vs. Dimensional Models

What is Data quality? well understood accurate integrated satisfies the needs of the business user is satisfied with the quality of the data and the information derived from that data complete no duplicate records Data anomalies accurate stored according to data type has integrity consistent well designed database non redundant follows business rules corresponds to established domains timely

Rule-based strategy Begins with the understanding that there are really only two options for data cleansing: clean the source data clean the warehouse data.

2 options follow a sequence of: Define a robust and comprehensive set of data cleansing rules. Data models are the keys. Package the rules as executable components. Objects or modules: depending on the environment in which they will execute.

Data cleansing rules: 2 parts Integrity rules: tests integrity of the data Cleansing Rules: specification of an action to be taken when integrity violations are encountered.

Uses of Data cleansing rules Audit data Report on the occurrence of integrity violations without taking any steps to restore integrity to the data. Filter data Detect data integrity violations and remove the offending data from the set used to populate the given target. Correct data Detect data integrity violations and repair the data to restore its integrity.

Integrity rules vs. Cleansing rules refer to the way the data must conform to meet business rules. Cleansing rules: combine the definition from the integrity rule with the action to be taken in the event of a violation.

Data Integrity Rules Classification

Rule Based Cleansing Process Determine the action to be taken from perspectives of: At which data is the rule directed – source data or target data? What type of cleansing action is being performed – auditing, filtering, or correction? At what point in the warehousing process will the cleansing action occur?

Source data vs. Target data Source Data Cleansing Target Data Cleansing Requires source data models Rules from target data models Typically large number of rules Typically smaller number of rules Implicit rules difficult to determine Implicit rules readily identified Fixes problems at the source May only fix symptoms One fix applies to many targets Multiple targets may = redundant cleansing May require operational system changes Avoids operational systems issues

Data Cleansing Economics Completeness, complexity, and cost must all be considered when developing a data cleansing strategy. Auditing: least costly and least complex action, but least complete solution. Correction: most complex and costly, but most complete solution. Cleansing of source data: more complete solution, at higher cost and greater complexity, than does cleansing of target data.

Data Cleansing Economics

Principles of Rule Based DC Data quality directly affects data warehouse acceptance. Source data is typically “dirty” and must be made reliable for warehousing. Data models are a good source of data integrity rules Both source and target data models yield integrity rules. Data cleansing rules combine integrity rules with actions to be taken when violations are detected. Both source data and target data may be cleansed. Auditing reports data integrity violations, but does not fix them. Filtering discards data that violates integrity rules. Correction repairs data that violates integrity rules. Choice of techniques involves severity of the data quality problem, available data models, and commitment of time and resources. Whatever techniques are chosen, a systematic, rule-based approach yields better results than an unstructured approach.

Bibliography [Bischoff, Alexander, 1997]: Data Warehousing; Joyce Bishop and Ted Alexander, 1997 [Duncan, Wells, 1999] Rule Based Data Cleansing for Data Warehousing; Karolyn Duncan and David Wells [Kelly, 1997]: Data Warehousing in Action; Sean Kelly, 1997