Plan for Populating a DW

Slides:



Advertisements
Similar presentations
An overview of Data Warehousing and OLAP Technology Presented By Manish Desai.
Advertisements

BY LECTURER/ AISHA DAWOOD DW Lab # 4 Overview of Extraction, Transformation, and Loading.
BY LECTURER/ AISHA DAWOOD DW Lab # 3 Overview of Extraction, Transformation, and Loading.
Outline What is a data warehouse? A multi-dimensional data model Data warehouse architecture Data warehouse implementation Further development of data.
Basic guidelines for the creation of a DW Create corporate sponsors and plan thoroughly Determine a scalable architectural framework for the DW Identify.
Lecture 5 Themes in this session Building and managing the data warehouse Data extraction and transformation Technical issues.
ICS 421 Spring 2010 Data Warehousing 3 Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa 4/1/20101Lipyeow.
Data Warehouse IMS5024 – presented by Eder Tsang.
Distributed DBMSs A distributed database is a single logical database that is physically distributed to computers on a network. Homogeneous DDBMS has the.
IS Consulting Process (IS 6005) Masters in Business Information Systems 2009 / 2010 Fergal Carton Business Information Systems.
Designing the Data Warehouse and Data Mart Methodologies and Techniques.
1 ACCTG 6910 Building Enterprise & Business Intelligence Systems (e.bis) Data Staging Olivia R. Liu Sheng, Ph.D. Emma Eccles Jones Presidential Chair of.
Components and Architecture CS 543 – Data Warehousing.
Data Warehouse success depends on metadata
Business and IS Performance (IS 6010) MBS BIS 2010 / th November 2010 Fergal Carton Accounting Finance and Information Systems.
MIS DATABASE SYSTEMS, DATA WAREHOUSES, AND DATA MARTS CHAPTER 3
Designing a Data Warehouse
Business Intelligence Instructor: Bajuna Salehe Web:
Data Conversion to a Data warehouse Presented By Sanjay Gunasekaran.
ETL By Dr. Gabriel.
Database Management Systems, 2 nd Edition. R. Ramakrishnan and J. Gehrke1 Decision Support Chapter 23.
Data Warehouse Operational Issues Potential Research Directions.
L/O/G/O Metadata Business Intelligence Erwin Moeyaert.
©Silberschatz, Korth and Sudarshan18.1Database System Concepts - 5 th Edition, Aug 26, 2005 Buzzword List OLTP – OnLine Transaction Processing (normalized,
IMS 4212: Distributed Databases 1 Dr. Lawrence West, Management Dept., University of Central Florida Distributed Databases Business needs.
Data Warehousing Seminar Chapter 5. Data Warehouse Design Methodology Data Warehousing Lab. HyeYoung Cho.
Database Management Systems, 2 nd Edition. R. Ramakrishnan and J. Gehrke1 Data Warehousing and Decision Support Chapter 25, Part B.
The McGraw-Hill Companies, Inc Information Technology & Management Thompson Cats-Baril Chapter 3 Content Management.
DW-1: Introduction to Data Warehousing. Overview What is Database What Is Data Warehousing Data Marts and Data Warehouses The Data Warehousing Process.
Lecture 7 Integrity & Veracity UFCE8K-15-M: Data Management.
Lecture 12 Designing Databases 12.1 COSC4406: Software Engineering.
Part Two: - The use of views. 1. Topics What is a View? Why Views are useful in Data Warehousing? Understand Materialised Views Understand View Maintenance.
Discovering Computers Fundamentals Fifth Edition Chapter 9 Database Management.
MIS DATABASE SYSTEMS, DATA WAREHOUSES, AND DATA MARTS CHAPTER 3
1 Data Warehouses BUAD/American University Data Warehouses.
Data Warehousing and OLAP Kuliah 2 Introduction Diambil dari paper An Overview of Data Warehousing and OLAP Technology Surajit Chaudhuri & Umeshwar Dayal.
Data Staging Data Loading and Cleaning Marakas pg. 25 BCIS 4660 Spring 2012.
Ing. Erick López Ch. M.R.I. Replicación Oracle. What is Replication  Replication is the process of copying and maintaining schema objects in multiple.
7 Strategies for Extracting, Transforming, and Loading.
Advanced Database Concepts
MIS 451 Building Business Intelligence Systems Data Staging.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Data Warehousing and Decision Support Chapter 25.
Data Mining What is to be done before we get to Data Mining?
An Overview of Data Warehousing and OLAP Technology
Management Information Systems by Prof. Park Kyung-Hye Chapter 7 (8th Week) Databases and Data Warehouses 07.
Data Mining and Data Warehousing: Concepts and Techniques What is a Data Warehouse? Data Warehouse vs. other systems, OLTP vs. OLAP Conceptual Modeling.
Data Staging Data Staging Legacy System Data Warehouse SQL Server
Introduction to Data Warehouse
Data warehouse and OLAP
MongoDB Distributed Write and Read
The Client/Server Database Environment
Methodology – Physical Database Design for Relational Databases
Modern Systems Analysis and Design Third Edition
SQL: Advanced Options, Updates and Views Lecturer: Dr Pavle Mogin
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Data Warehouse.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 2 Database System Concepts and Architecture.
Data Warehouse and OLAP
Data Warehouse Overview September 28, 2012 presented by Terry Bilskie
Data Warehousing and OLAP
Typically data is extracted from multiple sources
MANAGING DATA RESOURCES
Data Warehousing and Decision Support
Data Warehouse.
Metadata The metadata contains
Data Warehousing Concepts
Chapter 11 Managing Databases with SQL Server 2000
Data Warehouse and OLAP
Implementing ETL solution for Incremental Data Load in Microsoft SQL Server Ganesh Lohani SR. Data Analyst Lockheed Martin
Presentation transcript:

Populating a Data Warehouse Lecturer : Dr Pavle Mogin

Plan for Populating a DW Data extracting Data cleaning Data loading Data refreshing Readings : S. Chaudhuri, U. Dayal: An Overview of Datawarehousing and OLAP Technology

Some Terminology Source data are data from operational databases and external sources Base data are Data Warehouse fact table or dimension table data Derived data is Data Warehouse data produced by materializing views and building auxiliary access structures (indexes)

Populating and Updating a DW Data Warehousing systems use a variety of software tools for: data extraction, data cleaning, DW loading, and DW refreshing All these tools have the goal to provide data of high quality for the decision making purposes ETL = Extraction, Transformation, and Loading

Data Extraction Data from operational databases and external sources are extracted using gateways A gateway is an application program interface that allows a client program to generate SQL statements to be executed at a server Common examples of gateways are: Open Database Connectivity (ODBC), Object Loading and Embedding for Databases (OLE), and Java Database Connectivity (JDBC)

Data Cleaning Since a Data Warehouse is built using data from various sources, there is a high probability of errors and anomalies in data Most frequent errors are: inconsistent field lengths, inconsistent attribute names, inconsistent value assignment, missing entries, and violated integrity constraints ETL software transforms, cleans, and discovers violation of constraints in input data

Data Cleaning Tools Data migration tools allow simple data transformation rules to be specified: Replace Surname by Last_Name Convert pound to kg Data scrubbing tools are more sophisticated, they use domain specific knowledge (business rules of the real system) to clean data from various sources Use functional dependency ProductID Prod_Name to clean product data from production and marketing databases, Convert country code number part of a telephone number into country name (e.g. 64 into New Zealand) Fill in missing Address data Data auditing tools are used to scan data and discover strange patterns (data mining) Products that have been never sold Exceptionally large attribute values (although within limits allowed)

Building Time Dimension Operational data rarely contain precise time data In OLTP databases time is not so important as in OLAP databases At most, operational data records contain the date So, to satisfy need for a unique and immutable Timeid and build a time attribute hierarchy, the time dimension is built before Data Warehouse data loading

Data Loading Before loading data some additional data preprocessing has to be done: sorting, summarization, aggregation, building indexes, and building materialized views The load utilities have to deal with very large volumes of data during small time slots (a night) Sequential loads would take weeks (or more), so pipelined and parallel loads are exploited instead

Loading Data Doing a full load has advantage of using the current version of a Data Warehouse for queries during the time the load is in progress But doing a full load can last to long To reduce the amount of data, incremental loading during refresh is used instead Only the updated operational tuples influence data to be inserted But the incremental load conflicts with ongoing queries To avoid conflicts, incremental loading is performed as a sequence of transactions that commit periodically

Data Refreshing DW refreshing is an alternative to full load Refreshing a Data Warehouse consists in propagating updates on source data (operational and external data) to corresponding updates of base and derived data (in the Data Warehouse) The Data Warehouse refresh policy has to concern two issues: Frequency, and Procedures of data refreshing

Data Refreshing Frequency Data refreshing frequency depends on user needs and OLTP traffic Usually, a DW is refreshed periodically (daily or weekly) But, if users need current data, it is necessary to propagate every relevant update from OLTP data to OLAP data Also, if the OLTP update traffic is high and the DW refreshment frequency low, data volumes during refreshment may overwhelm the refreshment utility So, OLTP update traffic also influences refreshment policy (high traffic leads to frequent updates)

Data Refreshing Procedures Generally, DW refreshing is made using one of the following two techniques: Data Shipping and Transaction shipping Both techniques suppose that the operational DBMS supports replication servers that incrementally propagate updates from a primary database to replicas If the operational database system is a legacy one, and does not support replication, extracting the whole source database can be the only choice

Data Replication To reduce data transfer cost and enhance availability, distributed databases store copies of data on every location where data is in high demand Data copies are called data replicas or snapshots There are two kinds of data replication: Synchronous and Asynchronous

Data Replication In synchronous replication, when source data is updated, all its data replicas have to be updated before the transaction commits In asynchronous replication, the source updates are propagated to replicas periodically (sometimes even long after transaction commits) A Data Warehouse is considered as an asynchronous replica

Data and Transaction Shipping In data shipping a table in the DW is treated as a remote snapshot of a table in the source database Whenever the source table changes, a mechanism called After_row trigger is used to update a snapshot log file, and A refresh procedure is set up to propagate the update to the DW (at some time) In transaction shipping a regular transaction log file is used instead of the trigger and the snapshot log file: At the source site the transaction log is checked for updates that influence the replicated tables The log records containing appropriate changes are transferred to replication servers Still, the issue is to determine the refreshing cycle

Data versus Transaction Shipping Data shipping is more appropriate when operational databases and The Data Warehouse are from different vendors, since transaction log files are not standardized Transaction shipping is more appropriate for homogeneous systems, since the problem of interpreting the contents of the transaction log file is not present Transaction shipping requires less resources of the operational database server

Maintenance of Materialized Views Making a view consistent with its (DW) base tables is called view refreshing If the cost of algorithms for view refreshing is proportional to the change of the view, they are sad to be incremental A view maintenance policy is a decision about when a view has to be refreshed

View Updates View update can be: Deferred update can be done either: Immediate (within the same transaction that updates the base tables), or Deferred (a period of time after the base tables are updated) Deferred update can be done either: During the time a view is used for a query evaluation for the first time after the update of base tables, Periodically, in regular time intervals, or Forced, after a certain number of base table updates

View Refreshing and Aggregates A special consideration is needed when aggregate views are refreshed Views containing distributive aggregates are refreshed without any problem Views containing algebraic aggregates are easily refreshed if they contain all other necessary data (most frequently it is count); Views containing holistic aggregates are hard to refresh, they are rather build every time from scratch

Summary Data extraction is done using gateways (ODBC and JDBC) Data cleaning software reconciles inconsistent: Field length, Attribute names, Value assignment, Missing entries and discovers integrity violations and suspicious data patterns Data loading can be: Either full, or Incremental

Summary (continued) Incremental load is done during data refreshing Data refreshing consist of updating Data Warehouse base and derived data by data inferred from changes made to source data Data refreshing techniques: Data shipping Transaction shipping