7 Strategies for Extracting, Transforming, and Loading.

Slides:



Advertisements
Similar presentations
C6 Databases.
Advertisements

Management Information Systems, Sixth Edition
IBM Software Group ® Recommending Materialized Views and Indexes with the IBM DB2 Design Advisor (Automating Physical Database Design) Jarek Gryz.
Managing Data Resources
Designing the data warehouse / data marts Part 2.
Designing the Data Warehouse and Data Mart Methodologies and Techniques.
1 ACCTG 6910 Building Enterprise & Business Intelligence Systems (e.bis) Data Staging Olivia R. Liu Sheng, Ph.D. Emma Eccles Jones Presidential Chair of.
Components and Architecture CS 543 – Data Warehousing.
Physical Database Monitoring and Tuning the Operational System.
Data Warehousing - 3 ISYS 650. Snowflake Schema one or more dimension tables do not join directly to the fact table but must join through other dimension.
1 © Prentice Hall, 2002 Chapter 11: Data Warehousing.
Data Warehouse Components
Data Warehousing: Defined and Its Applications Pete Johnson April 2002.
Architecture and Infrastructure Module 2 G.Anuradha.
Pokročilé databázové technológie Genči
M ODULE 5 Metadata, Tools, and Data Warehousing Section 4 Data Warehouse Administration 1 ITEC 450.
Leaving a Metadata Trail Chapter 14. Defining Warehouse Metadata Data about warehouse data and processing Vital to the warehouse Used by everyone Metadata.
Chapter 14 & 15 Conceptual & Logical Database Design Methodology
ETL Design and Development Michael A. Fudge, Jr.
ETL By Dr. Gabriel.
BUSINESS INTELLIGENCE/DATA INTEGRATION/ETL/INTEGRATION AN INTRODUCTION Presented by: Gautam Sinha.
Basic Concepts of Datawarehousing An Overview Prasanth Gurram.
Chapter 1 Database Systems. Good decisions require good information derived from raw facts Data is managed most efficiently when stored in a database.
Database Systems – Data Warehousing
Database Design - Lecture 1
Data Warehouse Chapter 11. Multiple Files Problem Added complexity of multiple source files Start simple Multiple Source files Extracted data Logic to.
Data Warehousing Seminar Chapter 5. Data Warehouse Design Methodology Data Warehousing Lab. HyeYoung Cho.
Zhangxi Lin Texas Tech University ISQS 6339, Data Management & Business Intelligence 1 ISQS 6339, Data Management & Business Intelligence Extraction, Transformation,
ISV Innovation Presented by ISV Innovation Presented by Business Intelligence Fundamentals: Data Loading Ola Ekdahl IT Mentors 9/12/08.
ITEC224 Database Programming
Lecture 9 Methodology – Physical Database Design for Relational Databases.
Physical Database Design Chapter 6. Physical Design and implementation 1.Translate global logical data model for target DBMS  1.1Design base relations.
1 Adapted from Pearson Prentice Hall Adapted form James A. Senn’s Information Technology, 3 rd Edition Chapter 7 Enterprise Databases and Data Warehouses.
Chapter 16 Methodology – Physical Database Design for Relational Databases.
Data warehousing and online analytical processing- Ref Chap 4) By Asst Prof. Muhammad Amir Alam.
IT 456 Seminar 5 Dr Jeffrey A Robinson. Overview of Course Week 1 – Introduction Week 2 – Installation of SQL and management Tools Week 3 - Creating and.
1 Data Warehouses BUAD/American University Data Warehouses.
Right In Time Presented By: Maria Baron Written By: Rajesh Gadodia
Soup-2-Nuts Alaska Department of Fish & Game Commercial Fisheries October, 2011.
C6 Databases. 2 Traditional file environment Data Redundancy and Inconsistency: –Data redundancy: The presence of duplicate data in multiple data files.
10/10/2012ISC239 Isabelle Bichindaritz1 Physical Database Design.
Decision Support and Date Warehouse Jingyi Lu. Outline Decision Support System OLAP vs. OLTP What is Date Warehouse? Dimensional Modeling Extract, Transform,
Transportation: Loading Warehouse Data Chapter 12.
Data Management for Decision Support Session-3 Prof. Bharat Bhasker.
Sachin Goel (68) Manav Mudgal (69) Piyush Samsukha (76) Rachit Singhal (82) Richa Somvanshi (85) Sahar ( )
Data Staging Data Loading and Cleaning Marakas pg. 25 BCIS 4660 Spring 2012.
Chapter 13 Designing Databases Systems Analysis and Design Kendall & Kendall Sixth Edition.
Methodology – Physical Database Design for Relational Databases.
Transportation: Refreshing Warehouse Data Chapter 13.
Creating a Data Warehouse Data Acquisition: Extract, Transform, Load Extraction Process of identifying and retrieving a set of data from the operational.
Chapter 11: Data Warehousing Modern Database Management 6 th Edition Jeffrey A. Hoffer, Mary B. Prescott, Fred R. McFadden.
Metadata By N.Gopinath AP/CSE Metadata and it’s role in the lifecycle. The collection, maintenance, and deployment of metadata Metadata and tool integration.
3/6: Data Management, pt. 2 Refresh your memory Relational Data Model
Information Integration 15 th Meeting Course Name: Business Intelligence Year: 2009.
© 2009 Wipro Ltd - Confidential ETL TESTING Handling Heterogeneous Data Formats Rajasimman Selvaraj Simanchal Sahu Tithi Mukherjee.
MIS 451 Building Business Intelligence Systems Data Staging.
1 Copyright © Oracle Corporation, All rights reserved. Business Intelligence and Data Warehousing.
The Need for Data Analysis 2 Managers track daily transactions to evaluate how the business is performing Strategies should be developed to meet organizational.
6 Copyright © 2006, Oracle. All rights reserved. The ETL Process: Transforming Data.
C Copyright © 2007, Oracle. All rights reserved. Introduction to Data Warehousing Fundamentals.
Building the Corporate Data Warehouse Pindaro Demertzoglou Lally School of Management Data Resource Management.
Copyright  Oracle Corporation, All rights reserved Building the Warehouse.
Copyright  Oracle Corporation, All rights reserved Transforming Data.
Data Mining and Data Warehousing: Concepts and Techniques What is a Data Warehouse? Data Warehouse vs. other systems, OLTP vs. OLAP Conceptual Modeling.
Plan for Populating a DW
Data Warehouse.
Typically data is extracted from multiple sources
THE ARCHITECTURAL COMPONENTS
Chapter 17 Designing Databases
Best Practices in Higher Education Student Data Warehousing Forum
Presentation transcript:

7 Strategies for Extracting, Transforming, and Loading

Programs Tools ETL Operational systems Warehouse Gateways Extraction, Transformation, and Loading Processes (ETL) Extract source data Transform and cleanse data Index and summarize Load data into warehouse Detect changes Refresh data

Data Staging Area The construction site for the warehouse Required by most implementations Composed of ODS, flat files, or relational server tables Frequently configured as multitier staging Extract Transform Transport Transform Transport (Load) Operational environment Staging environment Warehouse environment

Preferred Traditional Staging Model Remote staging: Data staging area in its own environment, avoiding negative impact on the warehouse environment

Extracting Data Routines developed to select fields from source Various data formats Rules, audit trails, error correction facilities Various techniques

Examining Source Systems Production –Legacy systems –Database systems –Vertical applications Archive –Historical (for initial load) –Used for query analysis –May require transformations

Mapping Defines which operational attributes to use Defines how to transform the attributes for the warehouse Defines where the attributes exist in the warehouse

Designing Extraction Processes Analysis –Sources, technologies –Data types, quality, owners Design options –Manual, custom, gateway, third-party –Replication, full, or delta refresh Design issues –Batch window, volumes, data currency –Automation, skills needed, resources Maintenance of metadata trail

Importance of Data Quality Business user confidence Query and reporting accuracy Standardization Data integration

Benefits of Data Quality Cleansed data is critical for: Standardization within the warehouse High quality matching on names and addresses Creation of accurate rules and constraints Prediction and analysis Creation of a solid infrastructure to support customer-centric business intelligence Reduction of project risk Reduction of long term costs

Guidelines for Data Quality Operational data should not be used directly in the warehouse. Operational data must be cleaned for each increment. Operational data is not simply fixed by modifying applications.

Transformation Transformation eliminates operational data anomalies: Cleans Standardizes Presents subject-oriented data Extract Transform Transport Transform Transport (Load) Restructure Consolidate Cleanse

Transformation Routines Cleansing data Eliminating inconsistencies Adding elements Merging data Integrating data Transforming data before load

Why Transform? In-house system development Multipart keys Multiple encoding Multiple local standards

Why Transform? Multiple files Missing values Duplicate values Element names

Why Transform? Element meaning Input format Referential integrity

Why Transform? Name and address: No unique key Missing data values (NULLs) Personal and commercial names mixed Different addresses for the same member Different names and spelling for the same member Many names on one line One name on two lines The data may be in a single field of no fixed format Each component of an address is in a specific field

Integration (Match and Merge) SourceTarget Match and Merge schema

Transformation Techniques Merging data –Operational transactions do not usually map one-to-one with warehouse data. –Data for the warehouse is merged to provide information for analysis. Adding keys to data

Transformation Techniques Time

Transformation Techniques Adding a date stamp: Fact table –Add triggers –Recode applications –Compare tables Dimension table Time representation –Point in time –Time span

Transformation Techniques Creating summary data: During extraction on staging area After loading onto the warehouse server

Transformation Techniques Creating artificial keys: Use generalized or derived keys Maintain the uniqueness of a row Use an administrative process to assign the key Concatenate operational key with number Easy to maintain Cumbersome keys No clean value for retrieval

Where to Transform? Choose wisely where the transformation takes place: Operational platform Staging area Warehouse server

When to Transform? Choose the transformation point wisely: Workload Environment impact CPU use Disk space Network bandwidth Parallel execution Load window time User information needs

Designing Transformation Processes Analysis –Sources and target mappings, business rules –Key users, metadata, grain, verify integrity of data Design options –Programming, Tools Design issues –Performance –Size of the staging area –Exception handling, integrity maintenance

Loading Data into the Warehouse Loading moves the data into the warehouse. Subsequent refresh moves smaller volumes. Business determines the cycle. Extract Transform Transport Transform Transport (Load) Operational environment Staging environment Warehouse environment

Extract versus Warehouse Processing Environment Extract processing builds a new database after each time interval. Warehouse processing adds changes to the database after each time interval. T1T2 T3 Operational databases T1T2 T3 Operational databases

First-Time Load Single event that populates the database with historical data Involves a large volume of data Uses distinct ETL tasks Involves large amounts of processing after load

Refresh Performed according to a business cycle Simpler task Less data to load than first-time load Less complex ETL Smaller amounts of postload processing

Building the Transportation Process Specification: Techniques and tools File transfer methods The load window Time window for other tasks First-time and refresh volumes Frequency of the refresh cycle Connectivity bandwidth

Building the Transportation Process Test the proposed technique Document proposed load Gain agreement on the process Monitor Review Revise

Granularity Important design and operational issue Low-level grain: Expensive, high level of processing, more disk, detail High-level grain: Cheaper, less processing, less disk, little detail Space requirements –Storage –Backup –Recovery –Partitioning –Load

Post-Processing of Loaded Data ExtractTransformTransport Summarize Index