Prepared by Stephen A. Brobst (617) 422-0800 Copyright © 2000, 2001. Stephen A. Brobst. Do not duplicate or distribute without written.

Slides:

Advertisements

Similar presentations

Module 13: Performance Tuning. Overview Performance tuning methodologies Instance level Database level Application level Overview of tools and techniques.

Advertisements

CHAPTER OBJECTIVE: NORMALIZATION THE SNOWFLAKE SCHEMA.

Manish Bhide, Manoj K Agarwal IBM India Research Lab India {abmanish, Amir Bar-Or, Sriram Padmanabhan IBM Software Group, USA

Components and Architecture CS 543 – Data Warehousing.

Ch1: File Systems and Databases Hachim Haddouti

Physical design. Stage 6 - Physical Design Retrieve the target physical environment Create physical data design Create function component implementation.

Data Warehouse success depends on metadata

Chapter 12 File Management Systems

Page 1Prepared by Sapient for MITVersion 0.1 – August – September 2004 This document represents a snapshot of an evolving set of documents. For information.

Designing a Data Warehouse

Chapter 1: The Database Environment

Data Warehousing: Defined and Its Applications Pete Johnson April 2002.

1 CONCENTRXSept 2000 Our Perspective “Integration without an architecture is like doing a jigsaw puzzle on your lap “ – R Tessier We look at the big picture.

ETL By Dr. Gabriel.

BUSINESS INTELLIGENCE/DATA INTEGRATION/ETL/INTEGRATION AN INTRODUCTION Presented by: Gautam Sinha.

Data Warehouse Tools and Technologies - ETL

By N.Gopinath AP/CSE. Why a Data Warehouse Application – Business Perspectives  There are several reasons why organizations consider Data Warehousing.

Designing a Data Warehouse Issues in DW design. Three Fundamental Processes Data Acquisition Data Storage Data a Access.

L/O/G/O Metadata Business Intelligence Erwin Moeyaert.

Database Systems – Data Warehousing

Data Warehousing Seminar Chapter 5. Data Warehouse Design Methodology Data Warehousing Lab. HyeYoung Cho.

Ahsan Abdullah 1 Data Warehousing Lecture-17 Issues of ETL Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.

1 Chapter 12 File Management Systems. 2 Systems Architecture Chapter 12.

ISV Innovation Presented by ISV Innovation Presented by Business Intelligence Fundamentals: Data Loading Ola Ekdahl IT Mentors 9/12/08.

The McGraw-Hill Companies, Inc Information Technology & Management Thompson Cats-Baril Chapter 3 Content Management.

ITEC224 Database Programming

Sofia, Bulgaria | 9-10 October SQL Server 2005 High Availability for developers Vladimir Tchalkov Crossroad Ltd. Vladimir Tchalkov Crossroad Ltd.

Ch 5. The Evolution of Analytic Processes

MIS 385/MBA 664 Systems Implementation with DBMS/ Database Management Dave Salisbury ( )

5 Copyright © 2009, Oracle. All rights reserved. Right-Time Data Warehousing with OWB.

1 Introduction to Database Systems. 2 Database and Database System / A database is a shared collection of logically related data designed to meet the.

Data Warehouse Overview September 28, 2012 presented by Terry Bilskie.

Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.

Oracle Advanced Compression – Reduce Storage, Reduce Costs, Increase Performance Session: S Gregg Christman -- Senior Product Manager Vineet Marwah.

Data warehousing and online analytical processing- Ref Chap 4) By Asst Prof. Muhammad Amir Alam.

Database Systems: Design, Implementation, and Management Tenth Edition Chapter 12 Distributed Database Management Systems.

2 Copyright © Oracle Corporation, All rights reserved. Defining Data Warehouse Concepts and Terminology.

Right In Time Presented By: Maria Baron Written By: Rajesh Gadodia

The Data Warehouse “A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of “all” an organisation’s data in support.

1 Reviewing Data Warehouse Basics. Lessons 1.Reviewing Data Warehouse Basics 2.Defining the Business and Logical Models 3.Creating the Dimensional Model.

1 CS 430 Database Theory Winter 2005 Lecture 16: Inside a DBMS.

Database Design and Management CPTG /23/2015Chapter 12 of 38 Functions of a Database Store data Store data School: student records, class schedules,

IS 325 Notes for Wednesday August 28, Data is the Core of the Enterprise.

Ahsan Abdullah 1 Data Warehousing Lecture-16 Extract Transform Load (ETL) Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for.

Sachin Goel (68) Manav Mudgal (69) Piyush Samsukha (76) Rachit Singhal (82) Richa Somvanshi (85) Sahar ( )

CIS/SUSL1 Fundamentals of DBMS S.V. Priyan Head/Department of Computing & Information Systems.

Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.

7 Strategies for Extracting, Transforming, and Loading.

Infrastructure for Data Warehouses. Basics Of Data Access Data Store Machine Memory Buffer Memory Cache Data Store Buffer Bus Structure.

MISSION CRITICAL COMPUTING Siebel Database Considerations.

Last Updated : 27 th April 2004 Center of Excellence Data Warehousing Group Teradata Performance Optimization.

Copyright (c) 2014 Pearson Education, Inc. Introduction to DBMS.

Aggregator Stage : Definition : Aggregator classifies data rows from a single input link into groups and calculates totals or other aggregate functions.

Business Intelligence and Decision Support Systems (9 th Ed., Prentice Hall) Chapter 8: Data Warehousing.

2 Copyright © 2006, Oracle. All rights reserved. Defining Data Warehouse Concepts and Terminology.

11 Copyright © 2009, Oracle. All rights reserved. Enhancing ETL Performance.

Data Mining and Data Warehousing: Concepts and Techniques What is a Data Warehouse? Data Warehouse vs. other systems, OLTP vs. OLAP Conceptual Modeling.

Supervisor : Prof . Abbdolahzadeh

Building a Data Warehouse

DBMS & TPS Barbara Russell MBA 624.

An Introduction to database system

Chapter 13 The Data Warehouse

Informix Red Brick Warehouse 5.1

PowerMart of Informatica

Data Warehouse.

Data Warehouse Overview September 28, 2012 presented by Terry Bilskie

Data Warehouse.

Data Warehousing Concepts

The Database Environment

MIS 385/MBA 664 Systems Implementation with DBMS/ Database Management

Presentation transcript:

prepared by Stephen A. Brobst (617) Copyright © 2000, Stephen A. Brobst. Do not duplicate or distribute without written permission. 1 High Performance Data Warehouse Design and Construction ETL Processing

Copyright © 2000, Stephen A. Brobst. Do not duplicate or distribute without written permission. 2 ETL Processing Operational Data Data Transformation Enterprise Warehouse and Integrated Data Marts Replication Dependent Data Marts or Departmental Warehouses IT Users Business Users

Copyright © 2000, Stephen A. Brobst. Do not duplicate or distribute without written permission. 3 Data Acquisition from OLTP Systems Why is it hard? n Multiple source systems technologies. n Inconsistent data representations. n Multiple sources for the same data element. n Complexity of required transformations. n Scarcity and cost of legacy cycles. n Volume of legacy data.

Copyright © 2000, Stephen A. Brobst. Do not duplicate or distribute without written permission. 4 Data Acquisition from OLTP Systems Many possible source systems technologies: * Flat files * Excel* Model 204 * VSAM* Access* DBF Format * IMS*  Oracle* RDB * IDMS*  Informix* RMS * DB2 (many flavors)* Sybase* Compressed * Adabase*  Ingres* Many others...

Copyright © 2000, Stephen A. Brobst. Do not duplicate or distribute without written permission. 5 Data Acquisition from OLTP Systems Inconsistent data representation: Same data, different domain values... Examples: n Date value representations: /14/ FEB n Gender value representations: - M/F- M/F/PM/PF - 0/1- 1/2

Copyright © 2000, Stephen A. Brobst. Do not duplicate or distribute without written permission. 6 Data Acquisition from OLTP Systems Multiple sources for the same data element: n Need to establish precedence between source systems on a per data element basis. n Take data element from source system with highest precedence where element exists. n Must sometimes establish “group precedence” rules to maintain data integrity.

Copyright © 2000, Stephen A. Brobst. Do not duplicate or distribute without written permission. 7 Data Acquisition from OLTP Systems Complexity of required transformations: n Simple scalar transformations. – 0/1 => M/F n One to many element transformations. – 6x30 address field => street1, street2, city, state, zip n Many to many element transformations. – Householding and Individualization of customer records

Copyright © 2000, Stephen A. Brobst. Do not duplicate or distribute without written permission. 8 Data Acquisition from OLTP Systems Scarcity and cost of legacy cycles: n Generally want to off-load transformation cycles to open systems environment. n Often requires new skill sets. n Need efficient and easy way to deal with mainframe data formats such as EBCDIC and packed decimal.

Copyright © 2000, Stephen A. Brobst. Do not duplicate or distribute without written permission. 9 Data Acquisition from OLTP Systems Volume of legacy data: n Need lots of processing and I/O to effectively handle large data volumes. n 2GB file limit in older versions of UNIX is not acceptable for handling legacy data - need full 64-bit file system. n Need efficient interconnect bandwidth to transfer large amounts of data from legacy sources.

Copyright © 2000, Stephen A. Brobst. Do not duplicate or distribute without written permission. 10 Data Acquisition from OLTP Systems What does the solution look like? n Meta data driven transformation architecture. n Modular software solutions with component building blocks. n Parallel software and hardware architectures.

Copyright © 2000, Stephen A. Brobst. Do not duplicate or distribute without written permission. 11 Data Acquisition from OLTP Systems Meta data driven transformation architecture: n Need multiple meta data structures. –Source meta data –Target meta data –Transformation meta data n Must avoid “hard coding” for maintainability. n Automatic generation of transformations from meta data structures. n Meta data repository ideally accessible by APIs and end user tools.

Copyright © 2000, Stephen A. Brobst. Do not duplicate or distribute without written permission. 12 Data Acquisition from OLTP Systems Modular software structures with component building blocks: n Want a data flow driven transformation architecture that supports multiple processing steps. n Meta data structures should map inputs and outputs between each transformation module. n Leverage pre-packaged tools for transformation steps wherever possible.

Copyright © 2000, Stephen A. Brobst. Do not duplicate or distribute without written permission. 13 Data Acquisition from OLTP Systems Parallel software and hardware architectures: n Use data parallelism (partitioning) to allow concurrent execution of multiple job streams. n Software architecture must allow efficient re- partitioning of data between steps in the transformation process. n Want powerful parallel hardware architectures with many processors and I/O channels.

Copyright © 2000, Stephen A. Brobst. Do not duplicate or distribute without written permission. 14 A Word of Warning The data quality in the source systems will be much worse than what you expect. n Must allocate explicit time and resources to facilitate data clean-up. n Data quality is a continuous improvement process - must institute TQM program to be successful. n Use “house of quality” technique to prioritize and focus data quality efforts.

Copyright © 2000, Stephen A. Brobst. Do not duplicate or distribute without written permission. 15 ETL Processing It is important to look at the big picture. Data acquisition time may include: n Extracts from source systems. n Data movement. n Transformations. n Data loading. n Index maintenance. n Statistics collection. n Summary data maintenance. n Data mart construction. n Backups.

Copyright © 2000, Stephen A. Brobst. Do not duplicate or distribute without written permission. 16 Loading Strategies Once we have transformed data, there are three primary loading strategies: 1. Full data refresh with “block slamming” into empty tables. 2. Incremental data refresh with “block slamming” into existing (populated) tables. 3. Trickle feed with continuous data acquisition using row level insert and update operations.

Copyright © 2000, Stephen A. Brobst. Do not duplicate or distribute without written permission. 17 Loading Strategies We must also worry about rolling off “old” data as its economic value drops below the cost for storing and maintaining it. new data old data

Copyright © 2000, Stephen A. Brobst. Do not duplicate or distribute without written permission. 18 Loading Strategies Choice in loading strategy depends on tradeoffs in data freshness and performance, as well as data volatility characteristics. What is the goal? n Increased data freshness. n Increased data loading performance. ( Delayed Availability ) Real-Time Availability Minimal Load Time Low Update Rates High Update Rates

Copyright © 2000, Stephen A. Brobst. Do not duplicate or distribute without written permission. 19 Loading Strategies Should consider: n Data storage requirements. n Impact on query workloads. n Ratio of existing to new data. n Insert versus update workloads.

Copyright © 2000, Stephen A. Brobst. Do not duplicate or distribute without written permission. 20 Loading Strategies Tradeoffs in data loading with a high percentage of data changes per data block:

Copyright © 2000, Stephen A. Brobst. Do not duplicate or distribute without written permission. 21 Loading Strategies Tradeoffs in data loading with a low percentage of data changes per data block:

Copyright © 2000, Stephen A. Brobst. Do not duplicate or distribute without written permission. 22 Full Refresh Strategy Completely re-load table on each refresh. Step 1: Load table using block slamming. Step 2: Build indexes. Step 3: Collect statistics. This is a good (simple) strategy for small tables or when a high percentage of rows in the data changes on each refresh (greater than 10%). e.g., reference lookup tables or account tables where balances change on each refresh.

Copyright © 2000, Stephen A. Brobst. Do not duplicate or distribute without written permission. 23 Full Refresh Strategy Performance hints: n Remove referential integrity (RI) constraints from table definitions for loading operations. –Assume that data cleansing takes place in transformations. n Remove secondary index specifications from table definition. –Build indices after table has been loaded. n Make sure target table logging is disabled during loads.

Copyright © 2000, Stephen A. Brobst. Do not duplicate or distribute without written permission. 24 Full Refresh Strategy Consider using “shadow” tables to allow refresh to take place without impacting query workloads. 1. Load shadow table. 2. Replace-view operation to direct queries to refreshed table make new data visible. Trades storage for availability.

Copyright © 2000, Stephen A. Brobst. Do not duplicate or distribute without written permission. 25 Incremental Refresh Strategy Incrementally load new data into existing target table that has already been populated from previous loads. Two primary strategies: 1. Incremental load directly into target table. 2. Use shadow table load followed by insert-select operation into target table.

Copyright © 2000, Stephen A. Brobst. Do not duplicate or distribute without written permission. 26 Incremental Refresh Strategy Design considerations for incremental load directly into target table using RDBMS utilities: n Indices should be maintained automatically. n Re-collect statistics if table demographics have changed significantly. n Typically requires a table lock to be taken during block slamming operation. n Do you want to allow for “dirty” reads? n Logging behavior differs across RDBMS products.

Copyright © 2000, Stephen A. Brobst. Do not duplicate or distribute without written permission. 27 Incremental Refresh Strategy Design considerations for shadow table implementation: n Use block slamming into empty “shadow” table having identical structure to target table. n Staging space required for shadow table. n Insert-select operation from shadow table to target table will preserve indices. n Locking will normally escalate to table level lock. n Beware of log file size constraints. n Beware of performance overhead for logging. n Beware of rollbacks if operation fails for any reason.

Copyright © 2000, Stephen A. Brobst. Do not duplicate or distribute without written permission. 28 Incremental Refresh Strategy Both incremental load strategies described preserve index structures during the loading operation. However, there is a cost to maintaining indexes during the loads... n Rule-of-thumb: Each secondary index maintained during the load costs 2-3 times the resources of the actual row insertion of data into a table. n Rule-of-thumb: Consider dropping and re-building index structures if the number of rows being incrementally loaded is more than 10% of the size of the target table. Note: Drop and re-build of secondary indices may not be acceptable due to availability requirements of the DW.

Copyright © 2000, Stephen A. Brobst. Do not duplicate or distribute without written permission. 29 Trickle Feed Acquire data on a continuous basis into RDBMS using row level SQL insert and update operations. n Data is made available to DW “immediately” rather than waiting for batch loading to complete. n Much higher overhead for data acquisition on a per record basis as compared to batch strategies. n Row level locking mechanisms allow queries to proceed during data acquisition. n Typically relies on Enterprise Application Integration (EAI) for data delivery.

Copyright © 2000, Stephen A. Brobst. Do not duplicate or distribute without written permission. 30 Trickle Feed A tradeoff exists between data freshness and insert efficiency: n Buffering rows for insertion allows for fewer round trips to RDBMS... n … but waiting to accumulate rows into the buffer impacts data freshness. Suggested approach: Use a threshold that buffers up to M rows, but never waits more than N seconds before sending a buffer of data for insertion.

Copyright © 2000, Stephen A. Brobst. Do not duplicate or distribute without written permission. 31 ELT versus ETL There are two fundamental approaches to data acquisition: n ETL is extract, transform, load in which transformation takes place on a transformation server using either an “engine” or by generated code. n ELT is extract, load, transform in which data transformations take place in the relational database on the data warehouse server. Of course, hybrids are also possible...

Copyright © 2000, Stephen A. Brobst. Do not duplicate or distribute without written permission. 32 ETL Processing ETL processing performs the transform operations prior to loading data into the RDBMS. 1. Extract data from the source systems. 2. Transform data into a form consistent with the target tables. 3. Load the data into the target tables (or to shadow tables).

Copyright © 2000, Stephen A. Brobst. Do not duplicate or distribute without written permission. 33 ETL Processing ETL processing is typically performed using resources on the source systems platform(s) or a dedicated transformation server. Source Systems Pre-Transformations Data Warehouse Transformation Server

Copyright © 2000, Stephen A. Brobst. Do not duplicate or distribute without written permission. 34 ETL Processing Perform the transformations on the source system platform if available resources exist and there is significant data reduction that can be achieved during the transformations. Perform the transformations on a dedicated transformation server if the source systems are highly distributed, lack capacity, or have high cost per unit of computing.

Copyright © 2000, Stephen A. Brobst. Do not duplicate or distribute without written permission. 35 ETL Processing Two approaches for ETL processing: 1. Engine: ETL processing using an interpretive engine for applying transformation rules based on meta data specifications. - e.g., Ascential, Informatica 2. Code Generation: ETL processing using code generated based on meta data specification. - e.g., Ab Initio, ETI

Copyright © 2000, Stephen A. Brobst. Do not duplicate or distribute without written permission. 36 ELT Processing n First, load “raw” data into empty tables using RDBMS block slamming utilities. n Next, use SQL to transform the “raw” data into a form appropriate to the target tables. –Ideally, the SQL is generated using a meta data driven tool rather than hand coding. n Finally, use insert-select into the target table for incremental loads or view switching if a full refresh strategy is used.

Copyright © 2000, Stephen A. Brobst. Do not duplicate or distribute without written permission. 37 ELT Processing DW server is the the transformation server for ELT processing. Source Systems Data Warehouse Teradata Fastload Files Channel Network

Copyright © 2000, Stephen A. Brobst. Do not duplicate or distribute without written permission. 38 ELT Processing n ELT Processing obviates the need for a separate transformation server. –Assumes that spare capacity exists on DW server to support transformation operations. n ELT leverages the build-in scalability and manageability of the parallel RDBMS and HW platform. n Must allocate sufficient staging area space to support load of raw data and execution of the transformation SQL. n Works well only for batch oriented transforms because SQL is optimized for set processing.

Copyright © 2000, Stephen A. Brobst. Do not duplicate or distribute without written permission. 39 Bottom Line ETL is a significant task in any DW deployment. Many options for data loading strategies: need to evaluate tradeoffs in performance, data freshness, and compatibility with source systems environment. Many options for ETL/ELT deployment: need to evaluate tradeoffs in where and how transformations should be applied.