Ahsan Abdullah 1 Data Warehousing Lecture-16 Extract Transform Load (ETL) Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for.

Slides:



Advertisements
Similar presentations
IS 4420 Database Fundamentals Chapter 11: Data Warehousing Leon Chen
Advertisements

BY LECTURER/ AISHA DAWOOD DW Lab # 3 Overview of Extraction, Transformation, and Loading.
C6 Databases.
Copyright © Starsoft Inc, Data Warehouse Architecture By Slavko Stemberger.
Lecture-19 ETL Detail: Data Cleansing
Technical BI Project Lifecycle
Management Information Systems, Sixth Edition
DWH-Ahsan Abdullah 1 Data Warehousing Lecture-5 Types & Typical Applications of DWH Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center.
ICS 421 Spring 2010 Data Warehousing (1) Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa 3/18/20101Lipyeow.
Prepared by Stephen A. Brobst (617) Copyright © 2000, Stephen A. Brobst. Do not duplicate or distribute without written.
Components and Architecture CS 543 – Data Warehousing.
1 ACCTG 6910 Building Enterprise & Business Intelligence Systems (e.bis) The Data Warehouse Lifecycle Olivia R. Liu Sheng, Ph.D. Emma Eccles Jones Presidential.
Introduction to Database Management
Lecture-33 DWH Implementation: Goal Driven Approach (1)
Data Warehousing: Defined and Its Applications Pete Johnson April 2002.
Leaving a Metadata Trail Chapter 14. Defining Warehouse Metadata Data about warehouse data and processing Vital to the warehouse Used by everyone Metadata.
ETL Design and Development Michael A. Fudge, Jr.
ETL By Dr. Gabriel.
Lecture-1 Introduction and Background
L/O/G/O Metadata Business Intelligence Erwin Moeyaert.
Ahsan Abdullah 1 Data Warehousing Lecture-12 Relational OLAP (ROLAP) Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.
Ahsan Abdullah 1 Data Warehousing Lecture-17 Issues of ETL Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.
ISV Innovation Presented by ISV Innovation Presented by Business Intelligence Fundamentals: Data Loading Ola Ekdahl IT Mentors 9/12/08.
The Business Intelligence Side of Blue Mountain RAM Bill Lucas, IT Systems Architect and Senior Software Engineer.
Data Warehouse Overview September 28, 2012 presented by Terry Bilskie.
Ahsan Abdullah 1 Data Warehousing Lecture-11 Multidimensional OLAP (MOLAP) Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for.
1 Data Warehousing Lecture-13 Dimensional Modeling (DM) Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics Research.
Ahsan Abdullah 1 Data Warehousing Lecture-7De-normalization Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.
DWH-Ahsan Abdullah 1 Data Warehousing Lecture-4 Introduction and Background Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for.
Data warehousing and online analytical processing- Ref Chap 4) By Asst Prof. Muhammad Amir Alam.
1 Data Warehouses BUAD/American University Data Warehouses.
Ahsan Abdullah 1 Data Warehousing Lecture-18 ETL Detail: Data Extraction & Transformation Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. &
Ahsan Abdullah 1 Data Warehousing Lecture-9 Issues of De-normalization Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.
ETL Extract. Design Logical before Physical Have a plan Identify Data source candidates Analyze source systems with data- profiling tools Receive walk-through.
Chapter 17 Creating a Database.
Data Warehousing 1 Lecture-28 Need for Speed: Join Techniques Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.
Data Warehousing.
1 Data Warehousing Lecture-14 Process of Dimensional Modeling Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.
Ahsan Abdullah 1 Data Warehousing Lecture-20 Data Duplication Elimination & BSN Method Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head.
DWH-Ahsan Abdullah 1 Data Warehousing Lecture-2 Introduction and Background Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for.
MIS2502: Data Analytics The Information Architecture of an Organization.
Ahsan Abdullah 1 Data Warehousing Lecture-10 Online Analytical Processing (OLAP) Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center.
Decision Support and Date Warehouse Jingyi Lu. Outline Decision Support System OLAP vs. OLTP What is Date Warehouse? Dimensional Modeling Extract, Transform,
Data Warehousing Lecture-31 Supervised vs. Unsupervised Learning Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.
Data Staging Data Loading and Cleaning Marakas pg. 25 BCIS 4660 Spring 2012.
Data resource management
Database Management System Prepared by Dr. Ahmed El-Ragal Reviewed & Presented By Mr. Mahmoud Rafeek Alfarra College Of Science & Technology- Khan younis.
1 Data Warehousing Lecture-15 Issues of Dimensional Modeling Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.
Data Warehousing Lecture-30 What can Data Mining do? Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics Research.
DWH-Ahsan Abdullah 1 Data Warehousing Lecture-29 Brief Intro. to Data Mining Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center.
7 Strategies for Extracting, Transforming, and Loading.
DWH-Ahsan Abdullah 1 Data Warehousing Lecture-22 DQM: Quantifying Data Quality Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center.
Ahsan Abdullah 1 Data Warehousing Lecture-6Normalization Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.
Ahsan Abdullah 1 Data Warehousing Lecture-8 De-normalization Techniques Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.
The Need for Data Analysis 2 Managers track daily transactions to evaluate how the business is performing Strategies should be developed to meet organizational.
Introduction to OLAP and Data Warehouse Assoc. Professor Bela Stantic September 2014 Database Systems.
An Overview of Data Warehousing and OLAP Technology
Unit 4 : Data Warehousing Technologies and Implementation Lecturer : Bijay Mishra.
C Copyright © 2007, Oracle. All rights reserved. Introduction to Data Warehousing Fundamentals.
Management Information Systems by Prof. Park Kyung-Hye Chapter 7 (8th Week) Databases and Data Warehouses 07.
Plan for Populating a DW
Lecture-3 Introduction and Background
Lecture-32 DWH Lifecycle: Methodologies
Data Warehouse Overview September 28, 2012 presented by Terry Bilskie
Lecture-38 Case Study: Agri-Data Warehouse
Lecture-35 DWH Implementation: Pitfalls, Mistakes, Keys
Data Warehouse.
Data Warehouse Overview September 28, 2012 presented by Terry Bilskie
Data Warehousing Concepts
David Gilmore & Richard Blevins Senior Consultants April 17th, 2012
Presentation transcript:

Ahsan Abdullah 1 Data Warehousing Lecture-16 Extract Transform Load (ETL) Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics Research National University of Computers & Emerging Sciences, Islamabad

Ahsan Abdullah 2 Extract Transform Load (ETL)

Ahsan Abdullah 3 Data Warehouse Server (Tier 1) OLAP Servers (Tier 2) Clients (Tier 3) Data Warehouse Operational Data Bases Semistructured Sources MOLAP ROLAP Query/Reporting  Data MartsTools Meta Data Data sources Data (Tier 0)                IT Users   Business Users   Business Users Data Mining  Archived data Analysis  Extract Transform Load (ETL)  www data Putting the pieces together {Comment: All except ETL washed out look}

Ahsan Abdullah 4 The ETL Cycle EXTRACT The process of reading data from different sources. TRANSFORM The process of transforming the extracted data from its original state into a consistent state so that it can be placed into another database. LOAD The process of writing the data into the target source. TRANSFORMCLEANSE LOAD Data Warehouse OLAP Temporary Data storage EXTRACT MIS Systems (Acct, HR) Legacy Systems Other indigenous applications (COBOL, VB, C++, Java)  Archived data www data

Ahsan Abdullah 5 ETL Processing Extracts Extracts from from source source systems systems Data Data Movement Movement Data Data Transfor- Transfor- mation mation Data Data Loading Loading Index Index Mainte- Mainte- nance nance Statistics Statistics Collection Collection Data Data Cleansing Cleansing ETL is independent yet interrelated steps. It is important to look at the big picture. Data acquisition time may include… Backup Back-up is a major task, its a DWH not a cube Note: Backup will come as other elements after “Statistical collection”

Ahsan Abdullah 6 Overview of Data Extraction First step of ETL, followed by many. Source system for extraction are typically OLTP systems. A very complex task due to number of reasons:  Very complex and poorly documented source system.  Data has to be extracted not once, but number of times.  The process design is dependent on:  Which extraction method to choose?  How to make available extracted data for further processing?

Ahsan Abdullah 7 Types of Data Extraction  Logical Extraction  Full Extraction  Incremental Extraction  Physical Extraction  Online Extraction  Offline Extraction  Legacy vs. OLTP

Ahsan Abdullah 8 Logical Data Extraction  Full Extraction  The data extracted completely from the source system.  No need to keep track of changes.  Source data made available as-is with any additional information.  Incremental Extraction  Data extracted after a well defined point/event in time.  Mechanism used to reflect/record the temporal changes in data (column or table).  Sometimes entire tables off-loaded from source system into the DWH.  Can have significant performance impacts on the data warehouse server.

Ahsan Abdullah 9 Physical Data Extraction…  Online Extraction  Data extracted directly from the source system.  May access source tables through an intermediate system.  Intermediate system usually similar to the source system.  Offline Extraction  Data NOT extracted directly from the source system, instead staged explicitly outside the original source system.  Data is either already structured or was created by an extraction routine.  Some of the prevalent structures are:  Flat files  Dump files  Redo and archive logs  Transportable table-spaces

Ahsan Abdullah 10 Physical Data Extraction  Legacy vs. OLTP  Data moved from the source system  Copy made of the source system data  Staging area used for performance reasons

Ahsan Abdullah 11 Data Transformation  Basic tasks 1. Selection 2. Splitting/Joining 3. Conversion 4. Summarization 5. Enrichment

Ahsan Abdullah 12 Data Transformation Basic Tasks  Selection

Ahsan Abdullah 13 Data Transformation Basic Tasks  Splitting/joining

Ahsan Abdullah 14 Data Transformation Basic Tasks  Conversion

Ahsan Abdullah 15 Data Transformation Basic Tasks: Conversion Example-1  Convert common data elements into a consistent form i.e. name and address.  Translation of dissimilar codes into a standard code. Field formatField data First-Family-titleMuhammad Ibrahim Contractor Family-title-comma-firstIbrahim Contractor, Muhammad Family-comma-first-titleIbrahim, Muhammad Contractor Natl. IDNID National IDNID F/NO-2 F-2 FL.NO.2 FL.2 FL/NO.2 FL-2 FLAT-2 FLAT# FLAT,2 FLAT-NO-2 FL-NO.2 FLAT No. 2

Ahsan Abdullah 16  Data representation change  EBCIDIC to ASCII  Operating System Change  Mainframe (MVS) to UNIX  UNIX to NT or XP  Data type change  Program (Excel to Access), database format (FoxPro to Access).  Character, numeric and date type.  Fixed and variable length. Data Transformation Basic Tasks: Conversion Example-2

Ahsan Abdullah 17 Data Transformation Basic Tasks  Summarization

Ahsan Abdullah 18 Data Transformation Basic Tasks  Enrichment

Ahsan Abdullah 19 Data Transformation Basic Tasks: Enrichment Example  Data elements are mapped from source tables and files to destination fact and dimension tables.  Default values are used in the absence of source data.  Fields are added for unique keys and time elements. Input Data HAJI MUHAMMAD IBRAHIM, GOVT. CONT. K. S. ABDULLAH & BROTHERS, MAMOOJI ROAD, ABDULLAH MANZIL RAWALPINDI, Ph Parsed Data First Name: HAJI MUHAMMAD Family Name: IBRAHIM Title: GOVT. CONT. Firm: K. S. ABDULLAH & BROTHERS Firm Location: ABDULLAH MANZIL Road: MAMOOJI ROAD Phone: City: RAWALPINDI Code:46200

Ahsan Abdullah 20 Aspects of Data Loading Strategies  Need to look at:  Data freshness  System performance  Data volatility  Data Freshness  Very fresh low update efficiency  Historical data, high update efficiency  Always trade-offs in the light of goals  System performance  Availability of staging table space  Impact on query workload  Data Volatility  Ratio of new to historical data  High percentages of data change (batch update)

Ahsan Abdullah 21 Three Loading Strategies  Once we have transformed data, there are three primary loading strategies:  Full data refresh with BLOCK INSERT or ‘block slamming’ into empty table.  Incremental data refresh with BLOCK INSERT or ‘block slamming’ into existing (populated) tables.  Trickle/continuous feed with constant data collection and loading using row level insert and update operations.