Presentation is loading. Please wait.

Presentation is loading. Please wait.

ETL By Dr. Gabriel.

Similar presentations


Presentation on theme: "ETL By Dr. Gabriel."— Presentation transcript:

1 ETL By Dr. Gabriel

2 ETL Process 4 major components: Extracting Cleaning and conforming
Gathering raw data from source systems and storing it in ETL staging environment Cleaning and conforming Processing data to improve its quality, format it, merge from multiple sources, enforce conformed dimensions Delivering Loading data into data warehouse tables Managing Management of ETL environment

3 ETL: Extracting Data profiling
Identifying data that changed since last load extraction

4 ETL: Cleaning and Conforming
Data cleansing Recording error events Audit dimensions Deduping Creating and maintaining conformed dimensions and facts

5 ETL: Delivering Implementation of SCD logic Surrogate key generation
Managing hierarchies in dimensions Managing special dimensions such as date and time, junk, mini, shrunken, small static, and user-maintained dimensions Mini dimensions used to track changes of dimension attribute when type 2 technique is infeasible. Similar to junk dimensions Typically is used for large dimensions Combinations can be built in advance or on the fly Built from dimension table input

6 ETL: Delivering (Cont)
Small static dimensions Dimensions created by the ETL system without real source Lookup dimensions for translations of codes, etc. User maintained dimensions Master dimensions without real source system Descriptions, groupings, hierarchies created for reporting and analysis purposes.

7 ETL: Delivering (Cont)
Fact table loading Building and maintaining bridge dimension tables Handling late arriving data Management of conformed dimensions Administration of fact tables Building aggregations Building OLAP cubes Transferring DW data to other environment for specific purposes

8 ETL: Managing Management of ETL environment Goals Job scheduler
Reliability Availability Manageability Job scheduler backup system Recovery and restart system Version control system

9 ETL: Managing (Cont.) Version migration system Workflow monitor
Sorting system Analyzing dependencies and lineage Problem escalation system Parallelization Security system Compliance manager Metadata repository manager

10 ETL Process Planning High level source to target data flow diagram
Selection and implementation of ETL tool Development of default strategies for dimension management, error handling, and other processes Development data transformations diagrams by target table Development of job sequencing

11 ETL Process Developing one-time historic load
Build and test the historic dimension and fact tables load Developing incremental load process Build and test dimension and fact tables incremental load processes Build and test aggregate table loads and/or OLAP processing Design, build, and test the ETL system automation

12 ETL Tools: Build vs Buy Many off-the-shelf tools exist
Benefits are not seen right away Setup Learning curve High-end tools may not justify value for smaller warehouses

13 Off-the-shelf ETL Tools
Vendor Oracle Warehouse Builder (OWB) Oracle  Data Integrator (BODI) Business Objects IBM Information Server (Ascential) IBM SAS Data Integration Studio SAS Institute PowerCenter Informatica  Oracle Data Integrator (Sunopsis) Oracle Data Migrator Information Builders Integration Services Microsoft Talend Open Studio Talend DataFlow Group 1 Software (Sagent) Data Integrator Pervasive Transformation Server DataMirror Transformation Manager ETL Solutions Ltd. Data Manager Cognos DT/Studio Embarcadero Technologies ETL4ALL IKAN DB2 Warehouse Edition Jitterbit Pentaho Data Integration Pentaho 

14 ETL Specification Document
Can be as large as 100 pages per business process; In reality, the work starts after the high level design is documented in a few pages. Source-to-target mappings Data profiling reports Physical design decisions Default strategy for extracting from each major source system Archival strategy Data quality tracking and metadata Default strategy for managing changes to dimension attributes

15 ETL Specification Document (Cont)
System availability requirements and strategy Design of data auditing mechanism Location of staging areas Historic and incremental load strategies for each table Detailed table design Historic data load parameters (# of months) and volumes (# of rows) Incremental data volumes

16 ETL Specification Document (Cont)
Handling of late arriving data Load frequency Handling of changes in each dimension attribute (types 1,2,3) Table partitioning Overview of data sources; discussion of source-specific characteristics Extract strategy for the source data Change data capture logic for each source table Dependencies Transformation logic (diagram or pseudo code)

17 ETL Specification Document (Cont)
Preconditions to avoid error conditions Recovery and restart assumptions for each major step of the ETL pipeline Archiving assumptions for each table Cleanup steps Estimated effort Overall workflow Job sequencing Logical dependencies

18 Loading Pointers One time historic load
Disable RI constraints (FKs) and re-enable them after the load is complete Drop indexes and re-create them after the load is complete Use bulk loading techniques Not always the case

19 Loading Pointers (Cont)
Incremental load

20 Loading Pointers (Cont)
Sometimes historic and incremental load logic is the same; many times- is similar. Updating aggregations, if necessary Error handling

21 Sample: Generation of Surrogate Keys on SQL Server
As simple as: INTEGER = MAX(ID) + 1 FROM TableName But may not work with concurrent processes OR Create PROCEDURE pGetNextID VARCHAR(32), @SeedValue BIGINT OUTPUT) AS UPDATE Lookup_Seed = SeedValue = SeedValue + 1 WHERE SeedID Lookup_Seed table: SeedID varchar (32) SeedValue bigint

22 Questions ?


Download ppt "ETL By Dr. Gabriel."

Similar presentations


Ads by Google