Presentation is loading. Please wait.

Presentation is loading. Please wait.

7 Strategies for Extracting, Transforming, and Loading.

Similar presentations


Presentation on theme: "7 Strategies for Extracting, Transforming, and Loading."— Presentation transcript:

1 7 Strategies for Extracting, Transforming, and Loading

2 Programs Tools ETL Operational systems Warehouse Gateways Extraction, Transformation, and Loading Processes (ETL) Extract source data Transform and cleanse data Index and summarize Load data into warehouse Detect changes Refresh data

3 Data Staging Area The construction site for the warehouse Required by most implementations Composed of ODS, flat files, or relational server tables Frequently configured as multitier staging Extract Transform Transport Transform Transport (Load) Operational environment Staging environment Warehouse environment

4 Preferred Traditional Staging Model Remote staging: Data staging area in its own environment, avoiding negative impact on the warehouse environment

5 Extracting Data Routines developed to select fields from source Various data formats Rules, audit trails, error correction facilities Various techniques

6 Examining Source Systems Production –Legacy systems –Database systems –Vertical applications Archive –Historical (for initial load) –Used for query analysis –May require transformations

7 Mapping Defines which operational attributes to use Defines how to transform the attributes for the warehouse Defines where the attributes exist in the warehouse

8 Designing Extraction Processes Analysis –Sources, technologies –Data types, quality, owners Design options –Manual, custom, gateway, third-party –Replication, full, or delta refresh Design issues –Batch window, volumes, data currency –Automation, skills needed, resources Maintenance of metadata trail

9 Importance of Data Quality Business user confidence Query and reporting accuracy Standardization Data integration

10 Benefits of Data Quality Cleansed data is critical for: Standardization within the warehouse High quality matching on names and addresses Creation of accurate rules and constraints Prediction and analysis Creation of a solid infrastructure to support customer-centric business intelligence Reduction of project risk Reduction of long term costs

11 Guidelines for Data Quality Operational data should not be used directly in the warehouse. Operational data must be cleaned for each increment. Operational data is not simply fixed by modifying applications.

12 Transformation Transformation eliminates operational data anomalies: Cleans Standardizes Presents subject-oriented data Extract Transform Transport Transform Transport (Load) Restructure Consolidate Cleanse

13 Transformation Routines Cleansing data Eliminating inconsistencies Adding elements Merging data Integrating data Transforming data before load

14 Why Transform? In-house system development Multipart keys Multiple encoding Multiple local standards

15 Why Transform? Multiple files Missing values Duplicate values Element names

16 Why Transform? Element meaning Input format Referential integrity

17 Why Transform? Name and address: No unique key Missing data values (NULLs) Personal and commercial names mixed Different addresses for the same member Different names and spelling for the same member Many names on one line One name on two lines The data may be in a single field of no fixed format Each component of an address is in a specific field

18 Integration (Match and Merge) SourceTarget Match and Merge schema

19 Transformation Techniques Merging data –Operational transactions do not usually map one-to-one with warehouse data. –Data for the warehouse is merged to provide information for analysis. Adding keys to data

20 Transformation Techniques Time

21 Transformation Techniques Adding a date stamp: Fact table –Add triggers –Recode applications –Compare tables Dimension table Time representation –Point in time –Time span

22 Transformation Techniques Creating summary data: During extraction on staging area After loading onto the warehouse server

23 109908109908 01 Transformation Techniques Creating artificial keys: Use generalized or derived keys Maintain the uniqueness of a row Use an administrative process to assign the key Concatenate operational key with number Easy to maintain Cumbersome keys No clean value for retrieval

24 Where to Transform? Choose wisely where the transformation takes place: Operational platform Staging area Warehouse server

25 When to Transform? Choose the transformation point wisely: Workload Environment impact CPU use Disk space Network bandwidth Parallel execution Load window time User information needs

26 Designing Transformation Processes Analysis –Sources and target mappings, business rules –Key users, metadata, grain, verify integrity of data Design options –Programming, Tools Design issues –Performance –Size of the staging area –Exception handling, integrity maintenance

27 Loading Data into the Warehouse Loading moves the data into the warehouse. Subsequent refresh moves smaller volumes. Business determines the cycle. Extract Transform Transport Transform Transport (Load) Operational environment Staging environment Warehouse environment

28 Extract versus Warehouse Processing Environment Extract processing builds a new database after each time interval. Warehouse processing adds changes to the database after each time interval. T1T2 T3 Operational databases T1T2 T3 Operational databases

29 First-Time Load Single event that populates the database with historical data Involves a large volume of data Uses distinct ETL tasks Involves large amounts of processing after load

30 Refresh Performed according to a business cycle Simpler task Less data to load than first-time load Less complex ETL Smaller amounts of postload processing

31 Building the Transportation Process Specification: Techniques and tools File transfer methods The load window Time window for other tasks First-time and refresh volumes Frequency of the refresh cycle Connectivity bandwidth

32 Building the Transportation Process Test the proposed technique Document proposed load Gain agreement on the process Monitor Review Revise

33 Granularity Important design and operational issue Low-level grain: Expensive, high level of processing, more disk, detail High-level grain: Cheaper, less processing, less disk, little detail Space requirements –Storage –Backup –Recovery –Partitioning –Load

34 Post-Processing of Loaded Data ExtractTransformTransport Summarize Index


Download ppt "7 Strategies for Extracting, Transforming, and Loading."

Similar presentations


Ads by Google