Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Presentation # 506 David Stanford President Red Sky Data Inc. Design Tips for the Warehouse Architect IOUG Live! 2005.

Similar presentations


Presentation on theme: "1 Presentation # 506 David Stanford President Red Sky Data Inc. Design Tips for the Warehouse Architect IOUG Live! 2005."— Presentation transcript:

1 1 Presentation # 506 David Stanford President Red Sky Data Inc. Design Tips for the Warehouse Architect IOUG Live! 2005

2 2 Objectives Obtain a clear understanding of data warehouse design ‘hot points’ Identify solutions and alternatives for these ‘hot points’ See how real world solutions are implemented

3 3 Agenda Top 10 Gotchya’s Design Traps –Loading Dirty Fact Data –Surrogate Keys –The Staging Area Slowly Changing Dimensions Tracking All History Audit Considerations Bad & Missing Data Administrative Fields Other Tidbits of Advice

4 4 Dave’s Top 10 Gotchya’s 1.Failing to model for both a) view of the data when the event occurred and b) view of the data as of today’s reality 2.Limiting the number of dimensions 3.Failing to model and populate a meta data repository 4.Failing to provide sufficient audit capabilities to verify loads against source systems 5.Not using surrogate keys for everything

5 5 6.Failing to design an error correction process 7.Normalizing too much 8.Not using a staging area 9.Failing to load ALL of the fact data 10.Failing to classify incorrect data 11.Making it too complex! Dave’s Top 10 Gotchya’s

6 6 Design Traps Design Review Staging Area Surrogate Keys Facts – Surrogates and Dirty Data

7 7 Data Warehouse Process Source OLTP Systems Data Marts Design Mapping Design Mapping Extract Scrub Transform Extract Scrub Transform Load Index Aggregation Load Index Aggregation Replication Data Set Distribution Replication Data Set Distribution Access & Analysis Resource Scheduling & Distribution Access & Analysis Resource Scheduling & Distribution Meta Data System Monitoring Raw Detail No/Minimal History Integrated Scrubbed History Summaries Targeted Specialized (OLAP) Data Characteristics Data Warehouse Source: Enterprise Group StagingArea

8 8 Where The Work Is Source OLTP Systems Data Marts Design Mapping Design Mapping Extract Scrub Transform Extract Scrub Transform Load Index Aggregation Load Index Aggregation Replication Data Set Distribution Replication Data Set Distribution Access & Analysis Resource Scheduling & Distribution Access & Analysis Resource Scheduling & Distribution Meta Data System Monitoring Data Warehouse Over 80% of the work is here Source: Enterprise Group StagingArea

9 9 Warehouse Design Normalized (Relational) Design Dimensional Design – Star and Snowflake Hybrid Design –In reality, the DW is more normalized but has elements of dimensional design –The data marts are star schemas but have elements of normalization

10 10 Modelling is not straight forward Donation Member IncomeCampaign Time Gender Marital Status Location Age

11 11 Should These Be Combined? Donation Member IncomeCampaign Time Gender Marital Status Location Age

12 12 Behind The Scenes There are several aspects of a design that users don’t directly see: –Meta Data –Error Correction –Audit –Load Control (if not using a scheduling tool) –Transformation Tables (used for transforming the data prior to being loaded into the DW)

13 13 Behind The Scenes Data Marts Data Warehouse Error CorrectionMeta DataAuditLoad ControlTransform Tables Source OLTP Systems StagingArea

14 14 A 10 Step Design Process 1.Identify major subject areas or topics 2.Declare the Grain 3.Add element of time to the tables 4.Create appropriate names for tables, columns, and views 5.Add derived fields where applicable 6.Add administrative fields 7.Consider security and privacy in design 8.Make sure data model answers the critical business questions 9.Consider meta data 10.Consider error correction 11.Performance considerations: Tune, Tune, Tune

15 15 Independent of Approach… …the goal of the data model is to satisfy two primary criteria: 1. Meet Business Objectives 2. Provide Good Performance

16 16 Staging Area

17 17 Staging Area Source OLTP Systems Data Marts Design Mapping Design Mapping Extract Scrub Transform Extract Scrub Transform Load Index Aggregation Load Index Aggregation Replication Data Set Distribution Replication Data Set Distribution Access & Analysis Resource Scheduling & Distribution Access & Analysis Resource Scheduling & Distribution Meta Data System Monitoring Raw Detail No/Minimal History Integrated Scrubbed History Summaries Targeted Specialized (OLAP) Data Characteristics Data Warehouse Source: Enterprise Group StagingArea

18 18 The Staging Area Holds a mirror copy of the extract files Allows pre-processing of the data before loading Allows easier reloading (you WILL do this) Keeps more control with the DW team, rather than with an external group (the extract team)

19 19 The Staging Area Facilitates easier audit processes Can facilitate error correction processes Helps identifying the Record Type (translates into easier ETL processing and logic)

20 20 Surrogate Keys

21 21 Surrogate Keys A surrogate key is a system generated, unintelligent, single column, unique identifier for each row within a table Always use surrogate keys for dimensions Always use surrogate keys for the time dimension Always use surrogate keys for transformation tables Always use surrogate keys for EVERY table..and this includes FACT tables

22 22 Surrogate Keys Avoid… Duplicate keys from different source systems Recycling of primary keys Use of the same key for different business rows Lengthy composite key joins Space in fact tables Application changes or upgrades in source systems

23 23 Using Surrogates In Fact Tables You will need a surrogate key on the fact table if you allow ‘unknown’ values into the fact table (which is recommended by the way) The Primary Key of a fact table is typically the combination of the base dimensions

24 24 Surrogates In Fact Tables

25 25 Surrogates In Fact Tables Date Of First Service15-Jan-2001 Benefit PackageFamily, Eye Coverage Contract ProductExtendaGroup MemberDavid Stanford Service ProviderDr. Walters Primary DiagnosisBroken Arm Amount$123.34

26 26 Surrogates In Fact Tables Date Of First Service15-Jan-2001 Benefit PackageFamily, Eye Coverage Contract ProductExtendaGroup MemberDavid Stanford Service ProviderDr. Walters Primary DiagnosisMISSING (Broken Arm) Amount$16,239.00

27 27 Date Of First Service15-Jan-2001 Benefit PackageFamily, Eye Coverage Contract ProductExtendaGroup MemberDavid Stanford Service ProviderDr. Walters Primary DiagnosisMISSING (Heart Attack) Amount$16, This results in a duplicate primary key in the table Surrogates In Fact Tables Claim Line Key

28 28 Surrogates In Fact Tables Thus the need for a surrogate primary key

29 29 Load “Dirty” Data Into The Fact Ties out to source systems Gains credibility with end users Requires a few design resolutions: –Bad & Missing (BAM) Logic –Surrogate Keys in the Fact tables Still 100% accurate – we don’t load the bad values, we identify the bad values for correction later Empowers the End Users to decide if the “dirty” data will invalidate their analysis

30 30 Tracking History

31 31 Tracking History in Dimensions Type 1 – No history Type 2 – All history Type 3 – Some history

32 32 Type 1 – No History Source Transaction #1 Id1 NameSandy Rubble Address23 Boulder Rd CityBedrock SalutationMs. Warehouse Transaction #1 Key100 Id1 NameSandy Rubble Address23 Boulder Rd CityBedrock SalutationMs. Date01-Jan-2001

33 33 Type 1 – No History Source Transaction #1 Id1 NameSandy Rubble Address23 Boulder Rd CityBedrock SalutationMs. Warehouse Transaction #2 Key100 Id1 NameSandy Rubble Address42 Slate Ave CityGravelPit SalutationMrs. Date15-Mar-2001 Source Transaction #2 Id1 NameSandy Rubble Address42 Slate Ave CityGravelPit SalutationMrs.

34 34 Type 2 – All History Source Transaction #1 Id1 NameSandy Rubble Address23 Boulder Rd CityBedrock SalutationMs. Warehouse Transaction #1 Key100 Id1 NameSandy Rubble Address23 Boulder Rd CityBedrock SalutationMs. Date01-Jan-2001 Source Transaction #2 Id1 NameSandy Rubble Address42 Slate Ave CityGravelPit SalutationMrs. Warehouse Transaction #2 Key101 Id1 NameSandy Rubble Address42 Slate Ave CityGravelPit SalutationMrs. Date15-Mar-2001

35 35 Type 3 – Some History Source Transaction #1 Id1 NameSandy Rubble Address23 Boulder Rd CityBedrock SalutationMs. Warehouse Transaction #1 Key100 Id1 NameSandy Rubble Address23 Boulder Rd CityBedrock Original Salutation Ms. SalutationMs. Date01-Jan-2001

36 36 Source Transaction #1 Id1 NameSandy Rubble Address23 Boulder Rd CityBedrock SalutationMs. Warehouse Transaction #1 Key100 Id1 NameSandy Rubble Address42 Slate Ave CityGravelPit Original Salutation Ms. SalutationMrs. Date15-Mar-2001 Source Transaction #2 Id1 NameSandy Rubble Address42 Slate Ave CityGravelPit SalutationMrs. Type 3 – Some History

37 37 More Dimension Types…Combinations Type 3 Prime – Types 1 and 2 (the most common) Type 4 – Types 1 and 3 Type 5 – Types 2 & 3 Type 6 – Types 1, 2, and 3 (the second most common)

38 38 Trigger Fields Trigger Fields are fields within a table that you want to track history Non-Trigger fields are those which you do not want to track history

39 39 Type 3 Prime –All and No History Source Transaction #1 Id1 NameSandy Rubble Address23 Boulder Rd CityBedrock SalutationMs. Expiry Date Null

40 40 Non Trigger Field Update Source Transaction #1 Id1 NameSandy Rubble Address23 Boulder Rd CityBedrock SalutationMs. Warehouse Transaction #1 Key100 Id1 NameSandy Rubble Address42 Slate Ave CityGravelPit SalutationMs. Date15-Mar-2001 Source Transaction #2 Id1 NameSandy Rubble Address42 Slate Ave CityGravelPit SalutationMs.

41 41 Trigger & Non Trigger Field Update Source Transaction #1 Id1 NameSandy Rubble Address23 Boulder Rd CityBedrock SalutationMs. Warehouse Transaction #1 Key100 Id1 NameSandy Rubble Address23 Boulder Rd City Bedrock SalutationMs. Date01-Jan-2001 Source Transaction #2 Id1 NameSandy Rubble Address42 Slate Ave CityGravelPit SalutationMrs. Warehouse Transaction #2 Key101 Id1 NameSandy Rubble Address42 Slate Ave CityGravelPit SalutationMrs. Date15-Mar-2001

42 42 Changes One At A Time Source Transaction #2 Id1 NameSandy Rubble Address42 Slate Ave CityGravelPit SalutationMs. Warehouse Transaction #1 Key100 Id1 NameSandy Rubble Address42 Slate Ave CityGravelPit SalutationMs. Date01-Jan-2001 Source Transaction #3 Id1 NameSandy Rubble Address42 Slate Ave CityGravelPit SalutationMrs. Warehouse Transaction #2 Key101 Id1 NameSandy Rubble Address42 Slate Ave CityGravelPit SalutationMrs. Date15-Mar-2001

43 43 Expect To Track Everything Users want to view the data as it was when the transaction or event occurred AND… Users want to view the data in the context of today’s realities THUS, model for both!

44 44 Add ‘Current’ Columns In order to provide these two views, consider adding ‘current’ columns to tables. This is a special Type 6. These fields get updated in historical records when a trigger field changes value in the current record. This simplifies the use of the DW by the users It’s easier to understand than having to write complex SQL

45 45 Type 6 – All, Some, and No History Source Transaction #1 Id1 NameSandy Rubble Address23 Boulder Rd CityBedrock SalutationMs. Warehouse Transaction #1 Key100 Id1 NameSandy Rubble Address23 Boulder Rd City Bedrock SalutationMs. Current Sal’nMrs. Date01-Jan-2001 Source Transaction #2 Id1 NameSandy Rubble Address42 Slate Ave CityGravelPit SalutationMrs. Warehouse Transaction #2 Key101 Id1 NameSandy Rubble Address42 Slate Ave CityGravelPit SalutationMrs. Current Sal’nMrs. Date15-Mar-2001

46 46 Most Recent Flag Tracks the Most Recent record in time (not loaded, but based on a time series) Should be added to the dimensions as a Yes/No (1/0) field The most recently loaded record is set to Yes, all other records are set to No Allows user to restrict on the Most Recent Flag to get a view of the world today

47 47 Double Keying Type 2 Dimensions Double surrogate key in Type 2 dimensions 1 key is unique for each individual row 1 key is unique for each individual business key Protects against: –Authoritative source system changes / duplication

48 48 Rapidly Changing Dimensions Rapidly Changing Dimensions (RCD’s) need to be partitioned –Use Oracle partitioning –Include the native partition key in the dimension –Or split into several tables

49 49 BAM Rules, Audit & Administrative Fields

50 50 Bad & Missing Fact Data Bad and/or missing data will be always be an issue The source data is never completely clean There are always exceptions Recall that you need to tie back into the source systems for your audit, thus you must load this ‘incorrect’ data Put the decisions into the hands of your users – don’t decide for them whether the data is good enough or not Need to develop Bad & Missing (BAM) Rules

51 51 BAM Rules Used in the ETL process when loading data that references other tables (e.g. loading a fact table and looking up the dimension record) Need a series of rules to follow if the lookup fails Create a set of ‘dummy’ records for each referenced table (for Referential Integrity purposes) In snapshots, may need a set of dummy records per snapshot period

52 52 BAM Rules – Dummy Records -99Error/Missing -88Not Available -77Acceptable Error -66Temporarily Not Available Not Applicable A great hockey team! GretzkyLindros Coffey Lemieux Bunny Larocque

53 53 Dummy Record Meanings -99 A data element is missing or a lookup into another table cannot find a matching value (e.g. Missing foreign key). The source record is still loaded and the column value is set to – ‘Data not available’. This data element is not available from the source record. -77 ‘Acceptable Errors’ that will not be corrected. This data element was invalid (set to -99) during the initial load and will not be corrected or reloaded. -66 Data is temporarily not available. Usually used in a multiple pass loading process. ‘Not Applicable’. This data element is not required in the context of the record.

54 54 Error Correction Process An area that you can report from and reload from Hold or point to the original source record and be able to recreate it (the DW has lost the original value once tagged to a BAM rule) Can be one summary table with standard error types For more detail, create one error table for each target table Create a series of error flag columns in the error table indicating what went wrong

55 55 Error Correction Model – Summary Mode

56 56 Error Correction Model – Detail Mode StageTarget Load Process SourceReload Error Exists

57 57 Audit Considerations A key area that is quite often ignored You must match to the source systems or be able to explain the differences Auditing data loads (when did we start a load and what is the status?) Without proof, you will not get the credibility!

58 58 Audit Model

59 59 Pulling Audit & Error Correction Together

60 60 Administrative Fields Supports the ‘behind the scenes’ aspects –Loading –Querying Different requirements for dimensions and facts But try to standardize across all tables, even if the fields aren’t utilized today

61 61 Dimension Tables Record Type – indicates New, Trigger Field Modify, Non-Trigger Field Modify, Delete, Correction Active Flg - indicates a business key is active Most Recent Flg - indicates the most recent row loaded within a business key Effective Date - for the instance of that row End Date - for the instance of that row Create Date Update Date Create User Update User

62 62 Fact Tables Record Type Active Flg Most Recent Flg Row Cnt Partition Date – store the actual date value Create Date Update Date Create User Update User

63 63 Administrative Field Values Use 1’s and 0’s in flag and count fields – they’re easier to add (but it really depends on what the user can best understand) Always fill in date fields (use dummy start and end dates in time if needed) Use triggers to populate the create/update dates and users

64 64 Other Tips “In The Bag”

65 65 Random Thoughts Ensure you secure… –Budget –Top management commitment Have focus (scope definition) Develop incrementally Have a business driven solution Use experienced designers and implementers Use industry tools for development

66 66 …More Random Thoughts Generally, make all of your column names unique across tables Conform fact table measures (same name) Don’t normalize too much – jump right into a dimensional design Avoid retroactive changes Don’t be afraid of many dimensions

67 67 Too Many Dimensions? 1 Fact, 41 CONFORMED dimensions

68 68 12 Common DW Design Mistakes (Intelligent Enterprise: Ralph Kimball Oct 2001) 1.Place text attributes in a fact table when you want to use them as constraints and groupings 2.Limit the use of verbose descriptions in your dimensions to save space 3.Split hierarchy and hierarchy levels into multiple dimension tables 4.Delay dealing with slowly changing dimensions 5.Use smart keys to join dimension and fact tables 6.Add dimensions to fact tables before declaring the grain

69 69 7.Declare that the dimensional model is based on a specific report 8.Mixing different grains in one fact table 9.Leave lowest-level atomic data in non-dimensional format 10.Avoid building aggregates and use hardware for performance improvements 11.Fail to conform fact data 12.Fail to conform dimension data 12 Common DW Design Mistakes (Intelligent Enterprise: Ralph Kimball Oct 2001)

70 70 In Summary Avoid “Dave’s Gotchya’s” Be careful in your design Meet business requirements Address the ‘behind the scenes’ issues Remember: DW design is not a science, it is an art…so be creative!

71 71 A Q & Q U E S T I O N S A N S W E R S David Stanford Thank You!


Download ppt "1 Presentation # 506 David Stanford President Red Sky Data Inc. Design Tips for the Warehouse Architect IOUG Live! 2005."

Similar presentations


Ads by Google