Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 TechnoLogica Ltd. 3, Sofiisko Pole Str. tel: (+ 3592) 91 91 2 (ten lines) DW Concepts Dimension.

Similar presentations


Presentation on theme: "1 TechnoLogica Ltd. 3, Sofiisko Pole Str. tel: (+ 3592) 91 91 2 (ten lines) DW Concepts Dimension."— Presentation transcript:

1 1 TechnoLogica Ltd. 3, Sofiisko Pole Str. tel: (+ 3592) (ten lines) DW Concepts Dimension Modeling Techniques Milena Gerova Project Manager

2 2 TechnoLogica DW Projects  Business Management System National Health Insurance Fund ( – current)  Customer Data Integration Allianz Bulgaria Holding ( – current)  Regulatory Reporting System BULBANK ( )  Information System Monetary Statistics Bulgarian National Bank (April 2003 – August 2004)  Management Information System BULBANK (January June 2002)

3 3 Agenda  DW Terminology Overview  Dimensional Modeling  Dimension Types  History and Dimensions  Hierarchy in Dimensions

4 4 The data warehouse must  Make an organization’s information easily accessible.  Present the organization’s information consistently.  Be adaptive and resilient to change  Be a secure bastion that protects our information assets.  Serve as the foundation for improved decision making  The business community must accept the data warehouse if it is to be deemed successful.

5 5 Components of a Data Warehouse

6 6 Dimensional Modeling  Dimensional modeling is a new name for an old technique for making databases simple and understandable  Dimensional modeling is quite different from third- normal-form (3NF) modeling ERM ->The Transaction Processing Model o One table per entity o Minimize data redundancy o Optimize update DM -> The data warehousing model o One fact table for a process in the organization o Maximize understandability o Optimized for retrieval o Resilient to change

7 7 Star Dimensional Modeling

8 8 Four-Step Dimensional Design Process  1. Select the business process to model.  2. Declare the grain of the business process.  3. Choose the dimensions that apply to each fact table row.  4. Identify the numeric facts that will populate each fact table row.

9 9 Dimensions  Determine these by the ways you want to slice and dice the data  Small number of rows compared to facts  Usually 5-10 dimensions surrounding a fact table  Time is almost always a dimension used by every fact  Track history  Uses Surrogate Keys  Hierarchies are usually built into them if possible

10 10 Date Dimension  The date dimension is the one dimension nearly guaranteed to be in every data mart  Date Dimension = Time Dimension before  We can build the date dimension table in advance (5-10 years -> only 3,650 rows)

11 11 Date Dimension

12 12 Date Dimension

13 13 Date Dimension Data warehouses always need an explicit date dimension table. There are many date attributes not supported by the SQL date function, including fiscal periods, seasons, holidays, and weekends. Rather than attempting to determine these nonstandard calendar calculations in a query, we should look them up in a date dimension table. select sum(f.amount_sold) from DATE_DIM d, FACT f where d.Calendar_Month = ‘January’ and d.id = f.date_dim_id;

14 14 Dimension Normalization (Denormalized dimension)

15 15 Dimension Normalization (Denormalized dimension)

16 16 Dimension Normalization (Snowflaking)

17 17 Dimension Normalization (Snowflaking)  The dimension tables should remain as flat tables physically.  Normalized, snowflaked dimension tables penalize cross-attribute browsing and prohibit the use of bit-mapped indexes.  Disk space savings gained by normalizing the dimension tables typically are less than 1 percent of the total disk space needed for the overall schema

18 18 Too Many Dimensions

19 19 Too Many Dimensions  A very large number of dimensions typically is a sign that several dimensions are not completely independent and should be combined into a single dimension.  If our design has 25 or more dimensions, we should look for ways to combine correlated dimensions into a single dimension  It is a dimensional modeling mistake to represent elements of a hierarchy as separate dimensions in the fact table.

20 20 Surrogate Keys  Every join between dimension and fact tables in the data warehouse should be based on meaningless integer surrogate keys.  You should avoid using the natural operational production codes. None of the data warehouse keys should be smart, where you can tell something about the row just by looking at the key.

21 21 Surrogate Keys Surrogate keys are like an immunization for the data warehouse  Buffer the data warehouse environment from operational changes  Performance advantages The smaller surrogate key translates into smaller fact tables, smaller fact table indices, and more fact table rows per block input-output operation  Surrogate keys are used to record dimension conditions that may not have an operational code “No Promotion in Effect”, “Date Not Applicable.”

22 22 Surrogate Keys  The date dimension is the one dimension where surrogate keys should be assigned in a meaningful, sequential order  Surrogate keys are needed to support one of the primary techniques for handling changes to dimension table attributes  Don’t use concatenated or compound keys for dimension tables

23 23 Data Warehouse Bus Architecture

24 24 Data Warehouse Bus Matrix

25 25 Conformed Dimensions  Most dimensions are defined naturally at the most granular level possible  Conformed dimensions are either identical or strict mathematical subsets of the most granular, detailed dimension  They have consistent dimension keys, consistent attribute column names, consistent attribute definitions, and consistent attribute values  The conformed dimension may be the same physical table within the database or may be duplicated synchronously in each data mart

26 26 Conformed Dimensions  Roll-up dimensions conform to the base-level atomic dimension if they are a strict subset of that atomic dimension.

27 27 Conformed Dimensions  They should be built once in the staging area  They must be published prior to staging of the fact data  The dimension authority has responsibility for defining, maintaining, and publishing a particular dimension or its subsets to all the data mart clients who need it

28 28 Tracking History in Dimensions  Unchanging Dimensions  Changing, but Original Values are Irrelevant A phone number in a customer record  Slowly Changing Dimensions (SCD) A customer address, manager  Rapidly Changing Dimensions Income range of a customer  Continuously Changing Dimensions Customer age

29 29 Type 1: Overwrite the Value  The type 1 response is easy to implement, but:  it does not maintain any history of prior attribute values  any preexisting aggregations based on the department value will need to be rebuilt

30 30  The type 2 response is the primary technique for accurately tracking slowly changing dimension attributes. It is extremely powerful because the new dimension row automatically partitions history in the fact table.  It’s not suitable for dimension tables that already exceed a million rows Type 2: Add a Dimension Row

31 31 Type 2: Add a Dimension Row

32 32 Type 3: Add a Dimension Column  The type 3 slowly changing dimension technique allows us to see new and historical fact data by either the new or prior attribute values.

33 33 Hybrid SCD Techniques Series of Type 3 Attributes Predictable Changes with Multiple Version Overlays  Report each year’s sales using the district map for that year.  Report each year’s sales using a district map from an arbitrary different year.  Report an arbitrary span of years’ sales using a single district map from any chosen year. The most common version of this requirement would be to report the complete span of fact data using the current district map.

34 34 Hybrid SCD Techniques Type 2 with "Current" Overwrite Unpredictable Changes with Single-Version Overlay preserves historical accuracy while supporting the ability to report historical data according to the current values

35 35 Dimension Table Staging

36 36 Dimension Table Staging

37 37 Degenerate Dimension  Dimension keys without corresponding dimension tables  Operational control numbers such as order numbers, invoice numbers, and bill-oflading numbers usually give rise to empty dimensions  Degenerate dimensions are stored in the fact tables where the grain of the table is the document itself or a line item in the document

38 38 Junk Dimensions  What to do with flags and indicators  Leave the flags and indicators unchanged in the fact table row.  Make each flag and indicator into its own separate dimension  Strip out all the flags and indicators from the design.  A junk dimension is a convenient grouping of typically low-cardinality flags and indicators

39 39 Junk Dimensions  Whether to use junk dimension  5 indicators, each has 3 values -> 243 (3 5 ) rows  5 indicators, each has 100 values -> 100 million (100 5 ) rows  When to insert rows in the dimension

40 40 Multiple Currencies

41 41 Customer Dimension  Critical element for effective CRM  The most challenging dimension for any data warehouse  extremely deep (with millions of rows)  extremely wide (with dozens or even hundreds of attributes)  sometimes subject to rather rapid change

42 42 Customer Dimension Name and Address Parsing

43 43 Customer Dimension Other Common Customer Attributes  Gender  Ethnicity  Age or other life-stage classifications  Income or other lifestyle classifications  Status (for example, new, active, inactive, closed)  Referring source  Business-specific market segment  Scores characterizing the customer, such as purchase behavior, payment behavior, product preferences

44 44 Customer Dimension Aggregated Facts as Attributes  These attributes are to be used for constraining and labeling; they are not to be used in numeric calculations  Focus on those which will be used frequently  Minimize the frequency with which these attributes need to be updated  Replace metrics with more meaningful descriptive values, such as “High Spender”

45 45 Dimension Outriggers for a Low-Cardinality Attribute Set

46 46 Rapidly Changing Customer Dimensions  Challenges  It generally takes too long to constrain or browse among the relationships in such a big table  It is difficult to use previously described techniques for tracking changes in these large dimensions  One solution is to break off frequently analyzed or frequently changing attributes into a separate dimension, referred to as a minidimension

47 47 Rapidly Changing Customer Dimensions The Mini Dimension with "Current" Overwrite

48 48 Rapidly Changing Customer Dimensions  The minidimension terminology refers to when the demographics key is part of the fact table composite key  If the demographics key is a foreign key in the customer dimension, we refer to it as an outrigger

49 49 Rapidly Changing Customer Dimensions Type 2 with Natural Keys in Fact Table

50 50 Implications of Type 2 Customer Dimension Changes  Be careful to avoid overcounting because we may have multiple rows in the customer dimension for the same individual  COUNT DISTINCT  A most recent row indicator  The comparison operators depend on the business rules used to set our effective/expiration dates.

51 51  Capture the keys of the customers or products whose behavior you are tracking Customer Behavior Study Groups

52 52 Commercial Customer Hierarchies

53 53 Commercial Customer Hierarchies  Bridge tables

54 54 Commercial Customer Hierarchies

55 55 Commercial Customer Hierarchies Be aware of risk of double counting SELECT 'San Francisco', SUM(F.REVENUE) FROM FACT F, DATE D WHERE F.CUSTOMER_KEY IN (SELECT B.SUBSIDIARY_KEY FROM CUSTOMER C, BRIDGE B WHERE C.CUSTOMER_KEY = B.PARENT_KEY AND C.CUSTOMER_CITY = 'San Francisco') //to sum all SF parents AND F.DATE_KEY = D.DATE_KEY AND D.MONTH = 'January 2002‘ GROUP BY 'San Francisco'

56 56 Heterogeneous Product Schemas

57 57 Heterogeneous Product Schemas

58 58 Common Dimensional Modeling Mistakes to Avoid  Mistake 10: Place text attributes used for constraining and grouping in a fact table  Mistake 9: Limit verbose descriptive attributes in dimensions to save space  Mistake 8: Split hierarchies and hierarchy levels into multiple dimensions  Mistake 7: Ignore the need to track dimension attribute changes  Mistake 6: Solve all query performance problems by adding more hardware

59 59 Common Dimensional Modeling Mistakes to Avoid  Mistake 5: Use operational or smart keys to join dimension tables to a fact table  Mistake 4: Neglect to declare and then comply with the fact table’s grain  Mistake 3: Design the dimensional model based on a specific report  Mistake 2: Expect users to query the lowest-level atomic data in a normalized forma  Mistake 1: Fail to conform facts and dimensions across separate fact tables

60 60 Answers Questions and


Download ppt "1 TechnoLogica Ltd. 3, Sofiisko Pole Str. tel: (+ 3592) 91 91 2 (ten lines) DW Concepts Dimension."

Similar presentations


Ads by Google