Dimensional Modelling

Slides:



Advertisements
Similar presentations
The Organisation As A System An information management framework The Performance Organiser Data Warehousing.
Advertisements

Lecture 3 Themes in this session Basics of the multidimensional data model and star- join schemata The process of, and specific design issues in, multidimensional.
Dimensional Modeling.
Tips and Tricks for Dimensional Modeling
CHAPTER OBJECTIVE: NORMALIZATION THE SNOWFLAKE SCHEMA.
BY LECTURER/ AISHA DAWOOD DW Lab # 2. LAB EXERCISE #1 Oracle Data Warehousing Goal: Develop an application to implement defining subject area, design.
Copyright © Starsoft Inc, Data Warehouse Architecture By Slavko Stemberger.
Data Warehousing M R BRAHMAM.
Dimensional Modeling Business Intelligence Solutions.
Dimensional Modeling CS 543 – Data Warehousing. CS Data Warehousing (Sp ) - Asim LUMS2 From Requirements to Data Models.
Dimensional Modeling – Part 2
Hachim Haddouti, adv. DBMS & DW CSC5301, Ch6 Chapter 6: The Big Dimensions Adv. DBMS & DW Hachim Haddouti.
How to build your own… Super Model Dimensional Modelling for Analysis Services Darren Gosbell Principal Consultant - James & Monroe
Data Warehousing Design Transparencies
Data Warehousing - 3 ISYS 650. Snowflake Schema one or more dimension tables do not join directly to the fact table but must join through other dimension.
Lab3 CPIT 440 Data Mining and Warehouse.
CSE6011 Warehouse Models & Operators  Data Models  relations  stars & snowflakes  cubes  Operators  slice & dice  roll-up, drill down  pivoting.
Data Warehousing DSCI 4103 Dr. Mennecke Introduction and Chapter 1.
Tanvi Madgavkar CSE 7330 FALL Ralph Kimball states that : A data warehouse is a copy of transaction data specifically structured for query and analysis.
Principles of Dimensional Modeling
Lecture 5 CS.456 DATABASE DESIGN.
Online Analytical Processing (OLAP) Hweichao Lu CS157B-02 Spring 2007.
ITEC 3220A Using and Designing Database Systems
DWH – Dimesional Modeling PDT Genči. 2 Outline Requirement gathering Fact and Dimension table Star schema Inside dimension table Inside fact table STAR.
Best Practices for Data Warehousing. 2 Agenda – Best Practices for DW-BI Best Practices in Data Modeling Best Practices in ETL Best Practices in Reporting.
OLAP Theory-English version On-Line Analytical processing (Buisness Intzlligence) [Ing.Skorkovský,CSc] KPH_ESF_MU.
PowerPoint Presentation for Dennis & Haley Wixom, Systems Analysis and Design, 2 nd Edition Copyright 2003 © John Wiley & Sons, Inc. All rights reserved.
Dimensional model. What do we know so far about … FACTS? “What is the process measuring?” Fact types:  Numeric Additive Semi-additive Non-additive (avg,
OnLine Analytical Processing (OLAP)
Dimensional Modeling Chapter 2. The Dimensional Data Model An alternative to the normalized data model Present information as simply as possible (easier.
Program Pelatihan Tenaga Infromasi dan Informatika Sistem Informasi Kesehatan Ari Cahyono.
Data Warehousing Concepts, by Dr. Khalil 1 Data Warehousing Design Dr. Awad Khalil Computer Science Department AUC.
DIMENSIONAL MODELLING. Overview Clearly understand how the requirements definition determines data design Introduce dimensional modeling and contrast.
Chapter 1 Adamson & Venerable Spring Dimensional Modeling Dimensional Model Basics Fact & Dimension Tables Star Schema Granularity Facts and Measures.
Data Warehouse. Design DataWarehouse Key Design Considerations it is important to consider the intended purpose of the data warehouse or business intelligence.
1 Data Warehouses BUAD/American University Data Warehouses.
BI Terminologies.
6.1 © 2010 by Prentice Hall 6 Chapter Foundations of Business Intelligence: Databases and Information Management.
DIMENSIONAL MODELING MIS2502 Data Analytics. So we know… Relational databases are good for storing transactional data But bad for analytical data What.
Designing a Data Warehousing System. Overview Business Analysis Process Data Warehousing System Modeling a Data Warehouse Choosing the Grain Establishing.
1 Data Warehousing Lecture-15 Issues of Dimensional Modeling Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.
More Dimensional Modeling. Facts Types of Fact Design Transactional Periodic Snapshot –Predictable time period –Ex. Monthly, yearly, etc. Accumulating.
UNIT-II Principles of dimensional modeling
1 On-Line Analytic Processing Warehousing Data Cubes.
CMPE 226 Database Systems October 21 Class Meeting Department of Computer Engineering San Jose State University Fall 2015 Instructor: Ron Mak
Creating the Dimensional Model
June 08, 2011 How to design a DATA WAREHOUSE Linh Nguyen (Elly)
Copyright© 2014, Sira Yongchareon Department of Computing, Faculty of Creative Industries and Business Lecturer : Dr. Sira Yongchareon ISCG 6425 Data Warehousing.
Last Updated : 26th may 2003 Center of Excellence Data Warehousing Introductionto Data Modeling.
The Need for Data Analysis 2 Managers track daily transactions to evaluate how the business is performing Strategies should be developed to meet organizational.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Data Warehousing and Decision Support Chapter 25.
1 Database Systems, 8 th Edition Star Schema Data modeling technique –Maps multidimensional decision support data into relational database Creates.
Introduction to OLAP and Data Warehouse Assoc. Professor Bela Stantic September 2014 Database Systems.
Building the Corporate Data Warehouse Pindaro Demertzoglou Data Resource Management.
Building the Corporate Data Warehouse Pindaro Demertzoglou Lally School of Management Data Resource Management.
CMPE 226 Database Systems April 12 Class Meeting Department of Computer Engineering San Jose State University Spring 2016 Instructor: Ron Mak
On-Line Analytic Processing
Data storage is growing Future Prediction through historical data
Data Warehouse.
Applying Data Warehouse Techniques
Overview and Fundamentals
CMPE 226 Database Systems April 11 Class Meeting
Dimensional Modeling.
MIS2502: Data Analytics Dimensional Data Modeling
Retail Sales is used to illustrate a first dimensional model
Dimensional Model January 16, 2003
DWH – Dimesional Modeling
Applying Data Warehouse Techniques
Examines blended and separate transaction schemas
Applying Data Warehouse Techniques
Presentation transcript:

Dimensional Modelling II

Factless Fact Table Generally Fact Table contains concatenated primary key linking it to dimension tables facts or measures Task: Keep track of student attendance Dimensions: student, course, date, room, Instructor What is the measurement ? In the fact Table, attendance can be indicated with the number one. Every fact table row will contain the number one as attendance If fact table represents events => Factless Fact table

Star Schema for Tracking Attendance

Data Granularity Granularity : level of detail in the fact table Lowest grain: facts or metrics are at the lowest possible level at which they could be captured from the operational systems. What are the advantages of keeping the fact table at the lowest grain? users can drill down to the lowest level of detail from the data warehouse without the need to go to the operational systems themselves. Base level fact tables must be at the natural lowest levels of all corresponding dimensions. By doing this, queries for drill down and roll up can be performed efficiently. What then are the natural lowest levels of the corresponding dimensions? Example from sales department with the dimensions of product, date, customer, and sales representative, an individual product, a specific individual date, an individual customer, individual sales representative, respectively. => Fact table contains measurements at the lowest level Fact tables at the lowest grain facilitate “graceful” extensions. Adding an attribute to a dimension table or even adding a new dimension table will NOT effect old queries. Granular fact tables serve as natural destinations for current operational data : Less Transformations rquirements Data mining applications need details at the lowest grain. Data warehouses feed data into data mining applications. What is the trade-off? Increased storage and maintenance requirements large numbers of fact table rows. Aim : build aggregate fact tables to support queries looking for summary numbers.

Star Schema Keys Each dimension table must have a Primary Keys Each row in a dimension table is identified by a unique value of an attribute designated as the primary key of the dimension. Can we use the primary keys of the operational system? Example: Customer already has a unique Primary Key in the OLTP system Possible scenarios when these are used as PK of dimension tables. Products table has a PK product code with 8-position chars: Two indicate the code of the warehouse where the product is normally stored, and two other positions denote the product category …. Assume productcode is used as PK of the dimension table: What happens if the product code gets changed in the middle of a year, because the product is now stored in a different warehouse of the company. Remember : The data warehouse contains historic data

Star Schema Keys

Primary keys of Dimension tables Avoid PKs with built in meanings Run away from PKs with multiple meanings! Avoid PKs that may be re-used PK of a an old customer is assigned to new customers Use Surrogate Keys : System generated “meaningless” keys Keep the OLTP PKs as additional attributes in the dimension tables

Primary Key of Fact Table The Fact table is on the many side of 1:M relationships with dimension tables=> It contains the PKs of the dimension tables as Foreign Keys. What should be the PK of the Fact Table: A single compound primary key whose length is the total length of the keys of the individual dimension tables. Foreign keys must also be kept in the fact table as additional attributes Increases the size of the fact table. A concatenated primary key that is the concatenation of all the primary keys of the dimension tables. NO need to keep the primary keys of the dimension tables as additional attributes to serve as foreign keys. Individual parts of the primary keys themselves will serve as the foreign keys. A generated primary key independent of the keys of the dimension tables. Foreign keys must also be kept in the fact table as additional attributes.

Advantages of the Star Schema Easy for users to understand Data warehouse users must understand the data structures used. Formulate queries and reference data structures (tables) Star schema is based on business Dimensions + metrics Optimizes navigation Relationships used for joining tables (fact table to dimension table) are easy to understand Optimized Short Straight forward join paths More suitable for query processing. All queries are executed/formulated in the same way Use filtering conditions to select rows from dimension tables Find corresponding rows in the fact table Star-join and star-index STARJoin: single pass, high speed, parallelizable, multitable join STARIndex: specialized index to increase STARjoin performance Star Join Optimization Many data warehousing queries share a common pattern. They select several measures from a fact table, join the fact rows with one or more dimensions along the surrogate keys, and place filter predicates and aggregates on non-key columns of the dimension tables. You can think of a fact table as the star in the center of a solar system with a number of dimension tables orbiting like planets around the star. We refer to this pattern as star pattern. Star Join is a dedicated optimization designed specifically for queries on dimensionally modeled data configured into a star pattern. The first step of the optimization is to detect the star pattern depending on heuristics, such as: The largest table that participates in an n-ary join is considered as the fact table. To be considered as a fact table, a table must be larger than a specified minimum size. All join conditions of each of the binary joins have to be single column equality predicates. The joins have to be inner joins, and so on. By using these heuristics, the query optimizer of SQL Server 2008 detects the star pattern and identifies the fact table automatically. The query optimizer then builds hash tables for each dimension table that participates. Based on these hash tables, the query optimizer builds bitmap filters to push down the scan on the fact table. These filters effectively eliminate most of the rows that would be removed by later joining actions. As a result, the total number of rows that need to be processed by subsequent operators is greatly reduced, thereby providing a significant performance improvement. CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS

Example Star Schema:Video Rental

Example Star Schema:Supermarket

Example Star Schema:Wireless Phone Service

Example Star Schema:Auction Company

Snowflake Schema Snowflake Schema A variant of the star schema where each dimension can have its dimensions. Starflake schema is a hybrid structure that contains a mixture of star (denormalized) and snowflake (normalized) schemas. Allows dimensions to be present in both forms to cater for different query requirements. -- Kimball Ralph, Data Warehouse Toolkit ---

When to Snowflake? Customer Dimension table Customer Key Customer name ‘Snowflaking’ is a method of normalizing the dimension tables in a Star schema. City Classification table Customer Dimension table Customer Key Customer name address Zip City class key City class key (pk) City code Class description Population range Cost of living Pollution index Public trans Customer indes Fact Table Customer key Other keys metrics If a dimension is very large, the savings in storage could be substantial. If the dimension table is normalized Users may now browse the additional attributes in the new normalized table only when required.

A Few Definitions OLAP “On-Line Analytical Processing (OLAP) is a category of software technology that enables analysts, managers and executives to gain insight into data through fast, consistent, interactive access to a wide variety of possible views of information that has been transformed from raw data to reflect the real dimensions of the enterprise as understood by the user” -- DBMS Magazine, April, 1995 Multidimensional Analysis The manipulation of data by a variety of categories or “dimensions”, facilitating analysis and an understanding of the data-also known as “Drill-around” and “slice and dice” Multidimensional Database Proprietary, non-relational database that stores and manages data in a multidimensional manner, with limited dimensional information.

Updates Updates to the fact table Updates to dimension tables Frequent: Addition of rows Rare: Changes in row (adjustments in values) Rare: Addition of attributes (new fact or metric) Updates to dimension tables Slow addition of rows Slow addition of attributes New dimensions CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS

Updates to the Dimension Tables Most dimensions are generally constant over time If not constant, change slowly over time The key of the source record does not change The description and other attributes change slowly over time In the source OLTP systems, the new values overwrite the old values Overwriting is not always the best option for dimension table attributes The way updates are made depends on the type of change CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS

Order Tracking CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS

Type 1 Changes: Correction of Errors Properties Usually, the changes relate to correction of errors in source systems. Sometimes the change in the source system has no significance. The old value in the source system needs to be discarded. The change in the source system need not be preserved in the data warehouse Correcting a spelling mistake in name Changing name due to marriage Changing marital Status?????

Type 1 Changes: Correction of Errors Approach Overwrite the attribute value in the dimension table row with the new value The old value of the attribute is not preserved No other change are made in the dimension table row The key of this row or any other key value are not affected This type is easiest to implement CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS

Type 1 Changes CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS

Type 2 Changes: Preservation of History Properties They usually relate to true changes in source systems There is a need to preserve history in the data warehouse This type of change partitions the history in the data warehouse Every change for the same attribute must be preserved Change in marital status Track orders by marital status, state … Change of Address CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS

Type 2 Changes: Preservation of History Approach Add a new dimension table row with the new value of the changed attribute An effective data field may be added into the dimension table There are no changes to the original row of the dimension table The new row is inserted with a new surrogate key The key of the original row is not affected CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS

Type 2 Changes: Preservation of History CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS

Type 3 Changes: Tentative Soft Revisions Properties They usually apply to “soft” or tentative changes in the source systems There is a need to keep track of history with old and new values of the changed attribute They are used to compare performances across the transition They provide the ability to track forward and backward CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS

Type 3 Changes: Tentative Soft Revisions Approach Add an “old” field in the dimension table for the affected attribute Push down the existing value of the attribute from the “current” field to the “old” field Keep the new value of the attribute in the “current” field Also, you may add a “current” effective date field for the attribute The key of the row is not affected No new dimension row is needed CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS

Type 3 Changes: Tentative Soft Revisions CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS

Slowly Changing Dimensions (Addresses, Managers, etc.) Type 1: Store only the current value, overwrite previous value Type 2: Create a dimension record for each value (with or without date stamps) Type 3: Create an attribute in the dimension record for previous value

Examples Original Type 1 Type 2 Type 3 Hybrid ProductKey Description Category SKU 21553 LeapPad Education LP2105 Type 1 ProductKey Description Category SKU 21553 LeapPad Toy LP2105 Type 2 ProductKey Description Category SKU 21553 LeapPad Education LP2105 44631 Toy Type 3 ProductKey Description Category OldCat SKU 21553 LeapPad Toy Education LP2105 Hybrid ProductKey Description Category OldCat SKU 21335 LeapPad Electronics Education LP2105 44631 Toy 68122

Type 1 Slowly Changing Dimension The simplest form Only updates existing records Overwrites history

Type 1 Slowly Changing Dimension CustomerID Code Name State Gender 1 K001 Miranda Kerr VIC F CustomerID Code Name State Gender 1 K001 Miranda Kerr NSW F There are some changes where it is valid to overwrite history. When someone gets married and changes their name, they may want to carry the history of their previous purchases over to their new name rather than see a split history. 32

Type 2 Slowly Changing Dimension Allows the recording of changes of state over time Generates a new record each time the state changes Usually requires the use of effective dates when joining to facts.

Type 2 Slowly Changing Dimension CustomerID Code Name State Gender Start End 1 K001 Miranda Kerr NSW F 1/1/09 <NULL> CustomerID Code Name State Gender Start End 1 K001 Miranda Kerr NSW F 1/1/09 23/2/09 2 VIC 24/2/09 <NULL> 23/2/09 This makes inserts into your fact table more expensive as you always need to match on the effective dates as well as the business key. Sometimes people kept a “Current” flag. Another approach rather than putting nulls in the End date is to put an arbitrary date well in the future, this can make the join logic a bit simpler. 34

Type 3 Slowly Changing Dimension De-normalized change tracking Only keeps a limited history Stores changes in separate columns

Type 3 Slowly Changing Dimension CustomerID Code Name Current State Gender Prev State 1 K001 Miranda Kerr F <NULL> NSW VIC This type of change tracking is more useful when there is a once off change like a change in sales regions where you want to see history re-cast into the new regions, but may also want to compare the old and new regions. 36

Junk Dimensions Dimensions for a DW are typically taken from operational source systems Source systems contain many additional attributes (such as flags, text, descriptions, etc) that may not be useful in a DW What are the options Discard all such fields in the source systems Include them in the fact table Include all of them as dimensions Select some and add them to a single “junk” dimension table CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS

Large Dimensions Large dimensions? Large number of rows (deep) Large number of attributes (wide) Dimensions can become large because of frequent changes (what type?) and need to have many attributes for analysis Consequence Slow and inefficient Solution Proper logical and physical design Indexes Optimized algorithms CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS

Large Dimensions: Vertical Segmentation Dividing a large, rapidly changing dimension table CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS

Vertical Segmentation Separate attributes in other tables Overhead of shared locks may be reduced Table scans can be faster Could cause excessive joins

Vertical Segmentation Separate attributes into other tables Branch_id PK School_id PK Month_yr School_name School_Address Ref School Branch Branch_id PK School_id PK Month_yr School_name School_Address Number_of_Graduates Number_of_underGraduate Semaster_Tuition Branch_id PK School_id PK Month_yr Number_of_Graduates Number_of_underGraduates Semaster_Tuition

Horizontal Segmentation Separate subset of data to another table For example, separate yearly sales data into tables containing only monthly data Separate subsets of data to another table (Jan, Feb, ..) Use UNION to query multiple tables. Multiple queries of multiple tables (UNION) Breaking up tables will speed table scans

Multiple Hierarchies Multiple hierarchies in a large product dimension Notice that some attributes are “shared” among the multiple hirerarchies CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS

Shared Dimension Tables Time Newspaper owner Fact Table Fact Table Branch PropertySale Advertisement Promotion Property For sale Star Constellation

Roll Up (Dimension Hierarchies) Property Sales With Normalized Version of Branch Dimension Table : Snowflake Schema PropertySale Branch Id (PK) Branch no Branch type City (FK) timeId key propertyid key branchid key Clinetid key Promotionid Key Staffid key Ownerid key This is not an ERD City City ID(PK) Region ID (FK) Region Roll Up (Dimension Hierarchies) Region ID (PK)

Some Design Issues Too Few Dimensions Dimensions Are Lacking Aggregate Level Too Many Dimensions- One Possibility Combine Dimensions Overly Complex Dimensions One Possibility: Split Dimensions Vertical/Horizontal Another Possibility: The Snowflake Schema Multiple Fact tables Star constellations