Lecture 2 - Dimensional Modelling

Lecture 2 - Dimensional Modelling
Reading Directions Kimbal98 chapters 5, 6 Keywords granularity, time, semi-additive, dense fact tables, degenerate dimension, big dimensions, slowly changing dimensions, multiple hierarchies, demographic minidimensions, dirty dimensions, heterogeneous products, synonym tables, factless fact table, aggregates, derived fact tables, bitmap indexing

Some basic concepts Fact Attribute “something not known in advance”,
an observation many facts (but not all) have numerical, continuously values e.g., the price of a product, quantity Attribute “describe a characteristic of a tangible thing” “we do not measure them, we usually know them” usually text fields, with discrete values e.g., the flavour of a product, the size of a product

Some basic concepts 2 Dimension Granularity
a business perspective from which data is looked upon “a collection of text like attributes that are highly correlated” e.g. Product, Store, Time Granularity the level of detail of data contained in the data warehouse e.g. Daily item totals by product, by store

Example of a Dimensional Model

The Time Dimension

The Standard Template Query
SELECT p.brand, sum(f.dollar), sum(f.units) FROM salesfact f, product p, time t WHERE f.prductkey = p.productkey AND f.timekey = t.timekey AND t.quarter = ’1Q1995’ GROUP BY p.brand ORDER BY p.brand

An Example Answer Set Aggregated fact Row header

Facts (Perfectly) Additive Semiadditive Non-additive
a fact is additive if it is possible to add it across all the dimensions e.g., discrete numerical measures of activity, i.e., quantity sold, dollars soled Semiadditive a fact is semiadditive if it is possible to add it along only some of the dimensions e.g., numerical measures of intensity, i.e., account balance, inventory level Non-additive facts that can not be added at all e.g., Whenever possible the facts in the fact table should be chosen to be perfectly additive. That means it make sense to add the facts along all of the dimensions. Discrete numerical measures of activity, such as retail cash register sales in either dollars or units, are often perfectly additive. Numerical measures of intensity, however, are not perfectly additive. The most important measures of intensity are account valances and inventory levels. These facts are snapshots of something taken at a point on time. Measures of intensity are usually additive along all the dimensions except time. To combine account balances and inventory levels across time, the query tool or reporting tool must add these facts across time and then divide by the number of time periods to create a time-average. This is not the same as using the SQL AVG function.

Facts and the Additive Property
28/3, paper, store1, 25, 250, 20 29/3, paper, store1, 35, 350, 30 60, 600, 50 28/3, paper, store1, 25, 250, 20 28/3, paper, store2, 35, 350, 30 60, 600, 50 28/3, paper1, store1, 25, 250, 20 28/3, paper2, store1, 35, 350, 30 60, 600

Semiadditive fact - example
NB! customer_count is not additive across the product dimension 28/3, tissue paper, store1, 25, 250, 20 28/3, paper towels, store1, 35, 350, 30 ---- 50 Is the number of customers who bought either paper towels or tissue paper 50? No, the number could be anywhere between 30 and 50.

Numerical Measures of Intensity
All measures that record a static level, such as account balance and inventory level, are non-additive across time. However, these measures may be usefully aggregated across time by averaging over the number of time periods. Note that, the SQL AVG can not be used for this. What is the average daily inventory of a brand in a geographic region during a given week? Let the brand cluster 3 products, the region has 4 stores, and we have 7 days/week. Using the SQL AVG would divide the summed value into 3*4*7=84 The correct answer is to divide the summed inventory value by 7

Sparsity The matrices, represented by multidimensional models are often 99% sparse. Sparsity is dealt with by simply not creating records for the cells that are not filled in the matrix. If nothing has happened no record is created. Pre-aggregation and storage of aggregates can however lead to sparsity failure which places large demands on data storage från-konton till-konton 1000 produkter celler trans per dag ger en fyllnadsgrad på 1 på 1 miljard celler Sparsity failure: se sid 553 i bok

Skinny fact tables As the fact table contains the vast volume of records it is important that it is memory space efficient Foreign keys are usually represented in integer form and do not require much memory space Facts too are often numeric properties and can usually be represented as integers (contrast to dimensional attributes which are usually long text strings) This space efficiency is critical to the memory space consumption of the data warehouse

Keys Choice the data warehouse keys to be meaningless surrogate keys
Let a surrogate key be a simple integer 4-byte ( , , , ) can contain 232 values (> 2 billion positive integers, starting with 1)

Keys Use surrogate keys also for the Time dimension Avoid smart keys
SQL-based date key, is typically 8 bytes, so 4 bytes are wasted bypassing joins leads to embedding knowledge of the calendar in the application, rather than reading it from the time dimension it is not possible to encode a data stamp as “I do not know”, “It has not happen yet”, etc Avoid smart keys Avoid production keys Avoid Smart Key i.e., resist the temptation to make a primary dimension key out of various attributes in the dimension table. NB! One of the cases (type2) for treating slowly changing dimensions suggests exactly this as solution. If you use production defined attributes as the literal components of your key construction, sooner or later they may cause problems. 1. Production may decide to reuse keys, it they purge their version of the dimension every year, than they may have no compunction about reusing a key after two years. 2. The rules for building the production key may be changed 3. You may be handed a revised dimension record from production, where the product key was deliberately not changed.

Slowly Changing Dimensions
E.g., product, customer The assumption: the key does not change, but some of the attributes does. Type 1: Overwrite the dimension record with the new values, thereby losing history Type 2: Create a new additional dimension record using a new value of the surrogate key Type 3: Create a new field in the dimension record to store the new value of the attribute

Type 1 Overwrite the old value of an attribute with a new one e.g.
+ easy to implement - avoids the real goal, which is to accurately track history applicable, however, in the cases when an error in the original data is discovered

Type 2 Create a new additional dimension record
A generalised (surrogate) key is required (which is a responsibility of the data warehouse team) Fact table Dimension table

Type 3 Create a new field in the dimension record
applicable for ”soft” changes, i.e., in the situations when although a change has occurred it is logically possible to act as if the change had not accrued.

Rapidly Changing Dimensions
From the previous slides: What is slow? What if the changes are fast? Must a different design technique be used? Small dimensions: the same technologies as for slowly changing dimensions may be applied Large dimensions: the choice of indexing techniques and data design approaches are important find suppress duplicate entries in the dimension do not create additional records to handle the slowly changing dimension problem

Bitmap indexing An effective indexing technique for attributes with low-cardinality domains There is a distinct bit vector BV for each value V of the domain Example: the attribute sex has value M and F. A table of 100 million people needs 2 lists of 100 million bits.

Bitmap Index Base Table Region Index Rating Index SELECT Customers
Text from Ramakrishnan&Gehrke, sid 691 (men exempeln är det samma som jag hade förra året) Consider a table that describes customers: CUST(Custid:Int, Region:String, Rating:String, …) where Domain(Region) = {N,W,S,E} and Domain(Rating) = {H,M,L} Columns with few possible values are called sparse [like, Region and Rating]. We can exploit sparsity to construct a new kind of index that greatly speeds up queries on these columns. The idea is to record values for sparse columns as a sequence of bits, one for each possible value.” For example, a rating value is 100, 010, or 001, a 1 in first position denotes ‘high’, a 1 in second position denotes ‘medium’, and a 1 in third position denotes ‘low’. Similarly, for region, 1000 denotes ‘North’, 0100 denotes ‘Sought’, 0010 denotes ‘East’, and 0001 denotes ‘West’. If we consider the rating values for all rows in the Customer table, we can treat this as a collection of three bit vectors, one of which has the associated value ‘high’, the other one the associated value ‘medium’ and the third one has the associated value ‘low’. Each bit vector has one bit per row in the Customers table, indication whether the value in that row is the value associated with the bit vector. The collection of bit vectors for a column is called a bitmap index for that column. SELECT Customers FROM Base Table WHERE Region = W AND Rating = L

Bitmap Index Base Table Region Index Rating Index Region = W AND
Bitmap indexes offer two important advantages over conventional hash and tree indexes. First, they allow the use of efficient bit operations to answer queries. For example, consider the query “How many customer from the West region are rated as low”. We can take the fourth bit vector for Region and do a bit-wise and AND with the third bit vector for Rating to obtain a bit vector that has 1 for every West-living customer with rating Low. We can then count the number of 1s in this bit vector to answer the query. Second, bitmap indexes can be much more compact than a traditional B+ tree index and are very amenable to the use of compression techniques. Region = W AND Rating = L

Bitmap Index Base Table Region Index Rating Index Region = W AND
Rating = L

Join Indexes Bitmapped Join Indexes

Rapidly changing very large (monster) dimensions
Break off some of the attributes into their own separate dimension(s), a demographic dimension(s). force the attributes selected to the demographic dimension to have relatively small number of discrete values build upp the demographic dimension with all possible discrete attributes combinations construct a surrogate demographic key for this dimension NB! The demographic attributes are the one of the heavily used attributes. Their values are often compared in order to identify interesting subsets.

Demographic Minidimension
106 = tuples

Two Demographic Dimensions

Demographic Minidimension
Advantages frequent ‘snapshoting’ of customers profiles with no increase in data storage or data complexity Drawbacks the demographic attributes are clumped into banded ranges of discrete values (it is impractical to change the set of value bands at a later time) the demographic dimension itself can not be allowed to grow too large slower down the browsing NB! the customer is associated with demographics only when a fact table record is created The customer is associated with demographics only when fact table record is created. - if the fact table is sparse (e.g., the sales are so sparse) then a demographic event can be defined. It is still saved in the fact table, but it lacks the rest of the measured attributes. - another alternative is to in the customer table create a FK to the demographic table. (NB! What is then the difference between DM and ER modelling?)

Degenerate Dimension A degenerate dimension is represented by a dimension key attribute(s) with no corresponding dimension table Occurs usually in line-item oriented fact table design Many of our dimensional designs revolve around some kind of control document like an order, an invoice, a bill or a ticket. Usually these control documents are a kind of container with one ore more line items inside. The natural granularity for a fact table in this cases is the individual line item. It is no point in making a dimension table because it would contain nothing. But the generate key itself is still necessary since it may for instance be used to group together line items.

Junk Dimensions When a number of miscellaneous flags and text attributes exist, the following design alternatives should be avoided: Leaving the flags and attributes unchanged in the fact table record Making each flag and attribute into its own separate dimension Stripping out all of these flags and attributes from the design Sometimes when you are extracting from a complex production data source, after identifying source attributes for a number of obvious dimensions, you are left with a number of miscellaneous flags and text attributes. These flags and attributes do not seem to be organised in any coherent way. Often the meaning of the flags and text attributes is obscure. Many of the fields are filled sporadically, or their content seems to vary depending on the context of the record. alt1 is bad because it might cause the fact table to swell alarmingly. It would be a shame to leave several uncompressed 20 byte text fields in the fact table. It may also be extensive to build indexes for constraining and browsing on these fields as well. alt2 does not work either since a 5-dimensional design could balloon into a 25-dimensional design. alt3: shipping out this information could be a mistake if there is demonstrated business relevance to any of these fields. It is worthwhile to examine this question carefully. If the flags and the text attributes are incomprehensible, noisy, inconsistently filed in, or only of operational significance, they should be left out. A simple example of a useful junk dimension would be taking ten Yes/No indicators out of the fact table and putting them into a single dimension with. At the worst you have 210=1024 records in this dimension. A better alternative is to create a junk dimension. A junk dimension is a convenient grouping of flags and attributes to get them out of a fact table into a useful dimensional framework

Heterogeneous products
Some products have many, many distinguishing attributes and many possible permutations (usually on the basis of some customised offer). This results in immense product dimensions and bad browsing performance In order to deal with this, fact tables with accompanying product dimensions can be created for each product type - these are known as custom fact tables Primary core facts on the products types are kept in a core fact table The core facts are copied in each of the customer fact tables Cf. horisontal partitioning

Heterogeneous Products

Factless fact tables Some fact tables quite simply have no measured facts! Often used to represent many-to-many relationships The only thing they contain is a concatenated key, they do still however represent a focal event which is identified by the combination of conditions referenced in the dimension tables There are two main types of factless fact tables: event tracking tables coverage tables

Dealing with many-to-many relationships among dimensional attributes
Many to many relationships are difficult to deal with in a any database design situation. Great efforts should be taken to identify any in the data model When creating a MDM it is necessary to separate the two entities and capture their relationship in a factless fact table

Aggregation Lowest level of aggregation is determined by the granularity of the fact table. Aggregations can be created on-the-fly or by the process of pre-aggregation Pre-aggregation demands more storage space but provides better query performance Aggregation is easier when facts are all additive

Depicting processes and value chains as families of star-join schemas
There are two sides to the value chain the demand side - the steps needed to satisfy the customers’ demand for the product the supply side - the steps needed to manufacture the products from original ingredients or parts The chain consists of a sequence of inventory and flow star-join schemata joining the different star-join schemata is only possible when two sequential schemata have a common, identical dimension Sometimes the represented chain can be extended beyond the bounds of the business itself

Supply Chain Demand Chain Row material production
Ingredient purchasing Ingredient delivery Ingredient inventory Bill of materials Manufacturing process control Manufacturing costs Packaging Trans-shipping to warehouse Finished goods inventory Demand Chain Finished goods inventory Manufacturing shipments Distributor inventory Distributor shipments Retail inventory Retail sales

A family of stars

The Data Warehouse Bus Orders Production Dimensions Time Sales Rep Customer Promotion Product Plant Distr. Center Allows the parallell dvlpmt of business process data marts with ability to integrate Allows the parallell dvlpmt of business process data marts with ability to integrate

Steps in the Design Process
1 Choose a business process to model A business process is a major operational process in an organi- sation, that is supported by some kind of a legacy system(s) from which data can be collected, e.g., orders, invoices, shipments, inventory. 2 Choose the grain of the business process The grains is the level of detail at which the data is represented in the DW. Typical grains are individual transactions, individual daily (monthly) snapshots. 3 Choose the dimensions that will apply to each fact table record Typical dimensions are time, product, customer, store, etc. 4 Choose the measured facts that will populate fact table E.g., quantity sold, dollars sold

Lecture 2 - Dimensional Modelling

Similar presentations

Presentation on theme: "Lecture 2 - Dimensional Modelling"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lecture 2 - Dimensional Modelling

Similar presentations

Presentation on theme: "Lecture 2 - Dimensional Modelling"— Presentation transcript:

Similar presentations

About project

Feedback