Dimensional modelling - Star-join schemas

Dimensional modelling - Star-join schemas
Service Dimension Time Dimension Sales Dimension Customer Dimension Fact table - Transactions Sum Number of calls C210 S1 F11 991011 25:00 3 S3 05:00 1 C212 S2 F13 89:00 C213 12:00 C214 S4 991012 08:00 När man modellerara fokuserar man på de viktiga affärshändelserna -transactions, i verkamheten. Det kan till exempel vara säljhändelser. Att man säljer varor eller tjänster. Jag kommer återkomma till andra typer av händelser - men centarlt när man modellerar på det här sättet är att identifiera händelserna. Man samla dessa händelser och fakta, dvs värdet, om händelserna i en entitet som vi kallar försäljningsfakta. Försäljningsfaktaentiteten innehåller alltså det här värdet vi såg förut i kuben. Sedan kan man studera dessa händelser ur olika aspekter, dimensioner. Dimensionerna motsvarar dimensionerna vi såg förut. När man modellerar gäller det alltså att välja dimensionerna, till exempel vilka tjänster eller tjänstegrupper som säljs, vilka kunder eller kundkategorier som köper tjänster. Vid vilka tidpunkter eller tidsperioder som tjänsterna säljs. Vilka försäljare eller försäljningskontor som sålt tjänsterna. Sedan väljer på attributen för dessa dimensioner - för tjänst kanske vi lägger på tjänstenamn och tjänstegrupp. För kund väljer vi kundnamn, postadress, region, inkomstgrupp. Fakta i mitten, och runt om får vi dimensionser Gör om till databasschema med tabeller. Entiteterna blir tabeller. Vi får faktatabelloch dimensionstabeller. Då hamnar attributen som kolumner. Bofolkar tabellerna med instanser, rader, tupler. Mellan dimensionstabellerna och faktatabellen får vi ett till många förhållande. Dimensionerna innehåller få rader. Ej normaliserat. Dubbellagra information Faktatabellen innehåller massor med rader. Den innehåller den överväldigasnde delen av information. Den vill manska ha så få och små kolumner som mjligt. I faktatabellen finns främmande nycklar till de olika dimensionerna. Plockar bort rader i tabellerna. Går in och joinar mellan dimensioner och faktatabell för att ta bort rader i faktatabell. Tar bort alla rader som inte är S1 Enkelt att förstå en sådan här struktur för verksamhetschefer än en relationsdatabasstruktur. Normalisera av två orsaker– för att ha en så efffektiv lagring, spara utrymmer, som möjligt +redundans/inkonsistens. Men insonstistesen hanteras av transformationlagret. Och spara utrymme är sekundärt. OLAP-verktyget kan använda den här datastrukturen effektivt. Normaliserar vissa attribut – vinne kortare rader, men innebär att man måste joina.

Partitioning strategy
for performace-related and manageablity reasons usually for handling the fact table Horisontal partitioning - speed up queries by minimising the data to be scanned (without using an index) - partition by time most common Vertical partitioning - data is split vertically - two forms: normalisation and row splitting - consider row splitting if some columns are access infrequently Dim Fact Dim Dim

A family of stars

A family of stars A dimensional model of a data warehouse for a large data warehouse consists of between 10 and 25 similar-looking star-join schemas. Each star join will have 5 to 15 dimensional tables. Conformed (shared) dimensions for drill-across. A Conformed dimension is a dimension that means the same thing with every possible fact table to which it can be joined.

Value chains as families of star-join schemas
There are two sides to the value chain the demand side - the steps needed to satisfy the customers’ demand for the product the supply side - the steps needed to manufacture the products from original ingredients or parts The chain consists of a sequence of inventory and flow star-join schemata joining the different star-join schemata is only possible when two sequential schemata have a common, identical dimension Sometimes the represented chain can be extended beyond the bounds of the business itself

Value chains as families of star-join schemas
Supply Chain Row material production Ingredient purchasing Ingredient delivery Ingredient inventory Bill of materials Manufacturing process control Manufacturing costs Packaging Trans-shipping to warehouse Finished goods inventory Demand Chain Finished goods inventory Manufacturing shipments Distributor inventory Distributor shipments Retail inventory Retail sales

The Data Warehouse Bus Dimensions Orders Production
Time Sales Rep Customer Promotion Product Plant Distr. Center Allows the parallell dvlpmt of business process data marts with ability to integrate

Problems of Data Warehousing
Complexity of integration Hidden problems with source systems Data homogenisation Underestimation of resources for data loading Required data not captured High maintenance Long duration projects Why not integrating the legacy applications (OLTP systems) instead? Svårt att få ut data. Svårt att göra data homogen – så att den betyder samma sak. Den data man behöver finns inte. Förvaltningen av dw

The stove pipe problem Business process IT- system1 system2 system3
Market Purchase Production Shipment Service Business process IT- system1 system2 system3 system4 IT system5

The stove pipe problem Business process IT systems IT- system1 system2
Market Purchase Production Shipment Service Business process IT- system1 system2 system3 system4 IT system5 IT systems

Integrating the Enterprise - ERP/ES & DW
Reporting and DW Managers Customers Suppliers Sales force Customer service Back office Services Sales & delivery Finan- cials Manufac- turing Inventory Central database Any organisation has to interface to their customers by means of front office functions such as sales and services, as well as to their suppliers by means of corresponding back office units. The organisational functions that need to be supported are hence such as sales and delivery, services, financials, Manufacturing Inventory But also the human resource has to be managed And managers need to be supported by information and system functionality that help them make wise tactical and strategic decisions. In an enterprise system all of this would be supported by an integrated set of databases with no unnecessary redundancy. Human resource Employees

case Integration of Telecom systems - point-to-point

Integration of Telecom systems - Message Brokers & DW

Dimensional modelling vs. ER-modelling
Entity-relationship modelling - a logical design technique to eliminate data redundancy to keep consistency and storage efficiency - makes transaction simple and deterministic - ER models for enterprise are usually complex, e.g. they often have hundreds, or even thousands, of entities/tables Dimensional modelling - a logical design technique that present data in a intuitive way and that allow high-performance access - aims at model decision support data - easier to navigate for the user and high performance

Why dimensional modelling?
the logical model is easy understand a predictable standard framework for end user applications the logical design can be done nearly independent of expected query pattern handle changes easy - at least adding new dimensional attributes high performance “browsing” across the attributes, eliminating joins and make use bit vector indexes strategy to handling aggregates, e.g. summery records that are logical redundant with base table to enhance query performance the database engine can make strong assumption how to optimise strategies for handling slowly changing dimensions, heterogenous products, event-handling (“factless fact tables”)

Steps in the Design Process
1 Choose a business process to model A business process is a major operational process in an organisation, that is supported by some kind of a legacy system(s) from which data can be collected, e.g., orders, invoices, shipments, inventory. 2 Choose the grain of the business process The grains is the level of detail at which the data is represented in the DW. Typical grains are individual transactions, individual daily (monthly) snapshots. 3 Choose the dimensions that will apply to each fact table record Typical dimensions are time, product, customer, store, etc. 4 Choose the measured facts that will populate fact table E.g., quantity sold, dollars sold

Consider the following questions
How much total business did my newly remodelled stores do compared with the chain average? How did leather goods items costing less than $5 do with my most frequent shoppers? What was the ratio of nonholiday weekend days total revenue to holiday weekend days? A major point of the analysis done so far is to show that the foundation layers of all these dimensional data warehouses have huge number of records. This large number of detailed records is needed, not because an analyst is going to need a single record at this level, but because the analyst must be able to make very precise cuts between records. If we don’t tract the individual accounts, we can’t group accounts into the most interesting clumps. If we don’t track the lowest-level product subdivision, then we can’t pull out all the groups of products we might want to track. Even though many queries will opportunistically need to retrieve clumps of base-level data, most queries will relax the constraints on ton or more of the dimensions. If all we have is base-level data, then relaxing constrains on one or more dimensions means that we let a vast number of records into our query. Consider the queries … In all three cases, we need very detailed data from one of the major dimensions, but we are either summarising or omitting detail from some of the other dimensions. In the 1st q., we need specific stores but we are summarising over all products. In the 2nd q. We need specific products but we are summarising across all stores. In the 3rd q, we need specific says on the calendar but we are summarising across both product and store. All of these questions will be expensive to process if we have only base-level data. What is needed here is a set of several different precomputed aggregates that will accelerate each of these queries. The effect on performance is not small. It is typical to expect anywhere from a 10-fold to a 1000-fold improvements in runtime by having the right aggregates available.

Aggregation Aggregations can be created on-the-fly or by the process of pre-aggregation An aggregate is a fact table record representing a summarisation of base-level fact table records Category-level product aggregates by store by day District-level store aggregates by product by day Monthly sales aggregates by product by store Category-level product aggregates by store district by day Category-level product aggregates by store district by month The use of prestored summaries (aggregates) is the single most effective tool the data warehouse designer has to control performance. An aggregate is a fact table record representing a summarisation of base-level fact table record. An aggregate fact table record is always associated with one or more aggregate dimension table records. For example, in a base level fact table dimensioned by UPC (Unified product code) level products, individual stores, and individual days, we can imagine the following aggregate records: Not only do we have these new fact table records, but we must have aggregate dimension table entries describing category level products, store districts, and months.

How to store aggregates
as new Level fields in an already existing Fact table as new fact tables

Example of a Dimensional Model

New Level field for Aggregates
One approach to storing aggregates is by using Level fields in the affected dimension tables, and thereby allowing the aggregate fact records to reside in the original fact table. If we look at the grocery store aggregate example with category totals would result in the shown model. The Level field describe the aggregation level of every record in the dimension table. All of the original base-level records are encoded with Level = Base. The new category total records are encoded with Level=Category. These category product aggregate records need to have keys that are compatible with and do not conflict with the original base-level keys in the product dimension table. However, all the rest of the fields have no valid meaning, and have to take on null or ‘NA’ or ‘Total’ values. The most serious application problem with this solution is that it is possible to double count. This happens if the requesting application fails to constrain on a single value for a Level field. For instance, if the only product table constraint is Category = Paper, then both the base-level records and the category-level records will be retrieved. Tissue NB! Constraint the queries to avoid double counting of the Level fields

An example ? $sold napkin/day ? $sold tissue/day ? $sold paper/day

New Level field for Aggregates - an example
? $sold napkin/day ? $sold tissue/day ? $sold paper/day Well, is this a solution you would chose?

New Tables

New Fact Tables for Aggregates
The creation of aggregate fact table requires the creation of: a derivative dimension an artificial key for each new derivative dimension

How to store aggregates
as new Level fields in an already existing Fact table problems with double count visible for the users as new fact tables + no problems with double count + invisible for the users + are easily introduced and/or reduced at different points in time + simpler metadata + simpler choice of key + the size of the field for the summarised data does not increase the size of the field for the basic data

Sparsity Failure The planning of aggregate table sizes can
be tricky because of the phenomenon called sparsity failure This phenomenon appears when we build aggregates on sparse tables. For example: In the grocery store item movement database, only about 10% of the products in the store are actually sold in a given store on a given day. Even disregarding the promotion dimension, the database is only occupied 10% in the primary keys of product, store, and time. However when we build aggregates, the occupancy rate shoots up dramatically.

Aggregation Navigator
Query Tool Client or Application Server base level SQL aggregated results Aggregation Metadata Aggregate Navigator aggregate-aware SQL aggregated results DBMS data + aggregations

An example of SQL query SELECT category_description, sum(sales_dollars) FROM base_sales_fact, product, store, time WHERE base_sales_fact.product_key = product.product_key AND base_sales_fact.store_key = store.store_key AND base_sales_fact.time_key = time.time_key AND store.city = ‘Cincinnati’ AND time.day = ‘January 1, 1996’ GROUP BY category_description category_sales_fact category_product

An example of SQL query, cont
SELECT category_description, sum(sales_dollars) FROM category_sales_fact, category_product, store, time WHERE category_sales_fact.product_key = category_product.product_key AND category_sales_fact.store_key = store.store_key AND category_sales_fact.time_key = time.time_key AND store.city = ‘Cincinnati’ AND time.day = ‘January 1, 1996’ GROUP BY category_description

Aggregation Navigator
Insulates end user applications from the changing portfolio of aggregates Allows the DBA to dynamically and seamlessly for the end user adjust the aggregates without having to roll over the applications base

Aggregations - summary
Pre-aggregation demands more storage space but provides better query performance Lowest level of aggregation is determined by the granularity of the fact table Aggregation is easier when facts are all additive

Bitmap indexing An effective indexing technique for attributes with low-cardinality domains There is a distinct bit vector BV for each value V of the domain Example: the attribute sex has value M and F. A table of 100 million people needs 2 lists of 100 million bits.

Bitmap Index Base Table Region Index Rating Index SELECT Customers
Text from Ramakrishnan&Gehrke, sid 691 (men exempeln är det samma som jag hade förra året) Consider a table that describes customers: CUST(Custid:Int, Region:String, Rating:String, …) where Domain(Region) = {N,W,S,E} and Domain(Rating) = {H,M,L} Columns with few possible values are called sparse [like, Region and Rating]. We can exploit sparsity to construct a new kind of index that greatly speeds up queries on these columns. The idea is to record values for sparse columns as a sequence of bits, one for each possible value.” For example, a rating value is 100, 010, or 001, a 1 in first position denotes ‘high’, a 1 in second position denotes ‘medium’, and a 1 in third position denotes ‘low’. Similarly, for region, 1000 denotes ‘North’, 0100 denotes ‘Sought’, 0010 denotes ‘East’, and 0001 denotes ‘West’. If we consider the rating values for all rows in the Customer table, we can treat this as a collection of three bit vectors, one of which has the associated value ‘high’, the other one the associated value ‘medium’ and the third one has the associated value ‘low’. Each bit vector has one bit per row in the Customers table, indication whether the value in that row is the value associated with the bit vector. The collection of bit vectors for a column is called a bitmap index for that column. SELECT Customers FROM Base Table WHERE Region = W AND Rating = L

Bitmap Index Base Table Region Index Rating Index Region = W AND
Bitmap indexes offer two important advantages over conventional hash and tree indexes. First, they allow the use of efficient bit operations to answer queries. For example, consider the query “How many customer from the West region are rated as low”. We can take the fourth bit vector for Region and do a bit-wise and AND with the third bit vector for Rating to obtain a bit vector that has 1 for every West-living customer with rating Low. We can then count the number of 1s in this bit vector to answer the query. Second, bitmap indexes can be much more compact than a traditional B+ tree index and are very amenable to the use of compression techniques. Region = W AND Rating = L

Bitmap Index Base Table Region Index Rating Index Region = W AND
Rating = L

Mullet-dimensional OLAP (MOLAP)
Relational DB server and/or legacy systems End-user access tools MOLAP server data request load result set Database & application logic layer Presentation layer

Relational OLAP (ROLAP)
db server ROLAP server End-user access tools SQL data request result set result set Database layer Application logic layer Presentation layer

DB2’s Integration Server Architecture
Desktop OLAP Model OLAP Metaoutline Integration Server desktop TCP/IP DB2 OLAP server TCP/IP Server ODBC Relational data source ODBC TCP/IP OLAP Metadata Catalog OLAP Command Interface DV2 OLAP database

Dimensional modelling - Star-join schemas

Similar presentations

Presentation on theme: "Dimensional modelling - Star-join schemas"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Dimensional modelling - Star-join schemas

Similar presentations

Presentation on theme: "Dimensional modelling - Star-join schemas"— Presentation transcript:

Similar presentations

About project

Feedback