Data Warehouse Schema Star Schema Snowflake Schema Fact constellation
Star Schema A single, large and central fact table and one table for each dimension. The star schema separates business data into facts, which hold the measurable, quantitative data about a business, and dimensions which are descriptive attributes related to fact data. Examples of fact data include sales price, sale quantity, and time, distance, speed, and weight measurements. Related dimension attribute examples include product models, product colors, product sizes, geographic locations, and salesperson names.
Star Schema (contd..) Store Key Product Key Period Key Units Price Store Dimension Time Dimension Product Dimension Fact Table Benefits: Easy to understand, easy to define hierarchies, reduces no. of physical joins. Store Key Store Name City State Region Period Key Year Quarter Month Product Key Product Desc
SnowFlake Schema Variant of star schema model. A single, large and central fact table and one or more tables for each dimension. Dimension tables are normalized i.e. split dimension table data into additional tables
SnowFlake Schema (contd..) Store Key Product Key Period Key Units Price Time Dimension Product Dimension Fact Table Store Key Store Name City Key Period Key Year Quarter Month Product Key Product Desc City Key City State Region City Dimension Store Dimension Drawbacks: Time consuming joins,report generation slow
7 Fact Constellation Multiple fact tables that share many dimension tables Booking and Checkout may share many dimension tables in the hotel industry Hotels Travel Agents Promotion Room Type Customer Booking Checkout
8 Comparison Snowflake SchemaStar Schema Ease of maintenance/change: No redundancy and hence more easy to maintain and change Has redundant data and hence less easy to maintain/change Ease of Use: More complex queries and hence less easy to understand Less complex queries and easy to understand Query Performance: More foreign keys-and hence more query execution time Less no. of foreign keys and hence lesser query execution time Normalization: Has normalized tablesHas De-normalized tables Type of Data warehouse: Good to use for datawarehouse core to simplify complex relationships (many:many) Good for datamarts with simple relationships (1:1 or 1:many) Joins: Higher number of JoinsFewer Joins Dimension table: It may have more than one dimension table for each dimension Contains only single dimension table for each dimension When to use: When dimension table is relatively big in size, snowflaking is better as it reduces space. When dimension table contains less number of rows, we can go for Star schema.
To understand a Multidimensional view of data and how it can be implemented in a relational database. ROLAP – Relational On-Line Analytical Processing. To be aware of extensions have been added to SQL to make OLAP queries easier to write; in particular: Rollup Operator Cube Operator 9 ROLAP and SQL
10 Star schema (dimensional model) for property sales of DreamHome
A subject-oriented, integrated, time-variant, and non-volatile collection of data in support of managements decision-making process (Inmon, 1993). The technique of taking large amounts of operational data and putting them into a large database, with the aim of analysing them every which way to yield useful information Scoring Points – How Tesco Continues to Win Customer Loyalty. Terry Hunt Pub. Kogan Page 11 Data Warehouse Definitions
Data Warehousing: Consolidate data from many sources in one large repository. Loading, periodic synchronization of replicas. Semantic integration. OLAP: Complex SQL queries and views. Queries based on spreadsheet-style operations and multidimensional view of data. Interactive and online queries. Data Mining: Exploratory search for interesting trends and anomalies. (Another module/topic!) 12 Three Complementary Trends
Fact table in BCNF; dimension tables un-normalized. Dimension tables are small; updates/inserts/deletes are rare. So, anomalies less important than query performance. This kind of schema is very common in OLAP applications, and is called a star schema; computing the join of all these relations is called a star join. 13 Design Issues – Further Example pricecategorypnamepidcountrystatecitylocid saleslocidtimeidpid holiday_flagweekdatetimeidmonthquarteryear (Fact table) SALES TIMES PRODUCTSLOCATIONS
Collection of numeric measures, which depend on a set of dimensions. E.g., measure Sales, dimensions Product (key: pid), Location (locid), and Time (timeid). 14 Multidimensional Data Model timeid pid pid timeidlocid sales locid Slice locid=1 is shown:
Multidimensional data can be stored physically in a (disk- resident, persistent) array; called MOLAP systems. Alternatively, can store as a relation; called ROLAP systems. The main relation, which relates dimensions to a measure, is called the fact table. Each dimension can have additional attributes and an associated dimension table. E.g., Products(pid, pname, category, price) Fact tables are much larger than dimensional tables. 15 MOLAP vs ROLAP
For each dimension, the set of values can be organized in a hierarchy: 16 Dimension Hierarchies PRODUCTTIMELOCATION category week month state pname date city year quarter country
Influenced by SQL and by spreadsheets. A common operation is to aggregate a measure over one or more dimensions. Find total sales. Find total sales for each city, or for each state. Roll-up: Aggregating at different levels of a dimension hierarchy. E.g., Given total sales by city, we can roll-up to get sales by state. 17 OLAP Queries
Drill-down: The inverse of roll-up. E.g., Given total sales by state, can drill-down to get total sales by city. E.g., Can also drill-down on different dimensions to get total sales by product for each state. Pivoting: Aggregation on selected dimensions. E.g., Pivoting on Location and Time yields this cross- tabulation: 18 OLAP Queries WI CA Total Total
The cross-tabulation obtained by pivoting can also be computed using a collection of SQLqueries: 19 Comparison with SQL Queries SELECT T.Year, L.state, SUM(S.sales) FROM Sales S, Times T, Locations L WHERE S.timeid=T.timeid AND S.timeid=L.timeid GROUP BY T.year, L.state This query generates the entries in the body of the pivot chart on slide 12.
SELECT L.state, SUM(S.sales) FROM Sales S, Location L WHERE S.timeid=L.timeid GROUP BY L.state This query produces the summary row at the bottom of the pivot chart on slide
SELECT T.Year, SUM(S.sales) FROM Sales S, Times T WHERE S.timeid=T.timeid GROUP BY T.Year; This query produces the summary column on the right of the pivot chart on slide
The cumulative sum in the bottom-right corner of the chart is produced by this query: SELECT SUM (S.sales) FROM Sales S, Locations L WHERE S.locid=L.locid; 22
The Group by clause with the CUBE keyword is equivalent to a collection of GROUP BY statements: The result of the previous four queries can be produced with one query using the CUBE keyword. SELECT T.year, L.state, SUM (S.sales) FROM Sales S, Times T, Locations L WHERE S.timid=T.timeid AND S.locid.L.locid GROUP BY CUBE (T.year, L.state); The results of this query (on the following slide) is a tabular representation of the cross tabulation on slide The Cube Operator
T.YearL.StateSUM(S.sales) 1995WI CA NULL WI CA NULL WI CA NULL110 NULLWI176 NULLCA223 NULL
S#P#QTY S1P1300 S1P2200 S2P1300 S2P2400 S3P2200 S4P OLAP: Example of Queries using Rollup keyword The above table (called SP) shows Suppliers who supply Parts in a certain quantity. S# = Supplier No and P# = Part number.
1.Get the total shipment quantity. 2.Get total shipment quantities by supplier 3.Get total shipment quantities by part 4.Get total shipment quantities by supplier and part. 26 Queries
1 Get the total shipment quantity. Select SUM (QTY) From SP; 2 Get total shipment quantities by supplier Select S#, SUM(QTY) From SP Group by (S#); 27
3 Get total shipment quantities by part. Select P#, SUM(QTY) From SP Group by (P#); 4 Get total shipment quantities by supplier and part. Select S#, P#, SUM(QTY) From SP Group by (S#, P#); 28
Consider the following query Select S#, P#, Sum (QTY) As TOTQTY From SP Group by Rollup (S#, P#); The query is a bundled SQL formulation of Queries 1, 2 and 4. Result is on the next slide. Called Rollup because the quantities have been rolled up along the Supplier dimension. 29 Rollup
S#P#TOTQTY S1P1300 S1P2200 S2P1300 S2P2400 S3P2200 S4P2200 S1NULL500 S2NULL700 S3NULL200 S4NULL200 NULL Result of Rollup
Consider the following query Select S#, P#, Sum (QTY) As TOTQTY From SP Group by CUBE (S#, P#); The result of this query, on the next slide, is a bundle of all four of the original queries. Note C. J. Date describes this as … a table (an SQL-style table, at any rate,) but it is hardly a relation….In fact the result table in this example can be regarded as an outer union….outer union is not a respectable relational operation. An introduction to Database Systems eighth edition. C.J.Date Addison Wesley CUBE
32 Result of Cube S#P#TOTQTY S1P1300 S1P2200 S2P1300 S2P2400 S3P2200 S4P2200 S1NULL500 S2NULL700 S3NULL200 S4NULL200 NULLP1600 NULLP21000 NULL 1600
The drawbacks to the previous queries without CUBE and Rollup is obvious. Formulating several similar but distinct queries is tedious Using these additions to SQL we can attempt to try and represent several levels of aggregation in a single query. This is the motivation behind Rollup and Cube which are extra options on the GROUP BY clause in SQL 33 Several Levels of Aggregation in same query – Rollup
Multi-Dimensional views of data can be stored in a relational database. The SQL Group by clause can be used to aggregate data but queries can be tedious. The new keywords of Rollup and Cube have been added to SQL for use in Relational OnLine Analytical Processing (ROLAP) Practical exercises in the lab next week. 34 Summary