2 Data Warehouse Schema Star Schema Snowflake Schema Fact constellation We need subject oriented and multidimensional data amodel fro data warehouse which facilitates online analysis
3 Star SchemaA single, large and central fact table and one table for each dimension.The star schema separates business data into facts, which hold the measurable, quantitative data about a business, and dimensions which are descriptive attributes related to fact data. Examples of fact data include sales price, sale quantity, and time, distance, speed, and weight measurements. Related dimension attribute examples include product models, product colors, product sizes, geographic locations, and salesperson names.
4 Star Schema (contd..) Store Dimension Fact Table Time Dimension Store KeyStore NameCityStateRegionStore KeyProduct KeyPeriod KeyUnitsPricePeriod KeyYearQuarterMonthProduct KeyProduct DescProduct DimensionBenefits: Easy to understand, easy to define hierarchies, reduces no. of physical joins.
5 SnowFlake Schema Variant of star schema model. A single, large and central fact table and one or more tables for each dimension.Dimension tables are normalized i.e. split dimension table data into additional tables
7 Fact Constellation Booking Checkout Fact Constellation Promotion Multiple fact tables that share many dimension tablesBooking and Checkout may share many dimension tables in the hotel industryHotelsTravel AgentsPromotionRoom TypeCustomerBookingCheckout
8 Comparison Snowflake Schema Star Schema Ease of maintenance/change: No redundancy and hence more easy to maintain and changeHas redundant data and hence less easy to maintain/changeEase of Use:More complex queries and hence less easy to understandLess complex queries and easy to understandQuery Performance:More foreign keys-and hence more query execution timeLess no. of foreign keys and hence lesser query execution timeNormalization:Has normalized tablesHas De-normalized tablesType of Data warehouse:Good to use for datawarehouse core to simplify complex relationships (many:many)Good for datamarts with simple relationships (1:1 or 1:many)Joins:Higher number of JoinsFewer JoinsDimension table:It may have more than one dimension table for each dimensionContains only single dimension table for each dimensionWhen to use:When dimension table is relatively big in size, snowflaking is better as it reduces space.When dimension table contains less number of rows, we can go for Star schema.
9 ROLAP and SQLTo understand a Multidimensional view of data and how it can be implemented in a relational database. ROLAP – Relational On-Line Analytical Processing.To be aware of extensions have been added to SQL to make OLAP queries easier to write; in particular:Rollup OperatorCube Operator
10 Star schema (dimensional model) for property sales of DreamHome
11 Data Warehouse Definitions A subject-oriented, integrated, time-variant, and non-volatile collection of data in support of management’s decision-making process (Inmon, 1993).“The technique of taking large amounts of operational data and putting them into a large database, with the aim of analysing them every which way to yield useful information” Scoring Points – How Tesco Continues to Win Customer Loyalty. Terry Hunt Pub. Kogan Page
12 Three Complementary Trends Data Warehousing: Consolidate data from many sources in one large repository.Loading, periodic synchronization of replicas.Semantic integration.OLAP:Complex SQL queries and views.Queries based on spreadsheet-style operations and “multidimensional” view of data.Interactive and “online” queries.Data Mining: Exploratory search for interesting trends and anomalies. (Another module/topic!)
13 Design Issues – Further Example TIMEStimeiddateweekmonthquarteryearholiday_flag(Fact table)pidtimeidlocidsalesSALESPRODUCTSLOCATIONSpidpnamecategorypricelocidcitystatecountryFact table in BCNF; dimension tables un-normalized.Dimension tables are small; updates/inserts/deletes are rare. So, anomalies less important than query performance.This kind of schema is very common in OLAP applications, and is called a star schema; computing the join of all these relations is called a star join.
14 Multidimensional Data Model timeidlocidsalespidCollection of numeric measures, which depend on a set of dimensions.E.g., measure Sales, dimensions Product (key: pid), Location (locid), and Time (timeid).timeidpidSlice locid=1is shown:locid
15 MOLAP vs ROLAPMultidimensional data can be stored physically in a (disk-resident, persistent) array; called MOLAP systems. Alternatively, can store as a relation; called ROLAP systems.The main relation, which relates dimensions to a measure, is called the fact table. Each dimension can have additional attributes and an associated dimension table.E.g., Products(pid, pname, category, price)Fact tables are much larger than dimensional tables.
16 Dimension Hierarchies For each dimension, the set of values can be organized in a hierarchy:PRODUCTTIMELOCATIONyearquarter countrycategory week month statepname date city
17 OLAP Queries Influenced by SQL and by spreadsheets. A common operation is to aggregate a measure over one or more dimensions.Find total sales.Find total sales for each city, or for each state.Roll-up: Aggregating at different levels of a dimension hierarchy.E.g., Given total sales by city, we can roll-up to get sales by state.
18 OLAP Queries Drill-down: The inverse of roll-up. E.g., Given total sales by state, can drill-down to get total sales by city.E.g., Can also drill-down on different dimensions to get total sales by product for each state.Pivoting: Aggregation on selected dimensions.E.g., Pivoting on Location and Time yields this cross-tabulation:WI CA Total199519961997Total
19 Comparison with SQL Queries The cross-tabulation obtained by pivoting can also be computed using a collection of SQLqueries:SELECT T.Year, L.state, SUM(S.sales)FROM Sales S, Times T, Locations LWHERE S.timeid=T.timeid AND S.timeid=L.timeidGROUP BY T.year, L.stateThis query generates the entries in the body of the pivot chart on slide 12.
20 SELECT L.state, SUM(S.sales) FROM Sales S, Location LWHERE S.timeid=L.timeidGROUP BY L.stateThis query produces the summary row at the bottom of the pivot chart on slide 12.
21 SELECT T.Year, SUM(S.sales) FROM Sales S, Times TWHERE S.timeid=T.timeidGROUP BY T.Year;This query produces the summary column on the right of the pivot chart on slide 12.
22 The cumulative sum in the bottom-right corner of the chart is produced by this query: SELECT SUM (S.sales) FROM Sales S, Locations L WHERE S.locid=L.locid;
23 The Cube OperatorThe Group by clause with the CUBE keyword is equivalent to a collection of GROUP BY statements:The result of the previous four queries can be produced with one query using the CUBE keyword.SELECT T.year, L.state, SUM (S.sales)FROM Sales S, Times T, Locations LWHERE S.timid=T.timeid AND S.locid.L.locidGROUP BY CUBE (T.year, L.state);The results of this query (on the following slide) is a tabular representation of the cross tabulation on slide 18.
25 OLAP: Example of Queries using Rollup keyword QTYS1P1300P2200S2400S3S4The above table (called SP) shows Suppliers who supply Parts in a certain quantity. S# = Supplier No and P# = Part number.
26 Queries Get the total shipment quantity. Get total shipment quantities by supplierGet total shipment quantities by partGet total shipment quantities by supplier and part.
27 1 Get the total shipment quantity. Select SUM (QTY) From SP;2 Get total shipment quantities by supplierSelect S# , SUM(QTY) From SPGroup by (S#);
28 3 Get total shipment quantities by part. Select P#, SUM(QTY) From SPGroup by (P#);4 Get total shipment quantities by supplier and part.Select S#, P#, SUM(QTY) From SPGroup by (S#, P#);
29 Rollup Consider the following query Select S#, P#, Sum (QTY) As TOTQTYFrom SPGroup by Rollup (S#, P#);The query is a bundled SQL formulation of Queries 1, 2 and 4. Result is on the next slide.Called Rollup because the quantities have been “rolled up” along the Supplier dimension.
30 Result of Rollup S# P# TOTQTY S1 P1 300 P2 200 S2 400 S3 S4 NULL 500 7001600
31 CUBE Consider the following query Select S#, P#, Sum (QTY) As TOTQTY From SPGroup by CUBE (S#, P#);The result of this query, on the next slide, is a bundle of all four of the original queries.Note C. J. Date describes this as “… a table (an SQL-style table, at any rate,) but it is hardly a relation….In fact the result table in this example can be regarded as an “outer union”….outer union is not a respectable relational operation.“An introduction to Database Systems” eighth edition. C.J.Date Addison Wesley 2004
32 Result of Cube S# P# TOTQTY S1 P1 300 P2 200 S2 400 S3 S4 NULL 500 700 60010001600
33 Several Levels of Aggregation in same query – Rollup The drawbacks to the previous queries without CUBE and Rollup is obvious.Formulating several similar but distinct queries is tediousUsing these additions to SQL we can attempt to try and represent several levels of aggregation in a single query.This is the motivation behind “Rollup” and “Cube” which are extra options on the GROUP BY clause in SQL
34 SummaryMulti-Dimensional views of data can be stored in a relational database.The SQL Group by clause can be used to aggregate data but queries can be tedious.The new keywords of Rollup and Cube have been added to SQL for use in Relational OnLine Analytical Processing (ROLAP)Practical exercises in the lab next week.