Data Warehousing and OLAP

Data Warehousing and OLAP
ENGI3675 Database Systems

Introduction Cow book chapter 25 Data mining book chapter 4
Useful to analyze historical data Discover trends Make more informed decisions Decision support Requires comprehensive databases Data from several separate sources Historical data Data warehouse Different from regular SQL database Almost no update operations Lots of statistical operations (not supported by SQL)

Data Warehousing A data warehouse is a massive database
Multidimensional Millions of records, terabytes of data Consolidates data from multiple smaller databases and sources Contains historical data for long periods of time But not necessarily immediate up-to-date data Useful for statistical analysis and trends detection Not used for operational day-to-day data management

Data Warehousing Operational databases Decision Support
OLAP (Online Analysis Processing) Data cleaning and preprocessing Data analysis User Services Data Warehouse Data Mining External Sources

Data Warehousing Top-down creation Bottom-up creation
Design the data warehouse then collect data Pro: minimizes integration problems Con: resulting warehouse lacks flexibility Bottom-up creation Populate warehouse from existing sources Pro: use all existing data Con: requires data extraction, cleaning and transformation steps

Metadata Repository Databases have a catalog
An additional table that contains data about the other tables Table name, number of entries, attributes names and order, integrity constraints, indexes, etc. Useful for query optimization Data warehouses have a metadata repository Larger and more complex than catalog Needs additional information: original source of data, load date into the warehouse, cleaning algorithms used, etc.

Data Cubes Data in warehouse is best modelled as a data cube
Multidimensional data model Dimension is a “perspective” (attribute or entity) by which you can sort the data Data is n-dimensional (not just 3D… the word “cube” is misleading!) A normal database table is a 2D data cube

Data Cubes (example) We have a table storing sales values for categories of items in each quarter of the year (2D) For one location and one year Quarter Clothing Toys Food Q1 $605 $825 $400 Q2 $680 $952 $512 Q3 $812 $1023 $501 Q4 $927 $1038 $580

Data Cubes (example) Now say we have multiple locations Thunder Bay
Toronto New York Quarter Clothing Toys Food Q1 $605 $825 $400 Q2 $680 $952 $512 Q3 $812 $1023 $501 Q4 $927 $1038 $580 Quarter Clothing Toys Food Q1 $354 $867 $786 Q2 $876 $456 $783 Q3 $879 $4523 $789 Q4 $321 $4569 $975 Quarter Clothing Toys Food Q1 $654 $543 $456 Q2 $325 $879 $684 Q3 $789 $876 $397 Q4 $354 $3645

Data Cubes (example) It’s a 3D data cube 605 825 400 4 3 2 1 Quarter
C T F Category Quarter TB TO NY Location

Data Cubes (example) Add in years and it’s a 4D cube 2012 2013 Year
C T F Category Quarter TB TO NY Location C T F Category Quarter TB TO NY Location

Data Cubes We can generate cubes (or cuboids) for any subset of dimensions of the data Taking a big-picture view, we have a lattice of cubes detailing some attributes and summarizing others The cube displaying all n dimensions is the base cuboid Lowest level of summarization The cube displaying no dimensions is the apex cuboid Highest level of summarization

Data Cubes (example) all Apex cuboid quarter category location year
quarter, location quarter, year category, location category, year location, year quarter, category, location quarter, category, year quarter, location, year category, location, year quarter, category, location, year Base cuboid

Data Model Schema ER model appropriate for relational database
Data warehouse requires model that illustrates data topics and levels Star schema Snowflake schema Fact constellation schema

Data Model Schema Star schema Only one table per dimension!
Simplest, most common One central fact table with the data n dimension tables detailing the n dimensions Only one table per dimension! Introduces redundancy Multiple tuples with same attribute value will duplicate that attribute Those should be moved to a separate table Prevents creation of attribute value hierarchies

Data Model Schema (example)
Star schema Location Location_key City_Name Province Country Number_of_stores Quarter Quarter_key Number_of_days Number_of_holidays Sales Quarter_key Category_key Location_key Year_key Dollars Category Category_key Name Number_of_items Number_of_sales Supplier_name Supplier_address Year Year_key Leap_year Recession_year Election_year

Data Model Schema Snowflake schema Reduces redundancy
Star schema that allows normalization of dimension tables Can be more than one table per dimension Reduces redundancy But at the cost of computation time for join Therefore not popular

Category Category_key Name Number_of_items Number_of_sales Supplier_key Snowflake schema Supplier Supplier_key Name Address Quarter Quarter_key Number_of_days Number_of_holidays Sales Quarter_key Category_key Location_key Year_key Dollars Location Location_key City_Name Province_key Year Year_key Leap_year Recession_year Election_year Province Province_key Country Number_of_stores

Data Model Schema Fact constellation schema
Larger data warehouses can have multiple independent fact tables (stars) that share a common dimension Stars become connected in constellations Also called “galaxy schema”

Fact constellation schema Category Category_key Name Number_of_items Number_of_sales Supplier_name Supplier_address Quarter Quarter_key Number_of_days Number_of_holidays Warehouse Warehouse_key Category_key Location_key Space Address Sales Quarter_key Category_key Location_key Year_key Dollars Location Location_key City_Name Province Country Number_of_stores Year Year_key Leap_year Recession_year Election_year

Data Model Schema (exercise)
A company uses an RFID system to track tools throughout several company buildings Sensor in each room of each building reads and reports all RFID tags within range every second They need a data warehouse to keep a history of the information It is also necessary to keep track of departments Each room is assigned to a department Each department has a specialization Each tool is assigned a specialization

Data Model Schema (exercise)
Tools RFID Name Specialization_key Manufacturer Time Time_key Second Minute Hour Day Month Year Tracking RFID Location_key Time_key Specialization Specialization_key Name Classification Location Location_key Room Building Department_key Department Department_key Name Budget Specialization_key

Concept Hierarchy Attribute can be structured in concept hierarchies
Useful for building in order in the values Useful for data mining algorithms all Canada USA Québec Ontario Illinois New York Montreal Victoriaville Toronto Thunder Bay Chicago Urbana NYC

Concept Hierarchy Some attributes have total order
Only one possible ordering of attributes from top to bottom Creates a clean hierarchy Some attributes have partial order Some attributes are not connected to each other Creates a lattice Country Year Province Quarter City Week Month Street Day

OLAP Operations OnLine Analytical Processing
Set of tools for analysis of multidimensional data at various levels of granularity Building blocks for data mining and decision support algorithm Operations on data cubes & concept hierarchies C T F Category Quarter TB TO NY Location

OLAP Operations Roll-up Drill-down
Aggregating a measure over a dimension Given the sales per city, we roll up into sales per province Drill-down Decomposing an aggregated measure over a dimension Given sales per province, we drill down to sales per city

OLAP Operations Slice Dice An equality query on one or more dimension
You’re taking a slice of the cube We slice off the sales data from Thunder Bay Dice A range query on one or more dimension You’re dicing off a smaller cube We make a dice of sales in TBay and Toronto for toys and food in Quarters 2, 3, 4

OLAP Operations Pivot (rotate)
Changing the visualisation angle of the data 2D: transpose of the data 3D: rotating cube, or splitting it into a set of 2D tables

OLAP Operations Roll-up Dice 952 512 2411 2341 8941 4 3 2 4 3 2 1
C T F CA US T F TB TO Roll-up Dice C T F Category Quarter TB TO NY Location C T F C T F Category JFMAMJJASOND TB TO NY Location 927 212 128 486 522 548 125 148 974 156 114 152 451 1235 1445 187 874 456 985 235 154 458 1231 847 4125 1239 457 685 124 159 123 487 1124 Slice Drill-down Pivot C T F

Efficiency Considerations
Data warehouses contain massive amounts of data TB in size, millions of tuples But queries must be answered quickly Seconds at the most Efficiency is an issue Biggest cost comes from aggregation operations, e.g. computing different cuboids in lattice (p. 13)

Efficiency Considerations: Precomputing Cuboids
Solution, precompute all cuboids in lattice? How many could there be? n dimensions = 2n cuboids But wait! Some dimensions have hierarchies of attributes associated to them Dimension i has Li dimensions Clearly not a good solution Although we can still precompute some cuboids, if we know which ones are most likely to be used 𝑖=0 𝑛−1 ( 𝐿 𝑖 +1) cuboids

Efficiency Considerations: Refreshing Cuboids
When the data in the warehouse changes (refreshed when we load in an operational database), precomputed cuboids don’t change We need to refresh them We could simply delete the cuboid and recompute it from zero Simple, requires no new algorithms Can work well if the data is in another system Can work well on smaller tables But becomes expansive on larger relations We can refresh the cuboid incrementally Only update the tuples that changed Cost proportional to change

Projection cubes The cuboid is a projection of attributes from a single relation Slice, dice and pivot operations Incremental insert Compute the projection of the new tuples, and add to the cuboid Incremental delete Compute the projection of the tuples to remove, and subtract from the cuboid

Join cubes The cuboid is a join of two relations Roll-up and drill-down with additional tables for hierarchies Incremental insert Compute the join of the new tuples in one relation with the other relation, and add to the cube Incremental delete Compute the join of the tuples to remove in one relation with the other relation, and subtract from the cube

Aggregation cubes Cuboid performs an aggregate function on some attribute Roll-up operation We need to maintain detailed information on the role of each row in the aggregated value of the group COUNT: maintain a counter of the number of each instance counted, remove when counter = 0 SUM: maintain a counter of the number of tuples summed; remove when counter = 0 (not sum = 0) AVG: maintain a sum and counter, average = sum/counter MAX/MIN: maintain the value and counter of instances of that value; easy to insert, but when counter = 0 we must scan the whole group to update

Immediate maintenance Refresh the cuboid whenever the tables are updated Cuboid always up-to-date Slows down the update operations Deferred maintenance Lazy update: update a cuboid when queried Periodic update: update a cuboid at a set time Forced update: update a cuboid after a given number of table updates

Efficiency Considerations: Indexing
Indexing data allows faster search and retrieval Data warehouses make use of two new indexes Bitmap index Join index

Consider an attribute with few possible values (sparse) that can be represented as binary words (bit vectors) Example: A column in this bit table is a bitmap index 1 2 3 4 5 EID Name Sex Rating 1 Bob M 2 Joe 3 Catherine F 4 Dave 5 M F 1

Advantages: More compact and compressible than B+ tree & hash indexes More efficient: can make use of bit operations Example: which male employees have a rating of 3 M 1 3 1 1 & =

Join index to speed up join queries Indexes results of join operations Allows us to skip the join altogether Particularly useful with star/constellation schema Example: Sales per location per category is a join query Category Category_key Name Number_of_items Number_of_sales Supplier_name Supplier_address Sales Quarter_key Category_key Location_key Year_key Dollars Location Location_key City_Name Province Country Number_of_stores

Example (continued): The join index will store the results of the join Discard all unnecessary results (e.g. categories not sold at certain locations) Subsequent queries on sales per location will be more efficient (no unnecessary I/O) Category Location Sales Category_key Location_key Dollars

Finding Answers Quickly
A recent trend, fueled mostly by the Internet, is an emphases on queries for which a user only wants the first few, or the ‘best’ few, answers quickly. A related trend is that, for complex queries, users would like to see an approximate answer quickly and then have it be continually refined, rather than wait until the exact answer is available Especially important when dealing with data warehouses, where taking all the data into account takes enormous time

Finding Answers Quickly
For queries with a lot of results, sometimes users only want the top-N results, rather than complete results For long queries, sometimes users want preliminary approximate results right away, rather than wait for the exact results

Top-N Results SELECT P.pid, P.pname, S.sale FROM Sales S, Products P WHERE S.pid = P.pid AND S.locid =1 AND S.timeid = 3 ORDER BY S.sale DESC OPTIMIZE FOR 10 ROWS Without the OPTIMIZE command, the query would compute all sales – wasteful The OPTIMIZE command is not standard SQL What to do without it?

Top-N Results SELECT P.pid, P.pname, S.sale FROM Sales S, Products P WHERE S.pid = P.pid AND S.locid =1 AND S.timeid = 3 AND S.sale > c ORDER BY S.sale DESC If you know the approximate value c of the top-N selling products, add it to the query Issues How to discover c? What if we get more than N results? What if we get less than N results?

Online Aggregation Assume sales and location are massive tables
SELECT L.region, AVG(S.Sale) FROM Sales S, Location L WHERE S.locid = L.Locid GROUP BY L.region Assume sales and location are massive tables Or distributed over several computers over network Takes a long time to completely compute the results Online aggregation Compute and display approximate results, and update them as the computation goes on But the DBMS now needs to give confidence values in the results

Online Aggregation Status Prioritize Region AVG(Sale) Confidence Interval 70% Yes Ontario 5,323.5 97% 103.4 40% No Alberta 2,832.5 93% 132.2 90% 6,432.5 98% 52.3 … 30% Quebec 4,243.5 92% 152.3 Probability of 93% that sales in Alberta are $2, ± $132.20, based on 40% of the computations The user does not prioritize that region The DBMS needs Statistical algorithms for the confidence intervals Non-blocking algorithms

Summary A data warehouse is a massive multidimensional heterogeneous database of historical and statistical data Requires different ways of seeing data Data cubes and lattice Database schemas: star, snowflake, constellation Requires new operations OLAP: Roll-up, drill-down, slice, dice, pivot Has more efficiency constraints New indexes Top results Online operations

Exercises 25.1 25.2 25.3 25.6 25.9 (use cuboids instead of views)
Cow Book Data Mining Book 25.1 25.2 25.3 25.6 25.9 (use cuboids instead of views) 25.10 (use cuboids instead of views) 4.2 4.3 4.4 4.5 4.6 4.13

Data Warehousing and OLAP

Similar presentations

Presentation on theme: "Data Warehousing and OLAP"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Warehousing and OLAP

Similar presentations

Presentation on theme: "Data Warehousing and OLAP"— Presentation transcript:

Similar presentations

About project

Feedback