Online Analytical Processing (OLAP) An Overview Kian Win Ong, Nicola Onose Mar 3 rd 2006.

Slides:



Advertisements
Similar presentations
Dimensional Modeling.
Advertisements

An Array-Based Algorithm for Simultaneous Multidimensional Aggregates By Yihong Zhao, Prasad M. Desphande and Jeffrey F. Naughton Presented by Kia Hall.
Materialization and Cubing Algorithms. Cube Materialization Each cell of the data cube is a view consisting of an aggregation of interest. The values.
Data Warehousing and Decision Support, part 2
OLAP Tuning. Outline OLAP 101 – Data warehouse architecture – ROLAP, MOLAP and HOLAP Data Cube – Star Schema and operations – The CUBE operator – Tuning.
Frequent Closed Pattern Search By Row and Feature Enumeration
Outline What is a data warehouse? A multi-dimensional data model Data warehouse architecture Data warehouse implementation Further development of data.
Data Warehousing CPS216 Notes 13 Shivnath Babu. 2 Warehousing l Growing industry: $8 billion way back in 1998 l Range from desktop to huge: u Walmart:
OLAP Services Business Intelligence Solutions. Agenda Definition of OLAP Types of OLAP Definition of Cube Definition of DMR Differences between Cube and.
Jennifer Widom On-Line Analytical Processing (OLAP) Introduction.
OLAP. Overview Traditional database systems are tuned to many, small, simple queries. Some new applications use fewer, more time-consuming, analytic queries.
Data Cube and OLAP Server
Advanced Querying OLAP Part 2. Context OLAP systems for supporting decision making. Components: –Dimensions with hierarchies, –Measures, –Aggregation.
COMP 578 Data Warehousing And OLAP Technology Keith C.C. Chan Department of Computing The Hong Kong Polytechnic University.
Evaluation of Top-k OLAP Queries Using Aggregate R-trees Nikos Mamoulis (HKU) Spiridon Bakiras (HKUST) Panos Kalnis (NUS)
CSE6011 Warehouse Models & Operators  Data Models  relations  stars & snowflakes  cubes  Operators  slice & dice  roll-up, drill down  pivoting.
Chapter 13 The Data Warehouse
An Array-Based Algorithm for Simultaneous Multidimensional Aggregates
Online Analytical Processing (OLAP) Hweichao Lu CS157B-02 Spring 2007.
Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals Presenter : Parminder Jeet Kaur Discussion Lead : Kailang.
Database Management Systems, 2 nd Edition. R. Ramakrishnan and J. Gehrke1 Decision Support Chapter 23.
1 Dr. Panagiotis Symeonidis Data Engineering Laboratory Data Warehouse implementation: Part B.
Week 6 Lecture The Data Warehouse Samuel Conn, Asst. Professor
1 CUBE: A Relational Aggregate Operator Generalizing Group By By Ata İsmet Özçelik.
Ahsan Abdullah 1 Data Warehousing Lecture-11 Multidimensional OLAP (MOLAP) Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for.
1 Cube Computation and Indexes for Data Warehouses CPS Notes 7.
1 Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan Ken :: Yiu Man Lung Implementing Data Cubes Efficiently.
OnLine Analytical Processing (OLAP)
Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross- Tab and Sub-Totals Gray et Al. Presented By: Priya Rajan.
OLAP & DSS SUPPORT IN DATA WAREHOUSE By - Pooja Sinha Kaushalya Bakde.
1 Fast Computation of Sparse Datacubes Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan Ken :: Yiu Man Lung.
Prof. Bayer, DWH, Ch.4, SS Chapter 4: Dimensions, Hierarchies, Operations, Modeling.
Data Warehousing.
Roadmap 1.What is the data warehouse, data mart 2.Multi-dimensional data modeling 3.Data warehouse design – schemas, indices 4.The Data Cube operator –
MIS2502: Data Analytics The Information Architecture of an Organization.
Designing Aggregations. Performance Fundamentals - Aggregations Pre-calculated summaries of data Intersections of levels from each dimension Tradeoff.
Decision Support and Date Warehouse Jingyi Lu. Outline Decision Support System OLAP vs. OLTP What is Date Warehouse? Dimensional Modeling Extract, Transform,
2006/3/211 Multiple Aggregations over Data Stream Rui Zhang, Nick Koudas, Beng Chin Ooi Divesh Srivastava SIGMOD 2005.
MIS2502: Data Analytics Dimensional Data Modeling
Fox MIS Spring 2011 Data Warehouse Week 8 Introduction of Data Warehouse Multidimensional Analysis: OLAP.
1 On-Line Analytic Processing Warehousing Data Cubes.
Implementing Data Cube Construction Using a Cluster Middleware: Algorithms, Implementation Experience, and Performance Ge Yang Ruoming Jin Gagan Agrawal.
Sales Dim Date Dim Customers Dim Products Dim Categories Dim Geography The data warehouse is a simple and standard one, after all we.
What is OLAP?.
Database Management Systems, 2 nd Edition. R. Ramakrishnan and J. Gehrke1 Data Warehousing and Decision Support.
SF-Tree and Its Application to OLAP Speaker: Ho Wai Shing.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Data Warehousing and Decision Support Chapter 25.
Introduction to OLAP and Data Warehouse Assoc. Professor Bela Stantic September 2014 Database Systems.
An Overview of Data Warehousing and OLAP Technology
Data Warehouses and OLAP 1.  Review Questions ◦ Question 1: OLAP ◦ Question 2: Data Warehouses ◦ Question 3: Various Terms and Definitions ◦ Question.
Pindaro Demertzoglou Data Resource Management – MGMT 4170 Lally School of Management Rensselaer Polytechnic Institute.
Or How I Learned to Love the Cube…. Alexander P. Nykolaiszyn BLOG:
CSE6011 Implementing a Warehouse  Monitoring: Sending data from sources  Integrating: Loading, cleansing,...  Processing: Query processing, indexing,...
Data Analysis Decision Support Systems Data Analysis and OLAP Data Warehousing.
Data Analysis and OLAP Dr. Ms. Pratibha S. Yalagi Topic Title
Percentage cube queries Optimisation Presented by: Abdallah KHELIL
On-Line Analytic Processing
Chapter 13 The Data Warehouse
Chapter 5: Advanced SQL Database System concepts,6th Ed.
MIS2502: Data Analytics Dimensional Data Modeling
MIS2502: Data Analytics Dimensional Data Modeling
MIS2502: Data Analytics Dimensional Data Modeling
DATA CUBE Advanced Databases 584.
MIS2502: Data Analytics Dimensional Data Modeling
On-Line Analytical Processing (OLAP)
MIS2502: Data Analytics The Information Architecture of an Organization Acknowledgement: David Schuff.
MIS2502: Data Analytics Dimensional Data Modeling
MIS2502: Data Analytics Dimensional Data Modeling
Slides based on those originally by : Parminder Jeet Kaur
Presentation transcript:

Online Analytical Processing (OLAP) An Overview Kian Win Ong, Nicola Onose Mar 3 rd 2006

Overview Motivation Multi-Dimensional Data Model Research Areas Optimizations –Materializing multiple aggregates simultaneously –Materialization strategy

Motivation Aggregation, summarization and exploration Of historical data To help management make informed decisions

Different Goal Aggregation, summarization and exploration Of historical data To help management make informed decisions ProductBranchTimePrice Coke (0.5 gallon)Convoy Street :00:01$1.00 Pepsi (0.5 gallon)UTC :00:01$1.03 Coke (1 gallon)UTC :00:02$1.50 AltoidsCosta Verde :01:33$ Find the total sales for each product and month Find the percentage change in the total monthly sales for each product

Different Requirements OLTPOLAP TasksDay to day operationHigh level decision support Size of databaseGigabytesTerabytes Time spanRecent, up-to-dateSpanning over months / years Size of working setTens of records, accessed through primary keys Consolidated data from multiple databases WorkloadStructured / repetitiveAd-hoc, exploratory queries PerformanceTransaction throughputQuery latency OLTP – On-Line Transaction Processing OLAP – On-Line Analytical Processing

Overview Motivation Multi-Dimensional Data Model Research Areas Optimizations –Materializing multiple aggregates simultaneously –Materialization strategy

Query Language Extensions In the real world, data is stored in RDBs.

Query Language Extensions In the real world, data is stored in RDBs. How to express N-dimensional problems using 2D tables?

Query Language Extensions In the real world, data is stored in RDBs. How to express N-dimensional problems using 2D tables? Can we combine OLAP and SQL queries? Jim Gray et al: Data Cube: A Relational Aggregation Operator 1997

Query Language Extensions 1.histograms Problems with GROUP BY SELECT sales, prod_name, population FROM sales_history GROUP BY Population(City, State) as population

Query Language Extensions 1.histograms 2.rollup/drilldow n Problems with GROUP BY Product Category Product Name MonthSalesSales by Cat., by Name Sales by Cat. DrinksCokeFeb30.3 Mar HeinekenFeb34.8 Mar non relational representation

Query Language Extensions 1.histograms 2.rollup/drilldow n Problems with GROUP BY Product Category Product Name MonthSalesSales by Cat., by Name Sales by Cat. DrinksCokeFeb DrinksCokeMar DrinksHeinekenFeb DrinksHeinekenMar relational, but the rollup is huge

Query Language Extensions 1.histograms 2.rollup/drilldown 3.cross tabulations Problems with GROUP BY Product Category Product Name MonthSales DrinksCokeFeb30.3 DrinksCokeMar93.9 DrinksCokeTotal124.2 DrinksHeinekenFeb34.8 DrinksHeinekenMar123.8 DrinksHeinekenTotal158.6 DrinksTotal Could be represented as:

Query Language Extensions 1.histograms 2.rollup/drilldown 3.cross tabulations Problems with GROUP BY 2-D aggregation is more compact and more natural: DrinksFebMarTotal Coke Heineken Total

Query Language Extensions 1.histograms 2.rollup/drilldown 3.cross tabulations 4.complex expressions, hard to optimize Problems with GROUP BY when reducing to 1-D aggregation (GROUP BY) need 2^{number of dim.} GROUP BY’s

Query Language Extensions Reducing the number of attributes Product Category Product Name MonthSales DrinksCokeFeb30.3 DrinksCokeMar93.9 DrinksCokeALL124.2 DrinksHeinekenFeb34.8 DrinksHeinekenMar123.8 DrinksHeinekenALL158.6 DrinksALL DrinksALLFeb65.1 DrinksALLMar217.7

Query Language Extensions introduce a new value: “ALL” Reducing the number of attributes “ALL” = the set over which we aggregate DrinksFebMarTotal (ALL) Coke Heineken Total (ALL)

Query Language Extensions GROUP BY (1D) General approach Sales by Product Name FebMar Coke Heineken SUM

Query Language Extensions GROUP BY (1D) Cross Tab (2D) General approach DrinksFebMarALL Coke Heineken ALL Product Category Product Name MonthSales DrinksCokeFeb30.3 DrinksCokeMar93.9 DrinksCokeALL124.2 DrinksHeinekenFeb34.8 DrinksHeinekenMar123.8 DrinksHeinekenALL158.6 DrinksALLFeb65.1 DrinksALLMar217.7 DrinksALL the corresponding relation:

Query Language Extensions GROUP BY (1D) Cross Tab (2D) Cube (3D) General approach Product Category Product Name MonthSales DrinksCokeFeb30.3 DrinksCokeMar93.9 DrinksCokeALL124.2 ………… SnacksDoritosFeb123.8 SnacksDoritosMar158.6 SnacksDoritosALL65.1 ………… ALL By cat. and month By cat. and name (does it make sense?) By month and name

Query Language Extensions GROUP BY (1D) Cross Tab (2D) Cube (3D) Any hypercube can be represented as a relation! General approach

Query Language Extensions a CUBE relation, with aggregation function f(.) (x 1, x 2, …, x n-1, x n, f() ) …………………………… (x 1, x n-1, …, x n, ALL, f() ) …………………………… (x 1, x 2, …, ALL, x n, f() ) …………………………… after ROLLUP, reduce to a linear # of tuples (x 1, x 2, …, x n-1, x n, f() ) ………………………………… (x 1, x n-1, …, x n, ALL, f() ) ………………………………… (x 1, x 2, …, ALL, ALL, f() ) ………………………………… (ALL, ALL, …, ALL, ALL, f() ) General approach

Query Language Extensions The new operators: CUBE, ROLLUP SELECT prod_category, prod_name, month, SUM(sales) AS sales FROM sales_history GROUP BY CUBE prod_category, prod_name, month Product Category Product Name MonthSales DrinksCokeFeb30.3 DrinksCokeMar93.9 DrinksCokeALL124.2 ………… DrinksALLFeb99.8 ………… ALL Idea: Group by the CUBE list. Union the aggregates. Introduce the ALL values.

Query Language Extensions The new operators: CUBE, ROLLUP SELECT prod_category, month, day, state, prod_name, SUM(sales) AS sales FROM sales_history GROUP BY prod_category ROLLUP month, day CUBE city, state Product Category MonthDayStateProduct Name Sales DrinksFeb26CACoke12.3 Feb26CAHeineken5.4 …………… Feb26CAALL30.4 Feb26ALLCoke… ………… SnacksFeb26CADoritos12.0 …………

Overview Motivation Multi-Dimensional Data Model Research Areas Optimizations –Materializing multiple aggregates simultaneously –Materialization strategy

Research Areas SQL language extensions Server architecture Parallel processing Index structures Materialized views

Overview Motivation Multi-Dimensional Data Model Research Areas Optimizations –Materializing multiple aggregates simultaneously –Materialization strategy

Simultaneous Aggregates Multi-Dimensional Optimization to calculate multiple aggregates simultaneously Useful for materialization of aggregate views Y. Zhao, P. Deshpande, J. Naughton An Array-Based Algorithm for Simultaneous Multidimensional Aggregates SIGMOD 1997

Multiple Aggregates Month / Product FebMarTotal Altoids Coke Doritos Heineken Pepsi Pringles Total ProductCityMonthSales CokeSan DiegoFeb 0612 PepsiLos AngelesFeb 0613 DoritosSan DiegoMar 0672 AltoidsSan DiegoMar Aggregate on…

Multiple Aggregates City / Product San DiegoLos AngelesTotal Altoids Coke Doritos Heineken Pepsi Pringles Total Month / City FebMarTotal Los Angeles San Diego Total Month / Product FebMarTotal Altoids Coke Doritos Heineken Pepsi Pringles Total ProductCityMonthSales CokeSan DiegoFeb 0612 PepsiLos AngelesFeb 0613 DoritosSan DiegoMar 0672 AltoidsSan DiegoMar Aggregate on…

Multiple Aggregates ProductCityMonthSales CokeSan DiegoFeb 0612 PepsiLos AngelesFeb 0613 DoritosSan DiegoMar 0672 AltoidsSan DiegoMar Sales by Product / City 2.Sales by Product / Month 3.Sales by Month / City 4.Sales by Product 5.Sales by City 6.Sales by Month 7.Sales (Total) Is it possible to make a single pass over the transactional table? calculate multiple aggregates simultaneously? Aggregate on…

Chunking Dimension B Dimension A Dimension C 1 ProductCityMonthSales CokeSan DiegoFeb 0612 Array Chunk Product City Month Partition transactional data into array chunks

Naïve Algorithm Dimension A Dimension C Pivot on AB aggregate on all C Dimension A Dimension B

Naïve Algorithm Dimension A Dimension C Pivot on AB aggregate on all C Pivot on AC aggregate on all B Pivot on BC aggregate on all A Dimension B

Single Pass Algorithm Dimension A Dimension C B AB AC BC Make a single pass over data

Single Pass Algorithm Dimension A Dimension C B AB AC BC Simultaneously maintain multiple aggregates

Single Pass Algorithm Dimension A Dimension C B AB AC BC Write out completed aggregates

Single Pass Algorithm Dimension A Dimension C B AB AC BC Only allocate memory that is necessary

Single Pass Algorithm AB AC BC Array Chunk ABC 4 x 4 x 4 AB 16 x 4 x 4 AC 4 x 4 x 4 BC 4 x 4 A 4 x 4 B4B4 C4C4 all 1 Minimum memory spanning tree

Multi Pass Algorithm ABCD ABC ABD ACD BCD AB AC BC AD BD CD A B C D all Recursively aggregate

Overview Motivation Multi-Dimensional Data Model Research Areas Optimizations –Materializing multiple aggregates simultaneously –Materialization strategy

Implementing Data Cubes Biggest problem for data warehouses: the size Space / time trade-off: accelerate queries by materializing the cube

Implementing Data Cubes Biggest problem for data warehouses: the size Space / time trade-off: accelerate queries by materializing the cube The size of the relations gets even bigger!

Implementing Data Cubes Biggest problem for data warehouses: the size Space / time trade-off: accelerate queries by materializing the cube The size of the relations gets even bigger! M(ultidimensional)OLAP: good query performance, but bad scalability R(elational)OLAP: very scalable; query performance improved by materializing (partial) results

Implementing Data Cubes V. Harinarayan, A. Rajaraman, J.D. Ullman: Implementing Data Cubes Efficiently SIGMOD 1996 Presents a materialization strategy for the cells of the cube.

Implementing Data Cubes Time Id City Id Product Id Sales Day Month Week City Id City State Product Id Name Category Week Month Year Category Id Category Name

Implementing Data Cubes casted as particular case of the rewriting using views problem what cells to materialize  what SQL views to materialize

Implementing Data Cubes casted as particular case of the rewriting using views problem what cells to materialize  what SQL views to materialize p = product t = time c = city simple idea: Q 1 depends on Q 2 (Q 1 ≤Q 2 ) if Q 1 can be fully answered using the results of Q 2 ptc pt tc pc tp c none

Implementing Data Cubes but cube dimensions are usually hierarchical product_name product_category none day weekmonth year none city state none XX direct-product lattice p = product t = time c = city ptc pttcpc pts pwc pyc pmc ps p cat t … … … … …

Implementing Data Cubes Def. cost of answering Q = # of rows in the table of ancestor(Q) It can be estimated w/o materializing the views Assume that all queries are identical to some view in the lattice

Implementing Data Cubes For a set S and a view v B(v,S) = ∑ w≤v, (w not in S) max{cost(w)-cost(v), 0} Greedy algorithm for selecting k views to materialize from the lattice: 1.S := {top view} 2.For i=1 to k, add v to S s.t. B(v,S) is maximized The greedy algorithm is an (e-1)/e ≈ 0.63 approx. of the optimum.

Discussion Questions from the audience…