Online Analytical Processing (OLAP) An Overview Kian Win Ong, Nicola Onose Mar 3 rd 2006.

Online Analytical Processing (OLAP) An Overview Kian Win Ong, Nicola Onose Mar 3 rd 2006

Overview Motivation Multi-Dimensional Data Model Research Areas Optimizations –Materializing multiple aggregates simultaneously –Materialization strategy

Motivation Aggregation, summarization and exploration Of historical data To help management make informed decisions

Different Goal Aggregation, summarization and exploration Of historical data To help management make informed decisions ProductBranchTimePrice Coke (0.5 gallon)Convoy Street2006-03-01 09:00:01$1.00 Pepsi (0.5 gallon)UTC2006-03-01 09:00:01$1.03 Coke (1 gallon)UTC2006-03-01 09:00:02$1.50 AltoidsCosta Verde2006-03-01 09:01:33$0.30... Find the total sales for each product and month Find the percentage change in the total monthly sales for each product

Different Requirements OLTPOLAP TasksDay to day operationHigh level decision support Size of databaseGigabytesTerabytes Time spanRecent, up-to-dateSpanning over months / years Size of working setTens of records, accessed through primary keys Consolidated data from multiple databases WorkloadStructured / repetitiveAd-hoc, exploratory queries PerformanceTransaction throughputQuery latency OLTP – On-Line Transaction Processing OLAP – On-Line Analytical Processing

Query Language Extensions In the real world, data is stored in RDBs.

Query Language Extensions In the real world, data is stored in RDBs. How to express N-dimensional problems using 2D tables?

Query Language Extensions In the real world, data is stored in RDBs. How to express N-dimensional problems using 2D tables? Can we combine OLAP and SQL queries? Jim Gray et al: Data Cube: A Relational Aggregation Operator 1997

Query Language Extensions 1.histograms Problems with GROUP BY SELECT sales, prod_name, population FROM sales_history GROUP BY Population(City, State) as population

Query Language Extensions 1.histograms 2.rollup/drilldow n Problems with GROUP BY Product Category Product Name MonthSalesSales by Cat., by Name Sales by Cat. DrinksCokeFeb30.3 Mar93.9124.2 HeinekenFeb34.8 Mar123.8158.6282.8 non relational representation

Query Language Extensions 1.histograms 2.rollup/drilldow n Problems with GROUP BY Product Category Product Name MonthSalesSales by Cat., by Name Sales by Cat. DrinksCokeFeb30.3124.2282.8 DrinksCokeMar93.9124.2282.8 DrinksHeinekenFeb34.8158.6282.8 DrinksHeinekenMar123.8158.6282.8 relational, but the rollup is huge

Query Language Extensions 1.histograms 2.rollup/drilldown 3.cross tabulations Problems with GROUP BY Product Category Product Name MonthSales DrinksCokeFeb30.3 DrinksCokeMar93.9 DrinksCokeTotal124.2 DrinksHeinekenFeb34.8 DrinksHeinekenMar123.8 DrinksHeinekenTotal158.6 DrinksTotal 282.8 Could be represented as:

Query Language Extensions 1.histograms 2.rollup/drilldown 3.cross tabulations Problems with GROUP BY 2-D aggregation is more compact and more natural: DrinksFebMarTotal Coke30.393.9124.2 Heineken34.8123.8158.6 Total65.1217.7282.8

Query Language Extensions 1.histograms 2.rollup/drilldown 3.cross tabulations 4.complex expressions, hard to optimize Problems with GROUP BY when reducing to 1-D aggregation (GROUP BY) need 2^{number of dim.} GROUP BY’s

Query Language Extensions Reducing the number of attributes Product Category Product Name MonthSales DrinksCokeFeb30.3 DrinksCokeMar93.9 DrinksCokeALL124.2 DrinksHeinekenFeb34.8 DrinksHeinekenMar123.8 DrinksHeinekenALL158.6 DrinksALL 282.8 DrinksALLFeb65.1 DrinksALLMar217.7

Query Language Extensions introduce a new value: “ALL” Reducing the number of attributes “ALL” = the set over which we aggregate DrinksFebMarTotal (ALL) Coke30.393.9124.2 Heineken34.8123.8158.6 Total (ALL)65.1217.7282.8

Query Language Extensions GROUP BY (1D) General approach Sales by Product Name FebMar Coke30.393.9 Heineken34.8123.8 SUM65.1217.7

Query Language Extensions GROUP BY (1D) Cross Tab (2D) General approach DrinksFebMarALL Coke30.393.9124.2 Heineken34.8123.8158.6 ALL65.1217.7282.8 Product Category Product Name MonthSales DrinksCokeFeb30.3 DrinksCokeMar93.9 DrinksCokeALL124.2 DrinksHeinekenFeb34.8 DrinksHeinekenMar123.8 DrinksHeinekenALL158.6 DrinksALLFeb65.1 DrinksALLMar217.7 DrinksALL 282.8 the corresponding relation:

Query Language Extensions GROUP BY (1D) Cross Tab (2D) Cube (3D) General approach Product Category Product Name MonthSales DrinksCokeFeb30.3 DrinksCokeMar93.9 DrinksCokeALL124.2 ………… SnacksDoritosFeb123.8 SnacksDoritosMar158.6 SnacksDoritosALL65.1 ………… ALL 964.0 By cat. and month By cat. and name (does it make sense?) By month and name

Query Language Extensions GROUP BY (1D) Cross Tab (2D) Cube (3D) Any hypercube can be represented as a relation! General approach

Query Language Extensions a CUBE relation, with aggregation function f(.) (x 1, x 2, …, x n-1, x n, f() ) …………………………… (x 1, x n-1, …, x n, ALL, f() ) …………………………… (x 1, x 2, …, ALL, x n, f() ) …………………………… after ROLLUP, reduce to a linear # of tuples (x 1, x 2, …, x n-1, x n, f() ) ………………………………… (x 1, x n-1, …, x n, ALL, f() ) ………………………………… (x 1, x 2, …, ALL, ALL, f() ) ………………………………… (ALL, ALL, …, ALL, ALL, f() ) General approach

Query Language Extensions The new operators: CUBE, ROLLUP SELECT prod_category, prod_name, month, SUM(sales) AS sales FROM sales_history GROUP BY CUBE prod_category, prod_name, month Product Category Product Name MonthSales DrinksCokeFeb30.3 DrinksCokeMar93.9 DrinksCokeALL124.2 ………… DrinksALLFeb99.8 ………… ALL 964.0 Idea: Group by the CUBE list. Union the aggregates. Introduce the ALL values.

Query Language Extensions The new operators: CUBE, ROLLUP SELECT prod_category, month, day, state, prod_name, SUM(sales) AS sales FROM sales_history GROUP BY prod_category ROLLUP month, day CUBE city, state Product Category MonthDayStateProduct Name Sales DrinksFeb26CACoke12.3 Feb26CAHeineken5.4 …………… Feb26CAALL30.4 Feb26ALLCoke… ………… SnacksFeb26CADoritos12.0 …………

Research Areas SQL language extensions Server architecture Parallel processing Index structures Materialized views

Simultaneous Aggregates Multi-Dimensional Optimization to calculate multiple aggregates simultaneously Useful for materialization of aggregate views Y. Zhao, P. Deshpande, J. Naughton An Array-Based Algorithm for Simultaneous Multidimensional Aggregates SIGMOD 1997

Multiple Aggregates Month / Product FebMarTotal Altoids36131167 Coke37138175 Doritos21136157 Heineken44110154 Pepsi31122153 Pringles37126164 Total206764970 ProductCityMonthSales CokeSan DiegoFeb 0612 PepsiLos AngelesFeb 0613 DoritosSan DiegoMar 0672 AltoidsSan DiegoMar 0665... Aggregate on…

Multiple Aggregates City / Product San DiegoLos AngelesTotal Altoids9077167 Coke8986175 Doritos7483157 Heineken7480154 Pepsi6885153 Pringles7390164 Total469501970 Month / City FebMarTotal Los Angeles112358469 San Diego95407501 Total206764970 Month / Product FebMarTotal Altoids36131167 Coke37138175 Doritos21136157 Heineken44110154 Pepsi31122153 Pringles37126164 Total206764970 ProductCityMonthSales CokeSan DiegoFeb 0612 PepsiLos AngelesFeb 0613 DoritosSan DiegoMar 0672 AltoidsSan DiegoMar 0665... Aggregate on…

Multiple Aggregates ProductCityMonthSales CokeSan DiegoFeb 0612 PepsiLos AngelesFeb 0613 DoritosSan DiegoMar 0672 AltoidsSan DiegoMar 0665... 1.Sales by Product / City 2.Sales by Product / Month 3.Sales by Month / City 4.Sales by Product 5.Sales by City 6.Sales by Month 7.Sales (Total) Is it possible to make a single pass over the transactional table? calculate multiple aggregates simultaneously? Aggregate on…

Chunking 1 234 5 6 7 8 9 10 11 12 13 1415 16 64 20 36 42 Dimension B Dimension A Dimension C 1 ProductCityMonthSales CokeSan DiegoFeb 0612 Array Chunk Product City Month Partition transactional data into array chunks

Naïve Algorithm 1 23 4 5 6 7 8 9 10 11 12 13 1415 16 64 20 36 42 Dimension A Dimension C Pivot on AB aggregate on all C Dimension A Dimension B

Naïve Algorithm 1 234 5 6 7 8 9 10 11 12 13 1415 16 64 20 36 42 Dimension A Dimension C Pivot on AB aggregate on all C Pivot on AC aggregate on all B Pivot on BC aggregate on all A Dimension B

Single Pass Algorithm 1 234 5 6 7 8 9 10 11 12 13 1415 16 64 20 36 42 Dimension A Dimension C B 1234 1234 1 2 3 4 AB AC BC Make a single pass over data

Single Pass Algorithm 1 234 5 6 7 8 9 10 11 12 13 1415 16 64 20 36 42 Dimension A Dimension C B 13 9101112 5678 1234 1 5 9 13 2 6 103 7 114 5 12 13 9 10 11 12 5 6 7 8 1 2 3 4 AB AC BC Simultaneously maintain multiple aggregates

Single Pass Algorithm 1 234 5 6 7 8 9 10 11 12 13 1415 16 64 20 36 42 Dimension A Dimension C B 13 9101112 5678 1234 1 5 9 13 2 6 103 7 114 5 12 13 9 10 11 12 5 6 7 8 1 2 3 4 AB AC BC Write out completed aggregates

Single Pass Algorithm 1 234 5 6 7 8 9 10 11 12 13 1415 16 64 20 36 42 Dimension A Dimension C B 13 9101112 5678 1234 1 5 9 13 2 6 103 7 114 5 12 13 AB AC BC Only allocate memory that is necessary

Single Pass Algorithm 13 9101112 5678 1234 1 5 9 13 2 6 103 7 114 5 12 13 AB AC BC Array Chunk ABC 4 x 4 x 4 AB 16 x 4 x 4 AC 4 x 4 x 4 BC 4 x 4 A 4 x 4 B4B4 C4C4 all 1 Minimum memory spanning tree

Multi Pass Algorithm ABCD ABC ABD ACD BCD AB AC BC AD BD CD A B C D all Recursively aggregate

Implementing Data Cubes Biggest problem for data warehouses: the size Space / time trade-off: accelerate queries by materializing the cube

Implementing Data Cubes Biggest problem for data warehouses: the size Space / time trade-off: accelerate queries by materializing the cube The size of the relations gets even bigger!

Implementing Data Cubes Biggest problem for data warehouses: the size Space / time trade-off: accelerate queries by materializing the cube The size of the relations gets even bigger! M(ultidimensional)OLAP: good query performance, but bad scalability R(elational)OLAP: very scalable; query performance improved by materializing (partial) results

Implementing Data Cubes V. Harinarayan, A. Rajaraman, J.D. Ullman: Implementing Data Cubes Efficiently SIGMOD 1996 Presents a materialization strategy for the cells of the cube.

Implementing Data Cubes Time Id City Id Product Id Sales Day Month Week City Id City State Product Id Name Category Week Month Year Category Id Category Name

Implementing Data Cubes casted as particular case of the rewriting using views problem what cells to materialize  what SQL views to materialize

Implementing Data Cubes casted as particular case of the rewriting using views problem what cells to materialize  what SQL views to materialize p = product t = time c = city simple idea: Q 1 depends on Q 2 (Q 1 ≤Q 2 ) if Q 1 can be fully answered using the results of Q 2 ptc pt tc pc tp c none

Implementing Data Cubes but cube dimensions are usually hierarchical product_name product_category none day weekmonth year none city state none XX direct-product lattice p = product t = time c = city ptc pttcpc pts pwc pyc pmc ps p cat t … … … … …

Implementing Data Cubes Def. cost of answering Q = # of rows in the table of ancestor(Q) It can be estimated w/o materializing the views Assume that all queries are identical to some view in the lattice

Implementing Data Cubes For a set S and a view v B(v,S) = ∑ w≤v, (w not in S) max{cost(w)-cost(v), 0} Greedy algorithm for selecting k views to materialize from the lattice: 1.S := {top view} 2.For i=1 to k, add v to S s.t. B(v,S) is maximized The greedy algorithm is an (e-1)/e ≈ 0.63 approx. of the optimum.

Discussion Questions from the audience…

Online Analytical Processing (OLAP) An Overview Kian Win Ong, Nicola Onose Mar 3 rd 2006.

Similar presentations

Presentation on theme: "Online Analytical Processing (OLAP) An Overview Kian Win Ong, Nicola Onose Mar 3 rd 2006."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Online Analytical Processing (OLAP) An Overview Kian Win Ong, Nicola Onose Mar 3 rd 2006.

Similar presentations

Presentation on theme: "Online Analytical Processing (OLAP) An Overview Kian Win Ong, Nicola Onose Mar 3 rd 2006."— Presentation transcript:

Similar presentations

About project

Feedback