Download presentation

Presentation is loading. Please wait.

Published byKaylee Hollis Modified over 3 years ago

1
1 CUBE: A Relational Aggregate Operator Generalizing Group By Jim Gray Adam Bosworth Andrew Layman Microsoft Gray@ Microsoft.com Hamid Pirahesh IBM

2
2 The Data Analysis Cycle oUser extracts data from database with query oThen visualizes, analyzes data with desktop tools Spread Sheet Table 1 10 15 10 12 10 9 6 3 Size vs Speed Access Time (seconds) 10 -9 10 -6 10 -3 10 0 3 Cache Main Secondary Disc Nearline Tape Offline Tape Online Tape 10 4 2 0 -2 10 -4 Price vs Speed Access Time (seconds) 10 -9 10 -6 10 -3 10 0 3 Cache Main Secondary Disc Nearline Tape Offline Tape Online Tape Size(B) $/MB visualize Extract analyze

3
3 Division of labor Computation vs Visualization oRelational system builds CUBE relation –aggregation best done close to data –Much filtering of data possible –Cube computation may be recursive »(e.g., percent of total, quartile,....) oVisualization System displays/explores the cube

4
4 Relational Aggregate Operators oSQL has several aggregate operators: –sum(), min(), max(), count(), avg() oOther systems extend this with many others: –stat functions, financial functions,... oThe basic idea is: –Combine all values in a column –into a single scalar value. oSyntax select sum(units) from inventory;

5
5 Relational Group By Operator oGroup By allows aggregates over table sub-groups oResult is a new table oSyntax: select location, sum(units) from inventory group by location having nation = USA;

6
6 Problems With This Design oUsers Want Histograms oUsers want sub-totals and totals –drill-down & roll-up reports oUsers want CrossTabs oConventional wisdom –These are not relational operators –They are in many report writers and query engines sum M T W T F S S AIR HOTEL FOOD MISC F() G() H()

7
7 Thesis: The Data CUBE Relational Operator Generalizes Group By and Aggregates

8
8 The Idea: Think of the N-dimensional Cube Each Attribute is a Dimension oN-dimensional Aggregate (sum(), max(),...) –fits relational model exactly: »a 1, a 2,...., a N, f() oSuper-aggregate over N-1 Dimensional sub-cubes »ALL, a 2,...., a N, f() »a 3, ALL, a 3,...., a N, f() »... »a 1, a 2,...., ALL, f() –this is the N-1 Dimensional cross-tab. oSuper-aggregate over N-2 Dimensional sub-cubes »ALL, ALL, a 3,...., a N, f() »... »a 1, a 2,...., ALL, ALL, f()

9
9 An Example CUBE

10
10 Why the ALL Value? oNeed a new Null value (overloads the null indicator) oValue must not already be in the aggregated domain oCant use NULL since may aggregate on it. oThink of ALL as a token representing the set –{red, white, blue}, {1990, 1991, 1992}, {Chevy, Ford} oRules for ALL in other areas not explored –assertions –insertion / deletion /... –referential integrity oFollow set of values semantics.

11
11 CUBE operator: Syntax oProposed syntax: oNote: Group By operator repeats aggregate list –in select list –in group by list select model, make, year, sum(units) from car_sales where model in {chevy, ford} and year between 1990 and 1994 group by model, make, year with cube having sum(units) > 0;

12
12 Why This Syntax? oabstract syntax oallows functional aggregations (e.g., sales by quarter): select from where group by [ with [ cube | roll up] ] having select store, quarter, sum(units) from sales where nation = Mexico group by store, quarter(date) as quarter with roll up and year = 1994;

13
13 Decorations and Abstractions oSometimes want to tag cube with redundant values –region #, region_name, sales –region name is not a dimension, it is a decoration –Decorations are functionally dependent on dimensions oMore interesting, some dimensions are aggregations. oOften these aggregations are not linear (are a lattice) oRather than treat time as 12 dimensions –Recognize abstractions as one dimension (like decorations) –Compute efficiently (virtual functions) second minute hour day week month quarter year Xmas Easter Thanksgiving Holiday block city county state nation

14
14 Interesting Aggregate Functions oFrom RedBrick systems –Rank (in sorted order) –N-Tile (histograms) –Running average (cumulative functions) –Windowed running average –Percent of total oUsers want to define their own aggregate functions –statistics –domain specific

15
15 User Defined Aggregates oIdea: –User function is called at start of each group –Each function instance has scratchpad –Function is called at end of group oExample: SUM –START: allocates a cell and sets it to zero –NEXT: adds next value to cell –END:deallocates cell and returns value –Simple example: MAX() oThis idea is in Illustra, IBMs DB2/CS, and SQL standard oNeeds extension for rollup and cube Scratchpad start next end

16
16 User Defined Aggregate Function Generalized For Cubes oAggregates have graduated difficulty –Distributive : can compute cube from next lower dimension values (count, min, max,...) –Algebraic : can compute cube from next lower lower scratchpads (average,...) –Holistic : Need base data (Median, Mode, Rank..) oDistributive and Algebraic have simple and efficient algorithm: build higher dimensions from core oHolistic computation seems to require multiple passes. –real systems use sampling to estimate them »(e.g., sample to find median, quartile boundaries)

17
17 How To Compute the Cube? If each attribute has N i values CUBE has (N i +1) values oCompute N-D cube with hash if fits in RAM oCompute N-D cube with sort if overflows RAM oSame comments apply to subcubes: –compute N-D-1 subcube from N-D cube. –Aggregate on biggest domain first when >1 deep –Aggregate functions need hidden variables: »e.g. average needs sum and count. oUse standard techniques from query processing –arrays, hashing, hybrid hashing –fall back on sorting.

18
18 Example: oCompute 2D core of 2 x 3 cube oThen compute 1D edges oThen compute 0D point oWorks for algebraic and distributive functions Saves lots of calls

19
19 Summary oCUBE operator generalizes relational aggregates oNeeds ALL value to denote sub-cubes –ALL values represent aggregation sets oNeeds generalization of user-defined aggregates oDecorations and abstractions are interesting oComputation has interesting optimizations Research Topics oGeneralize Spreadsheet Pivot operator to RDBs oCharacterize Algebraic/Distributive/Holistic functions for update

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google