Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross- Tab and Sub-Totals Gray et Al. Presented By: Priya Rajan.

Similar presentations


Presentation on theme: "Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross- Tab and Sub-Totals Gray et Al. Presented By: Priya Rajan."— Presentation transcript:

1 Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross- Tab and Sub-Totals Gray et Al. Presented By: Priya Rajan

2 Introduction  Increasing popularity of Data Warehouses has increased the demand for querying, reporting and OLAP tools  Need to extract maximum information from data and have the ability to analyze it and aggregate it across multiple dimensions  Cross Tabs over various dimensions are particularly useful because aggregates convey a lot of information when the amount of data is large

3 Traditional Representation of N-Dimensional Data  N-Dimensional data represented as relation with N- attribute columns  Example: Here Sales Quantity is a function of Supplier,Product and Date 3P2S2 3P2S2 2P3S1 1P1S1 Date_IDProduct Supplier Sales Table Represents sales quantity by supplier S1 of product P2 on Date_ID 1& at Loc5 1 3 2 4 Loc_ID 11 39 18 10 Sales Quantity

4 Aggregation Queries  Standard SQL provides COUNT(), SUM(), MIN(), MAX(), AVG()  Example:SELECT AVG(Sales Quantity) FROM SALES  SQL also provides a GROUP-BY operator-produces 0 & 1 dimensional aggregates  Example: SELECT SUPPLIER, PRODUCT, AVG(SALES QUANTITY) AS A FROM SALES GROUP BY SUPPLIER, PRODUCT  Result: SupplierProductA S1P110 S1P318 S2P225

5 Inadequacy of GROUP BY  Cannot perform aggregation over computed categories (histograms) directly  Example: Suppose suppliers and products together define a ‘factory’ and a Factory() function maps suppliers and Loc_Ids to a factory. It would be nice to define the query to find maximum daily sales for each factory as: SELECT Date, factory, MAX(SALES) FROM SALES GROUP BY Date, Factory(suppliers,Loc_ID) as factory;  Instead we have to compute a table indirectly and then perform the aggregation (a nested query): SELECT Date, factory, MAX(SALES) FROM (SELECT Date, factory(suppliers, LOC_ID) as factory, sales FROM SALES) AS Foo GROUP BY Date, factory;

6 Inadequacy of GROUP BY  In analyzing data we often need to “roll-up” (example: go from sales/city  sales/state) or “drill- down” (vice-versa)  Problem with Group By for roll-ups and drill-downs: have to store the subtotal at each level of the aggregation

7 Rollup Representations  Need 2 N columns to represent a rollup of N elements!  Introduce ‘ALL’ value  Problem: Need to union many GROUP Bys

8 Rollup Representations SELECT Model, ALL, ALL, SUM(Sales) FROM Sales WHEREModel = 'Chevy' GROUP BYModel UNION SELECT Model, Year, ALL, SUM(Sales) FROM Sales WHEREModel = 'Chevy' GROUP BY Model, Year UNION SELECT Model, Year, Color, SUM(Sales) FROM Sales WHEREModel = 'Chevy' GROUP BY Model, Year, Color;  Note that aggregation by year is not included  To add that in we would have to Union in SELECT Model, ALL, Color, SUM(Sales) FROM Sales WHEREModel = 'Chevy' GROUP BY Model, Color;  Query is really complex-for a cross tab in 6-dimensions need 2 6 =64unions of GROUP BY statements!  This is the motivation for a new operator! To arrive at the previous table we would have to do the following complex query:

9 The CUBE Operator  The CUBE operator builds a table with all possible aggregated values constructed by grouping—it is the generalization of a simple aggregate function in N- dimensions  The 0-D cube is the aggregate of all values  Ex: Total Sales over all models,years and colors  The 1-D cube is a Group By on one dimension  Ex: Group By Color  The 2-D cube is a cross-tab  The 3-D cube is an intersection of 3 2-D cubes  This can be generalized to N-Dimensions

10 Intuition behind the CUBE 0-D cube 1-D cube 2-D cube 3-D cube

11 Creating a CUBE  The syntax for a cube is: SELECT FROM GROUP BY CUBE  To create the global cube need to generate the power set of aggregation columns {Model, Year, Color} {Model,Year} {Model,Color} {Year,Color} {Model} {Year}{Color} { }

12 Creating a CUBE  The CUBE operator first aggregates over the just as the GROUP BY would. It then Unions in the 2 N -1 super aggregates shown in the previous picture.  ‘ALL’ is substituted for the Aggregating column. Every domain has an extra ‘ALL’ value now. The cube has (C 1 +1)*(C 2 +1)…..*(C N +1) values, where C i is the size of each domain.  Idea is that Since C i is usually large, CUBE will be only a little bigger compared to corresponding GROUP BY but provides a lot more information (not really true—see Conclusion)

13 CUBE Creation: EXAMPLE

14 ROLLUP Operator  If the user only needs a roll-up report or a drill down report computing the entire cube is wasteful  The ROLLUP Operator is used to compute only the super aggregates  In the previous example the if ROLLUP were used in place of CUBE only the following will be returned (marked in the figure in red)  Regular aggregation rows that would be produced by GROUP BY without using ROLLUP  First-level subtotals aggregating across Color for each combination of Model and Year  Second-level subtotals aggregating across Year and Color for each Model value  A grand total row

15 Integrating CUBE into SQL  GROUP BY ROLLUP CUBE  Could support histograms by extending GROUP BY  Use of ‘ALL’ to indicate ‘aggregate’ could cause problems  ALL is set valued other things are not  Could use NULL instead of ALL  Use Grouping function to distinguish between NULL and ALL

16 Computing Cubes and Rollups  Aggregation Computation Techniques  Compute aggregates at lowest possible system level to minimize processing cost  Use hashing to organize aggregation columns in memory  If aggregates don’t fit in memory use hybrid-hashing to organize data by value  If data is spread over several disks exploit parallelism for each partition and then combine results

17 Computing Cubes and Rollups  Simplest Algorithm 2 N algorithm  Allocate handle for every cell in the cube  When a tuple is passed in a function Iter is called on the handle and tuple value. Iter is called 2 N times  After all tuples have been computed a function, final is called to complete the aggregation  If base table has T tuples Iter is called Tx 2 N times— can improve!

18 Types of Aggregate Functions  Distributive Functions: F(X)=G(F(X)) Where X: {Xij | i = 1,...,I; j=1,...,J}  Example: MIN(), COUNT(),SUM(), MAX()  Algorithm Idea: Can compute N-1th dimension by aggregating Nth dimension  Example: Cross Tab—Here SUM() is the Distributive Function 1 st Dimension=sum of 2 nd Dim rows Oth Dimension=sum of 1 st dimension 2-D cube

19 Types of Aggregation Functions  Algebraic Functions: F(X)=H(G(X))  Example: Average(), Standard Dev. In Average G would be the sum and the count and H would be sum/count  Algorithm Idea: A handle is maintained for each cube cell and is passed on to the N-1dimensional superaggregates. A function Iter(&handle, &handle) is used to pass the aggregates. At the end Final() is called.  Example: Cross-Tab. Here Avg() is the algebraic function

20 Aggregation Functions  Holistic Functions: Functions where the sub- aggregates cannot be described by a fixed storage bound  Example: Median(), Rank()  No known algorithm better than 2 N

21 Comments & Conclusion  The CUBE operator is useful in Multi-Dimensional data analysis. Popular in OLAP tools  Can now answer queries that were either previously impossible or very complicated in SQL (using GROUP BY)  Claim that CUBE is only slightly larger than corresponding GROUP BY for large domains may not always be valid-often a large number of data values are null, so CUBE is much larger than the GROUP BY  Next logical issue is the issue of maintaining cubes  If Cube is materialized need to consider updates. Need to only small scan for some functions but need re-computation for others.  Example: Max() for SELECT, INSERT and DELETE

22 Implementing Data Cubes Efficiently Venky Harinarayan Anand Rajaraman Jeffery D. Ullman

23 Introduction  Decision Support Systems(DSS) rely a great deal on aggregations made on data in data-warehouses  These queries can be very complex and could take a substantial amount of time.  Need queries to be computed very quickly for Decision Support Systems  Idea: Can materialize or pre-compute certain queries  Pre-compute frequently asked queries  Pre-compute infrequent queries so that the time saved can be used to compute other queries

24 Materializing the Cube  Data Warehouses represent data in a multi- dimensional cube to answer aggregation queries and enable analysis in multiple dimensions  Computing every cell on request increases the response time, making it unacceptable for DSS  Need to materialize cube. Options:  Materialize the entire cube: Although pre-computing every cell gives excellent response times but space consumption makes it infeasible  Materialize parts of the cube: This is the problem explored in this paper-pick the optimal (close to optimal) cells to materialize

25 Dependencies in the Cube  Cells in the data cube that have the value “ALL” can be computed from other cells in the cube. Those that don’t are independent—need to query raw data to compute  Can materialize dependent cells(‘views’)  Interesting Questions:  How many views to materialize for good performance?  Which views to materialize to minimize average query cost?

26 Example {Part,Supplier,Customer} {part, customer} {part, supplier} {supplier,customer} {part} {supplier} {customer} { }  We need to materialize the {Part, Supplier, Customer} because we cannot derive this from any other view. Say the cost of this is 6 M rows  Suppose we want to answer a query on supplier-customer. If the cost of this view is also 6M, there is no point in materializing it since we can derive it from the {Part, Supplier, Customer} view. On the other hand say the part view costs 0.2 M and we need to query parts. There might be some reason in materializing it. If we used the {Part, Supplier, Customer} view we would have to process 6M rows.

27 Lattice Structure & Hierarchies  We say Query1 “is dependent on” Query2 if the result set of Query2 can be used to answer Query1  For example (part) “is dependent on” (part, customer)  Can form a lattice diagram from these dependencies  In addition each dimension of the data-cube can have a set of associated values, forming a hierarchy  Example: Day Week Month Year none  Week is “dependent” on Day

28 Combining Dimension Hierarchies with Query Dependencies  Need to combine these two concepts when arriving at query dependencies. The results from one query can contain any element in the hierarchy for that dimension  Suppose we have two tuples T1 and T2 then T1 is dependent on T2 iff every element in the tuple T1 is lower in the dimension hierarchy compared to the corresponding element in T2.

29

30 Cost Model  If view V has been materialized and V can be used to process V’ then the cost of answering V is assumed to be the number of rows in V’  In reality more or less than 1 entire scan of V’ may be required. V’ may be clustered on a particular attribute or may have an index on an attribute, eliminating the need to scan it once.  This uniform cost model is adopted to simplify analysis  Experimentally showed the linear relationship between query size and running time

31 Optimizing Data Cube Lattices  There is a space-time tradeoff. Need to optimize this tradeoff  Problem:  Minimize Time taken to evaluate queries  Have to materialize k views (this can be easily converted to a space restraint and the algorithm is the same)  This problem is NP-Complete.  Greedy Algorithm used to approximate optimal solution.

32 Algorithm Let C(v) be the cost of a view v If S is the set of views chosen so far the the benefit of adding a view v to s is B(v,S)  For each w that is depended on v let Bw be defined as follows:  If u is in S and w is dependent on u(u is the least cost such view in S) & C(v)<C(u) then Bw=C(v)- C(u) otherwise Bw=0  B(v,S) is then  w dependent on v Bw  Intuitively what is being measured is the sum of the benefits other views will gain if v is included in S

33 Algorithm S={top view} Select v such that B(v,S) is maximized S is the set of ‘optimal’ views K times

34 Example

35 Greedy not Always Good

36 Performance of Greedy Algorithm Greedy guarantees at least (e-1/e) of the benefit of the optimal algorithm  The greedy algorithm is exactly optimal when the benefit of the first picked view is much greater the subsequent views  If the benefits of all the views picked are equal then the greedy algorithm is optimal

37 Performance of Greedy Algorithm

38 Performance of Greedy Algorithm & Extensions to the Algorithm  The views in the lattice should have some probability associated with it so that the algorithm uses these probabilities as weights  Have to restrict space not number of views  Now need to consider benefit/space  Problem is that a small view may have high benefit/space compared to large view and large view may be excluded because of lack of space— views picked in the future are affected by this. Treat this as a boundary case and ignore!

39 Hypercube Lattice  In a hypercube typically, grouping occurs on n attributes and aggregation on the n+1th attribute.  Example: SELECT Mode, Year, Color, SUM(Sales) AS Sales FROM SALES GROUP BY CUBE Model, Year, Color  Exploit this regularity in selecting views  If size of each domain is r & only m cells of the top-most view are filled. If we group on I attributes then: If r i >=m then use m as size for any view where If r i >m If r i< m then use r i as size for any view where If r i < m

40 Time and Space Optimality  Space is minimized when only the top query is materialized  Each query takes time m. Total cost is mx2 N  Time is minimized if we materialize every query

41 Time and Space Optimality

42 Conclusion  Materialization of parts of the cube is very important in increasing response times for DSS  Pick the ‘optimal’ views that strike a balance between the space lost due to materialization and the increase in response time due to computing on the fly.  The greedy algorithm gives a close approximation of the optimal solution  Future Work: Dynamic materialization strategies


Download ppt "Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross- Tab and Sub-Totals Gray et Al. Presented By: Priya Rajan."

Similar presentations


Ads by Google