# Generating the Data Cube (Shared Disk) Andrew Rau-Chaplin Faculty of Computer Science Dalhousie University Joint Work with F. Dehne T. Eavis S. Hambrusch.

## Presentation on theme: "Generating the Data Cube (Shared Disk) Andrew Rau-Chaplin Faculty of Computer Science Dalhousie University Joint Work with F. Dehne T. Eavis S. Hambrusch."— Presentation transcript:

Generating the Data Cube (Shared Disk) Andrew Rau-Chaplin Faculty of Computer Science Dalhousie University Joint Work with F. Dehne T. Eavis S. Hambrusch

Data Cube Generation Proposed by Gray et al in 1995 Can be generated from a relational DB but… A B C The cuboid ABC (or CAB) ABC AB ACBC AC B ALL 12 18 83 21 34 38 50 21

As a table ModelYearColourSales Chevy1990Red5 Chevy1990Blue87 Ford1990Green64 Ford1990Blue99 Ford1991Red8 Ford1991Blue7

The Challenge Input data set, R |R| typically in the millions, usually will not fit into memory. Number of dimensions, d, 10-30 2 d cuboids in Data Cube How to solve this highly data and computational intensive problem in parallel?

Existing Parallel Results Goil & Choudhary MOLAP Approach Parallelize the generation of each cuboid Challenge > 2d comm. rounds

Overview 1)Data cubes 2)Review sequential cubing algorithms 3)Our Top-down parallel algorithm 4)Conclusions and open problems

Optimizations based on computing multiple cuboids Smallest-parent - computing a cuboid from the smallest previously computed cuboid. Cache-results - cache in memory the results of a cuboid from which other cuboid are computed to reduce disk I/O. Amortize-scans - amortizing disk read by computing as many cuboid as possible. Share-sorts - sharing sorting cost. ABCD ABCABD ACDBCD AB AC ADBCBDCD AA BB CCDD All

Many Algorithms Pipesort – [AADGNRS’96] PipeHash – [SAG’96] Overlap – [DANR’96] ArrayCube – [ZDN’97] Bottom-up-cube – [BR’99] Partition Cube – [RS’97] Memory Cube - [RS’97]

Approaches Top Down Pipesort – [AADGNRS’96] PipeHash – [SAG’96] Overlap – [DANR’96] Bottom up Bottom-up-cube – [BR’99] Partition Cube – [RS’97] Memory Cube - [RS’97] Array Based ArrayCube – [ZDN’97]

Our results A framework for parallelization of existing sequential data cube algorithms Top-down Bottom-up Array based Architecture independent Communication efficient Avoids irregular communication patterns Few large messages Overlap computation and communication Today’s Focus Top down approach

ABCD ABCABD ACDBCD AB AC ADBCBDCD AA BB CCDD All Top Down Algorithms Find a “least cost” spanning tree Use estimators of cuboid size Exploit Data shrinking Pipelining Cuts vs. Sorts

Cut vs. Sort Ordering ABCD Cutting ABCD -> ABC Linear time Sorting ABCD ->ABD Sort time Size ABC may be much smaller than ABCD

Pipesort [AADGNRS’96] Minimize sorting while seeking to compute cuboid from smallest parent Pipeline sorts with common prefixes

Level-by-level Optimization Minimum cost matching in a bipartite graph Scan edges solid, Sort edges dashed Establish dimension ordering working up the lattice

Overview 1)Data cubes 2)Review sequential cubing algorithms 3)Our Top-down parallel algorithm 4)Conclusions and open problems

Top-down parallel: The Idea Cut the process tree into p “equal weight” subtrees Each Proc. generates cuboids from a subtree independently Load balance/stripe the output CBAD CBABADACDBCD BA AC ADCBDBCD AA BB CCDD All

The Basic Algorithm (1) Construct a lattice housing all 2 d views. (2) Estimate the size of each of the views in the lattice. (3) To determine the cost of using a given view to directly compute its children, use its estimated size to calculate (a) the cost of scanning the view and (b) the cost of sorting it. (4) Using the bipartite matching technique presented in the original IBM paper, reduce the lattice to a spanning tree that identifies the appropriate set of prefix-ordered sort paths. (5) Add the I/O estimates to the spanning tree. (6) Partition the tree into p sub-trees. (7) Distribute the sub-tree lists to each of the p compute nodes. (8) On each node, use the sequential Pipesort algorithm to build the set of local views.

Tree Partitioning What does “Equal Weight” mean? Want to minimize the max weight partition! O(Rk(k + log d)+n) time - Becker, Perl and Schach ‘82 O(n) time, Frederickson 1990 time

Tree Partitioning Min-max tree k-partitioning. Given a tree T with n vertices and a positive weight assigned to each vertex, delete k edges in the tree to obtain k connected components T 1, T 2, … T k+1 such that the largest total weight of a resulting sub- tree is minimized. O(Rk(k + log d)+n) time - Becker, Perl and Schach ‘82 O(n) time, Frederickson 1990

Dynamic min-max 125 15 8 125 47 125 Raw data ABC ABBC A

Over-sampling p subtrees s * p subtrees p subsets

Implementation Issues 1) Sort Optimization 2) Minimizing Data Movement 3) Efficient Aggregation Operations 4) Disk Optimizations

1) Sort Optimization qSort is SLOW May be O(n 2 ) when there are duplicates When cardinality is small range of keys is small  Radix sort Dynamically select between well optimized Radix and Quick Sorts

2) Minimizing Data Movement Sort pointers to the records! Never reorder the columns

3) Efficient Aggregation Operations One pass for each pipeline Do lazy aggregation ABCD ABC A AB all

4) Disk Optimizations Avoid OS buffering Implemented I/O Manager Manages buffers to avoid thrashing Does I/O in separate process to overlap with computation

Speedup - Cluster

Efficiency - Cluster

Speedup - SunFire

Efficiency - SunFire

Increasing Data Size

Varying Over Sampling Factor

Varying Skew

Conclusions New communication efficient parallel cubing framework for Top-down Bottom up Array based Easy to implement (sort of), architecture independent

Thank you! Questions?

Download ppt "Generating the Data Cube (Shared Disk) Andrew Rau-Chaplin Faculty of Computer Science Dalhousie University Joint Work with F. Dehne T. Eavis S. Hambrusch."

Similar presentations