Download presentation
Presentation is loading. Please wait.
Published byTrystan Crom Modified over 10 years ago
2
Generating the Data Cube (Shared Disk) Andrew Rau-Chaplin Faculty of Computer Science Dalhousie University Joint Work with F. Dehne T. Eavis S. Hambrusch
3
Data Cube Generation Proposed by Gray et al in 1995 Can be generated from a relational DB but… A B C The cuboid ABC (or CAB) ABC AB ACBC AC B ALL 12 18 83 21 34 38 50 21
4
As a table ModelYearColourSales Chevy1990Red5 Chevy1990Blue87 Ford1990Green64 Ford1990Blue99 Ford1991Red8 Ford1991Blue7
5
The Challenge Input data set, R |R| typically in the millions, usually will not fit into memory. Number of dimensions, d, 10-30 2 d cuboids in Data Cube How to solve this highly data and computational intensive problem in parallel?
6
Existing Parallel Results Goil & Choudhary MOLAP Approach Parallelize the generation of each cuboid Challenge > 2d comm. rounds
7
Overview 1)Data cubes 2)Review sequential cubing algorithms 3)Our Top-down parallel algorithm 4)Conclusions and open problems
8
Optimizations based on computing multiple cuboids Smallest-parent - computing a cuboid from the smallest previously computed cuboid. Cache-results - cache in memory the results of a cuboid from which other cuboid are computed to reduce disk I/O. Amortize-scans - amortizing disk read by computing as many cuboid as possible. Share-sorts - sharing sorting cost. ABCD ABCABD ACDBCD AB AC ADBCBDCD AA BB CCDD All
9
Many Algorithms Pipesort – [AADGNRS’96] PipeHash – [SAG’96] Overlap – [DANR’96] ArrayCube – [ZDN’97] Bottom-up-cube – [BR’99] Partition Cube – [RS’97] Memory Cube - [RS’97]
10
Approaches Top Down Pipesort – [AADGNRS’96] PipeHash – [SAG’96] Overlap – [DANR’96] Bottom up Bottom-up-cube – [BR’99] Partition Cube – [RS’97] Memory Cube - [RS’97] Array Based ArrayCube – [ZDN’97]
11
Our results A framework for parallelization of existing sequential data cube algorithms Top-down Bottom-up Array based Architecture independent Communication efficient Avoids irregular communication patterns Few large messages Overlap computation and communication Today’s Focus Top down approach
12
ABCD ABCABD ACDBCD AB AC ADBCBDCD AA BB CCDD All Top Down Algorithms Find a “least cost” spanning tree Use estimators of cuboid size Exploit Data shrinking Pipelining Cuts vs. Sorts
13
Cut vs. Sort Ordering ABCD Cutting ABCD -> ABC Linear time Sorting ABCD ->ABD Sort time Size ABC may be much smaller than ABCD
14
Pipesort [AADGNRS’96] Minimize sorting while seeking to compute cuboid from smallest parent Pipeline sorts with common prefixes
15
Level-by-level Optimization Minimum cost matching in a bipartite graph Scan edges solid, Sort edges dashed Establish dimension ordering working up the lattice
16
Overview 1)Data cubes 2)Review sequential cubing algorithms 3)Our Top-down parallel algorithm 4)Conclusions and open problems
17
Top-down parallel: The Idea Cut the process tree into p “equal weight” subtrees Each Proc. generates cuboids from a subtree independently Load balance/stripe the output CBAD CBABADACDBCD BA AC ADCBDBCD AA BB CCDD All
18
The Basic Algorithm (1) Construct a lattice housing all 2 d views. (2) Estimate the size of each of the views in the lattice. (3) To determine the cost of using a given view to directly compute its children, use its estimated size to calculate (a) the cost of scanning the view and (b) the cost of sorting it. (4) Using the bipartite matching technique presented in the original IBM paper, reduce the lattice to a spanning tree that identifies the appropriate set of prefix-ordered sort paths. (5) Add the I/O estimates to the spanning tree. (6) Partition the tree into p sub-trees. (7) Distribute the sub-tree lists to each of the p compute nodes. (8) On each node, use the sequential Pipesort algorithm to build the set of local views.
19
Tree Partitioning What does “Equal Weight” mean? Want to minimize the max weight partition! O(Rk(k + log d)+n) time - Becker, Perl and Schach ‘82 O(n) time, Frederickson 1990 time
20
Tree Partitioning Min-max tree k-partitioning. Given a tree T with n vertices and a positive weight assigned to each vertex, delete k edges in the tree to obtain k connected components T 1, T 2, … T k+1 such that the largest total weight of a resulting sub- tree is minimized. O(Rk(k + log d)+n) time - Becker, Perl and Schach ‘82 O(n) time, Frederickson 1990
21
Dynamic min-max 125 15 8 125 47 125 Raw data ABC ABBC A
22
Over-sampling p subtrees s * p subtrees p subsets
23
Implementation Issues 1) Sort Optimization 2) Minimizing Data Movement 3) Efficient Aggregation Operations 4) Disk Optimizations
24
1) Sort Optimization qSort is SLOW May be O(n 2 ) when there are duplicates When cardinality is small range of keys is small Radix sort Dynamically select between well optimized Radix and Quick Sorts
25
2) Minimizing Data Movement Sort pointers to the records! Never reorder the columns
26
3) Efficient Aggregation Operations One pass for each pipeline Do lazy aggregation ABCD ABC A AB all
27
4) Disk Optimizations Avoid OS buffering Implemented I/O Manager Manages buffers to avoid thrashing Does I/O in separate process to overlap with computation
28
Speedup - Cluster
29
Efficiency - Cluster
30
Speedup - SunFire
31
Efficiency - SunFire
32
Increasing Data Size
33
Varying Over Sampling Factor
34
Varying Skew
35
Conclusions New communication efficient parallel cubing framework for Top-down Bottom up Array based Easy to implement (sort of), architecture independent
36
Thank you! Questions?
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.