Presentation is loading. Please wait.

Presentation is loading. Please wait.

Generating the Data Cube (Shared Disk) Andrew Rau-Chaplin Faculty of Computer Science Dalhousie University Joint Work with F. Dehne T. Eavis S. Hambrusch.

Similar presentations


Presentation on theme: "Generating the Data Cube (Shared Disk) Andrew Rau-Chaplin Faculty of Computer Science Dalhousie University Joint Work with F. Dehne T. Eavis S. Hambrusch."— Presentation transcript:

1

2 Generating the Data Cube (Shared Disk) Andrew Rau-Chaplin Faculty of Computer Science Dalhousie University Joint Work with F. Dehne T. Eavis S. Hambrusch

3 Data Cube Generation Proposed by Gray et al in 1995 Can be generated from a relational DB but… A B C The cuboid ABC (or CAB) ABC AB ACBC AC B ALL 12 18 83 21 34 38 50 21

4 As a table ModelYearColourSales Chevy1990Red5 Chevy1990Blue87 Ford1990Green64 Ford1990Blue99 Ford1991Red8 Ford1991Blue7

5 The Challenge Input data set, R |R| typically in the millions, usually will not fit into memory. Number of dimensions, d, 10-30 2 d cuboids in Data Cube How to solve this highly data and computational intensive problem in parallel?

6 Existing Parallel Results Goil & Choudhary MOLAP Approach Parallelize the generation of each cuboid Challenge > 2d comm. rounds

7 Overview 1)Data cubes 2)Review sequential cubing algorithms 3)Our Top-down parallel algorithm 4)Conclusions and open problems

8 Optimizations based on computing multiple cuboids Smallest-parent - computing a cuboid from the smallest previously computed cuboid. Cache-results - cache in memory the results of a cuboid from which other cuboid are computed to reduce disk I/O. Amortize-scans - amortizing disk read by computing as many cuboid as possible. Share-sorts - sharing sorting cost. ABCD ABCABD ACDBCD AB AC ADBCBDCD AA BB CCDD All

9 Many Algorithms Pipesort – [AADGNRS’96] PipeHash – [SAG’96] Overlap – [DANR’96] ArrayCube – [ZDN’97] Bottom-up-cube – [BR’99] Partition Cube – [RS’97] Memory Cube - [RS’97]

10 Approaches Top Down Pipesort – [AADGNRS’96] PipeHash – [SAG’96] Overlap – [DANR’96] Bottom up Bottom-up-cube – [BR’99] Partition Cube – [RS’97] Memory Cube - [RS’97] Array Based ArrayCube – [ZDN’97]

11 Our results A framework for parallelization of existing sequential data cube algorithms Top-down Bottom-up Array based Architecture independent Communication efficient Avoids irregular communication patterns Few large messages Overlap computation and communication Today’s Focus Top down approach

12 ABCD ABCABD ACDBCD AB AC ADBCBDCD AA BB CCDD All Top Down Algorithms Find a “least cost” spanning tree Use estimators of cuboid size Exploit Data shrinking Pipelining Cuts vs. Sorts

13 Cut vs. Sort Ordering ABCD Cutting ABCD -> ABC Linear time Sorting ABCD ->ABD Sort time Size ABC may be much smaller than ABCD

14 Pipesort [AADGNRS’96] Minimize sorting while seeking to compute cuboid from smallest parent Pipeline sorts with common prefixes

15 Level-by-level Optimization Minimum cost matching in a bipartite graph Scan edges solid, Sort edges dashed Establish dimension ordering working up the lattice

16 Overview 1)Data cubes 2)Review sequential cubing algorithms 3)Our Top-down parallel algorithm 4)Conclusions and open problems

17 Top-down parallel: The Idea Cut the process tree into p “equal weight” subtrees Each Proc. generates cuboids from a subtree independently Load balance/stripe the output CBAD CBABADACDBCD BA AC ADCBDBCD AA BB CCDD All

18 The Basic Algorithm (1) Construct a lattice housing all 2 d views. (2) Estimate the size of each of the views in the lattice. (3) To determine the cost of using a given view to directly compute its children, use its estimated size to calculate (a) the cost of scanning the view and (b) the cost of sorting it. (4) Using the bipartite matching technique presented in the original IBM paper, reduce the lattice to a spanning tree that identifies the appropriate set of prefix-ordered sort paths. (5) Add the I/O estimates to the spanning tree. (6) Partition the tree into p sub-trees. (7) Distribute the sub-tree lists to each of the p compute nodes. (8) On each node, use the sequential Pipesort algorithm to build the set of local views.

19 Tree Partitioning What does “Equal Weight” mean? Want to minimize the max weight partition! O(Rk(k + log d)+n) time - Becker, Perl and Schach ‘82 O(n) time, Frederickson 1990 time

20 Tree Partitioning Min-max tree k-partitioning. Given a tree T with n vertices and a positive weight assigned to each vertex, delete k edges in the tree to obtain k connected components T 1, T 2, … T k+1 such that the largest total weight of a resulting sub- tree is minimized. O(Rk(k + log d)+n) time - Becker, Perl and Schach ‘82 O(n) time, Frederickson 1990

21 Dynamic min-max 125 15 8 125 47 125 Raw data ABC ABBC A

22 Over-sampling p subtrees s * p subtrees p subsets

23 Implementation Issues 1) Sort Optimization 2) Minimizing Data Movement 3) Efficient Aggregation Operations 4) Disk Optimizations

24 1) Sort Optimization qSort is SLOW May be O(n 2 ) when there are duplicates When cardinality is small range of keys is small  Radix sort Dynamically select between well optimized Radix and Quick Sorts

25 2) Minimizing Data Movement Sort pointers to the records! Never reorder the columns

26 3) Efficient Aggregation Operations One pass for each pipeline Do lazy aggregation ABCD ABC A AB all

27 4) Disk Optimizations Avoid OS buffering Implemented I/O Manager Manages buffers to avoid thrashing Does I/O in separate process to overlap with computation

28 Speedup - Cluster

29 Efficiency - Cluster

30 Speedup - SunFire

31 Efficiency - SunFire

32 Increasing Data Size

33 Varying Over Sampling Factor

34 Varying Skew

35 Conclusions New communication efficient parallel cubing framework for Top-down Bottom up Array based Easy to implement (sort of), architecture independent

36 Thank you! Questions?


Download ppt "Generating the Data Cube (Shared Disk) Andrew Rau-Chaplin Faculty of Computer Science Dalhousie University Joint Work with F. Dehne T. Eavis S. Hambrusch."

Similar presentations


Ads by Google