Download presentation

Presentation is loading. Please wait.

Published byTrystan Crom Modified about 1 year ago

1

2
Generating the Data Cube (Shared Disk) Andrew Rau-Chaplin Faculty of Computer Science Dalhousie University Joint Work with F. Dehne T. Eavis S. Hambrusch

3
Data Cube Generation Proposed by Gray et al in 1995 Can be generated from a relational DB but… A B C The cuboid ABC (or CAB) ABC AB ACBC AC B ALL

4
As a table ModelYearColourSales Chevy1990Red5 Chevy1990Blue87 Ford1990Green64 Ford1990Blue99 Ford1991Red8 Ford1991Blue7

5
The Challenge Input data set, R |R| typically in the millions, usually will not fit into memory. Number of dimensions, d, d cuboids in Data Cube How to solve this highly data and computational intensive problem in parallel?

6
Existing Parallel Results Goil & Choudhary MOLAP Approach Parallelize the generation of each cuboid Challenge > 2d comm. rounds

7
Overview 1)Data cubes 2)Review sequential cubing algorithms 3)Our Top-down parallel algorithm 4)Conclusions and open problems

8
Optimizations based on computing multiple cuboids Smallest-parent - computing a cuboid from the smallest previously computed cuboid. Cache-results - cache in memory the results of a cuboid from which other cuboid are computed to reduce disk I/O. Amortize-scans - amortizing disk read by computing as many cuboid as possible. Share-sorts - sharing sorting cost. ABCD ABCABD ACDBCD AB AC ADBCBDCD AA BB CCDD All

9
Many Algorithms Pipesort – [AADGNRS’96] PipeHash – [SAG’96] Overlap – [DANR’96] ArrayCube – [ZDN’97] Bottom-up-cube – [BR’99] Partition Cube – [RS’97] Memory Cube - [RS’97]

10
Approaches Top Down Pipesort – [AADGNRS’96] PipeHash – [SAG’96] Overlap – [DANR’96] Bottom up Bottom-up-cube – [BR’99] Partition Cube – [RS’97] Memory Cube - [RS’97] Array Based ArrayCube – [ZDN’97]

11
Our results A framework for parallelization of existing sequential data cube algorithms Top-down Bottom-up Array based Architecture independent Communication efficient Avoids irregular communication patterns Few large messages Overlap computation and communication Today’s Focus Top down approach

12
ABCD ABCABD ACDBCD AB AC ADBCBDCD AA BB CCDD All Top Down Algorithms Find a “least cost” spanning tree Use estimators of cuboid size Exploit Data shrinking Pipelining Cuts vs. Sorts

13
Cut vs. Sort Ordering ABCD Cutting ABCD -> ABC Linear time Sorting ABCD ->ABD Sort time Size ABC may be much smaller than ABCD

14
Pipesort [AADGNRS’96] Minimize sorting while seeking to compute cuboid from smallest parent Pipeline sorts with common prefixes

15
Level-by-level Optimization Minimum cost matching in a bipartite graph Scan edges solid, Sort edges dashed Establish dimension ordering working up the lattice

16
Overview 1)Data cubes 2)Review sequential cubing algorithms 3)Our Top-down parallel algorithm 4)Conclusions and open problems

17
Top-down parallel: The Idea Cut the process tree into p “equal weight” subtrees Each Proc. generates cuboids from a subtree independently Load balance/stripe the output CBAD CBABADACDBCD BA AC ADCBDBCD AA BB CCDD All

18
The Basic Algorithm (1) Construct a lattice housing all 2 d views. (2) Estimate the size of each of the views in the lattice. (3) To determine the cost of using a given view to directly compute its children, use its estimated size to calculate (a) the cost of scanning the view and (b) the cost of sorting it. (4) Using the bipartite matching technique presented in the original IBM paper, reduce the lattice to a spanning tree that identifies the appropriate set of prefix-ordered sort paths. (5) Add the I/O estimates to the spanning tree. (6) Partition the tree into p sub-trees. (7) Distribute the sub-tree lists to each of the p compute nodes. (8) On each node, use the sequential Pipesort algorithm to build the set of local views.

19
Tree Partitioning What does “Equal Weight” mean? Want to minimize the max weight partition! O(Rk(k + log d)+n) time - Becker, Perl and Schach ‘82 O(n) time, Frederickson 1990 time

20
Tree Partitioning Min-max tree k-partitioning. Given a tree T with n vertices and a positive weight assigned to each vertex, delete k edges in the tree to obtain k connected components T 1, T 2, … T k+1 such that the largest total weight of a resulting sub- tree is minimized. O(Rk(k + log d)+n) time - Becker, Perl and Schach ‘82 O(n) time, Frederickson 1990

21
Dynamic min-max Raw data ABC ABBC A

22
Over-sampling p subtrees s * p subtrees p subsets

23
Implementation Issues 1) Sort Optimization 2) Minimizing Data Movement 3) Efficient Aggregation Operations 4) Disk Optimizations

24
1) Sort Optimization qSort is SLOW May be O(n 2 ) when there are duplicates When cardinality is small range of keys is small Radix sort Dynamically select between well optimized Radix and Quick Sorts

25
2) Minimizing Data Movement Sort pointers to the records! Never reorder the columns

26
3) Efficient Aggregation Operations One pass for each pipeline Do lazy aggregation ABCD ABC A AB all

27
4) Disk Optimizations Avoid OS buffering Implemented I/O Manager Manages buffers to avoid thrashing Does I/O in separate process to overlap with computation

28
Speedup - Cluster

29
Efficiency - Cluster

30
Speedup - SunFire

31
Efficiency - SunFire

32
Increasing Data Size

33
Varying Over Sampling Factor

34
Varying Skew

35
Conclusions New communication efficient parallel cubing framework for Top-down Bottom up Array based Easy to implement (sort of), architecture independent

36
Thank you! Questions?

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google