Presentation is loading. Please wait.

Presentation is loading. Please wait.

Communication and Memory Efficient Parallel Decision Tree Construction

Similar presentations


Presentation on theme: "Communication and Memory Efficient Parallel Decision Tree Construction"— Presentation transcript:

1 Communication and Memory Efficient Parallel Decision Tree Construction
11/21/2018 Communication and Memory Efficient Parallel Decision Tree Construction Ruoming Jin and Gagan Agrawal The Ohio State University Hello, everyone, my name is Ruoming Jin. Today, I will present the paper, “Communication and Memory Efficient Parallel Decision Tree Construction”. 11/21/2018

2 Outline Motivation SPIES approach Parallelization of SPIES
11/21/2018 Motivation SPIES approach Parallelization of SPIES Experimental Results Related work Conclusions and Future work Inequality Parallelize SPIES 11/21/2018

3 Motivation foreach ( data instance d) {
11/21/2018 Can we develop an efficient algorithm for decision tree construction that can be parallelized in the same way as algorithms for other major mining tasks? Popular data mining algorithms share a common canonical loop foreach ( data instance d) { // R is an in-memory data structure R ← reduc(d); } The motivation of our study on decision tree construction comes from our previous work on parallel data mining. We have developed a middleware called FREERIDE for rapid Implementation of parallel data mining algorithms. It provides an unified data parallel approach for both distributed and shared memory parallelization, and support multi-pass and read-only processing large and disk-resident datasets. Generally, an sequential data mining algorithm needs only moderate modification to achieve parallelization by using Freeride. So far, we have demonstrated the FREERIDE approach by efficiently parallelizing a cluster of popular data mining algorithms, such as Apriori association mining, FP-tree construction and k-means clustering on both shared memory and distributed memory architecture. For decision tree construction, we have parallelize FainForest Read algorithm on shared memory machine. 11/21/2018

4 Motivation (Cont’) The common way of efficiently parallelizing a data mining algorithm A unified data parallelization approach for both distributed and shared memory parallelization The dataset is read-only, writing back and expensive preprocessing is not required Sufficient data parallelism and load balance for processing large datasets 11/21/2018

5 Decision Tree Construction
11/21/2018 Processing categorical attributes fits into the canonical loop Handling of numerical attributes SLIQ, SPRINT Virtual class histogram Pre-sort and attribute-list RainForest Materialize class histogram The size of class histogram might not fit in the main memory Decision tree construction for large datasets has been widely studied. SLIQ and SPRINT are the two well-known approaches, but both of them require presorting and vertical partitioning of datasets. RainForest provides alternative approach to scale decision tree construction without these two requirements. It constructs decision tree in the top-down fashion. From the root, it computes sufficient statistics for the splitting node, and chooses splitting attribute and predicate based on the sufficient statistics. Clearly, if the main memory could hold the sufficient statistics for all of the nodes in one level of decision tree, We could split them by a single pass. However, the size of the sufficient statistics in a single level could not always be hold in main memory. In this case, we either have to read the dataset several times to construct one level of the tree, Or we partition the dataset such as every small partition will be building a sub-tree. Then we process every partition independently. The first variant algorithm is RF-read, the second one is RF-write. Sometimes, we could combine them together as RF-hybrid which is partitioning the dataset only when it is necessary. We can see that RF-read bares a lot of similarity with the requirement of FREERIDE approach. But the difficulty is that the large size of sufficient statistics results. So one nature problem to be asked is that if it it possible to reduce the sufficient statistics. In the following, we provide a solution to this problem, and a new algorithm based on it. 11/21/2018

6 Our approach – SPIES 11/21/2018 Statistical Pruning of Intervals for Enhanced Scalability Partially Materialize class histogram Reducing the size of class histogram Without additional passes on data Interval based approach Divide the range of numerical attributes into intervals Summarize class histogram for intervals Materialize class histogram for intervals likely to have best split point (partial materialization) Requires two passes on the dataset Sampling approach Estimate class histogram for intervals Reduce first pass on the dataset 11/21/2018

7 Finding Best Split Point
The data comes from a IBM Quest synthetic dataset for function 0 Best Split Point 11/21/2018

8 Sampling Step Maximal gain from interval boundaries
Upper bound of gains for intervals 11/21/2018

9 Completion Step Best Split Point 11/21/2018

10 Verification Gain of Best Split Point False Pruning
An additional pass might be required if false pruning happens 11/21/2018

11 SPIES sketch Three Steps Meet design goals Two key problems
11/21/2018 Three Steps Sampling step Completion step Verification Meet design goals Memory reduction by maximally pruning the interval Avoid more passes by reducing false pruning Two key problems How to get a good upper bound of gain for an interval? How sampling can help in reducing false pruning? Now we look at how sampling is used to replace the first pass of the algorithm and some responding modification of the algorithm. So, If you are interested in technical detail of how to use Hoeffding bound, please look at our paper or Domingo’s paper on VFDT. 11/21/2018

12 Least Upper Bound of Gain for an Interval
[ 50 ,54 ] [ 50 ,54 ] Possible Best Configuration-1 Possible Best Configuration-2 11/21/2018

13 The Basic Idea of Sampling
The difference can be bounded by statistical rules, such as Hoeffding Inequality. 11/21/2018

14 Sampling approach Statistical Pruning The additional step
11/21/2018 Statistical Pruning Given a sample set, by applying statistical bound such as Hoeffding, the probability of false pruning is bounded by a user-defined error level Practically, we define the error level to be very small, such as 0.1% to avoid false pruning The additional step It might arise, but rarely happens Practically, we combine it with the construction of children nodes. The best split point is very unlikely to be in the false pruned intervals. 11/21/2018

15 SPIES fits into the common canonical loop!
SPIES algorithm 11/21/2018 Sampling step Estimate class histograms for intervals from samples Compute the estimate intermediate best gain and upper bound of intervals Apply Hoeffding bound to perform interval pruning Completion step Materialize class histogram for unpruned intervals Compute the final best gain Verification An additional pass might be needed if false pruning happens and it will be executed together with next completion step Now we look at how sampling is used to replace the first pass of the algorithm and some responding modification of the algorithm. So, If you are interested in technical detail of how to use Hoeffding bound, please look at our paper or Domingo’s paper on VFDT. SPIES always finds the best split point by just partially materialize class histogram with almost the same number of passes of dataset SPIES fits into the common canonical loop! 11/21/2018

16 System support for parallelization
FREERIDE (Framework for Rapid Implementation of Data-mining Engines) middleware Rapid implementation of parallel data mining algorithms The unified data parallelization for both distributed and shared memory parallelization Multi-pass and read-only processing of large and disk resident datasets Successful examples Apriori, FP-tree, K-means, EM, kNN 11/21/2018

17 Distributed parallelization
FREERIDE Interface Reduction object In-memory data structure Distributed parallelization Data distributed on every node Global merge Shared memory parallelization Several techniques are provided to avoid race condition Disk-resident dataset Data set is organized in chunk Asynchronous reading of chunks 11/21/2018

18 Parallel SPIES Data set is organized in chunks and distributed into different nodes Sampling Step Sampling chunks to be reduced in parallel to class histograms of intervals Completion Step Chunks to be reduced in parallel to class histograms of unpruned intervals We can develop an efficient algorithm for decision tree construction that can be parallelized in the same way as algorithms for other major mining tasks! 11/21/2018

19 Experimental Set-up and Datasets
SUN SMP clusters 8 ultra Enterprise 450’s, each has 4 250MHz Ultra-II processors Each node has 1 GB main memory, 4 GB system disk and 18 GB data disk Interconnected by Myrinet Synthetic Data set from IBM Quest group 9 attributes, 3 attributes are categorical, 6 are numerical Function 1, 6 and 7 is used Two groups of dataset ( 800MB/20 m, 1600MB/40 m) Stop point is 1,000,000 Sample size around 20% 11/21/2018

20 Parallel Performance 11/21/2018 8 nodes, around 4. 2.25, 1.88 and 1.75 Better sequential performance, Almost linear, about 8. Distributed Memory Speedup of RF-read (without intervals), 800 MB datasets SPIES with 1000 intervals 11/21/2018

21 Memory Requirement 11/21/2018 800MB dataset 0, From 100 to 1000, several memory we have 95% memory reduction . 800MB dataset with number of intervals 0, 100, 500,1000, 5000, 20000 11/21/2018

22 Impact of Number of Intervals on Sequential and Parallel Performance
11/21/2018 , better or competitive sequential performance All of them have almost linear speedup. We can not get very good performance from very large number of intervals. 800 MB, function 7 800 MB, function 1 11/21/2018

23 Scalability on Cluster of SMPs
11/21/2018 Averagely, 2 out of 3 threads. 1 nodes, the shared memory, I/O bound. Super-linear on 1600MB. Shared Memory and Distributed Memory Parallel Performance, 800MB, function 7 1600MB dataset 11/21/2018

24 Related work BOAT SLIQ and SPRINT CLOUDS VFDT
Bootstrapping and only two passes Difficulty in handling if bootstrapping could not provide unanimous splitting condition SLIQ and SPRINT CLOUDS Interval based approach but approximate VFDT Sampling approach and applying Hoeffding bound Streaming data and approximate 11/21/2018

25 Conclusions SPIES approach
Guaranteed to find the exact best split point No pre-sorting or writing back of datasets The size of the in-memory data structure is very small The communication volume is very low when the algorithm is parallelized The number of passes of dataset is almost the same as completely materializing class histogram 11/21/2018

26 Future work More experiments such as testing with real data set
Study different interval construction methods besides equal-width interval Extending the work to streaming data context 11/21/2018


Download ppt "Communication and Memory Efficient Parallel Decision Tree Construction"

Similar presentations


Ads by Google