Communication and Memory Efficient Parallel Decision Tree Construction

Slides:

Advertisements

Similar presentations

Mining High-Speed Data Streams

Advertisements

Paper By - Manish Mehta, Rakesh Agarwal and Jorma Rissanen

Mining High-Speed Data Streams Presented by: Tyler J. Sawyer UVM Spring CS 332 Data Mining Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International.

Scalable Classification Robert Neugebauer David Woo.

BOAT - Optimistic Decision Tree Construction Gehrke, J. Ganti V., Ramakrishnan R., Loh, W.

Haifa Research Lab © 2008 IBM Corporation Parallel streaming decision trees Yael Ben-Haim & Elad Yom-Tov Presented by: Yossi Richter.

SLIQ: A Fast Scalable Classifier for Data Mining Manish Mehta, Rakesh Agrawal, Jorma Rissanen Presentation by: Vladan Radosavljevic.

(C) 2001 SNU CSE Biointelligence Lab Incremental Classification Using Tree- Based Sampling for Large Data H. Yoon, K. Alsabti, and S. Ranka Instance Selection.

MATE-EC2: A Middleware for Processing Data with Amazon Web Services Tekin Bicer David Chiu* and Gagan Agrawal Department of Compute Science and Engineering.

Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.

Exploiting Domain-Specific High-level Runtime Support for Parallel Code Generation Xiaogang Li Ruoming Jin Gagan Agrawal Department of Computer and Information.

林俊宏 Parallel Association Rule Mining based on FI-Growth Algorithm Bundit Manaskasemsak, Nunnapus Benjamas, Arnon Rungsawang.

Database Management Systems, R. Ramakrishnan and J. Gehrke1 Query Evaluation Chapter 12: Overview.

Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal

Mining Data Streams Challenges, Techniques, and Future Work Ruoming Jin Joint work with Prof. Gagan Agrawal.

Shared Memory Parallelization of Decision Tree Construction Using a General Middleware Ruoming Jin Gagan Agrawal Department of Computer and Information.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

Integrating and Optimizing Transactional Memory in a Data Mining Middleware Vignesh Ravi and Gagan Agrawal Department of ComputerScience and Engg. The.

Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.

Scaling up Decision Trees. Decision tree learning.

Data-Intensive Computing: From Clouds to GPUs Gagan Agrawal June 1,

CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob SPRINT: A Scalable Parallel Classifier for Data Mining.

Computer Science and Engineering Predicting Performance for Grid-Based P. 1 IPDPS’07 A Performance Prediction Framework.

Powerpoint Templates 1 Mining High-Speed Data Streams Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International Confrence Presented by: Afsoon.

Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.

FREERIDE: System Support for High Performance Data Mining Ruoming Jin Leo Glimcher Xuan Zhang Ge Yang Gagan Agrawal Department of Computer and Information.

1 Using Tiling to Scale Parallel Datacube Implementation Ruoming Jin Karthik Vaidyanathan Ge Yang Gagan Agrawal The Ohio State University.

High-level Interfaces and Abstractions for Data-Driven Applications in a Grid Environment Gagan Agrawal Department of Computer Science and Engineering.

Data-Intensive Computing: From Clouds to GPUs Gagan Agrawal December 3,

Implementing Data Cube Construction Using a Cluster Middleware: Algorithms, Implementation Experience, and Performance Ge Yang Ruoming Jin Gagan Agrawal.

Database Management Systems, R. Ramakrishnan and J. Gehrke1 Introduction to Query Optimization Chapter 13.

Computer Science and Engineering FREERIDE-G: A Grid-Based Middleware for Scalable Processing of Remote Data Leonid Glimcher Gagan Agrawal.

PDAC-10 Middleware Solutions for Data- Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David.

System Support for High Performance Data Mining Ruoming Jin Leo Glimcher Xuan Zhang Gagan Agrawal Department of Computer and Information Sciences Ohio.

High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University.

Ohio State University Department of Computer Science and Engineering Servicing Range Queries on Multidimensional Datasets with Partial Replicas Li Weng,

Bootstrapped Optimistic Algorithm for Tree Construction

An Interval Classifier for Database Mining Applications Rakes Agrawal, Sakti Ghosh, Tomasz Imielinski, Bala Iyer, Arun Swami Proceedings of the 18 th VLDB.

AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.

Research Overview Gagan Agrawal Associate Professor.

Hierarchical Load Balancing for Large Scale Supercomputers Gengbin Zheng Charm++ Workshop 2010 Parallel Programming Lab, UIUC 1Charm++ Workshop 2010.

Ensemble Learning, Boosting, and Bagging: Scaling up Decision Trees (with thanks to William Cohen of CMU, Michael Malohlava of 0xdata, and Manish Amde.

1 Parallel Datacube Construction: Algorithms, Theoretical Analysis, and Experimental Evaluation Ruoming Jin Ge Yang Gagan Agrawal The Ohio State University.

Mustafa Gokce Baydogan, George Runger and Eugene Tuv INFORMS Annual Meeting 2011, Charlotte A Bag-of-Features Framework for Time Series Classification.

System Support for High Performance Scientific Data Mining Gagan Agrawal Ruoming Jin Raghu Machiraju S. Parthasarathy Department of Computer and Information.

Mining High-Speed Data Streams Presented by: William Kniffin Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International Conference

Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 1 ipdps’04 Scaling and Parallelizing a Scientific Feature.

A Dynamic Scheduling Framework for Emerging Heterogeneous Systems

Ge Yang Ruoming Jin Gagan Agrawal The Ohio State University

Parallel Density-based Hybrid Clustering

Real-Time Ray Tracing Stefan Popov.

Sameh Shohdy, Yu Su, and Gagan Agrawal

Chapter 15 QUERY EXECUTION.

Introduction to Data Mining, 2nd Edition by

Spatial Online Sampling and Aggregation

Scalable Data Mining: Algorithms, System Support, and Applications

On Spatial Joins in MapReduce

Database Management Systems (CS 564)

Data-Intensive Computing: From Clouds to GPU Clusters

Bootstrapped Optimistic Algorithm for Tree Construction

Dept. of Computer Sciences University of Wisconsin-Madison

Bin Ren, Gagan Agrawal, Brad Chamberlain, Steve Deitz

Yi Wang, Wei Jiang, Gagan Agrawal

Fast and Exact K-Means Clustering

Ensemble learning.

Decision Trees for Mining Data Streams

Gengbin Zheng, Esteban Meneses, Abhinav Bhatele and Laxmikant V. Kale

FREERIDE: A Framework for Rapid Implementation of Datamining Engines

FREERIDE: A Framework for Rapid Implementation of Datamining Engines

L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher

Presentation transcript:

Communication and Memory Efficient Parallel Decision Tree Construction 11/21/2018 Communication and Memory Efficient Parallel Decision Tree Construction Ruoming Jin and Gagan Agrawal The Ohio State University Hello, everyone, my name is Ruoming Jin. Today, I will present the paper, “Communication and Memory Efficient Parallel Decision Tree Construction”. 11/21/2018

Outline Motivation SPIES approach Parallelization of SPIES 11/21/2018 Motivation SPIES approach Parallelization of SPIES Experimental Results Related work Conclusions and Future work Inequality Parallelize SPIES 11/21/2018

Motivation foreach ( data instance d) { 11/21/2018 Can we develop an efficient algorithm for decision tree construction that can be parallelized in the same way as algorithms for other major mining tasks? Popular data mining algorithms share a common canonical loop foreach ( data instance d) { // R is an in-memory data structure R ← reduc(d); } The motivation of our study on decision tree construction comes from our previous work on parallel data mining. We have developed a middleware called FREERIDE for rapid Implementation of parallel data mining algorithms. It provides an unified data parallel approach for both distributed and shared memory parallelization, and support multi-pass and read-only processing large and disk-resident datasets. Generally, an sequential data mining algorithm needs only moderate modification to achieve parallelization by using Freeride. So far, we have demonstrated the FREERIDE approach by efficiently parallelizing a cluster of popular data mining algorithms, such as Apriori association mining, FP-tree construction and k-means clustering on both shared memory and distributed memory architecture. For decision tree construction, we have parallelize FainForest Read algorithm on shared memory machine. 11/21/2018

Motivation (Cont’) The common way of efficiently parallelizing a data mining algorithm A unified data parallelization approach for both distributed and shared memory parallelization The dataset is read-only, writing back and expensive preprocessing is not required Sufficient data parallelism and load balance for processing large datasets 11/21/2018

Decision Tree Construction 11/21/2018 Processing categorical attributes fits into the canonical loop Handling of numerical attributes SLIQ, SPRINT Virtual class histogram Pre-sort and attribute-list RainForest Materialize class histogram The size of class histogram might not fit in the main memory Decision tree construction for large datasets has been widely studied. SLIQ and SPRINT are the two well-known approaches, but both of them require presorting and vertical partitioning of datasets. RainForest provides alternative approach to scale decision tree construction without these two requirements. It constructs decision tree in the top-down fashion. From the root, it computes sufficient statistics for the splitting node, and chooses splitting attribute and predicate based on the sufficient statistics. Clearly, if the main memory could hold the sufficient statistics for all of the nodes in one level of decision tree, We could split them by a single pass. However, the size of the sufficient statistics in a single level could not always be hold in main memory. In this case, we either have to read the dataset several times to construct one level of the tree, Or we partition the dataset such as every small partition will be building a sub-tree. Then we process every partition independently. The first variant algorithm is RF-read, the second one is RF-write. Sometimes, we could combine them together as RF-hybrid which is partitioning the dataset only when it is necessary. We can see that RF-read bares a lot of similarity with the requirement of FREERIDE approach. But the difficulty is that the large size of sufficient statistics results. So one nature problem to be asked is that if it it possible to reduce the sufficient statistics. In the following, we provide a solution to this problem, and a new algorithm based on it. 11/21/2018

Our approach – SPIES 11/21/2018 Statistical Pruning of Intervals for Enhanced Scalability Partially Materialize class histogram Reducing the size of class histogram Without additional passes on data Interval based approach Divide the range of numerical attributes into intervals Summarize class histogram for intervals Materialize class histogram for intervals likely to have best split point (partial materialization) Requires two passes on the dataset Sampling approach Estimate class histogram for intervals Reduce first pass on the dataset 11/21/2018

Finding Best Split Point The data comes from a IBM Quest synthetic dataset for function 0 Best Split Point 11/21/2018

Sampling Step Maximal gain from interval boundaries Upper bound of gains for intervals 11/21/2018

Completion Step Best Split Point 11/21/2018

Verification Gain of Best Split Point False Pruning An additional pass might be required if false pruning happens 11/21/2018

SPIES sketch Three Steps Meet design goals Two key problems 11/21/2018 Three Steps Sampling step Completion step Verification Meet design goals Memory reduction by maximally pruning the interval Avoid more passes by reducing false pruning Two key problems How to get a good upper bound of gain for an interval? How sampling can help in reducing false pruning? Now we look at how sampling is used to replace the first pass of the algorithm and some responding modification of the algorithm. So, If you are interested in technical detail of how to use Hoeffding bound, please look at our paper or Domingo’s paper on VFDT. 11/21/2018

Least Upper Bound of Gain for an Interval [ 50 ,54 ] [ 50 ,54 ] Possible Best Configuration-1 Possible Best Configuration-2 11/21/2018

The Basic Idea of Sampling The difference can be bounded by statistical rules, such as Hoeffding Inequality. 11/21/2018

Sampling approach Statistical Pruning The additional step 11/21/2018 Statistical Pruning Given a sample set, by applying statistical bound such as Hoeffding, the probability of false pruning is bounded by a user-defined error level Practically, we define the error level to be very small, such as 0.1% to avoid false pruning The additional step It might arise, but rarely happens Practically, we combine it with the construction of children nodes. The best split point is very unlikely to be in the false pruned intervals. 11/21/2018

SPIES fits into the common canonical loop! SPIES algorithm 11/21/2018 Sampling step Estimate class histograms for intervals from samples Compute the estimate intermediate best gain and upper bound of intervals Apply Hoeffding bound to perform interval pruning Completion step Materialize class histogram for unpruned intervals Compute the final best gain Verification An additional pass might be needed if false pruning happens and it will be executed together with next completion step Now we look at how sampling is used to replace the first pass of the algorithm and some responding modification of the algorithm. So, If you are interested in technical detail of how to use Hoeffding bound, please look at our paper or Domingo’s paper on VFDT. SPIES always finds the best split point by just partially materialize class histogram with almost the same number of passes of dataset SPIES fits into the common canonical loop! 11/21/2018

System support for parallelization FREERIDE (Framework for Rapid Implementation of Data-mining Engines) middleware Rapid implementation of parallel data mining algorithms The unified data parallelization for both distributed and shared memory parallelization Multi-pass and read-only processing of large and disk resident datasets Successful examples Apriori, FP-tree, K-means, EM, kNN 11/21/2018

Distributed parallelization FREERIDE Interface Reduction object In-memory data structure Distributed parallelization Data distributed on every node Global merge Shared memory parallelization Several techniques are provided to avoid race condition Disk-resident dataset Data set is organized in chunk Asynchronous reading of chunks 11/21/2018

Parallel SPIES Data set is organized in chunks and distributed into different nodes Sampling Step Sampling chunks to be reduced in parallel to class histograms of intervals Completion Step Chunks to be reduced in parallel to class histograms of unpruned intervals We can develop an efficient algorithm for decision tree construction that can be parallelized in the same way as algorithms for other major mining tasks! 11/21/2018

Experimental Set-up and Datasets SUN SMP clusters 8 ultra Enterprise 450’s, each has 4 250MHz Ultra-II processors Each node has 1 GB main memory, 4 GB system disk and 18 GB data disk Interconnected by Myrinet Synthetic Data set from IBM Quest group 9 attributes, 3 attributes are categorical, 6 are numerical Function 1, 6 and 7 is used Two groups of dataset ( 800MB/20 m, 1600MB/40 m) Stop point is 1,000,000 Sample size around 20% 11/21/2018

Parallel Performance 11/21/2018 8 nodes, around 4. 2.25, 1.88 and 1.75 Better sequential performance, Almost linear, about 8. Distributed Memory Speedup of RF-read (without intervals), 800 MB datasets SPIES with 1000 intervals 11/21/2018

Memory Requirement 11/21/2018 800MB dataset 0, From 100 to 1000, several memory we have 95% memory reduction . 800MB dataset with number of intervals 0, 100, 500,1000, 5000, 20000 11/21/2018

Impact of Number of Intervals on Sequential and Parallel Performance 11/21/2018 100-5000, better or competitive sequential performance All of them have almost linear speedup. We can not get very good performance from very large number of intervals. 800 MB, function 7 800 MB, function 1 11/21/2018

Scalability on Cluster of SMPs 11/21/2018 Averagely, 2 out of 3 threads. 1 nodes, the shared memory, I/O bound. Super-linear on 1600MB. Shared Memory and Distributed Memory Parallel Performance, 800MB, function 7 1600MB dataset 11/21/2018

Related work BOAT SLIQ and SPRINT CLOUDS VFDT Bootstrapping and only two passes Difficulty in handling if bootstrapping could not provide unanimous splitting condition SLIQ and SPRINT CLOUDS Interval based approach but approximate VFDT Sampling approach and applying Hoeffding bound Streaming data and approximate 11/21/2018

Conclusions SPIES approach Guaranteed to find the exact best split point No pre-sorting or writing back of datasets The size of the in-memory data structure is very small The communication volume is very low when the algorithm is parallelized The number of passes of dataset is almost the same as completely materializing class histogram 11/21/2018

Future work More experiments such as testing with real data set Study different interval construction methods besides equal-width interval Extending the work to streaming data context 11/21/2018