Download presentation

Presentation is loading. Please wait.

Published bySheldon Marquess Modified over 2 years ago

1
Decision Tree Induction in Hierarchic Distributed Systems With: Amir Bar-Or, Ran Wolff, Daniel Keren

2
Motivation Large distributed computation is costly Especially data intensive and synchronization intensive ones; e.g., data mining Decision tree induction: –Collect global statistics (thousands) for every attribute (thousands) in every tree node (hundreds) –Global statistics – global synchronization

3
Motivation Hierarchy Helps Simplifies synchronization –Synchronize on each level Simplifies communication An “industrial strength” architecture –The way real systems (including grids) are often organized

4
Motivation Mining highly dimensional data Thousands of sources Central control Examples: –Genomically enriched healthcare data –Text repositories

5
Objectives of the Algorithm Exact results –Common approaches would either Collect a sample of the data Build independent models at each site and then use centralized meta-learning atop of them Communication efficiency –Naive approach: collect exact statistics for each tree node would result in GBytes of communication

6
Decision tree in a Teaspoon A tree were at each level the learning samples are splitted according to one attribute’s value Hill-climbing heuristic is used to induce the tree –The attribute that maximized a gain function is taken –Gain functions: Gini or Information Gain No real need to compute the gain

7
Main Idea Infer deterministic bounds on the gain of each attribute Improve bounds until best attribute is provenly better than the rest Communication efficiency is achieved because bounds require just limited data –Partial statistics for promising attributes –Rough bound on irrelevant attributes

8
Hierarchical Algorithm At each level of the hierarchy –Wait for reports from all descendants Contain upper and lower bounds on the gain of each attribute, number of samples from each class –Use descendant's report to compute cumulative bounds –If no clear separation, request descendants to tighten bounds by sending more data –At worst, all data is gathered

9
Deterministic Bounds Upper bound Lower bound

10
Performance Figures 99% reduction in communication bandwidth Out of 1000 SNP, only ~12 were reported to higher levels of the hierarchy Percent declines with hierarchy level

11
Performance Figures 99% reduction in communication bandwidth Out of 1000 SNP, only ~12 were reported to higher levels of the hierarchy Percent declines with hierarchy level

12
More Performance Figures Larger datasets require lower bandwidth Outlier noise is not a big issue –White noise even better

13
More Performance Figures Larger datasets require lower bandwidth Outlier noise is not a big issue –White noise even better

14
Future Work Text mining Incremental algorithm Accommodation of failure Testing on a real grid system Is this a general framework?

Similar presentations

OK

1 Learning Bias & Clustering Louis Oliphant CS-540-2 www.cs.wisc.edu/~cs540-2 based on slides by Burr H. Settles.

1 Learning Bias & Clustering Louis Oliphant CS-540-2 www.cs.wisc.edu/~cs540-2 based on slides by Burr H. Settles.

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google

Ppt on polynomials in maths what does the range Ppt on business etiquettes training dogs Ppt on effect of global warming on weather forecast Ppt on hp quality center Ppt on training and placement cell Ppt on regular expression library Ppt on linear equations in two variables 6th Ppt on resistance spot welding Ppt on history of australia colony Ppt on folk music of india