BOAI: Fast Alternating Decision Tree

Slides:

Advertisements

Similar presentations

On the Optimality of Probability Estimation by Random Decision Trees Wei Fan IBM T.J.Watson.

Advertisements

Is Random Model Better? -On its accuracy and efficiency-

Random Forest Predrag Radenković 3237/10

Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.

Fast Algorithms For Hierarchical Range Histogram Constructions

Paper By - Manish Mehta, Rakesh Agarwal and Jorma Rissanen

Hunt’s Algorithm CIT365: Data Mining & Data Warehousing Bajuna Salehe

1 Data Mining Classification Techniques: Decision Trees (BUSINESS INTELLIGENCE) Slides prepared by Elizabeth Anglo, DISCS ADMU.

Scalable Classification Robert Neugebauer David Woo.

FilterBoost: Regression and Classification on Large Datasets Joseph K. Bradley 1 and Robert E. Schapire 2 1 Carnegie Mellon University 2 Princeton University.

BOAT - Optimistic Decision Tree Construction Gehrke, J. Ganti V., Ramakrishnan R., Loh, W.

SLIQ: A Fast Scalable Classifier for Data Mining Manish Mehta, Rakesh Agrawal, Jorma Rissanen Presentation by: Vladan Radosavljevic.

Decision Tree under MapReduce Week 14 Part II. Decision Tree.

B+-tree and Hashing.

Classification and Prediction

Generic Object Detection using Feature Maps Oscar Danielsson Stefan Carlsson

Ensemble Learning: An Introduction

Induction of Decision Trees

Lecture 5 (Classification with Decision Trees)

Three kinds of learning

Classification II.

(C) 2001 SNU CSE Biointelligence Lab Incremental Classification Using Tree- Based Sampling for Large Data H. Yoon, K. Alsabti, and S. Ranka Instance Selection.

Privacy-Preserving Data Mining Rakesh Agrawal Ramakrishnan Srikant IBM Almaden Research Center 650 Harry Road, San Jose, CA Published in: ACM SIGMOD.

Scaling Decision Tree Induction. Outline Why do we need scaling? Cover state of the art methods Details on my research (which is one of the state of the.

Ensemble Learning (2), Tree and Forest

Online Learning Algorithms

PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning RASTOGI, Rajeev and SHIM, Kyuseok Data Mining and Knowledge Discovery, 2000, 4.4.

Issues with Data Mining

Fall 2004 TDIDT Learning CS478 - Machine Learning.

Database Management 9. course. Execution of queries.

Classification supplemental. Scalable Decision Tree Induction Methods in Data Mining Studies SLIQ (EDBT’96 — Mehta et al.) – builds an index for each.

Basics of Decision Trees  A flow-chart-like hierarchical tree structure –Often restricted to a binary structure  Root: represents the entire dataset.

The X-Tree An Index Structure for High Dimensional Data Stefan Berchtold, Daniel A Keim, Hans Peter Kriegel Institute of Computer Science Munich, Germany.

Decision Trees Jyh-Shing Roger Jang ( 張智星 ) CSIE Dept, National Taiwan University.

Shared Memory Parallelization of Decision Tree Construction Using a General Middleware Ruoming Jin Gagan Agrawal Department of Computer and Information.

CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob SPRINT: A Scalable Parallel Classifier for Data Mining.

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Classification COMP Seminar BCB 713 Module Spring 2011.

FREERIDE: System Support for High Performance Data Mining Ruoming Jin Leo Glimcher Xuan Zhang Ge Yang Gagan Agrawal Department of Computer and Information.

Indexing and hashing Azita Keshmiri CS 157B. Basic concept An index for a file in a database system works the same way as the index in text book. For.

For Monday No new reading Homework: –Chapter 18, exercises 3 and 4.

Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.

Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.

Patch Based Prediction Techniques University of Houston By: Paul AMALAMAN From: UH-DMML Lab Director: Dr. Eick.

CIS671-Knowledge Discovery and Data Mining Vasileios Megalooikonomou Dept. of Computer and Information Sciences Temple University AI reminders (based on.

Lecture Notes for Chapter 4 Introduction to Data Mining

Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.

By: Gang Zhou Computer Science Department University of Virginia 1 Medians and Beyond: New Aggregation Techniques for Sensor Networks CS851 Seminar Presentation.

Database Management Systems, R. Ramakrishnan 1 Algorithms for clustering large datasets in arbitrary metric spaces.

Classification and Prediction: Ensemble Methods Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.

Bootstrapped Optimistic Algorithm for Tree Construction

Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 3 Basic Data Mining Techniques Jason C. H. Chen, Ph.D. Professor of MIS School of Business.

Ensemble Learning, Boosting, and Bagging: Scaling up Decision Trees (with thanks to William Cohen of CMU, Michael Malohlava of 0xdata, and Manish Amde.

Packet Classification Using Multi- Iteration RFC Author: Chun-Hui Tsai, Hung-Mao Chu, Pi-Chung Wang Publisher: 2013 IEEE 37th Annual Computer Software.

Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.

1 Parallel Datacube Construction: Algorithms, Theoretical Analysis, and Experimental Evaluation Ruoming Jin Ge Yang Gagan Agrawal The Ohio State University.

Mining High-Speed Data Streams Presented by: William Kniffin Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International Conference

Ensemble Classifiers.

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

Data Transformation: Normalization

Mining Time-Changing Data Streams

Ch9: Decision Trees 9.1 Introduction A decision tree:

RE-Tree: An Efficient Index Structure for Regular Expressions

On Spatial Joins in MapReduce

Communication and Memory Efficient Parallel Decision Tree Construction

Basic Concepts and Decision Trees

Dept. of Computer Sciences University of Wisconsin-Madison

General External Merge Sort

Avoid Overfitting in Classification

Classification.

BIRCH: Balanced Iterative Reducing and Clustering using Hierarchies

Presentation transcript:

BOAI: Fast Alternating Decision Tree Induction based on Bottom-up Evaluation Bishan Yang, Tengjiao Wang, Dongqing Yang, and Lei Chang School of EECS, Peking University

Outline Motivation Related work Preliminaries Our Approach Experimental Results Summary

Motivation Alternating Decision Tree (ADTree) is an effective decision tree algorithm based on AdaBoost. High accurate classifier Small size of tree and easy to interpret Provide measures of prediction confidence Wide range of applications Customer churn prediction Fraud detection Disease trait modeling ……

Limitation of existing work Very expensive to apply to large sets of training data Takes hours to train on large number of examples and attributes The training time will grow up exponentially with the increasing size of data

Related work Several techniques have been developed to tackle the efficiency problem for traditional decision trees (ID3,C4.5) SLIQ (EDBT — Mehta et al.) Introduce data structures called attribute list and class list SPRINT (VLDB — J. Shafer et al.) Only construct attribute list PUBLIC (VLDB — Rastogi & Shim) Integrates MDL “prunning” into tree “building” process Rainforest (VLDB — Gehrke, Ramakrishnan & Ganti) Introduce AVC-groups which are sufficient for split evaluation BOAT (PODS — Gehrke, Ganti, Ramakrishnan & Loh) build tree based on a subset of data by using bootstrapping Can’t not directly apply to ADTree evaluate splits based on information of the current node

Related work Several optimizing methods for ADTree PAKDD – Pfahringer et al. Zpure cut off Merging Three heuristic mechanisms ILP – Vanassche et al. Caching optimization Cons: Little improvement until reaching large number of iterations Cons: can’t guarantee the model quality Cons :the additional memory consumption grows fast with increasing number of boosting rounds

Preliminaries Alternating Decision Tree (ADTree) (ICML) Classification: the sign of the sum of the prediction values along the paths defined by the instance Eg. Instance (age, income) = (35, 1300), sign(f(x)) = sign(0.5-0.5+0.4+0.3) = sign(0.7) = +1 +0.5 Age < 40 Income > 1000 -0.5 +0.2 +0.3 -0.6 Income > 1200 Age > 30 : prediction node +0.4 -0.2 -0.1 +0.1 : decision node

Complexity mainly lies in this part! Preliminaries Algorithm ADTree Induction Input: For t=1 to T do /* evaluation phase*/ for all such that for all such that calculate Select , which minimize /*Partition phase*/ Set update weights: , is the prediction value Pt is set of preconditions, C is set of base conditions, (c=(A≤(vi+vi+1)/2) or c=(A=vi)) Complexity mainly lies in this part! W+(c) (resp. W-(c)) is the total weight of the positive (resp. negative) instances that satisfying condition c Weights are increased for misclassified instances and decreased for correctly classified instances

Our approach – BOAI Evaluation phase in top-down evaluation instances need to be sorted on numeric attributes at each prediction node the weight distribution for all possible splits need to be calculated by scanning instances at each prediction node great deal of sorting and computing overlap BOAI (Bottom-up Evaluation for ADTree Induction) Presorting technique reduce the sorting cost Bottom-up evaluation evaluate splits from the leaf nodes to the root node avoid much redundant computing and sorting cost obtain the exactly same evaluation results with the top-down evaluation approach

Pre-sorting technique Preprocessing step Sort the values of the numeric attribute, and map the sorting space of values x0,x1,…,xm-1 to 0,1,…,m-1, and then replace the attribute values as the sorted indexes. use the sorted indexes to speed up the sorting time the split evaluation phase. Sorting space: 1500, 1600, 1800 Sorted index: 0,1,2 sorting Replace the original values in the data with their sorted indexes replacing

VW-set (Attribute-Value, Class-Weight) Only need weight distribution ( , ) on distinct attribute values for split evaluation. Just keep the necessary information! VW-set of attribute A at node p stores the weight distribution of each class for each distinct value of A in F(p). (F(p) denotes the instances projected onto p) If A is a numeric attribute, the distinct values in VW-set must be sorted. VW-group of node p the set of all VW-sets at node p Each prediction node can be evaluated based on its VW-group Our idea is to just keep the necessary weight distribution information for evaluation.

VW-set (Attribute-Value, Class-Weight) The size of the VW-set is determined by the distinct attribute values appeared in F(p) and is not proportional to the size of F(p) W+(Dept.=2) W-(Dept.=2)

Bottom-up evaluation The main idea: Evaluate based on VW-group evaluate splits from the leaf nodes to the root node use already computed statistics to evaluate parent nodes much computing and sorting redundancy can be avoided Evaluate based on VW-group

Bottom-up evaluation (Cont.) For leaf nodes directly construct VW-group by scanning instances at the node VW-set on categorical attribute: Use hash table to index different values and collect their weights VW-set on numeric attribute: Map weights to the corresponding index in the value space and compress them to VW-set Sort can take linear time! O(n+m)

Bottom-up evaluation (Cont.) For internal nodes construct VW-group by merging the VW-groups of its two children nodes (prediction nodes) Sort cost: O(V1+V2)

Evaluation Algorithm directly construction merge construction categorical split numeric split evaluate other children

Computation analysis Prediction node p, |F(p)|=n Top-down evaluation: sorting cost for each numeric attribute : O(nlogn) Z-value calculation cost for each attribute : O(n) Bottom-up evaluation: sorting cost for each numeric attribute : Leaf node: in most case O(n+m) Internal node: sort through merging, O(V1+V2), where V1,V2 are the numbers of distinct values in the two merged VW-groups. They always much smaller then n Z-value calculation cost for each attribute: O(V), V is the number of distinct values in the VW-group, always much smaller than n

Experiments Data sets Environment Synthetic data sets: Real data sets: IBM Quest data mining group, up to 500,000 instances Real data sets: China Mobile Communication Company, 290,000 subscribers covering 92 variables Environment AMD 3200+ CPU running windows XP with 768MB main memory

Experimental results (Synthetic data)

Experimental results (real data)

Experimental results (real data) Apply to churn prediction Calibration set: 20,083, validation set: 5,062 Imbalance problem: about 2.1% churn rate Re-balancing strategy: Multiply the weight of each instance in the minority class by Wmaj/Wmin (Wmaj(resp. Wmin) is the total weight of the majority (resp. minority) class instances) Little information loss and does not introduce more computing power on average Models F-measure G-mean W-accuracy Modeling Time (sec) ADT(w/o re-balancing) 56.04 65.65 44.53 75.56 Random Forests 19.21 84.04 84.71 960.00 TreeNet 72.81 79.61 64.40 30.00 BOAI 50.62 90.81 85.84 7.625

Summary We developed a novel approach for ADTree induction to speed up training time on large data sets Key insight: eliminate the great redundancy of sorting and computation in the tree induction by using a bottom-up evaluation approach based on VW-group Experiments on both synthetic and real datasets show that BOAI offers significant performance improvement while constructing exactly the same model. Its an attractive algorithm for modeling on large data sets!

Thanks!