PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning RASTOGI, Rajeev and SHIM, Kyuseok Data Mining and Knowledge Discovery, 2000, 4.4.

Slides:



Advertisements
Similar presentations
The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 4: An Introduction To Data Mining (II) Johannes Gehrke
Advertisements

DECISION TREES. Decision trees  One possible representation for hypotheses.
Frequent Closed Pattern Search By Row and Feature Enumeration
Paper By - Manish Mehta, Rakesh Agarwal and Jorma Rissanen
Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.
Decision Trees and MPI Collective Algorithm Selection Problem Jelena Pje¡sivac-Grbovi´c,Graham E. Fagg, Thara Angskun, George Bosilca, and Jack J. Dongarra,
IT 433 Data Warehousing and Data Mining
Hunt’s Algorithm CIT365: Data Mining & Data Warehousing Bajuna Salehe
Decision Tree Approach in Data Mining
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Part I Introduction to Data Mining by Tan,
Classification: Definition Given a collection of records (training set ) –Each record contains a set of attributes, one of the attributes is the class.
1 Data Mining Classification Techniques: Decision Trees (BUSINESS INTELLIGENCE) Slides prepared by Elizabeth Anglo, DISCS ADMU.
Decision Tree.
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining by Tan, Steinbach,
Classification Techniques: Decision Tree Learning
Chapter 7 – Classification and Regression Trees
SLIQ: A Fast Scalable Classifier for Data Mining Manish Mehta, Rakesh Agrawal, Jorma Rissanen Presentation by: Vladan Radosavljevic.
Classification: Decision Trees, and Naïve Bayes etc. March 17, 2010 Adapted from Chapters 4 and 5 of the book Introduction to Data Mining by Tan, Steinbach,
Decision Trees Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei Han.
1 Decision Tree Classification Tomi Yiu CS 632 — Advanced Database Systems April 5, 2001.
ID3 Algorithm Abbas Rizvi CS157 B Spring What is the ID3 algorithm? ID3 stands for Iterative Dichotomiser 3 Algorithm used to generate a decision.
Decision Tree Algorithm
Spatial and Temporal Data Mining V. Megalooikonomou Introduction to Decision Trees ( based on notes by Jiawei Han and Micheline Kamber and on notes by.
Induction of Decision Trees
© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.
Lecture 5 (Classification with Decision Trees)
Decision Trees an Introduction.
Example of a Decision Tree categorical continuous class Splitting Attributes Refund Yes No NO MarSt Single, Divorced Married TaxInc NO < 80K > 80K.
Classification II.
(C) 2001 SNU CSE Biointelligence Lab Incremental Classification Using Tree- Based Sampling for Large Data H. Yoon, K. Alsabti, and S. Ranka Instance Selection.
Decision Tree Models in Data Mining
Introduction to Directed Data Mining: Decision Trees
Fall 2004 TDIDT Learning CS478 - Machine Learning.
Learning what questions to ask. 8/29/03Decision Trees2  Job is to build a tree that represents a series of questions that the classifier will ask of.
1 Data Mining Lecture 3: Decision Trees. 2 Classification: Definition l Given a collection of records (training set ) –Each record contains a set of attributes,
Chapter 9 – Classification and Regression Trees
Chapter 4 Classification. 2 Classification: Definition Given a collection of records (training set ) –Each record contains a set of attributes, one of.
Computational Intelligence: Methods and Applications Lecture 19 Pruning of decision trees Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
Inferring Decision Trees Using the Minimum Description Length Principle J. R. Quinlan and R. L. Rivest Information and Computation 80, , 1989.
Decision Trees Jyh-Shing Roger Jang ( 張智星 ) CSIE Dept, National Taiwan University.
CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob SPRINT: A Scalable Parallel Classifier for Data Mining.
CS690L Data Mining: Classification
For Monday No new reading Homework: –Chapter 18, exercises 3 and 4.
MACHINE LEARNING 10 Decision Trees. Motivation  Parametric Estimation  Assume model for class probability or regression  Estimate parameters from all.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
Decision Trees Example of a Decision Tree categorical continuous class Refund MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K Splitting.
CIS671-Knowledge Discovery and Data Mining Vasileios Megalooikonomou Dept. of Computer and Information Sciences Temple University AI reminders (based on.
An Introduction Student Name: Riaz Ahmad Program: MSIT( ) Subject: Data warehouse & Data Mining.
Lecture Notes for Chapter 4 Introduction to Data Mining
Bootstrapped Optimistic Algorithm for Tree Construction
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
An Interval Classifier for Database Mining Applications Rakes Agrawal, Sakti Ghosh, Tomasz Imielinski, Bala Iyer, Arun Swami Proceedings of the 18 th VLDB.
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining By Tan, Steinbach,
Introduction to Data Mining Clustering & Classification Reference: Tan et al: Introduction to data mining. Some slides are adopted from Tan et al.
BY International School of Engineering {We Are Applied Engineering} Disclaimer: Some of the Images and content have been taken from multiple online sources.
Data Mining CH6 Implementation: Real machine learning schemes(2) Reporter: H.C. Tsai.
Chapter 3 Data Mining: Classification & Association Chapter 4 in the text box Section: 4.3 (4.3.1),
By N.Gopinath AP/CSE.  A decision tree is a flowchart-like tree structure, where each internal node (nonleaf node) denotes a test on an attribute, each.
10. Decision Trees and Markov Chains for Gene Finding.
DECISION TREES An internal node represents a test on an attribute.
Ch9: Decision Trees 9.1 Introduction A decision tree:
Decision Tree Saed Sayad 9/21/2018.
Data Mining Classification: Basic Concepts and Techniques
Introduction to Data Mining, 2nd Edition by
Introduction to Data Mining, 2nd Edition by
Introduction to Data Mining, 2nd Edition by
Basic Concepts and Decision Trees
Machine Learning: Lecture 3
Decision Trees for Mining Data Streams
INTRODUCTION TO Machine Learning 2nd Edition
Presentation transcript:

PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning RASTOGI, Rajeev and SHIM, Kyuseok Data Mining and Knowledge Discovery, 2000, 4.4 Speaker : 李宜錚、 黃聖芸、 趙怡

Outline Introduction Preliminary Building phase Pruning phase The PUBLIC Integrated Algorithm Computation of Lower Bound on Subtree Cost Experimental Results Conclusion Discussion

Introduction Classification Classify each input training set into different labeled class based on its attributes The goal is to induce for each class in terms of attribute

Introduction Classification method Bayesian classification Neural network Genetic algorithm Decision tree Reason for using decision tree Easily to be understood Efficient

Introduction Decision tree classify

Introduction Constructing decision tree Building phase Pruning phase Minimum Description Length Principle Smaller tree Get higher accuracy Efficient

Introduction PUBLIC: Pruning and Building Integrated in Classification Integrate pruning phase into building phase Build the same decision tree as separated phase tree Cost no more or less than other algorithm

Preliminary - building phase Building phase – SPRINT algorithm Breadth first building tree Each split is binary

Preliminary - building phase Data structure Attribute list – each entry Attribute value Class label Record identifier

Preliminary - building phase Root node : have all attribute list Other nodes : attribute sub-list that separated by one attribute

Preliminary - building phase Finding split point Using entropy E(S) as split standard Split by an attribute that has the least E(S1,S2) Split attribute list into leaf node by record id

Preliminary - pruning phase Pruning phase Compute the cost to determine if the subtree should be prune or not Lower the total cost will get better tree MDL principle the “best” tree is the one that can be encoded using the fewest number of bits.

Preliminary - pruning phase Cost of Encoding Tree the structure of the tree: 1 bit internal node (1) leaf (0) each split: lg(a) bits + value bits the attribute the value of attribute the classes of data records in each leaf of the tree

Preliminary - pruning phase Pruning algorithm Leaf node: compute and return its own cost Internal node: compare with the cost that prune the sub-tree and not prune, choose smaller one Stop when N is root

The PUBLIC Integrated Algorithm Most algorithms for inducing decision trees Building phase → Pruning phase Disadvantage in two phases of decision tree PUBLIC (PrUning and BuiLding Integrated in Classification)

The PUBLIC Integrated Algorithm Similar to the build procedure

The PUBLIC Integrated Algorithm Problem with applying the original pruning procedure

The PUBLIC Integrated Algorithm PUBLIC’s Pruning Algorithm Under-estimation strategy Three kinds of leaf nodes Q ensures not expanded

Computation of Lower Bound on Subtree Cost PUBLIC(1) : a cost at least 1 PUBLIC(S) : the cost of splits PUBLIC(V) : cost of values They are identical except for the value “lower bound on subtree cost at N”. They use increasingly accurate cost estimates for “yet to be expanded” leaf nodes, and result in fewer nodes being expanded during the building phase.

Computation of Lower Bound on Subtree Cost Estimating Split Costs S: the set of records at node N k: the number of classes for the records in S n i be the number of records belonging to class i in S, n i ≧ n i+1 for 1 ≦ i < k a : the number of attributes In case node N is not split, that is, s = 0, then the minimum cost for a subtree at N is C(S)+1 For s > 0, the cost of any subtree with s splits and rooted at node N is at feast:

Computation of Lower Bound on Subtree Cost Algorithm for Computing Lower Bound on Subtree Cost ─ PUBLIC(S)

Computation of Lower Bound on Subtree Cost PUBLIC(S) Calculates a lower bound for s = 0,…,k-1 For s = 0 : C(S)+1 For s > 0 : Takes the minimum of the bounds Computes by iterative addition O(klogk)

Computation of Lower Bound on Subtree Cost Example: Let a “yet to be expanded” leaf node N contain the following set S of data records.

Computation of Lower Bound on Subtree Cost Incorporating Costs of Split Values This is to specify the distribution of records amongst the children of the split node PUBLIC(S) estimates each split as log(a) PUBLIC(V) estimates each split as log(a),plus the encoding of the splitting value(s) Time complexity of PUBLIC(V) : O(k*(logk+a))

Experimental Results- Real-life Data Sets Data Setbreast cancer carlettersatimageshuttlevehicleyeast No. of Categorical Attributes No. of Numeric Attributes No. of Classes No. of Records (Train) No. of Records (Test)

Experimental Results- Real-life Data Sets Execution Time Data Setbreast cancer carlettersatimageshuttlevehicleyeast SPRINT PUBLIC(1) PUBLIC(S) PUBLIC(V) Max Ratio56%38%18%43%0.6%55%83%

Experimental Results- Synthetic Data Sets AttributeDescriptionValue salary Uniformly distributed from to commission If salary >75000 then commission is zero else uniformly distributed from to age uniformly distributed from 20 to 80 eleveleducation leveluniformly chosen from 0 to 4 carmake of the caruniformly chosen from 1 to 20 zipcodezip code of the townuniformly chosen from 9 to available zipcodes hvaluevalue of the houseuniformly distributed from 0.5k to 1.5k00000 where k {0,.,9} depends on zipcode hearsyears house owneduniformly distributed from 1 to 30 loantotal loan amountuniformly distributed from 0 to

Experimental Results- Synthetic Data Sets Execution Time Predicate No SPRINT PUBLIC(1) PUBLIC(S) PUBLIC(V) Max Ratio 267%359%269%246%236%251%250%279%236%244%

Experimental Results- Synthetic Data Sets Execution Time

Conclusion PUBLIC(l):simplest, building and pruning together PUBLIC(S):considers subtree with splits PUBLIC(V):computes the most accurate lower bound Experimental Results: real-life data & synthetic data --> PUBLIC can result in significant performance.

Discussion In building phase, use GINI may have less space than compute entropy, but cost more time Log needs log table, square cost time Add pruning phase back into PUBLIC to make the total node reduce Reduce memory cost 31

Thank you!