PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning RASTOGI, Rajeev and SHIM, Kyuseok Data Mining and Knowledge Discovery, 2000, 4.4.

PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning RASTOGI, Rajeev and SHIM, Kyuseok Data Mining and Knowledge Discovery, 2000, 4.4 Speaker ： 0016323 李宜錚、 0016035 黃聖芸、 0016074 趙怡

Outline Introduction Preliminary Building phase Pruning phase The PUBLIC Integrated Algorithm Computation of Lower Bound on Subtree Cost Experimental Results Conclusion Discussion

Introduction Classification Classify each input training set into different labeled class based on its attributes The goal is to induce for each class in terms of attribute

Introduction Classification method Bayesian classification Neural network Genetic algorithm Decision tree Reason for using decision tree Easily to be understood Efficient

Introduction Decision tree classify

Introduction Constructing decision tree Building phase Pruning phase Minimum Description Length Principle Smaller tree Get higher accuracy Efficient

Introduction PUBLIC: Pruning and Building Integrated in Classification Integrate pruning phase into building phase Build the same decision tree as separated phase tree Cost no more or less than other algorithm

Preliminary - building phase Building phase – SPRINT algorithm Breadth first building tree Each split is binary

Preliminary - building phase Data structure Attribute list – each entry Attribute value Class label Record identifier

Preliminary - building phase Root node ： have all attribute list Other nodes ： attribute sub-list that separated by one attribute

Preliminary - building phase Finding split point Using entropy E(S) as split standard Split by an attribute that has the least E(S1,S2) Split attribute list into leaf node by record id

Preliminary - pruning phase Pruning phase Compute the cost to determine if the subtree should be prune or not Lower the total cost will get better tree MDL principle the “best” tree is the one that can be encoded using the fewest number of bits.

Preliminary - pruning phase Cost of Encoding Tree the structure of the tree: 1 bit internal node (1) leaf (0) each split: lg(a) bits + value bits the attribute the value of attribute the classes of data records in each leaf of the tree

Preliminary - pruning phase Pruning algorithm Leaf node: compute and return its own cost Internal node: compare with the cost that prune the sub-tree and not prune, choose smaller one Stop when N is root

The PUBLIC Integrated Algorithm Most algorithms for inducing decision trees Building phase → Pruning phase Disadvantage in two phases of decision tree PUBLIC (PrUning and BuiLding Integrated in Classification)

The PUBLIC Integrated Algorithm Similar to the build procedure

The PUBLIC Integrated Algorithm Problem with applying the original pruning procedure

The PUBLIC Integrated Algorithm PUBLIC’s Pruning Algorithm Under-estimation strategy Three kinds of leaf nodes Q ensures not expanded

Computation of Lower Bound on Subtree Cost PUBLIC(1) : a cost at least 1 PUBLIC(S) : the cost of splits PUBLIC(V) : cost of values They are identical except for the value “lower bound on subtree cost at N”. They use increasingly accurate cost estimates for “yet to be expanded” leaf nodes, and result in fewer nodes being expanded during the building phase.

Computation of Lower Bound on Subtree Cost Estimating Split Costs S: the set of records at node N k: the number of classes for the records in S n i be the number of records belonging to class i in S, n i ≧ n i+1 for 1 ≦ i < k a : the number of attributes In case node N is not split, that is, s = 0, then the minimum cost for a subtree at N is C(S)+1 For s > 0, the cost of any subtree with s splits and rooted at node N is at feast:

Computation of Lower Bound on Subtree Cost Algorithm for Computing Lower Bound on Subtree Cost ─ PUBLIC(S)

Computation of Lower Bound on Subtree Cost PUBLIC(S) Calculates a lower bound for s = 0,…,k-1 For s = 0 : C(S)+1 For s > 0 : Takes the minimum of the bounds Computes by iterative addition O(klogk)

Computation of Lower Bound on Subtree Cost Example: Let a “yet to be expanded” leaf node N contain the following set S of data records.

Computation of Lower Bound on Subtree Cost Incorporating Costs of Split Values This is to specify the distribution of records amongst the children of the split node PUBLIC(S) estimates each split as log(a) PUBLIC(V) estimates each split as log(a),plus the encoding of the splitting value(s) Time complexity of PUBLIC(V) : O(k*(logk+a))

Experimental Results- Real-life Data Sets Data Setbreast cancer carlettersatimageshuttlevehicleyeast No. of Categorical Attributes 0600000 No. of Numeric Attributes 9016369188 No. of Classes242675410 No. of Records (Train) 4691161133684435435005591001 No. of Records (Test) 21456766322000014500287483

Experimental Results- Real-life Data Sets Execution Time Data Setbreast cancer carlettersatimageshuttlevehicleyeast SPRINT2.4945.83283.21471.00457.78151.90253.96 PUBLIC(1)19.00939.492799.171288.30458.10112.35179.14 PUBLIC(S)13.7833.932786.571036.34455.1599.67144.5 PUBLIC(V)1639033.932793.321042.39455.2097.69139.02 Max Ratio56%38%18%43%0.6%55%83%

Experimental Results- Synthetic Data Sets AttributeDescriptionValue salary Uniformly distributed from 20000 to 150000 commission If salary >75000 then commission is zero else uniformly distributed from 10000 to 75000 age uniformly distributed from 20 to 80 eleveleducation leveluniformly chosen from 0 to 4 carmake of the caruniformly chosen from 1 to 20 zipcodezip code of the townuniformly chosen from 9 to available zipcodes hvaluevalue of the houseuniformly distributed from 0.5k100000 to 1.5k00000 where k {0,.,9} depends on zipcode hearsyears house owneduniformly distributed from 1 to 30 loantotal loan amountuniformly distributed from 0 to 500000

Experimental Results- Synthetic Data Sets Execution Time Predicate No. 12345678910 SPRINT1531147113991413145414711277197813361627 PUBLIC(1)714720654707769760615875676783 PUBLIC(S)607604553617656653559741589689 PUBLIC(V)574593520575615587510708567666 Max Ratio 267%359%269%246%236%251%250%279%236%244%

Experimental Results- Synthetic Data Sets Execution Time

Conclusion PUBLIC(l):simplest, building and pruning together PUBLIC(S):considers subtree with splits PUBLIC(V):computes the most accurate lower bound Experimental Results: real-life data & synthetic data --> PUBLIC can result in significant performance.

Discussion In building phase, use GINI may have less space than compute entropy, but cost more time Log needs log table, square cost time Add pruning phase back into PUBLIC to make the total node reduce Reduce memory cost 31

Thank you!

PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning RASTOGI, Rajeev and SHIM, Kyuseok Data Mining and Knowledge Discovery, 2000, 4.4.

Similar presentations

Presentation on theme: "PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning RASTOGI, Rajeev and SHIM, Kyuseok Data Mining and Knowledge Discovery, 2000, 4.4."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning RASTOGI, Rajeev and SHIM, Kyuseok Data Mining and Knowledge Discovery, 2000, 4.4.

Similar presentations

Presentation on theme: "PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning RASTOGI, Rajeev and SHIM, Kyuseok Data Mining and Knowledge Discovery, 2000, 4.4."— Presentation transcript:

Similar presentations

About project

Feedback