Presentation is loading. Please wait.

Presentation is loading. Please wait.

CHAPTER 9: Decision Trees Slides corrected and new slides added by Ch. Eick.

Similar presentations


Presentation on theme: "CHAPTER 9: Decision Trees Slides corrected and new slides added by Ch. Eick."— Presentation transcript:

1 CHAPTER 9: Decision Trees Slides corrected and new slides added by Ch. Eick

2 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Tree Uses Nodes, and Leaves

3 3 Divide and Conquer Internal decision nodes  Univariate: Uses a single attribute, x i Numeric x i : Binary split : x i > w m Discrete x i : n-way split for n possible values  Multivariate: Uses all attributes, x Leaves  Classification: Class labels, or proportions  Regression: Numeric; r average, or local fit Learning is greedy; find the best split recursively (Breiman et al, 1984; Quinlan, 1986, 1993) without backtracking (decision taken are final)

4 Side Discussion “Greedy Algorithms ” Fast and therefore attractive to solve NP-hard and other problems with high complexity. Later decisions are made in the context of decision selected early dramatically reducing the size of the search space. They do not backtrack: if they make a bad decision (based on local criteria), they never revise the decision. They are not guaranteed to find the optimal solutions, and sometimes can get deceived and find really bad solutions. In spite of what is said above, a lot successful and popular algorithms in Computer Science are greedy algorithms. Greedy algorithms are particularly popular in AI and Operations Research. Popular Greedy Algorithms: Decision Tree Induction,…

5 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 5 Classification Trees (ID3, CART, C4.5) For node m, N m instances reach m, N i m belong to C i Node m is pure if p i m is 0 or 1 Measure of impurity is entropy

6 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 6 Best Split If node m is pure, generate a leaf and stop, otherwise split and continue recursively Impurity after split: N mj of N m take branch j. N i mj belong to C i Find the variable and split that min impurity (among all variables -- and split positions for numeric variables) skip

7 Tree Induction Greedy strategy.  Split the records based on an attribute test that optimizes certain criterion. Issues  Determine how to split the records How to specify the attribute test condition? How to determine the best split?  Determine when to stop splitting

8 Information Gain vs. Gain Ratio D=(2/3,1/3) D1(1,0)D2(1/3,2/3) D=(2/3,1/3) D1(0,1) D2(2/3,1/3) D3(1,0) D=(2/3,1/3) D1(1,0) D2(0,1) D3(1,0)D4(1,0)D5(1,0) D6(0,1) city=sf city=la car=merccar=taurus car=van age=22 age=25age=27age=35 age=40age=50 Gain(D,city=)= H(1/3,2/3) – ½ H(1,0) – ½ H(1/3,2/3)=0.45 Gain(D,car=)= H(1/3,2/3) – 1/6 H(0,1) – ½ H(2/3,1/3) – 1/3 H(1,0)=0.45 Gain(D,age=)= H(1/3,2/3) – 6*1/6 H(0,1) = 0.90 G_Ratio_pen(city=)=H(1/2,1/2)=1 G_Ratio_pen(age=)=log 2 (6)=2.58 G_Ratio_pen(car=)=H(1/2,1/3,1/6)=1.45 Result: I_Gain_Ratio: city>age>car Result: I_Gain: age > car=city

9 How to determine the Best Split? Before Splitting: 10 records of class 0, 10 records of class 1 Before: E(1/2,1/2) After: 4/20*E(1/4,3/4) + 8/20*E(1,0) + 8/20*E(1/8,7/8) Gain: Before-After Pick Test that has the highest gain! Remark: E stands for Gini, Entropy(H), Impurity(1-max c (P(c )), Gain- ratio

10 Entropy and Gain Computations Assume we have m classes in our classification problem. A test S subdivides the examples D= (p 1,…,p m ) into n subsets D 1 =(p 11,…,p 1m ),…,D n =(p 11,…,p 1m ). The qualify of S is evaluated using Gain(D,S) (ID3) or GainRatio(D,S) (C5.0): Let H(D=(p 1,…,p m ))=  i=1 (p i log 2 (1/p i )) (called the entropy function) Gain(D,S)= H(D)   i=1 (|D i |/|D|)*H(D i ) Gain_Ratio(D,S)= Gain(D,S) / H(|D 1 |/|D|,…, |D n |/|D|) Remarks: |D| denotes the number of elements in set D. D=(p 1,…,p m ) implies that p 1 +…+ p m =1 and indicates that of the |D| examples p 1 *|D| examples belong to the first class, p 2 *|D| examples belong to the second class,…, and p m *|D| belong the m-th (last) class. H(0,1)=H(1,0)=0; H(1/2,1/2)=1, H(1/4,1/4,1/4,1/4)=2, H(1/p,…,1/p)=log 2 (p). C5.0 selects the test S with the highest value for Gain_Ratio(D,S), whereas ID3 picks the test S for the examples in set D with the highest value for Gain (D,S). m n

11 Information Gain vs. Gain Ratio D=(2/3,1/3) D1(1,0)D2(1/3,2/3) D=(2/3,1/3) D1(0,1) D2(2/3,1/3) D3(1,0) D=(2/3,1/3) D1(1,0) D2(0,1) D3(1,0)D4(1,0)D5(1,0) D6(0,1) city=sf city=la car=merccar=taurus car=van age=22 age=25age=27age=35 age=40age=50 Gain(D,city=)= H(1/3,2/3) – ½ H(1,0) – ½ H(1/3,2/3)=0.45 Gain(D,car=)= H(1/3,2/3) – 1/6 H(0,1) – ½ H(2/3,1/3) – 1/3 H(1,0)=0.45 Gain(D,age=)= H(1/3,2/3) – 6*1/6 H(0,1) = 0.90 G_Ratio_pen(city=)=H(1/2,1/2)=1 G_Ratio_pen(age=)=log 2 (6)=2.58 G_Ratio_pen(car=)=H(1/2,1/3,1/6)=1.45 Result: I_Gain_Ratio: city>age>car Result: I_Gain: age > car=city

12 Splitting Continuous Attributes Different ways of handling  Discretization to form an ordinal categorical attribute Static – discretize once at the beginning Dynamic – ranges can be found by equal interval bucketing, equal frequency bucketing (percentiles), clustering, or supervised clustering.  Binary Decision: (A < v) or (A  v) consider all possible splits and finds the best cut v can be more compute intensive

13 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 13

14 Stopping Criteria for Tree Induction 1. Grow entire tree  Stop expanding a node when all the records belong to the same class  Stop expanding a node when all the records have the same attribute values 2. Pre-pruning (do not grow complete tree) 1. Stop when only x examples are left (pre-pruning) 2. … other pre-pruning strategies

15 How to Address Overfitting in Decision Trees The most popular approach: Post-pruning  Grow decision tree to its entirety  Trim the nodes of the decision tree in a bottom-up fashion  If generalization error improves after trimming, replace sub-tree by a leaf node.  Class label of leaf node is determined from majority class of instances in the sub-tree

16 Advantages Decision Tree Based Classification Inexpensive to construct Extremely fast at classifying unknown records Easy to interpret for small-sized trees Okay for noisy data Can handle both continuous and symbolic attributes Accuracy is comparable to other classification techniques for many simple data sets Decent average performance over many datasets Kind of a standard — if you want to show that your “new” classification technique really “improves the world”  compare its performance against decision trees (e.g. C 5.0) using 10-fold cross-validation Does not need distance functions; only the order of attribute values is important for classification: 0.1,0.2,0.3 and 0.331,0.332, and 0.333 is the same for a decision tree learner.

17 Disadvantages Decision Tree Based Classification Relies on rectangular approximation that might not be good for some dataset Selecting good learning algorihtm parameters (e.g. degree of pruning) is non-trivial Ensemble techniques, support vector machines, and kNN might obtain higher accuracies for a specific dataset. More recently, forests (ensembles of decision trees) have gained some popularity.

18 Error in Regressions Trees Examples are: (1,1.37), (2,-1.45), (2.1 -1.35), (2.2,-125) (7, 2.06), (8, 1.66) Error before split: 1/6(2*0**2+ 2*0.01 +2*0.04) In general: minimizes squared error!

19 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 19 Regression Trees Error at node m: After splitting:

20 Error in Regressions Trees Examples are: (1,1.37), (2,-1.45), (2.1 -1.35), (2.2,-125) (7, 2.06), (8, 1.66) Error before split: 1/6(2*0**2+ 2*0.01 +2*0.04)

21 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 21 Model Selection in Trees :

22 Summary Regression Trees Regression trees work as decision trees except  Leafs carry numbers which predict the output variable instead of class label  Instead of entropy/purity the squared prediction error serves as the performance measure.  Tests that reduce the squared prediction error the most are selected as node test; test conditions have the form y>threshold where y is an independent variable of the prediction problem.  Instead of using the majority class, leaf labels are generated by averaging over the values of the dependent variable of the examples that are associated with the particular class node Like decision trees, regression tree employ rectangular tessellations with a fixed number being associated with each rectangle; the number is the output for the inputs which lie inside the rectangle. In contrast to ordinary regression that performs curve fitting, regression tries use averaging with respect to the output variable when making predictions.

23 Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 23 Multivariate Trees


Download ppt "CHAPTER 9: Decision Trees Slides corrected and new slides added by Ch. Eick."

Similar presentations


Ads by Google