Presentation is loading. Please wait.

Presentation is loading. Please wait.

Decision Tree Learning Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata August 25, 2014.

Similar presentations


Presentation on theme: "Decision Tree Learning Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata August 25, 2014."— Presentation transcript:

1 Decision Tree Learning Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata August 25, 2014

2 Example: Age, Income and Owning a flat 2 Monthly income (thousand rupees) Age Training set Owns a house Does not own a house  If the training data was as above – Could we define some simple rules by observation?  Any point above the line L 1  Owns a house  Any point to the right of L 2  Owns a house  Any other point  Does not own a house L1L1 L2L2

3 Example: Age, Income and Owning a flat 3 Monthly income (thousand rupees) Age Training set Owns a house Does not own a house L1L1 L2L2 Root node: Split at Income = 101 Income ≥ 101: Label = Yes Income < 101: Split at Age = 54 Age ≥ 54: Label = YesAge < 54: Label = No In general, the data won’t be such as above

4 Example: Age, Income and Owning a flat 4 Monthly income (thousand rupees) Age Training set Owns a house Does not own a house  Approach: recursively split the data into partitions so that each partition becomes purer till … How to decide the split? How to measure purity? When to stop?

5 Approach for splitting  What are the possible lines for splitting? – For each variable, midpoints between pairs of consecutive values for the variable – How many? – If N = number of points in training set and m = number of variables – About O(N × m)  How to choose which line to use for splitting? – The line which reduce impurity (~ heterogeneity of composition) the most  How to measure impurity? 5

6 Gini Index for Measuring Impurity  Suppose there are C classes  Let p(i|t) = fraction of observations belonging to class i in rectangle (node) t  Gini index: 6  If all observations in t belong to one single class Gini(t) = 0  When is Gini(t) maximum?

7 Entropy  Average amount of information contained  From another point of view – average amount of information expected – hence amount of uncertainty – We will study this in more detail later  Entropy: 7 Where 0 log 2 0 is defined to be 0

8 Classification Error  What if we stop the tree building at a node – That is, do not create any further branches for that node – Make that node a leaf – Classify the node with the most frequent class present in the node  Classification error as measure of impurity 8 This rectangle (node) is still impure  Intuitively – the impurity of the most frequent class in the rectangle (node)

9 The Full Blown Tree  Recursive splitting  Suppose we don’t stop until all nodes are pure  A large decision tree with leaf nodes having very few data points – Does not represent classes well – Overfitting  Solution: – Stop earlier, or – Prune back the tree 9 Root 1000 400 600 200 240 160 2 2 1 1 5 5 Number of points Statistically not significant

10 Prune back  Pruning step: collapse leaf nodes and make the immediate parent a leaf node  Effect of pruning – Lose purity of nodes – But were they really pure or was that a noise? – Too many nodes ≈ noise  Trade-off between loss of purity and gain in complexity 10 Leaf node (label = Y) Freq = 5 Leaf node (label = Y) Freq = 5 Decision node (Freq = 7) Decision node (Freq = 7) Leaf node (label = B) Freq = 2 Leaf node (label = B) Freq = 2 Leaf node (label = Y) Freq = 7 Leaf node (label = Y) Freq = 7 Prune

11 Prune back: cost complexity  Cost complexity of a (sub)tree:  Classification error (based on training data) and a penalty for size of the tree 11 Leaf node (label = Y) Freq = 5 Leaf node (label = Y) Freq = 5 Decision node (Freq = 7) Decision node (Freq = 7) Leaf node (label = B) Freq = 2 Leaf node (label = B) Freq = 2 Leaf node (label = Y) Freq = 7 Leaf node (label = Y) Freq = 7 Prune  Err(T) is the classification error  L(T) = number of leaves in T  Penalty factor α is between 0 and 1 – If α=0, no penalty for bigger tree

12 Different Decision Tree Algorithms  Chi-square Automatic Interaction Detector (CHAID) – Gordon Kass (1980) – Stop subtree creation if not statistically significant by chi-square test  Classification and Regression Trees (CART) – Breiman et al. – Decision tree building by Gini’s index  Iterative Dichotomizer 3 (ID3) – Ross Quinlan (1986) – Splitting by information gain (difference in entropy)  C4.5 – Quinlan’s next algorithm, improved over ID3 – Bottom up pruning, both categorical and continuous variables – Handling of incomplete data points  C5.0 – Ross Quinlan’s commercial version 12

13 Properties of Decision Trees  Non parametric approach – Does not require any prior assumptions regarding the probability distribution of the class and attributes  Finding an optimal decision tree is an NP-complete problem – Heuristics used: greedy, recursive partitioning, top-down, bottom-up pruning  Fast to generate, fast to classify  Easy to interpret or visualize  Error propagation – An error at the top of the tree propagates all the way down 13

14 References  Introduction to Data Mining, by Tan, Steinbach, Kumar – Chapter 4 is available online: http://www- users.cs.umn.edu/~kumar/dmbook/ch4.pdfhttp://www- users.cs.umn.edu/~kumar/dmbook/ch4.pdf 14


Download ppt "Decision Tree Learning Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata August 25, 2014."

Similar presentations


Ads by Google