Download presentation

Presentation is loading. Please wait.

Published byLauren Blaxton Modified about 1 year ago

2
Decision Tree Classification Prof. Navneet Goyal BITS, Pilani BITS C464 – Machine Learning

3
General Approach Figure taken from text book (Tan, Steinbach, Kumar)

4
Decision tree – is a classification scheme Represents – a model of different classes Generates – tree & set of rules A node without children - is a leaf node. Otherwise an internal node. Each internal node has - an associated splitting predicate. e.g. binary predicates. Example predicates: Age <= 20 Profession in {student, teacher} 5000*Age + 3*Salary – 10000 > 0 Classification by Decision Tree Induction

5
Decision tree A flow-chart-like tree structure Internal node denotes a test on an attribute Branch represents an outcome of the test Leaf nodes represent class labels or class distribution Decision tree generation consists of two phases Tree construction At start, all the training examples are at the root Partition examples recursively based on selected attributes Tree pruning Identify and remove branches that reflect noise or outliers Use of decision tree: Classifying an unknown sample Test the attribute values of the sample against the decision tree

6
Decision tree classifiers are very popular. WHY? It does not require any domain knowledge or parameter setting, and is therefore suitable for exploratory knowledge discovery DTs can handle high dimensional data Representation of acquired knowledge in tree form is intuitive and easy to assimilate by humans Learning and classification steps are simple & fast Good accuracy Classification by Decision Tree Induction

7
Main Algorithms Hunt’s algorithm ID3 C4.5 CART SLIQ,SPRINT Classification by Decision Tree Induction

8
Example of a Decision Tree categorical continuous class Training Data Refund MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K Splitting Attributes Model: Decision Tree Figure taken from text book (Tan, Steinbach, Kumar)

9
Another Example of Decision Tree categorical continuous class MarSt Refund TaxInc YES NO Yes No Married Single, Divorced < 80K> 80K There could be more than one tree that fits the same data! Figure taken from text book (Tan, Steinbach, Kumar)

10
Which tree is better and why? How many decision trees? How to find the optimal tree? Is it computationally feasible? (Try constructing a suboptimal tree in reasonable amount of time – greedy algorithm) What should be the order of split? Look for answers in “20 questions” & “Guess Who” games! Some Questions

11
Apply Model to Test Data Refund MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K Test Data Start from the root of tree. Figure taken from text book (Tan, Steinbach, Kumar)

12
Apply Model to Test Data Refund MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K Test Data Figure taken from text book (Tan, Steinbach, Kumar)

13
Apply Model to Test Data Refund MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K Test Data Figure taken from text book (Tan, Steinbach, Kumar)

14
Apply Model to Test Data Refund MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K Test Data

15
Apply Model to Test Data Refund MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K Test Data Figure taken from text book (Tan, Steinbach, Kumar)

16
Apply Model to Test Data Refund MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K Test Data Assign Cheat to “No” Figure taken from text book (Tan, Steinbach, Kumar)

17
Decision Trees: Example Training Data Set No PlayTrue6068Rain No PlayFalse7066Rain PlayFalse6078Rain PlayFalse9588Overcast PlayTrue7563Overcast PlayFalse88 Overcast No PlayTrue9060Sunny PlayTrue7579Sunny PlayFalse7056Sunny No playtrue9079Sunny ClassWindyHumidityTempOutlook Numerical Attributes Temprature, Humidity Categorical Attributes Outlook, Windy Class ??? Class label

18
Sample Decision Tree Outlook sunny rain Humidity truefalse <=75 Play No > 75 {1} Play Windy PlayNo Play overcast Five leaf nodes – Each represents a rule Decision Trees: Example

19
Rules corresponding to the given tree 1.If it is a sunny day and humidity is not above 75%, then play. 2.If it is a sunny day and humidity is above 75%, then do not play. 3.If it is overcast, then play. 4.If it is rainy and not windy, then play. 5.If it is rainy and windy, then do not play. Is it the best classification ????

20
Decision Trees: Example Accuracy of the classifier determined by the percentage of the test data set that is correctly classified Class: “No Play” Classification of new record New record: outlook=rain, temp =70, humidity=65, windy=true.

21
Decision Trees: Example Test Data Set PlayTrue6068Rain No PlayFalse7066Rain PlayFalse6078Rain PlayFalse9588Overcast PlayTrue7563Overcast No PlayFalse88 Overcast No PlayTrue9060Sunny No PlayTrue7579Sunny PlayFalse7056Sunny Playtrue9079Sunny ClassWindyHumidityTempOutlook Rule 1: two records Sunny & hum <=75 (one is correctly classified) Accuracy= 50% Rule 2:sunny, hum> 75 Accuracy = 50% Rule 3: overcast Accuracy= 66%

22
Practical Issues of Classification Underfitting and Overfitting Missing Values Costs of Classification

23
Overfitting the Data A classification model commits two kinds of errors: Training Errors (TE) (resubstitution, apparent errors) Generalization Errors (GE) A good classification model must have low TE as well as low GE A model that fits the training data too well can have high GE than a model with high TE This problem is known as model overfitting

24
Underfitting and Overfitting Overfitting Underfitting: when model is too simple, both training and test errors are large. TE & GE are large when the size of the tree is very small. It occurs because the model is yet to learn the true structure of the data and as a result it performs poorly on both training and test sets Figure taken from text book (Tan, Steinbach, Kumar)

25
Overfitting the Data When a decision tree is built, many of the branches may reflect anomalies in the training data due to noise or outliers. We may grow the tree just deeply enough to perfectly classify the training data set. This problem is known as overfitting the data.

26
Overfitting the Data TE of a model can be reduced by increasing the model complexity Leaf nodes of the tree can be expanded until it perfectly fits the training data TE for such a complex tree = 0 GE can be large because the tree may accidently fit noise points in the training set Overfitting & underfitting are two pathologies that are related to model complexity

27
Occam’s Razor Given two models of similar generalization errors, one should prefer the simpler model over the more complex model For complex models, there is a greater chance that it was fitted accidentally by errors in data Therefore, one should include model complexity when evaluating a model

28
Definition A decision tree T is said to overfit the training data if there exists some other tree T’ which is a simplification of T, such that T has smaller error than T’ over the training set but T’ has a smaller error than T over the entire distribution of the instances.

29
Problems of Overfitting Overfitting can lead to many difficulties: Overfitted models are incorrect. Require more space and more computational resources Require collection of unnecessary features They are more difficult to comprehend

30
Overfitting Overfitting can be due to: 1. Presence of Noise 2. Lack of representative samples

31
Overfitting: Example Presence of Noise: Training Set NameBody Temperature Gives Birth 4-leggedHibernatesClass Label (mammal) ProcupineWarm BloodedYYYY CatWarm BloodedYYNY BatWarm BloodedYNYN* WhaleWarm BloodedYNNN* SalamanderCold BloodedNYYN Komodo dragonCold BloodedNYNN PythonCold BloodedNNYN SalmonCold BloodedNNNN EagleWarm BloodedNNNN GuppyCold BloodedYNNN Table taken from text book (Tan, Steinbach, Kumar)

32
Overfitting: Example Presence of Noise: Training Set NameBody Temperature Gives Birth 4-leggedHibernatesClass Label (mammal) ProcupineWarm BloodedYYYY CatWarm BloodedYYNY BatWarm BloodedYNYN* WhaleWarm BloodedYNNN* SalamanderCold BloodedNYYN Komodo dragonCold BloodedNYNN PythonCold BloodedNNYN SalmonCold BloodedNNNN EagleWarm BloodedNNNN GuppyCold BloodedYNNN Table taken from text book (Tan, Steinbach, Kumar)

33
Overfitting: Example Presence of Noise:Test Set NameBody Temperature Gives Birth 4-leggedHibernatesClass Label (mammal) HumanWarm BloodedYNNY PigeonWarm BloodedNNNN ElephantWarm BloodedYYNY Leopard SharkCold BloodedYNNN TurtleCold BloodedNYNN PenguinCold BloodedNNNN EelCold BloodedNNNN DolphinWarm BloodedYNNY Spiny AnteaterWarm BloodedNYYY Gila MonsterCold BloodedNYYN Table taken from text book (Tan, Steinbach, Kumar)

34
Overfitting: Example Presence of Noise: Models Body Temp 4-legged Mammals Non-mammals Yes Warm blooded Gives Birth No Yes No Non-mammals Cold blooded Body Temp Mammals Non-mammals Warm blooded Gives Birth No Yes Cold blooded Non-mammals Model M1 TE = 0%, GE=30% Find out why? Model M2 TE = 20%, GE=10% Figure taken from text book (Tan, Steinbach, Kumar)

35
Overfitting: Example Lack of representative samples: Training Set NameBody Temperature Gives Birth 4-leggedHibernatesClass Label (mammal) SalamanderCold BloodedNYYN EagleWarm BloodedNNNN GuppyCold BloodedYNNN PoorwillWarm bloodedNNYN PlatypusWarm bloodedNYYY Table taken from text book (Tan, Steinbach, Kumar)

36
Overfitting: Example Lack of representative samples: Training Set Body Temp 4-legged Mammals Non-mammals Yes Warm blooded Hibernates No Yes No Non-mammals Cold blooded Model M3 TE = 0%, GE=30% Find out why? Figure taken from text book (Tan, Steinbach, Kumar)

37
Overfitting due to Noise Decision boundary is distorted by noise point Figure taken from text book (Tan, Steinbach, Kumar)

38
Overfitting due to Insufficient Examples Lack of data points in the lower half of the diagram makes it difficult to predict correctly the class labels of that region - Insufficient number of training records in the region causes the decision tree to predict the test examples using other training records that are irrelevant to the classification task Figure taken from text book (Tan, Steinbach, Kumar)

39
How to Address Overfitting Pre-Pruning (Early Stopping Rule) Stop the algorithm before it becomes a fully-grown tree Typical stopping conditions for a node: Stop if all instances belong to the same class Stop if all the attribute values are the same More restrictive conditions: Stop if number of instances is less than some user-specified threshold Stop if class distribution of instances are independent of the available features (e.g., using 2 test) Stop if expanding the current node does not improve impurity measures (e.g., Gini or information gain).

40
How to Address Overfitting… Post-pruning Grow decision tree to its entirety Trim the nodes of the decision tree in a bottom-up fashion If generalization error improves after trimming, replace sub-tree by a leaf node. Class label of leaf node is determined from majority class of instances in the sub-tree Can use MDL for post-pruning

41
Post-pruning Post-pruning approach- removes branches of a fully grown tree. Subtree replacement replaces a subtree with a single leaf node Alt Price Yes No $ $$ $$$ Yes Alt Yes

42
Post-pruning Subtree raising moves a subtree to a higher level in the decision tree, subsuming its parent Alt Price Yes No $ $$ $$$ Yes Res Yes No 4/4 Alt Price Yes No $ $$ $$$ Yes

43
Overfitting: Example Presence of Noise:Training Set NameBody Temperature Gives Birth 4-leggedHibernatesClass Label (mammal) PorcupineWarm BloodedYYYY CatWarm BloodedYYNY BatWarm BloodedYNYN* WhaleWarm BloodedYNNN* SalamanderCold BloodedNYYN Komodo DragonCold BloodedNYNN PythonCold BloodedNNYN SalmonCold BloodedNNNN EagleWarm BloodedNNNN GuppyCold BloodedYNNN Table taken from text book (Tan, Steinbach, Kumar)

44
Post-pruning: Techniques Cost Complexity pruning Algorithm: pruning operation is performed if it does not increase the estimated error rate. Of course, error on the training data is not the useful estimator (would result in almost no pruning) Minimum Description Length Algorithm: states that the best tree is the one that can be encoded using the fewest number of bits. The challenge for the pruning phase is to find the subtree that can be encoded with the least number of bits.

45
Hunt’s Algorithm Let D t be the set of training records that reach a node t Let y={y1,y2,…yc} be the class labels Step 1: If D t contains records that belong the same class y t, then t is a leaf node labeled as y t. If D t is an empty set, then t is a leaf node labeled by the default class, y d Step 2: If D t contains records that belong to more than one class, use an attribute test to split the data into smaller subsets. Recursively apply the procedure to each child node DtDt ? Figure taken from text book (Tan, Steinbach, Kumar)

46
Hunt’s Algorithm Don’t Cheat Refund Don’t Cheat Don’t Cheat YesNo Refund Don’t Cheat YesNo Marital Status Don’t Cheat Single, Divorced Married Taxable Income Don’t Cheat < 80K>= 80K Refund Don’t Cheat YesNo Marital Status Don’t Cheat Single, Divorced Married Figure taken from text book (Tan, Steinbach, Kumar)

47
Should handle the following additional conditions: Child nodes created in step 2 are empty. When can this happen? Declare the node as leaf node (majority class label of the training records of parent node) In step 2, if all the records associated with D t have identical attributes (except for the class label), then it is not possible to split these records further. Declare the node as leaf with the same class label as the majority class of training records associated with this node. Hunt’s Algorithm

48
Tree Induction Greedy strategy. Split the records based on an attribute test that optimizes certain criterion. Issues Determine how to split the records How to specify the attribute test condition? How to determine the best split? Determine when to stop splitting

49
Design Issues of Decision Tree Induction How should the training records be split? At each recursive step, an attribute test condition must be selected. Algorithm must provide a method for specifying the test condition for diff. attrib. types as well as an objective measure for evaluating the goodness of each test condition How should the splitting procedure stop? Stopping condition is needed to terminate the tree-growing process. Stop when: - all records belong to the same class - all records have identical values - both conditions are sufficient to stop any DT induction algo., other criterion can be imposed to terminate the procedure early (do we need to do this? Think of model over-fitting!) Hunt’s Algorithm

50
How to determine the Best Split? Before Splitting: 10 records of class 0, 10 records of class 1 Which test condition is the best? Slide taken from text book slides available at companion website (Tan, Steinbach, Kumar)

51
Greedy approach: Nodes with homogeneous class distribution are preferred Need a measure of node impurity: Non-homogeneous, High degree of impurity Homogeneous, Low degree of impurity How to determine the Best Split? Slide taken from text book slides available at companion website (Tan, Steinbach, Kumar)

52
Measures of Node Impurity - Based on the degree of impurity of child nodes - Less impurity more skew - node with class distribution (1,0) has zero impurity, whereas a node with class distribution (0.5, 0.5) has highest impurity Gini Index Entropy Misclassification error -

53
Measures of Node Impurity Gini Index Entropy Misclassification error

54
Comparison among Splitting Criteria For a 2-class problem: Figure taken from text book (Tan, Steinbach, Kumar) ImpurityImpurity

55
How to find the best split? Example: Node N1 C0: 0 C1:6 Node N2 C0: 1 C1:5 Node N3 C0: 3 C1:3 GINI = 0 ENTROPY =0 ERROR = 0 GINI = 0.278 ENTROPY = 0.650 ERROR = 0.167 GINI = 0.5 ENTROPY =1 ERROR =.5

56
How to find the best split? The 3 measures have similar characteristic curves Despite this, the attribute chosen as the test condition may vary depending on the choice of the impurity measure Need to normalize these measures! Introducing GAIN, where I=impurity measure of a given node N = total no. of records at the parent node K = no. of attribute values N(v j ) = no. of records associated with the child node v j I(parent) is same for all test conditions When entropy is used, it is called Information Gain, info The larger the Gain, the better is the split Is it the best measure?

57
How to Find the Best Split? B? YesNo Node N3Node N4 A? YesNo Node N1Node N2 Before Splitting: M0 M1 M2M3M4 M12 M34 Gain = M0 – M12 vs M0 – M34 Slide taken from text book slides available at companion website (Tan, Steinbach, Kumar)

58
Measure of Impurity: GINI Gini Index for a given node t : (NOTE: p( j | t) is the relative frequency of class j at node t). Maximum (1 - 1/n c ) when records are equally distributed among all classes, implying least interesting information Minimum (0.0) when all records belong to one class, implying most interesting information Slide taken from text book slides available at companion website (Tan, Steinbach, Kumar)

59
Examples for computing GINI P(C1) = 0/6 = 0 P(C2) = 6/6 = 1 Gini = 1 – P(C1) 2 – P(C2) 2 = 1 – 0 – 1 = 0 P(C1) = 1/6 P(C2) = 5/6 Gini = 1 – (1/6) 2 – (5/6) 2 = 0.278 P(C1) = 2/6 P(C2) = 4/6 Gini = 1 – (2/6) 2 – (4/6) 2 = 0.444

60
Splitting Based on GINI Used in CART, SLIQ, SPRINT. When a node p is split into k partitions (children), the quality of split is computed as, where,n i = number of records at child i, n = number of records at node p.

61
Binary Attributes: Computing GINI Index l Splits into two partitions l Effect of Weighing partitions: – Larger and Purer Partitions are sought for. A? YesNo Node N1Node N2 Gini(N1) = 1 – (4/7) 2 – (3/7) 2 = 0.490 Gini(N2) = 1 – (2/5) 2 – (3/5) 2 = 0.480 Gini(Children) = 7/12 * 0.490 + 5/12 * 0.480 = 0.486

62
Binary Attributes: Computing GINI Index l Splits into two partitions l Effect of Weighing partitions: – Larger and Purer Partitions are sought for. B? YesNo Node N1Node N2 Gini(N1) = 1 – (5/7) 2 – (2/7) 2 = 0.408 Gini(N2) = 1 – (1/5) 2 – (4/5) 2 = 0.320 Gini(Children) = 7/12 * 0.408 + 5/12 * 0.320 = 0.371 Attribute B is preferred over A

63
Categorical Attributes: Computing Gini Index For each distinct value, gather counts for each class in the dataset Use the count matrix to make decisions Multi-way splitTwo-way split (find best partition of values) GINI favours multiway splits!!

64
Continuous Attributes: Computing Gini Index Use Binary Decisions based on one value Several Choices for the splitting value Number of possible splitting values = Number of distinct values Each splitting value has a count matrix associated with it Class counts in each of the partitions, A < v and A v Simple method to choose best v For each v, scan the database to gather count matrix and compute its Gini index Computationally Inefficient! Repetition of work.

65
Continuous Attributes: Computing Gini Index... For efficient computation: for each attribute, Sort the attribute on values Linearly scan these values, each time updating the count matrix and computing gini index Choose the split position that has the least gini index Split Positions Sorted Values Find the time complexity in terms of # records!

66
Alternative Splitting Criteria based on INFO Entropy at a given node t: (NOTE: p( j | t) is the relative frequency of class j at node t). Measures homogeneity of a node. Maximum (log n c ) when records are equally distributed among all classes implying least information Minimum (0.0) when all records belong to one class, implying most information Entropy based computations are similar to the GINI index computations

67
Examples for computing Entropy P(C1) = 0/6 = 0 P(C2) = 6/6 = 1 Entropy = – 0 log 0 – 1 log 1 = – 0 – 0 = 0 P(C1) = 1/6 P(C2) = 5/6 Entropy = – (1/6) log 2 (1/6) – (5/6) log 2 (1/6) = 0.65 P(C1) = 2/6 P(C2) = 4/6 Entropy = – (2/6) log 2 (2/6) – (4/6) log 2 (4/6) = 0.92

68
Splitting Based on INFO... Information Gain: Parent Node, p is split into k partitions; n i is number of records in partition i Measures Reduction in Entropy achieved because of the split. Choose the split that achieves most reduction (maximizes GAIN) Used in ID3 and C4.5 Disadvantage: Tends to prefer splits that result in large number of partitions, each being small but pure.

69
Gain Ratio: Parent Node, p is split into k partitions n i is the number of records in partition i Adjusts Information Gain by the entropy of the partitioning (SplitINFO). Higher entropy partitioning (large number of small partitions) is penalized! Used in C4.5 Designed to overcome the disadvantage of Information Gain Splitting Based on INFO...

70
Splitting Criteria based on Classification Error Classification error at a node t : Measures misclassification error made by a node. Maximum (1 - 1/n c ) when records are equally distributed among all classes, implying least interesting information Minimum (0.0) when all records belong to one class, implying most interesting information

71
Examples for Computing Error P(C1) = 0/6 = 0 P(C2) = 6/6 = 1 Error = 1 – max (0, 1) = 1 – 1 = 0 P(C1) = 1/6 P(C2) = 5/6 Error = 1 – max (1/6, 5/6) = 1 – 5/6 = 1/6 P(C1) = 2/6 P(C2) = 4/6 Error = 1 – max (2/6, 4/6) = 1 – 4/6 = 1/3

72
Misclassification Error vs Gini A? YesNo Node N1Node N2 Gini(N1) = 1 – (3/3) 2 – (0/3) 2 = 0 Gini(N2) = 1 – (4/7) 2 – (3/7) 2 = 0.489 Gini(Children) = 3/10 * 0 + 7/10 * 0.489 = 0.342 Gini improves !!

73
Decision Tree Based Classification Advantages: Inexpensive to construct Extremely fast at classifying unknown records Easy to interpret for small-sized trees Accuracy is comparable to other classification techniques for many simple data sets

74
Example: C4.5 Simple depth-first construction. Uses Information Gain Sorts Continuous Attributes at each node. Needs entire data to fit in memory. Unsuitable for Large Datasets. Needs out-of-core sorting. You can download the software from: http://www.cse.unsw.edu.au/~quinlan/c4.5r8.tar.gz http://www.cse.unsw.edu.au/~quinlan/c4.5r8.tar.gz

75
Example Web robot or crawler Based on access patterns, distinguish between human user and web robots Web Usage Mining BUILD a MODEL – use web log data Think of some more applications of classification!!

76
Summary: DT Classifiers Does not require any prior assumptions about prob. dist. Satisfied by classes Finding optimal DT is NP-complete Construction of DT is fast even for large data sets. Testing is also fast. O(w), w=max. depth of the tree Robust to niose Irrelevant attributes can cause problems. (use feature selection) Data fragmentation problem (leaf nodes having very few records) Tree pruning has greater impact on the final tree than choice of impurity measure

77
Decision Boundary Border line between two neighboring regions of different classes is known as decision boundary Decision boundary is parallel to axes because test condition involves a single attribute at-a-time

78
Oblique Decision Trees x + y < 1 Class = + Class = Test condition may involve multiple attributes More expressive representation Finding optimal test condition is computationally expensive

79
Tree Replication Same subtree appears in multiple branches Split using P redundant? Remove P in post pruning

80
Metrics for Performance Evaluation Focus on the predictive capability of a model Rather than how fast it takes to classify or build models, scalability, etc. Confusion Matrix: TP: predicted to be in YES, and is actually in it FP: predicted to be in YES, but is not actually in it TN: predicted not to be in YES, and is not actually in it FN: predicted not to be in YES, but is actually in it PREDICTED CLASS ACTUAL CLASS Class=YesClass=No Class=Yesab Class=Nocd a: TP (true positive) b: FN (false negative) c: FP (false positive) d: TN (true negative)

81
Metrics for Performance Evaluation… Most widely-used metric: PREDICTED CLASS ACTUAL CLASS Class=YesClass=No Class=Yesa (TP) b (FN) Class=Noc (FP) d (TN)

82
Limitation of Accuracy Consider a 2-class problem Number of Class 0 examples = 9990 Number of Class 1 examples = 10 If model predicts everything to be class 0, accuracy is 9990/10000 = 99.9 % Accuracy is misleading because model does not detect any class 1 example

83
Cost Matrix PREDICTED CLASS ACTUAL CLASS C(i|j) Class=YesClass=No Class=YesC(Yes|Yes)C(No|Yes) Class=NoC(Yes|No)C(No|No) C(i|j): Cost of misclassifying class j example as class i

84
Computing Cost of Classification Cost Matrix PREDICTED CLASS ACTUAL CLASS C(i|j) +- +100 -10 Model M 1 PREDICTED CLASS ACTUAL CLASS +- +15040 -60250 Model M 2 PREDICTED CLASS ACTUAL CLASS +- +25045 -5200 Accuracy = 80% Cost = 3910 Accuracy = 90% Cost = 4255

85
Cost-Sensitive Measures l Precision is biased towards C(Yes|Yes) & C(Yes|No) l Recall is biased towards C(Yes|Yes) & C(No|Yes) l F-measure is biased towards all except C(No|No)

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google