# 4/13/2017 Decision Tree Classification Prof. Navneet Goyal BITS, Pilani BITS C464 – Machine Learning Dr. Navneet Goyal, BITS,Pilani.

## Presentation on theme: "4/13/2017 Decision Tree Classification Prof. Navneet Goyal BITS, Pilani BITS C464 – Machine Learning Dr. Navneet Goyal, BITS,Pilani."— Presentation transcript:

4/13/2017 Decision Tree Classification Prof. Navneet Goyal BITS, Pilani BITS C464 – Machine Learning Dr. Navneet Goyal, BITS,Pilani

General Approach Figure taken from text book (Tan, Steinbach, Kumar)
4/13/2017 General Approach Figure taken from text book (Tan, Steinbach, Kumar) Dr. Navneet Goyal, BITS,Pilani

Classification by Decision Tree Induction
4/13/2017 Classification by Decision Tree Induction Decision tree – is a classification scheme Represents – a model of different classes Generates – tree & set of rules A node without children - is a leaf node. Otherwise an internal node. Each internal node has - an associated splitting predicate. e.g. binary predicates. Example predicates: Age <= 20 Profession in {student, teacher} 5000*Age + 3*Salary – > 0 Dr. Navneet Goyal, BITS,Pilani

Classification by Decision Tree Induction
4/13/2017 Decision tree A flow-chart-like tree structure Internal node denotes a test on an attribute Branch represents an outcome of the test Leaf nodes represent class labels or class distribution Decision tree generation consists of two phases Tree construction At start, all the training examples are at the root Partition examples recursively based on selected attributes Tree pruning Identify and remove branches that reflect noise or outliers Use of decision tree: Classifying an unknown sample Test the attribute values of the sample against the decision tree Dr. Navneet Goyal, BITS,Pilani

Classification by Decision Tree Induction
4/13/2017 Classification by Decision Tree Induction Decision tree classifiers are very popular. WHY? It does not require any domain knowledge or parameter setting, and is therefore suitable for exploratory knowledge discovery DTs can handle high dimensional data Representation of acquired knowledge in tree form is intuitive and easy to assimilate by humans Learning and classification steps are simple & fast Good accuracy Dr. Navneet Goyal, BITS,Pilani

Classification by Decision Tree Induction
4/13/2017 Classification by Decision Tree Induction Main Algorithms Hunt’s algorithm ID3 C4.5 CART SLIQ,SPRINT Dr. Navneet Goyal, BITS,Pilani

Example of a Decision Tree
categorical continuous class Refund MarSt TaxInc YES NO Yes No Married Single, Divorced < 80K > 80K Splitting Attributes Model: Decision Tree Training Data Figure taken from text book (Tan, Steinbach, Kumar)

Another Example of Decision Tree
categorical categorical continuous class MarSt Single, Divorced Married NO Refund No Yes NO TaxInc < 80K > 80K NO YES There could be more than one tree that fits the same data! Figure taken from text book (Tan, Steinbach, Kumar)

Some Questions Which tree is better and why? How many decision trees?
4/13/2017 Some Questions Which tree is better and why? How many decision trees? How to find the optimal tree? Is it computationally feasible? (Try constructing a suboptimal tree in reasonable amount of time – greedy algorithm) What should be the order of split? Look for answers in “20 questions” & “Guess Who” games! Dr. Navneet Goyal, BITS,Pilani

Apply Model to Test Data
4/13/2017 Apply Model to Test Data Test Data Start from the root of tree. Refund MarSt TaxInc YES NO Yes No Married Single, Divorced < 80K > 80K Figure taken from text book (Tan, Steinbach, Kumar) Dr. Navneet Goyal, BITS,Pilani

Apply Model to Test Data
4/13/2017 Apply Model to Test Data Test Data Refund MarSt TaxInc YES NO Yes No Married Single, Divorced < 80K > 80K Figure taken from text book (Tan, Steinbach, Kumar) Dr. Navneet Goyal, BITS,Pilani

Apply Model to Test Data
4/13/2017 Apply Model to Test Data Test Data Refund Yes No NO MarSt Single, Divorced Married TaxInc NO < 80K > 80K NO YES Figure taken from text book (Tan, Steinbach, Kumar) Dr. Navneet Goyal, BITS,Pilani

Apply Model to Test Data
4/13/2017 Apply Model to Test Data Test Data Refund Yes No NO MarSt Single, Divorced Married TaxInc NO < 80K > 80K NO YES Dr. Navneet Goyal, BITS,Pilani

Apply Model to Test Data
4/13/2017 Apply Model to Test Data Test Data Refund Yes No NO MarSt Single, Divorced Married TaxInc NO < 80K > 80K NO YES Figure taken from text book (Tan, Steinbach, Kumar) Dr. Navneet Goyal, BITS,Pilani

Apply Model to Test Data
4/13/2017 Apply Model to Test Data Test Data Refund Yes No NO MarSt Married Assign Cheat to “No” Single, Divorced TaxInc NO < 80K > 80K NO YES Figure taken from text book (Tan, Steinbach, Kumar) Dr. Navneet Goyal, BITS,Pilani

Decision Trees: Example
4/13/2017 Decision Trees: Example Training Data Set No Play True 60 68 Rain False 70 66 Play 78 95 88 Overcast 75 63 90 Sunny 79 56 No play true Class Windy Humidity Temp Outlook Numerical Attributes Temprature, Humidity Categorical Attributes Outlook, Windy Class ??? Class label From a given data set Dr. Navneet Goyal, BITS,Pilani

Decision Trees: Example
4/13/2017 Decision Trees: Example Sample Decision Tree Outlook sunny rain Humidity true false <=75 Play No > 75 {1} Windy No Play overcast From a given data set Five leaf nodes – Each represents a rule Dr. Navneet Goyal, BITS,Pilani

Decision Trees: Example
4/13/2017 Decision Trees: Example Rules corresponding to the given tree If it is a sunny day and humidity is not above 75%, then play. If it is a sunny day and humidity is above 75%, then do not play. If it is overcast, then play. If it is rainy and not windy, then play. If it is rainy and windy, then do not play. From a given data set Is it the best classification ???? Dr. Navneet Goyal, BITS,Pilani

Decision Trees: Example
4/13/2017 Decision Trees: Example Classification of new record New record: outlook=rain, temp =70, humidity=65, windy=true. Class: “No Play” Accuracy of the classifier determined by the percentage of the test data set that is correctly classified From a given data set Dr. Navneet Goyal, BITS,Pilani

Decision Trees: Example
4/13/2017 Decision Trees: Example Test Data Set Rule 1: two records Sunny & hum <=75 (one is correctly classified) Play True 60 68 Rain No Play False 70 66 78 95 88 Overcast 75 63 90 Sunny 79 56 true Class Windy Humidity Temp Outlook Accuracy= 50% From a given data set Rule 2:sunny, hum> 75 Accuracy = 50% Rule 3: overcast Accuracy= 66% Dr. Navneet Goyal, BITS,Pilani

Practical Issues of Classification
4/13/2017 Practical Issues of Classification Underfitting and Overfitting Missing Values Costs of Classification Dr. Navneet Goyal, BITS,Pilani

4/13/2017 Overfitting the Data A classification model commits two kinds of errors: Training Errors (TE) (resubstitution, apparent errors) Generalization Errors (GE) A good classification model must have low TE as well as low GE A model that fits the training data too well can have high GE than a model with high TE This problem is known as model overfitting Dr. Navneet Goyal, BITS,Pilani

Underfitting and Overfitting
Underfitting: when model is too simple, both training and test errors are large. TE & GE are large when the size of the tree is very small. It occurs because the model is yet to learn the true structure of the data and as a result it performs poorly on both training and test sets Figure taken from text book (Tan, Steinbach, Kumar)

4/13/2017 Overfitting the Data When a decision tree is built, many of the branches may reflect anomalies in the training data due to noise or outliers. We may grow the tree just deeply enough to perfectly classify the training data set. This problem is known as overfitting the data. Dr. Navneet Goyal, BITS,Pilani

4/13/2017 Overfitting the Data TE of a model can be reduced by increasing the model complexity Leaf nodes of the tree can be expanded until it perfectly fits the training data TE for such a complex tree = 0 GE can be large because the tree may accidently fit noise points in the training set Overfitting & underfitting are two pathologies that are related to model complexity Dr. Navneet Goyal, BITS,Pilani

Occam’s Razor Given two models of similar generalization errors, one should prefer the simpler model over the more complex model For complex models, there is a greater chance that it was fitted accidentally by errors in data Therefore, one should include model complexity when evaluating a model

4/13/2017 Definition A decision tree T is said to overfit the training data if there exists some other tree T’ which is a simplification of T, such that T has smaller error than T’ over the training set but T’ has a smaller error than T over the entire distribution of the instances. Dr. Navneet Goyal, BITS,Pilani

Problems of Overfitting
4/13/2017 Problems of Overfitting Overfitting can lead to many difficulties: Overfitted models are incorrect. Require more space and more computational resources Require collection of unnecessary features They are more difficult to comprehend Dr. Navneet Goyal, BITS,Pilani

Overfitting Overfitting can be due to: Presence of Noise
4/13/2017 Overfitting Overfitting can be due to: Presence of Noise Lack of representative samples Dr. Navneet Goyal, BITS,Pilani

Overfitting: Example Presence of Noise: Training Set Name
4/13/2017 Overfitting: Example Presence of Noise: Training Set Name Body Temperature Gives Birth 4-legged Hibernates Class Label (mammal) Procupine Warm Blooded Y Cat N Bat N* Whale Salamander Cold Blooded Komodo dragon Python Salmon Eagle Guppy Cold Blooded Table taken from text book (Tan, Steinbach, Kumar) Dr. Navneet Goyal, BITS,Pilani

Overfitting: Example Presence of Noise: Training Set Name
4/13/2017 Overfitting: Example Presence of Noise: Training Set Name Body Temperature Gives Birth 4-legged Hibernates Class Label (mammal) Procupine Warm Blooded Y Cat N Bat N* Whale Salamander Cold Blooded Komodo dragon Python Salmon Eagle Guppy Cold Blooded Table taken from text book (Tan, Steinbach, Kumar) Dr. Navneet Goyal, BITS,Pilani

Overfitting: Example Presence of Noise:Test Set Name Body Temperature
4/13/2017 Overfitting: Example Presence of Noise:Test Set Name Body Temperature Gives Birth 4-legged Hibernates Class Label (mammal) Human Warm Blooded Y N Pigeon Elephant Leopard Shark Cold Blooded Turtle Penguin Eel Dolphin Spiny Anteater Gila Monster Cold Blooded Table taken from text book (Tan, Steinbach, Kumar) Dr. Navneet Goyal, BITS,Pilani

Overfitting: Example Presence of Noise: Models Body Temp Body Temp
4/13/2017 Overfitting: Example Presence of Noise: Models Body Temp 4-legged Mammals Non-mammals Yes Warm blooded Gives Birth No Cold blooded Body Temp Mammals Non-mammals Warm blooded Gives Birth No Yes Cold blooded Model M2 TE = 20%, GE=10% Model M1 TE = 0%, GE=30% Find out why? Figure taken from text book (Tan, Steinbach, Kumar) Dr. Navneet Goyal, BITS,Pilani

Overfitting: Example Lack of representative samples: Training Set Name
4/13/2017 Overfitting: Example Lack of representative samples: Training Set Name Body Temperature Gives Birth 4-legged Hibernates Class Label (mammal) Salamander Cold Blooded N Y Eagle Warm Blooded Guppy Cold Blooded Poorwill Warm blooded Platypus Table taken from text book (Tan, Steinbach, Kumar) Dr. Navneet Goyal, BITS,Pilani

Overfitting: Example Lack of representative samples: Training Set
4/13/2017 Overfitting: Example Lack of representative samples: Training Set Body Temp 4-legged Mammals Non-mammals Yes Warm blooded Hibernates No Cold blooded Model M3 TE = 0%, GE=30% Find out why? Figure taken from text book (Tan, Steinbach, Kumar) Dr. Navneet Goyal, BITS,Pilani

Overfitting due to Noise
Decision boundary is distorted by noise point Figure taken from text book (Tan, Steinbach, Kumar)

Overfitting due to Insufficient Examples
Lack of data points in the lower half of the diagram makes it difficult to predict correctly the class labels of that region - Insufficient number of training records in the region causes the decision tree to predict the test examples using other training records that are irrelevant to the classification task Figure taken from text book (Tan, Steinbach, Kumar)

Pre-Pruning (Early Stopping Rule) Stop the algorithm before it becomes a fully-grown tree Typical stopping conditions for a node: Stop if all instances belong to the same class Stop if all the attribute values are the same More restrictive conditions: Stop if number of instances is less than some user-specified threshold Stop if class distribution of instances are independent of the available features (e.g., using  2 test) Stop if expanding the current node does not improve impurity measures (e.g., Gini or information gain).

Post-pruning Grow decision tree to its entirety Trim the nodes of the decision tree in a bottom-up fashion If generalization error improves after trimming, replace sub-tree by a leaf node. Class label of leaf node is determined from majority class of instances in the sub-tree Can use MDL for post-pruning

4/13/2017 Post-pruning Post-pruning approach- removes branches of a fully grown tree. Subtree replacement replaces a subtree with a single leaf node Alt Alt Yes Yes Yes Price \$ \$\$\$ \$\$ No Yes Yes Dr. Navneet Goyal, BITS,Pilani

Post-pruning 4/13/2017 Subtree raising moves a subtree to a higher level in the decision tree, subsuming its parent Alt Price Yes No \$ \$\$ \$\$\$ Res 4/4 Dr. Navneet Goyal, BITS,Pilani

Overfitting: Example Presence of Noise:Training Set Name
4/13/2017 Overfitting: Example Presence of Noise:Training Set Name Body Temperature Gives Birth 4-legged Hibernates Class Label (mammal) Porcupine Warm Blooded Y Cat N Bat N* Whale Salamander Cold Blooded Komodo Dragon Python Salmon Eagle Guppy Cold Blooded Table taken from text book (Tan, Steinbach, Kumar) Dr. Navneet Goyal, BITS,Pilani

Post-pruning: Techniques
4/13/2017 Post-pruning: Techniques Cost Complexity pruning Algorithm: pruning operation is performed if it does not increase the estimated error rate. Of course, error on the training data is not the useful estimator (would result in almost no pruning) Minimum Description Length Algorithm: states that the best tree is the one that can be encoded using the fewest number of bits. The challenge for the pruning phase is to find the subtree that can be encoded with the least number of bits. Dr. Navneet Goyal, BITS,Pilani

Hunt’s Algorithm Let Dt be the set of training records that reach a node t Let y={y1,y2,…yc} be the class labels Step 1: If Dt contains records that belong the same class yt, then t is a leaf node labeled as yt. If Dt is an empty set, then t is a leaf node labeled by the default class, yd Step 2: If Dt contains records that belong to more than one class, use an attribute test to split the data into smaller subsets. Recursively apply the procedure to each child node Dt ? Figure taken from text book (Tan, Steinbach, Kumar)

Hunt’s Algorithm Figure taken from text book (Tan, Steinbach, Kumar)
Refund Don’t Cheat Yes No Don’t Cheat Refund Don’t Cheat Yes No Marital Status Single, Divorced Married Refund Don’t Cheat Yes No Marital Status Single, Divorced Married Taxable Income < 80K >= 80K Figure taken from text book (Tan, Steinbach, Kumar)

Hunt’s Algorithm Should handle the following additional conditions:
4/13/2017 Hunt’s Algorithm Should handle the following additional conditions: Child nodes created in step 2 are empty. When can this happen? Declare the node as leaf node (majority class label of the training records of parent node) In step 2, if all the records associated with Dt have identical attributes (except for the class label), then it is not possible to split these records further. Declare the node as leaf with the same class label as the majority class of training records associated with this node. Dr. Navneet Goyal, BITS,Pilani

Tree Induction Greedy strategy. Issues
Split the records based on an attribute test that optimizes certain criterion. Issues Determine how to split the records How to specify the attribute test condition? How to determine the best split? Determine when to stop splitting

Hunt’s Algorithm Design Issues of Decision Tree Induction
4/13/2017 Hunt’s Algorithm Design Issues of Decision Tree Induction How should the training records be split? At each recursive step, an attribute test condition must be selected. Algorithm must provide a method for specifying the test condition for diff. attrib. types as well as an objective measure for evaluating the goodness of each test condition How should the splitting procedure stop? Stopping condition is needed to terminate the tree-growing process. Stop when: - all records belong to the same class - all records have identical values - both conditions are sufficient to stop any DT induction algo., other criterion can be imposed to terminate the procedure early (do we need to do this? Think of model over-fitting!) Dr. Navneet Goyal, BITS,Pilani

How to determine the Best Split?
Before Splitting: 10 records of class 0, 10 records of class 1 Which test condition is the best? Slide taken from text book slides available at companion website (Tan, Steinbach, Kumar)

How to determine the Best Split?
Greedy approach: Nodes with homogeneous class distribution are preferred Need a measure of node impurity: Non-homogeneous, High degree of impurity Homogeneous, Low degree of impurity Slide taken from text book slides available at companion website (Tan, Steinbach, Kumar)

Measures of Node Impurity
- Based on the degree of impurity of child nodes - Less impurity  more skew - node with class distribution (1,0) has zero impurity, whereas a node with class distribution (0.5, 0.5) has highest impurity Gini Index Entropy Misclassification error -

Measures of Node Impurity
Gini Index Entropy Misclassification error

Comparison among Splitting Criteria
For a 2-class problem: Impurity Figure taken from text book (Tan, Steinbach, Kumar)

How to find the best split?
Example: Node N1 C0: 0 C1:6 Node N2 C0: 1 C1:5 Node N3 C0: 3 C1:3 GINI = 0 ENTROPY =0 ERROR = 0 GINI = 0.278 ENTROPY = 0.650 ERROR = 0.167 GINI = 0.5 ENTROPY =1 ERROR = .5

How to find the best split?
The 3 measures have similar characteristic curves Despite this, the attribute chosen as the test condition may vary depending on the choice of the impurity measure Need to normalize these measures! Introducing GAIN,  where I=impurity measure of a given node N = total no. of records at the parent node K = no. of attribute values N(vj) = no. of records associated with the child node vj I(parent) is same for all test conditions When entropy is used, it is called Information Gain, info The larger the Gain, the better is the split Is it the best measure?

How to Find the Best Split?
Before Splitting: M0 A? B? Yes No Yes No Node N1 Node N2 Node N3 Node N4 M1 M2 M3 M4 M12 M34 Gain = M0 – M12 vs M0 – M34 Slide taken from text book slides available at companion website (Tan, Steinbach, Kumar)

Measure of Impurity: GINI
Gini Index for a given node t : (NOTE: p( j | t) is the relative frequency of class j at node t). Maximum (1 - 1/nc) when records are equally distributed among all classes, implying least interesting information Minimum (0.0) when all records belong to one class, implying most interesting information Slide taken from text book slides available at companion website (Tan, Steinbach, Kumar)

Examples for computing GINI
P(C1) = 0/6 = P(C2) = 6/6 = 1 Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0 P(C1) = 1/ P(C2) = 5/6 Gini = 1 – (1/6)2 – (5/6)2 = 0.278 P(C1) = 2/ P(C2) = 4/6 Gini = 1 – (2/6)2 – (4/6)2 = 0.444

Splitting Based on GINI
Used in CART, SLIQ, SPRINT. When a node p is split into k partitions (children), the quality of split is computed as, where, ni = number of records at child i, n = number of records at node p.

Binary Attributes: Computing GINI Index
Splits into two partitions Effect of Weighing partitions: Larger and Purer Partitions are sought for. A? Yes No Node N1 Node N2 Gini(N1) = 1 – (4/7)2 – (3/7)2 = 0.490 Gini(N2) = 1 – (2/5)2 – (3/5)2 = 0.480 Gini(Children) = 7/12 * /12 * = 0.486

Binary Attributes: Computing GINI Index
Splits into two partitions Effect of Weighing partitions: Larger and Purer Partitions are sought for. B? Yes No Node N1 Node N2 Gini(N1) = 1 – (5/7)2 – (2/7)2 = 0.408 Gini(N2) = 1 – (1/5)2 – (4/5)2 = 0.320 Gini(Children) = 7/12 * /12 * = 0.371 Attribute B is preferred over A

Categorical Attributes: Computing Gini Index
For each distinct value, gather counts for each class in the dataset Use the count matrix to make decisions Multi-way split Two-way split (find best partition of values) GINI favours multiway splits!!

Continuous Attributes: Computing Gini Index
Use Binary Decisions based on one value Several Choices for the splitting value Number of possible splitting values = Number of distinct values Each splitting value has a count matrix associated with it Class counts in each of the partitions, A < v and A  v Simple method to choose best v For each v, scan the database to gather count matrix and compute its Gini index Computationally Inefficient! Repetition of work.

Continuous Attributes: Computing Gini Index...
For efficient computation: for each attribute, Sort the attribute on values Linearly scan these values, each time updating the count matrix and computing gini index Choose the split position that has the least gini index Split Positions Sorted Values Find the time complexity in terms of # records!

Alternative Splitting Criteria based on INFO
Entropy at a given node t: (NOTE: p( j | t) is the relative frequency of class j at node t). Measures homogeneity of a node. Maximum (log nc) when records are equally distributed among all classes implying least information Minimum (0.0) when all records belong to one class, implying most information Entropy based computations are similar to the GINI index computations

Examples for computing Entropy
P(C1) = 0/6 = P(C2) = 6/6 = 1 Entropy = – 0 log 0 – 1 log 1 = – 0 – 0 = 0 P(C1) = 1/ P(C2) = 5/6 Entropy = – (1/6) log2 (1/6) – (5/6) log2 (1/6) = 0.65 P(C1) = 2/ P(C2) = 4/6 Entropy = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92

Splitting Based on INFO...
Information Gain: Parent Node, p is split into k partitions; ni is number of records in partition i Measures Reduction in Entropy achieved because of the split. Choose the split that achieves most reduction (maximizes GAIN) Used in ID3 and C4.5 Disadvantage: Tends to prefer splits that result in large number of partitions, each being small but pure.

Splitting Based on INFO...
Gain Ratio: Parent Node, p is split into k partitions ni is the number of records in partition i Adjusts Information Gain by the entropy of the partitioning (SplitINFO). Higher entropy partitioning (large number of small partitions) is penalized! Used in C4.5 Designed to overcome the disadvantage of Information Gain

Splitting Criteria based on Classification Error
Classification error at a node t : Measures misclassification error made by a node. Maximum (1 - 1/nc) when records are equally distributed among all classes, implying least interesting information Minimum (0.0) when all records belong to one class, implying most interesting information

Examples for Computing Error
P(C1) = 0/6 = P(C2) = 6/6 = 1 Error = 1 – max (0, 1) = 1 – 1 = 0 P(C1) = 1/ P(C2) = 5/6 Error = 1 – max (1/6, 5/6) = 1 – 5/6 = 1/6 P(C1) = 2/ P(C2) = 4/6 Error = 1 – max (2/6, 4/6) = 1 – 4/6 = 1/3

Misclassification Error vs Gini
Yes No Node N1 Node N2 Gini(N1) = 1 – (3/3)2 – (0/3)2 = 0 Gini(N2) = 1 – (4/7)2 – (3/7)2 = 0.489 Gini(Children) = 3/10 * /10 * = 0.342 Gini improves !!

Decision Tree Based Classification
Advantages: Inexpensive to construct Extremely fast at classifying unknown records Easy to interpret for small-sized trees Accuracy is comparable to other classification techniques for many simple data sets

Example: C4.5 Simple depth-first construction. Uses Information Gain
Sorts Continuous Attributes at each node. Needs entire data to fit in memory. Unsuitable for Large Datasets. Needs out-of-core sorting. You can download the software from:

Example Web robot or crawler
Based on access patterns, distinguish between human user and web robots Web Usage Mining BUILD a MODEL – use web log data Think of some more applications of classification!!

Summary: DT Classifiers
Does not require any prior assumptions about prob. dist. Satisfied by classes Finding optimal DT is NP-complete Construction of DT is fast even for large data sets. Testing is also fast. O(w), w=max. depth of the tree Robust to niose Irrelevant attributes can cause problems. (use feature selection) Data fragmentation problem (leaf nodes having very few records) Tree pruning has greater impact on the final tree than choice of impurity measure

Decision Boundary Border line between two neighboring regions of different classes is known as decision boundary Decision boundary is parallel to axes because test condition involves a single attribute at-a-time

Oblique Decision Trees
x + y < 1 Class = + Class = Test condition may involve multiple attributes More expressive representation Finding optimal test condition is computationally expensive

Tree Replication Split using P redundant? Remove P in post pruning
Same subtree appears in multiple branches

Metrics for Performance Evaluation
Focus on the predictive capability of a model Rather than how fast it takes to classify or build models, scalability, etc. Confusion Matrix: TP: predicted to be in YES, and is actually in it FP: predicted to be in YES, but is not actually in it TN: predicted not to be in YES, and is not actually in it FN: predicted not to be in YES, but is actually in it PREDICTED CLASS ACTUAL CLASS Class=Yes Class=No a b c d a: TP (true positive) b: FN (false negative) c: FP (false positive) d: TN (true negative)

Metrics for Performance Evaluation…
PREDICTED CLASS ACTUAL CLASS Class=Yes Class=No a (TP) b (FN) c (FP) d (TN) Most widely-used metric:

Limitation of Accuracy
Consider a 2-class problem Number of Class 0 examples = 9990 Number of Class 1 examples = 10 If model predicts everything to be class 0, accuracy is 9990/10000 = 99.9 % Accuracy is misleading because model does not detect any class 1 example

Cost Matrix PREDICTED CLASS C(i|j) ACTUAL CLASS
Class=Yes Class=No C(Yes|Yes) C(No|Yes) C(Yes|No) C(No|No) C(i|j): Cost of misclassifying class j example as class i

Computing Cost of Classification
Cost Matrix PREDICTED CLASS ACTUAL CLASS C(i|j) + - -1 100 1 Model M1 PREDICTED CLASS ACTUAL CLASS + - 150 40 60 250 Model M2 PREDICTED CLASS ACTUAL CLASS + - 250 45 5 200 Accuracy = 80% Cost = 3910 Accuracy = 90% Cost = 4255

Cost-Sensitive Measures
Precision is biased towards C(Yes|Yes) & C(Yes|No) Recall is biased towards C(Yes|Yes) & C(No|Yes) F-measure is biased towards all except C(No|No)

Download ppt "4/13/2017 Decision Tree Classification Prof. Navneet Goyal BITS, Pilani BITS C464 – Machine Learning Dr. Navneet Goyal, BITS,Pilani."

Similar presentations