Presentation is loading. Please wait.

Presentation is loading. Please wait.

Classification I: Decision Tree

Similar presentations


Presentation on theme: "Classification I: Decision Tree"— Presentation transcript:

1 Classification I: Decision Tree
AMCS/CS 340: Data Mining Classification I: Decision Tree Xiangliang Zhang King Abdullah University of Science and Technology

2 Classification: Definition
Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes is the class. Find a model for class attribute as a function of the values of other attributes. Goal: previously unseen records should be assigned a class as accurately as possible. A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it. Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

3 Classification Example
categorical categorical continuous class Test Set Learn Classifier Model Training Set predicting borrowers who cheat on loan payments.

4 Issues: Evaluating Classification Methods
Accuracy classifier accuracy: how well the class labels of test data are predicted Speed time to construct the model (training time) time to use the model (classification/prediction time) Robustness: handling noise and missing values Scalability: efficiency in large-scale data Interpretability understanding and insight provided by the model Other measures, e.g., goodness of rules, such as decision tree size or compactness of classification rules 4 4

5 Classification Techniques
Decision Tree based Methods Rule-based Methods Learning from Neighbors Bayesian Classification Neural Networks Ensemble Methods Support Vector Machines Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

6 Example of a Decision Tree
categorical continuous class Splitting Attributes Refund Yes No NO MarSt Single, Divorced Married TaxInc NO < 80K > 80K NO YES Training Data Model: Decision Tree Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

7 Another Example of Decision Tree
categorical categorical continuous MarSt Single, Divorced class Married NO Refund No Yes NO TaxInc < 80K > 80K NO YES There could be more than one tree that fits the same data! Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

8 Decision Tree Classification Task
Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

9 Apply Model to Test Data
Start from the root of tree. Refund MarSt TaxInc YES NO Yes No Married Single, Divorced < 80K > 80K

10 Apply Model to Test Data
Refund MarSt TaxInc YES NO Yes No Married Single, Divorced < 80K > 80K Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

11 Apply Model to Test Data
Refund Yes No NO MarSt Single, Divorced Married TaxInc NO < 80K > 80K NO YES Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

12 Apply Model to Test Data
Refund Yes No NO MarSt Single, Divorced Married TaxInc NO < 80K > 80K NO YES Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

13 Apply Model to Test Data
Refund Yes No NO MarSt Single, Divorced Married TaxInc NO < 80K > 80K NO YES Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

14 Apply Model to Test Data
Refund Yes No NO MarSt Single, Divorced Married Assign Cheat to “No” TaxInc NO < 80K > 80K NO YES Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

15 Decision Tree Classification Task
Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

16 Algorithm for Decision Tree Induction
Basic algorithm (a greedy algorithm) Tree is constructed in a top-down recursive divide-and-conquer manner At start, all the training examples are at the root Examples are partitioned recursively based on selected attributes Attributes are categorical (if continuous-valued, they are discretized in advance) Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain) Conditions for stopping partitioning All samples for a given node belong to the same class There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf There are no samples left Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

17 Decision Tree Induction
Many Algorithms: Hunt’s Algorithm (one of the earliest) CART ID3, C4.5 SLIQ,SPRINT Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

18 General Structure of Hunt’s Algorithm
Let Dt be the set of training records that reach a node t General Procedure: If Dt contains records that belong the same class yt, then t is a leaf node labeled as yt If Dt is an empty set, then t is a leaf node labeled by the default class, yd If Dt contains records that belong to more than one class, use an attribute test to split the data into smaller subsets. Recursively apply the procedure to each subset. Dt ? Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

19 Hunt’s Algorithm Refund Refund Refund Marital Marital Status Status
Don’t Cheat Yes No Don’t Cheat Refund Don’t Cheat Yes No Marital Status Single, Divorced Married Refund Don’t Cheat Yes No Marital Status Single, Divorced Married Taxable Income < 80K >= 80K

20 Issues of Hunt's Algorithm
Issues: Determine how to split the records How to specify the attribute test condition? How many branches Partition threshold for splitting How to determine the best split? Choose which attribute ? Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

21 How to Specify Test Condition?
Depends on attribute types Nominal Ordinal Continuous Depends on number of ways to split 2-way split Multi-way split Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

22 Splitting Based on Nominal Attributes
Multi-way split: Use as many partitions as distinct values. Binary split: Divides values into two subsets. Need to find optimal partitioning. CarType Family Sports Luxury CarType {Sports, Luxury} {Family} CarType {Family, Luxury} {Sports} OR Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

23 Splitting Based on Ordinal Attributes
Multi-way split: Use as many partitions as distinct values. Binary split: Divides values into two subsets Need to find optimal partitioning. What about this split? Size Small Medium Large Size {Small, Medium} {Large} Size {Medium, Large} {Small} OR Size {Small, Large} {Medium} Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

24 Splitting Based on Continuous Attributes
Different ways of handling Binary Decision: (A < v) or ( A >= v ) consider all possible splits and finds the best cut Discretization to form an ordinal categorical attribute ranges can be found by equal interval bucketing, equal frequency bucketing (percentiles), or clustering. Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

25 Splitting Based on Continuous Attributes
Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

26 Issues of Hunt's Algorithm
Issues: Determine how to split the records How to specify the attribute test condition? How many branches Partition threshold for splitting How to determine the best split? Choose which attribute ? Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

27 How to determine the Best Split
Before Splitting: 10 records of class 0, records of class 1 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

28 How to determine the Best Split
Greedy approach: Nodes with homogeneous class distribution are preferred Need a measure of node impurity: Non-homogeneous, High degree of impurity Homogeneous, Low degree of impurity Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

29 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
How to determine the Best Split Before Splitting: 10 records of class 0, records of class 1 Non-homogeneous, High degree of impurity Homogeneous, Low degree of impurity Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 29

30 Measures of Node Impurity
Gini Index Gini impurity is a measure of how frequently a randomly chosen element from a set is incorrectly labeled if it were labeled randomly according to the distribution of labels in the subset. Entropy a measure of the uncertainty associated with a random variable. Misclassification error the proportion of misclassified samples Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

31 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
Measures of Node Impurity Gini Index p( j | t) is the relative frequency of class j at node t Entropy Misclassification error Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 31

32 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
Comparison among Measures of Node Impurity For a 2-class problem: Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 32

33 Comparison among Measures of Node Impurity
For a 2-class problem: Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 33

34 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
Quality of Split p: n child 1: n1 GINI(1) child k: nk GINI(k) child i: ni GINI(i) When a node p is split into k partitions, the quality of split is computed as, Or information gain where, ni = number of records at child i, n = number of records at node p. Measures reduction in GINI/Entropy achieved because of the split. Choose the split that achieves most reduction (maximizes GAIN) Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

35 Quality of Split: binary attributes
Splits into two partitions (binary attributes) P Yes No Node N1 Node N2 Gini(N1) = 1 – (4/7)2 – (3/7)2 = Gini(N2) = 1 – (2/5)2 – (3/5)2 = 0.480 Ginisplit(Children) = 7/12 * /12 * = 0.486 Gainsplit = Gini (parent) – Ginisplit (children) = 0.014

36 Decision Tree Induction
Many Algorithms: Hunt’s Algorithm (one of the earliest) CART ID3, C4.5 SLIQ,SPRINT Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

37 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
CART CART: Classification and Regression Trees constructs trees with only binary splits (simplifies the splitting criterion) use Gini Index as a splitting criterion split the attribute who provides the smallest Ginisplit(p) or the largest GAINsplit(p) need to enumerate all the possible splitting points for each attribute Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 37

38 Continuous Attributes: Computing Gini Index
For efficient computation: for each attribute, Sort the attribute on values Set candidate split positions as the midpoints between two adjacent sorted values. Linearly scan these values, each time updating the count matrix and computing Gini index Choose the split position that has the least Gini index Split Positions Sorted Values

39 Decision Tree Induction
Many Algorithms: Hunt’s Algorithm (one of the earliest) CART ID3, C4.5 SLIQ,SPRINT Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

40 How to Find the Best Split
Two-way split (find best partition of values) Multi-way split CarType {Sports, Luxury} {Family} CarType {Family, Luxury} {Sports} CarType Family Sports Luxury largest Gain = 0.337 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

41 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
Which Attribute to Split ? Gain = 0.02 Gain = 0.337 Gain = 0.5 Best ? Disadvantage: Tends to prefer splits that result in large number of partitions, each being small but pure. Unique value for each record  not predictable Small number of recodes in each node  not reliable prediction Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 41

42 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
Splitting Based on Gain Ratio Gain Ratio: Parent node p is split into k partitions where ni is the number of records in partition i designed to overcome the disadvantage of Information Gain adjusts Information Gain by the entropy of the partitioning (SplitINFO). higher entropy partitioning (large number of small partitions) is penalized! Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

43 Comparing Attribute Selection Measures
The three measures, in general, return good results but Gini gain: biased to multivalued attributes has difficulty when # of classes is large tends to favor tests that result in equal-sized partitions and purity in both partitions Information gain: biased towards multivalued attributes Gain ratio: tends to prefer unbalanced splits in which one partition is much smaller than the others Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining September 14, 2010 Data Mining: Concepts and Techniques 43 43

44 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
ID3 and C4.5 ID3 (Ross Quinlan1986) is the precursor to the C4.5 algorithm (Ross Quinlan 1993) C4.5 is an extension of earlier ID3 algorithm For each unused attribute Ai, count the information GAIN (ID3) or GAINRatio (C4.5) from splitting on Ai Find the best splitting attribute Abest with the highest GAIN or GAINRatio Create a decision node that splits on Abest Recur on the sublists obtained by splitting on Abest , and add those nodes as children of node Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

45 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
Improvements of C4.5 from ID3 algorithm Handling both continuous and discrete attributes In order to handle continuous attributes, C4.5 creates a threshold and then splits the list into those whose attribute value is above the threshold and those that are less than or equal to it. Handling training data with missing attribute values C4.5 allows attribute values to be marked as ? for missing. Missing attribute values are simply not used in gain and entropy calculations. Pruning trees after creation C4.5 goes back through the tree once it's been created and attempts to remove branches that do not help by replacing them with leaf nodes. Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

46 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
Issues: Needs entire data to fit in memory. Unsuitable for Large Datasets. Needs a lot of computation at every stage of construction of decision tree. You can download the software from: More information Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

47 Decision Tree Induction
Many Algorithms: Hunt’s Algorithm (one of the earliest) CART ID3, C4.5 SLIQ,SPRINT Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

48 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
SLIQ – a decision tree classifier SLIQ, Supervised Learning In Quest (EDBT’96 — Mehta et al.) Uses a pre-sorting technique in the tree growing phase (eliminates the need to sort data at each node) creates a separate list for each attribute of the training data A separate list, called class list, is created for the class labels attached to the examples. SLIQ requires that the class list and (only) one attribute list could be kept in the memory at any time Suitable for classification of large disk-resident datasets Applies to both numerical and categorical attributes Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

49 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
SLIQ Methodology Create decision tree by partitioning records Generate attribute list for each attribute Sort attribute lists for NUMERIC Attributes Start End Example Only NUMERIC attributes sorted Drivers Age CarType Class (risk) 23 Family HIGH 17 Sports 43 68 LOW 32 Truck 20 Age Class RecId 17 HIGH 2 20 6 23 1 32 LOW 5 43 3 68 4 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

50 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
Numeric attributes splitting index States of class histograms Partition position Numeric attributes sorted HIGH LOW Cbelow Cabove 4 2 Ginisplit =0.44 Age Class RecId 17 HIGH 2 20 6 23 1 32 LOW 5 43 3 68 4 Position 0 HIGH LOW Cbelow 3 Cabove 1 2 Ginisplit =0.22 Position 3 Position 6 HIGH LOW Cbelow 4 2 Cabove Ginisplit =0.44 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

51 Numeric attributes splitting index
States of class histograms Partition position Numeric attributes sorted HIGH LOW Cbelow Cabove 4 2 Ginisplit =0.44 Age Class RecId 17 HIGH 2 20 6 23 1 32 LOW 5 43 3 68 4 Position 0 HIGH LOW Cbelow 3 Cabove 1 2 Ginisplit =0.22 Position 3 Position 6 HIGH LOW Cbelow 4 2 Cabove Ginisplit =0.44 age < 27.5 > 27.5 N1 N2

52 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
Numeric attributes splitting index age N1 N2 < 27.5 > 27.5 Age Class Car type RecId 32 LOW truck 5 43 HIGH sport 3 68 family 4 Age Class Car type RecId 17 HIGH sport 2 20 family 6 23 1 Car type HIGH Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

53 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
SPRINT SPRINT: A Scalable Parallel Classifier for Data Mining (VLDB’96 — J. Shafer et al.) an enhancement of SLIQ, implemented in both serial and parallel pattern for good data placement and load balancing one time sort of the data items uses two data structures: attribute list and histogram (no memory resident making SPRINT suitable for large data set) handles both continuous and categorical attributes. 53 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

54 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
Decision Tree Issues Issues: Underfitting and Overfitting Missing values 54 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 54

55 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
Underfitting and Overfitting Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

56 Overfitting and Tree Pruning
Overfitting: An induced tree may overfit the training data Too many branches, some may reflect anomalies due to noise or outliers Poor accuracy for unseen samples Two approaches to avoid overfitting Pre-pruning: Stop tree construction early—do not split a node if this would result in the gain measure falling below a threshold Difficult to choose an appropriate threshold Post-pruning: Remove branches from a “fully grown” tree Use a set of data different from the training data to decide which is the “best pruned tree” Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining September 14, 2010 Data Mining: Concepts and Techniques 56 56

57 Data Mining: Concepts and Techniques
Tree Post-Pruning Reduced error pruning Start from leaves Replace each node with its most popular class Keep the change, if the prediction accuracy is not affected Cost complexity pruning Generate a series of trees T0,…,Tm, (T0 is the initial tree, Tm is the root alone) Construct tree Ti by Replacing a subtree of tree Ti-1 with a leaf node. the class label of leaf node is determined from majority class of instances in the sub-tree Which subtree to remove ? Data Mining: Concepts and Techniques 57 57

58 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
Decision Tree Issues Issues: Underfitting and Overfitting Missing values 58 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 58

59 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
Handling Missing Values of Attribute Missing values affect decision tree construction in three different ways: Affects how impurity measures are computed Affects how to distribute instance with missing value to child nodes Affects how a test instance with missing value is classified Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

60 Computing Impurity Measure
Before Splitting: Entropy(Parent) = -0.3 log(0.3)-(0.7)log(0.7) = Split on Refund: Entropy(Refund=Yes) = 0 Entropy(Refund=No) = -(2/6)log(2/6) – (4/6)log(4/6) = Entropy(Children) = 0.3 (0) (0.9183) = 0.551 Gain = 0.9  ( – 0.551) = Missing value

61 Distribute Instances Probability that Refund=Yes is 3/9
No Probability that Refund=Yes is 3/9 Probability that Refund=No is 6/9 Assign record to the left child with weight = 3/9 and to the right child with weight = 6/9 Refund Yes No

62 Classify Instances New record:
Married Single Divorced Total Class=No 3 1 4 Class=Yes 6/9 2.67 3.67 2 6.67 Refund Yes No NO MarSt Single, Divorced Married Probability that Marital Status = Married is 3.67/6.67 = 0.55 Probability that Marital Status ={Single,Divorced} is 3/6.67 = 0.45 TaxInc NO < 80K > 80K NO YES

63 Classify Instances New record: Probability that Class = NO is 0.55
Married Single Divorced Total Class=No 3 1 4 Class=Yes 6/9 2.67 3.67 2 6.67 Refund Yes No NO MarSt Single, Divorced Married TaxInc NO Probability that Class = NO is 0.55 < 80K > 80K NO YES Probability that Class = YES is 0.45

64 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
Questions What is Decision Tree? How to choose the best attribute to split? How to decide the branches when splitting? What are CART, ID3 and C4.5? What are their differences? How Decision Tree can be used to learn from large data? How to avoid overfitting in Decision Tree learning? How missing values can be handled by Decision Tree? Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

65 Classification Techniques
Decision Tree based Methods Rule-based Methods Learning from Neighbors Bayesian Classification Neural Networks Ensemble Methods Support Vector Machines Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

66 How tree can be used in other ways?
Tree  Rules Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

67 Rule Extraction from a Decision Tree
One rule is created for each path from the root to a leaf Each attribute-value pair along a path forms a conjunction: the leaf holds the class prediction Rules are mutually exclusive (rules are independent and each record is covered by at most one rule) and exhaustive (each record is covered by at least one rule) Rule set contains as much information as the tree

68 Rule Extraction from a Decision Tree
(Marital Status = {Married}) ==> No Rules can be simplified Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

69 Rule-Based Classifier
Classify records by using a collection of “if…then…” rules Rule: Represent the knowledge in the form of IF-THEN rules (Condition)  y, where Condition is a conjunctions of attributes y is the class label Examples of classification rules: (Blood Type=Warm)  (Lay Eggs=Yes)  Birds (Taxable Income < 50K)  (Refund=Yes)  Cheat=No (rule antecedent or condition)  (rule consequent) Classifier: A rule R covers an instance x if the attributes of the instance satisfy the condition of the rule, instance x labeled by the consequent of R Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 69

70 Sequential covering algorithm
Extract rules directly from training data e.g., FOIL (Quinlan 1990), AQ (Michalski et al 1986), CN2(Clark & Niblett89), RIPPER (Cohen95) Steps: Start from an empty rule Grow a rule for one class Ci using the Learn-One-Rule function Remove training records covered by the rule Repeat Step (2) and (3) until stopping criterion is met, e.g., when no more training examples or when the quality of a rule returned is below a user-specified threshold Compare with decision-tree induction: DT learns a set of rules simultaneously Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 70

71 Rules for 2-class and multi-class
For 2-class problem choose one of the classes as positive class, and the other as negative class Learn rules for positive class Negative class will be default class For multi-class problem Order the classes according to increasing class prevalence (fraction of instances that belong to a particular class) Learn the rule set for smallest class first, treat the rest as negative class Repeat with next smallest class as positive class Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 71

72 Classification: Evaluation
AMCS/CS 340: Data Mining Classification: Evaluation Xiangliang Zhang King Abdullah University of Science and Technology

73 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
Model Evaluation Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates? Methods for Model Comparison How to compare the relative performance among competing models? 73 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

74 Metrics for Performance Evaluation
Focus on the predictive capability of a model Rather than how fast it takes to classify or build models, scalability, etc. Confusion Matrix: Most widely-used metric: PREDICTED CLASS ACTUAL CLASS Class=Yes Class=No a (TP) b (FN) c (FP) d (TN) TP: True Positive FP: False Positive TN: True Negative FN: False Negative 74 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

75 Limitation of Accuracy
Consider a 2-class problem Number of Class 0 examples = 9990 Number of Class 1 examples = 10 Unbalanced classes If model predicts everything to be class 0, accuracy is 9990/10000 = 99.9 % Accuracy is misleading because model does not detect any class 1 example 75 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

76 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
Other Measures PREDICTED CLASS ACTUAL CLASS Class=Yes Class=No a (TP) b (FN) c (FP) d (TN) 76 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

77 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
Model Evaluation Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates? Methods for Model Comparison How to compare the relative performance among competing models? 77 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

78 Methods for Performance Evaluation
How to obtain a reliable estimate of performance? Performance of a model may depend on other factors besides the learning algorithm: Class distribution Cost of misclassification Size of training and test sets 78 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

79 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
Methods of Estimation Holdout Reserve 2/3 for training and 1/3 for testing Random subsampling Repeated holdout Cross validation Partition data into k disjoint subsets k-fold: train on k-1 partitions, test on the remaining one Leave-one-out: k=n Bootstrap Sampling with replacement 79 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

80 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
Model Evaluation Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates? Methods for Model Comparison How to compare the relative performance among competing models? 80 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

81 ROC (Receiver Operating Characteristic)
Developed in 1950s for signal detection theory to analyze noisy signals Characterize the trade-off between positive hits and false alarms ROC curve plots TP rate (y-axis) against FP rate (x-axis) Performance of each classifier represented as a point on ROC curve changing the threshold of algorithm, or sample distribution changes the location of the point 1-dimensional data set containing 2 classes (positive and negative) - any points located at x > t is classified as positive 81 At threshold t: TPR=0.5, FPR=0.12

82 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
ROC Curve (TPR,FPR): (0,0): declare everything to be negative class (1,1): declare everything to be positive class (0,1): ideal Diagonal line: Random guessing Below diagonal line: prediction is opposite of the true class 82 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

83 Using ROC for Model Comparison
No model consistently outperform the other M1 is better for small FPR M2 is better for large FPR Area Under the ROC curve Ideal: Area = 1 Random guess: Area = 0.5 83 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

84 How to construct an ROC curve
Posterior probability of test instance x Threshold: t # of + >= t # of - >= t ROC Curve: 84

85 Confidence Interval for Accuracy
Prediction can be regarded as a Bernoulli trial A Bernoulli trial has 2 possible outcomes Possible outcomes for prediction: correct or wrong Collection of Bernoulli trials has a Binomial distribution: x  Bin(N, p) x: number of correct predictions e.g: Toss a fair coin 50 times, how many heads would turn up? Expected number of heads = Np = 50  0.5 = 25 Given x (# of correct predictions) or equivalently, acc=x/N, and N (# of test instances), Can we predict p (true accuracy of model)? 85 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

86 Confidence Interval for Accuracy
Area = 1 -  For large test sets (N > 30), acc has a normal distribution with mean p and variance p(1-p)/N Confidence Interval for p: Z/2 Z1-  /2 86 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

87 Confidence Interval for Accuracy
Consider a model that produces an accuracy of 80% when evaluated on 100 test instances: N=100, acc = 0.8 Let 1- = 0.95 (95% confidence) From probability table, Z/2=1.96 1- Z 0.99 2.58 0.98 2.33 0.95 1.96 0.90 1.65 N 50 100 500 1000 5000 p(lower) 0.670 0.711 0.763 0.774 0.789 p(upper) 0.888 0.866 0.833 0.824 0.811 Standard Normal distribution 87 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

88 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
Test of Significance Given two models: Model M1: accuracy = 85%, tested on 30 instances Model M2: accuracy = 75%, tested on 5000 instances Can we say M1 is better than M2? How much confidence can we place on accuracy of M1 and M2? Can the difference in performance measure be explained as a result of random fluctuations in the test set? 88 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

89 Comparing Performance of 2 Models
Given two models, say M1 and M2, which is better? M1 is tested on D1 (size=n1), found error rate = e1 M2 is tested on D2 (size=n2), found error rate = e2 Assume D1 and D2 are independent If n1 and n2 are sufficiently large, then Approximate of variance (Binomial distribution): 89 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

90 Comparing Performance of 2 Models
To test if performance difference is statistically significant: d = e1 – e2 where dt is the true difference Since D1 and D2 are independent, their variance adds up: At (1-) confidence level, 90 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

91 An Illustrative Example
Given: M1: n1 = 30, e1 = M2: n2 = 5000, e2 = 0.25 d = |e2 – e1| = (2-sided test) At 95% confidence level, Z/2= => Interval contains 0 => difference may not be statistically significant 91 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

92 Our Teaching Assistant
Name: Francisco Franco Responsibilities: Homework Project report Final grades Questions 92 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining


Download ppt "Classification I: Decision Tree"

Similar presentations


Ads by Google