Presentation is loading. Please wait.

Presentation is loading. Please wait.

AMCS/CS 340: Data Mining Classification I: Decision Tree Xiangliang Zhang King Abdullah University of Science and Technology.

Similar presentations


Presentation on theme: "AMCS/CS 340: Data Mining Classification I: Decision Tree Xiangliang Zhang King Abdullah University of Science and Technology."— Presentation transcript:

1 AMCS/CS 340: Data Mining Classification I: Decision Tree Xiangliang Zhang King Abdullah University of Science and Technology

2 2 Classification: Definition Given a collection of records (training set ) – Each record contains a set of attributes, one of the attributes is the class. Find a model for class attribute as a function of the values of other attributes. Goal: previously unseen records should be assigned a class as accurately as possible. – A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it. Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

3 3 Classification Example categorical continuous class Test Set Training Set Model Learn Classifier predicting borrowers who cheat on loan payments.

4 Issues: Evaluating Classification Methods 4 Accuracy – classifier accuracy: how well the class labels of test data are predicted Speed – time to construct the model (training time) – time to use the model (classification/prediction time) Robustness: handling noise and missing values Scalability: efficiency in large-scale data Interpretability – understanding and insight provided by the model Other measures, e.g., goodness of rules, such as decision tree size or compactness of classification rules

5 5 Classification Techniques Decision Tree based Methods Rule-based Methods Learning from Neighbors Bayesian Classification Neural Networks Ensemble Methods Support Vector Machines Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

6 6 Example of a Decision Tree categorical continuous class Refund MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K Splitting Attributes Training Data Model: Decision Tree Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

7 7 Another Example of Decision Tree categorical continuous class MarSt Refund TaxInc YES NO Yes No Married Single, Divorced < 80K> 80K There could be more than one tree that fits the same data! Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

8 8 Decision Tree Classification Task Decision Tree Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

9 9 Apply Model to Test Data Refund MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K Test Data Start from the root of tree.

10 10 Apply Model to Test Data Refund MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K Test Data Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

11 11 Apply Model to Test Data Refund MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K Test Data Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

12 12 Apply Model to Test Data Refund MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K Test Data Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

13 13 Apply Model to Test Data Refund MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K Test Data Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

14 14 Apply Model to Test Data Refund MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K Test Data Assign Cheat to “No” Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

15 15 Decision Tree Classification Task Decision Tree Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

16 16 Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm) – Tree is constructed in a top-down recursive divide-and-conquer manner – At start, all the training examples are at the root – Examples are partitioned recursively based on selected attributes – Attributes are categorical (if continuous-valued, they are discretized in advance) – Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain) Conditions for stopping partitioning – All samples for a given node belong to the same class – There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf – There are no samples left Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

17 17 Decision Tree Induction Many Algorithms:  Hunt’s Algorithm (one of the earliest)  CART  ID3, C4.5  SLIQ,SPRINT Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

18 18 General Structure of Hunt’s Algorithm Let D t be the set of training records that reach a node t General Procedure:  If D t contains records that belong the same class y t, then t is a leaf node labeled as y t  If D t is an empty set, then t is a leaf node labeled by the default class, y d  If D t contains records that belong to more than one class, use an attribute test to split the data into smaller subsets. Recursively apply the procedure to each subset. DtDt ? Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

19 19 Hunt’s Algorithm Don’t Cheat Refund Don’t Cheat Don’t Cheat YesNo Refund Don’t Cheat YesNo Marital Status Don’t Cheat Single, Divorced Married Taxable Income Don’t Cheat < 80K>= 80K Refund Don’t Cheat YesNo Marital Status Don’t Cheat Single, Divorced Married

20 20 Issues of Hunt's Algorithm Issues: Determine how to split the records – How to specify the attribute test condition?  How many branches  Partition threshold for splitting – How to determine the best split?  Choose which attribute ? Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

21 21 How to Specify Test Condition? Depends on attribute types  Nominal  Ordinal  Continuous Depends on number of ways to split  2-way split  Multi-way split Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

22 22 Splitting Based on Nominal Attributes Multi-way split: Use as many partitions as distinct values. Binary split: Divides values into two subsets. Need to find optimal partitioning. CarType Family Sports Luxury CarType {Family, Luxury} {Sports} CarType {Sports, Luxury} {Family} OR Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

23 23 Splitting Based on Ordinal Attributes Multi-way split: Use as many partitions as distinct values. Binary split: Divides values into two subsets. Need to find optimal partitioning. What about this split? Size Small Medium Large Size {Medium, Large} {Small} Size {Small, Medium} {Large} OR Size {Small, Large} {Medium} Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

24 24 Splitting Based on Continuous Attributes Different ways of handling  Binary Decision: (A = v ) consider all possible splits and finds the best cut  Discretization to form an ordinal categorical attribute ranges can be found by equal interval bucketing, equal frequency bucketing (percentiles), or clustering. Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

25 25 Splitting Based on Continuous Attributes Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

26 26 Issues of Hunt's Algorithm Issues: Determine how to split the records – How to specify the attribute test condition?  How many branches  Partition threshold for splitting – How to determine the best split?  Choose which attribute ? Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

27 27 How to determine the Best Split Before Splitting: 10 records of class 0, 10 records of class 1 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

28 28 How to determine the Best Split Greedy approach:  Nodes with homogeneous class distribution are preferred Need a measure of node impurity: Non-homogeneous, High degree of impurity Homogeneous, Low degree of impurity Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

29 How to determine the Best Split Before Splitting: 10 records of class 0, 10 records of class 1 Non-homogeneous, High degree of impurity Homogeneous, Low degree of impurity 29 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

30 30 Measures of Node Impurity Gini Index – Gini impurity is a measure of how frequently a randomly chosen element from a set is incorrectly labeled if it were labeled randomly according to the distribution of labels in the subset. Entropy – a measure of the uncertainty associated with a random variable. Misclassification error – the proportion of misclassified samples Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

31 Measures of Node Impurity Gini Index p( j | t) is the relative frequency of class j at node t Entropy Misclassification error 31 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

32 Comparison among Measures of Node Impurity For a 2-class problem: 32 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

33 Comparison among Measures of Node Impurity For a 2-class problem: 33 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

34 When a node p is split into k partitions, the quality of split is computed as, Or information gain Quality of Split p: n child 1: n 1 GINI(1) child k: n k GINI(k) child i: n i GINI(i) Measures reduction in GINI/Entropy achieved because of the split. Choose the split that achieves most reduction (maximizes GAIN) where,n i = number of records at child i, n = number of records at node p. 34 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

35 Quality of Split: binary attributes Splits into two partitions (binary attributes) P YesNo Node N1Node N2 Gini(N1) = 1 – (4/7) 2 – (3/7) 2 = 0.4898 Gini(N2) = 1 – (2/5) 2 – (3/5) 2 = 0.480 Gini split (Children) = 7/12 * 0.4898 + 5/12 * 0.480 = 0.486 Gain split = Gini (parent) – Gini split (children) = 0.014

36 36 Decision Tree Induction Many Algorithms:  Hunt’s Algorithm (one of the earliest)  CART  ID3, C4.5  SLIQ,SPRINT Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

37 CART CART: Classification and Regression Trees constructs trees with only binary splits (simplifies the splitting criterion) use Gini Index as a splitting criterion split the attribute who provides the smallest Gini split (p) or the largest GAIN split (p) need to enumerate all the possible splitting points for each attribute 37 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

38 Continuous Attributes: Computing Gini Index For efficient computation: for each attribute, Sort the attribute on values Set candidate split positions as the midpoints between two adjacent sorted values. Linearly scan these values, each time updating the count matrix and computing Gini index Choose the split position that has the least Gini index Split Positions Sorted Values

39 39 Decision Tree Induction Many Algorithms:  Hunt’s Algorithm (one of the earliest)  CART  ID3, C4.5  SLIQ,SPRINT Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

40 How to Find the Best Split CarType Family Sports Luxury CarType {Sports, Luxury} {Family} CarType {Family, Luxury} {Sports} Multi-way split Two-way split (find best partition of values) largest Gain = 0.337 40 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

41 Which Attribute to Split ? Gain = 0.337Gain = 0.02Gain = 0.5 Best ? Disadvantage: Tends to prefer splits that result in large number of partitions, each being small but pure. Unique value for each record  not predictable Small number of recodes in each node  not reliable prediction 41 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

42 Splitting Based on Gain Ratio Gain Ratio: Parent node p is split into k partitions where n i is the number of records in partition i designed to overcome the disadvantage of Information Gain adjusts Information Gain by the entropy of the partitioning (SplitINFO). higher entropy partitioning (large number of small partitions) is penalized! 42 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

43 September 14, 2010Data Mining: Concepts and Techniques43 Comparing Attribute Selection Measures The three measures, in general, return good results but – Gini gain: biased to multivalued attributes has difficulty when # of classes is large tends to favor tests that result in equal-sized partitions and purity in both partitions – Information gain: biased towards multivalued attributes – Gain ratio: tends to prefer unbalanced splits in which one partition is much smaller than the others 43 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

44 ID3 (Ross Quinlan1986) is the precursor to the C4.5 algorithm (Ross Quinlan 1993) C4.5 is an extension of earlier ID3 algorithm - For each unused attribute Ai, count the information GAIN (ID3) or GAINRatio (C4.5) from splitting on Ai - Find the best splitting attribute A best with the highest GAIN or GAINRatio - Create a decision node that splits on A best - Recur on the sublists obtained by splitting on A best, and add those nodes as children of node ID3 and C4.5 44 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

45 Handling both continuous and discrete attributes o In order to handle continuous attributes, C4.5 creates a threshold and then splits the list into those whose attribute value is above the threshold and those that are less than or equal to it. Handling training data with missing attribute values o C4.5 allows attribute values to be marked as ? for missing. Missing attribute values are simply not used in gain and entropy calculations. Pruning trees after creation o C4.5 goes back through the tree once it's been created and attempts to remove branches that do not help by replacing them with leaf nodes. Improvements of C4.5 from ID3 algorithm 45 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

46 Issues: o Needs entire data to fit in memory. o Unsuitable for Large Datasets. – Needs a lot of computation at every stage of construction of decision tree. You can download the software from: http://www2.cs.uregina.ca/~dbd/cs831/notes/ml/dtrees/c4.5/c4.5r8.tar.gz http://www2.cs.uregina.ca/~dbd/cs831/notes/ml/dtrees/c4.5/c4.5r8.tar.gz More information http://www2.cs.uregina.ca/~dbd/cs831/notes/ml/dtrees/c4.5/tutorial.html C4.5 46 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

47 47 Decision Tree Induction Many Algorithms:  Hunt’s Algorithm (one of the earliest)  CART  ID3, C4.5  SLIQ,SPRINT Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

48 SLIQ, Supervised Learning In Quest (EDBT’96 — Mehta et al.) o Uses a pre-sorting technique in the tree growing phase (eliminates the need to sort data at each node) o creates a separate list for each attribute of the training data o A separate list, called class list, is created for the class labels attached to the examples. o SLIQ requires that the class list and (only) one attribute list could be kept in the memory at any time o Suitable for classification of large disk-resident datasets o Applies to both numerical and categorical attributes SLIQ – a decision tree classifier 48 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

49 Generate attribute list for each attribute Sort attribute lists for NUMERIC Attributes Create decision tree by partitioning records StartEnd SLIQ Methodology Drivers Age CarTypeClass (risk) 23FamilyHIGH 17SportsHIGH 43SportsHIGH 68FamilyLOW 32TruckLOW 20FamilyHIGH Only NUMERIC attributes sorted Example AgeClassRecId 17HIGH2 20HIGH6 23HIGH1 32LOW5 43HIGH3 68LOW4 49 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

50 Numeric attributes splitting index Numeric attributes sorted AgeClassRecId 17HIGH2 20HIGH6 23HIGH1 32LOW5 43HIGH3 68LOW4 Partition position Position 0 Position 3 Position 6 States of class histograms HIGHLOW C below 00 C above 42 Gini split =0.44 HIGHLOW C below 30 C above 12 Gini split =0.22 HIGHLOW C below 42 C above 00 Gini split =0.44 50 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

51 Numeric attributes splitting index Numeric attributes sorted AgeClassRecId 17HIGH2 20HIGH6 23HIGH1 32LOW5 43HIGH3 68LOW4 Partition position Position 0 Position 3 Position 6 States of class histograms HIGHLOW C below 00 C above 42 Gini split =0.44 HIGHLOW C below 30 C above 12 Gini split =0.22 HIGHLOW C below 42 C above 00 Gini split =0.44 age N1N2 < 27.5> 27.5 51

52 Numeric attributes splitting index age N1N2 < 27.5> 27.5 AgeClassCar typeRecId 32LOWtruck5 43HIGHsport3 68LOWfamily4 AgeClass Car type RecId 17HIGH sport 2 20HIGH family 6 23HIGH family 1 HIGH Car type 52 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

53 SPRINT: A Scalable Parallel Classifier for Data Mining (VLDB’96 — J. Shafer et al.)  an enhancement of SLIQ, implemented in both serial and parallel pattern for good data placement and load balancing  one time sort of the data items  uses two data structures: attribute list and histogram (no memory resident making SPRINT suitable for large data set)  handles both continuous and categorical attributes. SPRINT 53 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

54 54 Issues:  Underfitting and Overfitting  Missing values Decision Tree Issues Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

55 Underfitting and Overfitting 55 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

56 September 14, 2010Data Mining: Concepts and Techniques56 Overfitting and Tree Pruning Overfitting: An induced tree may overfit the training data – Too many branches, some may reflect anomalies due to noise or outliers – Poor accuracy for unseen samples Two approaches to avoid overfitting – Pre-pruning: Stop tree construction early—do not split a node if this would result in the gain measure falling below a threshold Difficult to choose an appropriate threshold – Post-pruning: Remove branches from a “fully grown” tree Use a set of data different from the training data to decide which is the “best pruned tree” 56 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

57 Data Mining: Concepts and Techniques57 Tree Post-Pruning Reduced error pruning – Start from leaves – Replace each node with its most popular class – Keep the change, if the prediction accuracy is not affected Cost complexity pruning – Generate a series of trees T0,…,Tm, (T0 is the initial tree, Tm is the root alone) – Construct tree Ti by Replacing a subtree of tree Ti-1 with a leaf node. the class label of leaf node is determined from majority class of instances in the sub-tree Which subtree to remove ? 57

58 58 Issues:  Underfitting and Overfitting  Missing values Decision Tree Issues Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

59 Missing values affect decision tree construction in three different ways: – Affects how impurity measures are computed – Affects how to distribute instance with missing value to child nodes – Affects how a test instance with missing value is classified Handling Missing Values of Attribute 59 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

60 Split on Refund: Entropy(Refund=Yes) = 0 Entropy(Refund=No) = -(2/6)log(2/6) – (4/6)log(4/6) = 0.9183 Entropy(Children) = 0.3 (0) + 0.6 (0.9183) = 0.551 Gain = 0.9  (0.8813 – 0.551) = 0.3303 Missing value Before Splitting: Entropy(Parent) = -0.3 log(0.3)-(0.7)log(0.7) = 0.8813 Computing Impurity Measure 60

61 Refund Yes No Refund YesNo Probability that Refund=Yes is 3/9 Probability that Refund=No is 6/9 Assign record to the left child with weight = 3/9 and to the right child with weight = 6/9 Distribute Instances 61

62 Classify Instances Refund MarSt TaxInc YES NO Yes No Married Single, Divorced < 80K> 80K MarriedSingleDivorcedTotal Class=No3104 Class=Yes6/9112.67 Total3.67216.67 New record: Probability that Marital Status = Married is 3.67/6.67 = 0.55 Probability that Marital Status ={Single,Divorced} is 3/6.67 = 0.45 62

63 Classify Instances Refund MarSt TaxInc YES NO Yes No Married Single, Divorced < 80K> 80K MarriedSingleDivorcedTotal Class=No3104 Class=Yes6/9112.67 Total3.67216.67 New record: Probability that Class = YES is 0.45 Probability that Class = NO is 0.55 63

64 What is Decision Tree? How to choose the best attribute to split? How to decide the branches when splitting? What are CART, ID3 and C4.5? What are their differences? How Decision Tree can be used to learn from large data? How to avoid overfitting in Decision Tree learning? How missing values can be handled by Decision Tree? 64 Questions Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

65 65 Classification Techniques Decision Tree based Methods Rule-based Methods Learning from Neighbors Bayesian Classification Neural Networks Ensemble Methods Support Vector Machines Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

66 Tree  Rules How tree can be used in other ways? 66 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

67 Rule Extraction from a Decision Tree One rule is created for each path from the root to a leaf Each attribute-value pair along a path forms a conjunction: the leaf holds the class prediction Rules are mutually exclusive (rules are independent and each record is covered by at most one rule) and exhaustive (each record is covered by at least one rule) Rule set contains as much information as the tree 67

68 Rule Extraction from a Decision Tree Rules can be simplified (Marital Status = {Married}) ==> No 68 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

69 Rule-Based Classifier Classify records by using a collection of “if…then…” rules Rule : Represent the knowledge in the form of IF-THEN rules  (Condition)  y, where ‾ Condition is a conjunctions of attributes ‾ y is the class label  Examples of classification rules: ‾ (Blood Type=Warm)  (Lay Eggs=Yes)  Birds ‾ (Taxable Income < 50K)  (Refund=Yes)  Cheat=No ‾ (rule antecedent or condition)  (rule consequent) Classifier : A rule R covers an instance x if the attributes of the instance satisfy the condition of the rule, instance x labeled by the consequent of R 69 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

70 Extract rules directly from training data e.g., FOIL (Quinlan 1990), AQ (Michalski et al 1986), CN2 (Clark & Niblett89), RIPPER (Cohen95) Steps: 1.Start from an empty rule 2.Grow a rule for one class C i using the Learn-One-Rule function 3.Remove training records covered by the rule 4.Repeat Step (2) and (3) until stopping criterion is met, e.g., when no more training examples or when the quality of a rule returned is below a user-specified threshold Compare with decision-tree induction: DT learns a set of rules simultaneously Sequential covering algorithm 70 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

71 Rules for 2-class and multi-class 71 For 2-class problem  choose one of the classes as positive class, and the other as negative class  Learn rules for positive class  Negative class will be default class For multi-class problem  Order the classes according to increasing class prevalence (fraction of instances that belong to a particular class)  Learn the rule set for smallest class first, treat the rest as negative class  Repeat with next smallest class as positive class Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

72 AMCS/CS 340: Data Mining Classification: Evaluation Xiangliang Zhang King Abdullah University of Science and Technology

73 Model Evaluation Metrics for Performance Evaluation  How to evaluate the performance of a model? Methods for Performance Evaluation  How to obtain reliable estimates? Methods for Model Comparison  How to compare the relative performance among competing models? 73 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

74 Metrics for Performance Evaluation Focus on the predictive capability of a model Rather than how fast it takes to classify or build models, scalability, etc. Confusion Matrix: Most widely-used metric: PREDICTED CLASS ACTUAL CLASS Class=YesClass=No Class=Yesa (TP)b (FN) Class=Noc (FP)d (TN) 74 TP: True Positive FP: False Positive TN: True Negative FN: False Negative Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

75 Limitation of Accuracy Consider a 2-class problem Number of Class 0 examples = 9990 Number of Class 1 examples = 10 Unbalanced classes If model predicts everything to be class 0, accuracy is 9990/10000 = 99.9 % Accuracy is misleading because model does not detect any class 1 example 75 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

76 Other Measures 76 PREDICTED CLASS ACTUAL CLASS Class=YesClass=No Class=Yesa (TP)b (FN) Class=Noc (FP)d (TN) Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

77 Model Evaluation Metrics for Performance Evaluation  How to evaluate the performance of a model? Methods for Performance Evaluation  How to obtain reliable estimates? Methods for Model Comparison  How to compare the relative performance among competing models? 77 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

78 Methods for Performance Evaluation How to obtain a reliable estimate of performance? Performance of a model may depend on other factors besides the learning algorithm: Class distribution Cost of misclassification Size of training and test sets 78 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

79 Methods of Estimation Holdout Reserve 2/3 for training and 1/3 for testing Random subsampling Repeated holdout Cross validation ‾Partition data into k disjoint subsets ‾k-fold: train on k-1 partitions, test on the remaining one ‾Leave-one-out: k=n Bootstrap Sampling with replacement 79 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

80 Model Evaluation Metrics for Performance Evaluation  How to evaluate the performance of a model? Methods for Performance Evaluation  How to obtain reliable estimates? Methods for Model Comparison  How to compare the relative performance among competing models? 80 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

81 ROC (Receiver Operating Characteristic ) Developed in 1950s for signal detection theory to analyze noisy signals Characterize the trade-off between positive hits and false alarms ROC curve plots TP rate (y-axis) against FP rate (x-axis) Performance of each classifier represented as a point on ROC curve changing the threshold of algorithm, or sample distribution changes the location of the point At threshold t: TPR=0.5, FPR=0.12 81 1-dimensional data set containing 2 classes (positive and negative) - any points located at x > t is classified as positive

82 ROC Curve (TPR,FPR): (0,0): declare everything to be negative class (1,1): declare everything to be positive class (0,1): ideal Diagonal line: ‾Random guessing ‾Below diagonal line: prediction is opposite of the true class Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 82

83 Using ROC for Model Comparison Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining l No model consistently outperform the other l M 1 is better for small FPR l M 2 is better for large FPR l Area Under the ROC curve l Ideal:  Area = 1 l Random guess:  Area = 0.5 83

84 How to construct an ROC curve Threshold: t ROC Curve: 84 # of + >= t # of - >= t Posterior probability of test instance x

85 Confidence Interval for Accuracy Prediction can be regarded as a Bernoulli trial ‾A Bernoulli trial has 2 possible outcomes ‾Possible outcomes for prediction: correct or wrong ‾Collection of Bernoulli trials has a Binomial distribution: ‾x  Bin(N, p) x: number of correct predictions e.g: Toss a fair coin 50 times, how many heads would turn up? Expected number of heads = N  p = 50  0.5 = 25 Given x (# of correct predictions) or equivalently, acc=x/N, and N (# of test instances), Can we predict p (true accuracy of model)? 85 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

86 For large test sets (N > 30), acc has a normal distribution with mean p and variance p(1-p)/N Confidence Interval for p: Area = 1 -  Z  /2 Z 1-  /2 86 Confidence Interval for Accuracy Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

87 Confidence Interval for Accuracy Consider a model that produces an accuracy of 80% when evaluated on 100 test instances: ‾N=100, acc = 0.8 ‾Let 1-  = 0.95 (95% confidence) ‾From probability table, Z  /2 =1.96 1-  Z 0.992.58 0.982.33 0.951.96 0.901.65 N5010050010005000 p(lower)0.6700.7110.7630.7740.789 p(upper)0.8880.8660.8330.8240.811 87 Standard Normal distribution Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

88 Test of Significance Given two models: ‾Model M1: accuracy = 85%, tested on 30 instances ‾Model M2: accuracy = 75%, tested on 5000 instances Can we say M1 is better than M2? ‾How much confidence can we place on accuracy of M1 and M2? ‾Can the difference in performance measure be explained as a result of random fluctuations in the test set? 88 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

89 Comparing Performance of 2 Models Given two models, say M1 and M2, which is better? ‾M1 is tested on D1 (size=n1), found error rate = e 1 ‾M2 is tested on D2 (size=n2), found error rate = e 2 ‾Assume D1 and D2 are independent ‾If n1 and n2 are sufficiently large, then ‾Approximate of variance (Binomial distribution): 89 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

90 To test if performance difference is statistically significant: d = e1 – e2 where d t is the true difference Since D1 and D2 are independent, their variance adds up: At (1-  ) confidence level, 90 Comparing Performance of 2 Models Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

91 An Illustrative Example Given: M1: n1 = 30, e1 = 0.15 M2: n2 = 5000, e2 = 0.25 d = |e2 – e1| = 0.1 (2-sided test) At 95% confidence level, Z  /2 =1.96 => Interval contains 0 => difference may not be statistically significant 91 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

92 Our Teaching Assistant Name: Francisco Franco Email: francisco.franco@kaust.edu.safrancisco.franco@kaust.edu.sa Responsibilities:  Homework  Project report  Final grades  Questions 92 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining


Download ppt "AMCS/CS 340: Data Mining Classification I: Decision Tree Xiangliang Zhang King Abdullah University of Science and Technology."

Similar presentations


Ads by Google