Classification I: Decision Tree

Classification I: Decision Tree
AMCS/CS 340: Data Mining Classification I: Decision Tree Xiangliang Zhang King Abdullah University of Science and Technology

Classification: Definition
Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes is the class. Find a model for class attribute as a function of the values of other attributes. Goal: previously unseen records should be assigned a class as accurately as possible. A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it. Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Classification Example
categorical categorical continuous class Test Set Learn Classifier Model Training Set predicting borrowers who cheat on loan payments.

Issues: Evaluating Classification Methods
Accuracy classifier accuracy: how well the class labels of test data are predicted Speed time to construct the model (training time) time to use the model (classification/prediction time) Robustness: handling noise and missing values Scalability: efficiency in large-scale data Interpretability understanding and insight provided by the model Other measures, e.g., goodness of rules, such as decision tree size or compactness of classification rules 4 4

Classification Techniques
Decision Tree based Methods Rule-based Methods Learning from Neighbors Bayesian Classification Neural Networks Ensemble Methods Support Vector Machines Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Example of a Decision Tree
categorical continuous class Splitting Attributes Refund Yes No NO MarSt Single, Divorced Married TaxInc NO < 80K > 80K NO YES Training Data Model: Decision Tree Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Another Example of Decision Tree
categorical categorical continuous MarSt Single, Divorced class Married NO Refund No Yes NO TaxInc < 80K > 80K NO YES There could be more than one tree that fits the same data! Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Decision Tree Classification Task
Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Apply Model to Test Data
Start from the root of tree. Refund MarSt TaxInc YES NO Yes No Married Single, Divorced < 80K > 80K

Refund MarSt TaxInc YES NO Yes No Married Single, Divorced < 80K > 80K Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Refund Yes No NO MarSt Single, Divorced Married TaxInc NO < 80K > 80K NO YES Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Refund Yes No NO MarSt Single, Divorced Married Assign Cheat to “No” TaxInc NO < 80K > 80K NO YES Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Decision Tree Classification Task

Algorithm for Decision Tree Induction
Basic algorithm (a greedy algorithm) Tree is constructed in a top-down recursive divide-and-conquer manner At start, all the training examples are at the root Examples are partitioned recursively based on selected attributes Attributes are categorical (if continuous-valued, they are discretized in advance) Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain) Conditions for stopping partitioning All samples for a given node belong to the same class There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf There are no samples left Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Decision Tree Induction
Many Algorithms: Hunt’s Algorithm (one of the earliest) CART ID3, C4.5 SLIQ,SPRINT Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

General Structure of Hunt’s Algorithm
Let Dt be the set of training records that reach a node t General Procedure: If Dt contains records that belong the same class yt, then t is a leaf node labeled as yt If Dt is an empty set, then t is a leaf node labeled by the default class, yd If Dt contains records that belong to more than one class, use an attribute test to split the data into smaller subsets. Recursively apply the procedure to each subset. Dt ? Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Hunt’s Algorithm Refund Refund Refund Marital Marital Status Status
Don’t Cheat Yes No Don’t Cheat Refund Don’t Cheat Yes No Marital Status Single, Divorced Married Refund Don’t Cheat Yes No Marital Status Single, Divorced Married Taxable Income < 80K >= 80K

Issues of Hunt's Algorithm
Issues: Determine how to split the records How to specify the attribute test condition? How many branches Partition threshold for splitting How to determine the best split? Choose which attribute ? Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

How to Specify Test Condition?
Depends on attribute types Nominal Ordinal Continuous Depends on number of ways to split 2-way split Multi-way split Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Splitting Based on Nominal Attributes
Multi-way split: Use as many partitions as distinct values. Binary split: Divides values into two subsets. Need to find optimal partitioning. CarType Family Sports Luxury CarType {Sports, Luxury} {Family} CarType {Family, Luxury} {Sports} OR Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Splitting Based on Ordinal Attributes
Multi-way split: Use as many partitions as distinct values. Binary split: Divides values into two subsets Need to find optimal partitioning. What about this split? Size Small Medium Large Size {Small, Medium} {Large} Size {Medium, Large} {Small} OR Size {Small, Large} {Medium} Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Splitting Based on Continuous Attributes
Different ways of handling Binary Decision: (A < v) or ( A >= v ) consider all possible splits and finds the best cut Discretization to form an ordinal categorical attribute ranges can be found by equal interval bucketing, equal frequency bucketing (percentiles), or clustering. Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Splitting Based on Continuous Attributes

Issues of Hunt's Algorithm
Issues: Determine how to split the records How to specify the attribute test condition? How many branches Partition threshold for splitting How to determine the best split? Choose which attribute ? Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

How to determine the Best Split
Before Splitting: 10 records of class 0, records of class 1 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

How to determine the Best Split
Greedy approach: Nodes with homogeneous class distribution are preferred Need a measure of node impurity: Non-homogeneous, High degree of impurity Homogeneous, Low degree of impurity Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
How to determine the Best Split Before Splitting: 10 records of class 0, records of class 1 Non-homogeneous, High degree of impurity Homogeneous, Low degree of impurity Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 29

Measures of Node Impurity
Gini Index Gini impurity is a measure of how frequently a randomly chosen element from a set is incorrectly labeled if it were labeled randomly according to the distribution of labels in the subset. Entropy a measure of the uncertainty associated with a random variable. Misclassification error the proportion of misclassified samples Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Measures of Node Impurity Gini Index p( j | t) is the relative frequency of class j at node t Entropy Misclassification error Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 31

Comparison among Measures of Node Impurity For a 2-class problem: Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 32

Comparison among Measures of Node Impurity
For a 2-class problem: Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 33

Quality of Split p: n child 1: n1 GINI(1) child k: nk GINI(k) child i: ni GINI(i) When a node p is split into k partitions, the quality of split is computed as, Or information gain where, ni = number of records at child i, n = number of records at node p. Measures reduction in GINI/Entropy achieved because of the split. Choose the split that achieves most reduction (maximizes GAIN) Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Quality of Split: binary attributes
Splits into two partitions (binary attributes) P Yes No Node N1 Node N2 Gini(N1) = 1 – (4/7)2 – (3/7)2 = Gini(N2) = 1 – (2/5)2 – (3/5)2 = 0.480 Ginisplit(Children) = 7/12 * /12 * = 0.486 Gainsplit = Gini (parent) – Ginisplit (children) = 0.014

CART CART: Classification and Regression Trees constructs trees with only binary splits (simplifies the splitting criterion) use Gini Index as a splitting criterion split the attribute who provides the smallest Ginisplit(p) or the largest GAINsplit(p) need to enumerate all the possible splitting points for each attribute Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 37

Continuous Attributes: Computing Gini Index
For efficient computation: for each attribute, Sort the attribute on values Set candidate split positions as the midpoints between two adjacent sorted values. Linearly scan these values, each time updating the count matrix and computing Gini index Choose the split position that has the least Gini index Split Positions Sorted Values

How to Find the Best Split
Two-way split (find best partition of values) Multi-way split CarType {Sports, Luxury} {Family} CarType {Family, Luxury} {Sports} CarType Family Sports Luxury largest Gain = 0.337 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Which Attribute to Split ? Gain = 0.02 Gain = 0.337 Gain = 0.5 Best ? Disadvantage: Tends to prefer splits that result in large number of partitions, each being small but pure. Unique value for each record  not predictable Small number of recodes in each node  not reliable prediction Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 41

Splitting Based on Gain Ratio Gain Ratio: Parent node p is split into k partitions where ni is the number of records in partition i designed to overcome the disadvantage of Information Gain adjusts Information Gain by the entropy of the partitioning (SplitINFO). higher entropy partitioning (large number of small partitions) is penalized! Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Comparing Attribute Selection Measures
The three measures, in general, return good results but Gini gain: biased to multivalued attributes has difficulty when # of classes is large tends to favor tests that result in equal-sized partitions and purity in both partitions Information gain: biased towards multivalued attributes Gain ratio: tends to prefer unbalanced splits in which one partition is much smaller than the others Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining September 14, 2010 Data Mining: Concepts and Techniques 43 43

ID3 and C4.5 ID3 (Ross Quinlan1986) is the precursor to the C4.5 algorithm (Ross Quinlan 1993) C4.5 is an extension of earlier ID3 algorithm For each unused attribute Ai, count the information GAIN (ID3) or GAINRatio (C4.5) from splitting on Ai Find the best splitting attribute Abest with the highest GAIN or GAINRatio Create a decision node that splits on Abest Recur on the sublists obtained by splitting on Abest , and add those nodes as children of node Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Improvements of C4.5 from ID3 algorithm Handling both continuous and discrete attributes In order to handle continuous attributes, C4.5 creates a threshold and then splits the list into those whose attribute value is above the threshold and those that are less than or equal to it. Handling training data with missing attribute values C4.5 allows attribute values to be marked as ? for missing. Missing attribute values are simply not used in gain and entropy calculations. Pruning trees after creation C4.5 goes back through the tree once it's been created and attempts to remove branches that do not help by replacing them with leaf nodes. Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Issues: Needs entire data to fit in memory. Unsuitable for Large Datasets. Needs a lot of computation at every stage of construction of decision tree. You can download the software from: More information Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

SLIQ – a decision tree classifier SLIQ, Supervised Learning In Quest (EDBT’96 — Mehta et al.) Uses a pre-sorting technique in the tree growing phase (eliminates the need to sort data at each node) creates a separate list for each attribute of the training data A separate list, called class list, is created for the class labels attached to the examples. SLIQ requires that the class list and (only) one attribute list could be kept in the memory at any time Suitable for classification of large disk-resident datasets Applies to both numerical and categorical attributes Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

SLIQ Methodology Create decision tree by partitioning records Generate attribute list for each attribute Sort attribute lists for NUMERIC Attributes Start End Example Only NUMERIC attributes sorted Drivers Age CarType Class (risk) 23 Family HIGH 17 Sports 43 68 LOW 32 Truck 20 Age Class RecId 17 HIGH 2 20 6 23 1 32 LOW 5 43 3 68 4 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Numeric attributes splitting index States of class histograms Partition position Numeric attributes sorted HIGH LOW Cbelow Cabove 4 2 Ginisplit =0.44 Age Class RecId 17 HIGH 2 20 6 23 1 32 LOW 5 43 3 68 4 Position 0 HIGH LOW Cbelow 3 Cabove 1 2 Ginisplit =0.22 Position 3 Position 6 HIGH LOW Cbelow 4 2 Cabove Ginisplit =0.44 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Numeric attributes splitting index
States of class histograms Partition position Numeric attributes sorted HIGH LOW Cbelow Cabove 4 2 Ginisplit =0.44 Age Class RecId 17 HIGH 2 20 6 23 1 32 LOW 5 43 3 68 4 Position 0 HIGH LOW Cbelow 3 Cabove 1 2 Ginisplit =0.22 Position 3 Position 6 HIGH LOW Cbelow 4 2 Cabove Ginisplit =0.44 age < 27.5 > 27.5 N1 N2

Numeric attributes splitting index age N1 N2 < 27.5 > 27.5 Age Class Car type RecId 32 LOW truck 5 43 HIGH sport 3 68 family 4 Age Class Car type RecId 17 HIGH sport 2 20 family 6 23 1 Car type HIGH Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

SPRINT SPRINT: A Scalable Parallel Classifier for Data Mining (VLDB’96 — J. Shafer et al.) an enhancement of SLIQ, implemented in both serial and parallel pattern for good data placement and load balancing one time sort of the data items uses two data structures: attribute list and histogram (no memory resident making SPRINT suitable for large data set) handles both continuous and categorical attributes. 53 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Decision Tree Issues Issues: Underfitting and Overfitting Missing values 54 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 54

Underfitting and Overfitting Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Overfitting and Tree Pruning
Overfitting: An induced tree may overfit the training data Too many branches, some may reflect anomalies due to noise or outliers Poor accuracy for unseen samples Two approaches to avoid overfitting Pre-pruning: Stop tree construction early—do not split a node if this would result in the gain measure falling below a threshold Difficult to choose an appropriate threshold Post-pruning: Remove branches from a “fully grown” tree Use a set of data different from the training data to decide which is the “best pruned tree” Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining September 14, 2010 Data Mining: Concepts and Techniques 56 56

Data Mining: Concepts and Techniques
Tree Post-Pruning Reduced error pruning Start from leaves Replace each node with its most popular class Keep the change, if the prediction accuracy is not affected Cost complexity pruning Generate a series of trees T0,…,Tm, (T0 is the initial tree, Tm is the root alone) Construct tree Ti by Replacing a subtree of tree Ti-1 with a leaf node. the class label of leaf node is determined from majority class of instances in the sub-tree Which subtree to remove ? Data Mining: Concepts and Techniques 57 57

Decision Tree Issues Issues: Underfitting and Overfitting Missing values 58 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 58

Handling Missing Values of Attribute Missing values affect decision tree construction in three different ways: Affects how impurity measures are computed Affects how to distribute instance with missing value to child nodes Affects how a test instance with missing value is classified Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Computing Impurity Measure
Before Splitting: Entropy(Parent) = -0.3 log(0.3)-(0.7)log(0.7) = Split on Refund: Entropy(Refund=Yes) = 0 Entropy(Refund=No) = -(2/6)log(2/6) – (4/6)log(4/6) = Entropy(Children) = 0.3 (0) (0.9183) = 0.551 Gain = 0.9  ( – 0.551) = Missing value

Distribute Instances Probability that Refund=Yes is 3/9
No Probability that Refund=Yes is 3/9 Probability that Refund=No is 6/9 Assign record to the left child with weight = 3/9 and to the right child with weight = 6/9 Refund Yes No

Classify Instances New record:
Married Single Divorced Total Class=No 3 1 4 Class=Yes 6/9 2.67 3.67 2 6.67 Refund Yes No NO MarSt Single, Divorced Married Probability that Marital Status = Married is 3.67/6.67 = 0.55 Probability that Marital Status ={Single,Divorced} is 3/6.67 = 0.45 TaxInc NO < 80K > 80K NO YES

Classify Instances New record: Probability that Class = NO is 0.55
Married Single Divorced Total Class=No 3 1 4 Class=Yes 6/9 2.67 3.67 2 6.67 Refund Yes No NO MarSt Single, Divorced Married TaxInc NO Probability that Class = NO is 0.55 < 80K > 80K NO YES Probability that Class = YES is 0.45

Questions What is Decision Tree? How to choose the best attribute to split? How to decide the branches when splitting? What are CART, ID3 and C4.5? What are their differences? How Decision Tree can be used to learn from large data? How to avoid overfitting in Decision Tree learning? How missing values can be handled by Decision Tree? Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Classification Techniques
Decision Tree based Methods Rule-based Methods Learning from Neighbors Bayesian Classification Neural Networks Ensemble Methods Support Vector Machines Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

How tree can be used in other ways?
Tree  Rules Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Rule Extraction from a Decision Tree
One rule is created for each path from the root to a leaf Each attribute-value pair along a path forms a conjunction: the leaf holds the class prediction Rules are mutually exclusive (rules are independent and each record is covered by at most one rule) and exhaustive (each record is covered by at least one rule) Rule set contains as much information as the tree

Rule Extraction from a Decision Tree
(Marital Status = {Married}) ==> No Rules can be simplified Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Rule-Based Classifier
Classify records by using a collection of “if…then…” rules Rule: Represent the knowledge in the form of IF-THEN rules (Condition)  y, where Condition is a conjunctions of attributes y is the class label Examples of classification rules: (Blood Type=Warm)  (Lay Eggs=Yes)  Birds (Taxable Income < 50K)  (Refund=Yes)  Cheat=No (rule antecedent or condition)  (rule consequent) Classifier: A rule R covers an instance x if the attributes of the instance satisfy the condition of the rule, instance x labeled by the consequent of R Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 69

Sequential covering algorithm
Extract rules directly from training data e.g., FOIL (Quinlan 1990), AQ (Michalski et al 1986), CN2(Clark & Niblett89), RIPPER (Cohen95) Steps: Start from an empty rule Grow a rule for one class Ci using the Learn-One-Rule function Remove training records covered by the rule Repeat Step (2) and (3) until stopping criterion is met, e.g., when no more training examples or when the quality of a rule returned is below a user-specified threshold Compare with decision-tree induction: DT learns a set of rules simultaneously Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 70

Rules for 2-class and multi-class
For 2-class problem choose one of the classes as positive class, and the other as negative class Learn rules for positive class Negative class will be default class For multi-class problem Order the classes according to increasing class prevalence (fraction of instances that belong to a particular class) Learn the rule set for smallest class first, treat the rest as negative class Repeat with next smallest class as positive class Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 71

Classification: Evaluation
AMCS/CS 340: Data Mining Classification: Evaluation Xiangliang Zhang King Abdullah University of Science and Technology

Model Evaluation Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates? Methods for Model Comparison How to compare the relative performance among competing models? 73 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Metrics for Performance Evaluation
Focus on the predictive capability of a model Rather than how fast it takes to classify or build models, scalability, etc. Confusion Matrix: Most widely-used metric: PREDICTED CLASS ACTUAL CLASS Class=Yes Class=No a (TP) b (FN) c (FP) d (TN) TP: True Positive FP: False Positive TN: True Negative FN: False Negative 74 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Limitation of Accuracy
Consider a 2-class problem Number of Class 0 examples = 9990 Number of Class 1 examples = 10 Unbalanced classes If model predicts everything to be class 0, accuracy is 9990/10000 = 99.9 % Accuracy is misleading because model does not detect any class 1 example 75 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Other Measures PREDICTED CLASS ACTUAL CLASS Class=Yes Class=No a (TP) b (FN) c (FP) d (TN) 76 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Methods for Performance Evaluation
How to obtain a reliable estimate of performance? Performance of a model may depend on other factors besides the learning algorithm: Class distribution Cost of misclassification Size of training and test sets 78 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Methods of Estimation Holdout Reserve 2/3 for training and 1/3 for testing Random subsampling Repeated holdout Cross validation Partition data into k disjoint subsets k-fold: train on k-1 partitions, test on the remaining one Leave-one-out: k=n Bootstrap Sampling with replacement 79 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

ROC (Receiver Operating Characteristic)
Developed in 1950s for signal detection theory to analyze noisy signals Characterize the trade-off between positive hits and false alarms ROC curve plots TP rate (y-axis) against FP rate (x-axis) Performance of each classifier represented as a point on ROC curve changing the threshold of algorithm, or sample distribution changes the location of the point 1-dimensional data set containing 2 classes (positive and negative) - any points located at x > t is classified as positive 81 At threshold t: TPR=0.5, FPR=0.12

ROC Curve (TPR,FPR): (0,0): declare everything to be negative class (1,1): declare everything to be positive class (0,1): ideal Diagonal line: Random guessing Below diagonal line: prediction is opposite of the true class 82 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Using ROC for Model Comparison
No model consistently outperform the other M1 is better for small FPR M2 is better for large FPR Area Under the ROC curve Ideal: Area = 1 Random guess: Area = 0.5 83 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

How to construct an ROC curve
Posterior probability of test instance x Threshold: t # of + >= t # of - >= t ROC Curve: 84

Confidence Interval for Accuracy
Prediction can be regarded as a Bernoulli trial A Bernoulli trial has 2 possible outcomes Possible outcomes for prediction: correct or wrong Collection of Bernoulli trials has a Binomial distribution: x  Bin(N, p) x: number of correct predictions e.g: Toss a fair coin 50 times, how many heads would turn up? Expected number of heads = Np = 50  0.5 = 25 Given x (# of correct predictions) or equivalently, acc=x/N, and N (# of test instances), Can we predict p (true accuracy of model)? 85 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Area = 1 -  For large test sets (N > 30), acc has a normal distribution with mean p and variance p(1-p)/N Confidence Interval for p: Z/2 Z1-  /2 86 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Consider a model that produces an accuracy of 80% when evaluated on 100 test instances: N=100, acc = 0.8 Let 1- = 0.95 (95% confidence) From probability table, Z/2=1.96 1- Z 0.99 2.58 0.98 2.33 0.95 1.96 0.90 1.65 N 50 100 500 1000 5000 p(lower) 0.670 0.711 0.763 0.774 0.789 p(upper) 0.888 0.866 0.833 0.824 0.811 Standard Normal distribution 87 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Test of Significance Given two models: Model M1: accuracy = 85%, tested on 30 instances Model M2: accuracy = 75%, tested on 5000 instances Can we say M1 is better than M2? How much confidence can we place on accuracy of M1 and M2? Can the difference in performance measure be explained as a result of random fluctuations in the test set? 88 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Comparing Performance of 2 Models
Given two models, say M1 and M2, which is better? M1 is tested on D1 (size=n1), found error rate = e1 M2 is tested on D2 (size=n2), found error rate = e2 Assume D1 and D2 are independent If n1 and n2 are sufficiently large, then Approximate of variance (Binomial distribution): 89 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Comparing Performance of 2 Models
To test if performance difference is statistically significant: d = e1 – e2 where dt is the true difference Since D1 and D2 are independent, their variance adds up: At (1-) confidence level, 90 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

An Illustrative Example
Given: M1: n1 = 30, e1 = M2: n2 = 5000, e2 = 0.25 d = |e2 – e1| = (2-sided test) At 95% confidence level, Z/2= => Interval contains 0 => difference may not be statistically significant 91 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Our Teaching Assistant
Name: Francisco Franco Responsibilities: Homework Project report Final grades Questions 92 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Classification I: Decision Tree

Similar presentations

Presentation on theme: "Classification I: Decision Tree"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Classification I: Decision Tree

Similar presentations

Presentation on theme: "Classification I: Decision Tree"— Presentation transcript:

Similar presentations

About project

Feedback