Bab 4.2 - 1/57 Bab 4 Classification: Basic Concepts, Decision Trees & Model Evaluation Part 2 Model Overfitting & Classifier Evaluation.

Slides:



Advertisements
Similar presentations
Computational Biology Lecture Slides Week 10 Classification (some parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar)
Advertisements

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Classification: Definition l Given a collection of records (training set) l Find a model.
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining by Tan, Steinbach,
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Other Classification Techniques 1.Nearest Neighbor Classifiers 2.Support Vector Machines.
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining by Tan, Steinbach,
Lecture Notes for Chapter 4 (2) Introduction to Data Mining
Lecture Notes for Chapter 4 Part III Introduction to Data Mining
Evaluation.
Assessing and Comparing Classification Algorithms Introduction Resampling and Cross Validation Measuring Error Interval Estimation and Hypothesis Testing.
Model Evaluation Metrics for Performance Evaluation
Lecture Notes for Chapter 4 Introduction to Data Mining
Evaluation.
© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.
© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.
Example of a Decision Tree categorical continuous class Splitting Attributes Refund Yes No NO MarSt Single, Divorced Married TaxInc NO < 80K > 80K.
EECS 800 Research Seminar Mining Biological Data
DATA MINING LECTURE 11 Classification Basic Concepts Decision Trees
Lecture Notes for Chapter 4 CSE 572 Data Mining
Classification II (continued) Model Evaluation
Evaluating Classifiers
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining by Tan, Steinbach,
CSE 4705 Artificial Intelligence
Classification Decision Trees Evaluation
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining by Tan, Steinbach,
Lecture Notes for Chapter 4 Introduction to Data Mining
Chapter 4 Classification. 2 Classification: Definition Given a collection of records (training set ) –Each record contains a set of attributes, one of.
Classification: Basic Concepts, Decision Trees, and Model Evaluation
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining by Minqi Zhou.
CpSc 810: Machine Learning Evaluation of Classifier.
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining by Tan, Steinbach,
Evaluating Results of Learning Blaž Zupan
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
Model Evaluation l Metrics for Performance Evaluation –How to evaluate the performance of a model? l Methods for Performance Evaluation –How to obtain.
Practical Issues of Classification Underfitting and Overfitting –Training errors –Generalization (test) errors Missing Values Costs of Classification.
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining by Tan, Steinbach,
1 Data Mining Lecture 4: Decision Tree & Model Evaluation.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
DATA MINING LECTURE 10 Classification Basic Concepts Decision Trees.
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining by Tan, Steinbach,
Validation methods.
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4.
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
DATA MINING LECTURE 11 Classification Basic Concepts Decision Trees Evaluation Nearest-Neighbor Classifier.
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 By Gun Ho Lee Intelligent.
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining by Tan, Steinbach,
Computational Biology
Lecture Notes for Chapter 4 Introduction to Data Mining
Lecture Notes for Chapter 4 Introduction to Data Mining
CSE 4705 Artificial Intelligence
Introduction to Data Mining, 2nd Edition by
Lecture Notes for Chapter 4 Introduction to Data Mining
Lecture Notes for Chapter 4 Introduction to Data Mining
COSC 6368 Machine Learning Organization
Lecture Notes for Chapter 4 Introduction to Data Mining
Lecture Notes for Chapter 4 Introduction to Data Mining
Scalable Decision Tree Induction Methods
Lecture Notes for Chapter 4 Introduction to Data Mining
Lecture Notes for Chapter 4 Introduction to Data Mining
Lecture Notes for Chapter 4 Introduction to Data Mining
آبان 96. آبان 96 Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining by Tan,
Chapter 4 Classification
Lecture Notes for Chapter 4 Introduction to Data Mining
Lecture Notes for Chapter 4 Introduction to Data Mining
Lecture Notes for Chapter 4 Introduction to Data Mining
Lecture Notes for Chapter 4 Introduction to Data Mining
Lecture Notes for Chapter 4 Introduction to Data Mining
COSC 4368 Intro Supervised Learning Organization
Lecture Notes for Chapter 4 Introduction to Data Mining
Presentation transcript:

Bab /57 Bab 4 Classification: Basic Concepts, Decision Trees & Model Evaluation Part 2 Model Overfitting & Classifier Evaluation

Bab /57 Classification Error

Bab /57 Example Data Set

Bab /57 Decision Trees

Bab /57 Model Overfiting

Bab /57 Overfiting due to Presence of Noise Decision boundary is distorted by noise point

Bab /57 Example: Mammal Classification

Bab /57 Effect of Noise

Bab /57 Overfiting due to Insufficient Examples Lack of data points in the lower half of the diagram makes it difficult to predict correctly the class labels of that region Insufficient number of training records in the region causes the decision tree to predict the test examples using other training records that are irrelevant to the classification task

Bab /57 Lack of Representative Samples

Bab /57 Effect of Multiple Comparison Procedure / 1

Bab /57 Effect of Multiple Comparison Procedure / 2 How does the multiple comparison procedure relate to model overfitting?

Bab /57 Effect of Multiple Comparison Procedure / 3

Bab /57 Notes on Overfitting

Bab /57 Estimating Generalization Errors

Bab /57 Resubstitution Estimates Note: classification decision at each leaf node is based on the majority class

Bab /57 Incorporating Model Complexity (IMC)

Bab /57 IMC: Pessimistic Error Estimates

Bab /57 Pessimistic Error Estimates: Example  = 0.5  e’(T L ) = (4 + 7 x 0.5) / 24 = e’(T R ) = (6 + 4 x 0.5) / 24 =  = 1.0  e’(T L ) = (4 + 7 x 1) / 24 = e’(T R ) = (6 + 4 x 1) / 24 = 0.417

Bab /57 IMC: Minimum Description Length/1 Based on “information-theoretic approach” In the above example: A and B are given a set of records with known attributes x A knows exact label for each record, while B knows none B obtain the classification of each record by requesting that A transmits the class label sequentially (require Ө(n), where n = total number of records)

Bab /57 IMC: Minimum Description Length/2

Bab /57 Estimating Statistical Bounds N = total number of training records used to compute e  = confidence level Z  /2 = standardized value from a standard normal distribution e’ = upper bound limit of error

Bab /57 Using Validation Set

Bab /57 Handling Overfitting in Decision Tree / 1

Bab /57 Handling Overfitting in Decision Tree / 2

Bab /57 Example of Post-Pruning / 1

Bab /57 Example of Post-Prunning / 2

Bab /57 Evaluating Performance of Classifier

Bab /57 Model Evaluation  Metrics for Performance Evaluation  How to evaluate the performance of a model?  Methods for Performance Evaluation  How to obtain reliable estimates?  Methods for Model Comparison  How to compare the relative performance among competing models?

Bab /57 Metric for Performance Evaluation / 1  Focus on the predictive capability of a model  Rather than how fast it takes to classify or build models, scalability, etc.  Confusion Matrix: PREDICTED CLASS ACTUAL CLASS Class=YesClass=No Class=Yesab Class=Nocd a: TP (true positive) b: FN (false negative) c: FP (false positive) d: TN (true negative)

Bab /57 Metric for Performance Evaluation / 2 PREDICTED CLASS ACTUAL CLASS Class=YesClass=No Class=Yesa (TP) b (FN) Class=Noc (FP) d (TN)

Bab /57 Limitation of Accuracy  Consider a 2-class problem  Number of Class 0 examples = 9990  Number of Class 1 examples = 10  If model predicts everything to be class 0, accuracy is 9990/10000 = 99.9 %  Accuracy is misleading because model does not detect any class 1 example

Bab /57 Cost Matrix PREDICTED CLASS ACTUAL CLASS C(i|j) Class=YesClass=No Class=Yes C(Yes|Yes)C(No|Yes) Class=No C(Yes|No)C(No|No) C(i|j): Cost of misclassifying class j example as class i

Bab /57 Computing Cost of Classification Cost Matrix PREDICTED CLASS ACTUAL CLASS C(i|j) Model M 1 PREDICTED CLASS ACTUAL CLASS Model M 2 PREDICTED CLASS ACTUAL CLASS Accuracy = 80% Cost = 3910 Accuracy = 90% Cost = 4255

Bab /57 Cost vs. Accuracy Count PREDICTED CLASS ACTUAL CLASS Class=YesClass=No Class=Yes ab Class=No cd Cost PREDICTED CLASS ACTUAL CLASS Class=YesClass=No Class=Yes pq Class=No qp N = a + b + c + d Accuracy = (a + d)/N Cost = p (a + d) + q (b + c) = p (a + d) + q (N – a – d) = q N – (q – p)(a + d) = N [q – (q-p)  Accuracy] Accuracy is proportional to cost if 1. C(Yes|No)=C(No|Yes) = q 2. C(Yes|Yes)=C(No|No) = p

Bab /57 Cost-Sensitive Measure l Precision is biased towards C(Yes|Yes) & C(Yes|No) l Recall is biased towards C(Yes|Yes) & C(No|Yes) l F-measure is biased towards all except C(No|No)

Bab /57 Methods for Performance Evaluation How to obtain a reliable estimate of performance? Performance of a model may depend on other factors besides the learning algorithm:  Class distribution  Cost of misclassification  Size of training and test sets

Bab /57 Learning curve l Learning curve shows how accuracy changes with varying sample size l Effect of small sample size: - Bias in the estimate - Variance of estimate

Bab /57 Methods for Estimation Holdout Method Random Subsampling Cross-Validation Bootstrap

Bab /57 Holdout Method Original data is partitioned into 2 disjoint sets: training dan test sets  Reserve k% as training set and (100-k)% as test set  Accuracy of the classifier can be estimated based on the accuracy of the induced model on the test set Limitations:  The induced model may not be as good as when all the labeled examples are used for training (i.e., the smaller the training size, the larger the variance of the model)  If the training set is too large, the estimated accuracy computed from the smaller test set is less reliable (i.e., a wide confidence interval)  The training and test sets are no longer independent of each other, because the training and test sets are subsets of the original data)

Bab /57 Random Subsampling Based on holdout method that is repeated several times to improve the estimated accuracy  The overall accuracy = Where k : total number of iteration, acc i : model accuracy during i th iteration Problems:  As for holdout method, it does not utilize as much data as possible for training  It has no control over the number of times each record is used for testing and training (i.e., some records might be used for training more often than others)

Bab /57 Cross-Validation Partition data into k equal-sized disjoin subsets  k-fold: train ok k-1 partitions, test on the remaining one (repeated k times so that each partition is used for testing exactly once)  If k = N (the size of the data set)  called “Leave-one-out” method (i.e., each test set contains only one record)  The total estimated accuracy is computed as for Random Subsampling Advantages of Leave-one-out method:  Data is utilized as mush as possible for training  The test sets are mutually exclusive and they are effectively cover the entire data set Drawback of Leave-one-out method :  Computationally expensive by repeating procedure N times  Variance of the estimated performance metric tends to be high, since each test set contains only one record

Bab /57 Bootstrap Sampling with replacement  All previous approaches assume that the training records are sampled without replacement  In bootstrap approach, training records are sampled with replacement  If original data contains N records  on average a “bootstrap” sample of size N contains about 63.2% of the records in the original data (probability of a record is chosen  1 – (1 – 1/N) N  1 – e −1 = for N  )  Records that are not included in the bootstrap sample become part of the test set  The sampling procedure is repeated b times to generate b bootstrap sample.632 bootstrap approach (one of widely used approaches)  Acc boot =  i : accuracy of each bootstrap sample; acc s : accuracy computed from a training set contains all the labeled examples in the original data

Bab /57 Methods for Comparing Classifiers (issue of estimating the confidence interval of a given model accuracy) (issue of testing the statistical significance of the observed deviation)

Bab /57 Confidence Interval for Accuracy / 1

Bab /57 Confidence Interval for Accuracy / 2

Bab /57 Confidence Interval for Accuracy / 3

Bab /57 Confidence Interval for Accuracy / 4

Bab /57 Comparing Performance of 2 Models / 1

Bab /57 Comparing Performance of 2 Models / 2

Bab /57 An Illustrative Example

Bab /57 Comparing Performance of 2 Classifiers

Bab /57 An Illustrative Example

Bab /57 Handling Missing Attributes Missing values affect decision tree construction in three different ways:  Affects how impurity measures are computed  Affects how to distribute instance with missing value to child nodes  Affects how a test instance with missing value is classified

Bab /57 Computing Impurity Measure Split on Refund: Entropy(Refund=Yes) = 0 – (3/3)log(3/3) = 0 Entropy(Refund=No) = -(3/7)log(3/7) – (4/7)log(4/7) = Entropy(Children) = 0.3 (0) (0.9852) = Gain = – = Missing value is assumed = No (majority) Before Splitting: Entropy(Parent) = -0.3 log(0.3)-(0.7)log(0.7) =

Bab /57 Distributed Instances Refund YesNo Refund YesNo Probability that Refund=Yes is 3/9 Probability that Refund=No is 6/9 Assign record to the left child with weight = 3/9 and to the right child with weight = 6/9

Bab /57 Classify Instances Refund MarSt TaxInc YES NO Yes No Married Single, Divorced < 80K> 80K MarriedSingleDivorcedTotal Class=No3104 Class=Yes6/ Total New record: Probability that Marital Status = Married is 3.67 / 6.67 Probability that Marital Status ={Single, Divorced} is 3 / 6.67

Bab /57 Tugas Kelompok Bab 4 Soal Nomor: 5, 6, 7, 9, 10, 11 Selain hardcopy, siapkan softcopynya untuk didiskusikan Dead line: Selasa (17 Maret 2008)