Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.

Slides:



Advertisements
Similar presentations
Brief introduction on Logistic Regression
Advertisements

Chapter 4 Pattern Recognition Concepts: Introduction & ROC Analysis.
Learning Algorithm Evaluation
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Other Classification Techniques 1.Nearest Neighbor Classifiers 2.Support Vector Machines.
Lecture Notes for Chapter 4 (2) Introduction to Data Mining
Evaluation.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Assessing and Comparing Classification Algorithms Introduction Resampling and Cross Validation Measuring Error Interval Estimation and Hypothesis Testing.
Model Evaluation Metrics for Performance Evaluation
Cost-Sensitive Classifier Evaluation Robert Holte Computing Science Dept. University of Alberta Co-author Chris Drummond IIT, National Research Council,
Credibility: Evaluating what’s been learned. Evaluation: the key to success How predictive is the model we learned? Error on the training data is not.
CS 8751 ML & KDDEvaluating Hypotheses1 Sample error, true error Confidence intervals for observed hypothesis error Estimators Binomial distribution, Normal.
Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.
Evaluation.
Evaluating Hypotheses
Experimental Evaluation
Evaluation and Credibility
INTRODUCTION TO Machine Learning 3rd Edition
Evaluating Performance for Data Mining Techniques
Evaluation of Learning Models
Data Mining – Credibility: Evaluating What’s Been Learned
CSCI 347 / CS 4206: Data Mining Module 06: Evaluation Topic 07: Cost-Sensitive Measures.
Today Evaluation Measures Accuracy Significance Testing
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Classification: Evaluation February 23,
1 Formal Evaluation Techniques Chapter 7. 2 test set error rates, confusion matrices, lift charts Focusing on formal evaluation methods for supervised.
CSCI 347 / CS 4206: Data Mining Module 06: Evaluation Topic 01: Training, Testing, and Tuning Datasets.
Evaluating Classifiers
CLassification TESTING Testing classifier accuracy
Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 5 of Data Mining by I. H. Witten, E. Frank and M. A. Hall 報告人:黃子齊
Evaluation – next steps
Classification II. 2 Numeric Attributes Numeric attributes can take many values –Creating branches for each value is not ideal The value range is usually.
1 Evaluating Model Performance Lantz Ch 10 Wk 5, Part 2 Right – Graphing is often used to evaluate results from different variations of an algorithm. Depending.
Error estimation Data Mining II Year Lluís Belanche Alfredo Vellido.
Evaluating Hypotheses Reading: Coursepack: Learning From Examples, Section 4 (pp )
Evaluating What’s Been Learned. Cross-Validation Foundation is a simple idea – “ holdout ” – holds out a certain amount for testing and uses rest for.
Experimental Evaluation of Learning Algorithms Part 1.
Classification Performance Evaluation. How do you know that you have a good classifier? Is a feature contributing to overall performance? Is classifier.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
CpSc 810: Machine Learning Evaluation of Classifier.
Evaluating Results of Learning Blaž Zupan
Computational Intelligence: Methods and Applications Lecture 16 Model evaluation and ROC Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
Model Evaluation l Metrics for Performance Evaluation –How to evaluate the performance of a model? l Methods for Performance Evaluation –How to obtain.
Bab /57 Bab 4 Classification: Basic Concepts, Decision Trees & Model Evaluation Part 2 Model Overfitting & Classifier Evaluation.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
1 Evaluation of Learning Models Literature: Literature: T. Mitchel, Machine Learning, chapter 5 T. Mitchel, Machine Learning, chapter 5 I.H. Witten and.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
Classification Evaluation. Estimating Future Accuracy Given available data, how can we reliably predict accuracy on future, unseen data? Three basic approaches.
CSCI 347, Data Mining Evaluation: Cross Validation, Holdout, Leave-One-Out Cross Validation and Bootstrapping, Sections 5.3 & 5.4, pages
Evaluating Classification Performance
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Chapter 5 – Evaluating Predictive Performance Data Mining for Business Analytics Shmueli, Patel & Bruce.
Evaluation of Learning Models Evgueni Smirnov. Overview Motivation Metrics for Classifier’s Evaluation Methods for Classifier’s Evaluation Comparing Data.
Evaluating Classifiers. Reading for this topic: T. Fawcett, An introduction to ROC analysis, Sections 1-4, 7 (linked from class website)
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Machine Learning in Practice Lecture 9 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Data Science Credibility: Evaluating What’s Been Learned
7. Performance Measurement
Evaluating Classifiers
Evaluation – next steps
Evaluating Results of Learning
Data Mining – Credibility: Evaluating What’s Been Learned
9. Credibility: Evaluating What’s Been Learned
Machine Learning Techniques for Data Mining
Evaluation and Its Methods
Evaluation and Its Methods
Evaluation and Its Methods
Presentation transcript:

Chapter 5: Credibility

Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict performance bounds Quality training data is difficult to obtain---not always in abundance Performance prediction based on limited training data is controversial---repeated cross validation technique is most useful in these situations Cost of misclassification is also an important criteria Statistical tests are also needed to validate the conclusions

Training and Testing Error rate of a classifier---if a classifier does a correct classification, it is counted as success; otherwise, it is an error. If out of 1000 instances, 700 are successful and 300 in error, then error rate is 30%. Is the classifier performance on training data a good indicator of its performance on test data and future data? Error rate on training data (resubstitution error) is not a good indicator of error rate on test data!! (overfit) Tests data --- data not used in the training phase Training data --- used by one or more learning methods to come up with classifiers Validation data --- optimize parameters or select a particular one Test data --- calculate the error rate of the optimized classifier

Predicting Performance 100-%error rate = %success rate Confidence interval: When the test data is not large, we refer to the resulting error rates (success rate) in the context of confidence intervals

Cross Validation Hold out: Hold back 1/3 of the available data for testing and use the remaining for training. Stratified holdout: Training data should be a good representative of the overall data---each class should be represented in the same proportion as its size Repeated holdout method---repeat the random selection several times and obtain different error rates Three-fold Cross validation---divide the data into three equal partitions: make three iterations, each time choosing one of the three partitions (folds) as test data and the other two as training data 10-fold cross-validation: Use 9 of the 10 to train, and the remaining one for testing; 10 error estimates are averaged to yield an overall error estimate Sometimes, we may repeat the 10-fold cross validation several times with different random samples of 10 folds

Other Estimates Leave-one-out cross validation: It is n-fold cross validation where n is the #of instances in the dataset. Each instance in turn is left out for testing, and n-1 instances are used for training. The results of all n judgments are averaged and that is the final error estimate. Bootstrap error estimation method: Sampling with replacement: A data set of n instances is sampled n times, with replacement, to give another dataset of n instances. Those instances that have not been picked in the training data will be chosen in the validation set. This is also referred as bootsrtrap---because there is a probability that an instance may not be chosen in the training set. –The error estimate obtained over the test set will be a pessimistic estimate of the true error rate, because the training set only contains 63% of the overall data, where as it covers 90% data in the 10-fold validation. –Final error rate is computed as: E = ERROR RATE OVER TEST INSTANCES * ERROR RATE OVER TRAINING INSTANCES –Repeat the bootstrap procedure several times and average the error rate

Comparing Data Mining Methods If a new learning algorithm is proposed, its proponents must show that it improves on the state of the art for the problem at hand and demonstrates that the observed improvement is not just a chance effect in the estimation process. A technique cannot be thrown out because it does not do well on one dataset; its average performance over different sets must be considered Determine whether or not the mean of a set of samples-- -cross-validation estimates for the various data sets that we sampled from the domain---is significantly greater than, or significantly less than, the mean of another. t- test or Student’s t-test and paired t-test are preferred tools.

Predicting Probabilities 0-1 loss function When a classification is done with a probability, it is not a 0-1 situation Quadratic loss function: If is a probability vector for an instance to belong to the k classes, and is the actual outcome vector where all but the entry that it belongs to is 1 and the rest 0. ∑ j (pj-aj) 2 is the quadratic loss function. If i is the correct class (ai=1), then it can be rewritten as: 1-2pi+∑ j pj 2. When test set contains several instance, the loss function is summed over all of them. Informational loss function: -log 2 p i where the i th prediction is the correct one

Counting the Cost Cost of making a wrong decision? Cost of missing a threat versus cost of false positives? Confusion matrix: True positive (TP) (Actual=predicted=yes) and True negative (TN) ( actual=predicted = no) are correct ones; false positive (FP) (actual=no, predicted = yes) and false negative (FN) (actual=yes, predicted=no) are incorrect ones. True positive rate = TP/(TP+FN); out of all actual “yes”, what fraction is correctly predicted as “yes” False positive rate = FP/(TN+FP); out of all actual “no”, what fraction is incorrectly predicted as “yes” Overall success rate = (TP+TN)/(TP+TN+FP+FN) Error rate = 1.0-success rate Multiclass prediction---use confusion matrix: c rows and c columns; In Table 5.4 (a), there are 100 instances of class a, 60 of b, and 40 of c. Out of these or 140 were correctly predicted---a success rate of 70%. The predictor predicted 120 of class a, 60 of b, and 20 of c. The question is “is it a random prediction or a chance or an intelligent one?” If there is a random predictor that randomly classifies the instances based on the actual ratio of classes (6:3:1 in this case), we get Fig. 5.4 (b) results. It got 82 instances correct as opposed to 140 by the learning technique. Is this significant? Kappa statistic Kappa statistic: = 58 extra successes out of a possible total of = 118, or 49.2%. Maximum value of Kappa is 100%. This is also not a cost-sensitive classification Good link

Cost-sensitive classification Benefits of TP and TN; costs of FP and FN Sometimes the cost of a learning techniques may also be taken into account Suppose a predictor predicts a class a instance as class a, b, c with probabilities pa, pb, and pc, then the default cost is pb+pc or 1-pa.

Cost-sensitive Learning Take cost into consideration at training time Generate training data with a different proportion of yes and no instances. For example, if we want to avoid errors on the no instances, since false positives are penalized 10 times to that of false negatives, we could choose the number of no instances to be 10 times that of yes instances in the training set. One way to vary the proportion of training instances is to duplicate instances in the training dataset. Other is to assign weights for different instances and build cost sensitive trees

Lift Charts Lift factor---Increase in response rate due to the selection of a different group (If one group yields a response of 1% and the other group 5%, then lift factor is 5 for the 2 nd group.) Table 5.6: 150 instances; 50 are yes (actual); 100 are no (actual). So 33% success rate. The 150 instances are sorted based on the predicted probability for by the learning scheme. For example, for instance 1, the learning scheme predicts a success of For the next one, it is 0.93, and so on. When the actual class is no, and the technique predicts yes, then it is a false positive. If we were to chose only 10 samples, then we go for the top most 10. Out of these, 2 are actually negative. So success rate would be 80%. Compared to the overall average success of 33%, there is a lift factor of 80/33 or 2.4 (tps/Ns)/(tpt/Nt) Lift chart: Figure % sample size (proportion of total test data); and number of respondents. Diagonal---expected number of respondents if random sample is taken; The upper curve shows a more intelligent choice of the samples. Reference 1 Reference 2

ROC Curves Idea (as in lift chapters) is to choose samples with high proportion of positives ROC curves depict the performance of a classifier without regard to class distribution or error costs. Receiver operating characteristics---how does a signal receiver respond to noise + signal. ROC curves depict the performance of a sample; % of +ves in the sample w.r.t. total +ves in the test data vs. % of –ves in the sample over all –ves in the test data. Generating ROC curves from the cross-validation: (i) Collect the predicted for all the various test sets (10 sets in a 10-fold cross-validation) along with the TRUE class labels for each instance (ii) Generate a single ranked list based on this data (ii) Build ROC curve Figure 5.3: ROC curves for two learning methods– when do we choose A and when do we choose B? By combing both the techniques with a weight factor, we can get the best: the top of the convex hull---In other words, if a classifier A predicts an instance to be positive with prob and another classifier B predicts it with prob. 0.72, then assigning a weight of 0.7 to A and 0.3 to B, gives a prob. of Link 1 Link 2Link

Recall-precision curves Example: A1 locates 100 documents of which 20 are relevant: A2 locates 400 documents of which 80 are relevant. Which one is better? Cost of false positives and false negatives. Recall = #of documents retrieved that are relevant/total #of documents that are relevant Example: If the total# of relevant are 100, then A1: recall = 0.2; A2 recall = 0.8; Precision: #of docs retrieved that are relevant/total #of docs retrieved Example: A1: precision=20/100=0.2; A2 precision: 80/400=0.2; Summary: Table 5.7 page 172 Ultimate objective: Choose a set of instances with a high proportion of yes instances and a high coverage of yes instances; of course, this is to be done with as few samples as possible. 3-point average recall: Average precision obtained at recall values of 20%, 50%, and 80%. In the example (see Excel sheet); 3-point average recall = ( )/3 = 30% 11-point average recall = ( )/11 =30.36% F-measure = 2*recall*precision/(recall+precision) = 2TP/(2TP+FP+FN); In the excel sheet example, TP=15, FP=5; let FN=4; then F-measure = 30/(30+5+4)=77% Success rate: (TP+TN)/(TP+FP+TN+FN)

Cost curves ROC curves and related measures are useful for exploring the tradeoffs among different classifiers over a range of costs. But they are not ideal for evaluating models in situations with known error costs (cost of false negatives and cost of false positives). Cost curves are suitable for this purpose---where a single classifier corresponds to a straight line that shows how the performance varies as the class distribution changes. Works best in two classes. Fig 5.4 a: Expected error against the probability of one of the classes (+ and -) If p(+) 0.65, then always picking + is better than A. Figure 5.4b: Taking costs into consideration. Prob. Cost function = pc[+] = (p[+]*C[+|-])/(p[+]*C[+|-]+p[-]*C[-|+}) Normalized expected cost = fn*pc[+]+fp*(1-pc[+]) Where fp is false positive rate and fn is false negative rate

Evaluating Numeric Prediction Applies to numeric prediction (not nominal values) Metrics (Table 5.8) –Mean squared error (MSE) –Root MSE –Mean absolute error –Relative squared error---relative to what it would have been if a simple predictor had been used---say the average values of the training data –Relative absolute error –Correlation coefficient---statistical correlation between actual values and predicted values Table 5.8 and Table 5.9 Choose the classifiers that gives the best results in terms of the chosen metric