Classification Evaluation. Estimating Future Accuracy Given available data, how can we reliably predict accuracy on future, unseen data? Three basic approaches.

Slides:



Advertisements
Similar presentations
Imbalanced data David Kauchak CS 451 – Fall 2013.
Advertisements

Statistical Significance and Performance Measures
Learning Algorithm Evaluation
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Other Classification Techniques 1.Nearest Neighbor Classifiers 2.Support Vector Machines.
Receiver Operating Characteristic (ROC) Curves
Lecture Notes for Chapter 4 Part III Introduction to Data Mining
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 1 Computational Statistics with Application to Bioinformatics Prof. William.
Evaluation.
Assessing and Comparing Classification Algorithms Introduction Resampling and Cross Validation Measuring Error Interval Estimation and Hypothesis Testing.
Model Evaluation Metrics for Performance Evaluation
Credibility: Evaluating what’s been learned. Evaluation: the key to success How predictive is the model we learned? Error on the training data is not.
CS 8751 ML & KDDEvaluating Hypotheses1 Sample error, true error Confidence intervals for observed hypothesis error Estimators Binomial distribution, Normal.
Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.
Evaluation.
ROC Curves.
Evaluation and Credibility How much should we believe in what was learned?
Experimental Evaluation
ROC & AUC, LIFT ד"ר אבי רוזנפלד.
Evaluation and Credibility
Evaluation of Learning Models
CSCI 347 / CS 4206: Data Mining Module 06: Evaluation Topic 07: Cost-Sensitive Measures.
Today Evaluation Measures Accuracy Significance Testing
Evaluating Classifiers
CLassification TESTING Testing classifier accuracy
Evaluation – next steps
Classification II. 2 Numeric Attributes Numeric attributes can take many values –Creating branches for each value is not ideal The value range is usually.
1 Evaluating Model Performance Lantz Ch 10 Wk 5, Part 2 Right – Graphing is often used to evaluate results from different variations of an algorithm. Depending.
Error estimation Data Mining II Year Lluís Belanche Alfredo Vellido.
Data Analysis 1 Mark Stamp. Topics  Experimental design o Training set, test set, n-fold cross validation, thresholding, imbalance, etc.  Accuracy o.
Evaluating What’s Been Learned. Cross-Validation Foundation is a simple idea – “ holdout ” – holds out a certain amount for testing and uses rest for.
Experiments in Machine Learning COMP24111 lecture 5 Accuracy (%) A BC D Learning algorithm.
Experimental Evaluation of Learning Algorithms Part 1.
Classification Performance Evaluation. How do you know that you have a good classifier? Is a feature contributing to overall performance? Is classifier.
CpSc 810: Machine Learning Evaluation of Classifier.
Model Evaluation. CRISP-DM CRISP-DM Phases Business Understanding – Initial phase – Focuses on: Understanding the project objectives and requirements.
Evaluating Results of Learning Blaž Zupan
Computational Intelligence: Methods and Applications Lecture 16 Model evaluation and ROC Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
Model Evaluation l Metrics for Performance Evaluation –How to evaluate the performance of a model? l Methods for Performance Evaluation –How to obtain.
Bab /57 Bab 4 Classification: Basic Concepts, Decision Trees & Model Evaluation Part 2 Model Overfitting & Classifier Evaluation.
1 Evaluation of Learning Models Literature: Literature: T. Mitchel, Machine Learning, chapter 5 T. Mitchel, Machine Learning, chapter 5 I.H. Witten and.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
CSCI 347, Data Mining Evaluation: Cross Validation, Holdout, Leave-One-Out Cross Validation and Bootstrapping, Sections 5.3 & 5.4, pages
1 Performance Measures for Machine Learning. 2 Performance Measures Accuracy Weighted (Cost-Sensitive) Accuracy Lift Precision/Recall –F –Break Even Point.
Evaluating Classification Performance
Validation methods.
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
Evaluating Classifiers Reading: T. Fawcett, An introduction to ROC analysis, Sections 1-4, 7 (linked from class website)An introduction to ROC analysis.
Evaluation of learned models Kurt Driessens again with slides stolen from Evgueni Smirnov and Hendrik Blockeel.
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
Evaluation of Learning Models Evgueni Smirnov. Overview Motivation Metrics for Classifier’s Evaluation Methods for Classifier’s Evaluation Comparing Data.
Evaluating Classifiers. Reading for this topic: T. Fawcett, An introduction to ROC analysis, Sections 1-4, 7 (linked from class website)
Data Science Credibility: Evaluating What’s Been Learned
7. Performance Measurement
Evaluating Classifiers
Evaluation – next steps
Performance Evaluation 02/15/17
Evaluating Results of Learning
9. Credibility: Evaluating What’s Been Learned
Performance evaluation
Data Mining Classification: Alternative Techniques
Machine Learning Techniques for Data Mining
Experiments in Machine Learning
Learning Algorithm Evaluation
Model Evaluation and Selection
Computational Intelligence: Methods and Applications
ECE – Pattern Recognition Lecture 8 – Performance Evaluation
ROC Curves and Operating Points
Presentation transcript:

Classification Evaluation

Estimating Future Accuracy Given available data, how can we reliably predict accuracy on future, unseen data? Three basic approaches – Training set – Hold-out set (2 variants) – Cross-validation

Estimating with Training Set Simplest approach – Build model from training set – Compute accuracy on training set Pros and cons – Easy – Likely to overestimate (think overfitting)

Estimating with Hold-out Set (1) Method 1 – Two distinct data sets are made available a priori – One is used to build the model – The other is used to test the model Pros and cons – No “bias” – Not always feasible

Estimating with Hold-out Set (2) Method 2: – Randomly partition data into training and test set – Training set used to train/build the model – Test set used evaluate the model Pros and cons – Easy – Less likely to overfit – Reduces amount of training data

Holding out data The holdout method reserves a certain amount for testing and uses the remainder for training – Usually: one third for testing, the rest for training For “unbalanced” datasets, random samples might not be representative – Few or none instances of some classes Stratified sample: – Make sure that each class is represented with approximately equal proportions in both subsets

Repeated holdout method Holdout estimate can be made more reliable by repeating the process with different subsamples – In each iteration, a certain proportion is randomly selected for training (possibly with stratification) – The error rates on the different iterations are averaged to yield an overall error rate This is called the repeated holdout method

Cross-validation Most popular and effective type of repeated holdout is cross-validation Cross-validation avoids overlapping test sets – First step: data is split into k subsets of equal size – Second step: each subset in turn is used for testing and the remainder for training This is called k-fold cross-validation Often the subsets are stratified before the cross- validation is performed

Cross-validation example: 9

More on cross-validation Standard data-mining method for evaluation: stratified ten-fold cross-validation Why ten? – Good choice to get an accurate estimate Stratification reduces the estimate’s variance Even better: repeated stratified cross-validation – E.g., ten-fold cross-validation is repeated ten times and results are averaged (reduces the sampling variance) Error estimate is the mean across all repetitions

Leave-One-Out cross-validation Leave-One-Out: a particular form of cross-validation: – Set number of folds to number of training instances – I.e., for n training instances, build classifier n times Makes best use of the data Involves no random subsampling Computationally expensive, but good performance

Leave-One-Out-CV and stratification Disadvantage of Leave-One-Out-CV: stratification is not possible – It guarantees a non-stratified sample because there is only one instance in the test set! Extreme example: random dataset split equally into two classes – Best model predicts majority class – 50% accuracy on fresh data – Leave-One-Out-CV estimate is 100% error!

Three-way Data Splits One problem with CV is since data is being used jointly to fit model and estimate error, the error could be biased downward. If the goal is a real estimate of error (as opposed to which model is best), you may want a three way split: – Training set: examples used for learning – Validation set: used to tune parameters – Test set: never used in the model fitting process, used at the end for unbiased estimate of hold out error Nested Cross-validation

Issues with Accuracy Measuring accuracy – Is 99% accuracy good? Is 20% accuracy bad? – Can be excellent, good, mediocre, poor, terrible Why? – Depends on problem complexity – Depends on base accuracy (i.e., majority learner) – Depends on cost of error (e.g., ICU, etc.) Problem: assumes equal cost for all errors

Confusion Matrix Predicted Output True Output (Target) True Positive (TP) Hits False Negative (FN) Misses True Negative (TN) Correct Rejections False Positive (FP) False Alarm Accuracy = (TP+TN)/(TP+TN+FP+FN) Single number: loses information

Precision Predicted Output True Output (Target) True Positive (TP) Hits False Negative (FN) Misses True Negative (TN) Correct Rejections False Positive (FP) False Alarm Precision = TP/(TP+FP) The percentage of predicted true positives that are target true positives (of those I predict true, how many are actually true)

Recall Predicted Output True Output (Target) True Positive (TP) Hits False Negative (FN) Misses True Negative (TN) Correct Rejections False Positive (FP) False Alarm Recall = TP/(TP+FN) The percentage of target true positives that were predicted as true positives (of those that are true, how many do I predict are true

P/R Trade-off (I) ICU monitoring: – Is precision the goal? Not so much, rather not miss any – Recall is the goal: Don’t want to miss any, and would rather err towards accepting some false positives (check patient when unnecessary) and minimize false negatives (not check on a needy patient) Google search: – Is recall the goal? Not really, because we never get to the millionth page, rather get a few very good results early – Precision is the goal: Don’t want to see irrelevant documents (false positives), can tolerate missing some (false negatives), there are plenty of sites anyways and we don’t need to get all Trade-off: – Easy to maximize precision – only classify the one or few most confident candidates as true – Easy to maximize recall – classify everything as true – Neither is particularly useful!

P/R Trade-off (II) Complete P/R curve Breakeven Point defined by P=R Alternatively, F-measure:

Other Measures Sensitivity (Recall): – TP / (TP + FN) Specificity: – TN / (TN + FP) Positive Predictive Value (Precision): – TP / (TP + FP) Negative Predictive Value: – TN / (TN + FN)

ROC Curves Receiver Operating Characteristic Curve – Developed in WWII to statistically model false positive and false negative detections of radar operators Standard measure in medicine and biology Graphs true positive rate (sensitivity) vs. false positive rate (1- specificity) Goal: Maximize TPR and minimize FPR – Max TPR: classify everything positive – Min FPR: classify everything negative – Neither is acceptable, of course!

Several Points in ROC Space Lower left point (0, 0) represents the strategy of never issuing a positive classification; – No FP but also no TP Upper right corner (1, 1) represents the opposite strategy, of unconditionally issuing positive classifications. Point (0, 1) represents perfect classification. – D's performance is perfect as shown Informally, one point in ROC space is better than another if it is to the northwest of the first – TP rate is higher, FP rate is lower, or both.

ROC Curves and AUC (II) Each point on the ROC curve represents a different tradeoff (cost ratio) between TPR and FPR AUC is area under the curve: represents performance averaged over all possible cost ratios Single summary number Perfect model has AUC = 1.0 Random model has AUC = 0.5

Specific Example Test Result Pts with disease Pts without the disease

Test Result Call these patients “ negative ” Call these patients “ positive ” Threshold

Test Result Call these patients “ negative ” Call these patients “ positive ” without the disease with the disease True Positives Some definitions...

Test Result Call these patients “ negative ” Call these patients “ positive ” without the disease with the disease False Positives

Test Result Call these patients “ negative ” Call these patients “ positive ” without the disease with the disease True negatives

Test Result Call these patients “ negative ” Call these patients “ positive ” without the disease with the disease False negatives

How to Construct ROC Curve for one Classifier Sort the instances according to their P pos Move a threshold on the sorted instances For each threshold define a classifier with confusion matrix Plot the TPR and FP of the classifier P pos True Class 0.99pos 0.98pos 0.7neg 0.6pos 0.43neg True Predicted posneg pos21 neg11

ROC Properties AUC properties – Perfect prediction –.9 - Excellent –.7 - Mediocre –.5 - Random ROC Curve properties – If two ROC curves do not intersect then one method dominates the other – If they do intersect then one method is better for some cost ratios, and is worse for others Blue alg better for precision, yellow alg for recall, red neither Can choose method and balance based on goals

Lift (I) In some situations, we are not interested in the accuracy over the entire data set – Accurate predictions for 5%, 10%, or 20% of data – Don’t care about the rest Prototypical application: direct marketing – Baseline: random targeting of population – Can we do better? Want to know how much better a targeted offer is on a fraction of the population

Lift (II) Predicted Output True Output (Target) True Positive (TP) Hits False Negative (FN) Misses True Negative (TN) Correct Rejections False Positive (FP) False Alarm Lift = [TP / (TP+TN)] / [(TP+FP) / (TP+TN+FP+FN)] How much better a model is over random predictions

Lift (III) Lift(t) = CR(t) / t E.g., Lift (25%) = CR(25) / 25 = 62 / 25 = 2.5 If we select 25% of prospects using our model, they are 2.5 times more likely to respond than if we selected them randomly Can vary t to make decisions (e.g., cost/benefit analysis)

Summary Several measures – Single value vs. range of thresholds Restricted to binary classification – Could always cast problem as a set of two class problems but that can be inconvenient Accuracy handles multi-class outputs Key point: – The measure you optimize makes a difference – The measure you report makes a difference – Measure what you want to optimize/report (i.e., use measure appropriate to task/domain)