Evaluating Results of Learning Blaž Zupan www.ailab.si/blaz/predavanja/uisp.

Slides:



Advertisements
Similar presentations
Evaluating Classifiers
Advertisements

Learning Algorithm Evaluation
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Other Classification Techniques 1.Nearest Neighbor Classifiers 2.Support Vector Machines.
Lecture 22: Evaluation April 24, 2010.
Evaluation (practice). 2 Predicting performance  Assume the estimated error rate is 25%. How close is this to the true error rate?  Depends on the amount.
Evaluation.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Assessing and Comparing Classification Algorithms Introduction Resampling and Cross Validation Measuring Error Interval Estimation and Hypothesis Testing.
Model Evaluation Metrics for Performance Evaluation
Cost-Sensitive Classifier Evaluation Robert Holte Computing Science Dept. University of Alberta Co-author Chris Drummond IIT, National Research Council,
Credibility: Evaluating what’s been learned. Evaluation: the key to success How predictive is the model we learned? Error on the training data is not.
CS 8751 ML & KDDEvaluating Hypotheses1 Sample error, true error Confidence intervals for observed hypothesis error Estimators Binomial distribution, Normal.
Evaluating Classifiers Lecture 2 Instructor: Max Welling Read chapter 5.
Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.
© sebastian thrun, CMU, The KDD Lab Intro: Outcome Analysis Sebastian Thrun Carnegie Mellon University
Evaluation.
Evaluation and Credibility How much should we believe in what was learned?
Experimental Evaluation
Evaluation and Credibility
INTRODUCTION TO Machine Learning 3rd Edition
Evaluation of Learning Models
Review of normal distribution. Exercise Solution.
CSCI 347 / CS 4206: Data Mining Module 06: Evaluation Topic 07: Cost-Sensitive Measures.
Computer Vision Lecture 8 Performance Evaluation.
Today Evaluation Measures Accuracy Significance Testing
CSCI 347 / CS 4206: Data Mining Module 06: Evaluation Topic 01: Training, Testing, and Tuning Datasets.
Evaluating Classifiers
Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 5 of Data Mining by I. H. Witten, E. Frank and M. A. Hall 報告人:黃子齊
1 Evaluating Model Performance Lantz Ch 10 Wk 5, Part 2 Right – Graphing is often used to evaluate results from different variations of an algorithm. Depending.
Basic statistics 11/09/13.
Error estimation Data Mining II Year Lluís Belanche Alfredo Vellido.
Population All members of a set which have a given characteristic. Population Data Data associated with a certain population. Population Parameter A measure.
Data Analysis 1 Mark Stamp. Topics  Experimental design o Training set, test set, n-fold cross validation, thresholding, imbalance, etc.  Accuracy o.
Evaluating What’s Been Learned. Cross-Validation Foundation is a simple idea – “ holdout ” – holds out a certain amount for testing and uses rest for.
Experiments in Machine Learning COMP24111 lecture 5 Accuracy (%) A BC D Learning algorithm.
Classification Performance Evaluation. How do you know that you have a good classifier? Is a feature contributing to overall performance? Is classifier.
CpSc 810: Machine Learning Evaluation of Classifier.
Confidence Intervals: The Basics BPS chapter 14 © 2006 W.H. Freeman and Company.
CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.
Model Evaluation l Metrics for Performance Evaluation –How to evaluate the performance of a model? l Methods for Performance Evaluation –How to obtain.
Bab /57 Bab 4 Classification: Basic Concepts, Decision Trees & Model Evaluation Part 2 Model Overfitting & Classifier Evaluation.
Evaluating Predictive Models Niels Peek Department of Medical Informatics Academic Medical Center University of Amsterdam.
1 Evaluation of Learning Models Literature: Literature: T. Mitchel, Machine Learning, chapter 5 T. Mitchel, Machine Learning, chapter 5 I.H. Witten and.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
Classification Evaluation. Estimating Future Accuracy Given available data, how can we reliably predict accuracy on future, unseen data? Three basic approaches.
CSCI 347, Data Mining Evaluation: Cross Validation, Holdout, Leave-One-Out Cross Validation and Bootstrapping, Sections 5.3 & 5.4, pages
Evaluating Classification Performance
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
1 Probability and Statistics Confidence Intervals.
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
Evaluation of Learning Models Evgueni Smirnov. Overview Motivation Metrics for Classifier’s Evaluation Methods for Classifier’s Evaluation Comparing Data.
Next, this study employed SVM to classify the emotion label for each EEG segment. The basic idea is to project input data onto a higher dimensional feature.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Data Science Credibility: Evaluating What’s Been Learned
7. Performance Measurement
Evaluating Results of Learning
9. Credibility: Evaluating What’s Been Learned
Evaluating Classifiers
Machine Learning Techniques for Data Mining
Evaluation and Its Methods
Experiments in Machine Learning
Receiver Operating Curves
Learning Algorithm Evaluation
Model Evaluation and Selection
Evaluation and Its Methods
Evaluation and Its Methods
COSC 4368 Intro Supervised Learning Organization
Presentation transcript:

Evaluating Results of Learning Blaž Zupan

Evaluating ML Results Criteria –Accuracy of induced concepts (predictive accuracy) accuracy = probability of correct classification error rate = 1 -accuracy –Comprehensibility Both are important –but comprehensibility is hard to measure –accuracy usually studied Kinds of accuracy –Accuracy on learning data –Accuracy on new data (much more important) –Major topic: estimating accuracy on new data

Usual Procedure to Estimate Accuracy All available data Learning set (Training set) Test set (Holdout set) Learning System Induced Classifier Accuracy on test data Main idea: accuracy on test data approximates accuracy on new data Internal Validation External Validation

Problems Common mistake –estimating accuracy on new data by accuracy on learning data (resubstitution accuracy) Size of the data set –hopefully test set is representative for new data –no problem when available data abounds Scarce data: major problem –much data is needed for successful learning –much data is needed for reliable accuracy estimate

Estimating Accuracy from Test Set Consider –Induced classifier classifies a=73% of test cases correctly –So we expect accuracy on a new data close to 75%. But: How close? How confident we are in this estimate? (this depends on the size of the testing data set)

Confidence Intervals Can be used to assess the confidence for our accuracy estimates Confidence intervals 0%50%100% success rate on test data 95% confidence interval

Evaluation Schemes (sampling methods)

3-Fold Cross Validation dataset train & test #2 train & test #3 evaluate statistics for each iteration and then compute the average reoder arbitrarily train test train & test #1

k-Fold Cross Validation Split the data to k sets of approximately equal size (and class distribution, if stratified) For i=1 to k: –Use i-th subset for testing and remaining (k-1) subsets for training Compute average accuracy k-fold CV can be repeated several, say, 100 times

Random Sampling (70/30) Random split data to, say, –70% data for training –30% data for testing Learn on training, test on testing data Repeat procedure, say, 100 times, and compute the average accuracy and its confidence intervals

Statistics calibration discrimination

Calibration and Discrimination Calibration –how accurate are probabilities assigned by the induced model –classification accuracy, sensitivity, specificity,... Discrimination –how good would the model be to distinguish between positive and negative cases –area under ROC

Test Statistics: Contingency Table of Classification Results true positive, false positive false negative, true negative

Classification Accuracy CA = (TP+TN) / N Proportion of correctly classified examples

Sensitivity Sensitivity = TP / (TP + FN) Proportion of correctly detected positive examples In medicine (+, -: presence and absence of a disease): –chance that our model correctly identifies a patient with a disease

Specificity Specificity = TN / (FP + TN) Proportion of correctly detected negative examples In medicine: –chance that our model correctly identifies a patient without a disease

Other Statistics From DL Sackett et al.: Evidence-Based Medicine, Churchill-Livingstone, 2000.

ROC Curves ROC = Receiver Operating Characteristics From 70s used to evaluate medical prognostic models Recently popular within ML [rediscovery?] 1-specificity [FP rate] sensitivity [TP rate] 0%100% 0% 100% a very good model not so good model

ROC Curve T = 0 T = 0.5 T = ∞

ROC Curve (Recipe) 1.Draw grid: –step 1/N horizontally –step 1/F vertically 2.Sort results by descending predicted probabilities 3.Start at (0,0) 4.From the table, select top row(s) with the highest probability 5.Let rows include p positive and n negative examples: move to a point p grid points up and n right 6.Remove selected rows 7.If any more rows, go to 4

ROC Curve (Recipe)

Area Under ROC FP Rate TP Rate 0%100% 0% 100% For every negative example we sum up the number of positive examples with higher estimate, and normalize this score with a product of positive and negative examples. A ROC = P [ P + (positive example) > P + (negative example) ]

Area Under ROC Is expected to be from 0.5 to 1.0 The score is not affected by class distributions Characteristic landmarks –0.5: random classifier –below 0.7: poor classification –0.7 to 0.8: ok, reasonable classification –0.8 to 0.9: here is where very good predictive models start FP Rate TP Rate 0%100% 0% 100%

Final Thoughts Never test on the learning set Use some sampling procedure for testing At the end, evaluate both –predictive performance –semantical content Bottom line: good models are those that are useful in practice