Learning Algorithm Evaluation

Slides:



Advertisements
Similar presentations
Evaluating Classifiers
Advertisements

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Other Classification Techniques 1.Nearest Neighbor Classifiers 2.Support Vector Machines.
Lecture 22: Evaluation April 24, 2010.
Lecture Notes for Chapter 4 Part III Introduction to Data Mining
Evaluation (practice). 2 Predicting performance  Assume the estimated error rate is 25%. How close is this to the true error rate?  Depends on the amount.
Evaluation.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Assessing and Comparing Classification Algorithms Introduction Resampling and Cross Validation Measuring Error Interval Estimation and Hypothesis Testing.
Model Evaluation Metrics for Performance Evaluation
CS 8751 ML & KDDEvaluating Hypotheses1 Sample error, true error Confidence intervals for observed hypothesis error Estimators Binomial distribution, Normal.
Evaluating Hypotheses Chapter 9. Descriptive vs. Inferential Statistics n Descriptive l quantitative descriptions of characteristics.
Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.
Evaluation.
Evaluating Hypotheses Chapter 9 Homework: 1-9. Descriptive vs. Inferential Statistics n Descriptive l quantitative descriptions of characteristics ~
Evaluation and Credibility How much should we believe in what was learned?
Experimental Evaluation
Evaluation and Credibility
INTRODUCTION TO Machine Learning 3rd Edition
Evaluation of Learning Models
CSCI 347 / CS 4206: Data Mining Module 06: Evaluation Topic 07: Cost-Sensitive Measures.
Today Evaluation Measures Accuracy Significance Testing
EVALUATION David Kauchak CS 451 – Fall Admin Assignment 3 - change constructor to take zero parameters - instead, in the train method, call getFeatureIndices()
CSCI 347 / CS 4206: Data Mining Module 06: Evaluation Topic 01: Training, Testing, and Tuning Datasets.
Evaluating Classifiers
AM Recitation 2/10/11.
1 CSI 5388: ROC Analysis (Based on ROC Graphs: Notes and Practical Considerations for Data Mining Researchers by Tom Fawcett, (Unpublished) January 2003.
CLassification TESTING Testing classifier accuracy
Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 5 of Data Mining by I. H. Witten, E. Frank and M. A. Hall 報告人:黃子齊
Evaluation – next steps
Error estimation Data Mining II Year Lluís Belanche Alfredo Vellido.
Evaluating Hypotheses Reading: Coursepack: Learning From Examples, Section 4 (pp )
Evaluating What’s Been Learned. Cross-Validation Foundation is a simple idea – “ holdout ” – holds out a certain amount for testing and uses rest for.
Experiments in Machine Learning COMP24111 lecture 5 Accuracy (%) A BC D Learning algorithm.
Experimental Evaluation of Learning Algorithms Part 1.
Classification Performance Evaluation. How do you know that you have a good classifier? Is a feature contributing to overall performance? Is classifier.
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
CpSc 810: Machine Learning Evaluation of Classifier.
Evaluating Results of Learning Blaž Zupan
Model Evaluation l Metrics for Performance Evaluation –How to evaluate the performance of a model? l Methods for Performance Evaluation –How to obtain.
Evaluating Predictive Models Niels Peek Department of Medical Informatics Academic Medical Center University of Amsterdam.
1 Evaluation of Learning Models Literature: Literature: T. Mitchel, Machine Learning, chapter 5 T. Mitchel, Machine Learning, chapter 5 I.H. Witten and.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
Classification Evaluation. Estimating Future Accuracy Given available data, how can we reliably predict accuracy on future, unseen data? Three basic approaches.
Validation methods.
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
Evaluating Classifiers Reading: T. Fawcett, An introduction to ROC analysis, Sections 1-4, 7 (linked from class website)An introduction to ROC analysis.
Evaluation of learned models Kurt Driessens again with slides stolen from Evgueni Smirnov and Hendrik Blockeel.
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
Evaluation of Learning Models Evgueni Smirnov. Overview Motivation Metrics for Classifier’s Evaluation Methods for Classifier’s Evaluation Comparing Data.
Computational Intelligence: Methods and Applications Lecture 15 Model selection and tradeoffs. Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
Evaluating Classifiers. Reading for this topic: T. Fawcett, An introduction to ROC analysis, Sections 1-4, 7 (linked from class website)
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Rodney Nielsen Many of these slides were adapted from: I. H. Witten, E. Frank and M. A. Hall Data Science Credibility: Evaluating What’s Been Learned Predicting.
Data Science Credibility: Evaluating What’s Been Learned
7. Performance Measurement
Evaluating Classifiers
Evaluation – next steps
Performance Evaluation 02/15/17
Evaluating Results of Learning
9. Credibility: Evaluating What’s Been Learned
Performance evaluation
Data Mining Classification: Alternative Techniques
Learning Algorithm Evaluation
Model Evaluation and Selection
Evaluation and Its Methods
ECE – Pattern Recognition Lecture 8 – Performance Evaluation
Evaluation David Kauchak CS 158 – Fall 2019.
Presentation transcript:

Learning Algorithm Evaluation

Algorithm evaluation: Outline Why? Overfitting How? Train/Test vs Cross-validation What? Evaluation measures Who wins? Statistical significance

Introduction

Introduction A model should perform well on unseen data drawn from the same distribution

Classification accuracy performance measure Success: instance’s class is predicted correctly Error: instance’s class is predicted incorrectly Error rate: #errors/#instances Accuracy: #successes/#instances Quiz 50 examples, 10 classified incorrectly Accuracy? Error rate?

Evaluation Rule #1 Never evaluate on training data!

Train and Test Step 1: Randomly split data into training and test set (e.g. 2/3-1/3) a.k.a. holdout set

Train and Test Step 2: Train model on training data

Train and Test Step 3: Evaluate model on test data

Train and Test Quiz: Can I retry with other parameter settings?

Evaluation Rule #1 Rule #2 Never evaluate on training data! Never train on test data! (that includes parameter setting or feature selection)

Train and Test Step 4: Optimize parameters on separate validation set testing

Test data leakage Never use test data to create the classifier Can be tricky: e.g. social network Proper procedure uses three sets training set: train models validation set: optimize algorithm parameters test set: evaluate final model

Making the most of the data Once evaluation is complete, all the data can be used to build the final classifier Trade-off: performance  evaluation accuracy More training data, better model (but returns diminish) More test data, more accurate error estimate

Train and Test Step 5: Build final model on ALL data (more data, better model)

Cross-Validation

k-fold Cross-validation Split data (stratified) in k-folds Use (k-1) for training, 1 for testing Repeat k times Average results train test Original Fold 1 Fold 2 Fold 3

Cross-validation Standard method: 10? Enough to reduce sampling bias Stratified ten-fold cross-validation 10? Enough to reduce sampling bias Experimentally determined

Leave-One-Out Cross-validation 100 Original Fold 1 Fold 100 ……… A particular form of cross-validation: #folds = #instances n instances, build classifier n times Makes best use of the data, no sampling bias Computationally expensive

ROC Analysis

ROC Analysis Stands for “Receiver Operating Characteristic” From signal processing: tradeoff between hit rate and false alarm rate over noisy channel Compute FPR, TPR and plot them in ROC space Every classifier is a point in ROC space For probabilistic algorithms Collect many points by varying prediction threshold Or, make cost sensitive and vary costs (see below)

TPrate (sensitivity): Confusion Matrix actual + - TP FP + true positive false positive predicted FN TN - false negative true negative TP+FN FP+TN P(TP): % true positives: sensitivity P(FP): % false positives: 1 – specificity TPrate (sensitivity): FPrate (fall-out):

ROC space classifiers J48 parameters fitted J48 OneR P(TP): % true positives: sensitivity P(FP): % false positives: 1 – specificity

ROC curves Area Under Curve (AUC) =0.75 Change prediction threshold: Threshold t: (P(+) > t) P(TP): % true positives: sensitivity P(FP): % false positives: 1 – specificity Area Under Curve (AUC) =0.75

ROC curves Alternative method (easier, but less intuitive) Rank probabilities Start curve in (0,0), move down probability list If positive, move up. If negative, move right Jagged curve—one set of test data Smooth curve—use cross-validation

ROC curves Method selection Overall: use method with largest Area Under ROC curve (AUROC) If you aim to cover just 40% of true positives in a sample: use method A Large sample: use method B In between: choose between A and B with appropriate probabilities

ROC Space and Costs equal costs skewed costs P(TP): % true positives: sensitivity P(FP): % false positives: 1 – specificity

Different Costs In practice, TP and FN errors incur different costs Examples: Medical diagnostic tests: does X have leukemia? Loan decisions: approve mortgage for X? Promotional mailing: will X buy the product? Add cost matrix to evaluation that weighs TP,FP,... pred + pred - actual + cTP = 0 cFN = 1 actual - cFP = 1 cTN = 0

Statistical Significance

Comparing data mining schemes Which of two learning algorithms performs better? Note: this is domain dependent! Obvious way: compare 10-fold CV estimates Problem: variance in estimate Variance can be reduced using repeated CV However, we still don’t know whether results are reliable

Significance tests Significance tests tell us how confident we can be that there really is a difference Null hypothesis: there is no “real” difference Alternative hypothesis: there is a difference A significance test measures how much evidence there is in favor of rejecting the null hypothesis E.g. 10 cross-validation scores: B better than A? mean A mean B P(perf) Algorithm A Algorithm B perf x x x xxxxx x x x x x xxxx x x x

32 Paired t-test P(perf) Algorithm A Algorithm B perf Student’s t-test tells whether the means of two samples (e.g., 10 cross-validation scores) are significantly different Use a paired t-test when individual samples are paired i.e., they use the same randomization Same CV folds are used for both algorithms William Gosset Born: 1876 in Canterbury; Died: 1937 in Beaconsfield, England Worked as chemist in the Guinness brewery in Dublin in 1899. Invented the t-test to handle small samples for quality control in brewing. Wrote under the name "Student".

Performing the test Fix a significance level  P(perf) Algoritme A Algoritme B Fix a significance level  Significant difference at % level implies (100-)% chance that there really is a difference Scientific work: 5% or smaller (>95% certainty) Divide  by two (two-tailed test) Look up the z-value corresponding to /2: If t  –z or t  z: difference is significant null hypothesis can be rejected perf α z 0.1% 4.3 0.5% 3.25 1% 2.82 5% 1.83 10% 1.38 20% 0.88