Download presentation

Presentation is loading. Please wait.

Published byAmerica Holeman Modified about 1 year ago

1
Learning Algorithm Evaluation

2
Algorithm evaluation: Outline Why? Overfitting How? Train/Test vs Cross-validation What? Evaluation measures Who wins? Statistical significance

3
Introduction

4
A model should perform well on unseen data drawn from the same distribution

5
Classification accuracy performance measure Success : instance ’ s class is predicted correctly Error : instance ’ s class is predicted incorrectly Error rate: #errors/#instances Accuracy: #successes/#instances Quiz 50 examples, 10 classified incorrectly Accuracy? Error rate?

6
Evaluation Rule #1 Never evaluate on training data!

7
Train and Test Step 1: Randomly split data into training and test set (e.g. 2/3-1/3) a.k.a. holdout set

8
Train and Test Step 2: Train model on training data

9
Train and Test Step 3: Evaluate model on test data

10
Train and Test Quiz: Can I retry with other parameter settings?

11
Evaluation Rule #1 Never train on test data! (that includes parameter setting or feature selection) Never evaluate on training data! Rule #2

12
Train and Test Step 4: Optimize parameters on separate validation set validation testing

13
Test data leakage Never use test data to create the classifier Can be tricky: e.g. social network Proper procedure uses three sets training set: train models validation set: optimize algorithm parameters test set: evaluate final model

14
Making the most of the data Once evaluation is complete, all the data can be used to build the final classifier Trade-off: performance evaluation accuracy More training data, better model (but returns diminish) More test data, more accurate error estimate

15
Step 5: Build final model on ALL data (more data, better model) Train and Test

16
Cross-Validation

17
k-fold Cross-validation Split data (stratified) in k-folds Use (k-1) for training, 1 for testing Repeat k times Average results train test Original Fold 1 Fold 2 Fold 3

18
Cross-validation Standard method: Stratified ten-fold cross-validation 10? Enough to reduce sampling bias Experimentally determined

19
Leave-One-Out Cross-validation A particular form of cross-validation: #folds = #instances n instances, build classifier n times Makes best use of the data, no sampling bias Computationally expensive 100 Original Fold 1 Fold 100 ………

20
ROC Analysis

21
Stands for “ Receiver Operating Characteristic ” From signal processing: tradeoff between hit rate and false alarm rate over noisy channel Compute FPR, TPR and plot them in ROC space Every classifier is a point in ROC space For probabilistic algorithms Collect many points by varying prediction threshold Or, make cost sensitive and vary costs (see below)

22
Confusion Matrix TPrate (sensitivity): FPrate (fall-out): TP FN FP TN actual predicted TP+FN FP+TN true positive false positive false negative true negative

23
ROC space classifiers J48 OneR J48 parameters fitted

24
ROC curves Change prediction threshold: Threshold t: (P(+) > t) Area Under Curve (AUC) =0.75

25
ROC curves Jagged curve—one set of test data Smooth curve—use cross-validation Alternative method (easier, but less intuitive) Rank probabilities Start curve in (0,0), move down probability list If positive, move up. If negative, move right

26
ROC curves Method selection Overall: use method with largest Area Under ROC curve (AUROC) If you aim to cover just 40% of true positives in a sample: use method A Large sample: use method B In between: choose between A and B with appropriate probabilities

27
ROC Space and Costs equal costs skewed costs

28
Different Costs In practice, TP and FN errors incur different costs Examples: Medical diagnostic tests: does X have leukemia? Loan decisions: approve mortgage for X? Promotional mailing: will X buy the product? Add cost matrix to evaluation that weighs TP,FP,... pred +pred - actual + c TP = 0 c FN = 1 actual - c FP = 1 c TN = 0

29
Statistical Significance

30
Comparing data mining schemes Which of two learning algorithms performs better? Note: this is domain dependent! Obvious way: compare 10-fold CV estimates Problem: variance in estimate Variance can be reduced using repeated CV However, we still don ’ t know whether results are reliable

31
Significance tests Significance tests tell us how confident we can be that there really is a difference Null hypothesis: there is no “ real ” difference Alternative hypothesis: there is a difference A significance test measures how much evidence there is in favor of rejecting the null hypothesis E.g. 10 cross-validation scores: B better than A? Algorithm A Algorithm B perf P(perf) mean A mean B x x x xxxxx x x

32
Paired t-test Student’s t-test tells whether the means of two samples (e.g., 10 cross-validation scores) are significantly different Use a paired t-test when individual samples are paired i.e., they use the same randomization Same CV folds are used for both algorithms 32 William Gosset Born: 1876 in Canterbury; Died: 1937 in Beaconsfield, England Worked as chemist in the Guinness brewery in Dublin in Invented the t-test to handle small samples for quality control in brewing. Wrote under the name "Student". Algorithm A Algorithm B perf P(perf)

33
Performing the test 1. Fix a significance level Significant difference at % level implies (100- )% chance that there really is a difference Scientific work: 5% or smaller (>95% certainty) 2. Divide by two (two-tailed test) 3. Look up the z-value corresponding to /2: 4. If t –z or t z : difference is significant null hypothesis can be rejected Algoritme A Algoritme B perf P(perf) αz 0.1% %3.25 1%2.82 5% % %0.88

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google