Presentation is loading. Please wait.

Presentation is loading. Please wait.

Learning Algorithm Evaluation. Algorithm evaluation: Outline  Why?  Overfitting  How?  Train/Test vs Cross-validation  What?  Evaluation measures.

Similar presentations


Presentation on theme: "Learning Algorithm Evaluation. Algorithm evaluation: Outline  Why?  Overfitting  How?  Train/Test vs Cross-validation  What?  Evaluation measures."— Presentation transcript:

1 Learning Algorithm Evaluation

2 Algorithm evaluation: Outline  Why?  Overfitting  How?  Train/Test vs Cross-validation  What?  Evaluation measures  Who wins?  Statistical significance

3 Introduction

4  A model should perform well on unseen data drawn from the same distribution

5 Classification accuracy  performance measure  Success : instance ’ s class is predicted correctly  Error : instance ’ s class is predicted incorrectly  Error rate: #errors/#instances  Accuracy: #successes/#instances  Quiz  50 examples, 10 classified incorrectly Accuracy? Error rate?

6 Evaluation Rule #1 Never evaluate on training data!

7 Train and Test Step 1: Randomly split data into training and test set (e.g. 2/3-1/3) a.k.a. holdout set

8 Train and Test Step 2: Train model on training data

9 Train and Test Step 3: Evaluate model on test data

10 Train and Test Quiz: Can I retry with other parameter settings?

11 Evaluation Rule #1 Never train on test data! (that includes parameter setting or feature selection) Never evaluate on training data! Rule #2

12 Train and Test Step 4: Optimize parameters on separate validation set validation testing

13 Test data leakage  Never use test data to create the classifier  Can be tricky: e.g. social network  Proper procedure uses three sets  training set: train models  validation set: optimize algorithm parameters  test set: evaluate final model

14 Making the most of the data  Once evaluation is complete, all the data can be used to build the final classifier  Trade-off: performance  evaluation accuracy  More training data, better model (but returns diminish)  More test data, more accurate error estimate

15 Step 5: Build final model on ALL data (more data, better model) Train and Test

16 Cross-Validation

17 k-fold Cross-validation Split data (stratified) in k-folds Use (k-1) for training, 1 for testing Repeat k times Average results train test Original Fold 1 Fold 2 Fold 3

18 Cross-validation  Standard method:  Stratified ten-fold cross-validation  10? Enough to reduce sampling bias  Experimentally determined

19 Leave-One-Out Cross-validation  A particular form of cross-validation:  #folds = #instances  n instances, build classifier n times  Makes best use of the data, no sampling bias  Computationally expensive 100 Original Fold 1 Fold 100 ………

20 ROC Analysis

21  Stands for “ Receiver Operating Characteristic ”  From signal processing: tradeoff between hit rate and false alarm rate over noisy channel  Compute FPR, TPR and plot them in ROC space  Every classifier is a point in ROC space  For probabilistic algorithms  Collect many points by varying prediction threshold  Or, make cost sensitive and vary costs (see below)

22 Confusion Matrix TPrate (sensitivity): FPrate (fall-out): TP FN FP TN actual predicted TP+FN FP+TN true positive false positive false negative true negative

23 ROC space classifiers J48 OneR J48 parameters fitted

24 ROC curves Change prediction threshold: Threshold t: (P(+) > t) Area Under Curve (AUC) =0.75

25 ROC curves  Jagged curve—one set of test data  Smooth curve—use cross-validation  Alternative method (easier, but less intuitive)  Rank probabilities  Start curve in (0,0), move down probability list  If positive, move up. If negative, move right

26 ROC curves Method selection  Overall: use method with largest Area Under ROC curve (AUROC)  If you aim to cover just 40% of true positives in a sample: use method A  Large sample: use method B  In between: choose between A and B with appropriate probabilities

27 ROC Space and Costs equal costs skewed costs

28 Different Costs  In practice, TP and FN errors incur different costs  Examples:  Medical diagnostic tests: does X have leukemia?  Loan decisions: approve mortgage for X?  Promotional mailing: will X buy the product?  Add cost matrix to evaluation that weighs TP,FP,... pred +pred - actual + c TP = 0 c FN = 1 actual - c FP = 1 c TN = 0

29 Statistical Significance

30 Comparing data mining schemes  Which of two learning algorithms performs better?  Note: this is domain dependent!  Obvious way: compare 10-fold CV estimates  Problem: variance in estimate  Variance can be reduced using repeated CV  However, we still don ’ t know whether results are reliable

31 Significance tests  Significance tests tell us how confident we can be that there really is a difference  Null hypothesis: there is no “ real ” difference  Alternative hypothesis: there is a difference  A significance test measures how much evidence there is in favor of rejecting the null hypothesis  E.g. 10 cross-validation scores: B better than A? Algorithm A Algorithm B perf P(perf) mean A mean B x x x xxxxx x x

32 Paired t-test  Student’s t-test tells whether the means of two samples (e.g., 10 cross-validation scores) are significantly different  Use a paired t-test when individual samples are paired  i.e., they use the same randomization  Same CV folds are used for both algorithms 32 William Gosset Born: 1876 in Canterbury; Died: 1937 in Beaconsfield, England Worked as chemist in the Guinness brewery in Dublin in Invented the t-test to handle small samples for quality control in brewing. Wrote under the name "Student". Algorithm A Algorithm B perf P(perf)

33 Performing the test 1. Fix a significance level   Significant difference at  % level implies (100-  )% chance that there really is a difference  Scientific work: 5% or smaller (>95% certainty) 2. Divide  by two (two-tailed test) 3. Look up the z-value corresponding to  /2: 4. If t  –z or t  z : difference is significant  null hypothesis can be rejected Algoritme A Algoritme B perf P(perf) αz 0.1% %3.25 1%2.82 5% % %0.88


Download ppt "Learning Algorithm Evaluation. Algorithm evaluation: Outline  Why?  Overfitting  How?  Train/Test vs Cross-validation  What?  Evaluation measures."

Similar presentations


Ads by Google