4 IntroductionA model should perform well on unseen data drawn from the same distribution
5 Classification accuracy performance measureSuccess: instance’s class is predicted correctlyError: instance’s class is predicted incorrectlyError rate: #errors/#instancesAccuracy: #successes/#instancesQuiz50 examples, 10 classified incorrectlyAccuracy? Error rate?
6 EvaluationRule #1Never evaluate on training data!
7 Train and TestStep 1: Randomly split data into training and test set (e.g. 2/3-1/3)a.k.a. holdout set
8 Train and TestStep 2: Train model on training data
9 Train and TestStep 3: Evaluate model on test data
10 Train and TestQuiz: Can I retry with other parameter settings?
11 Evaluation Rule #1 Rule #2 Never evaluate on training data! Never train on test data!(that includes parameter setting or feature selection)
12 Train and Test Step 4: Optimize parameters on separate validation set testing
13 Test data leakage Never use test data to create the classifier Can be tricky: e.g. social networkProper procedure uses three setstraining set: train modelsvalidation set: optimize algorithm parameterstest set: evaluate final model
14 Making the most of the data Once evaluation is complete, all the data can be used to build the final classifierTrade-off: performance evaluation accuracyMore training data, better model (but returns diminish)More test data, more accurate error estimate
15 Train and TestStep 5: Build final model on ALL data (more data, better model)
17 k-fold Cross-validation Split data (stratified) in k-foldsUse (k-1) for training, 1 for testingRepeat k timesAverage resultstraintestOriginal Fold Fold Fold 3
18 Cross-validation Standard method: 10? Enough to reduce sampling bias Stratified ten-fold cross-validation10? Enough to reduce sampling biasExperimentally determined
19 Leave-One-Out Cross-validation 100Original Fold Fold 100………A particular form of cross-validation:#folds = #instancesn instances, build classifier n timesMakes best use of the data, no sampling biasComputationally expensive
21 ROC Analysis Stands for “Receiver Operating Characteristic” From signal processing: tradeoff between hit rate and false alarm rate over noisy channelCompute FPR, TPR and plot them in ROC spaceEvery classifier is a point in ROC spaceFor probabilistic algorithmsCollect many points by varying prediction thresholdOr, make cost sensitive and vary costs (see below)
24 ROC curves Area Under Curve (AUC) =0.75 Change prediction threshold: Threshold t: (P(+) > t)P(TP): % true positives: sensitivityP(FP): % false positives: 1 – specificityArea Under Curve (AUC)=0.75
25 ROC curves Alternative method (easier, but less intuitive) Rank probabilitiesStart curve in (0,0), move down probability listIf positive, move up. If negative, move rightJagged curve—one set of test dataSmooth curve—use cross-validation
26 ROC curves Method selection Overall: use method with largest Area Under ROC curve (AUROC)If you aim to cover just 40% of true positives in a sample: use method ALarge sample: use method BIn between: choose between A and B with appropriate probabilities
27 ROC Space and Costs equal costs skewed costs P(TP): % true positives: sensitivityP(FP): % false positives: 1 – specificity
28 Different Costs In practice, TP and FN errors incur different costs Examples:Medical diagnostic tests: does X have leukemia?Loan decisions: approve mortgage for X?Promotional mailing: will X buy the product?Add cost matrix to evaluation that weighs TP,FP,...pred +pred -actual +cTP = 0cFN = 1actual -cFP = 1cTN = 0
30 Comparing data mining schemes Which of two learning algorithms performs better?Note: this is domain dependent!Obvious way: compare 10-fold CV estimatesProblem: variance in estimateVariance can be reduced using repeated CVHowever, we still don’t know whether results are reliable
31 Significance testsSignificance tests tell us how confident we can be that there really is a differenceNull hypothesis: there is no “real” differenceAlternative hypothesis: there is a differenceA significance test measures how much evidence there is in favor of rejecting the null hypothesisE.g. 10 cross-validation scores: B better than A?mean Amean BP(perf)Algorithm AAlgorithm Bperfx x x xxxxx x xx x x xxxx x x x
32 32Paired t-testP(perf)Algorithm AAlgorithm BperfStudent’s t-test tells whether the means of two samples (e.g., 10 cross-validation scores) are significantly differentUse a paired t-test when individual samples are pairedi.e., they use the same randomizationSame CV folds are used for both algorithmsWilliam GossetBorn: 1876 in Canterbury; Died: 1937 in Beaconsfield, EnglandWorked as chemist in the Guinness brewery in Dublin in Invented the t-test to handle small samples for quality control in brewing. Wrote under the name "Student".
33 Performing the test Fix a significance level P(perf)Algoritme AAlgoritme BFix a significance level Significant difference at % level implies (100-)% chance that there really is a differenceScientific work: 5% or smaller (>95% certainty)Divide by two (two-tailed test)Look up the z-value corresponding to /2:If t –z or t z: difference is significantnull hypothesis can be rejectedperfαz0.1%4.30.5%3.251%2.825%1.8310%1.3820%0.88