4IntroductionA model should perform well on unseen data drawn from the same distribution
5Classification accuracy performance measureSuccess: instance’s class is predicted correctlyError: instance’s class is predicted incorrectlyError rate: #errors/#instancesAccuracy: #successes/#instancesQuiz50 examples, 10 classified incorrectlyAccuracy? Error rate?
6EvaluationRule #1Never evaluate on training data!
7Train and TestStep 1: Randomly split data into training and test set (e.g. 2/3-1/3)a.k.a. holdout set
8Train and TestStep 2: Train model on training data
9Train and TestStep 3: Evaluate model on test data
10Train and TestQuiz: Can I retry with other parameter settings?
11Evaluation Rule #1 Rule #2 Never evaluate on training data! Never train on test data!(that includes parameter setting or feature selection)
12Train and Test Step 4: Optimize parameters on separate validation set testing
13Test data leakage Never use test data to create the classifier Can be tricky: e.g. social networkProper procedure uses three setstraining set: train modelsvalidation set: optimize algorithm parameterstest set: evaluate final model
14Making the most of the data Once evaluation is complete, all the data can be used to build the final classifierTrade-off: performance evaluation accuracyMore training data, better model (but returns diminish)More test data, more accurate error estimate
15Train and TestStep 5: Build final model on ALL data (more data, better model)
17k-fold Cross-validation Split data (stratified) in k-foldsUse (k-1) for training, 1 for testingRepeat k timesAverage resultstraintestOriginal Fold Fold Fold 3
18Cross-validation Standard method: 10? Enough to reduce sampling bias Stratified ten-fold cross-validation10? Enough to reduce sampling biasExperimentally determined
19Leave-One-Out Cross-validation 100Original Fold Fold 100………A particular form of cross-validation:#folds = #instancesn instances, build classifier n timesMakes best use of the data, no sampling biasComputationally expensive
21ROC Analysis Stands for “Receiver Operating Characteristic” From signal processing: tradeoff between hit rate and false alarm rate over noisy channelCompute FPR, TPR and plot them in ROC spaceEvery classifier is a point in ROC spaceFor probabilistic algorithmsCollect many points by varying prediction thresholdOr, make cost sensitive and vary costs (see below)
24ROC curves Area Under Curve (AUC) =0.75 Change prediction threshold: Threshold t: (P(+) > t)P(TP): % true positives: sensitivityP(FP): % false positives: 1 – specificityArea Under Curve (AUC)=0.75
25ROC curves Alternative method (easier, but less intuitive) Rank probabilitiesStart curve in (0,0), move down probability listIf positive, move up. If negative, move rightJagged curve—one set of test dataSmooth curve—use cross-validation
26ROC curves Method selection Overall: use method with largest Area Under ROC curve (AUROC)If you aim to cover just 40% of true positives in a sample: use method ALarge sample: use method BIn between: choose between A and B with appropriate probabilities
27ROC Space and Costs equal costs skewed costs P(TP): % true positives: sensitivityP(FP): % false positives: 1 – specificity
28Different Costs In practice, TP and FN errors incur different costs Examples:Medical diagnostic tests: does X have leukemia?Loan decisions: approve mortgage for X?Promotional mailing: will X buy the product?Add cost matrix to evaluation that weighs TP,FP,...pred +pred -actual +cTP = 0cFN = 1actual -cFP = 1cTN = 0
30Comparing data mining schemes Which of two learning algorithms performs better?Note: this is domain dependent!Obvious way: compare 10-fold CV estimatesProblem: variance in estimateVariance can be reduced using repeated CVHowever, we still don’t know whether results are reliable
31Significance testsSignificance tests tell us how confident we can be that there really is a differenceNull hypothesis: there is no “real” differenceAlternative hypothesis: there is a differenceA significance test measures how much evidence there is in favor of rejecting the null hypothesisE.g. 10 cross-validation scores: B better than A?mean Amean BP(perf)Algorithm AAlgorithm Bperfx x x xxxxx x xx x x xxxx x x x
3232Paired t-testP(perf)Algorithm AAlgorithm BperfStudent’s t-test tells whether the means of two samples (e.g., 10 cross-validation scores) are significantly differentUse a paired t-test when individual samples are pairedi.e., they use the same randomizationSame CV folds are used for both algorithmsWilliam GossetBorn: 1876 in Canterbury; Died: 1937 in Beaconsfield, EnglandWorked as chemist in the Guinness brewery in Dublin in Invented the t-test to handle small samples for quality control in brewing. Wrote under the name "Student".
33Performing the test Fix a significance level P(perf)Algoritme AAlgoritme BFix a significance level Significant difference at % level implies (100-)% chance that there really is a differenceScientific work: 5% or smaller (>95% certainty)Divide by two (two-tailed test)Look up the z-value corresponding to /2:If t –z or t z: difference is significantnull hypothesis can be rejectedperfαz0.1%4.30.5%3.251%2.825%1.8310%1.3820%0.88