Presentation on theme: "Machine Learning Group University College Dublin Evaluation in Machine Learning Pádraig Cunningham."— Presentation transcript:
Machine Learning Group University College Dublin Evaluation in Machine Learning Pádraig Cunningham
2 Outline Student’s t-test Test for paired data Cross Validation McNemar’s Test ROC Analysis Other Statistical Tests for Evaluation
3 William Sealy Gosset The t-statistic was introduced by William Sealy Gosset for cheaply monitoring the quality of beer brews. "Student" was his pen name. Gosset was a statistician for the Guinness brewery in Dublin, Ireland, and was hired due to Claude Guinness's innovative policy of recruiting the best graduates from Oxford and Cambridge to apply biochemistry and statistics to Guinness's industrial processes. Gosset published the t test in Biometrika in 1908, but was forced to use a pen name by his employer who regarded the fact that they were using statistics as a trade secret. In fact, Gosset's identity was unknown not only to fellow statisticians but to his employer - the company insisted on the pseudonym so that it could turn a blind eye to the breach of its rules. Wikipedia
4 Student’s t-Test Scores by two rugby teams: Is B better than A?
5 What does the t-statistic mean? -0.485 31.7% For a given t-statistic you can look up the confidence i.e. there is a 31.7% chance that this difference is due to chance (according to this test).
6 Student’s t-Test More data and/or clearer difference will give statistical significance
7 Student’s t-Test (paired) Scores paired, i.e. against same team With paired data statistical significance can be determined with less observations We can say with 95% confidence that B are better than A
8 Student’s t-test: Formulae Two samples, A and B is the average in A, is the variance in A Test for paired data, 1 Sample (D is difference in pairs)
9 Paired t-Test example t-Test can be used for comparing errors in regression systems. It can also be used for comparing classifiers if multiple test sets are available Also with cross validation (more later) = 5.2
10 U Evaluation in Machine Learning Supervised Learning Typical Question: Which is better, Classifier A or Classifier B? Evaluate Generalisation Accuracy Hold back some training data to use for testing Use performance on Test data as a proxy for performance on unseen data (i.e. Generalization). Training Data TestTrain
11 Problems with ‘Hold-out’ Validation Imagine 200 samples are available for training: 50:50 split underestimates generalisation acc. 80:20 estimate based on a small sample (40) Different hold-out sets - different results 200 100 # Samples Accuracy 160
12 k-Fold Cross Validation Having your cake and eating it too… Divide data into k folds For each fold in turn Use that fold for testing and Use the remainder of the data for training
13 Tuning is explicit 1. Divide dataset into k folds (say 10) 2. For each of the k folds 1. Create training and test sets T & S 2. Divide T into sets T1 and T2 3. For each of the classifiers 1. Use T2 to tune parameters on a model trained with T1 2. Use these ‘good’ parameters to train a model with T 3. Measure Accuracy on S 4. Record 0-1 loss results for each classifier 3. Assess significance of results (e.g. McNemar’s test). Comparing Two Classifiers (Salzberg, 1997)
14 McNemar’s test Which is better C1 or C2? Which is better C2 or C3? McNemar’s test captures this notion: n 01 number misclassified by 1 st but not 2 nd classifier n 10 number misclassified by 2 nd but not 1 st classifier C1C2C3 For test to be applicable (n 01 + n 10 ) > 10 >3.84 required for statistical significance at 95% MNS score for C2 v’s C1 = 1/2 MNS score for C2 v’s C1 = 1/6
16 Other Tests Dietterich’s 5x2cv paired t-test (Dietterich, 1998) 5 repetitions of 2-fold cross validation 2-fold no overlap in training data This gives 10 pairs of error estimates from which a t statistic can be derived + flexible on choice of loss function - training sets comprise 50% of data Demšar’s comparisons over multiple datasets (Demšar, 2006) Comparisons between classifiers done on multiple datasets a table of results Averaging across datasets is dodgy Demšar’s Test Wilcoxon Signed Ranks Test to compare a pair of classifiers Friedman’s Test to combine these scores for multiple classifiers Counts of wins, losses and ties This methodology could become the standard
17 Loss Functions How you keep the score… Regression Quadratic Loss Function Minimize Mean Squared Error Big errors matter more Classification Misclassification Rate aka: 0-1 Loss Function Many alternatives are possible and appropriate in different circumstances, e.g. F measure.
18 Loss Functions: ROC Curves Ranking Classifiers Many (binary) classifiers return a numeric score between 0 and 1. Classifier bias can be controlled by adjusting a threshold. For a given test set the ROC curve shows classifier performance over a range of thresholds/biases.
19 References Salzberg, S., (1997) On Comparing Classifiers: Pitfalls to Avoid and a Recommended Approach, Data Mining and Knowledge Discovery, 1, 317–327. Dietterich, T.G., (1998) Approximate statistical tests for comparing supervised classification learning algorithms, Neural Computation, 10:1895–1924. Demšar, J., (2006) Statistical Comparisons of Classifiers over Multiple Data Sets, Journal of Machine Learning Research, 7(Jan):1--30.