Download presentation

Presentation is loading. Please wait.

Published byWillie Mullennix Modified over 2 years ago

1
© 2013 GE All Rights Reserved Performance Evaluation of Machine Learning Algorithms Mohak ShahNathalie Japkowicz GE SoftwareUniversity of Ottawa ECML 2013, Prague

2
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Evaluation is the key to making real progress in data mining [Witten & Frank, 2005], p

3
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Yet…. Evaluation gives us a lot of, sometimes contradictory, information. How do we make sense of it all? Evaluation standards differ from person to person and discipline to discipline. How do we decide which standards are right for us? Evaluation gives us support for a theory over another, but rarely, if ever, certainty. So where does that leave us? 3

4
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Example I: Which Classifier is better? 4 There are almost as many answers as there are performance measures! (e.g., UCI Breast Cancer) AlgoAccRMSETPRFPRPrecRecFAUCInfo S NB C NN Ripp SVM Bagg Boost RanF acceptable contradictions questionable contradictions

5
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Example I: Which Classifier is better? 5 Ranking the results acceptable contradictions questionable contradictions AlgoAccRMSETPRFPRPrecRecFAUCInfo S NB C NN Ripp SVM Bagg Boost RanF

6
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Example II: Whats wrong with our results? A new algorithm was designed based on data that had been provided to by the National Institute of Health (NIH). [Wang and Japkowicz, 2008, 2010] According to the standard evaluation practices in machine learning, the results were found to be significantly better than the state-of-the-art. The conference version of the paper received a best paper award and a bit of attention in the ML community (35+ citations) NIH would not consider the algorithm because it was probably not significantly better (if at all) than others as they felt the evaluation wasnt performed according to their standards. 6

7
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Example III: What do our results mean? In an experiment comparing the accuracy performance of eight classifiers on ten domains, we concluded that One-Way Repeated-Measure ANOVA rejects the hypothesis that the eight classifiers perform similarly on the ten domains considered at the 95% significance level [Japkowicz & Shah, 2011], p Can we be sure of our results? –What if we are in the 5% not vouched for by the statistical test? –Was the ANOVA test truly applicable to our specific data? Is there any way to verify that? –Can we have any 100% guarantee on our results? 7

8
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Purpose of the tutorial To give an appreciation for the complexity and uncertainty of evaluation. To discuss the different aspects of evaluation and to present an overview of the issues that come up in their context and what the possible remedies may be. To present some of the newer lesser known results in classifier evaluation. Finally, point to some available, easy-to-use, resources that can help improve the robustness of evaluation in the field. 8

9
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved The main steps of evaluation 9 The Classifier Evaluation Framework 1 2 : Knowledge of 1 is necessary for : Feedback from 1 should be used to adjust 2 Choice of Learning Algorithm(s) Datasets Selection Error-Estimation/ Sampling Method Performance Measure of Interest Statistical Test Perform Evaluation

10
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved What these steps depend on These steps depend on the purpose of the evaluation: –Comparison of a new algorithm to other (may be generic or application-specific) classifiers on a specific domain (e.g., when proposing a novel learning algorithm) –Comparison of a new generic algorithm to other generic ones on a set of benchmark domains (e.g. to demonstrate general effectiveness of the new approach against other approaches) –Characterization of generic classifiers on benchmarks domains (e.g. to study the algorithms' behavior on general domains for subsequent use) –Comparison of multiple classifiers on a specific domain (e.g. to find the best algorithm for a given application task) 10

11
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Outline Performance measures Error Estimation/Resampling Statistical Significance Testing Data Set Selection and Evaluation Benchmark Design Available resources 11

12
© 2013 GE All Rights Reserved Performance Measures

13
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Performance Measures: Outline Ontology Confusion Matrix Based Measures Graphical Measures: –ROC Analysis and AUC, Cost Curves –Recent Developments on the ROC Front: Smooth ROC Curves, the H Measure, ROC Cost curves –Other Curves: PR Curves, lift, etc.. Probabilistic Measures Others: qualitative measures, combinations frameworks, visualization, agreement statistics 13

14
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Overview of Performance Measures 14 All Measures Additional Info (Classifier Uncertainty Cost ratio Skew) Confusion MatrixAlternate Information Deterministic Classifiers Scoring Classifiers Continuous and Prob. Classifiers (Reliability metrics) Multi-class Focus Single-class Focus No Chance Correction Chance Correction Accuracy Error Rate Cohens Kappa Fielss Kappa TP/FP Rate Precision/Recall Sens./Spec. F-measure Geom. Mean Dice Graphical Measures Summary Statistic Roc Curves PR Curves DET Curves Lift Charts Cost Curves AUC H Measure Area under ROC- cost curve Distance/Error measures KL divergence K&B IR BIR RMSE Information Theoretic Measures Interestingness Comprehensibility Multi-criteria

15
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Confusion Matrix-Based Performance Measures Multi-Class Focus: –Accuracy = (TP+TN)/(P+N) Single-Class Focus: –Precision = TP/(TP+FP) –Recall = TP/P –Fallout = FP/N –Sensitivity = TP/(TP+FN) –Specificity = TN/(FP+TN) 15 True class Hypothesized class PosNeg YesTPFP NoFNTN P=TP+FNN=FP+TN Confusion Matrix

16
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Aliases and other measures Accuracy = 1 (or 100%) - Error rate Recall = TPR = Hit rate = Sensitivity Fallout = FPR = False Alarm rate Precision = Positive Predictive Value (PPV) Negative Predictive Value (NPV) = TN/(TN+FN) Likelihood Ratios: –LR+ = Sensitivity/(1-Specificity) higher the better –LR- = (1-Sensitivity)/Specificity lower the better 16

17
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Pairs of Measures and Compounded Measures Precision / Recall Sensitivity / Specificity Likelihood Ratios (LR+ and LR-) Positive / Negative Predictive Values F-Measure: –F α = [(1+ α) (Precision x Recall)]/ α= 1, 2, 0.5 [(α x Precision) + Recall] G-Mean : 2-class version single-class version –G-Mean = Sqrt(TPR x TNR) or Sqrt(TPR x Precision) 17

18
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Skew and Cost considerations Skew-Sensitive Assessments: e.g., Class Ratios –Ratio+ = (TP + FN)/(FP + TN) –Ratio- = (FP + TN)/(TP + FN) –TPR can be weighted by Ratio+ and TNR can be weighted by Ratio- Asymmetric Misclassification Costs: –If the costs are known, Weigh each non-diagonal entries of the confusion matrix by the appropriate misclassification costs Compute the weighted version of any previously discussed evaluation measure. 18

19
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Some issues with performance measures Both classifiers obtain 60% accuracy They exhibit very different behaviours: –On the left: weak positive recognition rate/strong negative recognition rate –On the right: strong positive recognition rate/weak negative recognition rate 19 True class PosNeg Yes No P=500N=500 True class PosNeg Yes No P=500N=500

20
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Some issues with performance measures (contd) The classifier on the left obtains 99.01% accuracy while the classifier on the right obtains 89.9% –Yet, the classifier on the right is much more sophisticated than the classifier on the left, which just labels everything as positive and misses all the negative examples. 20 True class PosNeg Yes5005 No00 P=500N=5 True class PosNeg Yes4501 No504 P=500N=5

21
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Some issues with performance measures (contd) Both classifiers obtain the same precision and recall values of 66.7% and 40% (Note: the data sets are different) They exhibit very different behaviours: –Same positive recognition rate –Extremely different negative recognition rate: strong on the left / nil on the right Note: Accuracy has no problem catching this! 21 True class PosNeg Yes No P=500N=500 True class PosNeg Yes No3000 P=500N=100

22
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Observations This and similar analyses reveal that each performance measure will convey some information and hide other. Therefore, there is an information trade-off carried through the different metrics. A practitioner has to choose, quite carefully, the quantity s/he is interested in monitoring, while keeping in mind that other values matter as well. Classifiers may rank differently depending on the metrics along which they are being compared. 22

23
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Graphical Measures: ROC Analysis, AUC, Cost Curves Many researchers have now adopted the AUC (the area under the ROC Curve) to evaluate their results. Principal advantage: –more robust than Accuracy in class imbalanced situations –takes the class distribution into consideration and, therefore, gives more weight to correct classification of the minority class, thus outputting fairer results. We start first by understanding ROC curves 23

24
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved ROC Curves Performance measures for scoring classifiers –most classifiers are in fact scoring classifiers (not necessarily ranking classifiers) –Scores neednt be in pre-defined intervals, or even likelihood or probabilities ROC analysis has origins in Signal detection theory to set an operating point for desired signal detection rate –Signal assumed to be corrupted by noise (Normally distributed) ROC maps FPR on horizontal axis and TPR on the vertical axis; Recall that –FPR = 1 – Specificity –TPR = Sensitivity 24

25
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved ROC Space 25

26
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved ROC Plot for discrete classifiers 26

27
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved ROC plot for two hypothetical scoring classifiers 27

28
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Skew considerations ROC graphs are insensitive to class imbalances –Since they dont account for class distributions Based on 2x2 confusion matrix (i.e. 3 degrees of freedom) –First 2 dimensions: TPR and FPR –The third dimension is performance measure dependent, typically class imbalance –Accuracy/risk are sensitive to class ratios Classifier performance then corresponds to points in 3D space –Projections on a single TPR x FPR slice are performance w.r.t. same class distribution –Hence, implicit assumption is that 2D is independent of empirical class distributions 28

29
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Skew considerations Measures such as Class imbalances, Asymmetric misclassification costs, credits for correct classification, etc. can be incorporated via single metric: –skew-ratio r s: s.t. r s 1 if vice-versa Performance measures then can be re-stated to take skew ratio into account 29

30
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Skew considerations Accuracy –For symmetric cost, skew represents the class ratio: Precision: 30

31
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Isometrics Isometrics or Isoperformance lines (or curves) –For any performance metric, lines or curves denoting the same metric value for a given r s In case of 3D ROC graphs, Isometrics form surfaces 31

32
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Isoaccuracy lines 32

33
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Isoprecision lines 33

34
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Isometrics Similarly, Iso-performance lines can be drawn on the ROC space connecting classifiers with the same expected cost: f 1 and f 2 have same performance if: Expected cost can be interpreted as a special case of skew 34

35
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved ROC curve generation 35

36
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Curves from Discrete points and Sub-optimalities 36

37
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved ROC Convex hull The set of points on ROC curve that are not suboptimal forms the ROC convex hull 37

38
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved ROCCH and model selection 38

39
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Selectively optimal classifiers 39

40
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Selectively optimal classifiers 40

41
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved AUC ROC Analysis allows a user to visualize the performance of classifiers over their operating ranges. However, it does not allow the quantification of this analysis, which would make comparisons between classifiers easier. The Area Under the ROC Curve (AUC) allows such a quantification: it represents the performance of the classifier averaged over all the possible cost ratios. The AUC can also be interpreted as the probability of the classifier assigning a higher rank to a randomly chosen positive example than to a randomly chosen negative example Using the second interpretation, the AUC can be estimated as follows: where and are the sets of positive and negative examples, respectively, in the Test set and is the rank of the th example in given by the classifier under evaluation. 41

42
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Recent Developments I: Smooth ROC Curves (smROC) [Klement, et al., ECML2011] 42 ROC Analysis visualizes the ranking performance of classifiers but ignores any kind of scoring information. Scores are important as they tell us whether the difference between two rankings is large or small, but they are difficult to integrate because unlike probabilities, each scoring system is different from one classifier to the next and their integration is not obvious. smROC extends the ROC curve to include normalized scores as smoothing weights added to its line segments. smROC was shown to capture the similarity between similar classification models better than ROC and that it can capture difference between classification models that ROC is barely sensitive to, if at all.

43
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved A criticism of the AUC was given by [Hand, 2009]. The argument goes along the following lines: The misclassification cost distributions (and hence the skew-ratio distributions) used by the AUC are different for different classifiers. Therefore, we may be comparing apples and oranges as the AUC may give more weight to misclassifying a point by classifier A than it does by classifier B To address this problem, [Hand, 2009] proposed the H- Measure. In essence, The H-measure allows the user to select a cost-weight function that is equal for all the classifiers under comparison and thus allows for fairer comparisons. 43 Recent Developments II: The H Measure I

44
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Recent Developments II: The H Measure II [Flach et al. ICML2011] argued that –[Hand2009]s criticism holds when the AUC is interpreted as a classification measure but not when it is interpreted as a pure ranking measure. –When [Hand2009]s criticism holds, the problem occurs only if the threshold used to construct the ROC curve are assumed to be optimal. –When constructing the ROC curve using all the points in the data set as thresholds rather than only optimal ones, then the AUC measure can be shown to be coherent and the H-Measure unnecessary, at least from a theoretical point of view. –The H-Measure is a linear transformation of the value that the area under the Cost-Curve (discussed next) would produce. 44

45
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Cost Curves 45 Cost-curves tell us for what class probabilities one classifier is preferable over the other. ROC Curves only tell us that sometimes one classifier is preferable over the other False Positive Rate True Positive Rate f1f1 f2f2 Classifier that is always wrong Error Rate P(+), Probability of an example being from the Positive Class Positive Trivial Classifier Negative Trivial Classifier Classifier B2 Classifier A1 Classifiers A at different thresholds Classifiers B at different thresholds Classifiers that is always right

46
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Recent Developments III: ROC Cost curves Hernandez-Orallo et al. (ArXiv, 2011, JMLR 2012) introduces ROC Cost curves Explores connection between ROC space and cost space via expected loss over a range of operating conditions Shows how ROC curves can be transferred to cost space by threshold choice methods s.t. the proportion of positive predictions equals the operating conditions (via cost or skew) Expected loss measured by the area under ROC cost curves maps linearly to the AUC Also explores relationship between AUC and Brier score (MSE) 46

47
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved ROC cost curves (going beyond optimal cost curve) 47 Figure from Hernandez-Orallo et al (Fig 2)

48
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved ROC cost curves (going beyond optimal cost curve) 48 Figure from Hernandez-Orallo et al (Fig 2)

49
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Other Curves Other research communities have produced and used graphical evaluation techniques similar to ROC Curves. –Lift charts plot the number of true positives versus the overall number of examples in the data sets that was considered for the specific true positive number on the vertical axis. –PR curves plot the precision as a function of its recall. –Detection Error Trade-off (DET) curves are like ROC curves, but plot the FNR rather than the TPR on the vertical axis. They are also typically log-scaled. –Relative Superiority Graphs are more akin to Cost-Curves as they consider costs. They map the ratios of costs into the [0,1] interval. 49

50
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Probabilistic Measures I: RMSE The Root-Mean Squared Error (RMSE) is usually used for regression, but can also be used with probabilistic classifiers. The formula for the RMSE is: where m is the number of test examples, f(x i ), the classifiers probabilistic output on x i and y i the actual label. 50 RMSE(f) = sqrt(1/5 * ( )) = sqrt(0.975/5) = IDf(x i )yiyi (f(x i ) – y i )

51
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Probabilistic Measures II: Information Score Kononenko and Bratkos Information Score assumes a prior P(y) on the labels. This could be estimated by the class distribution of the training data. The output (posterior probability) of a probabilistic classifier is P(y|f), where f is the classifier. I(a) is the indicator function. Avg info score: 51 xP(y i |f)yiyi IS(x) P(y=1)= 3/5 = 0.6 P(y=0) = 2/5 = 0.4 I(x 1 ) = 1 * (-log(.6) + log(.95)) + 0 * (-log(.4) + log).05)) = 0.66 I a = 1/5 ( ) = 0.40

52
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Other Measures I: framework for combining various measures including qualitative metrics, such as interestingness. considers the positive metrics for which higher values are desirable (e.g, accuracy) and the negative metrics for which lower values are desirable (e.g., computational time). pm i + are the positive metrics and pm j - are the negative metrics. The w i s and w j s have to be determined and a solution is proposed that uses linear programming. 52 A Multi-Criteria Measure – The Efficiency Method

53
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Other Measures II Classifier evaluation can be viewed as a problem of analyzing high-dimensional data. The performance measures currently used are but one class of projections that can be applied to these data. Japkowicz et al ECML08 ISAIM08] introduced visualization- based measures –Projects the classifier performance on domains to a higher dimensional space –Distance measure in projected space allows to qualitatively differentiate classifiers –Note: Dependent on distance metric used 53

54
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Other Measures III: Agreement Statistics Measures considered so far assumes existence of ground truth labels (against which, for instance, accuracy is measured) In the absence of ground truth, typically (multiple) experts label the instances Hence, one needs to account for chance agreements while evaluating under imprecise labeling Two approaches: –either estimate ground truth; or –evaluate against the group of labelings 54

55
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Multiple Annotations 55

56
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved 2 Questions of interest 56 Inter-expert agreement: Overall Agreement of the group Classifier Agreement Against the group

57
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Such measurements are also desired when Assessing a classifier in a skewed class setting Assessing a classifier against expert labeling There are reasons to believe that bias has been introduced in the labeling process (say, due to class imbalance, asymmetric representativeness of classes in the data,…) … Agreement statistic aim to address this by accounting for chance agreements (coincidental concordances) 57

58
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Coincidental concordances (Natural Propensities) need to be accounted for 58 R6:1891 R1: 1526R2: 2484 R4: 816 R5: 754 R3: 1671

59
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved 59 General Agreement Statistic

60
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Examples: Cohens kappa, Fleiss Kappa, Scotts pi, ICC… General Agreement Statistic 60 Agreement measure Chance Agreement Maximum Achievable Agreement

61
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved 2-raters over 2 classes: Cohens Kappa Say we wish to measure classifier accuracy against some true process. Cohens kappa is defined as: where represents the probability of overall agreement over the label assignments between the classifier and the true process, and represents the chance agreement over the labels and is defined as the sum of the proportion of examples assigned to a class times the proportion of true labels of that class in the data set. 61

62
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Cohens Kappa Measure: Example Accuracy = = ( ) / 400 = 62.5% = 100/400 * 120/ /400 * 150/ /400 * 130/400 = = 43.29% Accuracy is overly optimistic in this example! 62 One of many generalizations for multiple class case Predicted -> Actual ABCTotal A B C Total

63
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Multiple raters over multiple classes Various generalizations have been proposed to estimate inter-rater agreement in general multi-class multi-rater scenario Some well known formulations include Fleiss kappa (Fleiss 1971); ICC, Scotts pi, Vanbelle and Albert (2009)s measures; These formulations all relied on marginalization argument (fixed number of raters are assumed to be sampled from a pool of raters to label each instance) –-applicable, for instance, in many crowd-sourcing scenarios For assessing agreement against a group of raters, typically consensus argument is used –A deterministic label set is formed via majority vote of raters in the group and the classifier is evaluated against this label set 63

64
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Multiple raters over multiple classes Shah (ECML 2011) highlighted the issues with both the marginalization and the consensus argument, and related approaches such as Dice coefficient, for machine learning evaluation when the raters group is fixed Two new measures were introduced –κ S : to assess inter-rater agreement (generalizing Cohens kappa) Variance upper-bounded by marginalization counterpart –the S measure: to evaluate a classifier against a group of raters 64

65
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Issues with marginalization and Consensus argument: A quick review 65

66
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Back to the agreement statistic Pair-wise Agreement measure Difference arise in obtaining the expectation

67
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Traditional Approaches: Inter-rater Agreement Inst #Rater 1Rater The Marginalization Argument: Consider a simple 2 rater 2 class case Agreement: 4/7 Probability of chance agreement over label 1: Pr(Label=1| Random rater) 2 = 7/14 * 7/14 = 0.25 Agreement = 1 – 0.5 4/7 – 0.5 = 0.143

68
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Traditional Approaches: Inter-rater Agreement Inst #Rater 1Rater The Marginalization Argument: Consider another scenario Observed Agreement: 0 Probability of chance agreement over label 1: Pr(Label=1| Random rater) 2 = 7/14 * 7/14 = 0.25 Agreement = 1 – – 0.5 = -1

69
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Traditional Approaches: Inter-rater Agreement Inst #Rater 1Rater The Marginalization Argument: But this holds even when there is no evidence of a chance agreement Observed Agreement: 0 Probability of chance agreement over label 1: Pr(Label=1| Random rater) 2 = 7/14 * 7/14 = 0.25 Agreement = 1 – – 0.5 = -1

70
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Marginalization argument not applicable in fixed rater scenario Marginalization ignores rater correlation Ignores rater asymmetry Results in loose chance agreement estimates by optimistic estimation Hence, overly conservative agreement estimate Shahs kappa corrects for these while retaining the convergence properties

71
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Shahs kappa: Variability upper-bounded by that of Fleiss kappa

72
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved II. Agreement of a classifier against a group Extension of marginalization argument –Recently appeared in Statistics literature: Vanbelle and Albert, (stat. ner. 2009) Consensus Based (more traditional) –Almost universally used in the machine learning/data mining community E.g., medical image segmentation, tissue classification, recommendation systems, expert modeling scenarios (e.g. market analyst combination)

73
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Marginalization Approach and Issues in Fixed experts setting Observed agreement: Proportion of raters with which the classifier agrees –Ignores qualitative agreement, may even ignore group dynamics c R1 R2R3 Inst #Rater 1Rater 2Rater Classifier

74
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Marginalization Approach and Issues in Fixed experts setting Observed agreement: Proportion of raters with which the classifier agrees –Ignores qualitative agreement, may even ignore group dynamics Inst #Rater 1Rater 2Rater Classifier Any random label assignment gives the same observed agreement

75
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Marginalization Approach and Issues in Fixed experts setting Chance Agreement: Extend the marginalization argument –Not informative when the raters are fixed, ignores rater- specific correlations Inst #Rater 1Rater 2Rater Classifier When fixed raters never agree, chance agreement should be zero

76
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Consensus Approach and Issues Approach: –Obtain a deterministic label for each instance if at least k raters agree –Treat this label set as ground truth and use dice coefficient against classifier labels

77
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Problem with consensus: Ties R6:1891 R1: 1526R2: 2484 R4: 816 R5: 754 R3: 1671

78
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Problem with consensus: Ties

79
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Problem with consensus: Ties: Solution -> Remove 1 (Tie-breaker)

80
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Problem with consensus: Ties: Remove which 1?

81
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Problem with consensus: Ties: Remove which 1?

82
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Two Consensus outputs: One Classifier C1: Liberal Classifier C2: Conservative

83
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Chance Correction: neither done nor meaningful (Even Cohen not useful over consensus since it is not a single classifier) C1: Liberal Classifier C2: Conservative

84
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Problem with consensus: Consensus upper bounded by threshold-highest R6:1891 (t2) R1: 1526 (t4) R2: 2484 R4: 816 (t5)R5: 754 (t6) R3: 1671 (t3)

85
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Consensus Approach and Issues Approach: –Obtain a deterministic label for each instance if at least k >= r/2 raters agree –Treat this label set as ground truth and use dice coefficient against classifier labels Issues: –Threshold sensitive –Establishing threshold can be non-trivial –Tie breaking not clear –Treats estimates as deterministic –Ignores minority raters as well as rater correlation The S measure (Shah, ECML 2011) addresses these issues with marginalization and is independent of the consensus approach

86
© 2013 GE All Rights Reserved Error Estimation/Resampling

87
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved How do we estimate it in an unbiased manner? What if we had all the data? –Re-substitution: Overly optimistic (best performance achieved with complete over-fit) Which factors affect such estimation choices? –Sample Size/Domains –Random variations: in training set, testing set, learning algorithms, class-noise –… 87 Se we decided on a performance measure.

88
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Set aside a separate test set T. Estimate the empirical risk: Advantages –independence from training set –Generalization behavior can be characterized –Estimates can be obtained for any classifier 88 Hold-out approach

89
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Confidence intervals can be found too (based on binomial approximation): 89 Hold-out approach

90
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Based on binomial inversion: More realistic than asymptotic Gaussian assumption (μ ± 2σ) 90 A Tighter bound

91
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved B l and B u shows binomial lower and upper bounds CI l and CI u shod lower and upper confidence intervals based on Gaussian assumption 91 Binomial vs. Gaussian assumptions Data-SetARTRT BlBl BuBu CI l CI u Bupa SVM Ada DT DL NB SCM Credit SVM Ada DT DL NB SCM

92
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Binomial vs. Gaussian assumptions B l and B u shows binomial lower and upper bounds CI l and CI u shod lower and upper confidence intervals based on Gaussian assumption 92 Data-SetARTRT BlBl BuBu CI l CI u Bupa SVM Ada DT DL NB SCM Credit SVM Ada DT DL NB SCM

93
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved The bound: gives 93 Hold-out sample size requirements

94
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Hold-out sample size requirements The bound: gives This sample size bound grows very quickly with ε and δ 94

95
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved The need for re-sampling Too few training examples -> learning a poor classifier Too few test examples -> erroneous error estimates Hence: Resampling –Allows for accurate performance estimates while allowing the algorithm to train on most data examples –Allows for closer duplication conditions 95

96
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved What implicitly guides re-sampling: Bias-Variance analysis helps understand the behavior of learning algorithms In the classification case: –Bias: difference between the most frequent prediction of the algorithm and the optimal label of the examples –Variance: Degree of algorithms divergence from its most probable (average) prediction in response to variations in training set Hence, bias is independent of the training set while variance isnt 96 A quick look into relationship with Bias-variance behavior

97
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved What implicitly guides re-sampling: Having too few examples in the training set affects the bias of algorithm by making its average prediction unreliable Having too few examples in test set results in high variance Hence, the choice of resampling can introduce different limitations on the bias and variance estimates of an algorithm and hence on the reliability of error estimates 97 A quick look into relationship with Bias-variance behavior

98
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved An ontology of error estimation techniques 98 All Data Regimen No Re-sampling Hold-Out Simple Re-sampling Multiple Re-sampling Cross- Validation Random Sub-Sampling BootstrappingRandomization Re- substitution Repeated k-fold Cross-Validation ε 0 Bootstrap Bootstrap Permutation Test Non-Stratified k-fold Cross-Validation Leave-One-Out10x10CV5x2CV Stratified k-fold Cross-Validation

99
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Simple Resampling: K-fold Cross Validation FOLD 1 FOLD 2 FOLD k-1 FOLD k 99 In Cross-Validation, the data set is divided into k folds and at each iteration, a different fold is reserved for testing while all the others used for training the classifiers. Testing data subset Training data subset

100
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Simple Resampling: Some variations of Cross-Validation Stratified k-fold Cross-Validation: –Maintain the class distribution from the original dataset when forming the k-fold CV subsets –useful when the class-distribution of the data is imbalanced/skewed. Leave-One-Out –This is the extreme case of k-fold CV where each subset is of size 1. –Also known a Jackknife estimate 100

101
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Observations k-fold CV is arguable the best known and most commonly used resampling technique –With k of reasonable size, less computer intensive than Leave-One- Out –Easy to apply In all the variants, the testing sets are independent of one another, as required, by many statistical testing methods –but the training sets are highly overlapping. This can affect the bias of the error estimates (generally mitigated for large dataset sizes). Results in averaged estimate over k different classifiers 101

102
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Observations Leave-One-Out can result in estimates with high variance given that the testing folds contains only one example –On the other hand, the classifier is practically unbiased since each training fold contains almost all the data Leave-One-Out can be quite effective for moderate dataset sizes –For small datasets, the estimates have high variance while for large datasets application becomes too computer-intensive Leave-One-Out can also be beneficial, compared to k-fold CV, when –there is wide dispersion of data distribution –data set contains extreme values 102

103
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Limitations of Simple Resampling Do not estimate the risk of a classifier but rather expected risk of algorithm over the size of partitions –While stability of estimates can give insight into robustness of algorithm, classifier comparisons in fact compare the averaged estimates (and not individual classifiers) Providing rigorous guarantees is quite problematic –No confidence intervals can be established (unlike, say, hold- out method) –Even under normality assumption, the standard deviation at best conveys the uncertainty in estimates 103

104
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Multiple Resampling: Bootstrapping Attempts to answer: What can be done when data is too small for application of k-fold CV or Leave-One-Out? Assumes that the available sample is representative Creates a large number of new sample by drawing with replacement 104

105
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Multiple Resampling: ε0 Bootstrapping Given a data D of size m, create k bootstrap samples B i, each of of size m by sampling with replacement At each iteration i: –B i represents the training set while the test set contains single copy of examples in D that are not present in B i –Train a classifier on B i and test on the test set to obtain error estimate ε0 i Average over ε0 i s to obtain ε0 estimate 105

106
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Multiple Resampling: ε0 Bootstrapping In every iteration the probability of an example not being chosen after m samples is: Hence, average distinct examples in the training set is m( )m = 0.632m The ε0 measure can hence be pessimistic since in each run the classifier is typically trained on about 63.2% of data 106

107
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Multiple Resampling: e632 Bootstrapping e632 estimate corrects for this pessimistic bias by taking into account the optimistic bias of resubstitution error over the remaining fraction: 107

108
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Discussion Bootstrap can be useful when datasets are too small –In these cases, the estimates can also have low variance owing to (artificially) increased data size ε0: effective in cases of very high true error rate; has low variance than k-fold CV but more biased e632 (owing to correction it makes): effective over small true error rates An interesting observation: these error estimates can be algorithm-dependent –e.g., Bootstrap can be a poor estimator for algorithms such as nearest neighbor or FOIL that do not benefit from duplicate instances 108

109
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Multiple Resampling: other approaches Randomization –Over labels: Assess the dependence of algorithm on actual label assignment (as opposed to obtaining similar classifier on chance assignment) –Over Samples (Permutation): Assess the stability of estimates over different re-orderings of the data Multiple trials of simple resampling –Potentially Higher replicability and more stable estimates –Multiple runs (how many?): 5x2 cv, 10x10 cv (main motivation is comparison of algorithms) 109

110
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Recent Developments: Bootstrap and big data Bootstrap has gained significant attention in the big data case recently Allows for error bars to characterize uncertainty in predictions as well as other evaluation Kleiner et al. (ICML 2012) devised the extension of bootstrap to large scale problems –Aimed at developing a database that returns answers with error bars to all queries Combines the ideas of classical bootstrap and subsampling 110

111
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Assessing the Quality of Inference Observe data X 1,..., X n Slide courtesy of Michael Jordan

112
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Assessing the Quality of Inference Observe data X 1,..., X n Form a parameter estimate n = (X 1,..., X n ) Slide courtesy of Michael Jordan

113
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Assessing the Quality of Inference Observe data X 1,..., X n Form a parameter estimate n = (X 1,..., X n ) Want to compute an assessment of the quality of our estimate n (e.g., a confidence region) Slide courtesy of Michael Jordan

114
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved The Unachievable Frequentist Ideal Ideally, we would Observe many independent datasets of size n. Compute n on each. Compute based on these multiple realizations of n. Slide courtesy of Michael Jordan

115
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved The Unachievable Frequentist Ideal Ideally, we would Observe many independent datasets of size n. Compute n on each. Compute based on these multiple realizations of n. But, we only observe one dataset of size n. Slide courtesy of Michael Jordan

116
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved The Bootstrap Use the observed data to simulate multiple datasets of size n: Repeatedly resample n points with replacement from the original dataset of size n. Compute * n on each resample. Compute based on these multiple realizations of * n as our estimate of for n. (Efron, 1979) Slide courtesy of Michael Jordan

117
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved The Bootstrap: Computational Issues Seemingly a wonderful match to modern parallel and distributed computing platforms But the expected number of distinct points in a bootstrap resample is ~ 0.632n –e.g., if original dataset has size 1 TB, then expect resample to have size ~ 632 GB Cant feasibly send resampled datasets of this size to distributed servers Even if one could, cant compute the estimate locally on datasets this large Slide courtesy of Michael Jordan

118
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Subsampling n (Politis, Romano & Wolf, 1999) Slide courtesy of Michael Jordan

119
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Subsampling n b Slide courtesy of Michael Jordan

120
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Subsampling There are many subsets of size b < n Choose some sample of them and apply the estimator to each This yields fluctuations of the estimate, and thus error bars But a key issue arises: the fact that b < n means that the error bars will be on the wrong scale (theyll be too large) Need to analytically correct the error bars Slide courtesy of Michael Jordan

121
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Subsampling Summary of algorithm: Repeatedly subsample b < n points without replacement from the original dataset of size n Compute b on each subsample Compute based on these multiple realizations of b Analytically correct to produce final estimate of for n The need for analytical correction makes subsampling less automatic than the bootstrap Still, much more favorable computational profile than bootstrap Slide courtesy of Michael Jordan

122
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Bag of Little Bootstraps (BLB) combines the bootstrap and subsampling, and gets the best of both worlds works with small subsets of the data, like subsampling, and thus is appropriate for distributed computing platforms But, like the bootstrap, it doesnt require analytical rescaling And its shown to be successful in practice Slide courtesy of Michael Jordan

123
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Towards the Bag of Little Bootstraps n b Slide courtesy of Michael Jordan

124
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Towards the Bag of Little Bootstraps b Slide courtesy of Michael Jordan

125
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Approximation Slide courtesy of Michael Jordan

126
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Pretend the Subsample is the Population Slide courtesy of Michael Jordan

127
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Pretend the Subsample is the Population And bootstrap the subsample! This means resampling n times with replacement, not b times as in subsampling Slide courtesy of Michael Jordan

128
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved The Bag of Little Bootstraps (BLB) The subsample contains only b points, and so the resulting empirical distribution has its support on b points But we can (and should!) resample it with replacement n times, not b times Doing this repeatedly for a given subsample gives bootstrap confidence intervals on the right scale---no analytical rescaling is necessary! Now do this (in parallel) for multiple subsamples and combine the results (e.g., by averaging) Slide courtesy of Michael Jordan

129
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved The Bag of Little Bootstraps (BLB) Slide courtesy of Michael Jordan

130
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved BLB: Theoretical Results BLB is asymptotically consistent and higher-order correct (like the bootstrap), under essentially the same conditions that have been used in prior analysis of the bootstrap. Theorem (asymptotic consistency): Under standard assumptions (particularly that is Hadamard differentiable and is continuous), the output of BLB converges to the population value of as n, b approach. Slide courtesy of Michael Jordan

131
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved What to watch out when selecting error estimation method Domain (e.g., for data distribution, dispersion) Dataset Size Bias-Variance dependence (Expected) true error rate Type of learning algorithms to be evaluated (e.g. can be robust to, say, permutation test; or do not benefit over bootstrapping) Computational Complexity The relations between the evaluation methods, whether statistically significantly different or not, varies with data quality. Therefore, one cannot replace one test with another, and only evaluation methods appropriate for the context may be used. Reich and Barai (1999, Pg 11) 131

132
© 2013 GE All Rights Reserved Statistical Significance Testing

133
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Statistical Significance Testing Error estimation techniques allows us to obtain an estimate of the desired performance metric(s) on different classifiers A difference is seen between the metric values over classifiers Questions –Can the difference be attributed to real characteristics of the algorithms? –Can the observed difference(s) be merely coincidental concordances? 133

134
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Statistical Significance Testing Statistical Significance Testing can enable ascertaining whether the observed results are statistically significant (within certain constraints) We primarily focus on Null Hypothesis Significance Testing (NHST) typically w.r.t.: –Evaluate algorithm A of interest against other algorithms on a specific problem –Evaluate generic capabilities of A against other algorithms on benchmark datasets –Evaluating performance of multiple classifiers on benchmark datasets or a specific problem of interest 134

135
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved NHST State a null hypothesis –Usually the opposite of what we wish to test (for example, classifiers A and B perform equivalently) Choose a suitable statistical test and statistic that will be used to (possibly) reject the null hypothesis Choose a critical region for the statistic to lie in that is extreme enough for the null hypothesis to be rejected. Calculate the test statistic from the data If the test statistic lies in the critical region: reject the null hypothesis. –If not, we fail to reject the null hypothesis, but do not accept it either. Rejecting the null hypothesis gives us some confidence in the belief that our observations did not occur merely by chance. 135

136
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Choosing a Statistical Test There are several aspects to consider when choosing a statistical test. –What kind of problem is being handled? –Whether we have enough information about the underlying distributions of the classifiers results to apply a parametric test? –… Statistical tests considered can be categorized by the task they address, i.e., comparing –2 algorithms on a single domain –2 algorithms on several domains –multiple algorithms on multiple domains 136

137
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Issues with hypothesis testing NHST never constitutes a proof that our observation is valid Some researchers (Drummond, 2006; Demsar, 2008) have pointed out that NHST: –overvalues the results; and –limits the search for novel ideas (owing to excessive, often unwarranted, confidence in the results) NHST outcomes are prone to misinterpretation –Rejecting null hypothesis with confidence p does not signify that the performance-difference holds with probability 1-p –1-p in fact denotes the likelihood of evidence from the experiment being correct if the difference indeed exists Given enough data, a statistically significant difference can always be shown 137

138
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved But NHST can still be helpful NHST never constitutes a proof that our observation is valid Some researchers (Drummond, 2006; Demsar, 2008) have pointed out that NHST: –overvalues the results; and –limits the search for novel ideas (owing to excessive, often unwarranted, confidence in the results) NHST outcomes are prone to misinterpretation –Rejecting null hypothesis with confidence p does not signify that the performance-difference holds with probability 1-p –1-p in fact denotes the likelihood of evidence from the experiment being correct if the difference indeed exists Given enough data, a statistically significant difference can always be shown 138 (but provides added concrete support to observations) (to which a better sensitization is necessary) The impossibility to reject the null hypothesis while using a reasonable effect size is telling

139
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved An overview of representative tests 139 All Machine Learning & Data Mining problems 2 Algorithms Multiple Domains 2 Algorithms 1 Domain Multiple Algorithms Multiple Domains Two Matched Sample t-Test Wilcoxons Signed Rank Test for Matched Pairs Repeated Measure One-way ANOVA Friedmans Test Nemeuyi Test Bonferroni-Dunn Post-boc Test Sign TestMcNemars Test Tukey Post-boc Test Parametric and Non-Parametric Parametric TestNon-Parametric

140
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Parametric vs. Non-parametric tests 140 All Machine Learning & Data Mining problems 2 Algorithms Multiple Domains 2 Algorithms 1 Domain Multiple Algorithms Multiple Domains Two Matched Sample t-Test Wilcoxons Signed Rank Test for Matched Pairs Repeated Measure One-way ANOVA Friedmans Test Nemeuyi Test Bonferroni-Dunn Post-boc Test Sign TestMcNemars Test Tukey Post-boc Test Parametric and Non-Parametric Parametric TestNon-Parametric Make assumptions on distribution of population (i.e., error estimates); typically more powerful than non- parametric counterparts

141
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Tests covered in this tutorial 141 All Machine Learning & Data Mining problems 2 Algorithms Multiple Domains 2 Algorithms 1 Domain Multiple Algorithms Multiple Domains Two Matched Sample t-Test Wilcoxons Signed Rank Test for Matched Pairs Repeated Measure One-way ANOVA Friedmans Test Nemeuyi Test Bonferroni-Dunn Post-boc Test Sign Test McNemars Test Tukey Post-boc Test Parametric and Non-Parametric Parametric TestNon-Parametric

142
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Comparing 2 algorithms on a single domain: t-test Arguably, one of the most widely used statistical tests Comes in various flavors (addressing different experimental settings) In our case: two matched samples t-test –to compare performances of two classifiers applied to the same dataset with matching randomizations and partitions Measures if the difference between two means (mean value of performance measures, e.g., error rate) is meaningful Null hypothesis: the two samples (performance measures of two classifiers over the dataset) come from the same population 142

143
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Comparing 2 algorithms on a single domain: t-test The t-statistic: where 143 Average Performance measure, e.g., error rate of classifier f 1 number of trials

144
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Comparing 2 algorithms on a single domain: t-test The t-statistic: where 144 Standard Deviation over the difference in performance measures Difference in Performance measures at trial i

145
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Comparing 2 algorithms on a single domain: t-test: Illustration C4.5 and Naïve Bayes (NB) algorithms on the Labor dataset 10 runs of 10-fold CV (maintaining same training and testing folds across algorithms). The result of each 10-fold CV is considered a single result. Hence, each run of 10-fold CV is a different trial. Consequently n=10 We have: and hence: 145

146
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Comparing 2 algorithms on a single domain: t-test: Illustration Referring to the t-test table, for n-1=9 degrees of freedom, we can reject the null hypothesis at significance level (for which the observed t- value should be greater than 4.781) 146

147
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Effect size t-test measured the effect (i.e., the different in the sample means is indeed statistically significant) but not the size of this effect (i.e., the extent to which this difference is practically important) Statistics offers many methods to determine this effect size including Pearsons correlation coefficient, Hedges G, etc. In the case of t-test, we use: Cohens d statistic where Pooled standard deviation estimate 147

148
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Effect size A typical interpretation proposed by Cohen for effect size was the following discrete scale: –A value of about 0.2 or 0.3 : small effect, but probably meaningful –0.5: medium effect that is noticeable – 0.8: large effect In the case of Labor dataset example, we have: 148

149
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Comparing 2 algorithms on a single domain Assumptions The Normality or pseudo-normality of populations The randomness of samples (representative-ness) Equal Variances of the population Lets see if these assumptions hold in our case of Labor dataset: Normality: we may assume pseudo-normality since each trial eventually tests the algorithms on each of 57 instances (note however, we compare averaged estimates) Randomness: Data was collected in All the collective agreements were collected. No reason to assume this to be non- representative 149 t-test: so far so good; but what about Assumptions made by t-test?

150
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Comparing 2 algorithms on a single domain However, The sample variances cannot be considered equal We also ignored: correlated measurement effect (due to overlapping training sets) resulting in higher Type I error. Were we warranted to use t-test? Probably Not (even though t-test is quite robust over some assumptions). –Other variations of t-test (e.g., Welchs t-test) –McNemars test (non-parametric) t-test: so far so good; but what about Assumptions made by t-test? 150

151
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Comparing 2 algorithms on a single domain: McNemars test Non-parametric counterpart to t-test Computes the following contingency matrix 0 here denotes misclassification by concerned classifier: i.e., –c 00 denotes the number of instances misclassified by both f 1 and f 2 –C 10 denotes the number of instances correctly classified by f 1 but not by f 2 –… 151

152
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Comparing 2 algorithms on a single domain: McNemars test Null hypothesis McNemars statistic distributed approximately as if is greater than 20 To reject the null hypothesis: should be greater than for desired α (typically 0.05) 152

153
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Comparing 2 algorithms on a single domain: McNemars test Back to the Labor dataset: comparing C4.5 and NB, we obtain Which is greater than the desired value of allowing us to reject the null hypothesis at 95% confidence 153

154
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Comparing 2 algorithms on a single domain: McNemars test To keep in mind: Test aimed for matched pairs (maintain same training and test folds across learning algorithms) Using resampling instead of independent test sets (over multiple runs) can introduce some bias A constraint: (the diagonal) should be at least 20 which was not true in this case. –In such a case, sign test (also applicable for comparing 2 algorithms on multiple domains) should be used instead 154

155
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Comparing 2 algorithms on a multiple domains No clear parametric way to compare two classifiers on multiple domains: –t-test is not a very good alternative because it is not clear that we have commensurability of the performance measures across domains –The normality assumption is difficult to establish –The t-test is susceptible to outliers, which is more likely when many different domains are considered. Therefore we will describe two non-parametric alternatives The Sign Test Wilcoxons signed-Rank Test 155

156
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Comparing 2 algorithms on a multiple domains: Sign test Can be used to compare two classifiers on a single domain (using the results at each fold as a trial) More commonly used to compare two classifiers on multiple domains Calculate: –n f1 : the number of times that f 1 outperforms f 2 –n f2 : the number of times that f 2 outperforms f 1 Null hypothesis holds if the number of wins follow a binomial distribution. Practically speaking, a classifier should perform better on at least w α (the critical value) datasets to reject null hypothesis at α significance level 156

157
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Comparing 2 algorithms on a multiple domains: Sign test: Illustration NB vs SVM: n f1 = 4.5, n f2 = 5.5 and w 0.05 = 8 we cannot reject the null hypothesis (1-tailed) Ada vs RF: n f1 =1 and n f2 =8.5 we can reject the null hypothesis at level α=0.05 (1-tailed) and conclude that RF is significantly better than Ada on these data sets at that significance level. 157 DatasetNBSVMAdaboostRand Forest Anneal Audiology Balance Scale Breast Cancer Contact Lenses71.67 Pima Diabetes Glass Hepatitis Hypothyroid Tic-Tac-Toe

158
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Comparing 2 algorithms on a multiple domains: Non-parametric, and typically more powerful than sign test (since, unlike sign test, it quantifies the differences in performances) Commonly used to compare two classifiers on multiple domains Method For each domain, calculate the difference in performance of the two classifiers Rank the absolute differences and graft the signs in front of these ranks Calculate the sum of positive ranks W s1 and sum of negative ranks W s2 Calculate T wilcox = min (W s1, W s2 ) Reject null hypothesis if T wilcox V α (critical value) 158 Wilcoxons Signed-Rank test

159
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Comparing 2 algorithms on a multiple domains: W S1 = 17 and W S2 = 28 T Wilcox = min(17, 28) = 17 For n= 10-1 degrees of freedom and α = 0.05, V = 8 for the 1-sided test. Since 17 > 8. Hence, we cannot reject the null hypothesis 159 Wilcoxons Signed-Rank test: Illustration DataNBSVMNB-SVM|NB-SVM|Ranks Remove

160
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Comparing multiple algorithms on a multiple domains Two alternatives –Parametric: (one-way repeated measure) ANOVA ; –Non-parametric: Friedmans Test. Such multiple-hypothesis tests are called Omnibus tests. –Their null hypotheses: is that all the classifiers perform similarly, and its rejection signifies that: there exists at least one pair of classifiers with significantly different performances. In case of rejection of this null hypothesis, the omnibus test is followed by a Post-hoc test to identify the significantly different pairs of classifiers. In this tutorial, we will discuss Friedmans Test (omnibus) and the Nemenyi test (post-hoc test). 160

161
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Friedmans Test Algorithms are ranked on each domain according to their performances (best performance = lowest rank) –In case of a d-way tie at rank r: assign ((r+1)+(r+2)+…+(r+d))/d to each classifier For each classifier j, compute R.j : the sum of its ranks on all domains Calculate Friedmans statistic: over n domains and k classifiers 161

162
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Illustration of the Friedman test 162 Domain Classifier fA Classifier fB Classifier fC Domain Classifier fA Classifier fB Classifier fC R.j

163
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Nemenyi Test The omnibus test rejected the null hypothesis Follow up by post-hoc test: Nemenyi test to identify classifier-pairs with significant performance difference Let R ij : rank of classifier f j on data S i ; Compute mean rank of f j on all datasets Between classifier f j1 and f j2, calculate 163

164
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Nemenyi Test: Illustration Computing the q statistic of Nemenyi test for different classifier pairs, we obtain: –q AB = –q BC = –q AC = 2.22 At α = 0.05, we have q α = Hence, we can show a statistically significantly different performance between classifiers A and B, and between classifiers B and C, but not between A and C 164

165
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Recent Developments Kernel-based tests: There has been considerable advances recently in employing kernels to devise statistical tests (e.g., Gretton et al JMLR 2012; Fromont et al. COLT 2012) Employing f-divergences: generalizes the information-theoretic tests such as KL to the more general f-divergences by considering distributions as class-conditional (thereby expressing the two distributions in terms of Bayes risk w.r.t. a loss function) (Reid and Williamson, JMLR 2011) –Also extended to multiclass case (Garcia-Garcia and Williamson COLT 2012) 165

166
© 2013 GE All Rights Reserved Data Set Selection and Evaluation Benchmark Design

167
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Considerations to keep in mind while choosing an appropriate test bed Wolperts No Free Lunch Theorems: if one algorithm tends to perform better than another on a given class of problems, then the reverse will be true on a different class of problems. LaLoudouana and Tarate (2003) showed that even mediocre learning approaches can be shown to be competitive by selecting the test domains carefully. The purpose of data set selection should not be to demonstrate an algorithms superiority to another in all cases, but rather to identify the areas of strengths of various algorithms with respect to domain characteristics or on specific domains of interest. 167

168
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Where can we get our data from? Repository Data: Data Repositories such as the UCI repository and the UCI KDD Archive have been extremely popular in Machine Learning Research Artificial Data: The creation of artificial data sets representing the particular characteristic an algorithm was designed to deal with is also a common practice in Machine Learning Web-Based Exchanges: Could we imagine a multi-disciplinary effort conducted on the Web where researchers in need of data analysis would lend their data to machine learning researchers in exchange for an analysis? 168

169
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Pros and Cons of Repository Data Pros: Very easy to use: the data is available, already processed and the user does not need any knowledge of the underlying field The data is not artificial since it was provided by labs and so on. So in some sense, the research is conducted in real-world setting (albeit a limited one) Replication and comparisons are facilitated, since many researchers use the same data set to validate their ideas Cons: The use of data repositories does not guarantee that the results will generalize to other domains The data sets in the repository are not representative of the data mining process which involves many steps other than classification Community experiment/Multiplicity effect: since so many experiments are run on the same data set, by chance, some will yield interesting (though meaningless) results 169

170
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Pros and Cons of Artificial Data Pros: Data sets can be created to mimic the traits that are expected to be present in real data (which are unfortunately unavailable) The researcher has the freedom to explore various related situations that may not have been readily testable from accessible data Since new data can be generated at will, the multiplicity effect will not be a problem in this setting Cons: Real data is unpredictable and does not systematically behave according to a well defined generation model Classifiers that happen to model the same distribution as the one used in the data generation process have an unfair advantage 170

171
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Pros and Cons of Web-Based Exchanges Pros: Would be useful for three communities: –Domain experts, who would get interesting analyses of their problem –Machine Learning researchers who would be getting their hand on interesting data, and thus encounter new problems –Statisticians, who could be studying their techniques in an applied context Cons: It has not been organized. Is it truly feasible? How would the quality of the studies be controlled? 171

172
© 2013 GE All Rights Reserved Available Resources

173
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved What help is available for conducting proper evaluation? There is no need for researchers to program all the code necessary to conduct proper evaluation nor to do the computations by hand Resources are readily available for every steps of the process, including: –Evaluation metrics –Statistical tests –Re-sampling –And of course, Data repositories Pointers to these resources and explanations on how to use them are discussed in our book: 173

174
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Where to look for evaluation metrics? Actually, as most people know, Weka is a great source for, not only, classifiers, but also computation of the results according to a variety of evaluation metrics 174 === Stratified cross-validation === === Summary === Correctly Classified Instances % Incorrectly Classified Instances % Kappa statistics K&B Relative Info Score % K&B Information Score bits bits/instance Class complexity | order bits bits/instance Class complexity | scheme bits bits/instance Complexity improvement(Sf) bits bits/instance Mean absolute error Root mean squared error Relative absolute error % Root relative squared error % Total Number of Instances57 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure ROC Area Class bad good Weighted Avg === Confusion Matrix === a b <-- classified as 14 6 | a = bad 9 28 | b = good === Stratified cross-validation === === Summary === Correctly Classified Instances % Incorrectly Classified Instances % Kappa statistics K&B Relative Info Score % K&B Information Score bits bits/instance Class complexity | order bits bits/instance Class complexity | scheme bits bits/instance Complexity improvement(Sf) bits bits/instance Mean absolute error Root mean squared error Relative absolute error % Root relative squared error % Total Number of Instances57 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure ROC Area Class bad good Weighted Avg === Confusion Matrix === a b <-- classified as 14 6 | a = bad 9 28 | b = good

175
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved WEKA even performs ROC Analysis and draws Cost-Curves Although better graphical analysis packages are available in R, namely the ROCR package, which permits the visualization of many versions of ROC and Cost Curves, and also Lift curves, P-R Curves, and so on 175

176
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Where to look for Statistical Tests? 176 The R Software Package contains implementations of all the statistical tests discussed here and many more. They are very simple to run. > c45 = c(.2456,.1754,.1754,.2632,.1579,.2456,.2105,.1404,.2632,.2982) > nb = c(.0702,.0702,.0175,.0702,.0702,.0526,.1579,.0351,.0351,.0702) > t.test(c45, nb, paired= TRUE) Paired t-test data: c45 and nb t = , df = 9, p-value = 2.536e-05 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: sample estimates: mean of the differences > c45 = c(.2456,.1754,.1754,.2632,.1579,.2456,.2105,.1404,.2632,.2982) > nb = c(.0702,.0702,.0175,.0702,.0702,.0526,.1579,.0351,.0351,.0702) > t.test(c45, nb, paired= TRUE) Paired t-test data: c45 and nb t = , df = 9, p-value = 2.536e-05 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: sample estimates: mean of the differences > c4510folds= c(3, 0, 2, 0, 2, 2, 2, 1, 1, 1) > nb10folds= c(1, 0, 0, 0, 0, 1, 0, 2, 0, 0) > wilcox.test(nb10folds, c4510folds, paired= TRUE) Wilcoxon signed rank test with continuity correction data: nb10folds and c4510folds V = 2.5, p-value = alternative hypothesis: true location shift is not equal to 0 > c4510folds= c(3, 0, 2, 0, 2, 2, 2, 1, 1, 1) > nb10folds= c(1, 0, 0, 0, 0, 1, 0, 2, 0, 0) > wilcox.test(nb10folds, c4510folds, paired= TRUE) Wilcoxon signed rank test with continuity correction data: nb10folds and c4510folds V = 2.5, p-value = alternative hypothesis: true location shift is not equal to 0 > | > classA= c(85.83, 85.91, 86.12, 85.82, 86.28, 86.42, 85.91, 86.10, 85.95, 86.12) > classB= c(75.86, 73.18, 69.08, 74.05, 74.71, 65.90, 76.25, 75.10, 70.50, 73.95) > classC= c(84.19, 85.90, 83.83, 85.11, 86.38, 81.20, 86.38, 86.75, 88.03, 87.18) > t=matrix(c(classA, classB, classC), nrow=10, byrow=FALSE) > t [,1] [,2] [,3] [1,] [2,] [3,] [4,] [5,] [6,] [7,] [8,] [9,] [10,] > friedman.test(t) Friedman rank sum test data: t Friedman chi-squared = 15, df = 2, p-value = > | > | > classA= c(85.83, 85.91, 86.12, 85.82, 86.28, 86.42, 85.91, 86.10, 85.95, 86.12) > classB= c(75.86, 73.18, 69.08, 74.05, 74.71, 65.90, 76.25, 75.10, 70.50, 73.95) > classC= c(84.19, 85.90, 83.83, 85.11, 86.38, 81.20, 86.38, 86.75, 88.03, 87.18) > t=matrix(c(classA, classB, classC), nrow=10, byrow=FALSE) > t [,1] [,2] [,3] [1,] [2,] [3,] [4,] [5,] [6,] [7,] [8,] [9,] [10,] > friedman.test(t) Friedman rank sum test data: t Friedman chi-squared = 15, df = 2, p-value = > |

177
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved Where to look for Re-sampling methods? In our book, we have implemented all the re-sampling schemes described in this tutorial. The actual code is also available upon request. ( me at ) 177 e0Boot = function(iter, dataSet, setSize, dimension, classifier1, classifier2){ classifier1e0Boot <- numeric(iter) classifier2e0Boot <- numeric(iter) for(i in 1:iter) { Subsamp <- sample(setSize, setSize, replace=TRUE) Basesamp <- 1:setSize oneTrain <- dataSet[Subsamp,1:dimension ] oneTest <- dataSet[setdiff(Basesamp,Subsamp), 1:dimension] classifier1model <- classifier1(class~., data=oneTrain) classifier2model <- classifier2(class~., data=oneTrain) classifier1eval <- evaluate_Weka_classifier(classifier1model, newdata=oneTest) classifier1acc <- as.numeric(substr(classifier1eval$string, 70,80)) classifier2eval <- evaluate_Weka_classifier(classifier2model, newdata=oneTest) classifier2acc <- as.numeric(substr(classifier2eval$string, 70,80)) classifier1e0Boot[i]= classifier1acc classifier2e0Boot[i]= classifier2acc } return(rbind(classifier1e0Boot, classifier2e0Boot))} e0Boot = function(iter, dataSet, setSize, dimension, classifier1, classifier2){ classifier1e0Boot <- numeric(iter) classifier2e0Boot <- numeric(iter) for(i in 1:iter) { Subsamp <- sample(setSize, setSize, replace=TRUE) Basesamp <- 1:setSize oneTrain <- dataSet[Subsamp,1:dimension ] oneTest <- dataSet[setdiff(Basesamp,Subsamp), 1:dimension] classifier1model <- classifier1(class~., data=oneTrain) classifier2model <- classifier2(class~., data=oneTrain) classifier1eval <- evaluate_Weka_classifier(classifier1model, newdata=oneTest) classifier1acc <- as.numeric(substr(classifier1eval$string, 70,80)) classifier2eval <- evaluate_Weka_classifier(classifier2model, newdata=oneTest) classifier2acc <- as.numeric(substr(classifier2eval$string, 70,80)) classifier1e0Boot[i]= classifier1acc classifier2e0Boot[i]= classifier2acc } return(rbind(classifier1e0Boot, classifier2e0Boot))}

178
© 2013 GE All Rights Reserved The End Some Concluding Remarks

179
© 2013 GE All Rights Reserved Thank you for attending the tutorial!!! For more information…contact us at:

180
Performance Evaluation of Machine Learning Algorithms © 2013 Mohak Shah All Rights Reserved References Too many to put down here, but pp of the book If you want specific ones, come and see me at the end of the tutorial or send me an 180

Similar presentations

© 2016 SlidePlayer.com Inc.

All rights reserved.

Ads by Google