Presentation is loading. Please wait.

Presentation is loading. Please wait.

Performance Evaluation of Machine Learning Algorithms

Similar presentations


Presentation on theme: "Performance Evaluation of Machine Learning Algorithms"— Presentation transcript:

1 Performance Evaluation of Machine Learning Algorithms
Mohak Shah Nathalie Japkowicz GE Software University of Ottawa ECML 2013, Prague

2 “Evaluation is the key to making real progress in data mining”
[Witten & Frank, 2005], p. 143

3 Yet…. Evaluation gives us a lot of, sometimes contradictory, information. How do we make sense of it all? Evaluation standards differ from person to person and discipline to discipline. How do we decide which standards are right for us? Evaluation gives us support for a theory over another, but rarely, if ever, certainty. So where does that leave us?

4 Example I: Which Classifier is better?
There are almost as many answers as there are performance measures! (e.g., UCI Breast Cancer) Algo Acc RMSE TPR FPR Prec Rec F AUC Info S NB 71.7 .4534 .44 .16 .53 .48 .7 48.11 C4.5 75.5 .4324 .27 .04 .74 .4 .59 34.28 3NN 72.4 .5101 .32 .1 .56 .41 .63 43.37 Ripp 71 .4494 .37 .14 .52 .43 .6 22.34 SVM 69.6 .5515 .33 .15 .39 54.89 Bagg 67.8 .4518 .17 .23 11.30 Boost 70.3 .4329 .42 .18 .5 .46 34.48 RanF 69.23 .47 20.78 acceptable contradictions questionable contradictions

5 Example I: Which Classifier is better?
Ranking the results Algo Acc RMSE TPR FPR Prec Rec F AUC Info S NB 3 5 1 7 2 C4.5 3NN 6 4 Ripp SVM 8 Bagg Boost RanF acceptable contradictions questionable contradictions

6 Example II: What’s wrong with our results?
A new algorithm was designed based on data that had been provided to by the National Institute of Health (NIH). [Wang and Japkowicz, 2008, 2010] According to the standard evaluation practices in machine learning, the results were found to be significantly better than the state-of-the-art. The conference version of the paper received a best paper award and a bit of attention in the ML community (35+ citations) NIH would not consider the algorithm because it was probably not significantly better (if at all) than others as they felt the evaluation wasn’t performed according to their standards.

7 Example III: What do our results mean?
In an experiment comparing the accuracy performance of eight classifiers on ten domains, we concluded that One-Way Repeated-Measure ANOVA rejects the hypothesis that the eight classifiers perform similarly on the ten domains considered at the 95% significance level [Japkowicz & Shah, 2011], p. 273. Can we be sure of our results? What if we are in the 5% not vouched for by the statistical test? Was the ANOVA test truly applicable to our specific data? Is there any way to verify that? Can we have any 100% guarantee on our results?

8 Purpose of the tutorial
To give an appreciation for the complexity and uncertainty of evaluation. To discuss the different aspects of evaluation and to present an overview of the issues that come up in their context and what the possible remedies may be. To present some of the newer lesser known results in classifier evaluation. Finally, point to some available, easy-to-use, resources that can help improve the robustness of evaluation in the field.

9 The main steps of evaluation
The Classifier Evaluation Framework Choice of Learning Algorithm(s) Datasets Selection Performance Measure of Interest Error-Estimation/ Sampling Method Statistical Test Perform Evaluation : Knowledge of 1 is necessary for 2 : Feedback from 1 should be used to adjust 2

10 What these steps depend on
These steps depend on the purpose of the evaluation: Comparison of a new algorithm to other (may be generic or application-specific) classifiers on a specific domain (e.g., when proposing a novel learning algorithm) Comparison of a new generic algorithm to other generic ones on a set of benchmark domains (e.g. to demonstrate general effectiveness of the new approach against other approaches) Characterization of generic classifiers on benchmarks domains (e.g. to study the algorithms' behavior on general domains for subsequent use) Comparison of multiple classifiers on a specific domain (e.g. to find the best algorithm for a given application task)

11 Outline Performance measures Error Estimation/Resampling
Statistical Significance Testing Data Set Selection and Evaluation Benchmark Design Available resources

12 Performance Measures

13 Performance Measures: Outline
Ontology Confusion Matrix Based Measures Graphical Measures: ROC Analysis and AUC, Cost Curves Recent Developments on the ROC Front: Smooth ROC Curves, the H Measure, ROC Cost curves Other Curves: PR Curves, lift, etc.. Probabilistic Measures Others: qualitative measures, combinations frameworks, visualization, agreement statistics

14 Overview of Performance Measures
All Measures Confusion Matrix Additional Info (Classifier Uncertainty Cost ratio Skew) Alternate Information Deterministic Classifiers Scoring Classifiers Continuous and Prob. Classifiers (Reliability metrics) Multi-class Focus Single-class Focus Graphical Measures Summary Statistic Distance/Error measures Information Theoretic Measures No Chance Correction Chance Correction Roc Curves PR Curves DET Curves Lift Charts Cost Curves AUC H Measure Area under ROC- cost curve RMSE KL divergence K&B IR BIR Accuracy Error Rate Cohen’s Kappa Fielss Kappa Interestingness Comprehensibility Multi-criteria TP/FP Rate Precision/Recall Sens./Spec. F-measure Geom. Mean Dice

15 Confusion Matrix-Based Performance Measures
True class  Hypothesized class Pos Neg Yes TP FP No FN TN P=TP+FN N=FP+TN Multi-Class Focus: Accuracy = (TP+TN)/(P+N) Single-Class Focus: Precision = TP/(TP+FP) Recall = TP/P Fallout = FP/N Sensitivity = TP/(TP+FN) Specificity = TN/(FP+TN) Confusion Matrix

16 Aliases and other measures
Accuracy = 1 (or 100%) - Error rate Recall = TPR = Hit rate = Sensitivity Fallout = FPR = False Alarm rate Precision = Positive Predictive Value (PPV) Negative Predictive Value (NPV) = TN/(TN+FN) Likelihood Ratios: LR+ = Sensitivity/(1-Specificity)  higher the better LR- = (1-Sensitivity)/Specificity  lower the better

17 Pairs of Measures and Compounded Measures
Precision / Recall Sensitivity / Specificity Likelihood Ratios (LR+ and LR-) Positive / Negative Predictive Values F-Measure: Fα = [(1+ α) (Precision x Recall)]/ α= 1, 2, 0.5 [(α x Precision) + Recall] G-Mean: class version single-class version G-Mean = Sqrt(TPR x TNR) or Sqrt(TPR x Precision) F-measure is weighted harmonic mean of Pr and Recall F1  balanced F-measure F2 weighs recall twice as much as precision and F0.5 half as much

18 Skew and Cost considerations
Skew-Sensitive Assessments: e.g., Class Ratios Ratio+ = (TP + FN)/(FP + TN) Ratio- = (FP + TN)/(TP + FN) TPR can be weighted by Ratio+ and TNR can be weighted by Ratio- Asymmetric Misclassification Costs: If the costs are known, Weigh each non-diagonal entries of the confusion matrix by the appropriate misclassification costs Compute the weighted version of any previously discussed evaluation measure.

19 Some issues with performance measures
True class  Pos Neg Yes 200 100 No 300 400 P=500 N=500 True class  Pos Neg Yes 400 300 No 100 200 P=500 N=500 Both classifiers obtain 60% accuracy They exhibit very different behaviours: On the left: weak positive recognition rate/strong negative recognition rate On the right: strong positive recognition rate/weak negative recognition rate

20 Some issues with performance measures (cont’d)
True class  Pos Neg Yes 500 5 No P=500 N=5 True class  Pos Neg Yes 450 1 No 50 4 P=500 N=5 The classifier on the left obtains 99.01% accuracy while the classifier on the right obtains 89.9% Yet, the classifier on the right is much more sophisticated than the classifier on the left, which just labels everything as positive and misses all the negative examples.

21 Some issues with performance measures (cont’d)
True class  Pos Neg Yes 200 100 No 300 400 P=500 N=500 True class  Pos Neg Yes 200 100 No 300 P=500 N=100 Both classifiers obtain the same precision and recall values of 66.7% and 40% (Note: the data sets are different) They exhibit very different behaviours: Same positive recognition rate Extremely different negative recognition rate: strong on the left / nil on the right Note: Accuracy has no problem catching this! Recall is TPR and hence pr and recall do not focus on negative class

22 Observations This and similar analyses reveal that each performance measure will convey some information and hide other. Therefore, there is an information trade-off carried through the different metrics. A practitioner has to choose, quite carefully, the quantity s/he is interested in monitoring, while keeping in mind that other values matter as well. Classifiers may rank differently depending on the metrics along which they are being compared.

23 Graphical Measures: ROC Analysis, AUC, Cost Curves
Many researchers have now adopted the AUC (the area under the ROC Curve) to evaluate their results. Principal advantage: more robust than Accuracy in class imbalanced situations takes the class distribution into consideration and, therefore, gives more weight to correct classification of the minority class, thus outputting fairer results. We start first by understanding ROC curves

24 ROC Curves Performance measures for scoring classifiers
most classifiers are in fact scoring classifiers (not necessarily ranking classifiers) Scores needn’t be in pre-defined intervals, or even likelihood or probabilities ROC analysis has origins in Signal detection theory to set an operating point for desired signal detection rate Signal assumed to be corrupted by noise (Normally distributed) ROC maps FPR on horizontal axis and TPR on the vertical axis; Recall that FPR = 1 – Specificity TPR = Sensitivity Ranking classifiers only in the sense of ranking positive instances generally higher than negative ones

25 ROC Space

26 ROC Plot for discrete classifiers

27 ROC plot for two hypothetical scoring classifiers

28 Skew considerations ROC graphs are insensitive to class imbalances
Since they don’t account for class distributions Based on 2x2 confusion matrix (i.e. 3 degrees of freedom) First 2 dimensions: TPR and FPR The third dimension is performance measure dependent, typically class imbalance Accuracy/risk are sensitive to class ratios Classifier performance then corresponds to points in 3D space Projections on a single TPR x FPR slice are performance w.r.t. same class distribution Hence, implicit assumption is that 2D is independent of empirical class distributions

29 Skew considerations Measures such as Class imbalances, Asymmetric misclassification costs, credits for correct classification, etc. can be incorporated via single metric: skew-ratio rs: s.t. rs < 1 if positive examples are deemed more important and rs > 1 if vice-versa Performance measures then can be re-stated to take skew ratio into account

30 Skew considerations Accuracy Precision:
For symmetric cost, skew represents the class ratio: Precision:

31 Isometrics Isometrics or Isoperformance lines (or curves)
For any performance metric, lines or curves denoting the same metric value for a given rs In case of 3D ROC graphs, Isometrics form surfaces

32 Isoaccuracy lines Black lines: rs = 1 (balanced)
Red lines: rs = 0.5 (positives weighed more than negatives)

33 Isoprecision lines

34 Isometrics Similarly, Iso-performance lines can be drawn on the ROC space connecting classifiers with the same expected cost: f1 and f2 have same performance if: Expected cost can be interpreted as a special case of skew

35 ROC curve generation

36 Curves from Discrete points and Sub-optimalities

37 ROC Convex hull The set of points on ROC curve that are not suboptimal forms the ROC convex hull

38 ROCCH and model selection

39 Selectively optimal classifiers

40 Selectively optimal classifiers

41 AUC ROC Analysis allows a user to visualize the performance of classifiers over their operating ranges. However, it does not allow the quantification of this analysis, which would make comparisons between classifiers easier. The Area Under the ROC Curve (AUC) allows such a quantification: it represents the performance of the classifier averaged over all the possible cost ratios. The AUC can also be interpreted as the probability of the classifier assigning a higher rank to a randomly chosen positive example than to a randomly chosen negative example Using the second interpretation, the AUC can be estimated as follows: where and are the sets of positive and negative examples, respectively, in the Test set and is the rank of the th example in given by the classifier under evaluation.

42 Recent Developments I: Smooth ROC Curves (smROC) [Klement, et al
Recent Developments I: Smooth ROC Curves (smROC) [Klement, et al., ECML’2011] ROC Analysis visualizes the ranking performance of classifiers but ignores any kind of scoring information. Scores are important as they tell us whether the difference between two rankings is large or small, but they are difficult to integrate because unlike probabilities, each scoring system is different from one classifier to the next and their integration is not obvious. smROC extends the ROC curve to include normalized scores as smoothing weights added to its line segments. smROC was shown to capture the similarity between similar classification models better than ROC and that it can capture difference between classification models that ROC is barely sensitive to, if at all.

43 Recent Developments II: The H Measure I
A criticism of the AUC was given by [Hand, 2009]. The argument goes along the following lines: The misclassification cost distributions (and hence the skew-ratio distributions) used by the AUC are different for different classifiers. Therefore, we may be comparing apples and oranges as the AUC may give more weight to misclassifying a point by classifier A than it does by classifier B To address this problem, [Hand, 2009] proposed the H- Measure. In essence, The H-measure allows the user to select a cost-weight function that is equal for all the classifiers under comparison and thus allows for fairer comparisons.

44 Recent Developments II: The H Measure II
[Flach et al. ICML‘2011] argued that [Hand‘2009]‘s criticism holds when the AUC is interpreted as a classification measure but not when it is interpreted as a pure ranking measure. When [Hand’2009]’s criticism holds, the problem occurs only if the threshold used to construct the ROC curve are assumed to be optimal. When constructing the ROC curve using all the points in the data set as thresholds rather than only optimal ones, then the AUC measure can be shown to be coherent and the H-Measure unnecessary, at least from a theoretical point of view. The H-Measure is a linear transformation of the value that the area under the Cost-Curve (discussed next) would produce.

45 Cost Curves Cost-curves tell us for what class probabilities one classifier is preferable over the other. 0.2 0.4 0.6 0.8 1 0.1 0.3 0.5 0.7 0.9 False Positive Rate True Positive Rate f1 f2 Classifier that is always wrong 0.4 1 0.25 Error Rate P(+), Probability of an example being from the Positive Class Positive Trivial Classifier Negative Classifier B2 Classifier A1 Classifiers A at different thresholds B at different thresholds Classifiers that is always right Cost curves plot the relative expected misclassification costs as a function of the proportion of positive instances in the dataset Cost curves are point-line duals of ROC curves In cost space, each of the ROC points are represented by a line and the convex hull of the ROC space corresponds to the lower envelope created by all the classifier lines. ROC Curves only tell us that sometimes one classifier is preferable over the other

46 Recent Developments III: ROC Cost curves
Hernandez-Orallo et al. (ArXiv, 2011, JMLR 2012) introduces ROC Cost curves Explores connection between ROC space and cost space via expected loss over a range of operating conditions Shows how ROC curves can be transferred to cost space by threshold choice methods s.t. the proportion of positive predictions equals the operating conditions (via cost or skew) Expected loss measured by the area under ROC cost curves maps linearly to the AUC Also explores relationship between AUC and Brier score (MSE)

47 ROC cost curves (going beyond optimal cost curve)
Figure from Hernandez-Orallo et al (Fig 2)

48 ROC cost curves (going beyond optimal cost curve)
Figure from Hernandez-Orallo et al (Fig 2)

49 Other Curves Other research communities have produced and used graphical evaluation techniques similar to ROC Curves. Lift charts plot the number of true positives versus the overall number of examples in the data sets that was considered for the specific true positive number on the vertical axis. PR curves plot the precision as a function of its recall. Detection Error Trade-off (DET) curves are like ROC curves, but plot the FNR rather than the TPR on the vertical axis. They are also typically log-scaled. Relative Superiority Graphs are more akin to Cost-Curves as they consider costs. They map the ratios of costs into the [0,1] interval.

50 Probabilistic Measures I: RMSE
The Root-Mean Squared Error (RMSE) is usually used for regression, but can also be used with probabilistic classifiers. The formula for the RMSE is: where m is the number of test examples, f(xi), the classifier’s probabilistic output on xi and yi the actual label. ID f(xi) yi (f(xi) – yi)2 1 .95 .0025 2 .6 .36 3 .8 .04 4 .75 .5625 5 .9 .01 RMSE(f) = sqrt(1/5 * ( )) = sqrt(0.975/5) =

51 Probabilistic Measures II: Information Score
Kononenko and Bratko’s Information Score assumes a prior P(y) on the labels. This could be estimated by the class distribution of the training data. The output (posterior probability) of a probabilistic classifier is P(y|f), where f is the classifier. I(a) is the indicator function. Avg info score: x P(yi|f) yi IS(x) 1 .95 0.66 2 .6 3 .8 .42 4 .75 .32 5 .9 .59 P(y=1) = 3/5 = 0.6 P(y=0) = 2/5 = 0.4 I(x1) = 1 * (-log(.6) + log(.95)) + 0 * (-log(.4) + log).05)) = 0.66 Ia = 1/5 ( ) = 0.40

52 Other Measures I: A Multi-Criteria Measure – The Efficiency Method framework for combining various measures including qualitative metrics, such as interestingness. considers the positive metrics for which higher values are desirable (e.g, accuracy) and the negative metrics for which lower values are desirable (e.g., computational time). pmi+ are the positive metrics and pmj- are the negative metrics. The wi’s and wj’s have to be determined and a solution is proposed that uses linear programming.

53 Other Measures II Classifier evaluation can be viewed as a problem of analyzing high-dimensional data. The performance measures currently used are but one class of projections that can be applied to these data. Japkowicz et al ECML’08 ISAIM’08] introduced visualization- based measures Projects the classifier performance on domains to a higher dimensional space Distance measure in projected space allows to qualitatively differentiate classifiers Note: Dependent on distance metric used

54 Other Measures III: Agreement Statistics
Measures considered so far assumes existence of ground truth labels (against which, for instance, accuracy is measured) In the absence of ground truth, typically (multiple) experts label the instances Hence, one needs to account for chance agreements while evaluating under imprecise labeling Two approaches: either estimate ground truth; or evaluate against the group of labelings

55 Multiple Annotations

56 Inter-expert agreement:
2 Questions of interest Inter-expert agreement: Overall Agreement of the group Classifier Agreement Against the group

57 Such measurements are also desired when
Assessing a classifier in a skewed class setting Assessing a classifier against expert labeling There are reasons to believe that bias has been introduced in the labeling process (say, due to class imbalance, asymmetric representativeness of classes in the data,…) Agreement statistic aim to address this by accounting for chance agreements (coincidental concordances)

58 Coincidental concordances (Natural Propensities) need to be accounted for

59 General Agreement Statistic

60 General Agreement Statistic
Agreement measure Chance Agreement Maximum Achievable Agreement Examples: Cohen’s kappa, Fleiss Kappa, Scott’s pi, ICC…

61 2-raters over 2 classes: Cohen’s Kappa
Say we wish to measure classifier accuracy against some “true” process. Cohen’s kappa is defined as: where represents the probability of overall agreement over the label assignments between the classifier and the true process, and represents the chance agreement over the labels and is defined as the sum of the proportion of examples assigned to a class times the proportion of true labels of that class in the data set.

62 Cohen’s Kappa Measure: Example
One of many generalizations for multiple class case Predicted -> Actual A B C Total 60 50 10 120 100 40 150 30 90 130 160 140 Accuracy = = ( ) / 400 = 62.5% = 100/400 * 120/ /400 * 150/ /400 * 130/400 = = 43.29%  Accuracy is overly optimistic in this example!

63 Multiple raters over multiple classes
Various generalizations have been proposed to estimate inter-rater agreement in general multi-class multi-rater scenario Some well known formulations include Fleiss kappa (Fleiss 1971); ICC, Scott’s pi, Vanbelle and Albert (2009)’s measures; These formulations all relied on marginalization argument (fixed number of raters are assumed to be sampled from a pool of raters to label each instance) -applicable, for instance, in many crowd-sourcing scenarios For assessing agreement against a group of raters, typically consensus argument is used A deterministic label set is formed via majority vote of raters in the group and the classifier is evaluated against this label set

64 Multiple raters over multiple classes
Shah (ECML 2011) highlighted the issues with both the marginalization and the consensus argument, and related approaches such as Dice coefficient, for machine learning evaluation when the raters’ group is fixed Two new measures were introduced κS : to assess inter-rater agreement (generalizing Cohen’s kappa) Variance upper-bounded by marginalization counterpart the S measure: to evaluate a classifier against a group of raters

65 Issues with marginalization and Consensus argument: A quick review

66 Back to the agreement statistic
Pair-wise Agreement measure Difference arise in obtaining the expectation

67 Traditional Approaches: Inter-rater Agreement
The Marginalization Argument: Consider a simple 2 rater 2 class case Agreement: 4/7 Probability of chance agreement over label 1: Pr(Label=1| Random rater)2 = 7/14 * 7/14 = 0.25 Inst # Rater 1 Rater 2 1 2 3 4 5 6 7 4/7 – 0.5 Agreement = = 0.143 1 – 0.5

68 Traditional Approaches: Inter-rater Agreement
The Marginalization Argument: Consider another scenario Observed Agreement: 0 Probability of chance agreement over label 1: Pr(Label=1| Random rater)2 = 7/14 * 7/14 = 0.25 Inst # Rater 1 Rater 2 1 2 3 4 5 6 7 0 – 0.5 Agreement = = -1 1 – 0.5

69 Traditional Approaches: Inter-rater Agreement
The Marginalization Argument: But this holds even when there is no evidence of a chance agreement Observed Agreement: 0 Probability of chance agreement over label 1: Pr(Label=1| Random rater)2 = 7/14 * 7/14 = 0.25 Inst # Rater 1 Rater 2 1 2 3 4 5 6 7 0 – 0.5 Agreement = = -1 1 – 0.5

70 Marginalization argument not applicable in fixed rater scenario
Marginalization ignores rater correlation Ignores rater asymmetry Results in loose chance agreement estimates by optimistic estimation Hence, overly conservative agreement estimate Shah’s kappa corrects for these while retaining the convergence properties

71 Shah’s kappa: Variability upper-bounded by that of Fleiss kappa

72 II. Agreement of a classifier against a group
Extension of marginalization argument Recently appeared in Statistics literature: Vanbelle and Albert, (stat. ner. 2009) Consensus Based (more traditional) Almost universally used in the machine learning/data mining community E.g., medical image segmentation, tissue classification, recommendation systems, expert modeling scenarios (e.g. market analyst combination)

73 Marginalization Approach and Issues in Fixed experts setting
Observed agreement: Proportion of raters with which the classifier agrees Ignores qualitative agreement, may even ignore group dynamics Inst # Rater 1 Rater 2 Rater 3 1 2 3 4 5 6 7 Classifier 1 3 2 R1 R2 R3 c

74 Marginalization Approach and Issues in Fixed experts setting
Observed agreement: Proportion of raters with which the classifier agrees Ignores qualitative agreement, may even ignore group dynamics Any random label assignment gives the same observed agreement Inst # Rater 1 Rater 2 Rater 3 1 2 3 4 5 6 7 Classifier 1 3 2

75 Marginalization Approach and Issues in Fixed experts setting
Chance Agreement: Extend the marginalization argument Not informative when the raters are fixed, ignores rater- specific correlations When fixed raters never agree, chance agreement should be zero Inst # Rater 1 Rater 2 Rater 3 1 2 3 4 5 6 7 Classifier 1 3 2

76 Consensus Approach and Issues
Obtain a deterministic label for each instance if at least k raters agree Treat this label set as ground truth and use dice coefficient against classifier labels

77 Problem with consensus: Ties

78 Problem with consensus: Ties

79 Problem with consensus: Ties: Solution -> Remove 1 (Tie-breaker)

80 Problem with consensus: Ties: Remove which 1?

81 Problem with consensus: Ties: Remove which 1?

82 Two Consensus outputs: One Classifier
Liberal C2: Conservative Classifier

83 Chance Correction: neither done nor meaningful (Even Cohen not useful over consensus since it is not a single classifier) C1: Liberal Both notions of liberal and conservative are not meaningful, not just because of how they are in politics, but also because here these notions are over consensus which do not reflect the expert group C2: Conservative Classifier

84 Problem with consensus: Consensus upper bounded by “threshold-highest”
R3: 1671 (t3) R1: 1526 (t4) A theoretical bound on the consensus…(put in paper)…if ti is the number of voxels labeled as lesions by the ith most liberal rater, then this is the upper bound on the labeled example in consensus R4: 816 (t5) R5: 754 (t6) R6:1891 (t2)

85 Consensus Approach and Issues
Obtain a deterministic label for each instance if at least k >= r/2 raters agree Treat this label set as ground truth and use dice coefficient against classifier labels Issues: Threshold sensitive Establishing threshold can be non-trivial Tie breaking not clear Treats estimates as deterministic Ignores minority raters as well as rater correlation The S measure (Shah, ECML 2011) addresses these issues with marginalization and is independent of the consensus approach

86 Error Estimation/Resampling

87 Se we decided on a performance measure.
How do we estimate it in an unbiased manner? What if we had all the data? Re-substitution: Overly optimistic (best performance achieved with complete over-fit) Which factors affect such estimation choices? Sample Size/Domains Random variations: in training set, testing set, learning algorithms, class-noise

88 Hold-out approach Set aside a separate test set T. Estimate the empirical risk: Advantages independence from training set Generalization behavior can be characterized Estimates can be obtained for any classifier

89 Hold-out approach Confidence intervals can be found too (based on binomial approximation):

90 A Tighter bound Based on binomial inversion:
More realistic than asymptotic Gaussian assumption (μ ± 2σ)

91 Binomial vs. Gaussian assumptions
Bl and Bu shows binomial lower and upper bounds CIl and CIu shod lower and upper confidence intervals based on Gaussian assumption Data-Set A RT Bl Bu CIl CIu Bupa SVM 0.352 0.235 0.376 -0.574 1.278 Ada 0.291 0.225 0.364 -0.620 1.202 DT 0.325 0.256 0.400 -0.614 1.264 DL NB 0.4 0.326 0.476 -0.582 1.382 SCM 0.377 0.305 0.453 -0.595 1.349 Credit 0.183 0.141 0.231 -0.592 0.958 0.17 0.129 0.217 0.922 0.13 0.094 0.173 -0.543 0.803 0.193 0.0150 0.242 -0.598 0.984 0.2 0.156 0.249 -0.603 1.003 0.19 0.147 0.239 -0.596 0.976

92 Binomial vs. Gaussian assumptions
Bl and Bu shows binomial lower and upper bounds CIl and CIu shod lower and upper confidence intervals based on Gaussian assumption Data-Set A RT Bl Bu CIl CIu Bupa SVM 0.352 0.235 0.376 -0.574 1.278 Ada 0.291 0.225 0.364 -0.620 1.202 DT 0.325 0.256 0.400 -0.614 1.264 DL NB 0.4 0.326 0.476 -0.582 1.382 SCM 0.377 0.305 0.453 -0.595 1.349 Credit 0.183 0.141 0.231 -0.592 0.958 0.17 0.129 0.217 0.922 0.13 0.094 0.173 -0.543 0.803 0.193 0.0150 0.242 -0.598 0.984 0.2 0.156 0.249 -0.603 1.003 0.19 0.147 0.239 -0.596 0.976

93 Hold-out sample size requirements
The bound: gives

94 Hold-out sample size requirements
The bound: gives This sample size bound grows very quickly with ε and δ

95 The need for re-sampling
Too few training examples -> learning a poor classifier Too few test examples -> erroneous error estimates Hence: Resampling Allows for accurate performance estimates while allowing the algorithm to train on most data examples Allows for closer duplication conditions

96 What implicitly guides re-sampling:
A quick look into relationship with Bias-variance behavior Bias-Variance analysis helps understand the behavior of learning algorithms In the classification case: Bias: difference between the most frequent prediction of the algorithm and the optimal label of the examples Variance: Degree of algorithm’s divergence from its most probable (average) prediction in response to variations in training set Hence, bias is independent of the training set while variance isn’t

97 What implicitly guides re-sampling:
A quick look into relationship with Bias-variance behavior Having too few examples in the training set affects the bias of algorithm by making its average prediction unreliable Having too few examples in test set results in high variance Hence, the choice of resampling can introduce different limitations on the bias and variance estimates of an algorithm and hence on the reliability of error estimates

98 An ontology of error estimation techniques
All Data Regimen No Re-sampling Re-sampling Hold-Out Simple Re-sampling Multiple Re-sampling Re-substitution Cross-Validation Random Sub-Sampling Bootstrapping Randomization Repeated k-fold Cross-Validation ε0 Bootstrap 0.632 Bootstrap Permutation Test Non-Stratified k-fold Cross-Validation Stratified k-fold Cross-Validation Leave-One-Out 5x2CV 10x10CV

99 Simple Resampling: K-fold Cross Validation
FOLD k-1 FOLD k Training data subset Testing data subset In Cross-Validation, the data set is divided into k folds and at each iteration, a different fold is reserved for testing while all the others used for training the classifiers.

100 Simple Resampling: Some variations of Cross-Validation
Stratified k-fold Cross-Validation: Maintain the class distribution from the original dataset when forming the k-fold CV subsets useful when the class-distribution of the data is imbalanced/skewed. Leave-One-Out This is the extreme case of k-fold CV where each subset is of size 1. Also known a Jackknife estimate

101 Observations k-fold CV is arguable the best known and most commonly used resampling technique With k of reasonable size, less computer intensive than Leave-One- Out Easy to apply In all the variants, the testing sets are independent of one another, as required, by many statistical testing methods but the training sets are highly overlapping. This can affect the bias of the error estimates (generally mitigated for large dataset sizes). Results in averaged estimate over k different classifiers

102 Observations Leave-One-Out can result in estimates with high variance given that the testing folds contains only one example On the other hand, the classifier is practically unbiased since each training fold contains almost all the data Leave-One-Out can be quite effective for moderate dataset sizes For small datasets, the estimates have high variance while for large datasets application becomes too computer-intensive Leave-One-Out can also be beneficial, compared to k-fold CV, when there is wide dispersion of data distribution data set contains extreme values

103 Limitations of Simple Resampling
Do not estimate the risk of a classifier but rather expected risk of algorithm over the size of partitions While stability of estimates can give insight into robustness of algorithm, classifier comparisons in fact compare the averaged estimates (and not individual classifiers) Providing rigorous guarantees is quite problematic No confidence intervals can be established (unlike, say, hold- out method) Even under normality assumption, the standard deviation at best conveys the uncertainty in estimates

104 Multiple Resampling: Bootstrapping
Attempts to answer: What can be done when data is too small for application of k-fold CV or Leave-One-Out? Assumes that the available sample is representative Creates a large number of new sample by drawing with replacement

105 Multiple Resampling: ε0 Bootstrapping
Given a data D of size m, create k bootstrap samples Bi, each of of size m by sampling with replacement At each iteration i: Bi represents the training set while the test set contains single copy of examples in D that are not present in Bi Train a classifier on Bi and test on the test set to obtain error estimate ε0i Average over ε0i’s to obtain ε0 estimate

106 Multiple Resampling: ε0 Bootstrapping
In every iteration the probability of an example not being chosen after m samples is: Hence, average distinct examples in the training set is m( )m = 0.632m The ε0 measure can hence be pessimistic since in each run the classifier is typically trained on about 63.2% of data

107 Multiple Resampling: e632 Bootstrapping
e632 estimate corrects for this pessimistic bias by taking into account the optimistic bias of resubstitution error over the remaining fraction:

108 Discussion Bootstrap can be useful when datasets are too small
In these cases, the estimates can also have low variance owing to (artificially) increased data size ε0: effective in cases of very high true error rate; has low variance than k-fold CV but more biased e632 (owing to correction it makes): effective over small true error rates An interesting observation: these error estimates can be algorithm-dependent e.g., Bootstrap can be a poor estimator for algorithms such as nearest neighbor or FOIL that do not benefit from duplicate instances

109 Multiple Resampling: other approaches
Randomization Over labels: Assess the dependence of algorithm on actual label assignment (as opposed to obtaining similar classifier on chance assignment) Over Samples (Permutation): Assess the stability of estimates over different re-orderings of the data Multiple trials of simple resampling Potentially Higher replicability and more stable estimates Multiple runs (how many?): 5x2 cv, 10x10 cv (main motivation is comparison of algorithms)

110 Recent Developments: Bootstrap and big data
Bootstrap has gained significant attention in the big data case recently Allows for “error bars” to characterize uncertainty in predictions as well as other evaluation Kleiner et al. (ICML 2012) devised the extension of bootstrap to large scale problems Aimed at developing a database that returns answers with error bars to all queries Combines the ideas of classical bootstrap and subsampling

111 Assessing the Quality of Inference
Observe data X1, ..., Xn Slide courtesy of Michael Jordan

112 Assessing the Quality of Inference
Observe data X1, ..., Xn Form a “parameter” estimate qn = q(X1, ..., Xn) Slide courtesy of Michael Jordan

113 Assessing the Quality of Inference
Observe data X1, ..., Xn Form a “parameter” estimate qn = q(X1, ..., Xn) Want to compute an assessment x of the quality of our estimate qn (e.g., a confidence region) Slide courtesy of Michael Jordan

114 The Unachievable Frequentist Ideal
Ideally, we would Observe many independent datasets of size n. Compute qn on each. Compute x based on these multiple realizations of qn. Slide courtesy of Michael Jordan

115 The Unachievable Frequentist Ideal
Ideally, we would Observe many independent datasets of size n. Compute qn on each. Compute x based on these multiple realizations of qn. But, we only observe one dataset of size n. Slide courtesy of Michael Jordan

116 Slide courtesy of Michael Jordan
The Bootstrap (Efron, 1979) Use the observed data to simulate multiple datasets of size n: Repeatedly resample n points with replacement from the original dataset of size n. Compute q*n on each resample. Compute x based on these multiple realizations of q*n as our estimate of x for qn. Slide courtesy of Michael Jordan

117 The Bootstrap: Computational Issues
Seemingly a wonderful match to modern parallel and distributed computing platforms But the expected number of distinct points in a bootstrap resample is ~ 0.632n e.g., if original dataset has size 1 TB, then expect resample to have size ~ 632 GB Can’t feasibly send resampled datasets of this size to distributed servers Even if one could, can’t compute the estimate locally on datasets this large Slide courtesy of Michael Jordan

118 Slide courtesy of Michael Jordan
Subsampling (Politis, Romano & Wolf, 1999) n Slide courtesy of Michael Jordan

119 Slide courtesy of Michael Jordan
Subsampling n b Slide courtesy of Michael Jordan

120 Slide courtesy of Michael Jordan
Subsampling There are many subsets of size b < n Choose some sample of them and apply the estimator to each This yields fluctuations of the estimate, and thus error bars But a key issue arises: the fact that b < n means that the error bars will be on the wrong scale (they’ll be too large) Need to analytically correct the error bars Slide courtesy of Michael Jordan

121 Slide courtesy of Michael Jordan
Subsampling Summary of algorithm: Repeatedly subsample b < n points without replacement from the original dataset of size n Compute q*b on each subsample Compute x based on these multiple realizations of q*b Analytically correct to produce final estimate of x for qn The need for analytical correction makes subsampling less automatic than the bootstrap Still, much more favorable computational profile than bootstrap Slide courtesy of Michael Jordan

122 Bag of Little Bootstraps (BLB)
combines the bootstrap and subsampling, and gets the best of both worlds works with small subsets of the data, like subsampling, and thus is appropriate for distributed computing platforms But, like the bootstrap, it doesn’t require analytical rescaling And it’s shown to be successful in practice Slide courtesy of Michael Jordan

123 Towards the Bag of Little Bootstraps
n b Slide courtesy of Michael Jordan

124 Towards the Bag of Little Bootstraps
Slide courtesy of Michael Jordan

125 Slide courtesy of Michael Jordan
Approximation Slide courtesy of Michael Jordan

126 Pretend the Subsample is the Population
Slide courtesy of Michael Jordan

127 Pretend the Subsample is the Population
And bootstrap the subsample! This means resampling n times with replacement, not b times as in subsampling Slide courtesy of Michael Jordan

128 The Bag of Little Bootstraps (BLB)
The subsample contains only b points, and so the resulting empirical distribution has its support on b points But we can (and should!) resample it with replacement n times, not b times Doing this repeatedly for a given subsample gives bootstrap confidence intervals on the right scale---no analytical rescaling is necessary! Now do this (in parallel) for multiple subsamples and combine the results (e.g., by averaging) Slide courtesy of Michael Jordan

129 The Bag of Little Bootstraps (BLB)
Slide courtesy of Michael Jordan

130 BLB: Theoretical Results
BLB is asymptotically consistent and higher-order correct (like the bootstrap), under essentially the same conditions that have been used in prior analysis of the bootstrap. Theorem (asymptotic consistency): Under standard assumptions (particularly that q is Hadamard differentiable and x is continuous), the output of BLB converges to the population value of x as n, b approach ∞. Slide courtesy of Michael Jordan

131 What to watch out when selecting error estimation method
Domain (e.g., for data distribution, dispersion) Dataset Size Bias-Variance dependence (Expected) true error rate Type of learning algorithms to be evaluated (e.g. can be robust to, say, permutation test; or do not benefit over bootstrapping) Computational Complexity The relations between the evaluation methods, whether statistically significantly different or not, varies with data quality. Therefore, one cannot replace one test with another, and only evaluation methods appropriate for the context may be used. Reich and Barai (1999, Pg 11)

132 Statistical Significance Testing

133 Statistical Significance Testing
Error estimation techniques allows us to obtain an estimate of the desired performance metric(s) on different classifiers A difference is seen between the metric values over classifiers Questions Can the difference be attributed to real characteristics of the algorithms? Can the observed difference(s) be merely coincidental concordances?

134 Statistical Significance Testing
Statistical Significance Testing can enable ascertaining whether the observed results are statistically significant (within certain constraints) We primarily focus on Null Hypothesis Significance Testing (NHST) typically w.r.t.: Evaluate algorithm A of interest against other algorithms on a specific problem Evaluate generic capabilities of A against other algorithms on benchmark datasets Evaluating performance of multiple classifiers on benchmark datasets or a specific problem of interest

135 NHST State a null hypothesis
Usually the opposite of what we wish to test (for example, classifiers A and B perform equivalently) Choose a suitable statistical test and statistic that will be used to (possibly) reject the null hypothesis Choose a critical region for the statistic to lie in that is extreme enough for the null hypothesis to be rejected. Calculate the test statistic from the data If the test statistic lies in the critical region: reject the null hypothesis. If not, we fail to reject the null hypothesis, but do not accept it either. Rejecting the null hypothesis gives us some confidence in the belief that our observations did not occur merely by chance.

136 Choosing a Statistical Test
There are several aspects to consider when choosing a statistical test. What kind of problem is being handled? Whether we have enough information about the underlying distributions of the classifiers’ results to apply a parametric test? Statistical tests considered can be categorized by the task they address, i.e., comparing 2 algorithms on a single domain 2 algorithms on several domains multiple algorithms on multiple domains

137 Issues with hypothesis testing
NHST never constitutes a proof that our observation is valid Some researchers (Drummond, 2006; Demsar, 2008) have pointed out that NHST: overvalues the results; and limits the search for novel ideas (owing to excessive, often unwarranted, confidence in the results) NHST outcomes are prone to misinterpretation Rejecting null hypothesis with confidence p does not signify that the performance-difference holds with probability 1-p 1-p in fact denotes the likelihood of evidence from the experiment being correct if the difference indeed exists Given enough data, a statistically significant difference can always be shown

138 But NHST can still be helpful
NHST never constitutes a proof that our observation is valid Some researchers (Drummond, 2006; Demsar, 2008) have pointed out that NHST: overvalues the results; and limits the search for novel ideas (owing to excessive, often unwarranted, confidence in the results) NHST outcomes are prone to misinterpretation Rejecting null hypothesis with confidence p does not signify that the performance-difference holds with probability 1-p 1-p in fact denotes the likelihood of evidence from the experiment being correct if the difference indeed exists Given enough data, a statistically significant difference can always be shown (but provides added concrete support to observations) (to which a better sensitization is necessary) The impossibility to reject the null hypothesis while using a reasonable effect size is telling

139 An overview of representative tests
All Machine Learning & Data Mining problems 2 Algorithms 1 Domain 2 Algorithms Multiple Domains Multiple Algorithms Multiple Domains Two Matched Sample t-Test Wilcoxon’s Signed Rank Test for Matched Pairs Repeated Measure One-way ANOVA Friedman’s Test Tukey Post-boc Test Bonferroni-Dunn Post-boc Test Nemeuyi Test McNemar’s Test Sign Test Parametric Test Parametric and Non-Parametric Non-Parametric

140 Parametric vs. Non-parametric tests
All Machine Learning & Data Mining problems 2 Algorithms 1 Domain 2 Algorithms Multiple Domains Multiple Algorithms Multiple Domains Two Matched Sample t-Test Wilcoxon’s Signed Rank Test for Matched Pairs Repeated Measure One-way ANOVA Friedman’s Test Tukey Post-boc Test Bonferroni-Dunn Post-boc Test Nemeuyi Test McNemar’s Test Sign Test Make assumptions on distribution of population (i.e., error estimates); typically more powerful than non-parametric counterparts Parametric Test Parametric and Non-Parametric Non-Parametric

141 Tests covered in this tutorial
All Machine Learning & Data Mining problems 2 Algorithms 1 Domain 2 Algorithms Multiple Domains Multiple Algorithms Multiple Domains Two Matched Sample t-Test Wilcoxon’s Signed Rank Test for Matched Pairs Repeated Measure One-way ANOVA Friedman’s Test Tukey Post-boc Test Bonferroni-Dunn Post-boc Test Nemeuyi Test McNemar’s Test Sign Test Parametric Test Parametric and Non-Parametric Non-Parametric

142 Comparing 2 algorithms on a single domain: t-test
Arguably, one of the most widely used statistical tests Comes in various flavors (addressing different experimental settings) In our case: two matched samples t-test to compare performances of two classifiers applied to the same dataset with matching randomizations and partitions Measures if the difference between two means (mean value of performance measures, e.g., error rate) is meaningful Null hypothesis: the two samples (performance measures of two classifiers over the dataset) come from the same population

143 Comparing 2 algorithms on a single domain: t-test
The t-statistic: where number of trials Average Performance measure, e.g., error rate of classifier f1

144 Comparing 2 algorithms on a single domain: t-test
The t-statistic: where Standard Deviation over the difference in performance measures Difference in Performance measures at trial i

145 Comparing 2 algorithms on a single domain: t-test: Illustration
C4.5 and Naïve Bayes (NB) algorithms on the Labor dataset 10 runs of 10-fold CV (maintaining same training and testing folds across algorithms). The result of each 10-fold CV is considered a single result. Hence, each run of 10-fold CV is a different trial. Consequently n=10 We have: and hence:

146 Comparing 2 algorithms on a single domain: t-test: Illustration
Referring to the t-test table, for n-1=9 degrees of freedom, we can reject the null hypothesis at significance level (for which the observed t- value should be greater than 4.781)

147 Effect size t-test measured the effect (i.e., the different in the sample means is indeed statistically significant) but not the size of this effect (i.e., the extent to which this difference is practically important) Statistics offers many methods to determine this effect size including Pearson’s correlation coefficient, Hedges’ G, etc. In the case of t-test, we use: Cohen’s d statistic where Pooled standard deviation estimate

148 Effect size A typical interpretation proposed by Cohen for effect size was the following discrete scale: A value of about 0.2 or 0.3 : small effect, but probably meaningful ≈0.5: medium effect that is noticeable ≥ 0.8: large effect In the case of Labor dataset example, we have:

149 Comparing 2 algorithms on a single domain
t-test: so far so good; but what about Assumptions made by t-test? Assumptions The Normality or pseudo-normality of populations The randomness of samples (representative-ness) Equal Variances of the population Let’s see if these assumptions hold in our case of Labor dataset: Normality: we may assume pseudo-normality since each trial eventually tests the algorithms on each of 57 instances (note however, we compare averaged estimates) Randomness: Data was collected in All the collective agreements were collected. No reason to assume this to be non- representative

150 Comparing 2 algorithms on a single domain
t-test: so far so good; but what about Assumptions made by t-test? However, The sample variances cannot be considered equal We also ignored: correlated measurement effect (due to overlapping training sets) resulting in higher Type I error. Were we warranted to use t-test? Probably Not (even though t-test is quite robust over some assumptions). Other variations of t-test (e.g., Welch’s t-test) McNemar’s test (non-parametric)

151 Comparing 2 algorithms on a single domain: McNemar’s test
Non-parametric counterpart to t-test Computes the following contingency matrix 0 here denotes misclassification by concerned classifier: i.e., c00 denotes the number of instances misclassified by both f1 and f2 C10 denotes the number of instances correctly classified by f1 but not by f2

152 Comparing 2 algorithms on a single domain: McNemar’s test
Null hypothesis McNemar’s statistic distributed approximately as if is greater than 20 To reject the null hypothesis: should be greater than for desired α (typically 0.05)

153 Comparing 2 algorithms on a single domain: McNemar’s test
Back to the Labor dataset: comparing C4.5 and NB, we obtain Which is greater than the desired value of allowing us to reject the null hypothesis at 95% confidence

154 Comparing 2 algorithms on a single domain: McNemar’s test
To keep in mind: Test aimed for matched pairs (maintain same training and test folds across learning algorithms) Using resampling instead of independent test sets (over multiple runs) can introduce some bias A constraint: (the diagonal) should be at least 20 which was not true in this case. In such a case, sign test (also applicable for comparing 2 algorithms on multiple domains) should be used instead

155 Comparing 2 algorithms on a multiple domains
No clear parametric way to compare two classifiers on multiple domains: t-test is not a very good alternative because it is not clear that we have commensurability of the performance measures across domains The normality assumption is difficult to establish The t-test is susceptible to outliers, which is more likely when many different domains are considered. Therefore we will describe two non-parametric alternatives  The Sign Test  Wilcoxon’s signed-Rank Test

156 Comparing 2 algorithms on a multiple domains: Sign test
Can be used to compare two classifiers on a single domain (using the results at each fold as a trial) More commonly used to compare two classifiers on multiple domains Calculate: nf1: the number of times that f1 outperforms f2 nf2: the number of times that f2 outperforms f1 Null hypothesis holds if the number of wins follow a binomial distribution. Practically speaking, a classifier should perform better on at least wα (the critical value) datasets to reject null hypothesis at α significance level

157 Comparing 2 algorithms on a multiple domains: Sign test: Illustration
Dataset NB SVM Adaboost Rand Forest Anneal 96.43 99.44 83.63 99.55 Audiology 73.42 81.34 46.46 79.15 Balance Scale 72.30 91.51 72.31 80.97 Breast Cancer 71.70 66.16 70.28 69.99 Contact Lenses 71.67 Pima Diabetes 74.36 77.08 74.35 74.88 Glass 70.63 62.21 44.91 79.87 Hepatitis 83.21 80.63 82.54 84.58 Hypothyroid 98.22 93.58 93.21 99.39 Tic-Tac-Toe 69.62 99.90 72.54 93.94 NB vs SVM: nf1= 4.5, nf2= 5.5 and w0.05 = 8  we cannot reject the null hypothesis (1-tailed) Ada vs RF: nf1=1 and nf2=8.5  we can reject the null hypothesis at level α=0.05 (1-tailed) and conclude that RF is significantly better than Ada on these data sets at that significance level.

158 Comparing 2 algorithms on a multiple domains:
Wilcoxon’s Signed-Rank test Non-parametric, and typically more powerful than sign test (since, unlike sign test, it quantifies the differences in performances) Commonly used to compare two classifiers on multiple domains Method For each domain, calculate the difference in performance of the two classifiers Rank the absolute differences and graft the signs in front of these ranks Calculate the sum of positive ranks Ws1 and sum of negative ranks Ws2 Calculate Twilcox = min (Ws1 , Ws2) Reject null hypothesis if Twilcox ≤ Vα (critical value)

159 Comparing 2 algorithms on a multiple domains:
Wilcoxon’s Signed-Rank test: Illustration Data NB SVM NB-SVM |NB-SVM| Ranks 1 .9643 .9944 0.0301 3 -3 2 .7342 .8134 0.0792 6 -6 .7230 .9151 0.1921 8 -8 4 .7170 .6616 0.0554 5 +5 .7167 Remove .7436 .7708 0.0272 -2 7 .7063 .6221 0.0842 +7 .8321 .8063 0.0258 +1 9 .9822 .9358 0.0464 +4 10 .6962 .9990 0.3028 -9 WS1 = 17 and WS2 = 28  TWilcox = min(17, 28) = 17 For n= 10-1 degrees of freedom and α = 0.05, V = 8 for the 1-sided test. Since 17 > 8. Hence, we cannot reject the null hypothesis

160 Comparing multiple algorithms on a multiple domains
Two alternatives Parametric: (one-way repeated measure) ANOVA ; Non-parametric: Friedman’s Test. Such multiple-hypothesis tests are called Omnibus tests. Their null hypotheses: is that all the classifiers perform similarly, and its rejection signifies that: there exists at least one pair of classifiers with significantly different performances. In case of rejection of this null hypothesis, the omnibus test is followed by a Post-hoc test to identify the significantly different pairs of classifiers. In this tutorial, we will discuss Friedman’s Test (omnibus) and the Nemenyi test (post-hoc test).

161 Friedman’s Test Algorithms are ranked on each domain according to their performances (best performance = lowest rank) In case of a d-way tie at rank r: assign ((r+1)+(r+2)+…+(r+d))/d to each classifier For each classifier j, compute R.j : the sum of its ranks on all domains Calculate Friedman’s statistic: over n domains and k classifiers

162 Illustration of the Friedman test
Domain Classifier fA Classifier fB Classifier fC 1 85.83 75.86 84.19 2 85.91 73.18 85.90 3 86.12 69.08 83.83 4 85.82 74.05 85.11 5 86.28 74.71 86.38 6 86.42 65.90 81.20 7 76.25 8 86.10 75.10 86.75 9 85.95 70.50 88.03 19 73.95 87.18 Domain Classifier fA Classifier fB Classifier fC 1 3 2 1.5 4 5 6 7 8 9 10 R .j 15.5 30 14.5

163 Nemenyi Test The omnibus test rejected the null hypothesis
Follow up by post-hoc test: Nemenyi test to identify classifier-pairs with significant performance difference Let Rij : rank of classifier fj on data Si; Compute mean rank of fj on all datasets Between classifier fj1 and fj2, calculate

164 Nemenyi Test: Illustration
Computing the q statistic of Nemenyi test for different classifier pairs, we obtain: qAB = qBC = 34.44 qAC = 2.22 At α = 0.05, we have qα = 2.55. Hence, we can show a statistically significantly different performance between classifiers A and B, and between classifiers B and C, but not between A and C

165 Recent Developments Kernel-based tests: There has been considerable advances recently in employing kernels to devise statistical tests (e.g., Gretton et al JMLR 2012; Fromont et al. COLT 2012) Employing f-divergences: generalizes the information-theoretic tests such as KL to the more general f-divergences by considering distributions as class-conditional (thereby expressing the two distributions in terms of Bayes risk w.r.t. a loss function) (Reid and Williamson, JMLR 2011) Also extended to multiclass case (Garcia-Garcia and Williamson COLT 2012)

166 Data Set Selection and Evaluation Benchmark Design

167 Considerations to keep in mind while choosing an appropriate test bed
Wolpert’s “No Free Lunch” Theorems: if one algorithm tends to perform better than another on a given class of problems, then the reverse will be true on a different class of problems. LaLoudouana and Tarate (2003) showed that even mediocre learning approaches can be shown to be competitive by selecting the test domains carefully. The purpose of data set selection should not be to demonstrate an algorithm’s superiority to another in all cases, but rather to identify the areas of strengths of various algorithms with respect to domain characteristics or on specific domains of interest.

168 Where can we get our data from?
Repository Data: Data Repositories such as the UCI repository and the UCI KDD Archive have been extremely popular in Machine Learning Research Artificial Data: The creation of artificial data sets representing the particular characteristic an algorithm was designed to deal with is also a common practice in Machine Learning Web-Based Exchanges: Could we imagine a multi-disciplinary effort conducted on the Web where researchers in need of data analysis would “lend” their data to machine learning researchers in exchange for an analysis?

169 Pros and Cons of Repository Data
Very easy to use: the data is available, already processed and the user does not need any knowledge of the underlying field The data is not artificial since it was provided by labs and so on. So in some sense, the research is conducted in real-world setting (albeit a limited one) Replication and comparisons are facilitated, since many researchers use the same data set to validate their ideas Cons: The use of data repositories does not guarantee that the results will generalize to other domains The data sets in the repository are not representative of the data mining process which involves many steps other than classification Community experiment/Multiplicity effect: since so many experiments are run on the same data set, by chance, some will yield interesting (though meaningless) results

170 Pros and Cons of Artificial Data
Data sets can be created to mimic the traits that are expected to be present in real data (which are unfortunately unavailable) The researcher has the freedom to explore various related situations that may not have been readily testable from accessible data Since new data can be generated at will, the multiplicity effect will not be a problem in this setting Cons: Real data is unpredictable and does not systematically behave according to a well defined generation model Classifiers that happen to model the same distribution as the one used in the data generation process have an unfair advantage

171 Pros and Cons of Web-Based Exchanges
Would be useful for three communities: Domain experts, who would get interesting analyses of their problem Machine Learning researchers who would be getting their hand on interesting data, and thus encounter new problems Statisticians, who could be studying their techniques in an applied context Cons: It has not been organized. Is it truly feasible? How would the quality of the studies be controlled?

172 Available Resources

173 What help is available for conducting proper evaluation?
There is no need for researchers to program all the code necessary to conduct proper evaluation nor to do the computations by hand Resources are readily available for every steps of the process, including: Evaluation metrics Statistical tests Re-sampling And of course, Data repositories Pointers to these resources and explanations on how to use them are discussed in our book: <“Evaluating Learning Algorithms: A Classification Perspective” by Japkowicz and Shah, Cambridge University Press, 2011>

174 Where to look for evaluation metrics?
Actually, as most people know, Weka is a great source for, not only, classifiers, but also computation of the results according to a variety of evaluation metrics === Stratified cross-validation === === Summary === Correctly Classified Instances % Incorrectly Classified Instances % Kappa statistics K&B Relative Info Score % K&B Information Score bits bits/instance Class complexity | order bits bits/instance Class complexity | scheme bits bits/instance Complexity improvement (Sf) bits bits/instance Mean absolute error Root mean squared error Relative absolute error % Root relative squared error % Total Number of Instances 57 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure ROC Area Class bad good Weighted Avg === Confusion Matrix === a b <-- classified as 6 | a = bad | b = good

175 WEKA even performs ROC Analysis and draws Cost-Curves
Although better graphical analysis packages are available in R, namely the ROCR package, which permits the visualization of many versions of ROC and Cost Curves, and also Lift curves, P-R Curves, and so on

176 Where to look for Statistical Tests?
> c45 = c(.2456, .1754, .1754, .2632, .1579, .2456, .2105, .1404, .2632, .2982) > nb = c(.0702, .0702, .0175, .0702, .0702, .0526, .1579, .0351, .0351, .0702) > t.test(c45, nb, paired= TRUE) Paired t-test data: c45 and nb t = , df = 9, p-value = 2.536e-05 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: sample estimates: mean of the differences The R Software Package contains implementations of all the statistical tests discussed here and many more. They are very simple to run. > | > classA= c(85.83, 85.91, 86.12, 85.82, 86.28, 86.42, 85.91, 86.10, 85.95, 86.12) > classB= c(75.86, 73.18, 69.08, 74.05, 74.71, 65.90, 76.25, 75.10, 70.50, 73.95) > classC= c(84.19, 85.90, 83.83, 85.11, 86.38, 81.20, 86.38, 86.75, 88.03, 87.18) > t=matrix(c(classA, classB, classC), nrow=10, byrow=FALSE) > t [,1] [,2] [,3] [1,] [2,] [3,] [4,] [5,] [6,] [7,] [8,] [9,] [10,] > friedman.test(t) Friedman rank sum test data: t Friedman chi-squared = 15, df = 2, p-value = > | > c4510folds= c(3, 0, 2, 0, 2, 2, 2, 1, 1, 1) > nb10folds= c(1, 0, 0, 0, 0, 1, 0, 2, 0, 0) > wilcox.test(nb10folds, c4510folds, paired= TRUE) Wilcoxon signed rank test with continuity correction data: nb10folds and c4510folds V = 2.5, p-value = alternative hypothesis: true location shift is not equal to 0

177 Where to look for Re-sampling methods?
In our book, we have implemented all the re-sampling schemes described in this tutorial. The actual code is also available upon request. ( me at e0Boot = function(iter, dataSet, setSize, dimension, classifier1, classifier2){  classifier1e0Boot <- numeric(iter) classifier2e0Boot <- numeric(iter) for(i in 1:iter) { Subsamp <- sample(setSize, setSize, replace=TRUE) Basesamp <- 1:setSize    oneTrain <- dataSet[Subsamp ,1:dimension ] oneTest <- dataSet[setdiff(Basesamp,Subsamp), 1:dimension] classifier1model <- classifier1(class~., data=oneTrain) classifier2model <- classifier2(class~., data=oneTrain) classifier1eval <- evaluate_Weka_classifier(classifier1model, newdata=oneTest) classifier1acc <- as.numeric(substr(classifier1eval$string, 70,80)) classifier2eval <- evaluate_Weka_classifier(classifier2model, newdata=oneTest) classifier2acc <- as.numeric(substr(classifier2eval$string, 70,80)) classifier1e0Boot[i]= classifier1acc classifier2e0Boot[i]= classifier2acc } return(rbind(classifier1e0Boot, classifier2e0Boot))}

178 Some Concluding Remarks
The End

179 eval@mohakshah.com Thank you for attending the tutorial!!!
For more information…contact us at: Thank you for attending the tutorial!!!

180 References Too many to put down here, but pp. 393-402 of the book
If you want specific ones, come and see me at the end of the tutorial or send me an


Download ppt "Performance Evaluation of Machine Learning Algorithms"

Similar presentations


Ads by Google