Presentation is loading. Please wait.

Presentation is loading. Please wait.

Non-Traditional Metrics Evaluation measures from the Evaluation measures from the medical diagnostic community medical diagnostic community Constructing.

Similar presentations


Presentation on theme: "Non-Traditional Metrics Evaluation measures from the Evaluation measures from the medical diagnostic community medical diagnostic community Constructing."— Presentation transcript:

1 Non-Traditional Metrics Evaluation measures from the Evaluation measures from the medical diagnostic community medical diagnostic community Constructing new evaluation Constructing new evaluation measures that combine metric measures that combine metric and statistical information and statistical information

2 Part I Borrowing new performance evaluation measures from the medical diagnostic community (Marina Sokolova, Nathalie Japkowicz and Stan Szpakowicz)

3 3 The need to borrow new performance measures: an example It has come to our attention that the performance measures commonly used in Machine Learning are not very good at assessing the performance of problems in which the two classes are equally important. It has come to our attention that the performance measures commonly used in Machine Learning are not very good at assessing the performance of problems in which the two classes are equally important.  Accuracy focuses on both classes, but, it does not distinguish between the two classes.  Other measures, such as Precision/Recall, F- Score and ROC Analysis only focus on one class, without concerning themselves with performance on the other class.

4 4 Learning Problems in which the classes are equally important Examples of recent Machine Learning domains that require equal focus on both classes and a distinction between false positive and false negative rates are: Examples of recent Machine Learning domains that require equal focus on both classes and a distinction between false positive and false negative rates are: opinion/sentiment identification opinion/sentiment identification classification of negotiations classification of negotiations An examples of a traditional problem that requires equal focus on both classes and a distinction between false positive and false negative rates is: An examples of a traditional problem that requires equal focus on both classes and a distinction between false positive and false negative rates is: Medical Diagnostic TestsMedical Diagnostic Tests What measures have researchers in the Medical Diagnostic Test Community used that we can borrow? What measures have researchers in the Medical Diagnostic Test Community used that we can borrow?

5 5 Performance Measures in use in the Medical Diagnostic Community Common performance measures in use in the Medical Diagnostic Community are: Common performance measures in use in the Medical Diagnostic Community are: Sensitivity/Specificity (also in use in Machine learning)Sensitivity/Specificity (also in use in Machine learning) Likelihood ratiosLikelihood ratios Youden’s IndexYouden’s Index Discriminant PowerDiscriminant Power [Biggerstaff, 2000; Blakeley & Oddone, 1995]

6 6Sensitivity/Specificity The sensitivity of a diagnostic test is: The sensitivity of a diagnostic test is: P[+|D], i.e., the probability of obtaining a positive test result in theP[+|D], i.e., the probability of obtaining a positive test result in the diseased population. diseased population. The specificity of a diagnostic test is: The specificity of a diagnostic test is: P[-|Ď], i.e., the probability of obtaining a negative test result in theP[-|Ď], i.e., the probability of obtaining a negative test result in the disease-free population. disease-free population. Sensitivity and specificity are not that useful, however, since one, really is interested in P[D|+] (PVP: the Predictive Value of a Positive) and P[Ď|-] (PVN: the Predictive Value of a Negative) in both the medical testing community and in Machine Learning.  We can apply Bayes’ Theorem to derive the PVP and PVN. Sensitivity and specificity are not that useful, however, since one, really is interested in P[D|+] (PVP: the Predictive Value of a Positive) and P[Ď|-] (PVN: the Predictive Value of a Negative) in both the medical testing community and in Machine Learning.  We can apply Bayes’ Theorem to derive the PVP and PVN.

7 7 Deriving the PVPs and PVNs The problem with deriving the PVP and PVN of a test, is that in order to derive them, we need to know p[D], the pre-test probability of the disease. This cannot be done directly. The problem with deriving the PVP and PVN of a test, is that in order to derive them, we need to know p[D], the pre-test probability of the disease. This cannot be done directly. As usual, however, we can set ourselves in the context of the comparison of two tests (with P[D] being the same in both cases). As usual, however, we can set ourselves in the context of the comparison of two tests (with P[D] being the same in both cases). Doing so, and using Bayes’ Theorem: Doing so, and using Bayes’ Theorem: P[D|+] = (P[+|D] P[D])/(P[+|D] P[D] + P[+|Ď]P[Ď]) We can get the following relationships (see Biggerstaff, 2000): We can get the following relationships (see Biggerstaff, 2000): P[D|+ Y ] > P[D|+ X ] ↔ ρ+ Y > ρ+ XP[D|+ Y ] > P[D|+ X ] ↔ ρ+ Y > ρ+ X P[Ď|- Y ] > P[Ď|- X ] ↔ ρ- Y P[Ď|- X ] ↔ ρ- Y < ρ- X Where X and Y are two diagnostic tests, and + X, and – X stand for confirming the presence of the disease and confirming the absence of the disease, respectively. (and similarly for + Y and – Y ) ρ+ and ρ- are the likelihood ratios that are defined on the next slideρ+ and ρ- are the likelihood ratios that are defined on the next slide

8 8 Likelihood Ratios ρ+ and ρ- are actually easy to derive. ρ+ and ρ- are actually easy to derive. The likelihood ratio of a positive test is: The likelihood ratio of a positive test is: ρ+ = P[+|D] / P[+| Ď], i.e. the ratio of the trueρ+ = P[+|D] / P[+| Ď], i.e. the ratio of the true positive rate to the false positive rate to the false positive rate. positive rate. The likelihood ratio of a negative test is: The likelihood ratio of a negative test is: ρ- = P[-|D] / P[-| Ď], i.e. the ratio of the false negative rate to the trueρ- = P[-|D] / P[-| Ď], i.e. the ratio of the false negative rate to the true negative rate. negative rate. Note: We want to maximize ρ+ and minimize ρ-. This means that, even though we cannot calculate the PVP and PVN directly, we can get the information we need to compare two tests through the likelihood ratios. This means that, even though we cannot calculate the PVP and PVN directly, we can get the information we need to compare two tests through the likelihood ratios.

9 9 Youden’s Index and Discriminant Power Youden’s Index measures the avoidance of failure of an algorithm while Discriminant Power evaluates how well an algorithm distinguishes between positive and negative examples. Youden’s Index measures the avoidance of failure of an algorithm while Discriminant Power evaluates how well an algorithm distinguishes between positive and negative examples. Youden’s Index Youden’s Index γ = sensitivity – (1 – specificity) γ = sensitivity – (1 – specificity) = P[+|D] – (1 - P[-|Ď]) = P[+|D] – (1 - P[-|Ď]) Discriminant Power: Discriminant Power: DP = √3/π (log X + log Y), DP = √3/π (log X + log Y), where X = sensitivity/(1 – sensitivity) and where X = sensitivity/(1 – sensitivity) and Y = specificity/(1-specificity) Y = specificity/(1-specificity)

10 10 Comparison of the various measures on the outcome of e-negotiation MeasureSVM N. Bayes Accuracy77.476.8 F-Score81.278.9 Sensitivity86.877.5 Specificity65.475.9 AUC76.176.7 Youden.522.534 Pos. Likelihood 2.513.22 Neg Likelihood.2.3 Discriminant Power 1.391.31 DP is below 3  insignificant

11 11 What does this all mean? Traditional ML Measures Classifier Overall effectiveness (Accuracy) Predictive Power (Precision) Effectiveness on a class, a- posteriori (sensitivity/ specificity) SVMSuperior Superior on pos examples NBinferior Superior on neg examples

12 12 What does this all mean? New Measures that are more appropriate for problems where both classes are as important Classifier Avoidance of failure (Youden) Effectiveness on a class, a- priori (Likelihood Ratios) Discrimination of classes (Discriminant Power) SVMInferior Superior on neg examples Limited NBSuperior Superior on pos examples Limited

13 13 Part I: Discussion The variety of results obtained with the different measures suggest two conclusions: The variety of results obtained with the different measures suggest two conclusions: 1.It is very important for practitioners of Machine Learning to understand their domain deeply, to understand what it is, exactly, that they want to evaluate, and to reach their goal using appropriate measures (existing or new ones). 2.Since some of the results are very close to each other, it is important to establish reliable confidence tests to find out whether or not these results are significant.

14 14 Part II Constructing new evaluation measures (William Elamzeh, Nathalie Japkowicz and Stan Matwin)

15 15 Motivation for our new evaluation method ROC Analysis alone and its associated AUC measure do not assess the performance of classifiers adequately since they omit any information regarding the confidence of these estimates. ROC Analysis alone and its associated AUC measure do not assess the performance of classifiers adequately since they omit any information regarding the confidence of these estimates. Though the identification of the significant portion of ROC Curves is an important step towards generating a more useful assessment, this analysis remains biased in favour of the large class, in case of severe imbalances. Though the identification of the significant portion of ROC Curves is an important step towards generating a more useful assessment, this analysis remains biased in favour of the large class, in case of severe imbalances. We would like to combine the information provided by the ROC Curve together with information regarding how balanced the classifier is with regard to the misclassification of positive and negative examples. We would like to combine the information provided by the ROC Curve together with information regarding how balanced the classifier is with regard to the misclassification of positive and negative examples.

16 16 ROC’s bias in the case of severe class imbalances ROC Curves, for the pos class, plots the true positive rate a/(a+b) against the false positive rate c/(c+d). ROC Curves, for the pos class, plots the true positive rate a/(a+b) against the false positive rate c/(c+d). When the number of pos. examples is significantly lower than the number of neg. examples, a+b << c+d, as we change the class probability threshold, a/(a+b) climbs faster than c/(c+d) When the number of pos. examples is significantly lower than the number of neg. examples, a+b << c+d, as we change the class probability threshold, a/(a+b) climbs faster than c/(c+d)  ROC gives the majority class (-) an unfair advantage. Ideally, a classifier should classify both classes proportionally Ideally, a classifier should classify both classes proportionally Pred+Pred-Total Class+aba+b Class-cdc+d Totala+cb+dn Confusion Matrix

17 17 Correcting for ROC’s bias in the case of severe class imbalances Though we keep ROC as a performance evaluation measure, since rate information is useful, we propose to favour classifiers that perform with similar number of errors in both classes, for confidence estimation. Though we keep ROC as a performance evaluation measure, since rate information is useful, we propose to favour classifiers that perform with similar number of errors in both classes, for confidence estimation. More specifically,as in the Tango test, we favour classifiers that have lower difference in classification errors in both classes, (b-c)/n. More specifically,as in the Tango test, we favour classifiers that have lower difference in classification errors in both classes, (b-c)/n. This quantity (b-c)/n is interesting not just for confidence estimation, but also as an evaluation measure in its own right This quantity (b-c)/n is interesting not just for confidence estimation, but also as an evaluation measure in its own right Pred+Pred-Total Class+aba+b Class-cdc+d Totala+cb+dn Confusion Matrix

18 18 Proposed Evaluation Method for severely Imbalanced Data sets Our method consists of five steps: Our method consists of five steps: Generate a ROC Curve R for a classifier K applied to data D.Generate a ROC Curve R for a classifier K applied to data D. Apply Tango’s confidence test in order to identify the confident segments of R.Apply Tango’s confidence test in order to identify the confident segments of R. Compute the CAUC, the area under the confident ROC segment.Compute the CAUC, the area under the confident ROC segment. Compute AveD, the average normalized difference (b-c)/n for all points in the confident ROC segment.Compute AveD, the average normalized difference (b-c)/n for all points in the confident ROC segment. Plot CAUC against aveD  An effective classifier shows low AveD and high CAUCPlot CAUC against aveD  An effective classifier shows low AveD and high CAUC

19 19 Experiments and Expected Results We considered 6 imbalanced domain from UCI. The most imbalanced one contained only 1.4% examples in the small class while the least imbalanced one had as many as 26%. We considered 6 imbalanced domain from UCI. The most imbalanced one contained only 1.4% examples in the small class while the least imbalanced one had as many as 26%. We ran 4 classifiers: Decision Stumps, Decision Trees, Decision Forests and Naïve Bayes We ran 4 classifiers: Decision Stumps, Decision Trees, Decision Forests and Naïve Bayes We expected the following results: We expected the following results: Weak Performance from the Decision StumpsWeak Performance from the Decision Stumps Stronger Performance from the Decision TreesStronger Performance from the Decision Trees Even Stronger Performance from the Random ForestsEven Stronger Performance from the Random Forests We expected Naïve Bayes to perform reasonably well, but with no idea of how it would compare to the tree family of learnersWe expected Naïve Bayes to perform reasonably well, but with no idea of how it would compare to the tree family of learners Same family of learners

20 20 Results using our new method: Our expectations are met Decision Stumps perform the worst, followed by decision trees and then random forests (in most cases) Surprise 1: Decision trees outperform random forests in the two most balanced data sets. Surprise 2: Naïve Bayes consistently outperforms Random forests Note: Classifiers in the top left corner outperform those in the bottom right corner

21 21 AUC Results Our, more informed, results contradict the AUC results which claim that: Our, more informed, results contradict the AUC results which claim that: Decision Stumps are sometimes as good as or superior to decision trees (!) Random Forests outperforms all other systems in all but one cases.

22 22 Part II: Discussion In order to better understand the performance of classifiers on various domains, it can be useful to consider several aspects of this evaluation simultaneously. In order to better understand the performance of classifiers on various domains, it can be useful to consider several aspects of this evaluation simultaneously. In order to do so, it might be useful to create specific measures adapted to the purpose of the evaluation. In order to do so, it might be useful to create specific measures adapted to the purpose of the evaluation. In our case, above, our evaluation measure allowed us to study the tradeoff between classification difference and area under the confident segment of the AUC curve, thus, producing more reliable results In our case, above, our evaluation measure allowed us to study the tradeoff between classification difference and area under the confident segment of the AUC curve, thus, producing more reliable results


Download ppt "Non-Traditional Metrics Evaluation measures from the Evaluation measures from the medical diagnostic community medical diagnostic community Constructing."

Similar presentations


Ads by Google