Presentation on theme: "Class Imbalance vs. Cost-Sensitive Learning Presenter: Hui Li University of Ottawa."— Presentation transcript:
Class Imbalance vs. Cost-Sensitive Learning Presenter: Hui Li University of Ottawa
Contents Introduction Making Classifier Balanced Making Classifier Cost Sensitive –MetaCost –Stratification
Class Imbalance vs. Asymmetric Misclassification costs Most of the algorithms assume that the data sets are balanced, and all errors have the same cost This is seldom true –In data base marketing, the cost of mailing to a non- respondent is very small, but the cost of not mailing to someone who would respond is the entire profit lost Both class imbalance and the cost of misclassification should be considered
Class Imbalance vs. Asymmetric Misclassification costs Class Imbalance: one class occurs much more often than the other Asymmetric misclassification costs: the cost of misclassifying an example from one class is much larger than the cost of misclassifying an example from the other class. One way to correct for imbalance: train a cost sensitive classifier with the misclassification cost of the minority class greater than that of the majority class. One way to make an algorithm cost sensitive: intentionally imbalance the training set.
Making Classifiers Cost-sensitive Substantial work has gone into making individual algorithms cost-sensitive  A better solution would be to have a procedure that converted a broad variety of classifiers into cost-sensitive ones –Stratification: change the frequency of classes in the training data in proportion to their cost Shortcoming –distort the distribution of examples –If it is done by under-sampling, it reduces the data available for learning. –If it is done by over-sampling, it increase learning time –MetaCost: a general method for making classifiers cost-sensitive
MetaCost By wrapping a cost-minimizing procedure, “meta- learning” stage, around the classifier Treat the underlying classifier as a black box, requiring no knowledge of its functioning or change to it Applicable to any number of classes and to arbitrary cost matrices always produces large cost reductions compared to the cost-blind classifier
MetaCost Conditional rist R(i|x) is the expected cost of predicting that x belongs to class i –R(i|x) = ∑P(j|x)C(i, j) Bayes optimal prediction is guaranteed to achieve the lowest possible overall cost The goal of MetaCost procedure is: to relabel the training examples with their “optimal” classes Therefore, we need to find a way to estimate their class probabilities P(j|x) –Learn multiple classifiers, for each example, use each class’s fraction of the total vote as an estimate of its probability given the example –Reason: most modern learners are highly unstable, in that applying them to slightly different training sets tends to produce very different models and correspondingly different predictions for the same examples, while the overall accuracy remains broadly unchanged. This accuracy can be much improved by learning several models in this way and then combining their predictions, for example by voting
MetaCost procedure Form multiple bootstrap replicates of the training set Learn a classifier on each training set Estimate each class’s probability for each example by the fraction of votes that it receives from the ensemble Use conditional risk equation to relabel each training example with the estimated optimal class Reapply the classifier to the relabeled training set
Evaluation of MetaCost Does MetaCost reduce cost compared to the error- based classifier and to stratification? 27 databases from the UCI repository, 15 multiclass databases, 12 two-class databases C4.5 Decision tree learner C4.5Rules post-processor Randomly select 2/3 of the examples in the database for training, using the remaining 1/3 for measuring the cost of their predictions Results are the average of 20 such runs
MultiClass Problems Experiments were conducted with two different types of cost model. 1.Fixed interval model Each C(i, i) was chosen randomly from a uniform distribution in the [0, 1000] interval Each C(i, j), i ≠ j, was chosen randomly from the fixed interval [0, 10000] Different costs were generated for each of the 20 runs conducted on each database 2.Class probability-dependent model Same C(i, i) as in model 1 Each C(i, j), i ≠ j, was chosen with uniform probability from the interval [0, 2000P(i)/P(j)], P(i) and P(j) are the probabilities of occurrence of classes i and j in the training set This means that the highest costs are for misclassifying a rare class as a frequent one, as in the database marketing domains.
MultiClass Problems In the fixed interval case –Neither form of stratification is very effective in reducing costs –MetaCost reduces costs compared to C4.5R and under-sampling in all but one database –MetaCost reduces costs compared to Over-sampling in all but three In the probability-dependent case –Both under-sampling and over-sampling reduce cost compared to C4.5R in 12 of the 15 databases –MetaCost achieves lower costs than C4.5R and both forms of stratification in all 15 databases –The average cost reduction obtained by MetaCost compared to C4.5R is approximately twice as large as that obtained by under-sampling, and five times that of over-sampling Conclusion: MetaCost is the cost reduction metnod of choice for multiclass problems
Two-class Problems Cost model –1 be the minority class, 2 be the majority class –C(1, 1) = C(2, 2) = 0 –C(1, 2) = 1000 –C(2, 1) = 1000r, where r was set alternately to 2, 5, and 10 Result –Over-sampling is not very effective in reducing cost with any of the cost ratios –Under-sampling is effective for r = 5 and r = 10, but not for r = 2 –MetaCost reduces costs compared to C4.5R, under-sampling and over-sampling on almost all databases, for all cost ratios Conclusion: MetaCost is the cost-reduction method for two-class problems
Lesion Studies of MetaCost 1.Q:How sensitive are the results to the number of resamples used? E: using 20 and 10 resamples instead of 50 R: cost increases as the number of resamples decrease, but only gradually There is no significant difference between the costs obtained with m=50 and m=20 With m=10, Metacost still reduces costs compared to C4.5R and both forms of stratification in almost all datasets
Lesion Studies of MetaCost 2. Q:Would it be enough to simply use the class probabilities produed by a single run of the error-based classifier on the full training set? E: relabeling the training examples using the class probabilities produced by a single run of C4.5R on all the data (labeled “C4 Probs”) R: It produces worse results than MetaCost and under-sampling in almost all datasets for all cost ratio It still outperforms over-sampling and C4.5R
Lesion Studies of MetaCost 3. Q: how well would MetaCost do if the class probabilities produced by C4.5R were ignored, and the probability of a class was estimated simply as the fraction of models that predicted it? E: ignoring the class probabilities produced by C4.5R (labeled “0-1 votes”) R: Increase cost in a majority of the datasets, but the relative differences are generally minor. MetaCost still outperforms other methods in a large majority of the datasets, for all cost ratios
Lesion Studies of MetaCost 4. Q: Would MetaCost perform better if all models were used in relabeling an example, irrespective of whether the example was used to learn them or not? E: using all models in relabeling an example (labeled “all Ms”) R: Decrease cost for r=10 but increase it for r=5 and r=2 In all three cases the relative differences are generally minor, and the performance vs. C4.5R and stratification is generally similar
Lesion Studies of MetaCost
Problem with MetaCost MetaCost increases learning time compared to the error-based classifier Reason: MetaCost increases time by a fixed factor, which is approximately the number of re-samples Solutions –Parallelize the multiple runs of the error-based classifier –Use re-samples that are smaller than the original training set
Stratification Baseline method: C4.5 combined with under-sampling or over-sampling Performance analysis technique: cost curves
Cost Curve The expected cost of a classifier is represented explicitly Easy to understand Allows the experimenter to immediately see the range of costs Allows to see where a particular classifier is the best and quantitatively how much better than other classifiers.
Cost Curve X-axis: probability cost function for positive examples PCF(+) = w + /(w + + w - ) Y-axis: expected cost normalized with respect to the cost incurred when every example is incorrectly classified NE[C] = (1-TP) w + + FP w - w + + w - Note: w + = p (+)C(-|+) w - = p (-)C(+|-) P(a): probability of a given example being in class a C(a|b): cost incurred if an example in class b is misclassified as being in class a Interprate: the expected cost of a classifier across all possible choices of misclassification costs and class distributions.
Comparing the Sampling Schemes Data set: Sonar data set, 208 instances; 111 mines and 97 rocks, 60 features Bold dashed curve: performance of C4.5 using under-sampling Bold continuous curve: over-sampling
Comparing the Sampling Schemes Data set: Japanese credit data set, 690 instances; 307 positive and 383 negative, 15 features Bold dashed curve: performance of C4.5 using under-sampling Bold continuous curve: over-sampling
Comparing the Sampling Schemes Data set: breast cancer data set, 286 instances; 201 non-recurrences and 85 recurrences, 9 features Bold dashed curve: performance of C4.5 using under-sampling Bold continuous curve: over-sampling
Comparing the Sampling Schemes Data set: sleep data set, 840 instances; 100 1’s and 140 2’s, 15 features Bold dashed curve: performance of C4.5 using under-sampling Bold continuous curve: over-sampling
Comparing the Sampling Schemes Under-sampling produces a cost curve that is reasonably cost sensitive Over-sampling produces a cost curve that is less sensitive, the performance varies little from that at data set’s original frequency Under-sampling scheme outperforms Over- sampling scheme
Investigating Over-sampling Curves Over-sampling prunes less and thus produce specialization by narrowing the region surrounding instances of the more common class as their number is increased, therefore generalizes less than under-sampling For data sets where there was appreciable pruning at the original frequency, oversampling produced some overall cost sensitivity. Disable the stopping criterion will removed the small additional sensitivity shown at the ends of the curves
Turn off Pruning
Disable stopping criterion
Investigating Under-sampling Curves Disable pruning and make early stopping criterion make no real change in under- sampling Yet, it still maintains roughly the same shape, not becoming as straight as the one produced when over-sampling
Disable different features of C4.5
Turn off Pruning
Comparing Weighting and Sampling Two weighting –Up-weighting, analogous to over-sampling, increases the weight of one of the classes keeping the weight of the other class at one –Down-weighting, analogous to under-sampling, decreases the weight of one of the classes keeping the weight of the other class at one If we represent misclassification costs and class frequencies by means of internal weights within C4.5, disabling these features does seem to make the difference. –The curve for up-weighting is very close to that for over-sampling –The curve for down-weighting is close but sometimes better than for under-sampling –Turning off pruning and then the stopping criterion does produce a curve that is very straight
Comparing Weighting and Sampling If the performance curves of under-sampling and down- weighting are similar and disabling these internal factors makes down-weighting similar to up-weighting, why do they seem not to have the same effect when under- sampling Explanation –Much of the cost sensitivity when under-sampling comes from the actual removal of instances –When we turned off many of the factors when down weighting, the branch was still grown and the region still labeled. –When the instances are removed from the training set, this cannot happen.
Comparing Weighting and Sampling
Disabling Factors when Down- weighting
Improving Over-sampling As over-sampling tends to disable pruning and other factors, perhaps we should increase their influence? Experiment –Set a over-sampling ratio, then the stopping criterion is set to 2 times this ratio, the pruning confidence factor is set to 0.25 divided by the ratio –Change defaults factors for over-sampling of Sonar data sets One class is over-sampled by 2.5 times Then the stopping criterion is 5 Pruning confidence factor is 0.1 –Result By increasing the factors in proportion to the number of duplicates in the training set does indeed have the desired effect
Conclusion MetaCost is applicable to any number of classes and to arbitrary cost matrices MetaCost always produces large cost reductions compared to the cost-blind classifier Using C4.5 with under-sampling establishes a reasonable standard for algorithmic comparison Under-sampling produces a reasonable sensitivity to changes in misclassification costs and class distribution. Over-sampling shows little sensitivity, there is often little difference in performance when misclassification costs are changed. Over-sampling can be made cost-sensitive if the pruning and early stopping parameters are set in proportion to the amount of over-sampling that is done. The extra computational cost of using over-sampling is unwarranted as the performance achieved is, at the best, the same as under-sampling
Reference  Domingos, P. MetaCost: A General Method for Making Classifiers Cost-Sensitive. In KDD (1999), pp  Drummond, C., and Holte, R. C. C4.5, Class Imbalance, and Cost Sensitive: Why Under-sampling beats Over-sampling. In Workshop on Learning from Imbalanced Data Sets II (2003).  Drummond, C., & Holte, R. C. (2000a). Explicitly representing expected cost: An alternative to ROC representation. Proceedings of the Sixth International Conference on Knowledge Discovery and Data Mining (pp ).  P. Turney. Cost-sensitive learning bibliography. Online bibliography, Institute for Information Technology of the National Research Council of Canada, Ottawa, Canada, 1997