Presentation on theme: "Class Imbalance vs. Cost-Sensitive Learning"— Presentation transcript:
1 Class Imbalance vs. Cost-Sensitive Learning Presenter: Hui LiUniversity of Ottawa
2 Contents Introduction Making Classifier Balanced Making Classifier Cost SensitiveMetaCostStratification
3 Class Imbalance vs. Asymmetric Misclassification costs Most of the algorithms assume that the data sets are balanced, and all errors have the same costThis is seldom trueIn data base marketing, the cost of mailing to a non-respondent is very small, but the cost of not mailing to someone who would respond is the entire profit lostBoth class imbalance and the cost of misclassification should be considered
4 Class Imbalance vs. Asymmetric Misclassification costs Class Imbalance: one class occurs much more often than the otherAsymmetric misclassification costs: the cost of misclassifying an example from one class is much larger than the cost of misclassifying an example from the other class.One way to correct for imbalance: train a cost sensitive classifier with the misclassification cost of the minority class greater than that of the majority class.One way to make an algorithm cost sensitive: intentionally imbalance the training set.
5 Making Classifier Balanced Baseline MethodsRandom over-samplingRandom under-samplingUnder-sampling MethodsTomek linksCondensed Nearest Neighbor RuleOne-sided selectionCNN + Tomek linksNeighborhood Cleaning RuleOver-sampling MethodsSmoteCombination of Over-sampling method with Under-sampling methodSmote + Tomek linksSmote + ENN
6 Making Classifiers Cost-sensitive Substantial work has gone into making individual algorithms cost-sensitive A better solution would be to have a procedure that converted a broad variety of classifiers into cost-sensitive onesStratification: change the frequency of classes in the training data in proportion to their costShortcomingdistort the distribution of examplesIf it is done by under-sampling, it reduces the data available for learning.If it is done by over-sampling, it increase learning timeMetaCost: a general method for making classifiers cost-sensitive
7 MetaCostBy wrapping a cost-minimizing procedure, “meta-learning” stage, around the classifierTreat the underlying classifier as a black box, requiring no knowledge of its functioning or change to itApplicable to any number of classes and to arbitrary cost matricesalways produces large cost reductions compared to the cost-blind classifier
8 MetaCostConditional rist R(i|x) is the expected cost of predicting that x belongs to class iR(i|x) = ∑P(j|x)C(i, j)Bayes optimal prediction is guaranteed to achieve the lowest possible overall costThe goal of MetaCost procedure is: to relabel the training examples with their “optimal” classesTherefore, we need to find a way to estimate their class probabilities P(j|x)Learn multiple classifiers, for each example, use each class’s fraction of the total vote as an estimate of its probability given the exampleReason: most modern learners are highly unstable, in that applying them to slightly different training sets tends to produce very different models and correspondingly different predictions for the same examples, while the overall accuracy remains broadly unchanged. This accuracy can be much improved by learning several models in this way and then combining their predictions, for example by voting
9 MetaCost procedureForm multiple bootstrap replicates of the training setLearn a classifier on each training setEstimate each class’s probability for each example by the fraction of votes that it receives from the ensembleUse conditional risk equation to relabel each training example with the estimated optimal classReapply the classifier to the relabeled training set
10 Evaluation of MetaCost Does MetaCost reduce cost compared to the error-based classifier and to stratification?27 databases from the UCI repository, 15 multiclass databases, 12 two-class databasesC4.5 Decision tree learnerC4.5Rules post-processorRandomly select 2/3 of the examples in the database for training, using the remaining 1/3 for measuring the cost of their predictionsResults are the average of 20 such runs
11 MultiClass ProblemsExperiments were conducted with two different types of cost model.Fixed interval modelEach C(i, i) was chosen randomly from a uniform distribution in the [0, 1000] intervalEach C(i, j), i ≠ j, was chosen randomly from the fixed interval [0, 10000]Different costs were generated for each of the 20 runs conducted on each databaseClass probability-dependent modelSame C(i, i) as in model 1Each C(i, j), i ≠ j, was chosen with uniform probability from the interval [0, 2000P(i)/P(j)], P(i) and P(j) are the probabilities of occurrence of classes i and j in the training setThis means that the highest costs are for misclassifying a rare class as a frequent one, as in the database marketing domains.
12 MultiClass Problems In the fixed interval case Neither form of stratification is very effective in reducing costsMetaCost reduces costs compared to C4.5R and under-sampling in all but one databaseMetaCost reduces costs compared to Over-sampling in all but threeIn the probability-dependent caseBoth under-sampling and over-sampling reduce cost compared to C4.5R in 12 of the 15 databasesMetaCost achieves lower costs than C4.5R and both forms of stratification in all 15 databasesThe average cost reduction obtained by MetaCost compared to C4.5R is approximately twice as large as that obtained by under-sampling, and five times that of over-samplingConclusion: MetaCost is the cost reduction metnod of choice for multiclass problems
14 Two-class Problems Cost model Result 1 be the minority class, 2 be the majority classC(1, 1) = C(2, 2) = 0C(1, 2) = 1000C(2, 1) = 1000r, where r was set alternately to 2, 5, and 10ResultOver-sampling is not very effective in reducing cost with any of the cost ratiosUnder-sampling is effective for r = 5 and r = 10, but not for r = 2MetaCost reduces costs compared to C4.5R, under-sampling and over-sampling on almost all databases, for all cost ratiosConclusion: MetaCost is the cost-reduction method for two-class problems
16 Lesion Studies of MetaCost Q:How sensitive are the results to the number of resamples used?E: using 20 and 10 resamples instead of 50R:cost increases as the number of resamples decrease, but only graduallyThere is no significant difference between the costs obtained with m=50 and m=20With m=10, Metacost still reduces costs compared to C4.5R and both forms of stratification in almost all datasets
17 Lesion Studies of MetaCost 2. Q:Would it be enough to simply use the class probabilities produed by a single run of the error-based classifier on the full training set?E: relabeling the training examples using the class probabilities produced by a single run of C4.5R on all the data (labeled “C4 Probs”)R:It produces worse results than MetaCost and under-sampling in almost all datasets for all cost ratioIt still outperforms over-sampling and C4.5R
18 Lesion Studies of MetaCost 3. Q: how well would MetaCost do if the class probabilities produced by C4.5R were ignored, and the probability of a class was estimated simply as the fraction of models that predicted it?E: ignoring the class probabilities produced by C4.5R (labeled “0-1 votes”)R:Increase cost in a majority of the datasets, but the relative differences are generally minor.MetaCost still outperforms other methods in a large majority of the datasets, for all cost ratios
19 Lesion Studies of MetaCost 4. Q: Would MetaCost perform better if all models were used in relabeling an example, irrespective of whether the example was used to learn them or not?E: using all models in relabeling an example (labeled “all Ms”)R:Decrease cost for r=10 but increase it for r=5 and r=2In all three cases the relative differences are generally minor, and the performance vs. C4.5R and stratification is generally similar
21 Problem with MetaCostMetaCost increases learning time compared to the error-based classifierReason: MetaCost increases time by a fixed factor, which is approximately the number of re-samplesSolutionsParallelize the multiple runs of the error-based classifierUse re-samples that are smaller than the original training set
22 StratificationBaseline method: C4.5 combined with under-sampling or over-samplingPerformance analysis technique: cost curves
23 Cost Curve The expected cost of a classifier is represented explicitly Easy to understandAllows the experimenter to immediately see the range of costsAllows to see where a particular classifier is the best and quantitatively how much better than other classifiers.
24 Cost CurveX-axis: probability cost function for positive examples PCF(+) = w+/(w+ + w-) Y-axis: expected cost normalized with respect to the cost incurred when every example is incorrectly classified NE[C] = (1-TP) w+ + FP w w+ + w- Note: w+ = p (+)C(-|+) w- = p (-)C(+|-) P(a): probability of a given example being in class a C(a|b): cost incurred if an example in class b is misclassified as being in class a Interprate: the expected cost of a classifier across all possible choices of misclassification costs and class distributions.
25 Comparing the Sampling Schemes Data set: Sonar data set, 208 instances; 111 mines and 97 rocks, 60 features Bold dashed curve: performance of C4.5 using under-sampling Bold continuous curve: over-sampling
26 Comparing the Sampling Schemes Data set: Japanese credit data set, 690 instances; 307 positive and 383 negative, 15 features Bold dashed curve: performance of C4.5 using under-sampling Bold continuous curve: over-sampling
27 Comparing the Sampling Schemes Data set: breast cancer data set, 286 instances; 201 non-recurrences and 85 recurrences, 9 features Bold dashed curve: performance of C4.5 using under-sampling Bold continuous curve: over-sampling
28 Comparing the Sampling Schemes Data set: sleep data set, 840 instances; 100 1’s and 140 2’s, 15 features Bold dashed curve: performance of C4.5 using under-sampling Bold continuous curve: over-sampling
29 Comparing the Sampling Schemes Under-sampling produces a cost curve that is reasonably cost sensitiveOver-sampling produces a cost curve that is less sensitive, the performance varies little from that at data set’s original frequencyUnder-sampling scheme outperforms Over-sampling scheme
30 Investigating Over-sampling Curves Over-sampling prunes less and thus produce specialization by narrowing the region surrounding instances of the more common class as their number is increased, therefore generalizes less than under-samplingFor data sets where there was appreciable pruning at the original frequency, oversampling produced some overall cost sensitivity.Disable the stopping criterion will removed the small additional sensitivity shown at the ends of the curves
33 Investigating Under-sampling Curves Disable pruning and make early stopping criterion make no real change in under-samplingYet, it still maintains roughly the same shape, not becoming as straight as the one produced when over-sampling
36 Comparing Weighting and Sampling Two weightingUp-weighting, analogous to over-sampling, increases the weight of one of the classes keeping the weight of the other class at oneDown-weighting, analogous to under-sampling, decreases the weight of one of the classes keeping the weight of the other class at oneIf we represent misclassification costs and class frequencies by means of internal weights within C4.5, disabling these features does seem to make the difference.The curve for up-weighting is very close to that for over-samplingThe curve for down-weighting is close but sometimes better than for under-samplingTurning off pruning and then the stopping criterion does produce a curve that is very straight
37 Comparing Weighting and Sampling If the performance curves of under-sampling and down-weighting are similar and disabling these internal factors makes down-weighting similar to up-weighting, why do they seem not to have the same effect when under-samplingExplanationMuch of the cost sensitivity when under-sampling comes from the actual removal of instancesWhen we turned off many of the factors when down weighting, the branch was still grown and the region still labeled.When the instances are removed from the training set, this cannot happen.
40 Improving Over-sampling As over-sampling tends to disable pruning and other factors, perhaps we should increase their influence?ExperimentSet a over-sampling ratio, then the stopping criterion is set to 2 times this ratio, the pruning confidence factor is set to 0.25 divided by the ratioChange defaults factors for over-sampling of Sonar data setsOne class is over-sampled by 2.5 timesThen the stopping criterion is 5Pruning confidence factor is 0.1ResultBy increasing the factors in proportion to the number of duplicates in the training set does indeed have the desired effect
42 ConclusionMetaCost is applicable to any number of classes and to arbitrary cost matricesMetaCost always produces large cost reductions compared to the cost-blind classifierUsing C4.5 with under-sampling establishes a reasonable standard for algorithmic comparisonUnder-sampling produces a reasonable sensitivity to changes in misclassification costs and class distribution.Over-sampling shows little sensitivity, there is often little difference in performance when misclassification costs are changed.Over-sampling can be made cost-sensitive if the pruning and early stopping parameters are set in proportion to the amount of over-sampling that is done.The extra computational cost of using over-sampling is unwarranted as the performance achieved is, at the best, the same as under-sampling
43 Reference Domingos, P. MetaCost: A General Method for Making Classifiers Cost-Sensitive. In KDD (1999), pp Drummond, C., and Holte, R. C. C4.5, Class Imbalance, and Cost Sensitive: Why Under-sampling beats Over-sampling. In Workshop on Learning from Imbalanced Data Sets II (2003). Drummond, C., & Holte, R. C. (2000a). Explicitly representing expected cost: An alternative to ROC representation. Proceedings of the Sixth International Conference on Knowledge Discovery and Data Mining (pp ). P. Turney. Cost-sensitive learning bibliography. Online bibliography, Institute for Information Technology of the National Research Council of Canada, Ottawa, Canada, 1997