Cost-Sensitive Learning vs. Sampling: Which is Best for Handling Unbalanced Classes with Unequal Error Costs? Gary Weiss, Kate McCarthy, Bibi Zabar Fordham.

Cost-Sensitive Learning vs. Sampling: Which is Best for Handling Unbalanced Classes with Unequal Error Costs? Gary Weiss, Kate McCarthy, Bibi Zabar Fordham University

2 Background Highly skewed data is common Highly skewed data is common Typically more interested in correctly classifying the minority class examples Typically more interested in correctly classifying the minority class examples Without special measures, classifier will rarely predict the minority class Without special measures, classifier will rarely predict the minority class A common approach: balance the data A common approach: balance the data Imposes non-uniform misclassification costs* Imposes non-uniform misclassification costs* If alter training set class distribution from 1:1 to 2:1 then have essentially applied a cost ratio of 2:1 If alter training set class distribution from 1:1 to 2:1 then have essentially applied a cost ratio of 2:1 * C. Elkan. The foundations of cost-sensitive learning. IJCAI 2001.

3 Two Competing Approaches Cost-sensitive learning algorithm Cost-sensitive learning algorithm The algorithm itself handles cost-sensitivity The algorithm itself handles cost-sensitivity Does not throw away any data Does not throw away any data Sampling Sampling Down-sample the majority class Down-sample the majority class Discards potentially useful data Discards potentially useful data Up-sample the minority class Up-sample the minority class Increases amount of training data Increases amount of training data Replicated examples may lead to overfitting Replicated examples may lead to overfitting

4 The Question Which method is best? Which method is best? A. cost-sensitive learning algorithm B. up-sampling C. down-sampling Most prior work compares sampling methods Most prior work compares sampling methods ?

5 Experiments We assume that cost information is known We assume that cost information is known Since cost info not really provided, we evaluate a variety of cost ratios and reports all results Since cost info not really provided, we evaluate a variety of cost ratios and reports all results Classifier performance is evaluated using total cost Classifier performance is evaluated using total cost Used cost-sensitive C5.0 Used cost-sensitive C5.0 Evaluated scenarios where C FN  C FP Evaluated scenarios where C FN  C FP All results are based on averages over 10 runs All results are based on averages over 10 runs For cost-sensitive learning, cost info passed in For cost-sensitive learning, cost info passed in For sampling approaches For sampling approaches Altered the the training data to “impose” the specified misclassification cost Altered the the training data to “impose” the specified misclassification cost

Fourteen Data Sets Name% in MinorityTotal Examples Letter-a4%20,000 Pendigits8%13,821 Connect-410%11,258 Bridges115%102 Letter-vowel19%20,000 Hepatitis21%155 Contraceptive23%1,473 Adult24%21,281 Blackjack36%15,000 Weather40%5,597 Sonar47%208 Boa150%11,000 Promoters50%106 Coding50%20,000

7 Results: Letter-a Data Set 4% minority 20,000 examples

8 Weather Data Set 40% minority 5,597 examples

9 Coding Data Set 50% minority 20,000 examples

10 Blackjack Data Set 36% minority 15,000 examples

11 Contraceptive Data Set 23% minority 1,473 examples

12 Results: 1 st /2 nd /3 rd Place Finishes

13 Comparison of 3 Methods

14 Discussion Results vary widely based on the data set Results vary widely based on the data set no method consistently outperforms the other two or even one of the other two no method consistently outperforms the other two or even one of the other two Are there any patterns based on the properties of the data sets? Are there any patterns based on the properties of the data sets?

15 Discussion II: Patterns For the four smallest data sets (size < 209) For the four smallest data sets (size < 209) Up-sampling does by far the best Up-sampling does by far the best Down-sampling does poorly since it discards data Down-sampling does poorly since it discards data For the eight largest data sets (size > 10,000) For the eight largest data sets (size > 10,000) Cost-sensitive learning does best Cost-sensitive learning does best Beats up-sampling on average by 5.5% Beats up-sampling on average by 5.5% Beats down-sampling on average by 5.7% Beats down-sampling on average by 5.7% No clear pattern based on the degree of class imbalance No clear pattern based on the degree of class imbalance

16 Discussion III Why might cost-sensitive learning algorithm perform best for large data sets? Why might cost-sensitive learning algorithm perform best for large data sets? Perhaps this method requires accurate probability estimates in order to perform well Perhaps this method requires accurate probability estimates in order to perform well This requires many examples per classification “rule” This requires many examples per classification “rule”

17 Conclusion No consistent winner between cost-sensitive learning and sampling methods No consistent winner between cost-sensitive learning and sampling methods Substantial differences for specific data sets Substantial differences for specific data sets Cost-sensitive learning may be best for large data sets Cost-sensitive learning may be best for large data sets Up-sampling appears best for small data sets Up-sampling appears best for small data sets

18 Follow-up Questions Why isn’t cost-sensitive learning the best? Why isn’t cost-sensitive learning the best? Can we identify problems with cost- sensitive learners? Can we identify problems with cost- sensitive learners? Can we improve cost-sensitive learners? Can we improve cost-sensitive learners? Are we better off not using cost-sensitive learner and using sampling instead?! Are we better off not using cost-sensitive learner and using sampling instead?!

19 Future Work There are areas for future work There are areas for future work Use additional cost-sensitive learners Use additional cost-sensitive learners Use larger data sets (then cost-sensitive best?) Use larger data sets (then cost-sensitive best?) Include more sophisticated sampling schemes Include more sophisticated sampling schemes Don’t assume known costs (ROC analysis) Don’t assume known costs (ROC analysis) I believe more comprehensive studies are needed and are underway I believe more comprehensive studies are needed and are underway

Cost-Sensitive Learning vs. Sampling: Which is Best for Handling Unbalanced Classes with Unequal Error Costs? Gary Weiss, Kate McCarthy, Bibi Zabar Fordham.

Similar presentations

Presentation on theme: "Cost-Sensitive Learning vs. Sampling: Which is Best for Handling Unbalanced Classes with Unequal Error Costs? Gary Weiss, Kate McCarthy, Bibi Zabar Fordham."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Cost-Sensitive Learning vs. Sampling: Which is Best for Handling Unbalanced Classes with Unequal Error Costs? Gary Weiss, Kate McCarthy, Bibi Zabar Fordham.

Similar presentations

Presentation on theme: "Cost-Sensitive Learning vs. Sampling: Which is Best for Handling Unbalanced Classes with Unequal Error Costs? Gary Weiss, Kate McCarthy, Bibi Zabar Fordham."— Presentation transcript:

Similar presentations

About project

Feedback