Presentation is loading. Please wait.

Presentation is loading. Please wait.

Class Imbalance, Redux Byron C Wallace,1,2 Kevin Small,1

Similar presentations


Presentation on theme: "Class Imbalance, Redux Byron C Wallace,1,2 Kevin Small,1"— Presentation transcript:

1 Class Imbalance, Redux Byron C Wallace,1,2 Kevin Small,1
Carla E Brodley2 and Thomas A Trikalinos1 1Center for Clinical Evidence Synthesis, Tufts Medical Center 2Department of Computer Science, Tufts University

2 - +

3 The Problem of Class Imbalance
Real-world imbalance is often extreme minority prevalence of < 1% common Models induced over imbalanced data tend to have low sensitivity (but high accuracy) accuracy is totally useless in these cases!

4 Class Imbalance Class imbalance is everywhere
disease screening, fraud detection, text classification (often), … typically the minority class is what we care about One of ten top challenging issues in data mining [who said this? ICDM 05]

5 Isn’t this a Solved Problem?
Well, no. Most prior work on imbalance used toy (UCI) datasets: low dimensionality/modest imbalance Many methods have been proposed - sometimes they work. Often they don’t. need theory to guide the practitioner

6 My Goals Today Present a useful probabilistic theory of imbalance that provides insight into the problem Convince you that undersampling + bagging is the best-motivated strategy for handling imbalance

7 The Problem of Class Imbalance
minority class majority class

8 The Problem of Class Imbalance
G P minority class majority class

9 The One-Dimensional Case
P G any line separates the the space into two half-spaces

10 Bias the empirical line is closer to the minority class than ideal P P
G G the empirical line is closer to the minority class than ideal

11 Bias results in low sensitivity!
Formalizing Bias def The induced classifier (hyperplane) is biased iff the the half-space it designates for the minority (+) class is smaller than the ideal half-space i.e., Bias results in low sensitivity!

12 The Case for Undersampling
Empirically, throwing away training data from the majority class until you’ve an equal number of examples from both classes works well [Van Hulse et al, ICML07] Why should this work?

13 The Effect of Undersampling

14 The Effect of Undersampling

15 The Effect of Undersampling

16 The Effect of Undersampling

17 The Effect of Undersampling
classifiers induced over undersampled datasets tend to be less biased

18 Bagging Throwing away majority examples at random adds a stochastic element variance in induction process Can be mitigated by bagging: creating an ensemble of classifiers induced over balanced samples

19 Bagging take a vote over these

20 Other Approaches Imbalance in the real-world is often extreme
minority prevalence of < 1% common Most prior work on imbalance used toy (UCI) datasets: low dimensionality; modest imbalance Other approaches mostly fiddle with relative costs in the objective function this won’t help with the distributional issues!

21 Other Approaches: Cost-Based Strategies
Empirical loss over a dataset D Solution: up Cfn relative to Cfp

22 SMOTE synthetic example observed example

23 When Cost-Based Strategies Won’t Work…
we are already classifying the x’s correctly, so setting Cfn infinitely high won’t change anything

24 … is When SMOTE Won’t Work
x adding a point between the two we correctly classify won’t help, either

25 Conjecture Our theory suggests that bagging should out-perform cost-based strategies on datasets with: high dimensionality low prevalence small training set size

26 Take-Aways Undersampling is an effective strategy for handling imbalance bagging helps further by reducing variance Cost-based strategies will be less effective when: there is dimensionality; small train set size; small prevalence


Download ppt "Class Imbalance, Redux Byron C Wallace,1,2 Kevin Small,1"

Similar presentations


Ads by Google