Download presentation
Presentation is loading. Please wait.
Published byJoshua Holt Modified over 7 years ago
1
Class Imbalance, Redux Byron C Wallace,1,2 Kevin Small,1
Carla E Brodley2 and Thomas A Trikalinos1 1Center for Clinical Evidence Synthesis, Tufts Medical Center 2Department of Computer Science, Tufts University
2
- +
3
The Problem of Class Imbalance
Real-world imbalance is often extreme minority prevalence of < 1% common Models induced over imbalanced data tend to have low sensitivity (but high accuracy) accuracy is totally useless in these cases!
4
Class Imbalance Class imbalance is everywhere
disease screening, fraud detection, text classification (often), … typically the minority class is what we care about One of ten top challenging issues in data mining [who said this? ICDM 05]
5
Isn’t this a Solved Problem?
Well, no. Most prior work on imbalance used toy (UCI) datasets: low dimensionality/modest imbalance Many methods have been proposed - sometimes they work. Often they don’t. need theory to guide the practitioner
6
My Goals Today Present a useful probabilistic theory of imbalance that provides insight into the problem Convince you that undersampling + bagging is the best-motivated strategy for handling imbalance
7
The Problem of Class Imbalance
minority class majority class
8
The Problem of Class Imbalance
G P minority class majority class
9
The One-Dimensional Case
P G any line separates the the space into two half-spaces
10
Bias the empirical line is closer to the minority class than ideal P P
G G the empirical line is closer to the minority class than ideal
11
Bias results in low sensitivity!
Formalizing Bias def The induced classifier (hyperplane) is biased iff the the half-space it designates for the minority (+) class is smaller than the ideal half-space i.e., Bias results in low sensitivity!
12
The Case for Undersampling
Empirically, throwing away training data from the majority class until you’ve an equal number of examples from both classes works well [Van Hulse et al, ICML07] Why should this work?
13
The Effect of Undersampling
14
The Effect of Undersampling
15
The Effect of Undersampling
16
The Effect of Undersampling
17
The Effect of Undersampling
classifiers induced over undersampled datasets tend to be less biased
18
Bagging Throwing away majority examples at random adds a stochastic element variance in induction process Can be mitigated by bagging: creating an ensemble of classifiers induced over balanced samples
19
Bagging take a vote over these
20
Other Approaches Imbalance in the real-world is often extreme
minority prevalence of < 1% common Most prior work on imbalance used toy (UCI) datasets: low dimensionality; modest imbalance Other approaches mostly fiddle with relative costs in the objective function this won’t help with the distributional issues!
21
Other Approaches: Cost-Based Strategies
Empirical loss over a dataset D Solution: up Cfn relative to Cfp
22
SMOTE synthetic example observed example
23
When Cost-Based Strategies Won’t Work…
we are already classifying the x’s correctly, so setting Cfn infinitely high won’t change anything
24
… is When SMOTE Won’t Work
x adding a point between the two we correctly classify won’t help, either
25
Conjecture Our theory suggests that bagging should out-perform cost-based strategies on datasets with: high dimensionality low prevalence small training set size
26
Take-Aways Undersampling is an effective strategy for handling imbalance bagging helps further by reducing variance Cost-based strategies will be less effective when: there is dimensionality; small train set size; small prevalence
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.