Class Imbalance, Redux Byron C Wallace,1,2 Kevin Small,1

Class Imbalance, Redux Byron C Wallace,1,2 Kevin Small,1
Carla E Brodley2 and Thomas A Trikalinos1 1Center for Clinical Evidence Synthesis, Tufts Medical Center 2Department of Computer Science, Tufts University

The Problem of Class Imbalance
Real-world imbalance is often extreme minority prevalence of < 1% common Models induced over imbalanced data tend to have low sensitivity (but high accuracy) accuracy is totally useless in these cases!

Class Imbalance Class imbalance is everywhere
disease screening, fraud detection, text classification (often), … typically the minority class is what we care about One of ten top challenging issues in data mining [who said this? ICDM 05]

Isn’t this a Solved Problem?
Well, no. Most prior work on imbalance used toy (UCI) datasets: low dimensionality/modest imbalance Many methods have been proposed - sometimes they work. Often they don’t. need theory to guide the practitioner

My Goals Today Present a useful probabilistic theory of imbalance that provides insight into the problem Convince you that undersampling + bagging is the best-motivated strategy for handling imbalance

minority class majority class

G P minority class majority class

The One-Dimensional Case
P G any line separates the the space into two half-spaces

Bias the empirical line is closer to the minority class than ideal P P
G G the empirical line is closer to the minority class than ideal

Bias results in low sensitivity!
Formalizing Bias def The induced classifier (hyperplane) is biased iff the the half-space it designates for the minority (+) class is smaller than the ideal half-space i.e., Bias results in low sensitivity!

The Case for Undersampling
Empirically, throwing away training data from the majority class until you’ve an equal number of examples from both classes works well [Van Hulse et al, ICML07] Why should this work?

The Effect of Undersampling

The Effect of Undersampling
classifiers induced over undersampled datasets tend to be less biased

Bagging Throwing away majority examples at random adds a stochastic element variance in induction process Can be mitigated by bagging: creating an ensemble of classifiers induced over balanced samples

Bagging take a vote over these

Other Approaches Imbalance in the real-world is often extreme
minority prevalence of < 1% common Most prior work on imbalance used toy (UCI) datasets: low dimensionality; modest imbalance Other approaches mostly fiddle with relative costs in the objective function this won’t help with the distributional issues!

Other Approaches: Cost-Based Strategies
Empirical loss over a dataset D Solution: up Cfn relative to Cfp

SMOTE synthetic example observed example

When Cost-Based Strategies Won’t Work…
we are already classifying the x’s correctly, so setting Cfn infinitely high won’t change anything

… is When SMOTE Won’t Work
x adding a point between the two we correctly classify won’t help, either

Conjecture Our theory suggests that bagging should out-perform cost-based strategies on datasets with: high dimensionality low prevalence small training set size

Take-Aways Undersampling is an effective strategy for handling imbalance bagging helps further by reducing variance Cost-based strategies will be less effective when: there is dimensionality; small train set size; small prevalence

Class Imbalance, Redux Byron C Wallace,1,2 Kevin Small,1

Similar presentations

Presentation on theme: "Class Imbalance, Redux Byron C Wallace,1,2 Kevin Small,1"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Class Imbalance, Redux Byron C Wallace,1,2 Kevin Small,1

Similar presentations

Presentation on theme: "Class Imbalance, Redux Byron C Wallace,1,2 Kevin Small,1"— Presentation transcript:

Similar presentations

About project

Feedback