Cost-Sensitive Boosting algorithms: Do we really need them?

Cost-Sensitive Boosting algorithms: Do we really need them?
Professor Gavin Brown School of Computer Science, University of Manchester

“The important thing in science is not so much to obtain new facts, but to discover new ways of thinking about them.” Sir William Bragg Nobel Prize, Physics 1915

Cost-sensitive Boosting algorithms: Do we really need them?
Nikolaos Nikolaou, Narayanan Edakunni, Meelis Kull, Peter Flach and Gavin Brown Machine Learning Journal, Vol. 104, Issue 2, Sept 2016 Nikos Nikolaou Cost-sensitive Boosting: A Unified Approach PhD thesis, University of Manchester, 2016

The “Usual” Machine Learning Approach
data + labels Learning Algorithm Model Predicted label New data (no labels)

The “Boosting” approach
data + labels Dataset 2 Dataset 3 Dataset 4 Model 1 Model 2 Model 3 Model 4 Each model corrects the mistakes of its predecessor.

The “Boosting” approach
New data Model 1 Model 2 Model 3 Model 4 Weighted vote

Marginally more accurate than random guessing
Boosting: A Rich Theoretical History “Can we turn a weak learner into a strong learner?” Kearns, 1988 Marginally more accurate than random guessing Arbitrarily high accuracy “Yes!” Schapire, 1990 AdaBoost algorithm. Freund & Schapire, 1997 Gödel Prize 2003

The Adaboost algorithm (informal)
Start with all datapoints equally important. for t=1 to T do Build a classifier, h Calculate the voting weight for h Update the distribution over the datapoints - increase emphasis where h makes mistakes - decrease emphasis where h is already correct end for Calculate answer for test point by majority vote.

The Adaboost algorithm
Majority voting weight for classifier #t Distribution update Majority vote on test example x’

Cost-sensitive Problems
What happens with differing False Positive / False Negative costs? …can we minimize the expected cost (a.k.a. risk)? Prediction 1 Truth

Cost-sensitive Adaboost?
15+ boosting variants, 20 years Some re-invented multiple times Most proposed as heuristic modifications to original AdaBoost Many treat FP/FN costs as hyperparameters

The Adaboost algorithm, made cost-sensitive?
We could modify this Or, some modify this Others modify this And some modify this

“The important thing in science is not so much to obtain new facts, but to discover new ways of thinking about them.” Let’s take a step back… Functional Gradient Descent (Mason et al, 2000) Decision Theory (Freund & Schapire, 1997) Margin Theory (Schapire et al., 1998) Probabilistic Modelling (Lebanon & Lafferty 2001)

still follow from each?”
So, for a cost-sensitive boosting algorithm…? Algorithm X Functional Gradient Descent Decision Theory Margin Theory Probabilistic Modelling “Does my new algorithm still follow from each?”

(i.e. both a consequence of FGD on a given loss?)
Functional Gradient Descent Distribution update? = Direction in function space Voting weight? = Step size Property 1: FGD-consistency Are the voting weights and distribution updates consistent with each other? (i.e. both a consequence of FGD on a given loss?)

FN FP > Decision Theory Property 2: Cost-consistency
Ideally: Assign each example to risk-minimizing class… Bayes optimal rule, predict class y=1 iff… > FP FN Property 2: Cost-consistency Does the algorithm use the above to make decisions? (assuming ‘good’ probability estimates)

Margin theory Surrogate exponential loss, upper bound on 0/1 loss . Large margins encourage small generalization error. Adaboost promotes large margins.

Margin theory… with costs…
Different surrogates for each class.

Margin theory… with costs…
We expect this to be the case. But some algorithms do this… Property 3: Asymmetry preservation Does the loss function preserve the relative importance of each class, for all margin values?

Probabilistic Models Niculescu-Mizil & Caruana, 2005
‘AdaBoost does not produce good probability estimates.’ Niculescu-Mizil & Caruana, 2005 ‘AdaBoost is successful at [..] classification [..] but not class probabilities.’ Mease et al., 2007 ‘This increasing tendency of [the margin] impacts the probability estimates by causing them to quickly diverge to 0 and 1.’ Mease & Wyner, 2008

Probabilistic Models Adaboost tends to produce probability
estimates close to 0 or 1. Adaboost output (score) Frequency Empirically Observed Probability Adaboost output (score)

Property 4: Calibrated estimates
Probabilistic Models Adaboost tends to produce probability estimates close to 0 or 1. Adaboost output (score) Frequency Empirically Observed Probability Adaboost output (score) Property 4: Calibrated estimates Does the algorithm generate “calibrated” probability estimates?

All algorithms produce uncalibrated probability estimates!
The results are in…. All algorithms produce uncalibrated probability estimates! So what if we calibrate these last three?

Empirically Observed Probability
Platt scaling (logistic calibration) Training phase: Reserve part of training data (here 50%) to fit a sigmoid - corrects the distortion: Adaboost output (score) Empirically Observed Probability Test phase: Apply sigmoid transform to score (output of ensemble) to get probability estimate

Experiments (lots) 15 algorithms. 18 datasets.
21 degrees of cost imbalance.

In summary…. Have all except calibrated probs Have all four properties
Algorithms Ranked by Average Brier Score AdaMEC- Calibrated AdaMEC, CGAda & AsymAda outperform all others. Their calibrated versions outperform the uncalibrated ones

In summary…. “AdaMEC-calibrated” was an algorithm ranked 2nd overall. So what is it? 1. Take original Adaboost. 2. Calibrate it (we use Platt scaling) 3. Shift the decision threshold…. Consistent with all theory perspectives. No extra hyperparameters added. No need to retrain if cost ratio changes. Consistently top (or joint top) in empirical comparisons.

All algorithms produce uncalibrated probability estimates!
The results are in…. All algorithms produce uncalibrated probability estimates!

Question: What if we calibrate all algorithms?
Answer: in theory … … calibration will improve the probability estimates. … if a method is not cost-sensitive, calibration will not fix it. … if the gradient steps/direction are not consistent, calibration will not fix it. … if class importance is swapped, calibration will not fix it.

Question: What if we calibrate all algorithms?
All except calibrated Have all properties Original AdaBoost (1997)

Conclusions We analyzed the cost-sensitive boosting literature… … 15+ variants over 20 years, from 4 different theoretical perspectives “Cost sensitive” modifications to the original Adaboost seem unnecessary. … if the scores are properly calibrated, and the decision threshold is shifted according to the cost matrix.

Code to play with All software at Including… - a step by step walkthrough of ‘Calibrated AdaMEC’, in Python implementation - a detailed tutorial on all aspects, allowing experiments to be reproduced

Results (extended) Isotonic regression beats Platt scaling, for larger datasets Can do better than 50%-50% train-calibration split. (Calibrated) Real AdaBoost > (Calibrated) Discrete AdaBoost...

Cost-Sensitive Boosting algorithms: Do we really need them?

Similar presentations

Presentation on theme: "Cost-Sensitive Boosting algorithms: Do we really need them?"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Cost-Sensitive Boosting algorithms: Do we really need them?

Similar presentations

Presentation on theme: "Cost-Sensitive Boosting algorithms: Do we really need them?"— Presentation transcript:

Similar presentations

About project

Feedback