Presentation is loading. Please wait.

Presentation is loading. Please wait.

Ensemble Methods for Machine Learning: The Ensemble Strikes Back

Similar presentations


Presentation on theme: "Ensemble Methods for Machine Learning: The Ensemble Strikes Back"— Presentation transcript:

1 Ensemble Methods for Machine Learning: The Ensemble Strikes Back

2 Outline Motivations and techniques Bias, variance: bagging
Combining learners vs choosing between them: bucket of models stacking & blending Pac-learning theory: boosting Relation of boosting to other learning methods—optimization, SVMs, …

3 Review Of Boosting

4 Sample with replacement
Increase weight of xi if ht is wrong, decrease weight if ht is right. Linear combination of base hypotheses - best weight αt depends on error of ht.

5 Boosting: A toy example
Thanks, Rob Schapire

6 Boosting: A toy example
Thanks, Rob Schapire

7 Boosting: A toy example
Thanks, Rob Schapire

8 Boosting: A toy example
Thanks, Rob Schapire

9 Boosting: A toy example
Thanks, Rob Schapire

10 Boosting improved decision trees…
KV S DSS FS T

11 Analysis Of Boosting

12 Theorem 1: error rate

13 upper bound on “[error on i ]”
Theorem 1: error rate Proof: = sign(f(x)) where upper bound on “[error on i ]” QED!

14 imequality holds for -1 <= u <= +1
Theorem 1: So: pick h’s and α’s to minimize Z’s Simplified notation: drop the t’s, let ui=yiht(xi), remember that ui = +1 or -1 Claim: 1 1 = sign(f(x)) where ui = +1 ui = -1 equality for u = +1, -1 imequality holds for -1 <= u <= +1 So: let’s minimize f(α) = to pick a best α

15 Minimize f(α) = = sign(f(x)) where

16 and hence training error is bounded by
Theorem 1: So: pick h’s and α’s to minimize Z’s Theorem 2: when for then and hence training error is bounded by Comment if h(x)=+/- 1 then

17

18 Boosting as Optimization

19 Even boosting single features worked well…
KV S DSS FS T Reuters newswire corpus

20 Some background facts Coordinate descent optimization to minimize f(w)
For t=1,…,T or till convergence: For i=1,…,N where w=<w1,…,wN> Pick w* to minimize f(<w1,…,wi-1,w*,wi+1,…,wN> Set wi = w* V KV S DSS FS T

21 Boosting as optimization using coordinate descent
With a small number of possible h’s, you can think of boosting as finding a linear combination of these: So boosting is sort of like stacking: Boosting uses coordinate descent to minimize an upper bound on error rate:

22 Boosting and optimization
V KV S DSS FS T Jerome Friedman, Trevor Hastie and Robert Tibshirani. Additive logistic regression: a statistical view of boosting. The Annals of Statistics, 2000. Compared using AdaBoost to set feature weights vs direct optimization of feature weights to minimize log-likelihood, squared error, … FHT

23 Boosting as Margin Learning

24 Boosting didn’t seem to overfit…(!)
KV S DSS FS T test error train error

25 …because it turned out to be increasing the margin of the classifier
V KV S DSS FS T 1000 rounds 100 rounds

26 Boosting movie

27 Some background facts Coordinate descent optimization to minimize f(w)
For t=1,…,T or till convergence: For i=1,…,N where w=<w1,…,wN> Pick w* to minimize f(<w1,…,wi-1,w*,wi+1,…,wN> Set wi = w* V KV S DSS FS T

28 Boosting is closely related to margin classifiers like SVM, voted perceptron, … (!)
KV S DSS FS T Boosting: The “coordinates” are being extended by one in each round of boosting --- usually, unless you happen to generate the same tree twice

29 Boosting is closely related to margin classifiers like SVM, voted perceptron, … (!)
KV S DSS FS T Boosting: Linear SVMs:

30 Wrapup On Boosting

31 Boosting in the real world
V KV S DSS FS T William’s wrap up: Boosting is not discussed much in the ML research community any more It’s much too well understood It’s really useful in practice as a meta-learning method Eg, boosted Naïve Bayes usually beats Naïve Bayes Boosted decision trees are almost always competitive with respect to accuracy very robust against rescaling numeric features, extra features, non-linearities, … somewhat slower to learn and use than many linear classifiers But getting probabilities out of them is a little less reliable. now


Download ppt "Ensemble Methods for Machine Learning: The Ensemble Strikes Back"

Similar presentations


Ads by Google