Boosting and Additive Trees (Part 1) Ch. 10 Presented by Tal Blum.

Boosting and Additive Trees (Part 1) Ch. 10 Presented by Tal Blum

Overview Ensemble methods and motivations Describing Adaboost.M1 algorithm Show that Adaboost maximizes the exponential loss Other loss functions for classification and regression

Ensemble Learning – Additive Models INTUITION: Combining Predictions of an ensemble is more accurate than a single classifier. Justification: ( Several reasons) easy to find quite correct “rules of thumb” however hard to find single highly accurate prediction rule. If the training examples are few and the hypothesis space is large then there are several equally accurate classifiers. (model uncertainty) Hypothesis space does not contain the true function, but a linear combination of hypotheses might. Exhaustive global search in the hypothesis space is expensive so we can combine the predictions of several locally accurate classifiers. Examples: Bagging, HME, Splines

Boosting (explaining)

Example learning curve for Y = 1 if  X 2 j >  2 10 (0.5) 0 otherwise

Adaboost.M1 Algorithn W(x) is the distribution of weights over the N training points ∑ W(x i )=1 Initially assign uniform weights W 0 (x) = 1/N for all x. At each iteration k : Find best weak classifier C k (x) using weights W k (x) Compute ε k the error rate as ε k = [ ∑ W(x i ) ∙ I(y i ≠ C k (x i )) ] / [ ∑ W(x i )] weight α k the classifier C k ‘s weight in the final hypothesis Set α k = log ((1 – ε k )/ε k ) For each x i, W k+1 (x i ) = W k (x i ) ∙ exp[α k ∙I(y i ≠ C k (x i ))] C FINAL (x) =sign [ ∑ α k C k (x) ]

Boosting as an Additive Model The final prediction in boosting f(x) can be expressed as an additive expansion of individual classifiers The process is iterative and can be expressed as follows. Typically we would try to minimize a loss function on the training examples

Forward Stepwise Additive Modeling - algorithm 1. Initialize f 0 (x)=0 2. For m = 1 to M Compute Set

Forward Stepwise Additive Modeling Sequentially adding new basis functions without adjusting the parameters of the previously chosen functions Simple case: Squared-error loss Forward stage-wise modeling amounts to just fitting the residuals from previous iteration. Squared-error loss not robust for classification

Exponential Loss and Adaboost AdaBoost for Classification: L(y, f (x)) = exp(-y ∙ f (x)) - the exponential loss function

Exponential Loss and Adaboost Assuming   0:

Finding the best  )()(minarg )])(([ )(minarg 1 )( 1 )(    Heerree e w xGyIw ee m G N i m i N i ii m i G           

Historical Notes Adaboost was first presented in ML theory as a way to boost a week classifier At first people thought it defies the “no free lunch theorem” and doesn’t overfitt. Connection between Adaboost and stepwise additive modeling was only recently discovered.

Why Exponential Loss? Mainly Computational Derivatives are easy to compute Optimal classifiers minimizes the weighted sample Under mild assumptions the instances weights decrease exponentially fast. Statistical Exp. loss is not necessary for success of boosting – On Boosting and exponential loss (Wyner) We will see in the next slides

Why Exponential Loss? Population minimizer (Friedman 2000): This justifies using its sign as a classification rule.

For exponential loss: Interpreting f as a logit transform The population maximizers and are the same Why Exponential Loss?

Loss Functions and Robustness For a finite dataset exp. loss and binomial deviance are not the same. Both criterion are monotonic decreasing functions of the margin. Examples with negative margin y*f(x)<0 are classified incorrectly.

Loss Functions and Robustness The problem: Classification error is not differentiable and with derivative 0 where it is differentiable. We want a criterion which is efficient and as close as possible to the true classification lost. Any loss criterion used for classification should give higher weights to misclassified examples. Therefore the square loss function is not appropriate for classification.

Loss Functions and Robustness Both functions can be though of as a continuous approximation to the misclassification loss Exponential lost grows exponentially fast for instances with high margin Such instances weight increases exponentially This makes Adaboost very sensitive to mislabeled examples Deviation generalizes to K classes, exp loss not.

Robust Loss Functions For Regression The relationship between square loss and absolute loss is analogous to that of exp. loss and deviance. The solutions are the mean and median. Absolute loss is more robust. For regression MSE leads to Adaboost for regression For Gaussian errors and robustness to outliers Huber loss:

Sample of UCI datasets Comparison Dataset NameJ48 J48 +bagg ing(10 ) Adaboost \w Decisio n stumps SVM SM OB NetNBNN 1LBMA LBMA DEVIA NCE colic85.182.881.0878.3878.3779.737785.182.43 anneal(70%)96.697.484.0797.0492.2291.894.079797.04 credit-a(x10)84.4986.6785.9485.6585.0785.368086.2284.06 iris-(disc5)x1093.39487.39493.3 94.6794 soybean-9x284.8779.8327.7386.8383.1984.5980.1187.3687.68 soybean-3790.5185.424.0993.4390.5188.3282.4892.794.16 labor-(disc5)70.1878.9587.8287.7294.7491.2385.9694.74 autos-(disc5)x270.7364.3944.8873.1761.9561.4677.0765.3576.1 credit-g(70%)74.3373.6774.3374.677776.6767.6774.3376.67 glassx557.9456.5442.0657.9456.5454.6755.1458.4157.48 diabetes68.3668.4971.6170.1870.3169.9264.456869.4 audiology76.55 46.4680.9775.2271.2473.4579.680.09 breast-cancer74.1368.1872.3869.9372.0372.7368.1875.5276.22 heart-c-disc77.5681.1984.4983.1784.1683.8376.5780.2184.16 vowel x 571.92 17.9786.4663.94 90.794.0493.84 Average78.4477.73262.147381.37977.977.7482.2283.205

Next Presentation

Boosting and Additive Trees (Part 1) Ch. 10 Presented by Tal Blum.

Similar presentations

Presentation on theme: "Boosting and Additive Trees (Part 1) Ch. 10 Presented by Tal Blum."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Boosting and Additive Trees (Part 1) Ch. 10 Presented by Tal Blum.

Similar presentations

Presentation on theme: "Boosting and Additive Trees (Part 1) Ch. 10 Presented by Tal Blum."— Presentation transcript:

Similar presentations

About project

Feedback