# Ensemble Learning Reading: R. Schapire, A brief introduction to boosting.

## Presentation on theme: "Ensemble Learning Reading: R. Schapire, A brief introduction to boosting."— Presentation transcript:

Ensemble Learning Reading: R. Schapire, A brief introduction to boosting

Ensemble learning Training sets Hypotheses Ensemble hypothesis S 1 h 1 S 2 h 2. S N h N H

Advantages of ensemble learning Can be very effective at reducing generalization error! (E.g., by voting.) Ideal case: the h i have independent errors

Example Given three hypotheses, h 1, h 2, h 3, with h i (x)  {−1,1} Suppose each h i has 60% generalization accuracy, and assume errors are independent. Now suppose H(x) is the majority vote of h 1, h 2, and h 3. What is probability that H is correct?

h1h1 h2h2 h3h3 Hprobability CCCC.216 CCIC.144 CIII.096 CICC.144 ICCC IICI.096 ICII IIII.064 Total probability correct:.648

Again, given three hypotheses, h 1, h 2, h 3. Suppose each h i has 40% generalization accuracy, and assume errors are independent. Now suppose we classify x as the majority vote of h 1, h 2, and h 3. What is probability that the classification is correct? Another Example

h1h1 h2h2 h3h3 Hprobability CCCC.064 CCIC.096 CIII.144 CICC.096 ICCC IICI.144 ICII IIII.261 Total probability correct:.352

In general, if hypotheses h 1,..., h M all have generalization accuracy A, what is probability that a majority vote will be correct? General case

Possible problems with ensemble learning Errors are typically not independent Training time and classification time are increased by a factor of M. Hard to explain how ensemble hypothesis does classification. How to get enough data to create M separate data sets, S 1,..., S M ?

Three popular methods: –Voting: Train classifier on M different training sets S i to obtain M different classifiers h i. For a new instance x, define H(x) as: where α i is a confidence measure for classifier h i

–Bagging (Breiman, 1990s): To create S i, create “bootstrap replicates” of original training set S –Boosting (Schapire & Freund, 1990s) To create S i, reweight examples in original training set S as a function of whether or not they were misclassified on the previous round.

Adaptive Boosting (Adaboost) A method for combining different weak hypotheses (training error close to but less than 50%) to produce a strong hypothesis (training error close to 0%)

Sketch of algorithm Given examples S and learning algorithm L, with | S | = N Initialize probability distribution over examples w 1 (i) = 1/N. Repeatedly run L on training sets S t  S to produce h 1, h 2,..., h K. –At each step, derive S t from S by choosing examples probabilistically according to probability distribution w t. Use S t to learn h t. At each step, derive w t+1 by giving more probability to examples that were misclassified at step t. The final ensemble classifier H is a weighted sum of the h t ’s, with each weight being a function of the corresponding h t ’s error on its training set.

Adaboost algorithm Given S = {(x 1, y 1 ),..., (x N, y N )} where x  X, y i  {+1, −1} Initialize w 1 (i) = 1/N. (Uniform distribution over data)

For t = 1,..., K: –Select new training set S t from S with replacement, according to w t –Train L on S t to obtain hypothesis h t –Compute the training error  t of h t on S : –Compute coefficient

–Compute new weights on data: For i = 1 to N where Z t is a normalization factor chosen so that w t+1 will be a probability distribution:

At the end of K iterations of this algorithm, we have h 1, h 2,..., h K We also have  1,  2,...,  K, where Ensemble classifier: Note that hypotheses with higher accuracy on their training sets are weighted more strongly.

A Hypothetical Example where { x 1, x 2, x 3, x 4 } are class +1 {x 5, x 6, x 7, x 8 } are class −1 t = 1 : w 1 = {1/8, 1/8, 1/8, 1/8, 1/8, 1/8, 1/8, 1/8} S 1 = {x 1, x 2, x 2, x 5, x 5, x 6, x 7, x 8 } (notice some repeats) Train classifier on S 1 to get h 1 Run h 1 on S. Suppose classifications are: {1, −1, −1, −1, −1, −1, −1, −1} Calculate error:

Calculate  ’s: Calculate new w’s:

t = 2 w 2 = {0.102, 0.163, 0.163, 0.163, 0.102, 0.102, 0.102, 0.102} S 2 = {x 1, x 2, x 2, x 3, x 4, x 4, x 7, x 8 } Run classifier on S 2 to get h 2 Run h 2 on S. Suppose classifications are: {1, 1, 1, 1, 1, 1, 1, 1} Calculate error:

Calculate  ’s: Calculate w’s:

t =3 w 3 = {0.082, 0.139, 0.139, 0.139, 0.125, 0.125, 0.125, 0.125} S 3 = {x 2, x 3, x 3, x 3, x 5, x 6, x 7, x 8 } Run classifier on S 3 to get h 3 Run h 3 on S. Suppose classifications are: {1, 1, −1, 1, −1, −1, 1, −1} Calculate error:

Calculate  ’s: Ensemble classifier:

What is the accuracy of H on the training data? ExampleActual class h1h1 h2h2 h3h3 x1x1 1111 x2x2 1−111 x3x3 1 1 x4x4 1111 x5x5 1 x6x6 1 x7x7 111 x8x8 1 where { x 1, x 2, x 3, x 4 } are class +1 {x 5, x 6, x 7, x 8 } are class −1 Recall the training set:

HW 2: SVMs and Boosting Caveat: I don’t know if boosting will help. But worth a try!

Boosting implementation in HW 2 Use arrays and vectors. Let S be the training set. You can set S to DogsVsCats.train or some subset of examples. Let M = |S| be the number of training examples. Let K be the number of boosting iterations. ; Initialize: Read in S as an array of M rows and 65 columns. Create the weight vector w(1) = (1/M, 1/M, 1/M,..., 1/M)

; Run Boosting Iterations: For T = 1 to K { ; Create training set T: For i = 1 to M { Select (with replacement) an example from S using weights in w. (Use “roulette wheel” selection algorithm) Print that example to S (i.e., “S1”) ; Learn classifier T svm_learn S h ; Calculate Error : svm_classify S h.predictions (e.g., “svm_classify S h1 1.predictions”) Use.predictions to determine which examples in S were misclassified. Error = Sum of weights of examples that were misclassified.

; Calculate Alpha ; Calculate new weight vector, using Alpha and.predictions file } ; T  T + 1 At this point, you have h1, h2,..., hK [the K different classifiers] Alpha1, Alpha2,..., AlphaK

; Run Ensemble Classifier on test data (DogsVsCats.test): For T = 1 to K: svm_classify DogsVsCats.test h.model Test(T).predictions } At this point you have: Test1.predictions, Test2.predictions,...,TestK.predictions To classify test example x1: The sign of row 1 of Test1.predictions gives h1(x1) The sign of row 1 of Test2.predictions gives h2(x1) etc. H(x1) = sgn [Alpha1 * h1(x1) + Alpha2 * h2(x1) +... + AlphaK * hK(x1)] Compute H(x) for every x in the test set.

Sources of Classifier Error: Bias, Variance, and Noise Bias: –Classifier cannot learn the correct hypothesis (no matter what training data is given), and so incorrect hypothesis h is learned. The bias is the average error of h over all possible training sets. Variance: –Training data is not representative enough of all data, so the learned classifier h varies from one training set to another. Noise: –Training data contains errors, so incorrect hypothesis h is learned. Give examples of these.

Adaboost seems to reduce both bias and variance. Adaboost does not seem to overfit for increasing K. Optional: Read about “Margin-theory” explanation of success of Boosting