Download presentation

Presentation is loading. Please wait.

Published byMarvin Ackley Modified over 3 years ago

1
Ensemble Learning Reading: R. Schapire, A brief introduction to boosting

2
Ensemble learning Training sets Hypotheses Ensemble hypothesis S 1 h 1 S 2 h 2. S N h N H

3
Advantages of ensemble learning Can be very effective at reducing generalization error! (E.g., by voting.) Ideal case: the h i have independent errors

4
Example Given three hypotheses, h 1, h 2, h 3, with h i (x) {−1,1} Suppose each h i has 60% generalization accuracy, and assume errors are independent. Now suppose H(x) is the majority vote of h 1, h 2, and h 3. What is probability that H is correct?

5
h1h1 h2h2 h3h3 Hprobability CCCC.216 CCIC.144 CIII.096 CICC.144 ICCC IICI.096 ICII IIII.064 Total probability correct:.648

6
Again, given three hypotheses, h 1, h 2, h 3. Suppose each h i has 40% generalization accuracy, and assume errors are independent. Now suppose we classify x as the majority vote of h 1, h 2, and h 3. What is probability that the classification is correct? Another Example

7
h1h1 h2h2 h3h3 Hprobability CCCC.064 CCIC.096 CIII.144 CICC.096 ICCC IICI.144 ICII IIII.261 Total probability correct:.352

8
In general, if hypotheses h 1,..., h M all have generalization accuracy A, what is probability that a majority vote will be correct? General case

9
Possible problems with ensemble learning Errors are typically not independent Training time and classification time are increased by a factor of M. Hard to explain how ensemble hypothesis does classification. How to get enough data to create M separate data sets, S 1,..., S M ?

10
Three popular methods: –Voting: Train classifier on M different training sets S i to obtain M different classifiers h i. For a new instance x, define H(x) as: where α i is a confidence measure for classifier h i

11
–Bagging (Breiman, 1990s): To create S i, create “bootstrap replicates” of original training set S –Boosting (Schapire & Freund, 1990s) To create S i, reweight examples in original training set S as a function of whether or not they were misclassified on the previous round.

12
Adaptive Boosting (Adaboost) A method for combining different weak hypotheses (training error close to but less than 50%) to produce a strong hypothesis (training error close to 0%)

13
Sketch of algorithm Given examples S and learning algorithm L, with | S | = N Initialize probability distribution over examples w 1 (i) = 1/N. Repeatedly run L on training sets S t S to produce h 1, h 2,..., h K. –At each step, derive S t from S by choosing examples probabilistically according to probability distribution w t. Use S t to learn h t. At each step, derive w t+1 by giving more probability to examples that were misclassified at step t. The final ensemble classifier H is a weighted sum of the h t ’s, with each weight being a function of the corresponding h t ’s error on its training set.

14
Adaboost algorithm Given S = {(x 1, y 1 ),..., (x N, y N )} where x X, y i {+1, −1} Initialize w 1 (i) = 1/N. (Uniform distribution over data)

15
For t = 1,..., K: –Select new training set S t from S with replacement, according to w t –Train L on S t to obtain hypothesis h t –Compute the training error t of h t on S : –Compute coefficient

16
–Compute new weights on data: For i = 1 to N where Z t is a normalization factor chosen so that w t+1 will be a probability distribution:

17
At the end of K iterations of this algorithm, we have h 1, h 2,..., h K We also have 1, 2,..., K, where Ensemble classifier: Note that hypotheses with higher accuracy on their training sets are weighted more strongly.

18
A Hypothetical Example where { x 1, x 2, x 3, x 4 } are class +1 {x 5, x 6, x 7, x 8 } are class −1 t = 1 : w 1 = {1/8, 1/8, 1/8, 1/8, 1/8, 1/8, 1/8, 1/8} S 1 = {x 1, x 2, x 2, x 5, x 5, x 6, x 7, x 8 } (notice some repeats) Train classifier on S 1 to get h 1 Run h 1 on S. Suppose classifications are: {1, −1, −1, −1, −1, −1, −1, −1} Calculate error:

19
Calculate ’s: Calculate new w’s:

20
t = 2 w 2 = {0.102, 0.163, 0.163, 0.163, 0.102, 0.102, 0.102, 0.102} S 2 = {x 1, x 2, x 2, x 3, x 4, x 4, x 7, x 8 } Run classifier on S 2 to get h 2 Run h 2 on S. Suppose classifications are: {1, 1, 1, 1, 1, 1, 1, 1} Calculate error:

21
Calculate ’s: Calculate w’s:

22
t =3 w 3 = {0.082, 0.139, 0.139, 0.139, 0.125, 0.125, 0.125, 0.125} S 3 = {x 2, x 3, x 3, x 3, x 5, x 6, x 7, x 8 } Run classifier on S 3 to get h 3 Run h 3 on S. Suppose classifications are: {1, 1, −1, 1, −1, −1, 1, −1} Calculate error:

23
Calculate ’s: Ensemble classifier:

24
What is the accuracy of H on the training data? ExampleActual class h1h1 h2h2 h3h3 x1x1 1111 x2x2 1−111 x3x3 1 1 x4x4 1111 x5x5 1 x6x6 1 x7x7 111 x8x8 1 where { x 1, x 2, x 3, x 4 } are class +1 {x 5, x 6, x 7, x 8 } are class −1 Recall the training set:

25
HW 2: SVMs and Boosting Caveat: I don’t know if boosting will help. But worth a try!

26
Boosting implementation in HW 2 Use arrays and vectors. Let S be the training set. You can set S to DogsVsCats.train or some subset of examples. Let M = |S| be the number of training examples. Let K be the number of boosting iterations. ; Initialize: Read in S as an array of M rows and 65 columns. Create the weight vector w(1) = (1/M, 1/M, 1/M,..., 1/M)

27
; Run Boosting Iterations: For T = 1 to K { ; Create training set T: For i = 1 to M { Select (with replacement) an example from S using weights in w. (Use “roulette wheel” selection algorithm) Print that example to S (i.e., “S1”) ; Learn classifier T svm_learn S h ; Calculate Error : svm_classify S h.predictions (e.g., “svm_classify S h1 1.predictions”) Use.predictions to determine which examples in S were misclassified. Error = Sum of weights of examples that were misclassified.

28
; Calculate Alpha ; Calculate new weight vector, using Alpha and.predictions file } ; T T + 1 At this point, you have h1, h2,..., hK [the K different classifiers] Alpha1, Alpha2,..., AlphaK

29
; Run Ensemble Classifier on test data (DogsVsCats.test): For T = 1 to K: svm_classify DogsVsCats.test h.model Test(T).predictions } At this point you have: Test1.predictions, Test2.predictions,...,TestK.predictions To classify test example x1: The sign of row 1 of Test1.predictions gives h1(x1) The sign of row 1 of Test2.predictions gives h2(x1) etc. H(x1) = sgn [Alpha1 * h1(x1) + Alpha2 * h2(x1) +... + AlphaK * hK(x1)] Compute H(x) for every x in the test set.

30
Sources of Classifier Error: Bias, Variance, and Noise Bias: –Classifier cannot learn the correct hypothesis (no matter what training data is given), and so incorrect hypothesis h is learned. The bias is the average error of h over all possible training sets. Variance: –Training data is not representative enough of all data, so the learned classifier h varies from one training set to another. Noise: –Training data contains errors, so incorrect hypothesis h is learned. Give examples of these.

31
Adaboost seems to reduce both bias and variance. Adaboost does not seem to overfit for increasing K. Optional: Read about “Margin-theory” explanation of success of Boosting

Similar presentations

OK

1 Machine Learning: Ensemble Methods. 2 Learning Ensembles Learn multiple alternative definitions of a concept using different training data or different.

1 Machine Learning: Ensemble Methods. 2 Learning Ensembles Learn multiple alternative definitions of a concept using different training data or different.

© 2018 SlidePlayer.com Inc.

All rights reserved.

To make this website work, we log user data and share it with processors. To use this website, you must agree to our Privacy Policy, including cookie policy.

Ads by Google