Download presentation

Presentation is loading. Please wait.

Published byConrad Kilton Modified over 2 years ago

1
My name is Dustin Boswell and I will be presenting: Ensemble Methods in Machine Learning by Thomas G. Dietterich Oregon State University, Corvallis, Oregon

2
Classification problem through supervised learning. (Notation) S Given a set S of training examples { x 1, x 2, … x m } with corresponding class labels { y 1, y 2, … y m } Each x i is a vector with n “features”. Each y i is one of K class labels. S A learning algorithm takes S and produces a hypothesis h

3
Pictoral View of the Learning Process x1x1 y1y1 x2x2 y2y2 xmxm ymym Learning Algorithm (Neural Net, Nearest Neighbor, etc…) Hypothesis (gimme an x, I’ll give you a y) x y

4
x1x1 y1y1 x2x2 y2y2 xmxm ymym Ensembled Learning Algorithm x y The Ensemble Method Hey, what’s going on inside there? Give us an x, we’ll give you a y. H1 H2 H3... HL

5
Characteristics of Ensemble Classifiers Ensemble classifiers are often more accurate than any of its individual members A necessary and sufficient condition for this is that individuals be accurate and diverse. Accuracy of a classifier means an error rate of less than 1/2. Two individual classifiers are diverse when their out of sample errors are uncorrelated (the are independent random variables)

6
Fundamental Reasons why Ensembles might perform better. Statistical. When training data is too small, there are many hypothesi which satisfy it. Ensembling reduces the chance of picking a bad classifier. Computational. Depending on the learning algorithm, individuals might get stuck in local minima of training errors. Ensembling reduces the chance of getting stuck in a bad minima Representational. An ensembled cast may represent a classifier which was not possible in the original set of hypothesi.

7
Methods for obtaining an Ensemble. Problem: Given that we only have one training set and one learning algorithm, how can we produce multiple hypthesi? Solution: Fiddle with everything we can get our hands on! Manipulate the training examples Manipulate the input data points Manipulate the target output (the class labels) of the training data Inject Randomness

8
Manipulating the training examples. Run the learning algorithm multiple times with subsets of training data. This works well for unstable learning algorithms. Unstable - decision tree’s - Neural Networks - rule learning Stable - linear regression - Nearest Neighbor - linear threshold

9
Bagging (manipulating training examples) Take multiple bootstrap replicates of the training data. Question: If you sample N points from a batch of N (with replacement), how many of the original N do you expect to have? Poll the audience… – On each sample, a given point has a (N-1)/N chance of being missed. – To be completely left out after N samples occurs with prob [ (N-1)/N ] ^ N. – Thus to be included at least once occurs with prob 1 - [ (N-1)/N ] ^ N = 1 - [ 1 - 1/N ] ^ N = 1 - 1/e =.63 as N gets large. – Thus we expect 63% of the points to be in the bootstrap replicate

10
AdaBoosting (still manipulating with the training set) - chooses a series of hypothesi, but the latter ones are designed to excel in the places (the training examples) that the earlier hypothesi did not.

11
Manipulation of the input data - Each input x is a vector of n features. - Train multiple hypothesi based on the same training set, but for each x i, only a subset of the n features are taken. - Cherkauer (1996) used this method to train an ensemble of neural nets to identify volcanoes on Venus. - there were 119 input features - they were grouped (by hand) into subsets of features based on different image processing operations, like PCA, Fourier, etc… - the resulting ensemble matched the ability of expert humans - Tumer and Ghosh (1996) applied this technique to sonar data and found that removing any of the input features hurt the performance - The technique only works when the features contain redundant data.

12
Manipulation of the output targets (of the input data) - Each x is mapped to one of K classes (where K is large). - Divide the set of K classes into 2 groups A and B. - Learn that new (and simpler) learning problem for various partitions A and B. - Each member of the ensemble then implicitly votes for K/2 of the K classes that are in A or B (whichever was voted for). - Think of it like classifying cities to the states where they are, but first classifying which region (southwest, northwest, etc) first. - Benefit: Can use any 2-classifier to classify arbitrary K class problem.

13
Injection of Randomness - Neural Networks - initial weights can be randomly chosen. - the ensemble consists of NN’s trained with different initial weights - Between: a) 10-fold cross-validated committees b) bagging and c) random initial weights, they performed in that order: a) was the best, c) worst. - Injecting randomness into the input vectors is another option.

14
Comparison of Ensemble Methods (empirical) 1) C4.5 2) C4.5 with injected randomness (in the tree-building) 3) Bagged C4.5 4) AdaBoost C4.5 - 33 data sets with little or no noise - AdaBoost performed the best - same 33 data sets with artificial 20% class label noise - Bagging was the best (AdaBoost overfit) - Analogy: AdaBoost tries to come up with a theory that explains everything, Bagging makes sure to know most of it.

15
Interpretation of the Methods by appealing to the Fundamental Reasons for Ensemble performance - Bagging and Randomness work by attacking the Statistical issue. - AdaBoost attacks the Representational Problem (it recognizes and uses the fact that not every hypothesis will be correct for all the training points).

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google