E NSEMBLE L EARNING : A DA B OOST Jianping Fan Dept of Computer Science UNC-Charlotte
E NSEMBLE L EARNING A machine learning paradigm where multiple learners are used to solve the problem Proble m …... Proble m Learner Previously: Ensemble: The generalization ability of the ensemble is usually significantly better than that of an individual learner Boosting is one of the most important families of ensemble methods
3 Bootstrapping Bagging Boosting (Schapire 1989) Adaboost (Schapire 1995) A B RIEF H ISTORY Resampling for estimating statistic Resampling for classifier design
B OOTSTRAP E STIMATION Repeatedly draw n samples from D For each set of samples, estimate a statistic The bootstrap estimate is the mean of the individual estimates Used to estimate a statistic (parameter) and its variance
B AGGING - A GGREGATE B OOTSTRAPPING For i = 1.. M Draw n * <n samples from D with replacement Learn classifier C i Final classifier is a vote of C 1.. C M Increases classifier stability/reduces variance
B AGGING f1f1 f2f2 fTfT ML f Random sample with replacement Random sample with replacement
B OOSTING Training Sample Weighted Sample fTfT f1f1 … f2f2 f ML
R EVISIT B AGGING
B OOSTING C LASSIFIER
B AGGING VS B OOSTING Bagging: the construction of complementary base-learners is left to chance and to the unstability of the learning methods. Boosting: actively seek to generate complementary base-learner--- training the next base-learner based on the mistakes of the previous learners.
B OOSTING (S CHAPIRE 1989) Randomly select n 1 < n samples from D without replacement to obtain D 1 Train weak learner C 1 Select n 2 < n samples from D with half of the samples misclassified by C 1 to obtain D 2 Train weak learner C 2 Select all samples from D that C 1 and C 2 disagree on Train weak learner C 3 Final classifier is vote of weak learners
A DA B OOST (S CHAPIRE 1995) Instead of sampling, re-weight Previous weak learner has only 50% accuracy over new distribution Can be used to learn weak classifiers Final classification based on weighted vote of weak classifiers
A DABOOST T ERMS Learner = Hypothesis = Classifier Weak Learner: < 50% error over any distribution Strong Classifier: thresholded linear combination of weak learner outputs
AdaBoost Adaptive A learning algorithm Building a strong classifier a lot of weaker ones Boosting
A DA B OOST C ONCEPT weak classifiers slightly better than random strong classifier
W EAKER C LASSIFIERS weak classifiers slightly better than random strong classifier Each weak classifier learns by considering one simple feature T most beneficial features for classification should be selected How to – define features? – select beneficial features? – train weak classifiers? – manage (weight) training samples? – associate weight to each weak classifier?
T HE S TRONG C LASSIFIERS weak classifiers slightly better than random strong classifier How good the strong one will be?
T HE A DA B OOST A LGORITHM Given: Initialization: For : Find classifier which minimizes error wrt D t,i.e., Weight classifier: Update distribution:
T HE A DA B OOST A LGORITHM Given: Initialization: For : Find classifier which minimizes error wrt D t,i.e., Weight classifier: Update distribution: Output final classifier:
B OOSTING ILLUSTRATION Weak Classifier 1
B OOSTING ILLUSTRATION Weights Increased
T HE A DA B OOST A LGORITHM typicallywhere the weights of incorrectly classified examples are increased so that the base learner is forced to focus on the hard examples in the training set where
B OOSTING ILLUSTRATION Weak Classifier 2
B OOSTING ILLUSTRATION Weights Increased
B OOSTING ILLUSTRATION Weak Classifier 3
B OOSTING ILLUSTRATION Final classifier is a combination of weak classifiers
T HE A DA B OOST A LGORITHM Given: Initialization: For : Find classifier which minimizes error wrt D t,i.e., Weight classifier: Update distribution: Output final classifier: What goal the AdaBoost wants to reach?
T HE A DA B OOST A LGORITHM Given: Initialization: For : Find classifier which minimizes error wrt D t,i.e., Weight classifier: Update distribution: Output final classifier: What goal the AdaBoost wants to reach? They are goal dependent.
G OAL Minimize exponential loss Final classifier:
G OAL Minimize exponential loss Final classifier: Maximize the margin yH(x)
G OAL Final classifier: Minimize Definewith Then,
Final classifier: Minimize Definewith Then, Set 0
Final classifier: Minimize Definewith Then, 0
with Final classifier: Minimize Define Then, 0
with Final classifier: Minimize Define Then, 0
with Final classifier: Minimize Define Then,
with Final classifier: Minimize Define Then,
with Final classifier: Minimize Define Then, maximized when
with Final classifier: Minimize Define Then, At time t
with Final classifier: Minimize Define Then, At time t At time 1 At time t+1
41 P ROS AND CONS OF A DA B OOST Advantages Very simple to implement Does feature selection resulting in relatively simple classifier Fairly good generalization Disadvantages Suboptimal solution Sensitive to noisy data and outliers
I NTUITION Train a set of weak hypotheses: h 1, …., h T. The combined hypothesis H is a weighted majority vote of the T weak hypotheses. Each hypothesis h t has a weight α t. During the training, focus on the examples that are misclassified. At round t, example x i has the weight D t (i).
B ASIC S ETTING Binary classification problem Training data: D t (i): the weight of x i at round t. D 1 (i)=1/m. A learner L that finds a weak hypothesis h t : X Y given the training set and D t The error of a weak hypothesis h t :
T HE BASIC A DA B OOST ALGORITHM For t=1, …, T Train weak learner using training data and D t Get ht: X {-1,1} with error Choose Update
T HE GENERAL A DA B OOST ALGORITHM
46 P ROS AND CONS OF A DA B OOST Advantages Very simple to implement Does feature selection resulting in relatively simple classifier Fairly good generalization Disadvantages Suboptimal solution Sensitive to noisy data and outliers