Ensemble Learning Reading: R. Schapire, A brief introduction to boosting.

Slides:



Advertisements
Similar presentations
Lectures 17,18 – Boosting and Additive Trees Rice ECE697 Farinaz Koushanfar Fall 2006.
Advertisements

Data Mining and Machine Learning
A Statistician’s Games * : Bootstrap, Bagging and Boosting * Please refer to “Game theory, on-line prediction and boosting” by Y. Freund and R. Schapire,
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Other Classification Techniques 1.Nearest Neighbor Classifiers 2.Support Vector Machines.
Ensemble Methods An ensemble method constructs a set of base classifiers from the training data Ensemble or Classifier Combination Predict class label.
Data Mining Classification: Alternative Techniques
FilterBoost: Regression and Classification on Large Datasets Joseph K. Bradley 1 and Robert E. Schapire 2 1 Carnegie Mellon University 2 Princeton University.
CMPUT 466/551 Principal Source: CMU
Longin Jan Latecki Temple University
Paper presentation for CSI5388 PENGCHENG XI Mar. 23, 2005
Introduction to Boosting Slides Adapted from Che Wanxiang( 车 万翔 ) at HIT, and Robin Dhamankar of Many thanks!
Model Evaluation Metrics for Performance Evaluation
Sparse vs. Ensemble Approaches to Supervised Learning
Ensemble Learning what is an ensemble? why use an ensemble?
2D1431 Machine Learning Boosting.
A Brief Introduction to Adaboost
Ensemble Learning: An Introduction
Evaluating Hypotheses
Adaboost and its application
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
Examples of Ensemble Methods
Machine Learning: Ensemble Methods
Boosting Main idea: train classifiers (e.g. decision trees) in a sequence. a new classifier should focus on those cases which were incorrectly classified.
For Better Accuracy Eick: Ensemble Learning
Ensembles of Classifiers Evgueni Smirnov
Machine Learning CS 165B Spring 2012
Topics on Final Perceptrons SVMs Precision/Recall/ROC Decision Trees Naive Bayes Bayesian networks Adaboost Genetic algorithms Q learning Not on the final:
CSSE463: Image Recognition Day 27 This week This week Last night: k-means lab due. Last night: k-means lab due. Today: Classification by “boosting” Today:
CS 391L: Machine Learning: Ensembles
Benk Erika Kelemen Zsolt
Boosting of classifiers Ata Kaban. Motivation & beginnings Suppose we have a learning algorithm that is guaranteed with high probability to be slightly.
Combining multiple learners Usman Roshan. Bagging Randomly sample training data Determine classifier C i on sampled data Goto step 1 and repeat m times.
Ensemble Methods: Bagging and Boosting
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
CLASSIFICATION: Ensemble Methods
ISQS 6347, Data & Text Mining1 Ensemble Methods. ISQS 6347, Data & Text Mining 2 Ensemble Methods Construct a set of classifiers from the training data.
Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.
Learning with AdaBoost
E NSEMBLE L EARNING : A DA B OOST Jianping Fan Dept of Computer Science UNC-Charlotte.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
CSSE463: Image Recognition Day 33 This week This week Today: Classification by “boosting” Today: Classification by “boosting” Yoav Freund and Robert Schapire.
Ensemble Methods in Machine Learning
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Ensemble Methods: Bagging, Boosting Readings: Murphy 16.4; Hastie 16.
Classification Ensemble Methods 1
1 January 24, 2016Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 7 — Classification Ensemble Learning.
Classification and Prediction: Ensemble Methods Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Ensemble Methods Construct a set of classifiers from the training data Predict class label of previously unseen records by aggregating predictions made.
Decision Trees IDHairHeightWeightLotionResult SarahBlondeAverageLightNoSunburn DanaBlondeTallAverageYesnone AlexBrownTallAverageYesNone AnnieBlondeShortAverageNoSunburn.
… Algo 1 Algo 2 Algo 3 Algo N Meta-Learning Algo.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bagging and Boosting Cross-Validation ML.
Notes on HW 1 grading I gave full credit as long as you gave a description, confusion matrix, and working code Many people’s descriptions were quite short.
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
Boosting ---one of combining models Xin Li Machine Learning Course.
Evaluating Classifiers. Reading for this topic: T. Fawcett, An introduction to ROC analysis, Sections 1-4, 7 (linked from class website)
1 Machine Learning: Ensemble Methods. 2 Learning Ensembles Learn multiple alternative definitions of a concept using different training data or different.
1 Machine Learning Lecture 8: Ensemble Methods Moshe Koppel Slides adapted from Raymond J. Mooney and others.
Ensemble Classifiers.
Machine Learning: Ensemble Methods
HW 2.
Reading: R. Schapire, A brief introduction to boosting
Bagging and Random Forests
Ensemble Learning Introduction to Machine Learning and Data Mining, Carla Brodley.
ECE 5424: Introduction to Machine Learning
Combining Base Learners
Data Mining Practical Machine Learning Tools and Techniques
The
Introduction to Data Mining, 2nd Edition
Ensemble learning.
Model Combination.
Presentation transcript:

Ensemble Learning Reading: R. Schapire, A brief introduction to boosting

Ensemble learning Training sets Hypotheses Ensemble hypothesis S 1 h 1 S 2 h 2. S N h N H

Advantages of ensemble learning Can be very effective at reducing generalization error! (E.g., by voting.) Ideal case: the h i have independent errors

Example Given three hypotheses, h 1, h 2, h 3, with h i (x)  {−1,1} Suppose each h i has 60% generalization accuracy, and assume errors are independent. Now suppose H(x) is the majority vote of h 1, h 2, and h 3. What is probability that H is correct?

h1h1 h2h2 h3h3 Hprobability CCCC.216 CCIC.144 CIII.096 CICC.144 ICCC IICI.096 ICII IIII.064 Total probability correct:.648

Again, given three hypotheses, h 1, h 2, h 3. Suppose each h i has 40% generalization accuracy, and assume errors are independent. Now suppose we classify x as the majority vote of h 1, h 2, and h 3. What is probability that the classification is correct? Another Example

h1h1 h2h2 h3h3 Hprobability CCCC.064 CCIC.096 CIII.144 CICC.096 ICCC IICI.144 ICII IIII.261 Total probability correct:.352

In general, if hypotheses h 1,..., h M all have generalization accuracy A, what is probability that a majority vote will be correct? General case

Possible problems with ensemble learning Errors are typically not independent Training time and classification time are increased by a factor of M. Hard to explain how ensemble hypothesis does classification. How to get enough data to create M separate data sets, S 1,..., S M ?

Three popular methods: –Voting: Train classifier on M different training sets S i to obtain M different classifiers h i. For a new instance x, define H(x) as: where α i is a confidence measure for classifier h i

–Bagging (Breiman, 1990s): To create S i, create “bootstrap replicates” of original training set S –Boosting (Schapire & Freund, 1990s) To create S i, reweight examples in original training set S as a function of whether or not they were misclassified on the previous round.

Adaptive Boosting (Adaboost) A method for combining different weak hypotheses (training error close to but less than 50%) to produce a strong hypothesis (training error close to 0%)

Sketch of algorithm Given examples S and learning algorithm L, with | S | = N Initialize probability distribution over examples w 1 (i) = 1/N. Repeatedly run L on training sets S t  S to produce h 1, h 2,..., h K. –At each step, derive S t from S by choosing examples probabilistically according to probability distribution w t. Use S t to learn h t. At each step, derive w t+1 by giving more probability to examples that were misclassified at step t. The final ensemble classifier H is a weighted sum of the h t ’s, with each weight being a function of the corresponding h t ’s error on its training set.

Adaboost algorithm Given S = {(x 1, y 1 ),..., (x N, y N )} where x  X, y i  {+1, −1} Initialize w 1 (i) = 1/N. (Uniform distribution over data)

For t = 1,..., K: –Select new training set S t from S with replacement, according to w t –Train L on S t to obtain hypothesis h t –Compute the training error  t of h t on S : –Compute coefficient

–Compute new weights on data: For i = 1 to N where Z t is a normalization factor chosen so that w t+1 will be a probability distribution:

At the end of K iterations of this algorithm, we have h 1, h 2,..., h K We also have  1,  2,...,  K, where Ensemble classifier: Note that hypotheses with higher accuracy on their training sets are weighted more strongly.

A Hypothetical Example where { x 1, x 2, x 3, x 4 } are class +1 {x 5, x 6, x 7, x 8 } are class −1 t = 1 : w 1 = {1/8, 1/8, 1/8, 1/8, 1/8, 1/8, 1/8, 1/8} S 1 = {x 1, x 2, x 2, x 5, x 5, x 6, x 7, x 8 } (notice some repeats) Train classifier on S 1 to get h 1 Run h 1 on S. Suppose classifications are: {1, −1, −1, −1, −1, −1, −1, −1} Calculate error:

Calculate  ’s: Calculate new w’s:

t = 2 w 2 = {0.102, 0.163, 0.163, 0.163, 0.102, 0.102, 0.102, 0.102} S 2 = {x 1, x 2, x 2, x 3, x 4, x 4, x 7, x 8 } Run classifier on S 2 to get h 2 Run h 2 on S. Suppose classifications are: {1, 1, 1, 1, 1, 1, 1, 1} Calculate error:

Calculate  ’s: Calculate w’s:

t =3 w 3 = {0.082, 0.139, 0.139, 0.139, 0.125, 0.125, 0.125, 0.125} S 3 = {x 2, x 3, x 3, x 3, x 5, x 6, x 7, x 8 } Run classifier on S 3 to get h 3 Run h 3 on S. Suppose classifications are: {1, 1, −1, 1, −1, −1, 1, −1} Calculate error:

Calculate  ’s: Ensemble classifier:

What is the accuracy of H on the training data? ExampleActual class h1h1 h2h2 h3h3 x1x x2x2 1−111 x3x3 1 1 x4x x5x5 1 x6x6 1 x7x7 111 x8x8 1 where { x 1, x 2, x 3, x 4 } are class +1 {x 5, x 6, x 7, x 8 } are class −1 Recall the training set:

HW 2: SVMs and Boosting Caveat: I don’t know if boosting will help. But worth a try!

Boosting implementation in HW 2 Use arrays and vectors. Let S be the training set. You can set S to DogsVsCats.train or some subset of examples. Let M = |S| be the number of training examples. Let K be the number of boosting iterations. ; Initialize: Read in S as an array of M rows and 65 columns. Create the weight vector w(1) = (1/M, 1/M, 1/M,..., 1/M)

; Run Boosting Iterations: For T = 1 to K { ; Create training set T: For i = 1 to M { Select (with replacement) an example from S using weights in w. (Use “roulette wheel” selection algorithm) Print that example to S (i.e., “S1”) ; Learn classifier T svm_learn S h ; Calculate Error : svm_classify S h.predictions (e.g., “svm_classify S h1 1.predictions”) Use.predictions to determine which examples in S were misclassified. Error = Sum of weights of examples that were misclassified.

; Calculate Alpha ; Calculate new weight vector, using Alpha and.predictions file } ; T  T + 1 At this point, you have h1, h2,..., hK [the K different classifiers] Alpha1, Alpha2,..., AlphaK

; Run Ensemble Classifier on test data (DogsVsCats.test): For T = 1 to K: svm_classify DogsVsCats.test h.model Test(T).predictions } At this point you have: Test1.predictions, Test2.predictions,...,TestK.predictions To classify test example x1: The sign of row 1 of Test1.predictions gives h1(x1) The sign of row 1 of Test2.predictions gives h2(x1) etc. H(x1) = sgn [Alpha1 * h1(x1) + Alpha2 * h2(x1) AlphaK * hK(x1)] Compute H(x) for every x in the test set.

Sources of Classifier Error: Bias, Variance, and Noise Bias: –Classifier cannot learn the correct hypothesis (no matter what training data is given), and so incorrect hypothesis h is learned. The bias is the average error of h over all possible training sets. Variance: –Training data is not representative enough of all data, so the learned classifier h varies from one training set to another. Noise: –Training data contains errors, so incorrect hypothesis h is learned. Give examples of these.

Adaboost seems to reduce both bias and variance. Adaboost does not seem to overfit for increasing K. Optional: Read about “Margin-theory” explanation of success of Boosting