Reading: R. Schapire, A brief introduction to boosting

Slides:



Advertisements
Similar presentations
Ensemble Learning Reading: R. Schapire, A brief introduction to boosting.
Advertisements

Data Mining and Machine Learning
Rapid Object Detection using a Boosted Cascade of Simple Features Paul Viola, Michael Jones Conference on Computer Vision and Pattern Recognition 2001.
Rapid Object Detection using a Boosted Cascade of Simple Features Paul Viola, Michael Jones Conference on Computer Vision and Pattern Recognition 2001.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Other Classification Techniques 1.Nearest Neighbor Classifiers 2.Support Vector Machines.
Ensemble Methods An ensemble method constructs a set of base classifiers from the training data Ensemble or Classifier Combination Predict class label.
AdaBoost & Its Applications
Longin Jan Latecki Temple University
Face detection Many slides adapted from P. Viola.
EE462 MLCV Lecture 5-6 Object Detection – Boosting Tae-Kyun Kim.
Ensemble Learning what is an ensemble? why use an ensemble?
2D1431 Machine Learning Boosting.
Robust Real-time Object Detection by Paul Viola and Michael Jones ICCV 2001 Workshop on Statistical and Computation Theories of Vision Presentation by.
A Brief Introduction to Adaboost
Ensemble Learning: An Introduction
Adaboost and its application
A Robust Real Time Face Detection. Outline  AdaBoost – Learning Algorithm  Face Detection in real life  Using AdaBoost for Face Detection  Improvements.
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
Robust Real-Time Object Detection Paul Viola & Michael Jones.
Examples of Ensemble Methods
Machine Learning: Ensemble Methods
Boosting Main idea: train classifiers (e.g. decision trees) in a sequence. a new classifier should focus on those cases which were incorrectly classified.
Foundations of Computer Vision Rapid object / face detection using a Boosted Cascade of Simple features Presented by Christos Stoilas Rapid object / face.
Face Detection CSE 576. Face detection State-of-the-art face detection demo (Courtesy Boris Babenko)Boris Babenko.
For Better Accuracy Eick: Ensemble Learning
AdaBoost Robert E. Schapire (Princeton University) Yoav Freund (University of California at San Diego) Presented by Zhi-Hua Zhou (Nanjing University)
Face Detection using the Viola-Jones Method
CS 391L: Machine Learning: Ensembles
Window-based models for generic object detection Mei-Chen Yeh 04/24/2012.
Benk Erika Kelemen Zsolt
Boosting of classifiers Ata Kaban. Motivation & beginnings Suppose we have a learning algorithm that is guaranteed with high probability to be slightly.
Lecture 29: Face Detection Revisited CS4670 / 5670: Computer Vision Noah Snavely.
Ensemble Methods: Bagging and Boosting
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
ISQS 6347, Data & Text Mining1 Ensemble Methods. ISQS 6347, Data & Text Mining 2 Ensemble Methods Construct a set of classifiers from the training data.
Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.
Lecture notes for Stat 231: Pattern Recognition and Machine Learning 1. Stat 231. A.L. Yuille. Fall 2004 AdaBoost.. Binary Classification. Read 9.5 Duda,
E NSEMBLE L EARNING : A DA B OOST Jianping Fan Dept of Computer Science UNC-Charlotte.
The Viola/Jones Face Detector A “paradigmatic” method for real-time object detection Training is slow, but detection is very fast Key ideas Integral images.
Bibek Jang Karki. Outline Integral Image Representation of image in summation format AdaBoost Ranking of features Combining best features to form strong.
Learning to Detect Faces A Large-Scale Application of Machine Learning (This material is not in the text: for further information see the paper by P.
Classification Ensemble Methods 1
1 January 24, 2016Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 7 — Classification Ensemble Learning.
Classification and Prediction: Ensemble Methods Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Decision Trees IDHairHeightWeightLotionResult SarahBlondeAverageLightNoSunburn DanaBlondeTallAverageYesnone AlexBrownTallAverageYesNone AnnieBlondeShortAverageNoSunburn.
Notes on HW 1 grading I gave full credit as long as you gave a description, confusion matrix, and working code Many people’s descriptions were quite short.
A Brief Introduction on Face Detection Mei-Chen Yeh 04/06/2010 P. Viola and M. J. Jones, Robust Real-Time Face Detection, IJCV 2004.
Hand Detection with a Cascade of Boosted Classifiers Using Haar-like Features Qing Chen Discover Lab, SITE, University of Ottawa May 2, 2006.
AdaBoost Algorithm and its Application on Object Detection Fayin Li.
1 Machine Learning: Ensemble Methods. 2 Learning Ensembles Learn multiple alternative definitions of a concept using different training data or different.
1 Machine Learning Lecture 8: Ensemble Methods Moshe Koppel Slides adapted from Raymond J. Mooney and others.
Ensemble Classifiers.
Machine Learning: Ensemble Methods
2. Skin - color filtering.
Session 7: Face Detection (cont.)
Ensemble Learning Introduction to Machine Learning and Data Mining, Carla Brodley.
ECE 5424: Introduction to Machine Learning
Combining Base Learners
Data Mining Practical Machine Learning Tools and Techniques
Cos 429: Face Detection (Part 2) Viola-Jones and AdaBoost Guest Instructor: Andras Ferencz (Your Regular Instructor: Fei-Fei Li) Thanks to Fei-Fei.
Introduction to Data Mining, 2nd Edition
Face Detection via AdaBoost
ADABOOST(Adaptative Boosting)
Lecture 18: Bagging and Boosting
Ensemble learning.
Lecture 06: Bagging and Boosting
Model Combination.
Lecture 29: Face Detection Revisited
CS 391L: Machine Learning: Ensembles
Presentation transcript:

Reading: R. Schapire, A brief introduction to boosting Ensemble Learning Reading: R. Schapire, A brief introduction to boosting

H Ensemble learning S1 h1 S2 h2 . SN hN Training sets Hypotheses Ensemble hypothesis S1 h1 S2 h2 . SN hN H

Advantages of ensemble learning Can be very effective at reducing generalization error! (E.g., by voting.) Ideal case: the hi have independent errors

Example Given three hypotheses, h1, h2, h3, with hi (x)  {−1,1} Suppose each hi has 60% generalization accuracy, and assume errors are independent. Now suppose H(x) is the majority vote of h1, h2, and h3 . What is probability that H is correct?

h1 h2 h3 H probability C I Total probability correct:

h1 h2 h3 H probability C .216 I .144 .096 .064 Total probability correct: .648

Another Example Again, given three hypotheses, h1, h2, h3. Suppose each hi has 40% generalization accuracy, and assume errors are independent. Now suppose we classify x as the majority vote of h1, h2, and h3 . What is probability that the classification is correct?

h1 h2 h3 H probability C I

h1 h2 h3 H probability C .064 I .096 .144 .261 Total probability correct: .352

General case In general, if hypotheses h1, ..., hM all have generalization accuracy A, what is probability that a majority vote will be correct?

Possible problems with ensemble learning Errors are typically not independent Training time and classification time are increased by a factor of M. Hard to explain how ensemble hypothesis does classification. How to get enough data to create M separate data sets, S1, ..., SM?

Three popular methods: Voting: Train classifier on M different training sets Si to obtain M different classifiers hi. For a new instance x, define H(x) as: where αi is a confidence measure for classifier hi .

Bagging (Breiman, 1990s): To create Si, create “bootstrap replicates” of original training set S Boosting (Schapire & Freund, 1990s) To create Si, reweight examples in original training set S as a function of whether or not they were misclassified on the previous round.

Adaptive Boosting (Adaboost) A method for combining different weak hypotheses (training error close to but less than 50%) to produce a strong hypothesis (training error close to 0%)

Sketch of algorithm Given examples S and learning algorithm L, with | S | = N Initialize probability distribution over examples w1(i) = 1/N . Repeatedly run L on training sets St  S to produce h1, h2, ... , hK. At each step, derive St from S by choosing examples probabilistically according to probability distribution wt . Use St to learn ht. At each step, derive wt + 1 by giving more probability to examples that were misclassified at step t. The final ensemble classifier H is a weighted sum of the ht’s, with each weight being a function of the corresponding ht’s error on its training set.

Adaboost algorithm Given S = {(x1, y1), ..., (xN, yN)} where x  X, yi  {+1, −1} Initialize w1(i) = 1/N. (Uniform distribution over data)

For t = 1, ..., K: Select new training set St from S with replacement, according to wt Train L on St to obtain hypothesis ht Compute the training error t of ht on S : Compute coefficient

Compute new weights on data: For i = 1 to N where Zt is a normalization factor chosen so that wt+1 will be a probability distribution:

At the end of K iterations of this algorithm, we have h1, h2, . . . , hK We also have 1, 2, . . . ,K, where Ensemble classifier: Note that hypotheses with higher accuracy on their training sets are weighted more strongly.

A Hypothetical Example where { x1, x2, x3, x4 } are class +1 {x5, x6, x7, x8 } are class −1 t = 1 : w1 = {1/8, 1/8, 1/8, 1/8, 1/8, 1/8, 1/8, 1/8} S1 = {x1, x2, x2, x5, x5, x6, x7, x8} (notice some repeats) Train classifier on S1 to get h1 Run h1 on S. Suppose classifications are: {1, −1, −1, −1, −1, −1, −1, −1} Calculate error:

A Hypothetical Example where { x1, x2, x3, x4 } are class +1 {x5, x6, x7, x8 } are class −1 t = 1 : w1 = {1/8, 1/8, 1/8, 1/8, 1/8, 1/8, 1/8, 1/8} S1 = {x1, x2, x2, x5, x5, x6, x7, x8} (notice some repeats) Train classifier on S1 to get h1 Run h1 on S. Suppose classifications are: {1, −1, −1, −1, −1, −1, −1, −1} Calculate error:

Calculate ’s: Calculate new w’s:

Calculate ’s: Calculate new w’s:

t = 2 w2 = {0.102, 0.163, 0.163, 0.163, 0.102, 0.102, 0.102, 0.102} S2 = {x1, x2, x2, x3, x4, x4, x7, x8} Learn classifier on S2 to get h2 Run h2 on S. Suppose classifications are: {1, 1, 1, 1, 1, 1, 1, 1} Calculate error:

Calculate ’s: Calculate w’s:

t =3 w3 = {0.082, 0.139, 0.139, 0.139, 0.125, 0.125, 0.125, 0.125} S3 = {x2, x3, x3, x3, x5, x6, x7, x8} Run classifier on S3 to get h3 Run h3 on S. Suppose classifications are: {1, 1, −1, 1, −1, −1, 1, −1} Calculate error:

Calculate ’s: Ensemble classifier:

Recall the training set: Example Actual class h1 h2 h3 x1 1 x2 −1 x3 x4 x5 x6 x7 x8 Recall the training set: where { x1, x2, x3, x4 } are class +1 {x5, x6, x7, x8 } are class −1 What is the accuracy of H on the training data?

Adaboost seems to reduce both bias and variance Adaboost seems to reduce both bias and variance. Adaboost does not seem to overfit for increasing K. Optional: Read about “Margin-theory” explanation of success of Boosting

Recap of Adaboost algorithm Given S = {(x1, y1), ..., (xN, yN)} where x  X, yi  {+1, −1} Initialize w1(i) = 1/N. (Uniform distribution over data) For t = 1, ..., K: Select new training set St from S with replacement, according to wt Train L on St to obtain hypothesis ht Compute the training error t of ht on S : If εt > 0.5, abandon ht and go to step 1

4. Compute coefficient: 5. Compute new weights on data: For i = 1 to N where Zt is a normalization factor chosen so that wt+1 will be a probability distribution: At the end of K iterations of this algorithm, we have h1, h2, . . . , hK , and 1, 2, . . . ,K Ensemble classifier:

In-class exercises

Problems with Boosting Sensitive to noise, outliers: Boosting focuses on those examples that are hard to classify But this can be a way to identify outliers or noise in a dataset

Case Study of Adaboost: Viola-Jones Face Detection Algorithm P. Viola and M. J. Jones, Robust real-time face detection. International Journal of Computer Vision, 2004. First face-detection algorithm to work well in real-time (e.g., on digital cameras)

Training Data Positive: Faces scaled and aligned to a base resolution of 24 by 24 pixels. Negative: Much larger number of non-faces.

Features Use rectangle features at multiple sizes and location in an image subwindow (candidate face). For each feature fj : From http://makematics.com/research/viola-jones/ Possible number of features per 24 x 24 pixel subwindow > 180,000.

Detecting faces Given a new image: Scan image using subwindows at all locations and at different scales For each subwindow, compute features and send them to an ensemble classifier (learned via boosting). If classifier is positive (“face”), then detect a face at this location and scale.

Preprocessing: Viola & Jones use a clever pre-processing step that allows the rectangular features to be computed very quickly. (See their paper for description. ) They use a variant of AdaBoost to both select a small set of features and train the classifier.

Base (“weak”) classifiers For each feature fj , where x is a 24 x 24-pixel subwindow of an image, θj is the threshold that best separates the data using feature fj , and pj is either -1 or 1. Such features are called “decision stumps”.

Boosting algorithm

Note that only the top T features are used.

https://www. youtube. com/watch. v=k3bJUP0ct08 https://www. youtube https://www.youtube.com/watch?v=k3bJUP0ct08 https://www.youtube.com/watch?v=pZi9o-3ddq4

Viola-Jones “attentional cascade” algorithm Increases performance while reducing computation time Main idea: Vast majority of sub-windows are negative. Simple classifiers used to reject majority of sub-windows before more complex classifiers are applied See their paper for details

From Wikipedia