Introduction to Boosting Aristotelis Tsirigos email: tsirigos@cs.nyu.edu SCLT seminar - NYU Computer Science.

Slides:



Advertisements
Similar presentations
Ensemble Learning Reading: R. Schapire, A brief introduction to boosting.
Advertisements

On-line learning and Boosting
Lectures 17,18 – Boosting and Additive Trees Rice ECE697 Farinaz Koushanfar Fall 2006.
Boosting Approach to ML
FilterBoost: Regression and Classification on Large Datasets Joseph K. Bradley 1 and Robert E. Schapire 2 1 Carnegie Mellon University 2 Princeton University.
CMPUT 466/551 Principal Source: CMU
Longin Jan Latecki Temple University
Review of : Yoav Freund, and Robert E
Sparse vs. Ensemble Approaches to Supervised Learning
Ensemble Learning what is an ensemble? why use an ensemble?
2D1431 Machine Learning Boosting.
Northwestern University Winter 2007 Machine Learning EECS Machine Learning Lecture 13: Computational Learning Theory.
A Brief Introduction to Adaboost
Reduced Support Vector Machine
Ensemble Learning: An Introduction
Evaluating Hypotheses
Adaboost and its application
Machine Learning: Ensemble Methods
Sparse vs. Ensemble Approaches to Supervised Learning
Boosting Main idea: train classifiers (e.g. decision trees) in a sequence. a new classifier should focus on those cases which were incorrectly classified.
For Better Accuracy Eick: Ensemble Learning
Ensembles of Classifiers Evgueni Smirnov
Machine Learning CS 165B Spring 2012
AdaBoost Robert E. Schapire (Princeton University) Yoav Freund (University of California at San Diego) Presented by Zhi-Hua Zhou (Nanjing University)
Chapter 10 Boosting May 6, Outline Adaboost Ensemble point-view of Boosting Boosting Trees Supervised Learning Methods.
A speech about Boosting Presenter: Roberto Valenti.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Boosting Neural Networks Published by Holger Schwenk and Yoshua Benggio Neural Computation, 12(8): , Presented by Yong Li.
CS 391L: Machine Learning: Ensembles
Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00.
Benk Erika Kelemen Zsolt
Boosting of classifiers Ata Kaban. Motivation & beginnings Suppose we have a learning algorithm that is guaranteed with high probability to be slightly.
BOOSTING David Kauchak CS451 – Fall Admin Final project.
1 Machine Learning: Lecture 8 Computational Learning Theory (Based on Chapter 7 of Mitchell T.., Machine Learning, 1997)
Today Ensemble Methods. Recap of the course. Classifier Fusion
Ensemble Methods: Bagging and Boosting
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
CLASSIFICATION: Ensemble Methods
Ensemble Learning (1) Boosting Adaboost Boosting is an additive model
Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Ensemble Methods in Machine Learning
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Ensemble Methods: Bagging, Boosting Readings: Murphy 16.4; Hastie 16.
Goal of Learning Algorithms  The early learning algorithms were designed to find such an accurate fit to the data.  A classifier is said to be consistent.
Classification Ensemble Methods 1
Decision Trees IDHairHeightWeightLotionResult SarahBlondeAverageLightNoSunburn DanaBlondeTallAverageYesnone AlexBrownTallAverageYesNone AnnieBlondeShortAverageNoSunburn.
CS 8751 ML & KDDComputational Learning Theory1 Notions of interest: efficiency, accuracy, complexity Probably, Approximately Correct (PAC) Learning Agnostic.
1 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 8 Combining Methods and Ensemble Learning.
Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.
Generalization Error of pac Model  Let be a set of training examples chosen i.i.d. according to  Treat the generalization error as a r.v. depending on.
Boosting ---one of combining models Xin Li Machine Learning Course.
Overfitting, Bias/Variance tradeoff. 2 Content of the presentation Bias and variance definitions Parameters that influence bias and variance Bias and.
AdaBoost Algorithm and its Application on Object Detection Fayin Li.
1 Machine Learning: Ensemble Methods. 2 Learning Ensembles Learn multiple alternative definitions of a concept using different training data or different.
1 Machine Learning Lecture 8: Ensemble Methods Moshe Koppel Slides adapted from Raymond J. Mooney and others.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Machine Learning: Ensemble Methods
HW 2.
Reading: R. Schapire, A brief introduction to boosting
Ensemble Learning Introduction to Machine Learning and Data Mining, Carla Brodley.
ECE 5424: Introduction to Machine Learning
Data Mining Practical Machine Learning Tools and Techniques
The
Computational Learning Theory
Computational Learning Theory
Ensemble learning.
Model Combination.
Ensemble learning Reminder - Bagging of Trees Random Forest
Machine Learning: UNIT-3 CHAPTER-2
Presentation transcript:

Introduction to Boosting Aristotelis Tsirigos email: tsirigos@cs.nyu.edu SCLT seminar - NYU Computer Science

1. Learning Problem Formulation I Unknown target function: Given data sample: Objective: predict output y for any given input x

Learning Problem Formulation II Loss function: Generalization error: Objective: find h with minimum generalization error Main boosting idea: minimize the empirical error:

PAC Learning Input: Objective: Hypothesis space H Sample of size N Accuracy ε Confidence 1-δ Objective: Strong PAC learning: for any given ε,δ: Weak PAC Learning: holds only for some ε,δ Boosting converts a weak learner to a strong one!

2. Adaboost - Introduction Idea: Complex hypothesis are hard to design without overfitting Simple hypothesis cannot explain all data points Combine many simple hypothesis into a complex one Issues: How do we generate simple hypotheses? How do we combine them? Method: Apply some weighting scheme on the examples Find a simple hypothesis for each weighted version of the examples Compute a weight for each hypothesis and combine them linearly

Some early algorithms Boosting by filtering (Schapire 1990) Run weak learner on differently filtered example sets Combine weak hypotheses Requires knowledge on the performance of weak learner Boosting by majority (Freund 1995) Run weak learner on weighted example set Combine weak hypotheses linearly Bagging (Breiman 1996) Run weak learner on bootstrap replicates of the training set Average weak hypotheses Reduces variance

Adaboost - Outline Input: N examples SN = {(x1,y1),…, (xN,yN)} a weak base learner h = h(d,x) Initialize: equal example weights di = 1/N for all n = 1..N Iterate for t = 1..T: train base learner according to weighted example set (d(t),x) and obtain hypothesis ht = h(d(t),x) compute hypothesis error εt compute hypothesis weight αt update example weights for next iteration d(t+1) Output: final hypothesis as a linear combination of ht

Adaboost – Data flow diagram A(d,S) d(1) d(2) d(T) SN … α1h1(x) α2h2(x) αThT(x)

Adaboost – Details The loss function on the combined hypothesis at step t: In each iteration, L is greedily minimized with respect to αt: Finally, the example weights are updated:

Adaboost – Big picture The weak learner A induces a feature space: Ideally, we want to find the combined hypothesis with the minimum loss: However, Adaboost optimizes α locally.

Base learners Weak learners used in practice: Decision stumps (axis parallel splits) Decision trees (e.g. C4.5 by Quinlan 1996) Multi-layer neural networks Radial basis function networks Can base learners operate on weighted examples? In many cases they can be modified to accept weights along with the examples In general, we can sample the examples (with replacement) according to the distribution defined by the weights

3. Boosting & Learning Theory Main results Training error converges to zero Bound for the generalization error Bound for the margin-based generalization error

Training error - Definitions Function φθ on margin z: 1 -1 θ 1 z Empirical margin-based error:

Training error - Theorem The empirical margin error of the composite hypothesis fT obeys: Therefore, the empirical margin error converges to zero exponentially fast (for large θ).

Generalization bounds Theorem 2 Let F be a class of {-1,+1}-valued functions. By applying standard VC dimension bounds, we get that for every f in F with probability at least 1-δ: This is a distribution-free bound, i.e. it holds for any probability measure P.

Luckiness Take advantage of data “regularities” in order to get tighter bounds Do that without imposing any a-priori conditions on P Introduce a luckiness function that is based on the data An example of luckiness function is

The Rademacher complexity A notion of complexity related to VC dimension but more general in some sense The Rademacher complexity of a class F of [-1,+1]-valued functions is defined as: where: σn are independently set to -1 or +1 with equal probability xn are drawn independently from the underlying distribution P RN(F) is the expected correlation of a random sequence σn with the optimal f in F on a given sample xn, i=1..N.

The margin-based bound Theorem 3 Let F be a class of [-1,+1]-valued functions. Then, for every f in F with probability at least 1-δ: Note that:

Application to boosting In boosting the considered hypothesis space is: The Rademacher complexity of F does not depend on T: Whereas the VC dimension of F is dependent on T: The generalization bound does not depend on T!

4. Boosting and large margins Input space X Feature space F Linear separation in feature space F corresponds to a nonlinear separation in the original input space X Under what conditions does boosting compute a combined hypothesis with large margin?

Min-Max theorem The edge of a weak learner: The margin of the combined hypothesis: Theorem 4 The connection between edge and margin:

Adaboostr algorithm (Breiman 1997) Rule for choosing hypothesis weight: If γ* > r, it guarantees a margin ρ: where γ* is the minimum edge of hypotheses ht.

Achieving the optimal margin bound Arc-GV Choose rt based on the margin achieved so far by the combined hypothesis: Convergence rate to maximum margin is not known Marginal Adaboost Run Adaboostr and measure achieved margin ρ If ρ < r, then run Adaboostr with a decreased r Otherwise run Adaboostr with an increased r Converges fast to the maximum margin

Summary Boosting takes a weak learner and converts it to a strong one Works by asymptotically minimizing the empirical error Effectively maximizes the margin of the combined hypothesis Obeys “low” generalization error bound under the “luckiness” assumption