Boosting Approach to ML

Slides:



Advertisements
Similar presentations
A KTEC Center of Excellence 1 Pattern Analysis using Convex Optimization: Part 2 of Chapter 7 Discussion Presenter: Brian Quanz.
Advertisements

Support Vector Machines and Margins
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Online learning, minimizing regret, and combining expert advice
FilterBoost: Regression and Classification on Large Datasets Joseph K. Bradley 1 and Robert E. Schapire 2 1 Carnegie Mellon University 2 Princeton University.
Games of Prediction or Things get simpler as Yoav Freund Banter Inc.
CMPUT 466/551 Principal Source: CMU
Longin Jan Latecki Temple University
Review of : Yoav Freund, and Robert E
The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire.
Boosting CMPUT 615 Boosting Idea We have a weak classifier, i.e., it’s error rate is a little bit better than 0.5. Boosting combines a lot of such weak.
Sparse vs. Ensemble Approaches to Supervised Learning
Active Learning of Binary Classifiers
Ensemble Learning what is an ensemble? why use an ensemble?
2D1431 Machine Learning Boosting.
Announcements See Chapter 5 of Duda, Hart, and Stork. Tutorial by Burge linked to on web page. “Learning quickly when irrelevant attributes abound,” by.
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
Sparse vs. Ensemble Approaches to Supervised Learning
Statistical Learning: Pattern Classification, Prediction, and Control Peter Bartlett August 2002, UC Berkeley CIS.
Maria-Florina Balcan A Theoretical Model for Learning from Labeled and Unlabeled Data Maria-Florina Balcan & Avrim Blum Carnegie Mellon University, Computer.
Boosting Main idea: train classifiers (e.g. decision trees) in a sequence. a new classifier should focus on those cases which were incorrectly classified.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Experts Learning and The Minimax Theorem for Zero-Sum Games Maria Florina Balcan December 8th 2011.
Online Learning Algorithms
Machine Learning Theory Maria-Florina (Nina) Balcan Lecture 1, August 23 rd 2011.
Machine Learning CS 165B Spring 2012
AdaBoost Robert E. Schapire (Princeton University) Yoav Freund (University of California at San Diego) Presented by Zhi-Hua Zhou (Nanjing University)
Data mining and machine learning A brief introduction.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Benk Erika Kelemen Zsolt
Boosting of classifiers Ata Kaban. Motivation & beginnings Suppose we have a learning algorithm that is guaranteed with high probability to be slightly.
CS Fall 2015 (© Jude Shavlik), Lecture 7, Week 3
BOOSTING David Kauchak CS451 – Fall Admin Final project.
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
Ensemble Learning (1) Boosting Adaboost Boosting is an additive model
Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.
Maria-Florina Balcan Active Learning Maria Florina Balcan Lecture 26th.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Ensemble Methods in Machine Learning
Maria-Florina Balcan 16/11/2015 Active Learning. Supervised Learning E.g., which s are spam and which are important. E.g., classify objects as chairs.
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Ensemble Methods: Bagging, Boosting Readings: Murphy 16.4; Hastie 16.
Learning Kernel Classifiers 1. Introduction Summarized by In-Hee Lee.
Boosting ---one of combining models Xin Li Machine Learning Course.
Ensemble Methods for Machine Learning. COMBINING CLASSIFIERS: ENSEMBLE APPROACHES.
AdaBoost Algorithm and its Application on Object Detection Fayin Li.
1 Machine Learning Lecture 8: Ensemble Methods Moshe Koppel Slides adapted from Raymond J. Mooney and others.
By Subhasis Dasgupta Asst Professor Praxis Business School, Kolkata Classification Modeling Decision Tree (Part 2)
Online Learning Model. Motivation Many situations involve online repeated decision making in an uncertain environment. Deciding how to invest your money.
1 CS 391L: Machine Learning: Computational Learning Theory Raymond J. Mooney University of Texas at Austin.
Cost-Sensitive Boosting algorithms: Do we really need them?
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
PREDICT 422: Practical Machine Learning
Reading: R. Schapire, A brief introduction to boosting
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
The Boosting Approach to Machine Learning
COMP61011 : Machine Learning Ensemble Models
Ensemble Learning Introduction to Machine Learning and Data Mining, Carla Brodley.
The Boosting Approach to Machine Learning
ECE 5424: Introduction to Machine Learning
Data Mining Practical Machine Learning Tools and Techniques
Cos 429: Face Detection (Part 2) Viola-Jones and AdaBoost Guest Instructor: Andras Ferencz (Your Regular Instructor: Fei-Fei Li) Thanks to Fei-Fei.
The
ADABOOST(Adaptative Boosting)
Implementing AdaBoost
Ensemble learning.
Model Combination.
Ensemble learning Reminder - Bagging of Trees Random Forest
Recitation 10 Oznur Tastan
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Presentation transcript:

Boosting Approach to ML Perceptron, Margins, Kernels Maria-Florina Balcan 03/18/2015

Recap from last time: Boosting General method for improving the accuracy of any given learning algorithm. Works by creating a series of challenge datasets s.t. even modest performance on these can be used to produce an overall high-accuracy predictor. Adaboost one of the top 10 ML algorithms. Works amazingly well in practice. Add a slide with Adaboost! Backed up by solid foundations.

Adaboost (Adaptive Boosting) Input: S={( x 1 , 𝑦 1 ), …,( x m , 𝑦 m )}; x i ∈𝑋, 𝑦 𝑖 ∈𝑌={−1,1} weak learning algo A (e.g., Naïve Bayes, decision stumps) For t=1,2, … ,T + + + + Construct D t on { x 1 , …, x m } + h t + Run A on D t producing h t :𝑋→{−1,1} + + - Output H final 𝑥 =sign 𝑡=1 𝛼 𝑡 ℎ 𝑡 𝑥 - - - - - D 1 uniform on { x 1 , …, x m } [i.e., D 1 𝑖 = 1 𝑚 ] - - Given D t and h t set Add a slide with Adaboost! 𝐷 𝑡+1 𝑖 = 𝐷 𝑡 𝑖 𝑍 𝑡 e − 𝛼 𝑡 if 𝑦 𝑖 = ℎ 𝑡 𝑥 𝑖 𝐷 𝑡+1 𝑖 = 𝐷 𝑡 𝑖 𝑍 𝑡 e − 𝛼 𝑡 𝑦 𝑖 ℎ 𝑡 𝑥 𝑖 𝐷 𝑡+1 𝑖 = 𝐷 𝑡 𝑖 𝑍 𝑡 e 𝛼 𝑡 if 𝑦 𝑖 ≠ ℎ 𝑡 𝑥 𝑖 𝛼 𝑡 = 1 2 ln 1− 𝜖 𝑡 𝜖 𝑡 >0 D t+1 puts half of weight on examples x i where h t is incorrect & half on examples where h t is correct

Nice Features of Adaboost Very general: a meta-procedure, it can use any weak learning algorithm!!! (e.g., Naïve Bayes, decision stumps) Very fast (single pass through data each round) & simple to code, no parameters to tune. Grounded in rich theory.

Analyzing Training Error Theorem 𝜖 𝑡 =1/2− 𝛾 𝑡 (error of ℎ 𝑡 over 𝐷 𝑡 ) 𝑒𝑟 𝑟 𝑆 𝐻 𝑓𝑖𝑛𝑎𝑙 ≤ exp −2 𝑡 𝛾 𝑡 2 So, if ∀𝑡, 𝛾 𝑡 ≥𝛾>0, then 𝑒𝑟 𝑟 𝑆 𝐻 𝑓𝑖𝑛𝑎𝑙 ≤ exp −2 𝛾 2 𝑇 The training error drops exponentially in T!!! To get 𝑒𝑟 𝑟 𝑆 𝐻 𝑓𝑖𝑛𝑎𝑙 ≤𝜖, need only 𝑇=𝑂 1 𝛾 2 log 1 𝜖 rounds Add a slide with Adaboost! Adaboost is adaptive Does not need to know 𝛾 or T a priori Can exploit 𝛾 𝑡 ≫ 𝛾

Generalization Guarantees 𝑒𝑟 𝑟 𝑆 𝐻 𝑓𝑖𝑛𝑎𝑙 ≤ exp −2 𝑡 𝛾 𝑡 2 Theorem where 𝜖 𝑡 =1/2− 𝛾 𝑡 How about generalization guarantees? Original analysis [Freund&Schapire’97] H space of weak hypotheses; d=VCdim(H) 𝐻 𝑓𝑖𝑛𝑎𝑙 is a weighted vote, so the hypothesis class is: G={all fns of the form sign( 𝑡=1 𝑇 𝛼 𝑡 ℎ 𝑡 (𝑥)) } Theorem [Freund&Schapire’97] ∀ 𝑔∈𝐺, 𝑒𝑟𝑟 𝑔 ≤𝑒𝑟 𝑟 𝑆 𝑔 + 𝑂 𝑇𝑑 𝑚 T= # of rounds Key reason: VCd𝑖𝑚 𝐺 = 𝑂 𝑑𝑇 plus typical VC bounds.

Generalization Guarantees Theorem [Freund&Schapire’97] ∀ 𝑔∈𝐺, 𝑒𝑟𝑟 𝑔 ≤𝑒𝑟 𝑟 𝑆 𝑔 + 𝑂 𝑇𝑑 𝑚 where d=VCdim(H) generalization error error train error complexity T= # of rounds

Generalization Guarantees Experiments showed that the test error of the generated classifier usually does not increase as its size becomes very large. Experiments showed that continuing to add new weak learners after correct classification of the training set had been achieved could further improve test set performance!!!

Generalization Guarantees Experiments showed that the test error of the generated classifier usually does not increase as its size becomes very large. Experiments showed that continuing to add new weak learners after correct classification of the training set had been achieved could further improve test set performance!!! These results seem to contradict FS’97 bound and Occam’s razor (in order achieve good test error the classifier should be as simple as possible)! ∀ 𝑔∈𝐺, 𝑒𝑟𝑟 𝑔 ≤𝑒𝑟 𝑟 𝑆 𝑔 + 𝑂 𝑇𝑑 𝑚

How can we explain the experiments? R. Schapire, Y. Freund, P. Bartlett, W. S. Lee. present in “Boosting the margin: A new explanation for the effectiveness of voting methods” a nice theoretical explanation. Key Idea: Training error does not tell the whole story. We need also to consider the classification confidence!! Add a slide with Adaboost!

Boosting didn’t seem to overfit…(!) …because it turned out to be increasing the margin of the classifier test error of base classifier (weak learner) test error train error Error Curve, Margin Distr. Graph - Plots from [SFBL98]

Classification Margin H space of weak hypotheses. The convex hull of H: 𝑐𝑜 𝐻 = 𝑓= 𝑡=1 𝑇 𝛼 𝑡 ℎ 𝑡 , 𝛼 𝑡 ≥0, 𝑡=1 𝑇 𝛼 𝑡 =1, ℎ 𝑡 ∈𝐻 Let 𝑓∈𝑐𝑜 𝐻 , 𝑓= 𝑡=1 𝑇 𝛼 𝑡 ℎ 𝑡 , 𝛼 𝑡 ≥0, 𝑡=1 𝑇 𝛼 𝑡 =1 . The majority vote rule 𝐻 𝑓 given by 𝑓 (given by 𝐻 𝑓 =𝑠𝑖𝑔𝑛(𝑓 𝑥 )) predicts wrongly on example (𝑥,𝑦) iff 𝑦𝑓 𝑥 ≤0. Definition: margin of 𝐻 𝑓 (or of 𝑓) on example (𝑥,𝑦) to be 𝑦𝑓(𝑥). 𝑦𝑓 𝑥 =𝑦 𝑡=1 𝑇 𝛼 𝑡 ℎ 𝑡 𝑥 = 𝑡=1 𝑇 𝑦 𝛼 𝑡 ℎ 𝑡 𝑥 = 𝑡:𝑦= ℎ 𝑡 𝑥 𝛼 𝑡 − 𝑡:𝑦≠ ℎ 𝑡 𝑥 𝛼 𝑡 Add a slide with Adaboost! The margin is positive iff 𝑦= 𝐻 𝑓 𝑥 . See 𝑦𝑓 𝑥 =|𝑓 𝑥 | as the strength or the confidence of the vote. 1 Low confidence -1 High confidence, correct High confidence, incorrect

Boosting and Margins Theorem:VCdim(𝐻)=𝑑, then with prob. ≥1−𝛿, ∀𝑓∈𝑐𝑜(𝐻), ∀𝜃>0, Pr 𝐷 𝑦𝑓 𝑥 ≤0 ≤ Pr 𝑆 𝑦𝑓 𝑥 ≤𝜃 +𝑂 1 𝑚 d ln 2 𝑚 𝑑 𝜃 2 + ln 1 𝛿 Note: bound does not depend on T (the # of rounds of boosting), depends only on the complex. of the weak hyp space and the margin! Add a slide with Adaboost!

Boosting and Margins Theorem:VCdim(𝐻)=𝑑, then with prob. ≥1−𝛿, ∀𝑓∈𝑐𝑜(𝐻), ∀𝜃>0, Pr 𝐷 𝑦𝑓 𝑥 ≤0 ≤ Pr 𝑆 𝑦𝑓 𝑥 ≤𝜃 +𝑂 1 𝑚 d ln 2 𝑚 𝑑 𝜃 2 + ln 1 𝛿 If all training examples have large margins, then we can approximate the final classifier by a much smaller classifier. Can use this to prove that better margin  smaller test error, regardless of the number of weak classifiers. Can also prove that boosting tends to increase the margin of training examples by concentrating on those of smallest margin. Add a slide with Adaboost! Although final classifier is getting larger, margins are likely to be increasing, so the final classifier is actually getting closer to a simpler classifier, driving down test error.

Boosting and Margins Theorem:VCdim(𝐻)=𝑑, then with prob. ≥1−𝛿, ∀𝑓∈𝑐𝑜(𝐻), ∀𝜃>0, Pr 𝐷 𝑦𝑓 𝑥 ≤0 ≤ Pr 𝑆 𝑦𝑓 𝑥 ≤𝜃 +𝑂 1 𝑚 d ln 2 𝑚 𝑑 𝜃 2 + ln 1 𝛿 Note: bound does not depend on T (the # of rounds of boosting), depends only on the complex. of the weak hyp space and the margin! Add a slide with Adaboost!

Boosting, Adaboost Summary Shift in mindset: goal is now just to find classifiers a bit better than random guessing. Backed up by solid foundations. Adaboost work and its variations well in practice with many kinds of data (one of the top 10 ML algos). More about classic applications in Recitation. Relevant for big data age: quickly focuses on “core difficulties”, so well-suited to distributed settings, where data must be communicated efficiently [Balcan-Blum-Fine-Mansour COLT’12]. Issues: noise. Weak learners…

Interestingly, the usefulness of margin recognized in Machine Learning since late 50’s. Perceptron [Rosenblatt’57] analyzed via geometric (aka 𝐿 2 , 𝐿 2 ) margin. Original guarantee in the online learning scenario. Issues: noise. Weak learners…

The Perceptron Algorithm Online Learning Model Margin Analysis Kernels Issues: noise. Weak learners…

The Online Learning Model Example arrive sequentially. We need to make a prediction. Afterwards observe the outcome. For i=1, 2, …, : Online Algorithm Example 𝑥 𝑖 Phase i: Prediction ℎ( 𝑥 𝑖 ) Observe c ∗ ( 𝑥 𝑖 ) Mistake bound model Issues: noise. Weak learners… Analysis wise, make no distributional assumptions. Goal: Minimize the number of mistakes.

The Online Learning Model. Motivation - Email classification (distribution of both spam and regular mail changes over time, but the target function stays fixed - last year's spam still looks like spam). - Recommendation systems. Recommending movies, etc. - Predicting whether a user will be interested in a new news article or not. - Add placement in a new market. Issues: noise. Weak learners…

Linear Separators w Instance space X= R d Hypothesis class of linear decision surfaces in R d . h x =w⋅ x + w 0 , if ℎ 𝑥 ≥ 0, then label x as +, otherwise label it as - Claim: WLOG w 0 =0. Proof: Can simulate a non-zero threshold with a dummy input feature 𝑥 0 that is always set up to 1. Issues: noise. Weak learners… 𝑥= 𝑥 1 , …, 𝑥 𝑑 → 𝑥 = 𝑥 1 , …, 𝑥 𝑑 ,1 w⋅ x + w 0 ≥0 iff 𝑤 1 ,…, 𝑤 𝑑 , w 0 ⋅ 𝑥 ≥0 where w= 𝑤 1 ,…, 𝑤 𝑑

Linear Separators: Perceptron Algorithm Set t=1, start with the all zero vector 𝑤 1 . Given example 𝑥, predict positive iff 𝑤 𝑡 ⋅𝑥≥0 On a mistake, update as follows: Mistake on positive, then update 𝑤 𝑡+1 ← 𝑤 𝑡 +𝑥 Mistake on negative, then update 𝑤 𝑡+1 ← 𝑤 𝑡 −𝑥 Note: 𝑤 𝑡 is weighted sum of incorrectly classified examples Issues: noise. Weak learners… 𝑤 𝑡 = 𝑎 𝑖 1 𝑥 𝑖 1 +…+ 𝑎 𝑖 𝑘 𝑥 𝑖 𝑘 𝑤 𝑡 ⋅𝑥= 𝑎 𝑖 1 𝑥 𝑖 1 ⋅𝑥+…+ 𝑎 𝑖 𝑘 𝑥 𝑖 𝑘 ⋅𝑥 Important when we talk about kernels.

Perceptron Algorithm: Example −1,2 − X  - 1,0 + 1,1 + X +  −1,0 − - + −1,−2 − X  + 1,−1 + - Algorithm: Set t=1, start with all-zeroes weight vector 𝑤 1 . Given example 𝑥, predict positive iff 𝑤 𝑡 ⋅𝑥≥0. On a mistake, update as follows: Mistake on positive, update 𝑤 𝑡+1 ← 𝑤 𝑡 +𝑥 Mistake on negative, update 𝑤 𝑡+1 ← 𝑤 𝑡 −𝑥 𝑤 1 =(0,0) 𝑤 2 = 𝑤 1 − −1,2 =(1,−2) 𝑤 3 = 𝑤 2 + 1,1 =(2,−1) 𝑤 4 = 𝑤 3 − −1,−2 =(3,1)

Geometric Margin Definition: The margin of example 𝑥 w.r.t. a linear sep. 𝑤 is the distance from 𝑥 to the plane 𝑤⋅𝑥=0 (or the negative if on wrong side) 𝑥 1 w Margin of positive example 𝑥 1 𝑥 2 Margin of negative example 𝑥 2

Geometric Margin Definition: The margin of example 𝑥 w.r.t. a linear sep. 𝑤 is the distance from 𝑥 to the plane 𝑤⋅𝑥=0 (or the negative if on wrong side) Definition: The margin 𝛾 𝑤 of a set of examples 𝑆 wrt a linear separator 𝑤 is the smallest margin over points 𝑥∈𝑆. + - 𝛾 𝑤 w

Geometric Margin Definition: The margin of example 𝑥 w.r.t. a linear sep. 𝑤 is the distance from 𝑥 to the plane 𝑤⋅𝑥=0 (or the negative if on wrong side) Definition: The margin 𝛾 𝑤 of a set of examples 𝑆 wrt a linear separator 𝑤 is the smallest margin over points 𝑥∈𝑆. Definition: The margin 𝛾 of a set of examples 𝑆 is the maximum 𝛾 𝑤 over all linear separators 𝑤. + - 𝛾 w

Perceptron: Mistake Bound Theorem: If data has margin 𝛾 and all points inside a ball of radius 𝑅, then Perceptron makes ≤ 𝑅/𝛾 2 mistakes. (Normalized margin: multiplying all points by 100, or dividing all points by 100, doesn’t change the number of mistakes; algo is invariant to scaling.) + w* -  R

Perceptron Algorithm: Analysis Theorem: If data has margin 𝛾 and all points inside a ball of radius 𝑅, then Perceptron makes ≤ 𝑅/𝛾 2 mistakes. Update rule: Mistake on positive: 𝑤 𝑡+1 ← 𝑤 𝑡 +𝑥 Mistake on negative: 𝑤 𝑡+1 ← 𝑤 𝑡 −𝑥 Proof: Idea: analyze 𝑤 𝑡 ⋅ 𝑤 ∗ and ‖ 𝑤 𝑡 ‖, where 𝑤 ∗ is the max-margin sep, ‖ 𝑤 ∗ ‖=1. Claim 1: 𝑤 𝑡+1 ⋅ 𝑤 ∗ ≥ 𝑤 𝑡 ⋅ 𝑤 ∗ +𝛾. (because 𝑙 𝑥 𝑥⋅ 𝑤 ∗ ≥𝛾) Claim 2: 𝑤 𝑡+1 2 ≤ 𝑤 𝑡 2 + 𝑅 2 . (by Pythagorean Theorem) 𝑤 𝑡 𝑤 𝑡+1 𝑥 After 𝑀 mistakes: 𝑤 𝑀+1 ⋅ 𝑤 ∗ ≥𝛾𝑀 (by Claim 1) 𝑤 𝑀+1 ≤𝑅 𝑀 (by Claim 2) 𝑤 ∗ 𝑤 𝑀+1 𝑤 𝑀+1 ⋅ 𝑤 ∗ ≤‖ 𝑤 𝑀+1 ‖ (since 𝑤 ∗ is unit length) So, 𝛾𝑀≤𝑅 𝑀 , so 𝑀≤ 𝑅 𝛾 2 .

Perceptron Extensions Can use it to find a consistent separator (by cycling through the data). One can convert the mistake bound guarantee into a distributional guarantee too (for the case where the 𝑥 𝑖 s come from a fixed distribution). Can be adapted to the case where there is no perfect separator as long as the so called hinge loss (i.e., the total distance needed to move the points to classify them correctly large margin) is small. Can be kernelized to handle non-linear decision boundaries!

Perceptron Discussion Simple online algorithm for learning linear separators with a nice guarantee that depends only on the geometric (aka 𝐿 2 , 𝐿 2 ) margin. It can be kernelized to handle non-linear decision boundaries --- see next class! Simple, but very useful in applications like Branch prediction; it also has interesting extensions to structured prediction.