BOOSTING & ADABOOST Lecturer: Yishay Mansour Itay Dangoor.

Slides:

Advertisements

Similar presentations

Ensemble Learning Reading: R. Schapire, A brief introduction to boosting.

Advertisements

Size-estimation framework with applications to transitive closure and reachability Presented by Maxim Kalaev Edith Cohen AT&T Bell Labs 1996.

On-line learning and Boosting

Approximation Algorithms Chapter 14: Rounding Applied to Set Cover.

An Efficient Membership-Query Algorithm for Learning DNF with Respect to the Uniform Distribution Jeffrey C. Jackson Presented By: Eitan Yaakobi Tamar.

Longin Jan Latecki Temple University

Cos 429: Face Detection (Part 2) Viola-Jones and AdaBoost Guest Instructor: Andras Ferencz (Your Regular Instructor: Fei-Fei Li) Thanks to Fei-Fei Li,

Introduction to Boosting Slides Adapted from Che Wanxiang( 车万翔 ) at HIT, and Robin Dhamankar of Many thanks!

Example set X Can Inductive Learning Work? Hypothesis space H Training set  Inductive hypothesis h size.

Machine Learning Week 2 Lecture 2.

2D1431 Machine Learning Boosting.

Northwestern University Winter 2007 Machine Learning EECS Machine Learning Lecture 13: Computational Learning Theory.

Ensemble Tracking Shai Avidan IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE February 2007.

Probably Approximately Correct Model (PAC)

1 How to be a Bayesian without believing Yoav Freund Joint work with Rob Schapire and Yishay Mansour.

Adaboost and its application

Vapnik-Chervonenkis Dimension Part II: Lower and Upper bounds.

Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.

Artificial Neural Networks

Decision Trees and more!. Learning OR with few attributes Target function: OR of k literals Goal: learn in time – polynomial in k and log n –  and 

Machine Learning: Ensemble Methods

Experimental Evaluation

Experts and Boosting Algorithms. Experts: Motivation Given a set of experts –No prior information –No consistent behavior –Goal: Predict as the best expert.

Probably Approximately Correct Learning (PAC) Leslie G. Valiant. A Theory of the Learnable. Comm. ACM (1984)

Boosting Main idea: train classifiers (e.g. decision trees) in a sequence. a new classifier should focus on those cases which were incorrectly classified.

F ACE D ETECTION FOR A CCESS C ONTROL By Dmitri De Klerk Supervisor: James Connan.

Support Vector Machines

PAC learning Invented by L.Valiant in 1984 L.G.ValiantA theory of the learnable, Communications of the ACM, 1984, vol 27, 11, pp

Probability theory: (lecture 2 on AMLbook.com)

Machine Learning Algorithms in Computational Learning Theory

CSSE463: Image Recognition Day 27 This week This week Last night: k-means lab due. Last night: k-means lab due. Today: Classification by “boosting” Today:

Ragesh Jaiswal Indian Institute of Technology Delhi Threshold Direct Product Theorems: a survey.

CS 391L: Machine Learning: Ensembles

Benk Erika Kelemen Zsolt

Boosting of classifiers Ata Kaban. Motivation & beginnings Suppose we have a learning algorithm that is guaranteed with high probability to be slightly.

Boosting and other Expert Fusion Strategies. References Chapter 9.5 Duda Hart & Stock Leo Breiman Boosting Bagging Arcing Presentation.

Classification Heejune Ahn SeoulTech Last updated May. 03.

1 Machine Learning: Lecture 8 Computational Learning Theory (Based on Chapter 7 of Mitchell T.., Machine Learning, 1997)

Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.

Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.

Lecture notes for Stat 231: Pattern Recognition and Machine Learning 1. Stat 231. A.L. Yuille. Fall 2004 AdaBoost.. Binary Classification. Read 9.5 Duda,

Machine Learning Chapter 5. Evaluating Hypotheses

Chapter5: Evaluating Hypothesis. 개요 개요 Evaluating the accuracy of hypotheses is fundamental to ML. - to decide whether to use this hypothesis - integral.

E NSEMBLE L EARNING : A DA B OOST Jianping Fan Dept of Computer Science UNC-Charlotte.

CSSE463: Image Recognition Day 33 This week This week Today: Classification by “boosting” Today: Classification by “boosting” Yoav Freund and Robert Schapire.

Ensemble Methods.  “No free lunch theorem” Wolpert and Macready 1995.

Goal of Learning Algorithms  The early learning algorithms were designed to find such an accurate fit to the data.  A classifier is said to be consistent.

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University Today: Computational Learning Theory Probably Approximately.

Decision Trees IDHairHeightWeightLotionResult SarahBlondeAverageLightNoSunburn DanaBlondeTallAverageYesnone AlexBrownTallAverageYesNone AnnieBlondeShortAverageNoSunburn.

CS 8751 ML & KDDComputational Learning Theory1 Notions of interest: efficiency, accuracy, complexity Probably, Approximately Correct (PAC) Learning Agnostic.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Statistical Significance Hypothesis Testing.

Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.

Approximation Algorithms based on linear programming.

On-Line Algorithms in Machine Learning By: WALEED ABDULWAHAB YAHYA AL-GOBI MUHAMMAD BURHAN HAFEZ KIM HYEONGCHEOL HE RUIDAN SHANG XINDI.

1 Machine Learning Lecture 8: Ensemble Methods Moshe Koppel Slides adapted from Raymond J. Mooney and others.

1 CS 391L: Machine Learning: Computational Learning Theory Raymond J. Mooney University of Texas at Austin.

Machine Learning: Ensemble Methods

Reading: R. Schapire, A brief introduction to boosting

The Boosting Approach to Machine Learning

The Boosting Approach to Machine Learning

ECE 5424: Introduction to Machine Learning

CSCI B609: “Foundations of Data Science”

ADABOOST(Adaptative Boosting)

Ensemble learning.

Model Combination.

Machine Learning: UNIT-3 CHAPTER-2

Evaluating Hypothesis

Inductive Learning (2/2) Version Space and PAC Learning

Presentation transcript:

BOOSTING & ADABOOST Lecturer: Yishay Mansour Itay Dangoor

O VERVIEW Introduction to weak classifiers Boosting the confidence Equivalence of weak & strong learning Boosting the accuracy - recursive construction AdaBoost

W EAK V S. S TRONG C LASSIFIERS PAC (Strong) classifier Renders classification of arbitrary accuracy Error Rate:  is arbitrarily small Confidence: 1 -  is arbitrarily close to 100% Weak classifier Only slightly better than random guess Error Rate:  < 50% Confidence: 1 -   50%

W EAK V S. S TRONG C LASSIFIERS It is easier to find a hypothesis that is correct only 51 percent of the time, rather than to find a hypothesis that is correct 99 percent of the time

W EAK V S. S TRONG C LASSIFIERS It is easier to find a hypothesis that is correct only 51 percent of the time, rather than to find a hypothesis that is correct 99 percent of the time Some examples The category of one word in a sentence The gray level of one pixel in an image Very simple patterns in image segments Degree of a node in a graph

W EAK V S. S TRONG C LASSIFIERS Given the following data

W EAK V S. S TRONG C LASSIFIERS A threshold in one dimension will be a weak classifier

W EAK V S. S TRONG C LASSIFIERS A suitable half plane might be a strong classifier

W EAK V S. S TRONG C LASSIFIERS A combination of weak classifiers could render a good classification

B OOST TO THE CONFIDENCE Given an algorithm A, which returns with probability  ½ a hypothesis h s.t error(h, c*)  we can build a PAC learning algorithm from A. Algorithm Boost-Confidence (A) Select  ’=  /2, and run A for k=log(2/  ) times (new data each time) to get output hypothesis h 1 …h k Draw new sample S of size m=(2/  2 )ln(4k/  ) and test each hypothesis h i on it. The observed error is marked error(h i (S)). Return h* s.t error(h*) = min(error(h i (S)))

B OOST TO THE CONFIDENCE A LG. C ORRECTNESS - I After the first stage (run A for k times), with probability at most (½) k  i. error(h i ) >  /2  With probability at least 1 - (½) k  i. error(h i )   /2  Setting k=log(2/  ) gives With probability 1-  /2  i. error(h i )   /2

B OOST TO THE CONFIDENCE A LG. C ORRECTNESS - II After the second stage (test all h i on a new sample), with probability 1-  /2 output h i * s.t error(h i *)   /2 + min(error(h i ))   Proof Using Chernoff bound, derive a bound for the probability for a “bad” event Pr[|error(h i ) - error(h i )|   /2]  2e -2m(  /2) 2 Bound the probability for a “bad” event, to any of the k hypotheses by  /2, Using a union bound, 2ke -m(  /2) 2   /2

B OOST TO THE CONFIDENCE A LG. C ORRECTNESS - II After the second stage (test all h i on a new sample), with probability 1-  /2 output h i * s.t error(h i *)   /2 + min(error(h i ))   Proof Chernoff: Pr[|error(h i ) - error(h i )|   /2]  2e -2m(  /2) 2 Union bound: 2ke -m(  /2) 2   /2 Now isolate m: m  (2/  2 )ln(4k/  )

B OOST TO THE CONFIDENCE A LG. C ORRECTNESS - II After the second stage (test all h i on a new sample), with probability 1-  /2 output h i * s.t error(h i *)   /2 + min(error(h i ))   Proof Chernoff: Pr[|error(h i ) - error(h i )|   /2]  2e -2m(  /2) 2 Union bound: 2ke -m(  /2) 2   /2 Isolate m: m  (2/  2 )ln(4k/  ) With probability 1-  /2, for a sample of size at least m:  i. error(h i ) - error(h i ) <  /2

B OOST TO THE CONFIDENCE A LG. C ORRECTNESS - II After the second stage (test all h i on a new sample), with probability 1-  /2 output h i * s.t error(h i *)   /2 + min(error(h i ))   Proof Chernoff: Pr[|error(h i ) - error(h i )|   /2]  2e -2m(  /2) 2 Union bound: 2ke -m(  /2) 2   /2 Isolate m: m  (2/  2 )ln(4k/  ) With probability 1-  /2, for a sample of size at least m:  i. error(h i ) - error(h i ) <  /2 So: error(h*) – min(error(h i )) <  /2

B OOST TO THE CONFIDENCE A LG. C ORRECTNESS From the first stage:  i. error(h i )   /2  min(error(h i ))   /2 From the second stage: error(h*) - min(error(h i )) <  /2 All together: error(h i *)   /2 + min(error(h i ))   Q.E.D

W EAK LEARNING DEFINITION Algorithm A learn Weak-PAC a concept class C with H if   > 0 - the replacement of   c*  C,  D,   < ½- identical to PAC With probability 1 -  : Algorithm A will output h  H, s.t error(h)  ½ - .

E QUIVALENCE OF WEAK & STRONG LEARNING Theorem: If a concept class has a Weak-PAC learning algorithm, it also has a PAC learning algorithm.

E QUIVALENCE OF WEAK & STRONG LEARNING Given Input sample:x 1 … x m Labels:c*(x 1 ) … c*(x m ) Weak hypothesis class: H Use a Regret Minimization algorithm for each step t: Choose a distribution D t over x 1 … x m For each correct classification increment the loss by 1 After T steps MAJ(h 1 (x)…h T (x)) classify all the samples correctly

E QUIVALENCE OF WEAK & STRONG LEARNING - RM CORRECTNESS The loss is at least (½+  )T since at each step we return a weak learner h that classifies correctly at least (½+  ) of the samples Suppose that MAJ doesn’t classify correctly some x i, than the loss for x i is at most T/2 2  Tlog(m) is the regret bound of RM  (½+  )T  loss(RM)  T/2 + 2  Tlog(m)  T  (4 log m)/  2  Executing the RM algorithm (4 log m)/  2 steps renders a consistent hypothesis  By Occam’s Razor we can PAC learn the class

R ECURSIVE CONSTRUCTION Given a weak learning algorithm A with error probability p We can generate a better performing algorithm by running A multiple times

R ECURSIVE CONSTRUCTION Step 1: Run A on the initial distribution D 1 to obtain h 1 (err  ½ -  ) Step 2: Define a new distribution D 2 D 2 (x) = p = D 1 (S c ) S c = {x| h 1 (x) = c*(x) } S e = {x| h 1 (x)  c*(x) } D 2 gives the same weight to h 1 errors and non errors D 2 (S c ) = D 2 (S e ) = ½ Run A on D 2 to obtain h 2

R ECURSIVE CONSTRUCTION Step 1: Run A on D 1 to obtain h 1 Step 2: Run A on D 2 (D 2 (Sc) = D 2 (Se) = ½ ) to obtain h 2 Step 3: Define a distribution D 3 only on examples x for which h 1 (x)  h 2 (x) D 3 (x) = Where Z = P[ h 1 (x)  h 2 (x) ] Run A on D 3 to obtain h 3

R ECURSIVE CONSTRUCTION Step 1: Run A on D 1 to obtain h 1 Step 2: Run A on D 2 (D 2 (Sc) = D 2 (Se) = ½ ) to obtain h 2 Step 3: Run A on D 3 (examples that satisfy h 1 (x)  h 2 (x)) to obtain h 3 Return a combined hypothesis H(x) = MAJ(h 1 (x), h 2 (x), h 3 (x))

R ECURSIVE CONSTRUCTION E RROR RATE Intuition: Suppose h 1, h 2, h 3 independently errors with probability p. what would be the error of MAJ(h 1, h 2, h 3 ) ?

R ECURSIVE CONSTRUCTION E RROR RATE Intuition: Suppose h 1, h 2, h 3 independently errors with probability p. what would be the error of MAJ(h 1, h 2, h 3 ) ? Error = 3p 2 (1-p) + p 3 = 3p 2 - 2p 3

R ECURSIVE CONSTRUCTION E RROR RATE Define S cc = {x| h 1 (x) = c*(x)  h 2 (x) = c*(x)} S ee = {x| h 1 (x)  c*(x)  h 2 (x)  c*(x) } S ce = {x| h 1 (x)  c*(x)  h 2 (x) = c*(x) } S ec = {x| h 1 (x) = c*(x)  h 2 (x)  c*(x) } P cc = D 1 (S cc ) P ee = D 1 (S ee ) P ce = D 1 (S ce ) P ec = D 1 (S ec )

R ECURSIVE CONSTRUCTION E RROR RATE Define S cc = {x| h 1 (x) = c*(x)  h 2 (x) = c*(x)} S ee = {x| h 1 (x)  c*(x)  h 2 (x)  c*(x) } S ce = {x| h 1 (x)  c*(x)  h 2 (x) = c*(x) } S ec = {x| h 1 (x) = c*(x)  h 2 (x)  c*(x) } P cc = D 1 (S cc ) P ee = D 1 (S ee ) P ce = D 1 (S ce ) P ec = D 1 (S ec ) The error probability for D 1 is P ee + (P ce + P ec )p

R ECURSIVE CONSTRUCTION E RROR RATE Define α = D 2 (S ce ) From the definition of D 2, in terms of D 1 : P ce = 2(1 - p)α D 2 (S *e ) = p, and therefore D 2 (S ee ) = p - α P ee = 2p(p - α) Also, from the construction of D 2 : D 2 (S ec ) = ½ - (p - α) P ec = 2p(½ - p + α) The total error: P ee + (P ce + P ec )p = 2p(p-α) + p(2p(½-p+α) + 2(1-p)α) = 3p 2 - 2p 3

R ECURSIVE CONSTRUCTION RECURSION STEP Let the initial error probability be p 0 = ½ -  0 Each step improves upon the previous : ½ -  t+1 = 3(½ -  t ) 2 - 2(½ -  t ) 3  ½ -  t (3/2 -  t ) Termination condition: obtain an error of  For  t > ¼, p < ¼, and than p t+1  3p t 2 - 2p t 3  2p t 2 < ½ Recursion depth: O( log(1/  ) + log(log(1/  )) ) Number of nodes: 3 depth = poly(1/ , log(1/  ))

A DA B OOST An iterative boosting algorithm to create a strong learning algorithm using a weak learning algorithm Maintains a distribution on the input sample, and increase the weight of the “harder” to classify examples, so the algorithm would focus on them Easy to implement, runs efficiently and removes the need of knowing the  parameter

A DA B OOST A LGORITHM DESCRIPTION Input Set of m classified examples S = { } where i  {1…m}y i  {-1, 1} Weak learning algorithm A Define D t - the distribution at time t D t (i) - the weight of example x i at time t Initialize: D 1 (i) = 1/m  i  {1…m}

A DA B OOST A LGORITHM DESCRIPTION Input S = { } and a weak learning algorithm A Maintain a distribution D t for each step t Initialize: D 1 (i) = 1/m Step Run A on D t to obtain h t define  t = P D t [ h t (x)  c*(x) ] D t+1 (i) = e -y i α t h t (x i ) D t (i)/Z t where Z t is a normalizing factor, α t = ½ln( (1-  t ) /  t ) Output H(x) =

A DA B OOST ERROR BOUND Theorem Let H be the output hypothesis of AdaBoost, then: Notice that this means the error drops exponentially fast in T.

A DA B OOST ERROR BOUND Proof structure Step 1: express D T+1 D T+1 (i) = D 1 (i)e -y i f(x i ) /  t=1 T Z t Step 2: bound the training error error (H)   t=1 T Z t Step 3: express Z t in terms of  t Z t = 2  t (1-  t ) The last inequality is derived from: 1 + x  e x.

A DA B OOST ERROR BOUND Proof Step 1 By definition: D T+1 (i) = e -y i α t h t (x i ) D t (i)/Z t  D T+1 (i) = D 1 (i)  t=1 T [e -y i α t h t (x i ) / Z t ]  D T+1 (i) = D 1 (i) e -y i Σ t=1 T α t h t (x i ) /  t=1 T Z t Define f(x) = Σ t=1 T α t h t (x) Summarize D T+1 (i) = D 1 (i)e -y i f(x i ) /  t=1 T Z t

A DA B OOST ERROR BOUND Proof Step 2 (indicator function) (step 1) (D T+1 is a prob. dist.)

A DA B OOST ERROR BOUND Proof Step 3 By definition: Z t = Σ i=1 m D t (i) e -y i α t h t (x i )  Z t = Σ y i =h t (x i ) D t (i) e -α t + Σ y i  h t (x i ) D t (i) e α t From the definition of  t :  Z t = (1-  t )e -α t +  t e α t The last expression holds for all α t. Find the minimum of error (H): = -(1-  t )e -α t +  t e α t = 0  α t = ½ ln ((1-  t ) /  t )  error (H)   t=1 T (2  t (1-  t )) QED