Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.

Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara

Tony Jebara, Columbia University Boosting Combining Multiple Classifiers Voting Boosting Adaboost Based on material by Y. Freund, P. Long & R. Schapire

Tony Jebara, Columbia University Combining Multiple Learners Have many simple learners Also called base learners or weak learners which have a classification error of <0.5 Combine or vote them to get a higher accuracy No free lunch: there is no guaranteed best approach here Different approaches: Voting combine learners with fixed weight Mixture of Experts adjust learners and a variable weight/gate fn Boosting actively search for next base-learners and vote Cascading, Stacking, Bagging, etc.

Tony Jebara, Columbia University Voting Have T classifiers Average their prediction with weights Like mixture of experts but weight is constant with input y x Expert  

Tony Jebara, Columbia University Mixture of Experts Have T classifiers or experts and a gating fn Average their prediction with variable weights But, adapt parameters of the gating function and the experts (fixed total number T of experts) Output Input Gating Network Expert

Tony Jebara, Columbia University Boosting Actively find complementary or synergistic weak learners Train next learner based on mistakes of previous ones. Average prediction with fixed weights Find next learner by training on weighted versions of data. weights training data weak learner ensemble of learners weak rule prediction

Tony Jebara, Columbia University AdaBoost Most popular weighting scheme Define margin for point i as Find an h t and find weight  t to min the cost function sum exp-margins weights training data weak learner ensemble of learners weak rule prediction

Tony Jebara, Columbia University AdaBoost Choose base learner &  t : Recall error of base classifier h t must be For binary h, Adaboost puts this weight on weak learners: (instead of the more general rule) Adaboost picks the following for the weights on data for the next round (here Z is the normalizer to sum to 1)

Tony Jebara, Columbia University Decision Trees X>3 Y>5 +1 no yes no X Y 3 5 +1

Tony Jebara, Columbia University -0.2 Decision tree as a sum X Y -0.2 Y>5 +0.2-0.3 yes no X>3 -0.1 no yes +0.1 -0.1 +0.2 -0.3 +1 sign

Tony Jebara, Columbia University An alternating decision tree X Y +0.1-0.1 +0.2 -0.3 sign -0.2 Y>5 +0.2-0.3 yes no X>3 -0.1 no yes +0.1 +0.7 Y<1 0.0 no yes +0.7 +1 +1

Tony Jebara, Columbia University Example: Medical Diagnostics Cleve dataset from UC Irvine database. Heart disease diagnostics (+1=healthy,-1=sick) 13 features from tests (real valued and discrete). 303 instances.

Tony Jebara, Columbia University Ad-Tree Example

Tony Jebara, Columbia University Cross-validated accuracy Learning algorithm Number of splits Average test error Test error variance ADtree617.0%0.6% C5.02727.2%0.5% C5.0 + boosting 44620.2%0.5% Boost Stumps 1616.5%0.8%

Tony Jebara, Columbia University AdaBoost Convergence Rationale? Consider bound on the training error: Adaboost is essentially doing gradient descent on this. Convergence? Loss Correct Margin Mistakes Brownboost Logitboost 0-1 loss exp bound on step definition of f(x) recursive use of Z

Tony Jebara, Columbia University AdaBoost Convergence Convergence? Consider the binary h t case. So, the final learner converges exponentially fast in T if each weak learner is at least better than gamma!

Tony Jebara, Columbia University Curious phenomenon Boosting decision trees Using 2,000,000 parameters

Tony Jebara, Columbia University Explanation using margins Margin 0-1 loss

Tony Jebara, Columbia University Explanation using margins Margin 0-1 loss No examples with small margins!!

Tony Jebara, Columbia University Experimental Evidence

Tony Jebara, Columbia University AdaBoost Generalization Bound Also, a VC analysis gives a generalization bound: (where d is VC of base classifier) But, more iterations  overfitting! A margin analysis is possible, redefine margin as: Then have

Tony Jebara, Columbia University AdaBoost Generalization Bound Margin Suggests this optimization problem:

Tony Jebara, Columbia University AdaBoost Generalization Bound Proof Sketch

Tony Jebara, Columbia University UCI Results DatabaseOtherBoosting Error reduction Cleveland27.2 (DT)16.539% Promoters22.0 (DT)11.846% Letter13.8 (DT)3.574% Reuters 45.8, 6.0, 9.82.95~60% Reuters 811.3, 12.1, 13.47.4~40% % test error rates

Tony Jebara, Columbia University Boosted Cascade of Stumps Consider classifying an image as face/notface Use weak learner as stump that averages of pixel intensity Easy to calculate, white areas subtracted from black ones A special representation of the sample called the integral image makes feature extraction faster.

Tony Jebara, Columbia University Boosted Cascade of Stumps Summed area tables A representation that means any rectangle’s values can be calculated in four accesses of the integral image.

Tony Jebara, Columbia University Boosted Cascade of Stumps Summed area tables

Tony Jebara, Columbia University Boosted Cascade of Stumps The base size for a sub window is 24 by 24 pixels. Each of the four feature types are scaled and shifted across all possible combinations In a 24 pixel by 24 pixel sub window there are ~160,000 possible features to be calculated.

Tony Jebara, Columbia University Boosted Cascade of Stumps Viola-Jones algorithm, with K attributes (e.g., K = 160,000) we have 160,000 different decision stumps to choose from At each stage of boosting given reweighted data from previous stage Train all K (160,000) single-feature perceptrons Select the single best classifier at this stage Combine it with the other previously selected classifiers Reweight the data Learn all K classifiers again, select the best, combine, reweight Repeat until you have T classifiers selected Very computationally intensive!

Tony Jebara, Columbia University Boosted Cascade of Stumps Reduction in Error as Boosting adds Classifiers

Tony Jebara, Columbia University Boosted Cascade of Stumps First (e.g. best) two features learned by boosting

Tony Jebara, Columbia University Boosted Cascade of Stumps Example training data

Tony Jebara, Columbia University Boosted Cascade of Stumps To find faces, scan all squares at different scales, slow  Boosting finds ordering on weak learners (best ones first) Idea: cascade stumps to avoid too much computation!

Tony Jebara, Columbia University Boosted Cascade of Stumps Training time = weeks (with 5k faces and 9.5k non-faces) Final detector has 38 layers in the cascade, 6060 features 700 Mhz processor: Can process a 384 x 288 image in 0.067 seconds (in 2003 when paper was written)

Tony Jebara, Columbia University Boosted Cascade of Stumps Results

Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.

Similar presentations

Presentation on theme: "Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.

Similar presentations

Presentation on theme: "Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara."— Presentation transcript:

Similar presentations

About project

Feedback