Review of : Yoav Freund, and Robert E

Slides:

Advertisements

Similar presentations

An Introduction to Boosting Yoav Freund Banter Inc.

Advertisements

Ensemble Learning Reading: R. Schapire, A brief introduction to boosting.

On-line learning and Boosting

V KV S DSS FS T … … Boosting Feb 18, Machine Learning Thanks to Citeseer and : A Short Introduction.

AdaBoost Reference Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal.

A Statistician’s Games * : Bootstrap, Bagging and Boosting * Please refer to “Game theory, on-line prediction and boosting” by Y. Freund and R. Schapire,

Boosting Approach to ML

Boosting Ashok Veeraraghavan. Boosting Methods Combine many weak classifiers to produce a committee. Resembles Bagging and other committee based methods.

Paper by Yoav Freund and Robert E. Schapire

FilterBoost: Regression and Classification on Large Datasets Joseph K. Bradley 1 and Robert E. Schapire 2 1 Carnegie Mellon University 2 Princeton University.

Games of Prediction or Things get simpler as Yoav Freund Banter Inc.

CMPUT 466/551 Principal Source: CMU

AdaBoost & Its Applications

Longin Jan Latecki Temple University

Face detection Many slides adapted from P. Viola.

Cos 429: Face Detection (Part 2) Viola-Jones and AdaBoost Guest Instructor: Andras Ferencz (Your Regular Instructor: Fei-Fei Li) Thanks to Fei-Fei Li,

Introduction to Boosting Slides Adapted from Che Wanxiang( 车万翔 ) at HIT, and Robin Dhamankar of Many thanks!

The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire.

Sparse vs. Ensemble Approaches to Supervised Learning

Boosting Rong Jin. Inefficiency with Bagging D Bagging … D1D1 D2D2 DkDk Boostrap Sampling h1h1 h2h2 hkhk Inefficiency with boostrap sampling: Every example.

A Brief Introduction to Adaboost

Ensemble Learning: An Introduction

Why does averaging work? Yoav Freund AT&T Labs Plan of talk Generative vs. non-generative modeling Boosting Boosting and over-fitting Bagging and over-fitting.

Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.

Sparse vs. Ensemble Approaches to Supervised Learning

Boosting Main idea: train classifiers (e.g. decision trees) in a sequence. a new classifier should focus on those cases which were incorrectly classified.

Machine Learning CS 165B Spring 2012

AdaBoost Robert E. Schapire (Princeton University) Yoav Freund (University of California at San Diego) Presented by Zhi-Hua Zhou (Nanjing University)

Chapter 10 Boosting May 6, Outline Adaboost Ensemble point-view of Boosting Boosting Trees Supervised Learning Methods.

A speech about Boosting Presenter: Roberto Valenti.

Boosting Neural Networks Published by Holger Schwenk and Yoshua Benggio Neural Computation, 12(8): , Presented by Yong Li.

CS 391L: Machine Learning: Ensembles

Classification and Ranking Approaches to Discriminative Language Modeling for ASR Erinç Dikici, Murat Semerci, Murat Saraçlar, Ethem Alpaydın 報告者：郝柏翰 2013/01/28.

Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00.

Window-based models for generic object detection Mei-Chen Yeh 04/24/2012.

Benk Erika Kelemen Zsolt

Boosting of classifiers Ata Kaban. Motivation & beginnings Suppose we have a learning algorithm that is guaranteed with high probability to be slightly.

BOOSTING David Kauchak CS451 – Fall Admin Final project.

1 Ensembles An ensemble is a set of classifiers whose combined results give the final decision. test feature vector classifier 1classifier 2classifier.

Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.

Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.

Learning with AdaBoost

Maria-Florina Balcan Active Learning Maria Florina Balcan Lecture 26th.

E NSEMBLE L EARNING : A DA B OOST Jianping Fan Dept of Computer Science UNC-Charlotte.

The Viola/Jones Face Detector A “paradigmatic” method for real-time object detection Training is slow, but detection is very fast Key ideas Integral images.

Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.

1 CHUKWUEMEKA DURUAMAKU.  Machine learning, a branch of artificial intelligence, concerns the construction and study of systems that can learn from data.

Ensemble Methods in Machine Learning

Classification Ensemble Methods 1

Decision Trees IDHairHeightWeightLotionResult SarahBlondeAverageLightNoSunburn DanaBlondeTallAverageYesnone AlexBrownTallAverageYesNone AnnieBlondeShortAverageNoSunburn.

Boosting ---one of combining models Xin Li Machine Learning Course.

Ensemble Methods for Machine Learning. COMBINING CLASSIFIERS: ENSEMBLE APPROACHES.

Overfitting, Bias/Variance tradeoff. 2 Content of the presentation Bias and variance definitions Parameters that influence bias and variance Bias and.

AdaBoost Algorithm and its Application on Object Detection Fayin Li.

1 Machine Learning Lecture 8: Ensemble Methods Moshe Koppel Slides adapted from Raymond J. Mooney and others.

Adaboost (Adaptive boosting) Jo Yeong-Jun Schapire, Robert E., and Yoram Singer. "Improved boosting algorithms using confidence- rated predictions."

1 Ensembles An ensemble is a set of classifiers whose combined results give the final decision. test feature vector classifier 1classifier 2classifier.

Reading: R. Schapire, A brief introduction to boosting

The Boosting Approach to Machine Learning

Ensemble Learning Introduction to Machine Learning and Data Mining, Carla Brodley.

Adaboost Team G Youngmin Jun

Data Mining Practical Machine Learning Tools and Techniques

Cos 429: Face Detection (Part 2) Viola-Jones and AdaBoost Guest Instructor: Andras Ferencz (Your Regular Instructor: Fei-Fei Li) Thanks to Fei-Fei.

A New Boosting Algorithm Using Input-Dependent Regularizer

Introduction to Boosting

Ensemble learning.

Model Combination.

Ensemble learning Reminder - Bagging of Trees Random Forest

Presentation transcript:

Review of : Yoav Freund, and Robert E Review of : Yoav Freund, and Robert E. Schapire, “A Short Introduction to Boosting”, (1999) Michael Collins, Discriminative Reranking for Natural Language Parsing, ICML 2000 by Gabor Melli melli@sfu.ca for CMPT-825 @ SFU Nov 21, 2003

Presentation Overview First paper: Boosting Example AdaBoost algorithm Second paper: Natural Language Parsing Reranking technique overview Boosting-based solution

Review of Yoav Freund, and Robert E Review of Yoav Freund, and Robert E. Schapire, “A Short Introduction to Boosting”, (1999) by Gabor Melli melli@sfu.ca for CMPT-825 @ SFU Nov 21, 2003

What is Boosting? A method for improving classifier accuracy Basic idea: Perform iterative search to locate the regions/ examples that are more difficult to predict. Thorough each iteration reward accurate predictions on those regions. Combines the rules from each iteration. Only requires that the underlying learning algorithm be better than guessing.

Example of a Good Classifier + - + + - + - - + -

Round 1 of 3 + - D2 h1 O + - + + - + - - + - e1 = 0.300 a1=0.424

Round 2 of 3 + + - D2 h2 - + + O - + - - + - e2 = 0.196 a2=0.704

Round 3 of 3 + O - + + h3 - - STOP + - + - e3 = 0.344 a2=0.323

Final Hypothesis 0.42 + 0.70 + 0.32 Hfinal = sign[ 0.42(h1? 1|-1) + 0.70(h2? 1|-1) + 0.32(h3? 1|-1) ] + -

History of Boosting "Kearns & Valiant (1989) proved that learners performing only slightly better than random, can be combined to form an arbitrarily good ensemble hypothesis." Schapire (1990) provided the first polynomial time Boosting algorithm. Freund (1995) “Boosting a weak learning algorithm by majority” Freund & Schapire (1995) AdaBoost. Solved many practical problems of boosting algorithms. “Ada” stands for adaptive.

AdaBoost Given: m examples (x1, y1), …, (xm, ym) where xiÎX, yiÎY={-1, +1} The goodness of ht is calculated over Dt and the bad guesses. Initialize D1(i) = 1/m For t = 1 to T 1. Train learner ht with min error 2. Compute the hypothesis weight The weight Adapts. The bigger et becomes the smaller at becomes. 3. For each example i = 1 to m Boost example if incorrectly predicted. Output Zt is a normalization factor. Linear combination of models.

AdaBoost on our Example Train data Round 1 Round 2 Round 3 Initialization

The Example’s Search Space Hfinal = 0.42(h1? 1|-1) + 0.65(h2? 1|-1) + 0.92(h3? 1|-1) + -

AdaBoost for Text Categ.

AdaBoost & Training Error Reduction Most basic theoretical property of AdaBoost is its ability to reduce the training error of the final hypothesis H(). Freund & Schapire(1995) The better that ht predicts over ‘random’ the faster the training error rate drops – exponentially so. If error εt of ht is = ½ - γt training error drops exponentially fast

No Overfitting Curious phenomenon Expected to overfit Does not For graph “Using <10,000 training examples we fit >2,000,000 parameters” Expected to overfit First bound on generalization error rate implies that overfit may occur as T gets large Does not Empirical results show the generalization error rate still decreasing after the training error has reached zero. Resistance explained by “margin” of error. Though, Gorve and Schurmans 1998 showed that the margin of error cannot be the explanation

Accuracy Change per Round

Shortcomings Actual performance of boosting can be: dependent on the data and the weak learner Boosting can fail to perform when: Insufficient data Overly complex weak hypotheses Weak hypotheses which are too weak Empirically shown to be especially susceptible to noise

Areas of Research Outliers Non-binary Targets AdaBoost can identify them. In fact can be hurt by them “Gentle AdaBoost” and “BrownBoost” de-emphasize outliers Non-binary Targets Continuous-valued Predictions

References Y.Freund and R.E. Schapire. A short introduction to boosting. Journal of Japanese Society for Artificial Intelligence, 14(5):771-780, September 1999. Http://www.boosting.org

Margins and boosting Boosting concentrates on the examples with smallest margins It is aggressive at increasing the margins Margins built a strong connection between boosting and SVM, which is known as an explicit attempt to maximize the minimum margin. See experimental evidence (5, 100, 1000)

Cumulative Distr. of Margins Cumulative distribution of margins for the training sample after 5, 100, and 1,000 iterations.

Review of Michael Collins, Discriminative Reranking for Natural Language Parsing, ICML 2000. by Gabor Melli melli@sfu.ca for CMPT-825 @ SFU Nov 21, 2003

Recall The Parsing Problem Green ideas sleep furiously. You looking at me? The blue jays flew. Can you parse me? ?

Train a Supervised Learning Alg. Model Supervised Learning Algorithm G()

Recall Parse Tree Rankings 0.60 0.90 0.01 true Score() 0.30 0.05 0.65 Q() “Can you parse this?” G()

Post-Analyze the G() Parses 0.70 0.55 0.10 rerankScore() 0.90 0.60 0.01 true Score() G() 0.65 0.30 0.05 Q() “Can you parse this?” O P F() +0.4 -0.1

Indicator Functions ... 1 if x contains the rule <S à NP VP> 0 otherwise 1 if x contains … 500,000 weak learners!! AdaBoost was not expecting this many hypotheses. Fortunately, we can precalculate membership.

Ranking Function F() Sample calculation for 1 sentence How to infer an a that improves ranking accuracy? Old rank score New rank score 0.55

Iterative Feature/Hypothesis Selection

Which feature to update per iteration? Which k* (and d*) to pick? Upd(a, kfeature, dweight) = Upd(a, k=3, d=0.60) The one that minimizes error! Test every combination of k and d and test against every sentence.

Find the best new hypothesis Update each example’s weights. Commit the new hypothesis to the final H.

High-Accuracy

O(m,i,j) w/ smaller constant Take advantage of the data sparcity. Time consuming to traverse the entire search space. O(m,i,j) O(m,i,j) w/ smaller constant

References M. Collins, Discriminative Reranking for Natural Language Parsing, In Machine Learning: Proceedings of the Fifteenth International Conference, ICML, 2000. Y. Freund, R. Iyer, R.E. Schapire, and Y. Singer. An efficient boosting algorithm for combining preferences. In Machine Learning: Proceedings of the Fifteenth International Conference, ICML, 1998.

Find the a that minimizes the misranking of the top parse. Error Definition Find the a that minimizes the misranking of the top parse.