Data Mining and Machine Learning

Slides:



Advertisements
Similar presentations
Ensemble Learning Reading: R. Schapire, A brief introduction to boosting.
Advertisements

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Other Classification Techniques 1.Nearest Neighbor Classifiers 2.Support Vector Machines.
Data Mining Classification: Alternative Techniques
Ensemble Methods An ensemble method constructs a set of base classifiers from the training data Ensemble or Classifier Combination Predict class label.
Data Mining Classification: Alternative Techniques
Longin Jan Latecki Temple University
Sparse vs. Ensemble Approaches to Supervised Learning
Ensemble Learning what is an ensemble? why use an ensemble?
2D1431 Machine Learning Boosting.
Boosting Rong Jin. Inefficiency with Bagging D Bagging … D1D1 D2D2 DkDk Boostrap Sampling h1h1 h2h2 hkhk Inefficiency with boostrap sampling: Every example.
Ensemble Learning: An Introduction
Adaboost and its application
Three kinds of learning
Examples of Ensemble Methods
Machine Learning: Ensemble Methods
Sparse vs. Ensemble Approaches to Supervised Learning
Boosting Main idea: train classifiers (e.g. decision trees) in a sequence. a new classifier should focus on those cases which were incorrectly classified.
Ensemble Learning (2), Tree and Forest
For Better Accuracy Eick: Ensemble Learning
Ensembles of Classifiers Evgueni Smirnov
Machine Learning CS 165B Spring 2012
AdaBoost Robert E. Schapire (Princeton University) Yoav Freund (University of California at San Diego) Presented by Zhi-Hua Zhou (Nanjing University)
Face Detection using the Viola-Jones Method
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
ENSEMBLE LEARNING David Kauchak CS451 – Fall 2013.
CS 391L: Machine Learning: Ensembles
LOGO Ensemble Learning Lecturer: Dr. Bo Yuan
Benk Erika Kelemen Zsolt
Boosting of classifiers Ata Kaban. Motivation & beginnings Suppose we have a learning algorithm that is guaranteed with high probability to be slightly.
CS Fall 2015 (© Jude Shavlik), Lecture 7, Week 3
BOOSTING David Kauchak CS451 – Fall Admin Final project.
Combining multiple learners Usman Roshan. Bagging Randomly sample training data Determine classifier C i on sampled data Goto step 1 and repeat m times.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
CS Ensembles1 Ensembles. 2 A “Holy Grail” of Machine Learning Automated Learner Just a Data Set or just an explanation of the problem Hypothesis.
CLASSIFICATION: Ensemble Methods
ISQS 6347, Data & Text Mining1 Ensemble Methods. ISQS 6347, Data & Text Mining 2 Ensemble Methods Construct a set of classifiers from the training data.
Training of Boosted DecisionTrees Helge Voss (MPI–K, Heidelberg) MVA Workshop, CERN, July 10, 2009.
Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.
E NSEMBLE L EARNING : A DA B OOST Jianping Fan Dept of Computer Science UNC-Charlotte.
Learning to Detect Faces A Large-Scale Application of Machine Learning (This material is not in the text: for further information see the paper by P.
Ensemble Methods in Machine Learning
Classification Ensemble Methods 1
COMP24111: Machine Learning Ensemble Models Gavin Brown
1 January 24, 2016Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 7 — Classification Ensemble Learning.
Classification and Prediction: Ensemble Methods Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Ensemble Methods Construct a set of classifiers from the training data Predict class label of previously unseen records by aggregating predictions made.
Decision Trees IDHairHeightWeightLotionResult SarahBlondeAverageLightNoSunburn DanaBlondeTallAverageYesnone AlexBrownTallAverageYesNone AnnieBlondeShortAverageNoSunburn.
… Algo 1 Algo 2 Algo 3 Algo N Meta-Learning Algo.
Notes on HW 1 grading I gave full credit as long as you gave a description, confusion matrix, and working code Many people’s descriptions were quite short.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
1 Machine Learning: Ensemble Methods. 2 Learning Ensembles Learn multiple alternative definitions of a concept using different training data or different.
1 Machine Learning Lecture 8: Ensemble Methods Moshe Koppel Slides adapted from Raymond J. Mooney and others.
Ensemble Classifiers.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Machine Learning: Ensemble Methods
Data Mining Practical Machine Learning Tools and Techniques
Reading: R. Schapire, A brief introduction to boosting
Session 7: Face Detection (cont.)
COMP61011 : Machine Learning Ensemble Models
Ensemble Learning Introduction to Machine Learning and Data Mining, Carla Brodley.
ECE 5424: Introduction to Machine Learning
Data Mining Practical Machine Learning Tools and Techniques
Introduction to Data Mining, 2nd Edition
Lecture 18: Bagging and Boosting
Ensembles.
Ensemble learning.
Lecture 06: Bagging and Boosting
Model Combination.
Presentation transcript:

Data Mining and Machine Learning Boosting, bagging and ensembles. The good of the many outweighs the good of the one

Classifier 1 Classifier 2 Classifier 3 Actual Class Predicted Class A B Actual Class Predicted Class A B Actual Class Predicted Class A B

Classifier 4 An ‘ensemble’ of classifier 1,2, and 3, which predicts by Actual Class Predicted Class A B Actual Class Predicted Class A B Actual Class Predicted Class A B Actual Class Predicted Class A B Classifier 4 An ‘ensemble’ of classifier 1,2, and 3, which predicts by majority vote

Combinations of Classifiers Usually called ‘ensembles’ When each classifier is a decision tree, these are called ‘decision forests’ Things to worry about: How exactly to combine the predictions into one? How many classifiers? How to learn the individual classifiers? A number of standard approaches ...

Basic approaches to ensembles: Simply averaging the predictions (or voting) ‘Bagging’ - train lots of classifiers on randomly different versions of the training data, then basically average the predictions ‘Boosting’ – train a series of classifiers – each one focussing more on the instances that the previous ones got wrong. Then use a weighted average of the predictions

What comes from the basic maths Simply averaging the predictions works best when: Your ensemble is full of fairly accurate classifiers ... but somehow they disagree a lot (i.e. When they’re wrong, they tend to be wrong about different instances) Given the above, in theory you can get 100% accuracy with enough of them. But, how much do you expect ‘the above’ to be given? ... and what about overfitting?

Bagging

Bootstrap aggregating

Bootstrap aggregating New version made by random resampling with replacement Instance P34 level Prostate cancer 1 High Y 2 Medium 3 Low 4 N 5 6 7 8 9 10 Instance P34 level Prostate cancer 3 High Y 10 Medium 2 Low 1 N 4 6 8

Bootstrap aggregating Instance P34 level Prostate cancer 1 High Y 2 Medium 3 Low 4 N 5 6 7 8 9 10 Generate a collection of bootstrapped versions ...

Bootstrap aggregating Learn a classifier from each ndividual bootstrapped dataset

Bootstrap aggregating The ‘bagged’ classifier is the ensemble, with predictions made by voting or averaging

BAGGING ONLY WORKS WITH ‘UNSTABLE’ CLASSIFIERS

Same with DTs, NB, ..., but not KNN Unstable? The decision surface can be very different each time. e.g. A neural network trained on same data could produce any of these ... A A A A A A A B A B A B A A A B B B B B B A A A A A A A B A B A B A A A B B B B B B Same with DTs, NB, ..., but not KNN

Example improvements from bagging www.csd.uwo.ca/faculty/ling/cs860/papers/mlj-randomized-c4.pdf

Example improvements from bagging Bagging improves over straight C4.5 almost every time (30 out of 33 datasets in this paper)

Kinect uses bagging

Depth feature / decision trees Each tree node is a “depth difference feature” e.g. branches may be: θ1 < 4.5 , θ1 >=4.5 Each leaf is a distribution over body part labels

The classifier Kinect uses (in real time, of course) Is an ensemble of (possibly 3) decision trees; .. each with depth ~ 20; … each trained on a separate collection of ~1M depth images with labelled body parts; …the body-part classification is made by simply averaging over the tree results, and then taking the most likely body part.

Boosting

Boosting Learn Classifier 1 1 A 2 3 B 4 5 Instance Actual Class Predicted Class 1 A 2 3 B 4 5 Learn Classifier 1

Boosting Learn Classifier 1 1 A 2 3 B 4 5 C1 Instance Actual Class Predicted Class 1 A 2 3 B 4 5 Learn Classifier 1 C1

Boosting Assign weight to Classifier 1 1 A 2 3 B 4 5 C1 W1=0.69 Instance Actual Class Predicted Class 1 A 2 3 B 4 5 Assign weight to Classifier 1 C1 W1=0.69

Boosting Construct new dataset that gives more weight to the ones Instance Actual Class 1 A 2 3 4 B 5 Instance Actual Class Predicted Class 1 A 2 3 B 4 5 Construct new dataset that gives more weight to the ones misclassified last time C1 W1=0.69

Boosting Learn classifier 2 1 A B 2 3 4 5 C1 C2 W1=0.69 Instance Actual Class Predicted Class 1 A B 2 3 4 5 Learn classifier 2 C1 W1=0.69 C2

Boosting Get weight for classifier 2 1 A B 2 3 4 5 C1 C2 W1=0.69 Instance Actual Class Predicted Class 1 A B 2 3 4 5 Get weight for classifier 2 C1 W1=0.69 C2 W2=0.35

Boosting Construct new dataset with more weight Instance Actual Class 1 A 2 3 4 B 5 Instance Actual Class Predicted Class 1 A B 2 3 4 5 Construct new dataset with more weight on those C2 gets wrong ... C1 W1=0.69 C2 W2=0.35

Boosting Learn classifier 3 1 A 2 3 4 B 5 C1 C2 C3 W1=0.69 W2=0.35 Instance Actual Class Predicted Class 1 A 2 3 4 B 5 Learn classifier 3 C1 W1=0.69 C2 W2=0.35 C3

Boosting Learn classifier 3 And so on ... Maybe 10 or 15 times 1 A 2 3 Instance Actual Class Predicted Class 1 A 2 3 4 B 5 And so on ... Maybe 10 or 15 times Learn classifier 3 C1 W1=0.69 C2 W2=0.35 C3

The resulting ensemble classifier W1=0.69 C2 W2=0.35 C3 W3=0.8 C4 W4=0.2 C5 W5=0.9

The resulting ensemble classifier New unclassified instance C1 W1=0.69 C2 W2=0.35 C3 W3=0.8 C4 W4=0.2 C5 W5=0.9

Each weak classifier makes a prediction New unclassified instance C1 W1=0.69 C2 W2=0.35 C3 W3=0.8 C4 W4=0.2 C5 W5=0.9 A A B A B

Use the weight to add up votes New unclassified instance C1 W1=0.69 C2 W2=0.35 C3 W3=0.8 C4 W4=0.2 C5 W5=0.9 A A B A B A gets 1.24, B gets 1.7 Predicted class: B

Some notes The individual classifiers in each round are called ‘weak classifiers’ ... Unlike bagging or basic ensembling, boosting can work quite well with ‘weak’ or inaccurate classifiers The classic (and very good) Boosting algorithm is ‘AdaBoost’ (Adaptive Boosting)

original AdaBoost / basic details Assumes 2-class data and calls them −1 and 1 Each round, it changes weights of instances (equivalent(ish) to making different numbers of copies of different instances) Prediction is weighted sum of classifiers – if weighted sum is +ve, prediction is 1, else −1

Boosting Assign weight to Classifier 1 1 A 2 3 B 4 5 C1 W1=0.69 Instance Actual Class Predicted Class 1 A 2 3 B 4 5 Assign weight to Classifier 1 C1 W1=0.69

Boosting The weight of the classifier is always: Instance Actual Class Predicted Class 1 A 2 3 B 4 5 The weight of the classifier is always: ½ ln( (1 – error )/ error) Assign weight to Classifier 1 C1 W1=0.69

Adaboost The weight of the classifier is always: Instance Actual Class Predicted Class 1 A 2 3 B 4 5 The weight of the classifier is always: ½ ln( (1 – error )/ error) Assign weight to Classifier 1 C1 W1=0.69 Here, for example, error is 1/5 = 0.2

Adaboost: constructing next dataset from previous

Adaboost: constructing next dataset from previous Each instance i has a weight D(i,t) in round t. D(i, 1) is always normalised, so they add up to 1 Think of D(i, t) as a probability – in each round, you can build the new dataset by choosing (with replacement) instances according to this probability D(i, 1) is always 1/(number of instances)

Adaboost: constructing next dataset from previous D(i, t+1) depends on three things: D(i, t) -- the weight of instance i last time - whether or not instance i was correctly classified last time w(t) – the weight that was worked out for classifier t

Adaboost: constructing next dataset from previous D(i, t+1) is D(i, t) x e−w(t) if correct last time D(i, t) x ew(t) if incorrect last time (when done for each i , they won’t add up to 1, so we just normalise them)

Why those specific formulas for the classifier weights and the instance weights?

Why those specific formulas for the classifier weights and the instance weights? Well, in brief ... Given that you have a set of classifiers with different weights, what you want to do is maximise: where yi is the actual and pred(c,i) is the predicted class of instance i, from classifier c, whose weight is w(c) Recall that classes are either -1 or 1, so when predicted Correctly, the contribution is always +ve, and when incorrect the contribution is negative

Why those specific formulas for the classifier weights and the instance weights? Maximising that is the same as minimizing: ... having expressed it in that particular way, some mathematical gymnastics can be done, which ends up showing that an appropriate way to change the classifier and instance weights is what we saw on the earlier slides.

Further details: Original adaboost paper: http://www.public.asu.edu/~jye02/CLASSES/Fall-2005/PAPERS/boosting-icml.pdf A tutorial on boosting: http://www.cs.toronto.edu/~hinton/csc321/notes/boosting.pdf

How good is adaboost?

Usually better than bagging Almost always better than not doing anything Used in many real applications – eg. The Viola/Jones face detector, which is used in many real-world surveillance applications

Viola-Jones face detector http://www.ipol.im/pub/art/2014/104/

Viola-Jones face detector

Viola-Jones face detector

Viola-Jones face detector

The Viola-Jones detector is a cascade of simple ‘decision stumps’ W1=0.69 C2 W2=0.35 C3 W3=0.8 ~C40 W5=0.9 … < 0.7 < 0.8 > 1.4 < 0.3

The Viola-Jones detector is a cascade of simple ‘decision stumps’ W1=0.69 C2 W2=0.35 C3 W3=0.8 ~C40 W5=0.9 … < 0.7 < 0.8 > 1.4 < 0.3