No Free Lunch Theorem Suppose we make no prior assumptions about the nature of the classification task. Can we expect any classification method to be superior.

Slides:



Advertisements
Similar presentations
Ensemble Learning Reading: R. Schapire, A brief introduction to boosting.
Advertisements

My name is Dustin Boswell and I will be presenting: Ensemble Methods in Machine Learning by Thomas G. Dietterich Oregon State University, Corvallis, Oregon.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Other Classification Techniques 1.Nearest Neighbor Classifiers 2.Support Vector Machines.
Ensemble Methods An ensemble method constructs a set of base classifiers from the training data Ensemble or Classifier Combination Predict class label.
Data Mining Classification: Alternative Techniques
Longin Jan Latecki Temple University
Algorithm-Independent Machine Learning Anna Egorova-Förster University of Lugano Pattern Classification Reading Group, January 2007 All materials in these.
Multiple Criteria for Evaluating Land Cover Classification Algorithms Summary of a paper by R.S. DeFries and Jonathan Cheung-Wai Chan April, 2000 Remote.
Bayesian Learning Rong Jin. Outline MAP learning vs. ML learning Minimum description length principle Bayes optimal classifier Bagging.
Sparse vs. Ensemble Approaches to Supervised Learning
Ensemble Learning what is an ensemble? why use an ensemble?
Ensemble Learning: An Introduction
Evaluating Hypotheses
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
Bayesian Learning Rong Jin.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Machine Learning: Ensemble Methods
Sparse vs. Ensemble Approaches to Supervised Learning
Experimental Evaluation
Ensemble Learning (2), Tree and Forest
Radial Basis Function Networks
For Better Accuracy Eick: Ensemble Learning
Ensembles of Classifiers Evgueni Smirnov
Machine Learning CS 165B Spring 2012
Issues with Data Mining
Bias and Variance Two ways to measure the match of alignment of the learning algorithm to the classification problem involve the bias and variance. Bias.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
CS 391L: Machine Learning: Ensembles
LOGO Ensemble Learning Lecturer: Dr. Bo Yuan
ECE 8443 – Pattern Recognition Objectives: Bagging and Boosting Cross-Validation ML and Bayesian Model Comparison Combining Classifiers Resources: MN:
Benk Erika Kelemen Zsolt
Learning from observations
Ensemble Based Systems in Decision Making Advisor: Hsin-His Chen Reporter: Chi-Hsin Yu Date: IEEE CIRCUITS AND SYSTEMS MAGAZINE 2006, Q3 Robi.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Ensemble Methods: Bagging and Boosting
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
CS Ensembles1 Ensembles. 2 A “Holy Grail” of Machine Learning Automated Learner Just a Data Set or just an explanation of the problem Hypothesis.
Ensemble Based Systems in Decision Making Yu-Mei, Chang Speech Lab, CSIE National Taiwan Normal University Robi Polikar Third Quarter 2006 IEEE circuits.
CLASSIFICATION: Ensemble Methods
ISQS 6347, Data & Text Mining1 Ensemble Methods. ISQS 6347, Data & Text Mining 2 Ensemble Methods Construct a set of classifiers from the training data.
Learning with AdaBoost
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Ensemble Methods in Machine Learning
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Occam’s Razor No Free Lunch Theorem Minimum.
Classification Ensemble Methods 1
1 January 24, 2016Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 7 — Classification Ensemble Learning.
Classification and Prediction: Ensemble Methods Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Ensemble Methods Construct a set of classifiers from the training data Predict class label of previously unseen records by aggregating predictions made.
Validation methods.
Decision Trees IDHairHeightWeightLotionResult SarahBlondeAverageLightNoSunburn DanaBlondeTallAverageYesnone AlexBrownTallAverageYesNone AnnieBlondeShortAverageNoSunburn.
1 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 8 Combining Methods and Ensemble Learning.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bagging and Boosting Cross-Validation ML.
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Neural networks (2) Reminder Avoiding overfitting Deep neural network Brief summary of supervised learning methods.
1 Machine Learning Lecture 8: Ensemble Methods Moshe Koppel Slides adapted from Raymond J. Mooney and others.
Ensemble Classifiers.
Machine Learning: Ensemble Methods
Trees, bagging, boosting, and stacking
Neuro-Computing Lecture 5 Committee Machine
A “Holy Grail” of Machine Learing
Data Mining Practical Machine Learning Tools and Techniques
Introduction to Data Mining, 2nd Edition
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Ensembles.
Ensemble learning.
Model Combination.
Ensemble learning Reminder - Bagging of Trees Random Forest
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Presentation transcript:

Chapter 9 Given a range of classification algorithms, which is the best? Some algorithms may be preferred because of their low complexity, ability to incorporate prior knowledge,…. Principle of Occam’s Razor: given two classifiers that perform equally well on the training set, it is asserted that the simpler classifier may do better on test set This chapter focuses on mathematical foundations that do not depend on a particular classifier or learning algorithm Bias and variance dilemma Ensemble of Classifiers (classifier combination) Cross validation Resampling

No Free Lunch Theorem Suppose we make no prior assumptions about the nature of the classification task. Can we expect any classification method to be superior or inferior overall? No Free Lunch Theorem: Answer to above question: NO If the goal is to obtain good generalization performance, there is no context-independent or usage-independent reasons to favor one algorithm over others If one algorithm seems to outperform another in a particular situation, it is a consequence of its fit to a particular pattern recognition problem For a new classification problem, what matters most: prior information, data distribution, size of training set, cost fn.

No Free Lunch Theorem It is the assumptions about the learning algorithm that are important Even popular algorithms will perform poorly on some problems, where the learning algorithm and data distribution do not match well In practice, experience with a broad range of techniques is the best insurance for solving arbitrary new classification problems

Ugly Duckling Theorem In the absence of assumptions, there is no best feature representation for all problems The similarity between patterns is fundamentally based on implicit assumptions about the problem domain Consider the example where features x and y represent blind_in_right_eye and blind_in_left_eye, respectively. If we base similarity on shared features, person P1 = {1, 0} (blind only in the right eye} is maximally different from person P2 = {0, 1} (blind only in the left eye). In this scheme P1 is more similar to a totally blind person and to a normally sighted person than he is to P2! There may be situations where we want P1 to be more similar to P2—such persons may be able to drive an automobile!

Bias and Variance No “best classifier” in general Necessity for exploring a variety of methods How to evaluate if the learning algorithm “matches” the classification problem Bias: measures the quality of the match High-bias implies poor match Variance: measures the specificity of the match High-variance implies a weak match Bias and variance are not independent of each other

Bias and Variance Given true function F(x) Estimated function g(x; D) from a training set D Dependence of function g on training set D. Each training set gives an estimate of error in the fit Taking average over all training sets of size n, MSE is Average error that g(x;D) makes in fitting F(x) Low bias: on average, we will accurately estimate F from D Low variance: Estimate of F does not change much with different D Difference between expected value and the true value Difference between observed value and expected value

Bias-Variance Dilemma in Regression Each column is a different model. Col 1: Poor fixed linear model; High bias, zero variance Each row is a different dataset of 6 points. Col 2: Slightly better fixed linear model; Lower (but high) bias, zero variance. Col 3: Learned cubic model; Low bias, moderate variance. Col 4: Learned linear model; Intermediate bias and variance. Histograms of mean-squared error of the fit.

Bias Variance Dilemma Procedures with increased flexibility to adapt to training data have lower bias, but higher variance Large number of parameters Fits well and have low bias, but high variance Inflexible procedures have higher bias, but lower variance Fewer number of parameters May not fit well to data: have high bias, but low variance A large amount of training data generally helps improve performance of estimation if the model is sufficiently general to represent the target function Bias/Variance considerations recommend that we gather as much prior information about the problem as possible to find a best match for the classifier, and as large a dataset as possible to reduce the variance We can virtually never get zero bias and zero variance.

Bias and Variance for Classification Low variance is more important for accurate classification than low boundary bias Classifiers with large flexibility to adapt to training data (more free parameters) tend to have low bias but high variance 2-class problem; 2D Gaussian distribution with diagonal covariances. Small no. of training data to estimate parameters of 3 different models For best classification given small training data, need to match model to the true distributions

Ensemble-based Systems in Decision Making For many tasks, we often seek second opinion before making a decision, sometimes many more Consulting different doctors before a major surgery Reading reviews before buying a product Requesting references before hiring someone We consider decisions of multiple experts in our daily lives Why not follow the same strategy in automated decision making? Multiple classifier systems, committee of classifiers, mixture of experts, ensemble based systems Polikar R., “Ensemble Based Systems in Decision Making,” IEEE Circuits and Systems Magazine, vol.6, no. 3, pp. 21-45, 2006 Polikar R., “Bootstrap Inspired Techniques in Computational Intelligence,” IEEE Signal Processing Magazine, vol.24, no. 4, pp. 56-72, 2007 Polikar R., “Ensemble Learning,” Scholarpedia, 2008.

Ensemble-based Classifiers Ensemble based systems provide favorable results compared to single-expert systems for a broad range of applications & under a variety of scenarios How to (i) generate individual components of the ensemble systems (base classifiers), and (ii) how to combine the outputs of individual classifiers? Popular ensemble based algorithms Bagging, boosting, AdaBoost, stacked generalization, and hierarchical mixture of experts Commonly used combination rules Algebraic combination of outputs, voting methods, behavior knowledge space & decision templates

Why Ensemble Based Systems? Statistical reasons A set of classifiers with similar training performances may have different generalization performances Combining outputs of several classifiers reduces the risk of selecting a poorly performing classifier Large volumes of data If the amount of data to be analyzed is too large, a single classifier may not be able to handle it; train different classifiers on different partitions of data Too little data Ensemble systems can also be used when there is too little data; resampling techniques

Why Ensemble Based Systems? Divide and Conquer Divide data space into smaller & easier-to-learn partitions; each classifier learns only one of the simpler partitions

Why Ensemble Based Systems? Data Fusion Given several sets of data from various sources, where the nature of features is different (heterogeneous features), training a single classifier may not be appropriate (e.g., MRI data, EEG recording, blood test,..) Applications in which data from different sources are combined are called data fusion applications Ensembles have successfully been used for fusion All ensemble systems must have two key components: Generate component classifiers of the ensemble Method for combining the classifier outputs

Brief History of Ensemble Systems Dasarathy and Sheela (1979) partitioned the feature space using two or more classifiers Schapire (1990) proved that a strong classifier can be generated by combining weak classifiers through boosting; predecessor of AdaBoost algorithm Two types of combination: classifier selection Each classifier is trained to become an expert in some local area of the feature space; one or more local experts can be nominated to make the decision classifier fusion All classifiers are trained over the entire feature space; fusion involves merging the individual (weaker) classifiers to obtain a single (stronger) expert of superior performance

Diversity of Ensemble Objective: create many classifiers, and combine their outputs to improve the performance of a single classifier Intuition: if each classifier makes different errors, then their strategic combination can reduce the total error! Need base classifiers whose decision boundaries are adequately different from those of others Such a set of classifiers is said to be diverse How to achieve classifier diversity? Use different training sets to train individual classifiers How to obtain different training sets? Resampling techniques: bootstrapping or bagging, training subsets are drawn randomly, usually with replacement, from the entire training set

Sampling with Replacement Random & overlapping training sets to train three classifiers; they are combined to obtain a more accurate classification

Sampling without Replacement Jackknife or k-fold data split: Entire data is split into k blocks; each classifier is trained only on different subset of (k-1) blocks

Other Approaches to Achieve Diversity Use different training parameters for a classifier A series of MLP can be trained using different weight initializations, number of layers/nodes, etc. Adjusting these parameters controls the instability of such classifiers (local minima) Similar strategy can be used to generate different decision trees for the same problem Different types of classifiers (MLPs, decision trees, NN classifiers, SVM) can be combined for added diversity Diversity can also be achieved by using random feature subsets, called random subspace method

Creating An Ensemble Two questions: How will the individual classifiers be generated? How will they differ from each other? Answer determines the diversity of classifiers & fusion performance Seek to improve ensemble diversity by some heuristic methods

Bagging Bagging, short for bootstrap aggregating, is one of the earliest ensemble based algorithms It is also one of the most intuitive and simplest to implement, with a surprisingly good performance Use bootstrapped replicas of the training data; large number of (say 200) training subsets are randomly drawn - with replacement - from the entire training data Each resampled training set is used to train a different classifier of the same type Individual classifiers are combined by taking a majority vote of their decisions Bagging is appealing for small training set; relatively large portion of the samples is included in each subset

Bagging

Variations of Bagging Random Forests so-called because it is constructed from decision trees A random forest is created from individual decision trees, whose training parameters vary randomly Such parameters can be bootstrapped replicas of the training data, as in bagging But they can also be different feature subsets as in random subspace methods

Boosting Boost the performance of a weak learner to the level of a strong one Boosting creates an ensemble of classifiers by resampling the data; classifiers combined by majority voting resampling is strategically geared to provide the most informative training data for each consecutive classifier Boosting creates three weak classifiers: First classifier C1 is trained with a random subset of the available training data Training set for second classifier C2 is chosen as the most informative subset, given C1; half of the training data for C2 is correctly classified by C1, other half is misclassified by C1 Third classifier C3 is trained on instances on which both C1 & C2 disagree

Boosting

AdaBoost AdaBoost (1997) is a more general version of the boosting algorithm; AdaBoost.M1 can handle multiclass problems AdaBoost generates a set of hypotheses (classifiers), and combines them through weighted majority voting of the classes predicted by the individual hypotheses Hypotheses are generated by training a weak classifier; samples are drawn from an iteratively updated distribution of the training set This distribution update ensures that instances misclassified by the previous classifier are more likely to be included in the training data of the next classifier Consecutive classifiers are trained on increasingly hard-to- classify samples

AdaBoost A weight distribution Dt(i) on training instances xi , i=1,…,N from which training data subsets St are chosen for each consecutive classifier (hypothesis) ht A normalized error is then obtained as t , such that for 0<t <1/2, they have 0< t <1 Distribution update rule: The distribution weights of those instances that are correctly classified by the current hypothesis are reduced by a factor of t , whereas the weights of the misclassified instances are unchanged. AdaBoost focuses on increasingly difficult instances AdaBoost raises the weights of instanced misclassified by ht , and lowers the weights of correctly classified instances AdaBoost is ready for classifying unlabeled test instances. Unlike bagging or boosting, AdaBoost uses the weighted majority voting 1/t is therefore a measure of performance, of the tth hypothesis and can be used to weight the classifiers

AdaBoost.M1

AdaBoost.M1 AdaBoost algorithm is sequential; classifier (CK-1) is created before classifier CK

Boosting

AdaBoost

Performance of AdaBoost In most practical cases, the ensemble error decreases very rapidly in the first few iterations, and approaches zero or stabilizes as new classifiers are added AdaBoost does not seem to be affected by overfitting; explained by margin theory

Stacked Generalization An ensemble of classifiers is first created, whose outputs are used as inputs to a second level meta-classifier to learn the mapping between the ensemble outputs and the actual correct classes C1, …,CT are trained using training parameters 1 through T to output hypotheses h1 through hT The outputs of these classifiers and the corresponding true classes are then used as input/output training pairs for the second level classifier, CT+1

Mixture-of-Experts A conceptually similar technique is the mixture-of-experts model, where a set of classifiers C1, …,CT constitute the ensemble, followed by a second-level classifier CT+1 used for assigning weights for the consecutive combiner The combiner itself is usually not a classifier, but rather a simple combination rule, such as random selection (from a weight distribution), weighted majority, or weighted winner-takes-all the weight distribution used for the combiner is determined by a second level classifier, usually a neural network, called the gating network The inputs to the gating network are the actual training data instances themselves (unlike outputs of first level classifiers for stacked generalization) Mixture-of-experts can, therefore, be seen as a classifier selection algorithm Individual classifiers are experts in some portion of the feature space, and the combination rule selects the most appropriate classifier, or classifiers weighted with respect to their expertise, for each instance x

Mixture of Experts The pooling system may use the weights in several different ways. it may choose a single classifier with the highest weight, or calculate a weighted sum of the classifier outputs for each class, and pick the class that receives the highest weighted sum.

Combining Classifiers How to combine classifiers? Combination rules grouped as (i) trainable vs. non-trainable Trainable rules: parameters of the combiner, called weights determined through a separate training algorithm Weights from trainable rules are usually instance specific, and hence are also called dynamic combination rules Non-trainable rules: combination parameters are available as classifiers are generated; Weighted majority voting is an example (ii) combination rules for class labels vs. class-specific continuous outputs combination rules that apply to class labels only need the classification decision (that is, one of j , j=1,…,C) Other rules need continuous-valued outputs of individual classifiers

Combining Class Labels Assume that only class labels are available from the classifier outputs Define the decision of the tth classifier as dt,j{0,1} , t=1,…,T and j=1,…,C , where T is the number of classifiers and C is the number of classes If tth classifier chooses class j , then dt,j=1, 0 otherwise Majority Voting : Weighted Majority Voting – Behavior Knowledge Space (BKS) – look up Table Borda Count – each voter (classifier) rank orders the candidates (classes). If there are N candidates, the first-place candidate receives N − 1 votes, the second-place candidate receives N − 2, with the candidate in ith place receiving N − i votes. The votes are added up across all classifiers, and the class with the most votes is chosen as the ensemble decision

Combining Continuous Outputs Algebraic combiners Mean Rule: Weighted Average: Minimum/Maximum/Median Rule: Product Rule: Generalized Mean: Many of the above rules are in fact special cases of the generalized mean : minimum rule; :maximum rule; : : mean rule

Combining Classifier Outputs

Conclusions Ensemble systems are useful in practice Diversity of the base classifiers is important Ensemble generation techniques: bagging, AdaBoost, mixture of experts Classifier combination strategies: algebraic combiners, voting methods, and decision templates. No single ensemble generation algorithm or combination rule is universally better than others Effectiveness on real world data depends on the classifier diversity and characteristics of the data