Download presentation

Presentation is loading. Please wait.

Published byNickolas Lansdowne Modified about 1 year ago

1
Chapter 9 Given a range of classification algorithms, which is the best? Some algorithms may be preferred because of their low complexity, ability to incorporate prior knowledge,…. Principle of Occam’s Razor: given two classifiers that perform equally well on the training set, it is asserted that the simpler classifier may do better on test set This chapter focuses on mathematical foundations that do not depend on a particular classifier or learning algorithm –Bias and variance dilemma –Ensemble of Classifiers (classifier combination) –Cross validation –Resampling

2
No Free Lunch Theorem Suppose we make no prior assumptions about the nature of the classification task. Can we expect any classification method to be superior or inferior overall? No Free Lunch Theorem: Answer to above question: NO If the goal is to obtain good generalization performance, there is no context-independent or usage-independent reasons to favor one algorithm over others If one algorithm seems to outperform another in a particular situation, it is a consequence of its fit to a particular pattern recognition problem For a new classification problem, what matters most: prior information, data distribution, size of training set, cost fn.

3
No Free Lunch Theorem It is the assumptions about the learning algorithm that are important Even popular algorithms will perform poorly on some problems, where the learning algorithm and data distribution do not match well In practice, experience with a broad range of techniques is the best insurance for solving arbitrary new classification problems

4

5

6
Ugly Duckling Theorem In the absence of assumptions, there is no best feature representation for all problems The similarity between patterns is fundamentally based on implicit assumptions about the problem domain –Consider the example where features x and y represent blind_in_right_eye and blind_in_left_eye, respectively. If we base similarity on shared features, person P1 = {1, 0} (blind only in the right eye} is maximally different from person P2 = {0, 1} (blind only in the left eye). In this scheme P1 is more similar to a totally blind person and to a normally sighted person than he is to P2! –There may be situations where we want P1 to be more similar to P2—such persons may be able to drive an automobile!

7
Bias and Variance No “best classifier” in general –Necessity for exploring a variety of methods How to evaluate if the learning algorithm “matches” the classification problem Bias: measures the quality of the match –High-bias implies poor match Variance: measures the specificity of the match –High-variance implies a weak match Bias and variance are not independent of each other

8
Bias and Variance Given true function F(x) Estimated function g(x; D) from a training set D Dependence of function g on training set D. Each training set gives an estimate of error in the fit Taking average over all training sets of size n, MSE is Average error that g(x;D) makes in fitting F(x) Difference between expected value and the true value Difference between observed value and expected value Low bias: on average, we will accurately estimate F from D Low variance: Estimate of F does not change much with different D

9
Bias-Variance Dilemma in Regression Each row is a different dataset of 6 points. Each column is a different model. Histograms of mean-squared error of the fit. Col 1: Poor fixed linear model; High bias, zero variance Col 2: Slightly better fixed linear model; Lower (but high) bias, zero variance. Col 3: Learned cubic model; Low bias, moderate variance. Col 4: Learned linear model; Intermediate bias and variance.

10
Bias Variance Dilemma Procedures with increased flexibility to adapt to training data have lower bias, but higher variance – Large number of parameters – Fits well and have low bias, but high variance Inflexible procedures have higher bias, but lower variance – Fewer number of parameters – May not fit well to data: have high bias, but low variance A large amount of training data generally helps improve performance of estimation if the model is sufficiently general to represent the target function Bias/Variance considerations recommend that we gather as much prior information about the problem as possible to find a best match for the classifier, and as large a dataset as possible to reduce the variance We can virtually never get zero bias and zero variance.

11
Bias and Variance for Classification 10 Low variance is more important for accurate classification than low boundary bias Classifiers with large flexibility to adapt to training data (more free parameters) tend to have low bias but high variance 2-class problem; 2D Gaussian distribution with diagonal covariances. Small no. of training data to estimate parameters of 3 different models For best classification given small training data, need to match model to the true distributions

12
Ensemble-based Systems in Decision Making Polikar R., “Ensemble Based Systems in Decision Making,” IEEE Circuits and Systems Magazine, vol.6, no. 3, pp , 2006 Polikar R., “Bootstrap Inspired Techniques in Computational Intelligence,” IEEE Signal Processing Magazine, vol.24, no. 4, pp , 2007 Polikar R., “Ensemble Learning,” Scholarpedia, For many tasks, we often seek second opinion before making a decision, sometimes many more Consulting different doctors before a major surgery Reading reviews before buying a product Requesting references before hiring someone We consider decisions of multiple experts in our daily lives Why not follow the same strategy in automated decision making? Multiple classifier systems, committee of classifiers, mixture of experts, ensemble based systems

13
Ensemble-based Classifiers Ensemble based systems provide favorable results compared to single-expert systems for a broad range of applications & under a variety of scenarios How to (i) generate individual components of the ensemble systems (base classifiers), and (ii) how to combine the outputs of individual classifiers? Popular ensemble based algorithms – Bagging, boosting, AdaBoost, stacked generalization, and hierarchical mixture of experts Commonly used combination rules – Algebraic combination of outputs, voting methods, behavior knowledge space & decision templates 12

14
Why Ensemble Based Systems? Statistical reasons – A set of classifiers with similar training performances may have different generalization performances – Combining outputs of several classifiers reduces the risk of selecting a poorly performing classifier Large volumes of data – If the amount of data to be analyzed is too large, a single classifier may not be able to handle it; train different classifiers on different partitions of data Too little data – Ensemble systems can also be used when there is too little data; resampling techniques 13

15
Why Ensemble Based Systems? Divide and Conquer – Divide data space into smaller & easier-to-learn partitions; each classifier learns only one of the simpler partitions 14

16
Why Ensemble Based Systems? Data Fusion – Given several sets of data from various sources, where the nature of features is different (heterogeneous features), training a single classifier may not be appropriate (e.g., MRI data, EEG recording, blood test,..) – Applications in which data from different sources are combined are called data fusion applications – Ensembles have successfully been used for fusion All ensemble systems must have two key components: – Generate component classifiers of the ensemble – Method for combining the classifier outputs 15

17
Brief History of Ensemble Systems Dasarathy and Sheela (1979) partitioned the feature space using two or more classifiers Schapire (1990) proved that a strong classifier can be generated by combining weak classifiers through boosting; predecessor of AdaBoost algorithm Two types of combination: – classifier selection Each classifier is trained to become an expert in some local area of the feature space; one or more local experts can be nominated to make the decision – classifier fusion All classifiers are trained over the entire feature space; fusion involves merging the individual (weaker) classifiers to obtain a single (stronger) expert of superior performance 16

18
Diversity of Ensemble Objective: create many classifiers, and combine their outputs to improve the performance of a single classifier Intuition: if each classifier makes different errors, then their strategic combination can reduce the total error! Need base classifiers whose decision boundaries are adequately different from those of others – Such a set of classifiers is said to be diverse How to achieve classifier diversity? – Use different training sets to train individual classifiers – How to obtain different training sets? Resampling techniques: bootstrapping or bagging, training subsets are drawn randomly, usually with replacement, from the entire training set 17

19
Sampling with Replacement Random & overlapping training sets to train three classifiers; they are combined to obtain a more accurate classification 18

20
Sampling without Replacement Jackknife or k-fold data split: – Entire data is split into k blocks; each classifier is trained only on different subset of (k-1) blocks 19

21
Other Approaches to Achieve Diversity Use different training parameters for a classifier – A series of MLP can be trained using different weight initializations, number of layers/nodes, etc. – Adjusting these parameters controls the instability of such classifiers (local minima) – Similar strategy can be used to generate different decision trees for the same problem Different types of classifiers (MLPs, decision trees, NN classifiers, SVM) can be combined for added diversity Diversity can also be achieved by using random feature subsets, called random subspace method 20

22
Creating An Ensemble Two questions: – How will the individual classifiers be generated? – How will they differ from each other? Answer determines the diversity of classifiers & fusion performance Seek to improve ensemble diversity by some heuristic methods 21

23
Bagging Bagging, short for bootstrap aggregating, is one of the earliest ensemble based algorithms It is also one of the most intuitive and simplest to implement, with a surprisingly good performance Use bootstrapped replicas of the training data; large number of (say 200) training subsets are randomly drawn - with replacement - from the entire training data Each resampled training set is used to train a different classifier of the same type Individual classifiers are combined by taking a majority vote of their decisions Bagging is appealing for small training set; relatively large portion of the samples is included in each subset 22

24
Bagging 23

25
Variations of Bagging Random Forests – so-called because it is constructed from decision trees – A random forest is created from individual decision trees, whose training parameters vary randomly – Such parameters can be bootstrapped replicas of the training data, as in bagging – But they can also be different feature subsets as in random subspace methods 24

26
Boosting Boost the performance of a weak learner to the level of a strong one Boosting creates an ensemble of classifiers by resampling the data; classifiers combined by majority voting – resampling is strategically geared to provide the most informative training data for each consecutive classifier Boosting creates three weak classifiers: – First classifier C1 is trained with a random subset of the available training data – Training set for second classifier C2 is chosen as the most informative subset, given C1; half of the training data for C2 is correctly classified by C1, other half is misclassified by C1 – Third classifier C3 is trained on instances on which both C1 & C2 disagree 25

27
Boosting 26

28
AdaBoost AdaBoost (1997) is a more general version of the boosting algorithm; AdaBoost.M1 can handle multiclass problems AdaBoost generates a set of hypotheses (classifiers), and combines them through weighted majority voting of the classes predicted by the individual hypotheses Hypotheses are generated by training a weak classifier; samples are drawn from an iteratively updated distribution of the training set This distribution update ensures that instances misclassified by the previous classifier are more likely to be included in the training data of the next classifier Consecutive classifiers are trained on increasingly hard-to- classify samples 27

29
AdaBoost A weight distribution D t (i) on training instances x i, i=1,…,N from which training data subsets S t are chosen for each consecutive classifier (hypothesis) h t A normalized error is then obtained as t, such that for 0< t <1/2, they have 0< t <1 Distribution update rule : – The distribution weights of those instances that are correctly classified by the current hypothesis are reduced by a factor of t, whereas the weights of the misclassified instances are unchanged. – AdaBoost focuses on increasingly difficult instances AdaBoost raises the weights of instanced misclassified by h t, and lowers the weights of correctly classified instances AdaBoost is ready for classifying unlabeled test instances. Unlike bagging or boosting, AdaBoost uses the weighted majority voting 1/ t is therefore a measure of performance, of the t th hypothesis and can be used to weight the classifiers 28

30
AdaBoost.M1 29

31
AdaBoost.M1 AdaBoost algorithm is sequential; classifier (CK-1) is created before classifier CK 30

32
Boosting 31

33
AdaBoost 32

34
Performance of AdaBoost In most practical cases, the ensemble error decreases very rapidly in the first few iterations, and approaches zero or stabilizes as new classifiers are added AdaBoost does not seem to be affected by overfitting; explained by margin theory 33

35
Stacked Generalization An ensemble of classifiers is first created, whose outputs are used as inputs to a second level meta-classifier to learn the mapping between the ensemble outputs and the actual correct classes C 1, …,C T are trained using training parameters 1 through T to output hypotheses h 1 through h T The outputs of these classifiers and the corresponding true classes are then used as input/output training pairs for the second level classifier, C T+1 34

36
Mixture-of-Experts A conceptually similar technique is the mixture-of-experts model, where a set of classifiers C 1, …,C T constitute the ensemble, followed by a second-level classifier C T+1 used for assigning weights for the consecutive combiner The combiner itself is usually not a classifier, but rather a simple combination rule, such as random selection (from a weight distribution), weighted majority, or weighted winner-takes-all – the weight distribution used for the combiner is determined by a second level classifier, usually a neural network, called the gating network – The inputs to the gating network are the actual training data instances themselves (unlike outputs of first level classifiers for stacked generalization) Mixture-of-experts can, therefore, be seen as a classifier selection algorithm Individual classifiers are experts in some portion of the feature space, and the combination rule selects the most appropriate classifier, or classifiers weighted with respect to their expertise, for each instance x 35

37
Mixture of Experts The pooling system may use the weights in several different ways. – it may choose a single classifier with the highest weight, or calculate a weighted sum of the classifier outputs for each class, and pick the class that receives the highest weighted sum. 36

38
Combining Classifiers How to combine classifiers? Combination rules grouped as – (i) trainable vs. non-trainable Trainable rules: parameters of the combiner, called weights determined through a separate training algorithm – Weights from trainable rules are usually instance specific, and hence are also called dynamic combination rules Non-trainable rules: combination parameters are available as classifiers are generated; Weighted majority voting is an example – (ii) combination rules for class labels vs. class-specific continuous outputs combination rules that apply to class labels only need the classification decision (that is, one of j, j=1,…,C) Other rules need continuous-valued outputs of individual classifiers 37

39
Combining Class Labels Assume that only class labels are available from the classifier outputs Define the decision of the t th classifier as d t,j {0,1}, t=1,…,T and j=1,…,C, where T is the number of classifiers and C is the number of classes If t th classifier chooses class j, then d t,j =1, 0 otherwise a)Majority Voting : b)Weighted Majority Voting – c)Behavior Knowledge Space (BKS) – look up Table d)Borda Count – – each voter (classifier) rank orders the candidates (classes). If there are N candidates, the first-place candidate receives N − 1 votes, the second-place candidate receives N − 2, with the candidate in i th place receiving N − i votes. The votes are added up across all classifiers, and the class with the most votes is chosen as the ensemble decision 38

40
Combining Continuous Outputs Algebraic combiners a)Mean Rule: b)Weighted Average: c)Minimum/Maximum/Median Rule: d)Product Rule: e) Generalized Mean: Many of the above rules are in fact special cases of the generalized mean : minimum rule; :maximum rule; : : mean rule 39

41
Combining Classifier Outputs 40

42
Conclusions Ensemble systems are useful in practice Diversity of the base classifiers is important Ensemble generation techniques: bagging, AdaBoost, mixture of experts Classifier combination strategies: algebraic combiners, voting methods, and decision templates. No single ensemble generation algorithm or combination rule is universally better than others Effectiveness on real world data depends on the classifier diversity and characteristics of the data 41

Similar presentations

© 2016 SlidePlayer.com Inc.

All rights reserved.

Ads by Google