Computational Intelligence: Methods and Applications Lecture 36 Meta-learning: committees, sampling and bootstrap. Włodzisław Duch Dept. of Informatics,

Slides:

Advertisements

Similar presentations

Random Forest Predrag Radenković 3237/10

Advertisements

Universal Learning Machines (ULM) Włodzisław Duch and Tomasz Maszczyk Department of Informatics, Nicolaus Copernicus University, Toruń, Poland ICONIP 2009,

Data Mining Classification: Alternative Techniques

Ensemble Methods An ensemble method constructs a set of base classifiers from the training data Ensemble or Classifier Combination Predict class label.

Data Mining Classification: Alternative Techniques

CMPUT 466/551 Principal Source: CMU

Longin Jan Latecki Temple University

Sparse vs. Ensemble Approaches to Supervised Learning

Ensemble Tracking Shai Avidan IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE February 2007.

Ensemble Learning: An Introduction

Three kinds of learning

Classification 10/03/07.

Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.

Competent Undemocratic Committees Włodzisław Duch, Łukasz Itert and Karol Grudziński Department of Informatics, Nicholas Copernicus University, Torun,

Machine Learning: Ensemble Methods

Sparse vs. Ensemble Approaches to Supervised Learning

Boosting Main idea: train classifiers (e.g. decision trees) in a sequence. a new classifier should focus on those cases which were incorrectly classified.

Ensemble Learning (2), Tree and Forest

For Better Accuracy Eick: Ensemble Learning

Ensembles of Classifiers Evgueni Smirnov

Machine Learning CS 165B Spring 2012

Issues with Data Mining

Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.

CS 391L: Machine Learning: Ensembles

Computational Intelligence: Methods and Applications Lecture 30 Neurofuzzy system FSM and covering algorithms. Włodzisław Duch Dept. of Informatics, UMK.

LOGO Ensemble Learning Lecturer: Dr. Bo Yuan

ECE 8443 – Pattern Recognition Objectives: Bagging and Boosting Cross-Validation ML and Bayesian Model Comparison Combining Classifiers Resources: MN:

Today Ensemble Methods. Recap of the course. Classifier Fusion

Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.

Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.

CLASSIFICATION: Ensemble Methods

Computational Intelligence: Methods and Applications Lecture 20 SSV & other trees Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.

Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:

Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.

Computational Intelligence: Methods and Applications Lecture 16 Model evaluation and ROC Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.

Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.

1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

Ensemble Methods in Machine Learning

Classification Ensemble Methods 1

1 January 24, 2016Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 7 — Classification Ensemble Learning.

Classification and Prediction: Ensemble Methods Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.

Computational Intelligence: Methods and Applications Lecture 21 Linear discrimination, linear machines Włodzisław Duch Dept. of Informatics, UMK Google:

Decision Trees IDHairHeightWeightLotionResult SarahBlondeAverageLightNoSunburn DanaBlondeTallAverageYesnone AlexBrownTallAverageYesNone AnnieBlondeShortAverageNoSunburn.

Machine Learning in Practice Lecture 24 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

… Algo 1 Algo 2 Algo 3 Algo N Meta-Learning Algo.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bagging and Boosting Cross-Validation ML.

Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.

Computational Intelligence: Methods and Applications Lecture 15 Model selection and tradeoffs. Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.

Computational Intelligence: Methods and Applications Lecture 22 Linear discrimination - variants Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.

SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.

Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.

1 Machine Learning: Ensemble Methods. 2 Learning Ensembles Learn multiple alternative definitions of a concept using different training data or different.

1 Machine Learning Lecture 8: Ensemble Methods Moshe Koppel Slides adapted from Raymond J. Mooney and others.

By Subhasis Dasgupta Asst Professor Praxis Business School, Kolkata Classification Modeling Decision Tree (Part 2)

Ensemble Classifiers.

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

Machine Learning: Ensemble Methods

Data Mining Practical Machine Learning Tools and Techniques

Computational Intelligence: Methods and Applications

Trees, bagging, boosting, and stacking

COMP61011 : Machine Learning Ensemble Models

Ensemble Learning Introduction to Machine Learning and Data Mining, Carla Brodley.

Basic machine learning background with Python scikit-learn

Computational Intelligence: Methods and Applications

Combining Base Learners

Data Mining Practical Machine Learning Tools and Techniques

Introduction to Data Mining, 2nd Edition

Ensemble learning.

Model Combination.

Presentation transcript:

Computational Intelligence: Methods and Applications Lecture 36 Meta-learning: committees, sampling and bootstrap. Włodzisław Duch Dept. of Informatics, UMK Google: W Duch

One model is not enough If the problem is difficult, an expert may call for a consultation. In real-life problems, such as predicting the glucose levels of diabetic patients, a large number of different classification algorithms have been applied, and special issues of journals and books have been written. What is the optimal way of combining results of many systems? Create different models (experts), and let them vote. Committee is less biased, results should improve and stabilize, in complex domain modular learning is needed, but... reasons for decisions taken are difficult to understand (each expert has other reasons). A collection of models is called an ensemble, or a committee, hence the ensemble or committee learning problems. Good models are desired, but those used in committee should specialize, otherwise committee will not give anything new.

Sources of model variability There are three main reasons why models of the same type differ: Different data subsets are used for training the same models: crossvalidation – systematically explore all data bootstrap – randomly select samples with replacement Stochastic nature of training: some learning algorithm contain stochastic elements, ex. all gradient based optimization methods may end up in different local minimum due to different initialization. Using different parameters to train models: even if models are completely deterministic, and trained on the same data, different parameters may be used (pruning, features selected for training, discretization parameters etc.

Homo and hetero A committee may be composed of homo or heterogeneous models. Homogenous models (most common): contain the same type of models, for example the same type of a decision tree or the same type of RBF neural mapping, but either trained on different data subsets (as in crossvalidation), or trained with different parameters (different number of nodes in RBF mapping). Example: use all models created in crossvalidation to form an ensemble. Heterogeneous models (rarely used): different type of models are used, for example decision trees, kNN, LDA, kernel methods combined in one committee. Advantage: explores different biases of classifiers, takes advantage of hyperplanes and local neighborhoods in one system. Example: GhostMiner committees. Best book: L. Kuncheva, Combining Pattern Classifiers. Wiley 2004

Voting procedures Models M k should return estimation P(  |X;M k ) Simple voting: democratic majority decision, most M k predict  so predict . But... decisions made by crowds are not always the best. Select the most confident models - perhaps they know better? ex. take only P(  |X;M k ) >0.8 into account. Competent voting: in some regions of feature space some models are more competent then others, let only those that work well around X vote. Use “gating” or a referee meta-model to select who has right to vote. Linear combination of weighted votes: Final model M includes partial models M k and weights W k ; optimize weights to minimize some cost measure on your data.

Bootstrap Best models (parameters, features), with good generalization, are selected maximizing accuracy on validation partition in crossvalidation. WEKA has a number of approaches to improve final results. Analyzing a scene our eyes jump (saccade) all over the scene. Each second new samples are taken, but unlike in crossvalidation, the same point may be viewed many times. Bootstrap: form the training set many times, draw data with replacement (i.e. leaving the original data for further drawing). Bootstrap algorithm: from a dataset with n vectors select randomly n samples, without removing the data from the original set; use the data that has not been selected for testing; repeat it many times and average the results.

0.632 bootstrap With n vectors, probability that a given vector is not selected is 1-1/n Therefore probability that some vector will be used for test is: Training data contains the rest, 63.2% of the data; this may lead to overly pessimistic estimation, since < 2/3 of the data is used for training. The bootstrap weights the error on the training and on the test for n > 40 Test and train data have thus the same weight in estimation of errors. This type of bootstrap is recommended by statisticians. Warning: results for models that overfit the data, giving E(Train)=0, are too optimistic, sometimes additional corrections are used.

Bagging (Bootstrap aggregating) Simplest way of making predictions by voting and averaging. Create N m bootstrap subsets D i sampling from the training data. Train the same model M(D i ) on all data subsets D i. Bagging decision for new vector X: –in classification: majority class predicted by N m models, or probability P(C k |X) estimated by the proportion of models predicting class C k. –in approximation: average predicted value. Decreases bias, stabilizes results, especially on noisy data. Even if individual models are unstable and have high variance, the committee result has lower variance! Ex: run 10xCV for SSV tree using GM on glass data, accuracy 69±10%, and a committee of 11 trees has 83±2%.

Boosting Weight performance of the model and encourage development of models that work well on cases that other models fail. Several variants of Ada Boost (adaptive boosting) exits, creating sequential series of models that specialize in cases that were not handled correctly by previous models. General algorithm is: initialize: assign equal weight w i to each training vector X (i) –train models M k (X), k=1.. m successively on weighted data. – for model M k calculate error E k using weights w i – increase weights of vectors that were not correctly classified Output the decision taking average prediction of all classifiers. These type of algorithms are quite popular, and may be applied to “weak classifiers” that are slightly better than random guessing, ex. “decision stumps”, or one-level trees, sometimes > 100 are used.

AdaBoost M1 version 2 classes ±, n training vectors X (i), m classifiers M k, true class is  X 1.Initialize weights w i = 1/n, for i =1..n, 2.For k=1.. m train classifier M k (X) on the weighted data – this is either done by using weights in the error function or replicating samples in the training data. 3.Compute error E k for each M k, counting weights w i as errors 4.Compute model weights: 5.Set new weights for vectors that gave errors: 6.After all m models are created output the decision

Results of boosting AdaBoost M1 combined with J.48 WEKA tree algorithm gives in 10CV: 10 iterations: test 73.8% 20 iterations: test 76.1% 30 iterations: test 77.6% 50 iterations: test 77.6% - saturated, but significantly better than simple voting committee. AdaBoost + decision trees have been called “the best off-the-shelf classifiers in the world”! They accept all kinds of attributes and do not require data preprocessing. Use 1-K scheme to apply it in the K-class case. Other boosting method exist, for example LogitBoost, with more sophisticated weighting scheme, but they rarely give significant improvements.

Stacking Stacking: try harder! Do it again! Instead of voting use outputs Y i =g i (X) of i =1..m classifiers that estimate probabilities, or just give binary values for class labels; train a new meta-classifier on m-dimensional vector Y. If not satisfied, repeat the training with more classifiers Z i =g i (Y). Several new classifiers may be applied to this new data, and meta-meta-classifiers applied; using one LDA model is equivalent to voting with linear weights. Single–level stacking is equivalent to: a committee of classifiers + a threshold function (majority voting); a linear classifier (linear weighting); or some non-linear weighting functions, like a perceptron  (W. X) Stacking may be a basis for series of useful transformations!

Remarks Such methods may improve generalization and predictive power of your decision system in many situations, but... algorithms loose their explanatory power: averaging predictions of many decision trees cannot be reduced to any single tree; decision process is difficult to comprehend, complex decision borders are created; overfitting may easily become a problem because more parameters are added. Constructing better features may lead to more interesting results. This is rather obvious for signal processing applications, where PCA or ICA features are frequently used. Meta-learning in WEKA – more examples.

The End This was only the introduction to some computational intelligence problems, methods and inspirations. The field is vast … A drop of water in an ocean of knowledge …