Presentation on theme: "Ensemble Learning – Bagging, Boosting, and Stacking, and other topics"— Presentation transcript:
1 Ensemble Learning – Bagging, Boosting, and Stacking, and other topics Professor Carolina RuizDepartment of Computer ScienceWPIWorcester, Massachusetts
2 Constructing predictors/models 1. Given labeled data, use a data mining technique to train a model2. Given a new unlabeled data instance, use the trained model to predict its labelDatanew datapredictionTechniques:Decision treesBayesian netsNeural nets…Wish list:- Good predictor: low errorStable: small variations in training data=> small variations in resulting model
3 Looking for a good model Varying data used Varying DM technique/parameterssubset of the attributes - different parameters for a techniquesubset of the data instances - different techniques… - …predictionpredictionDatapredictionpredictionpredictionpredictionpredictionpredictionUntil a “good” (low error, stable, …) model is found. But, what if a good model is not found? And even if one is found, how can we improve it?
4 Approach: Ensemble of models DatapredictionForm an ensemble of models and combine their predictions into a single prediction
5 Constructing Ensembles – How? 1. Given labeled data, how to construct an ensemble of models?2. Given a new unlabeled data instance, how to use the ensemble to predict its label?Datanew datapredictionData:What (part of the) data to use to train each model in the ensemble?Data Mining Techniques:What technique and/or what parameters to use to train each model?How to combine the individual model predictions into a unified prediction?
6 Several Approaches Bagging (Bootstrap Aggregating) Boosting Stacking Breiman, UC BerkeleyBoostingSchapire, ATT Research (now at Princeton U). Friedman, Stanford U.StackingWolpert, NASA Ames Research CenterModel Selection Meta-learningFloyd, Ruiz, Alvarez, WPI and Boston CollegeMixture of Experts in Neural NetsAlvarez, Ruiz, Kawato, Kogel, Boston College and WPI…
7 Bagging (Bootstrap Aggregation) Breiman, UC Berkeley 1. Create bootstrap replicates of the data (e.g., randomly sampled subsets of the data instances) and train a model on each replicate.2. Given a new unlabeled data instance, input it to each model:data R1new datapredictiondata R2…………data Rnthe ensemble prediction is the (weighted) average of the individual model predictions (voting system)Usually the same data mining technique is used to train each modelMay help stabilize models
8 Boosting Schapire, ATT Research/Princeton U. Friedman, Stanford U. 1. Assign equal weights to data instances.2. Train a model. Increase (decrease) the weight of incorrectly (correctly) predicted data instances. Repeat 2.3. Given a new unlabeled data instance, run it by the merged model:datanew datapredictiondata’the ensemble prediction is the prediction of the merged model (e.g., majority vote, weighted average, …)………data’’’Usually same data mining techniqueMay help decrease prediction error
9 Stacking Wolpert, NASA Ames Research Center 1. Train different models on the same data (“Level-0 models”)2. Train a new (“Level-1”) model with the outputs of the Level-0 models2. Given a new unlabeled data instance, input it to each Level-0 model:predictionnew datadata………the ensemble prediction is the Level-1 model prediction based on the Level-0 model predictionsLevel-0 Level-1Using different parameters and/or different data mining techniquesMay help reduce prediction error
10 Model Selection Meta-learning Floyd, Ruiz, Alvarez, WPI and Boston College 1. Train different Level-0 models2. Train a Level-1 model to predict which is the best Level-0 model for a given data instance2. Given a new unlabeled data instance, input it to the Level-1 model:…prediction…new data…data………the ensemble prediction is the prediction of the Level-0 model selected by the Level-1 model for the input data instanceLevel-0 Level-1Using different parameters and/or different data mining techniquesMay help determine what technique/model works best on given data
11 Mixture of Experts Architecture Alvarez, Ruiz, Kawato, Kogel, Boston College and WPI 1. Split data attributes into domain meaningful subgroups: A’, A”, … 2. Create and train a Mixture of Experts Feed-Forward Neural Net:3. Given a new unlabeled data instance, feed it forward through the mixture of expertsA’ A” A’’’A’ A” A’’’new dataDatapredictionthe mixture of experts prediction is the output produced by the networkANN layers: input hidden outputNote that not all connections between input and hidden nodes are includedMay help speed-up ANN training without increasing prediction error
12 ConclusionsEnsemble methods construct and/or combine collection of predictors with the purpose of improving upon the properties of the individual predictors:stabilize modelsreduce prediction erroraggregate individual predictors that make different errorsmore resistant to noise
13 ReferencesJ.F. Elder, G. Ridgeway. “Combining Estimators to Improve Performance” KDD-99 tutorial notesL. Breiman. “Bagging Predictors”. Machine Learning, 24(2),R.E. Schapire. “The strength of weak learnability.” Machine Learning. 5(2),Y. Freund, R. Schapire. “Experiments with a new boosting algorithm.” Proc. of the 13th Intl. Conf. on Machine LearningJ. Friedman, T. Hastie, R. Tibshirani. “Additive Logistic Regression: a statistical view of boosting”. Annals of StatisticsD.H Wolpert. “Stacked Generalization.” Neural Networks. 5(2),S. Floyd, C. Ruiz, S. A. Alvarez, J. Tseng, and G. Whalen. "Model Selection Meta-Learning for the Prognosis of Pancreatic Cancer", full paper, Proc. 3rd Intl. Conf. on Health Informatics (HEALTHINF 2010), ppS.A. Alvarez, C. Ruiz , T. Kawato, and W. Kogel. Faster neural networks for combined collaborative and content based recommendation. Journal of Computational Methods in Sciences and Engineering (JCMSE). IOS Press. Vol. 11, N. 4, pp
15 Bagging (Bootstrap Aggregation) Model Creation:Create bootstrap replicates of the dataset and fit a model to each onePrediction:Average/vote predictions of each modelAdvantagesStabilizes “unstable” methodsEasy to implement, parallelizable.
16 Bagging Algorithm 1. Create k bootstrap replicates of the dataset 2. Fit a model to each of the replicates3. Average/vote the predictions of the k models
17 Boosting Creating the model: Prediction: Advantages: Construct a sequence of datasets and models in such a way that a dataset in the sequence weights an instance heavily when the previous model has misclassified it.Prediction:“Merge” the models in the sequenceAdvantages:Improves classification accuracy
18 Generic Boosting Algorithm 1. Equally weight all instance in dataset2. For I = 1 to T2.1. Fit a model to current dataset2.2. Upweight poorly predicted instances2.3 Downweight well-predicted instances3. Merge the models in the sequence to obtain the final model