Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Mining - Volinsky - 2011 - Columbia University 1 Topic 10 - Ensemble Methods.

Similar presentations


Presentation on theme: "Data Mining - Volinsky - 2011 - Columbia University 1 Topic 10 - Ensemble Methods."— Presentation transcript:

1 Data Mining - Volinsky - 2011 - Columbia University 1 Topic 10 - Ensemble Methods

2 Ensemble Models - Motivation Remember this picture? Always looking for balance between low complexity (‘good on average’ but bad for prediction) and high complexity (‘good for specific cases’ but might overfit) By combining many different models, ensembles make it easier to hit the ‘sweet spot’ of modelling. Best for models to draw from diverse, independent opinions –Wisdom Of Crowds Data Mining - Volinsky - 2011 - Columbia University 2 S train (  ) S test (  )

3 Ensemble Methods - Motivation Models are just models. –Usually not true! –The truth is often much more complex than any single model can capture. –Combinations of simple models can be arbitrarily complex. (e.g. spam/robots models, neural nets, splines) Notion: An average of several measurements is often more accurate and stable than a single measurement Accuracy: how well the model does for estimation and prediction Stability: small changes in inputs have little effect on outputs Data Mining - Volinsky - 2011 - Columbia University 3

4 Ensemble Methods – How They Work The ensemble predicts a target value as an average or a vote of the predictions (of several individual models)... –Each model is fit independently of the others –Final prediction is a combination of the independent predictions of all models For an continuous target, an ensemble averages predictions –Usually weighted For a categorical target (classification), an ensemble may average the probabilities of the target values…or may use ‘voting’. –Voting classifies a case into the class that was selected most by individual models Data Mining - Volinsky - 2011 - Columbia University 4

5 Ensemble Models – Why they work Voting example –5 independent classifiers –70% accuracy for each –Use voting… –What is the probability that the ensemble model is correct? Lets simulate it –What about 100 examples? –(not a realistic example, why?) Data Mining - Volinsky - 2011 - Columbia University 5

6 Ensemble Schemes The beauty is that you can average together models of any kind!!! Don’t need fancy schemes – just average! But there are fancy schemes: each one has various ways of fitting many models to the same data, and use voting or averaging –Stacking (Wolpert 92): fit many leave-1-out models –Bagging (Breiman 96) build models on many permutations of original data –Boosting (Freund & Shapire 96): iteratively re-model, using re-weighted data based on errors from previous models… –Arcing (Breiman 98), Bumping (Tibshirani 97), Crumpling (Anderson & Elder 98), Born-Again (Breiman 98): –Bayesian Model Averaging - near to my heart… We’ll explore BMA, bagging and boosting… Data Mining - Volinsky - 2011 - Columbia University 6

7 Ensemble Methods – Bayesian Model Averaging Data Mining - Volinsky - 2011 - Columbia University 7

8 8 Model Averaging Idea: account for inherent variance of the model selection process Posterior Variance = Within-Model Variance + Between-Model Variance Data-driven model selection is risky: “Part of the evidence is spent specify the model” (Leamer, 1978) Model-based inferences can be over-precise

9 Data Mining - Volinsky - 2011 - Columbia University 9 Model Averaging For some quantity of interest  : avg over all Models M, given the data D: To calculate the first term properly, you need to integrate out model parameters , Where  is the MLE. For the second term, note that ^

10 Bayesian Model Averaging The approximations on the previous page allow you to calculate many posterior model probabilities quickly, and gives you the weights to use for averaging. But, how do you know which models to average over? –Example, regression with p parameters –Each subset of p is a ‘model’ –2 p possible models! Idea: Data Mining - Volinsky - 2011 - Columbia University 10

11 Data Mining - Volinsky - 2011 - Columbia University 11 Model Averaging But how to find the best models without fitting all models? Solution: Leaps and Bounds algorithm can find the best model without fitting all models –Goal: find the single best model for each model size Don’t need to traverse this part of the tree since there is no way it can beat AB

12 BMA - Example Data Mining - Volinsky - 2011 - Columbia University 12 PMP = Posterior Model Probability Best Models Score on holdout data: BMA wins

13 Ensemble Methods - Boosting Data Mining - Volinsky - 2011 - Columbia University 13

14 Boosting… Different approach to model ensembles – mostly for classification Observed: when model predictions are not highly correlated, combining does well Big idea: can we fit models specifically to the “difficult” parts of the data? Data Mining - Volinsky - 2011 - Columbia University 14

15 Data Mining - Volinsky - 2011 - Columbia University 15 Boosting— Algorithm From HTF p. 339

16 Example Courtesy M. Littman Data Mining - Volinsky - 2011 - Columbia University 16

17 Courtesy M. Littman Data Mining - Volinsky - 2011 - Columbia University 17 Example

18 Courtesy M. Littman Data Mining - Volinsky - 2011 - Columbia University 18 Example

19 Boosting - Advantages Fast algorithms - AdaBoost Flexible – can work with any classification algorithm Individual models don’t have to be good –In fact, the method works best with bad models! –(bad = slightly better than random guessing) –Most common model – “boosted stumps” Data Mining - Volinsky - 2011 - Columbia University 19

20 Data Mining - Volinsky - 2011 - Columbia University 20 Boosting Example from HTF p. 302

21 Ensemble Methods – Bagging / Stacking Data Mining - Volinsky - 2011 - Columbia University 21

22 Data Mining - Volinsky - 2011 - Columbia University 22 Bagging for Combining Classifiers Bagging = Boostrap aggregating Big Idea: –To avoid overfitting of specific dataset, fit model to “bootstrapped” random sets of the data Bootstrap –Random sample, with replacement, from the data set –Size of sample = size of data –X= (1,2,3,4,5,6,7,8,9,10) –B1=(1,2,3,3,4,5,6,6,7,8) –B2=(1,1,1,1,2,2,2,5,6,8) –… Bootstrap sample have the same statistical properties as original data By creating similar datasets you can see how much stability there is in your data. If there is a lack of stability, averaging helps.

23 Bagging Training data sets of size N Generate B “bootstrap” sampled data sets of size N Build B models (e.g., trees), one for each bootstrap sample –Intuition is that the bootstrapping “perturbs” the data enough to make the models more resistant to true variability –Note: only ~62% of data included in any bootstrap sample Can use the rest as an out-of-sample estimate! For prediction, combine the predictions from the B models –Voting or averaging based on“out-of-bag” sample –Plus: generally improves accuracy on models such as trees –Negative: lose interpretability Data Mining - Volinsky - 2011 - Columbia University 23

24 Data Mining - Volinsky - 2011 - Columbia University 24 HTF Bagging Example p 285

25 Ensemble Methods – Random Forests Data Mining - Volinsky - 2011 - Columbia University 25

26 Data Mining - Volinsky - 2011 - Columbia University 26 Random Forests Trees are great, but –As we’ve seen, they are “unstable” –Also, trees are sensitive to the primary split, which can lead the tree in inappropriate directions –one way to see this: fit a tree on a random sample, or a bootstrapped sample of the data -

27 Example of Tree Instability Data Mining - Volinsky - 2011 - Columbia University 27 from G. Ridgeway, 2003

28 Random Forests Solution: –random forests: an ensemble of decision trees –Similar to bagging: inject randomness to overcome instability –each tree is built on a random subset of the training data Boostrapped version of data –at each split point, only a random subset of predictors are considered –Use “out-of-bag” hold out sample to estimate size of each tree –prediction is simply majority vote of the trees ( or mean prediction of the trees). Randomizing the variables used is the key –Reduces correlation between models! Has the advantage of trees, with more robustness, and a smoother decision rule. Data Mining - Volinsky - 2011 - Columbia University 28

29 Data Mining - Volinsky - 2011 - Columbia University 29 HTF Example p 589

30 Breiman, Leo (2001). "Random Forests". Machine Learning 45 (1), 5-32 30 Data Mining - Volinsky - 2011 - Columbia University

31 Random Forests – How Big A Tree Breiman’s original algorithm said: “to keep bias low, trees are to be grown to maximum depth” However, empirical evidence typically shows that “stumps” do best Data Mining - Volinsky - 2011 - Columbia University 31

32 Ensembles – Main Points Averaging models together has been shown to be effective for prediction Many weird names: –See papers by Leo Breiman (e.g. “Bagging Predictors”, Arcing the Edge”, and “Random Forests” for more detail Key points –Models average well if they are uncorrelated –Can inject randomness to insure uncorrelated models –Averaging small models better than large ones Also, can give more insight into variables than simple tree –Variables that show up again and again must be good Data Mining - Volinsky - 2011 - Columbia University 32

33 Visualizing Forests Data: Wisconsin Breast Cancer –Courtesy S. Urbanek Data Mining - Volinsky - 2011 - Columbia University 33

34 Data Mining - Volinsky - 2011 - Columbia University 34

35 Data Mining - Volinsky - 2011 - Columbia University 35

36 References Random Forests from Leo Breiman himselfRandom Forests Breiman, Leo (2001). "Random Forests". Machine Learning 45 (1), 5-32 Hastie, Tibshirani, Friedman (HTF) –Chapters 8,10,15,16 – Data Mining - Volinsky - 2011 - Columbia University 36


Download ppt "Data Mining - Volinsky - 2011 - Columbia University 1 Topic 10 - Ensemble Methods."

Similar presentations


Ads by Google