Presentation is loading. Please wait.

Presentation is loading. Please wait.

MCS 2004 - Multiple Classifier Systems, Cagliari 9-11 June 2004 1 Giorgio Valentini Random aggregated and bagged ensembles.

Similar presentations


Presentation on theme: "MCS 2004 - Multiple Classifier Systems, Cagliari 9-11 June 2004 1 Giorgio Valentini Random aggregated and bagged ensembles."— Presentation transcript:

1 MCS 2004 - Multiple Classifier Systems, Cagliari 9-11 June 2004 1 Giorgio Valentini e-mail: valentini@dsi.unimi.it Random aggregated and bagged ensembles of SVMs: an empirical bias-variance analysis DSI – Dipartimento di Scienze dell’ Informazione Universit à degli Studi di Milano

2 MCS 2004 - Multiple Classifier Systems, Cagliari 9-11 June 2004 2 Goals Developing methods and procedures to estimate the bias- variance decomposition of the error in ensembles of learning machines A quantitative evaluation of the variance reduction property in random aggregated and bagged ensembles (Breiman,1996). A characterization of bias-variance (BV) decomposition of the error in bagged and random aggregated ensembles of SVMs, comparing the results with BV decomposition in single SVMs (Valentini and Dietterich, 2004) Getting insights into the reasons why the ensemble method Lobag (Valentini and Dietterich, 2003) works. Getting insights into the reasons why random subsampling techniques works with large data mining problems (Breiman, 1999; Chawla et al. 2002).

3 MCS 2004 - Multiple Classifier Systems, Cagliari 9-11 June 2004 3 Random aggregated ensembles Let D = {(x j, t j )}, 1  j  m, be a set of m samples drawn identically and independently from a population U according to P, where P(x, t) is the joint distribution of the data points in U. Let L be a learning algorithm, and define f D = L (D) as the predictor produced by L applied to a training set D. The model produces a prediction f D (x) = y. Suppose that a sequence of learning sets { D k } is given, each i.i.d. from the same underlying distribution P. Breiman proposed to aggregate the f D trained with different samples drawn from U to get a better predictor f A (x, P). For classification problems t j  S  N, and f A (x, P) = arg max j |{k | f Dk (x) = j }|. As the training sets D are randomly drawn from U, we name the procedure to build f A random aggregating.

4 MCS 2004 - Multiple Classifier Systems, Cagliari 9-11 June 2004 4 Random aggregation reduces variance Considering regression problems, if T and X are random variables having joint distribution P, the expected squared loss EL for the single predictor f D (X) is: EL = E D [E T,X [(T - f D (X)) 2 ]] while the expected squared loss EL A for the aggregated predictor is: EL A = E T,X [(T - f A (X)) 2 ] Breiman showed that EL  EL A. This disequality depends on the instability of the predictions, that is on how unequal the two sides of the following eq. are: E D [f D (X)] 2  E D [f D 2 (X)] There is a strict relationship between the instability and the variance of the base predictor. Indeed the variance V (X) of the base predictor is: V (X) = E D [(f D (X) - E D [f D (X)]) 2 ]= E D [f D 2 (X)] - E D [f D (X)] 2 Breiman showed also that in classification problems, as in regression, aggregating “good” predictors can lead to better performances, as long as the base predictor is unstable, whereas, unlike regression, aggregating poor predictors can lower performances.

5 MCS 2004 - Multiple Classifier Systems, Cagliari 9-11 June 2004 5 How much does the variance reduction property hold for bagging too? Bagging is an approximation of random aggregating, for at least two reasons: 1.Bootstrap samples are not “real” data samples: they are drawn from a data set D, that is in turn a sample from the population U. On the contrary f A uses samples drawn directly from U. 2.Bootstrap samples are drawn from D according to an uniform probability distribution, which is only an approximation of the unknown true distribution P. 1.Does the variance reduction property hold for bagging too ? 2.Can we provide a quantitative estimate of variance reduction both in random aggregating and bagging? Breiman theoretically showed the random aggregating reduces variance

6 MCS 2004 - Multiple Classifier Systems, Cagliari 9-11 June 2004 6 A quantitative estimate of bias-variance decomposition of the error in random aggregated (RA) and bagged ensembles of learning machines We developed procedures to quantitatively evaluate bias-variance decompostion of the error according to Domingos unified bias-variance theory (Domingos, 2000). We proposed three basic techniques (Valentini, 2003): 1.Out-of-bag, or cross-validation estimate (when only small samples are available) 2.Hold-out techniques (when relatively large data sets are available) In order to get a reliable estimate of the error we applied the second technique evaluating the bias-variance decomposition using quite large test sets. We summarize here the two main experimental steps to perform bias variance analysis with resampling-based ensembles: 1.Procedures to generate data for ensemble training 2.Bias-variance decomposition of the error on a separated test set

7 MCS 2004 - Multiple Classifier Systems, Cagliari 9-11 June 2004 7 Procedure to generate training samples for random aggregates ensembles Procedure to generate training samples for bagged ensembles

8 MCS 2004 - Multiple Classifier Systems, Cagliari 9-11 June 2004 8 Procedure to estimate the bias-variance decomposition of the error in ensembles of learning machines

9 MCS 2004 - Multiple Classifier Systems, Cagliari 9-11 June 2004 9 Comparison of bias-variance decomposition of the error in random aggregated (RA) and bagged ensembles of SVMs on 7 two-class classification problems Results represent changes relative to single SVMs (e.g. zero change means no difference). Square labeled lines refer to random aggregated ensembles, triangle to bagged ensembles. In random aggregated ensembles the error decreases form 15 to 70% w.r.t. single SVMs, while in bagged ensemble the errror decreases from 0 to 15% depending on the data set. Variance is significantly reduced in RA ens. (about 90%), while in bagging the variance reduction is quite limited, if compared to RA decrement (between 0 and 35 %). No substantial bias reduction is registered. Gaussian kernelsLinear kernels

10 MCS 2004 - Multiple Classifier Systems, Cagliari 9-11 June 2004 10 Characterization of bias-variance decompostion of the error in random aggregated ensembles of SVMs (gaussian kernel) Lines labeled with crosses: single SVMs Lines labeled with triangles: RA SVM ensembles

11 MCS 2004 - Multiple Classifier Systems, Cagliari 9-11 June 2004 11 Lobag works when unbiased variance is relatively high Lobag (Low bias bagging) is a variant of bagging that uses low biased base learners selected through bias-variance analysis procedures (Valentini and Dietterich, 2003). Our experiments with bagging show the reasons why Lobag works: bagging lowers variance, but the bias remains substantially unchanged. Hence selecting low bias base learners Lobag reduces both bias (through bias-variance analysis) and variance (through classical aggregation techniques) Valentini and Dietterich experimentally showed that Lobag is effective, with SVMs as base learners, when small sized samples are used, that is when the variance due to reduced cardinality of the available data is relatively high. But when we have relatively large data sets, we may expect that lobag does not outperform bagging (because in this case, on the average, the unbiased variance will be relatively low).

12 MCS 2004 - Multiple Classifier Systems, Cagliari 9-11 June 2004 12 Why random subsampling techniques work with large databases ? Breiman proposed random subsampling techniques for classification in large databases, using decision trees as base learners (Breiman, 1999), and these techniques have been also successfully applied in distributed environments (Chawla et al., 2002). Random aggregating can also be interpreted as a technique to draw from a large population small subsamples to train the base learners and then aggregating them e.g. by majority voting. Our experiments on random aggregated ensembles show that the variance component of the error is strongly reduced, while the bias remains unchanged or it is lowered, getting insights into the reasons why random subsampling techniques works with large data mining problems. In particular our experimental analysis suggests to apply SVMs trianed on small subsamples when large database are available or when they are fragmented in distributed systems.

13 MCS 2004 - Multiple Classifier Systems, Cagliari 9-11 June 2004 13 Conclusions We showed how to apply bias-variance decomposition techniques to the analysis of bagged and random aggregated ensembles of learning machines. These techniques have been applied to the analysis of bagged and random aggregated ensembles of SVMs, but can be directly applied to a large set of ensemble methods * The experimental analysis show that random aggregated ensembles significantly reduce the variance component of the error w.r.t. single SVMs, but this property only partially holds for bagged ensembles. The empirical bias variance analysis gets also insights into the reasons why Lobag works, highliting on the other hand some limitations of the Lobag approach. The bias-variance analysis of random aggregated ensembles highlights also the reasons of their successfull application to large scale data mining problems. * the C++ classes and applications to perform BV analysis are freely available at: http://homes.dsi.unimi.it/  valenti/sw/NEURObjects

14 MCS 2004 - Multiple Classifier Systems, Cagliari 9-11 June 2004 14 References Breiman, L.: Bagging predictors. Machine Learning 24 (1996) 123-140 Breiman, L.: Pasting Small Votes for Classification in Large Databases and On-Line. Machine Learning 36 (1999) 85-103 Chawla, N., Hall, L., Bowyer, K., Moore, T., Kegelmeyer, W.: Distributed pasting of small votes. In: MCS2002, Cagliari, Italy. Vol. 2364 of Lecture Notes in Computer Science., Springer-Verlag (2002) 52-61 Domingos, P. A Unified Bias-Variance Decomposition for Zero-One and Squared Loss. In: Proc. 17 th National Conference on Artificial Intelligence, Austin, TX, AAAI Press (2000) 564-569 G. Valentini and T.G. Dietterich. Low Bias Bagged Support Vector Machines. ICML 2003, pages 752-759, Washington D.C., USA (2003). AAAI Press. Valentini, G. Ensemble methods based on bias-variance analysis. PhD thesis, DISI, Università di Genova, Italy (2003), ftp://ftp.disi.unige.it/person/ValentiniG/Tesi/finalversion/vale-th-2003-04.pdf. Valentini, G., Dietterich, T.G.: Bias-variance analysis of Support Vector Machines for the development of SVM-based ensemble methods. Journal of Machine Learning Research (2004) (accepted for publication)


Download ppt "MCS 2004 - Multiple Classifier Systems, Cagliari 9-11 June 2004 1 Giorgio Valentini Random aggregated and bagged ensembles."

Similar presentations


Ads by Google