Classification with Multiple Decision Trees

Slides:



Advertisements
Similar presentations
The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 4: An Introduction To Data Mining (II) Johannes Gehrke
Advertisements

Data Mining Lecture 9.
DECISION TREES. Decision trees  One possible representation for hypotheses.
Ch. Eick: More on Machine Learning & Neural Networks Different Forms of Learning: –Learning agent receives feedback with respect to its actions (e.g. using.
CHAPTER 9: Decision Trees
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Introduction Training Complexity, Pruning CART vs. ID3 vs. C4.5
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Part I Introduction to Data Mining by Tan,
Classification: Definition Given a collection of records (training set ) –Each record contains a set of attributes, one of the attributes is the class.
Decision Tree.
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining by Tan, Steinbach,
Data Mining Classification: Alternative Techniques
Ensemble Methods An ensemble method constructs a set of base classifiers from the training data Ensemble or Classifier Combination Predict class label.
Data Mining Classification: Alternative Techniques
Overview Previous techniques have consisted of real-valued feature vectors (or discrete-valued) and natural measures of distance (e.g., Euclidean). Consider.
Lecture Notes for Chapter 4 Introduction to Data Mining
Discriminative and generative methods for bags of features
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Optimization of Signal Significance by Bagging Decision Trees Ilya Narsky, Caltech presented by Harrison Prosper.
Lecture 5 (Classification with Decision Trees)
Three kinds of learning
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
1 MACHINE LEARNING TECHNIQUES IN IMAGE PROCESSING By Kaan Tariman M.S. in Computer Science CSCI 8810 Course Project.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Ordinal Decision Trees Qinghua Hu Harbin Institute of Technology
Ensemble Learning (2), Tree and Forest
Digital Camera and Computer Vision Laboratory Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan, R.O.C.
Digital Camera and Computer Vision Laboratory Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan, R.O.C.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
1 Data Mining Lecture 3: Decision Trees. 2 Classification: Definition l Given a collection of records (training set ) –Each record contains a set of attributes,
LOGO Ensemble Learning Lecturer: Dr. Bo Yuan
Decision Trees Jyh-Shing Roger Jang ( 張智星 ) CSIE Dept, National Taiwan University.
K Nearest Neighbors Classifier & Decision Trees
Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.
Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.
Business Intelligence and Decision Modeling Week 9 Customer Profiling Decision Trees (Part 2) CHAID CRT.
Combining multiple learners Usman Roshan. Bagging Randomly sample training data Determine classifier C i on sampled data Goto step 1 and repeat m times.
For Wednesday No reading Homework: –Chapter 18, exercise 6.
CS690L Data Mining: Classification
Handwritten digit recognition
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Digital Camera and Computer Vision Laboratory Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan, R.O.C.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Lecture Notes for Chapter 4 Introduction to Data Mining
ECE 471/571 – Lecture 20 Decision Tree 11/19/15. 2 Nominal Data Descriptions that are discrete and without any natural notion of similarity or even ordering.
Classification and Regression Trees
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
1 By: Ashmi Banerjee (125186) Suman Datta ( ) CSE- 3rd year.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
BY International School of Engineering {We Are Applied Engineering} Disclaimer: Some of the Images and content have been taken from multiple online sources.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
LECTURE 20: DECISION TREES
Trees, bagging, boosting, and stacking
Ch9: Decision Trees 9.1 Introduction A decision tree:
Data Mining Classification: Basic Concepts and Techniques
ECE 471/571 – Lecture 12 Decision Tree.
Prepared by: Mahmoud Rafeek Al-Farra
Introduction to Data Mining, 2nd Edition
Computer Vision Chapter 4
Statistical Learning Dong Liu Dept. EEIS, USTC.
INTRODUCTION TO Machine Learning
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
LECTURE 18: DECISION TREES
Chapter 7: Transformations
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
A task of induction to find patterns
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Presentation transcript:

Classification with Multiple Decision Trees CV-2003 Eran Shimonovitz Ayelet Akselrod-Ballin

Plan Basic framework Intermediate summary Combining multiple trees Query selection, impurity, stopping … Intermediate summary Combining multiple trees Y.Amit & D.Geman’s approach Randomization, Bagging, Boosting Applications

Introduction A general classifier: Uses measurements made on the object to assign the object to a category

Some popular classification methods Nearest Neighbor Rule Bayesian Decision Theory Fisher Linear Discriminant SVM Neural Network

Formulation x measurement vector (x1,x2,…xd) Є X is pre-computed for each data point C = {1,…,J} set of J classes, labeled Y(x) L = {(x1, y1),…,(xN, yN)} learning sample Data patterns can be Ordered, numerical, real numbers Categorical, nominal list of attributes A classification rule: a function defined on X so that for every x, Ŷ(x) is equal to one of (1,..J)

The goal is to construct a classifier such that is as small as possible

Basic framework CART - classification and regression trees Root Sub-tree Node leaf Trees are constructed by repeated splits of subsets of X into descendent Breiman and Colleagues 1984

Split Number: binary / multi value Every tree can be represented using just binary decision. Duda, Hart & Stork 2001

Query selection & Impurity P(ωj ) more precisely Fraction of patterns at node T in category ωj Impurity Φ is a nonnegative function defined on the set of all J–tuples satisfying 1/6, 1/6, 1/6, 1/6, 1/6, 1/6 1/3, 1/3, 1/3, 0, 0, 0 0, 0, 0, 1/3, 1/3, 1/3

Impurity properties When all categories are equally represented: If all patterns that reach the nodes bear the same category: Symmetric function of pω1, …,pωj Given Φ define the impurity measure i(T) at any node T:

Entropy impurity X1<0.6 8/16 , 8/16 7/10 , 3/10 1/6 , 5/6 i(T)=0.88 8/16 , 8/16 X1<0.6 7/10 , 3/10 i(T)=0.88 i(T)=0.65 6/16 10/16 1/6 , 5/6 X2 X1 .29 .10 .83 .15 .08 .55 .09 .16 .23 .35 .19 .70 .38 .47 .62 .48 .52 .27 .91 .73 .57 .90 .65 .75 .36 .06

Entropy impurity w2 X2<0.32 X1<0.35 w1 X2<0.61 X1<0.69

Other Impurity Functions Variance impurity: Gini impurity: Misclassification Impurity: The measure does not affect the overall performance

Goodness of split Defined as the decrease in impurity where Select the splits that maximize Δi(s,T) Greedy method: local optimization Conditional Entropy t PL PR tL tR

Entropy formulation Vector of predictors assumed binary For each f calculate the conditional entropy on class given Xf

Stop splitting too early Stopping Criteria Best candidate split at a node reduces the impurity by less than threshold. Lower bound on the number/ percentage of points at a node. Validation & cross validation Statistical significance of impurity reduction grow the tree fully until min impurity Over fitting Stop splitting too early error is not sufficiently low Trade off

Recognizing Overfitting .5 .6 .7 .8 .9 Accuracy On training data On test data 0 10 20 30 40 50 60 70 80 Size of tree (number of nodes)

Assignments of leaf labels When leaf nodes have positive impurity each node will be labeled by the category that has most points.

Recursive partitioning scheme Select attribute A to max impurity reduction [by defining P(j|N), i(N) ] For each possible value of A add a new branch Below new branch add sub-tree If stopping criterion is met Y N Node label = most common

X2<0.83 X1<0.27 X1<0.89 X2<0.34 X2<0.56 w2 w1 X1<0.09 w2 w1 X1<0.56 w2 w1 w2 w1

Preprocessing - PCA -0.8X1+0.6X2<0.3 w1 w2

Popular tree algortihms ID3 - 3rd “interactive dichotomizer” (Quinlan 1983) C4.5 – descendent of ID3 (Quinlan 1993) C5.0

Pros & Cons Interpretability, good insight of data structure. Rapid classification Multi – class Space complexity Refinement without reconstructing. Can be further depend Natural to incorporate prior experts knowledge … Instability - sensitivity to training points, a result of greedy process. Training time Over training sensitivity Difficult to understand if large …

Combining multiple classification trees Main problem – Stability Small changes in training set cause large changes in classifier. Solution Grow multiple trees instead of just one and then combine the information. The aggregation produces significant improvement in accuracy.

Protocols for generating multiple classifiers Randomization: of queries at each node. Boosting: sequential reweighing, AdaBoost Bagging: bootstrap aggregation Multiple trees

Y. Amit & D. Geman’s Approach Shape quantization and recognition with randomized trees, Neural computation, 1997 . Shape recognition based on shape features & tree classifiers. The Goal: to select the informative shape features and build tree classifiers.

Randomization At each node: Choose a random sample of predictors from the whole candidate collection. Estimate the optimal predictor using a random sample of data points. The size of these 2 random samples are parameters. Multiple trees

Multiple classification trees Different trees correspond to different aspects of shapes, characterize from "different point of view". Statistically weakly dependent due to randomization.

Aggregation After producing N trees T1, ..TN Maximize average terminal distribution P at terminal node, Lt(c) : set of training points of class c at node t Multiple trees

test point ω T1 T2 Tn Multiple trees

Data Classification examples: Handwritten digits LATEX symbols. Binary images of 2D shapes. All images are registered to a fixed grid of 32X32. Considerable within class variation. Y. Amit & D. Geman

Handwritten digits – NIST (National institute of standards and technology) 223,000 binary images of isolated digits written by more than 2000 writers. 100,000 for training and 50,000 for testing. Y. Amit & D. Geman

LATEX Symbols 32 samples per class for all 293 classes. Synthetic deformations Y. Amit & D. Geman

Shape features Each query corresponds to a spatial arrangement of local codes "tags". Tags: coarse description (5 bit codes) of the local topography of intensity surface in the neighborhood of a pixel. Discriminating power comes from their relative angles and distances of tags. Y. Amit & D. Geman

Tags 4X4 sub-images are randomly extracted & recursively partitioned based on individual pixel values. A tag type for each node of the resulting tree. If 5 question are asked 2+4+8+16+32 = 62 tags Y. Amit & D. Geman

Tags (cont.) Tag 16 is a depth 4 tag. The corresponding 4 questions in the following sub-image are indicated by the following mask. Where 0 - background 1 - object n – “not asked” These neighborhoods are loosely described by background to lower left, object to upper right. 1 n Y. Amit & D. Geman

Spatial arrangement of local features The arrangement A is a labeled hyper-graph. Vertex labels correspond to the tag types and edge labels to relations. Directional and distance constraints. Query: whether such an arrangement exists anywhere in the image. Y. Amit & D. Geman

Example of node splitting The minimal extension of an arrangement A means the addition of one relation between existing tags, or the addition of exactly one tag and one relation binding the new tag to the existing one. Y. Amit & D. Geman

The trees are grown by the scheme described … Y. Amit & D. Geman

Importance of Multiple randomization trees Graphs found in the terminal node of five different trees. Y. Amit & D. Geman

Experiment - NIST Stopping: Nodes are split while at least m points in the second largest class. Q : # of random queries per node. Random sample of 200 training points per node. 25-100 trees are produces. Depth 10 on average Y. Amit & D.Geman

Results Best error rate with a single tree is 5% The average rate per tree is about 91% By aggregating trees the classification climbs State-of-the-art error rates Rejection rate→ above 99% 3% 2% 1% 99.8 99.5 99.2 25 99.6 99.3 50 #T↓ Y. Amit & D.Geman

Conclusions Stability & Accuracy - combining multiple trees leads to drastic decrease in error rates, relative to the best individual tree. Efficiency – fast training & testing Ability for visual interpretation of trees output. Few parameters & Insensitive to parameter setting. Y. Amit & D.Geman

Conclusions – (cont.) The approach is not model based, does not involve advanced geometry or extracting boundary information. Missing aspect: features from more than one resolution. Most successful handwritten CR reported by Lecun et al. 1998 (99.3%). Used multi-layer feed forward based on raw pixel intensity. Y. Amit & D.Geman

Voting Tree Learning Algorithms A family of protocols for producing and aggregating multiple classifiers. Improve predictive accuracy. For unstable procedures. Manipulate the training data in order to generate different classifiers. Methods: Bagging, Boosting

Bagging A name derived from “bootstrap aggregation”. A “bootstrap” data set: Created by randomly selecting points from the training set, with replacement. Bootstrap estimation: Selection process is independently repeated. Data sets are treated as independent. Bagging Predictors, Leo Breiman, 1996

Bagging Select a bootstrap sample, LB from L. Grow a decision tree from LB. Estimate the class of xn by plurality vote  # (estimated class ≠ true class)  Bagging - Algorithm

UCI Machine Learning Repository Bagging – Data Base

Error rates are the averages over 100 iterations. Results Error rates are the averages over 100 iterations. Bagging - Algorithm

C4.5 vs. bagging C4.5 UCI repository of machine learning database Boosting the Margin, Robert E. Schapire, Yoav Freund, Peter Bartlett & Wee Sun Lee, 1998 Bagging

Improve the accuracy of a given learning algorithm (Schapire 1989). Boosting Improve the accuracy of a given learning algorithm (Schapire 1989). Classifiers are produced with dependence on the previously generated classifiers. Form an ensemble whose joint decision rule has higher accuracy.

Basic procedure Problem Final classification 2-dimensional 2-category Voting of 3 component classifiers Boosting- Basic procedure

Train classifier C1 with D1 Randomly select a subset of patterns from the training set. Train the first classifier with this subset. Classifier C1 is a weak classifier. Boosting- Basic procedure

Train classifier C2 with D2 Find a second training set that is the “most informative” given C1. ½ should be correctly classified by C1. ½ classified incorrectly classified by C1. Train a second classifier, C2, with this set. Boosting- Basic procedure

Train classifier C3 with D3 Seek a third data set which is not well classified by voting by C1 and C2. Train the third classifier, C3, with this set. Boosting- Basic procedure

Ensemble of classifiers Classifying a test pattern based on votes: If C1 and C2 agree on a label  use that label. If disagree  use the label given by C3. Boosting- Basic procedure

AdaBoost The most popular variation on basic boosting “Adaptive Boosting” (Freund & Schapire 1995). Focus in on the “difficult” patterns.  weight vector on the training data Boosting - AdaBoost

A weak learner Non-negative weights sum to 1 Binary label Weighted training set (x1, y1, w1) (x2, y2, w2) … (xn, yn, wn) A weak rule h h Non-negative weights sum to 1 Binary label Feature vector labels y1,y2,y3,…,yn x1,x2,x3,…,xn The weak requirement: Boosting - AdaBoost An Introduction to Boosting, Yoav Freund

The key idea Sign[ … ] h1 h2 h3 h4 h5 h6 h7 hT h1 a1 + h2 a2 hT aT (x1,y1,1/n), … (xn,yn,1/n) h2 (x1,y1,w1), … (xn,yn,wn) (x1,y1,w1), … (xn,yn,wn) h3 (x1,y1,w1), … (xn,yn,wn) h4 (x1,y1,w1), … (xn,yn,wn) h5 (x1,y1,w1), … (xn,yn,wn) h6 (x1,y1,w1), … (xn,yn,wn) h7 (x1,y1,w1), … (xn,yn,wn) hT Sign[ … ] h1 a1 + h2 a2 hT aT Final rule: Boosting - AdaBoost

AdaBoost (Learning) Boosting - AdaBoost A Brief Introduction to Boosting, Robert E. Schapire, 1999

C4.5 vs. boosting C4.5 UCI repository of machine learning database Boosting - AdaBoost

Shift in mind … Instead of trying to design a learning algorithm that is accurate over the entire space, focus on finding weaking learning algorithms that are only better than random. Boosting - AdaBoost

Amit & Geman Approach Generating different trees using randomization. Produce multiple trees using boosting.

Randomized Boosting Reweighting is done using 1/en. Aggregation either by averaging or by weighted vote between the N trees. Boosting - Amit & Geman

Experiments (NIST) Deep trees  high training classification rate  boosting performed worse than simple randomization. 5000 training data. (Q) # queries/node = 20 (s) # samples/node = 15 (*) Boosting is not applicable for pure trees since the training error rate is 0. m=1 m=3 m=10 m=20 Stopping Criterion 97.6% 97.2% 96.5% 96.0% Randomized aggregate rate * 95.2% 96.8% Boosting aggregate rate Boosting - Amit & Geman

Experiments (NIST) Shallow trees  data points of different classes that are hard to separate. 100,000 training data. 50,000 testing data. m=50 Q=20 m=100 Protocol 99.07% 99.09% Boosting aggregate rate 69.3% 62.4% Individual rate 10.8 9.8 Average depth

Conclusions Randomization seems to be missing some problematic points. AdaBoost does not always produce better results Over fitting. Sensitive to small training sets. Better classification rates are obtained with randomized sequential reweighting. Multiple Randomized Classifiers,Y.Amit & G. Blanchard, 2001 Boosting - Amit & Geman

Bagging, Boosting and C4.5 C4.5’s mean error rate over the 10 cross-validation. Bagged C4.5 complete cross-validation of bagging vs. C4.5. and bagging/C4.5. Boosted C4.5 Boosting vs. Bagging 24/27 21/27 20/27 Bagging – vs. Boosting Bagging, Boosting and C4.5, J.R. Quinlan, 1996

Boosting can fail to perform well: Given insufficient data. Overly complex weak hypotheses. Weak hypotheses which are too weak. Bagging – vs. Boosting

More results Test error Base classifier Combined classifier Training error Boosting the Margin, Robert E. Schapire, Yoav Freund, Peter Bartlett & Wee Sun Lee, 1998 Bagging – vs. Boosting

Horizon effect – lack of sufficient look ahead. Pruning Horizon effect – lack of sufficient look ahead. A stopping condition may be met “too early” for overall optimal recognition accuracy. Smaller decision trees are desirable.  Avoid over fitting. Avoid over fitting without lack of look ahead

Impurity based pruning Grow the tree fully until leaf nodes have minimum impurity. Eliminate pairs of neighboring leaf nodes with small decrease in impurity. The common antecedent declared a leaf. Prune starting at the leaf nodes. Pruning - procedure

Error complexity based pruning Pass the pruning set through the tree, Record at each node: # errors if the tree was terminated there. Compare for each node, # errors if that node would be made a terminal node (Ep). # errors at the terminal nodes of that subtree (Est). If Ep≤ Est : Replace the subtree with that node. Pruning - procedure

C4.5 Rules Each leaf has an associated rule. Simplify by eliminating redundant decision. Has no influence on the classifier. Nodes near the root can be pruned. Pruning - procedure

Pre-pruning vs. Post-pruning Stops further splitting of a given node based on a stopping criteria. Look ahead methods. Post-pruning: Remove subtrees which have minimal impact on some estimated sensitivity measure. Such as: error rate, increase in impurity … Difficulty in evaluating combination of removals  a greedy strategy  may not be most optimal. Computation time. Pruning

Applications WebSeer - an image search engine for the www (University of Chicago). C5.0 - Rulequest Research by Ross Quinlan.

An Image Search Engine for the WWW Results of the query. An image search query.

Separating Photographs from Graphics Color transition from pixel to pixel Regions of constant color vs. texture and noise. Edges: sharp vs. smooth, fast transition. Light variation and shading. Highly saturated colors. Number of colors. Distinguishing Photographs and Graphics on the WWW, Vassilis Athiotsos, Michael J. Swain & Charles Frankel, CBAIVL 1997 WebSeer

Image Metrics Number of distinct colors. Prevalent color % most frequently occurring color. Farthest neighbor Color distance: d(p,p’) = |r-r’|+|g-g’|+|b-b’| Transition value: maximum color distance in a 4 neighborhood environment. For a given P: % transition value ≥ P. Saturation |max(r,g,b)-min(r,g,b)| For a given P: % saturation level ≥ P. WebSeer

Image Metrics Color histogram Farthest neighbor histogram Color histogram: (r,g,b)  (r/16, g/16, b/16)16x16x16 and normalize to 1. Create an average color histogram: Hg, Hp. Correlation: A test image Hi, define  Farthest neighbor histogram Histogram of transition values. Create an average histogram : Fg, Fp. A test image Fi, define  WebSeer

Individual Metrics T –threshold to minimize E. Eg, Ep – error in testing set E =(Eg+ Ep)/2. P  p>g G  p<g The scores we obtain from individual metrics are rarely definitive. P G WebSeer

Combining the Metric Scores Grow multiple decision trees (Amit & Geman). Binary decision trees. Test at node n: Mn(Pn)  Sn > Tn. Leaf node: Pp [0, 1]. Classify by average of trees’ result (A).  if A<K then graphic. Why ? Different “point of view” give increased accuracy. Salient metrics.  WebSeer

Results WebSeer

C5.0 The software package was written by Ross Quinlan Combines AdaBoost with C4.5. Available commercially from www.rulequest.com

Is C5.0 better than C4.5 ? Rulesets: much faster and much less memory. Decision trees: faster and smaller. C5.0 C4.5 C5.0 C4.5 C5.0

Is C5.0 better than C4.5 ? Boosting: adds accuracy C5.0 C4.5 C5.0

New functionality Variable misclassification costs In practical applications some classification errors are more serious than others . A separate cost to be defined for each predicted/actual class pair  minimize cost. provides free source code for reading and interpreting the classifiers generated. Demos: Data Set: breast cancer. Data Set: soybean. Simple Decision Tree & 10-fold cross validation Pruning Rule sets Boosting & X-reference on data & X-reference on test C5.0

THE END

References Y. Amit, 2D Object Detection and Recognition, Models, Algorithms, and Networks, MIT PRESS, 2002. R. Duda and P. Hart, D. Stork, Pattern Classification, Wiley, 2001. L. Breiman, Freidman J, Olshen R, & Stone, C., Classification and Regression Trees, Wadsworth Int. Group, 1984. Y. Amit, D. Geman and K. Wilder, Joint induction of shape features and tree classifiers, IEEE Trans. Pattern Anal. Mach. Intell., 19, 1300-1305, 1997. Y. Amit and D. Geman, Shape quantization and recognition with randomized trees, Neural computation, 9, 1545-1588, 1997 .

R. E. Schapire, Y. Freund, P. Bartlett, W. Sun Lee R.E. Schapire, Y.Freund, P.Bartlett, W. Sun Lee. Boosting the margin: A new explanation for the effectiveness of voting methods. The annals of statistics, 26(5): 1651-1686, 1998. L.Breiman Bagging predictors, Machine Learning, Vol 24, no. 2, pp.121-167, 1996. Dietterich. T.G, An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization. Machine Learning, 40(2):139-158, 2000. H. Drucker & C. Cortes, Boosting decision trees. Advances in neural information processing systems 8, 479-485, 1996 J.R. Quinlan, Bagging, Boosting, & C4.5, Proc. 13th Conf. Artificial Intelligence, 725-730, 1996 L. Breiman. Arcing classifiers. The Annals of statistics, 26(3): 801-849, 1998. J.R. Quinlan. C4.5 : Programs for machine learning, Morgan Kaufman, 1993. Quinlan, J.R. 1986. Induction of decision trees. Machine Learning 1(1):81-106. L. Breiman. Arcing classifiers. The Annals of statistics, 26(3): 801-849, 1998