# Classification with Multiple Decision Trees

## Presentation on theme: "Classification with Multiple Decision Trees"— Presentation transcript:

Classification with Multiple Decision Trees
CV-2003 Eran Shimonovitz Ayelet Akselrod-Ballin

Plan Basic framework Intermediate summary Combining multiple trees
Query selection, impurity, stopping … Intermediate summary Combining multiple trees Y.Amit & D.Geman’s approach Randomization, Bagging, Boosting Applications

Introduction A general classifier:
Uses measurements made on the object to assign the object to a category

Some popular classification methods
Nearest Neighbor Rule Bayesian Decision Theory Fisher Linear Discriminant SVM Neural Network

Formulation x measurement vector (x1,x2,…xd) Є X is pre-computed for each data point C = {1,…,J} set of J classes, labeled Y(x) L = {(x1, y1),…,(xN, yN)} learning sample Data patterns can be Ordered, numerical, real numbers Categorical, nominal list of attributes A classification rule: a function defined on X so that for every x, Ŷ(x) is equal to one of (1,..J)

The goal is to construct a classifier
such that is as small as possible

Basic framework CART - classification and regression trees
Root Sub-tree Node leaf Trees are constructed by repeated splits of subsets of X into descendent Breiman and Colleagues 1984

Split Number: binary / multi value
Every tree can be represented using just binary decision. Duda, Hart & Stork 2001

Query selection & Impurity
P(ωj ) more precisely Fraction of patterns at node T in category ωj Impurity Φ is a nonnegative function defined on the set of all J–tuples satisfying 1/6, 1/6, 1/6, 1/6, 1/6, 1/6 1/3, 1/3, 1/3, 0, 0, 0 0, 0, 0, 1/3, 1/3, 1/3

Impurity properties When all categories are equally represented:
If all patterns that reach the nodes bear the same category: Symmetric function of pω1, …,pωj Given Φ define the impurity measure i(T) at any node T:

Entropy impurity X1<0.6 8/16 , 8/16 7/10 , 3/10 1/6 , 5/6 i(T)=0.88
8/16 , 8/16 X1<0.6 7/10 , 3/10 i(T)=0.88 i(T)=0.65 6/16 10/16 1/6 , 5/6 X2 X1 .29 .10 .83 .15 .08 .55 .09 .16 .23 .35 .19 .70 .38 .47 .62 .48 .52 .27 .91 .73 .57 .90 .65 .75 .36 .06

Entropy impurity w2 X2<0.32 X1<0.35 w1 X2<0.61 X1<0.69

Other Impurity Functions
Variance impurity: Gini impurity: Misclassification Impurity: The measure does not affect the overall performance

Goodness of split Defined as the decrease in impurity
where Select the splits that maximize Δi(s,T) Greedy method: local optimization Conditional Entropy t PL PR tL tR

Entropy formulation Vector of predictors assumed binary
For each f calculate the conditional entropy on class given Xf

Stop splitting too early
Stopping Criteria Best candidate split at a node reduces the impurity by less than threshold. Lower bound on the number/ percentage of points at a node. Validation & cross validation Statistical significance of impurity reduction grow the tree fully until min impurity Over fitting Stop splitting too early error is not sufficiently low Trade off

Recognizing Overfitting
Accuracy On training data On test data Size of tree (number of nodes)

Assignments of leaf labels
When leaf nodes have positive impurity each node will be labeled by the category that has most points.

Recursive partitioning scheme
Select attribute A to max impurity reduction [by defining P(j|N), i(N) ] For each possible value of A add a new branch Below new branch add sub-tree If stopping criterion is met Y N Node label = most common

X2<0.83 X1<0.27 X1<0.89 X2<0.34 X2<0.56 w2 w1 X1<0.09 w2 w1 X1<0.56 w2 w1 w2 w1

Preprocessing - PCA -0.8X1+0.6X2<0.3 w1 w2

Popular tree algortihms
ID3 - 3rd “interactive dichotomizer” (Quinlan 1983) C4.5 – descendent of ID3 (Quinlan 1993) C5.0

Pros & Cons Interpretability, good insight of data structure.
Rapid classification Multi – class Space complexity Refinement without reconstructing. Can be further depend Natural to incorporate prior experts knowledge Instability - sensitivity to training points, a result of greedy process. Training time Over training sensitivity Difficult to understand if large

Combining multiple classification trees
Main problem – Stability Small changes in training set cause large changes in classifier. Solution Grow multiple trees instead of just one and then combine the information. The aggregation produces significant improvement in accuracy.

Protocols for generating multiple classifiers
Randomization: of queries at each node. Boosting: sequential reweighing, AdaBoost Bagging: bootstrap aggregation Multiple trees

Y. Amit & D. Geman’s Approach
Shape quantization and recognition with randomized trees, Neural computation, Shape recognition based on shape features & tree classifiers. The Goal: to select the informative shape features and build tree classifiers.

Randomization At each node:
Choose a random sample of predictors from the whole candidate collection. Estimate the optimal predictor using a random sample of data points. The size of these 2 random samples are parameters. Multiple trees

Multiple classification trees
Different trees correspond to different aspects of shapes, characterize from "different point of view". Statistically weakly dependent due to randomization.

Aggregation After producing N trees T1, ..TN
Maximize average terminal distribution P at terminal node, Lt(c) : set of training points of class c at node t Multiple trees

test point ω T1 T2 Tn Multiple trees

Data Classification examples: Handwritten digits LATEX symbols.
Binary images of 2D shapes. All images are registered to a fixed grid of 32X32. Considerable within class variation. Y. Amit & D. Geman

Handwritten digits – NIST (National institute of standards and technology)
223,000 binary images of isolated digits written by more than 2000 writers. 100,000 for training and 50,000 for testing. Y. Amit & D. Geman

LATEX Symbols 32 samples per class for all 293 classes.
Synthetic deformations Y. Amit & D. Geman

Shape features Each query corresponds to a spatial arrangement of local codes "tags". Tags: coarse description (5 bit codes) of the local topography of intensity surface in the neighborhood of a pixel. Discriminating power comes from their relative angles and distances of tags. Y. Amit & D. Geman

Tags 4X4 sub-images are randomly extracted & recursively partitioned based on individual pixel values. A tag type for each node of the resulting tree. If 5 question are asked = 62 tags Y. Amit & D. Geman

Tags (cont.) Tag 16 is a depth 4 tag. The corresponding 4 questions in the following sub-image are indicated by the following mask. Where 0 - background 1 - object n – “not asked” These neighborhoods are loosely described by background to lower left, object to upper right. 1 n Y. Amit & D. Geman

Spatial arrangement of local features
The arrangement A is a labeled hyper-graph. Vertex labels correspond to the tag types and edge labels to relations. Directional and distance constraints. Query: whether such an arrangement exists anywhere in the image. Y. Amit & D. Geman

Example of node splitting
The minimal extension of an arrangement A means the addition of one relation between existing tags, or the addition of exactly one tag and one relation binding the new tag to the existing one. Y. Amit & D. Geman

The trees are grown by the scheme described …
Y. Amit & D. Geman

Importance of Multiple randomization trees
Graphs found in the terminal node of five different trees. Y. Amit & D. Geman

Experiment - NIST Stopping: Nodes are split while at least m points in the second largest class. Q : # of random queries per node. Random sample of 200 training points per node. trees are produces. Depth 10 on average Y. Amit & D.Geman

Results Best error rate with a single tree is 5%
The average rate per tree is about 91% By aggregating trees the classification climbs State-of-the-art error rates Rejection rate→ above 99% 3% 2% 1% 99.8 99.5 99.2 25 99.6 99.3 50 #T↓ Y. Amit & D.Geman

Conclusions Stability & Accuracy - combining multiple trees leads to drastic decrease in error rates, relative to the best individual tree. Efficiency – fast training & testing Ability for visual interpretation of trees output. Few parameters & Insensitive to parameter setting. Y. Amit & D.Geman

Conclusions – (cont.) The approach is not model based, does not involve advanced geometry or extracting boundary information. Missing aspect: features from more than one resolution. Most successful handwritten CR reported by Lecun et al (99.3%). Used multi-layer feed forward based on raw pixel intensity. Y. Amit & D.Geman

Voting Tree Learning Algorithms
A family of protocols for producing and aggregating multiple classifiers. Improve predictive accuracy. For unstable procedures. Manipulate the training data in order to generate different classifiers. Methods: Bagging, Boosting

Bagging A name derived from “bootstrap aggregation”.
A “bootstrap” data set: Created by randomly selecting points from the training set, with replacement. Bootstrap estimation: Selection process is independently repeated. Data sets are treated as independent. Bagging Predictors, Leo Breiman, 1996

Bagging Select a bootstrap sample, LB from L.
Grow a decision tree from LB. Estimate the class of xn by plurality vote  # (estimated class ≠ true class)  Bagging - Algorithm

UCI Machine Learning Repository
Bagging – Data Base

Error rates are the averages over 100 iterations.
Results Error rates are the averages over 100 iterations. Bagging - Algorithm

C4.5 vs. bagging C4.5 UCI repository of machine learning database
Boosting the Margin, Robert E. Schapire, Yoav Freund, Peter Bartlett & Wee Sun Lee, 1998 Bagging

Improve the accuracy of a given learning algorithm (Schapire 1989).
Boosting Improve the accuracy of a given learning algorithm (Schapire 1989). Classifiers are produced with dependence on the previously generated classifiers. Form an ensemble whose joint decision rule has higher accuracy.

Basic procedure Problem Final classification 2-dimensional 2-category
Voting of 3 component classifiers Boosting- Basic procedure

Train classifier C1 with D1
Randomly select a subset of patterns from the training set. Train the first classifier with this subset. Classifier C1 is a weak classifier. Boosting- Basic procedure

Train classifier C2 with D2
Find a second training set that is the “most informative” given C1. ½ should be correctly classified by C1. ½ classified incorrectly classified by C1. Train a second classifier, C2, with this set. Boosting- Basic procedure

Train classifier C3 with D3
Seek a third data set which is not well classified by voting by C1 and C2. Train the third classifier, C3, with this set. Boosting- Basic procedure

Ensemble of classifiers
Classifying a test pattern based on votes: If C1 and C2 agree on a label  use that label. If disagree  use the label given by C3. Boosting- Basic procedure

AdaBoost The most popular variation on basic boosting “Adaptive Boosting” (Freund & Schapire 1995). Focus in on the “difficult” patterns.  weight vector on the training data Boosting - AdaBoost

A weak learner Non-negative weights sum to 1 Binary label
Weighted training set (x1, y1, w1) (x2, y2, w2) (xn, yn, wn) A weak rule h h Non-negative weights sum to 1 Binary label Feature vector labels y1,y2,y3,…,yn x1,x2,x3,…,xn The weak requirement: Boosting - AdaBoost An Introduction to Boosting, Yoav Freund

The key idea Sign[ … ] h1 h2 h3 h4 h5 h6 h7 hT h1 a1 + h2 a2 hT aT
(x1,y1,1/n), … (xn,yn,1/n) h2 (x1,y1,w1), … (xn,yn,wn) (x1,y1,w1), … (xn,yn,wn) h3 (x1,y1,w1), … (xn,yn,wn) h4 (x1,y1,w1), … (xn,yn,wn) h5 (x1,y1,w1), … (xn,yn,wn) h6 (x1,y1,w1), … (xn,yn,wn) h7 (x1,y1,w1), … (xn,yn,wn) hT Sign[ … ] h1 a1 + h2 a2 hT aT Final rule: Boosting - AdaBoost

A Brief Introduction to Boosting, Robert E. Schapire, 1999

C4.5 vs. boosting C4.5 UCI repository of machine learning database

Shift in mind … Instead of trying to design a learning algorithm that is accurate over the entire space, focus on finding weaking learning algorithms that are only better than random. Boosting - AdaBoost

Amit & Geman Approach Generating different trees using randomization.
Produce multiple trees using boosting.

Randomized Boosting Reweighting is done using 1/en.
Aggregation either by averaging or by weighted vote between the N trees. Boosting - Amit & Geman

Experiments (NIST) Deep trees  high training classification rate
 boosting performed worse than simple randomization. 5000 training data. (Q) # queries/node = 20 (s) # samples/node = 15 (*) Boosting is not applicable for pure trees since the training error rate is 0. m=1 m=3 m=10 m=20 Stopping Criterion 97.6% 97.2% 96.5% 96.0% Randomized aggregate rate * 95.2% 96.8% Boosting aggregate rate Boosting - Amit & Geman

Experiments (NIST) Shallow trees  data points of different classes that are hard to separate. 100,000 training data. 50,000 testing data. m=50 Q=20 m=100 Protocol 99.07% 99.09% Boosting aggregate rate 69.3% 62.4% Individual rate 10.8 9.8 Average depth

Conclusions Randomization seems to be missing some problematic points.
AdaBoost does not always produce better results Over fitting. Sensitive to small training sets. Better classification rates are obtained with randomized sequential reweighting. Multiple Randomized Classifiers,Y.Amit & G. Blanchard, 2001 Boosting - Amit & Geman

Bagging, Boosting and C4.5 C4.5’s mean error rate over the 10 cross-validation. Bagged C4.5 complete cross-validation of bagging vs. C4.5. and bagging/C4.5. Boosted C4.5 Boosting vs. Bagging 24/27 21/27 20/27 Bagging – vs. Boosting Bagging, Boosting and C4.5, J.R. Quinlan, 1996

Boosting can fail to perform well:
Given insufficient data. Overly complex weak hypotheses. Weak hypotheses which are too weak. Bagging – vs. Boosting

More results Test error Base classifier Combined classifier Training error Boosting the Margin, Robert E. Schapire, Yoav Freund, Peter Bartlett & Wee Sun Lee, 1998 Bagging – vs. Boosting

Horizon effect – lack of sufficient look ahead.
Pruning Horizon effect – lack of sufficient look ahead. A stopping condition may be met “too early” for overall optimal recognition accuracy. Smaller decision trees are desirable.  Avoid over fitting. Avoid over fitting without lack of look ahead

Impurity based pruning
Grow the tree fully until leaf nodes have minimum impurity. Eliminate pairs of neighboring leaf nodes with small decrease in impurity. The common antecedent declared a leaf. Prune starting at the leaf nodes. Pruning - procedure

Error complexity based pruning
Pass the pruning set through the tree, Record at each node: # errors if the tree was terminated there. Compare for each node, # errors if that node would be made a terminal node (Ep). # errors at the terminal nodes of that subtree (Est). If Ep≤ Est : Replace the subtree with that node. Pruning - procedure

C4.5 Rules Each leaf has an associated rule.
Simplify by eliminating redundant decision. Has no influence on the classifier. Nodes near the root can be pruned. Pruning - procedure

Pre-pruning vs. Post-pruning
Stops further splitting of a given node based on a stopping criteria. Look ahead methods. Post-pruning: Remove subtrees which have minimal impact on some estimated sensitivity measure. Such as: error rate, increase in impurity … Difficulty in evaluating combination of removals  a greedy strategy  may not be most optimal. Computation time. Pruning

Applications WebSeer - an image search engine for the www (University of Chicago). C5.0 - Rulequest Research by Ross Quinlan.

An Image Search Engine for the WWW
Results of the query. An image search query.

Separating Photographs from Graphics
Color transition from pixel to pixel Regions of constant color vs. texture and noise. Edges: sharp vs. smooth, fast transition. Light variation and shading. Highly saturated colors. Number of colors. Distinguishing Photographs and Graphics on the WWW, Vassilis Athiotsos, Michael J. Swain & Charles Frankel, CBAIVL 1997 WebSeer

Image Metrics Number of distinct colors. Prevalent color
% most frequently occurring color. Farthest neighbor Color distance: d(p,p’) = |r-r’|+|g-g’|+|b-b’| Transition value: maximum color distance in a 4 neighborhood environment. For a given P: % transition value ≥ P. Saturation |max(r,g,b)-min(r,g,b)| For a given P: % saturation level ≥ P. WebSeer

Image Metrics Color histogram Farthest neighbor histogram
Color histogram: (r,g,b)  (r/16, g/16, b/16)16x16x16 and normalize to 1. Create an average color histogram: Hg, Hp. Correlation: A test image Hi, define  Farthest neighbor histogram Histogram of transition values. Create an average histogram : Fg, Fp. A test image Fi, define  WebSeer

Individual Metrics T –threshold to minimize E.
Eg, Ep – error in testing set E =(Eg+ Ep)/2. P  p>g G  p<g The scores we obtain from individual metrics are rarely definitive. P G WebSeer

Combining the Metric Scores
Grow multiple decision trees (Amit & Geman). Binary decision trees. Test at node n: Mn(Pn)  Sn > Tn. Leaf node: Pp [0, 1]. Classify by average of trees’ result (A).  if A<K then graphic. Why ? Different “point of view” give increased accuracy. Salient metrics. WebSeer

Results WebSeer

C5.0 The software package was written by Ross Quinlan
Combines AdaBoost with C4.5. Available commercially from

Is C5.0 better than C4.5 ? Rulesets: much faster and much less memory.
Decision trees: faster and smaller. C5.0 C4.5 C5.0 C4.5 C5.0

Is C5.0 better than C4.5 ? Boosting: adds accuracy C5.0 C4.5 C5.0

New functionality Variable misclassification costs
In practical applications some classification errors are more serious than others . A separate cost to be defined for each predicted/actual class pair  minimize cost. provides free source code for reading and interpreting the classifiers generated. Demos: Data Set: breast cancer. Data Set: soybean. Simple Decision Tree & 10-fold cross validation Pruning Rule sets Boosting & X-reference on data & X-reference on test C5.0

THE END

References Y. Amit, 2D Object Detection and Recognition, Models, Algorithms, and Networks, MIT PRESS, 2002. R. Duda and P. Hart, D. Stork, Pattern Classification, Wiley, 2001. L. Breiman, Freidman J, Olshen R, & Stone, C., Classification and Regression Trees, Wadsworth Int. Group, 1984. Y. Amit, D. Geman and K. Wilder, Joint induction of shape features and tree classifiers, IEEE Trans. Pattern Anal. Mach. Intell., 19, , 1997. Y. Amit and D. Geman, Shape quantization and recognition with randomized trees, Neural computation, 9, ,

R. E. Schapire, Y. Freund, P. Bartlett, W. Sun Lee
R.E. Schapire, Y.Freund, P.Bartlett, W. Sun Lee. Boosting the margin: A new explanation for the effectiveness of voting methods. The annals of statistics, 26(5): , 1998. L.Breiman Bagging predictors, Machine Learning, Vol 24, no. 2, pp , 1996. Dietterich. T.G, An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization. Machine Learning, 40(2): , 2000. H. Drucker & C. Cortes, Boosting decision trees. Advances in neural information processing systems 8, , 1996 J.R. Quinlan, Bagging, Boosting, & C4.5, Proc. 13th Conf. Artificial Intelligence, , 1996 L. Breiman. Arcing classifiers. The Annals of statistics, 26(3): , 1998. J.R. Quinlan. C4.5 : Programs for machine learning, Morgan Kaufman, 1993. Quinlan, J.R Induction of decision trees. Machine Learning 1(1): L. Breiman. Arcing classifiers. The Annals of statistics, 26(3): , 1998