Presentation is loading. Please wait.

Presentation is loading. Please wait.

Classification with Multiple Decision Trees CV-2003 Eran Shimonovitz Ayelet Akselrod-Ballin.

Similar presentations


Presentation on theme: "Classification with Multiple Decision Trees CV-2003 Eran Shimonovitz Ayelet Akselrod-Ballin."— Presentation transcript:

1 Classification with Multiple Decision Trees CV-2003 Eran Shimonovitz Ayelet Akselrod-Ballin

2 Plan  Basic framework Query selection, impurity, stopping …  Intermediate summary  Combining multiple trees –Y.Amit & D.Geman’s approach –Randomization, Bagging, Boosting  Applications

3 Introduction A general classifier: Uses measurements made on the object to assign the object to a category

4 Some popular classification methods  Nearest Neighbor Rule  Bayesian Decision Theory  Fisher Linear Discriminant  SVM  Neural Network

5 Formulation  x measurement vector (x 1,x 2,…x d ) Є X is pre-computed for each data point  C = {1,…,J} set of J classes, labeled Y(x)  L = {(x 1, y 1 ),…,(x N, y N )} learning sample  Data patterns can be –Ordered, numerical, real numbers –Categorical, nominal list of attributes A classification rule: a function defined on X so that for every x, Ŷ(x) is equal to one of (1,..J)

6 The goal is to construct a classifier such that is as small as possible

7 Breiman and Colleagues 1984 Basic framework CART - classification and regression trees Trees are constructed by repeated splits of subsets of X into descendent Root Sub-tree Node leaf

8 Duda, Hart & Stork 2001 Split Number: binary / multi value Every tree can be represented using just binary decision.

9 Query selection & Impurity  P(ω j ) more precisely Fraction of patterns at node T in category ω j  Impurity Φ is a nonnegative function defined on the set of all J–tuples satisfying 1/6, 1/6, 1/6, 1/6, 1/6, 1/6 1/3, 1/3, 1/3, 0, 0, 0 0, 0, 0, 1/3, 1/3, 1/3

10 Impurity properties 1.When all categories are equally represented: 2.If all patterns that reach the nodes bear the same category: 3.Symmetric function of p ω1, …,p ωj Given Φ define the impurity measure i(T) at any node T:

11 8/16, 8/16 X2X1X2X Entropy impurity X 1 <0.6 7/10, 3/10 i(T)=0.88i(T)=0.65 6/1610/16 1/6, 5/6

12 Entropy impurity w2 X2<0.32 X1<0.35 w1 w1 X2<0.61 X1<0.69w2X1<0.6w2w1

13 Other Impurity Functions  Variance impurity:  Gini impurity:  Misclassification Impurity: The measure does not affect the overall performance

14 Goodness of split  Defined as the decrease in impurity where  Select the splits that maximize Δi(s,T)  Greedy method: local optimization Conditional Entropy tRtRtRtR t tLtLtLtL PLPLPLPL PRPRPRPR

15 Entropy formulation Vector of predictors assumed binary For each f calculate the conditional entropy on class given X f

16 Stopping Criteria  Best candidate split at a node reduces the impurity by less than threshold.  Lower bound on the number/ percentage of points at a node.  Validation & cross validation  Statistical significance of impurity reduction grow the tree fully until min impurity Over fitting Stop splitting too early error is not sufficiently low Trade off

17 Recognizing Overfitting Size of tree (number of nodes) Accuracy On training data On test data

18 Assignments of leaf labels  When leaf nodes have positive impurity each node will be labeled by the category that has most points.

19 Recursive partitioning scheme Node label = most common Select attribute A to max impurity reduction [by defining P(j|N), i(N) ] For each possible value of A add a new branch Below new branch add sub-tree If stopping criterion is met is met Y N

20 w2 X2<0.34 X1<0.09 w1 w2 X2<0.56 X1<0.56w1 w1w2 X1<0.27X1<0.89 w2w1 X2<0.83

21 Preprocessing - PCA w1 -0.8X1+0.6X2<0.3 w2

22 Popular tree algortihms  ID3 - 3rd “interactive dichotomizer” (Quinlan 1983)  C4.5 – descendent of ID3 (Quinlan 1993)  C5.0

23 Pros & Cons  Interpretability, good insight of data structure.  Rapid classification  Multi – class  Space complexity  Refinement without reconstructing.  Can be further depend  Natural to incorporate prior experts knowledge  …  Instability - sensitivity to training points, a result of greedy process.  Training time  Over training sensitivity  Difficult to understand if large  …

24 Main problem – Stability Small changes in training set cause large changes in classifier. Combining multiple classification trees Solution Grow multiple trees instead of just one and then combine the information. The aggregation produces significant improvement in accuracy.

25 Multiple trees Protocols for generating multiple classifiers  Randomization: of queries at each node.  Boosting: sequential reweighing, AdaBoost  Bagging: bootstrap aggregation

26 Y. Amit & D. Geman’s Approach Shape quantization and recognition with randomized trees, Neural computation, Shape recognition based on shape features & tree classifiers. The Goal: to select the informative shape features and build tree classifiers.

27 Multiple trees Randomization At each node:  Choose a random sample of predictors from the whole candidate collection.  Estimate the optimal predictor using a random sample of data points.  The size of these 2 random samples are parameters.

28 Multiple classification trees  Different trees correspond to different aspects of shapes, characterize from "different point of view".  Statistically weakly dependent due to randomization.

29 Multiple trees Aggregation After producing N trees T 1,..T N Maximize average terminal distribution P at terminal node, L t (c) : set of training points of class c at node t

30 Multiple trees T1T1T1T1 T2T2T2T2 TnTnTnTn test point ω

31 Y. Amit & D. Geman Data Classification examples:  Handwritten digits  LATEX symbols.  Binary images of 2D shapes.  All images are registered to a fixed grid of 32X32.  Considerable within class variation.

32 Y. Amit & D. Geman Handwritten digits – NIST (National institute of standards and technology)  223,000 binary images of isolated digits written by more than 2000 writers.  100,000 for training and 50,000 for testing.

33 Y. Amit & D. Geman LATEX Symbols 32 samples per class for all 293 classes. Synthetic deformations

34 Y. Amit & D. Geman Shape features  Each query corresponds to a spatial arrangement of local codes "tags".  Tags: coarse description (5 bit codes) of the local topography of intensity surface in the neighborhood of a pixel.  Discriminating power comes from their relative angles and distances of tags.

35 Y. Amit & D. Geman Tags Tags  4X4 sub-images are randomly extracted & recursively partitioned based on individual pixel values.  A tag type for each node of the resulting tree.  If 5 question are asked = 62 tags

36 Y. Amit & D. Geman Tags (cont.)  Tag 16 is a depth 4 tag. The corresponding 4 questions in the following sub-image are indicated by the following mask. Where –0 - background –1 - object –n – “not asked”  These neighborhoods are loosely described by background to lower left, object to upper right. 1nnn nnn0 n00n nnnn

37 Y. Amit & D. Geman Spatial arrangement of local features  The arrangement A is a labeled hyper- graph. Vertex labels correspond to the tag types and edge labels to relations.  Directional and distance constraints.  Query: whether such an arrangement exists anywhere in the image.

38 Y. Amit & D. Geman Example of node splitting The minimal extension of an arrangement A means the addition of one relation between existing tags, or the addition of exactly one tag and one relation binding the new tag to the existing one.

39 Y. Amit & D. Geman The trees are grown by the scheme described …

40 Y. Amit & D. Geman Importance of Multiple randomization trees Graphs found in the terminal node of five different trees.

41 Y. Amit & D.Geman Experiment - NIST  Stopping: Nodes are split while at least m points in the second largest class.  Q : # of random queries per node.  Random sample of 200 training points per node.  trees are produces.  Depth 10 on average

42 Y. Amit & D.Geman Results  Best error rate with a single tree is 5%  The average rate per tree is about 91%  By aggregating trees the classification climbs  State-of-the-art error rates 3%2%1% Rejection rate→ #T↓#T↓ above 99%

43 Y. Amit & D.Geman Conclusions  Stability & Accuracy - combining multiple trees leads to drastic decrease in error rates, relative to the best individual tree.  Efficiency – fast training & testing  Ability for visual interpretation of trees output.  Few parameters & Insensitive to parameter setting.

44 Y. Amit & D.Geman Conclusions – (cont.)  The approach is not model based, does not involve advanced geometry or extracting boundary information.  Missing aspect: features from more than one resolution.  Most successful handwritten CR reported by Lecun et al (99.3%). –Used multi-layer feed forward based on raw pixel intensity.

45 Voting Tree Learning Algorithms A family of protocols for producing and aggregating multiple classifiers.  Improve predictive accuracy.  For unstable procedures.  Manipulate the training data in order to generate different classifiers.  Methods: Bagging, Boosting

46 Bagging A name derived from “bootstrap aggregation”. A “bootstrap” data set: Created by randomly selecting points from the training set, with replacement. Bootstrap estimation: Selection process is independently repeated. Data sets are treated as independent. Bagging Predictors, Leo Breiman, 1996

47 Bagging  Select a bootstrap sample, L B from L.  Grow a decision tree from L B.  Estimate the class of x n by plurality vote  –# (estimated class ≠ true class)  Bagging - Algorithm

48 UCI Machine Learning Repository Bagging – Data Base

49 Results Bagging - Algorithm Error rates are the averages over 100 iterations.

50 C4.5 vs. bagging C4.5 Bagging UCI repository of machine learning database Boosting the Margin, Robert E. Schapire, Yoav Freund, Peter Bartlett & Wee Sun Lee, 1998

51 Boosting Improve the accuracy of a given learning algorithm (Schapire 1989). Classifiers are produced with dependence on the previously generated classifiers. Form an ensemble whose joint decision rule has higher accuracy.

52 Basic procedure Problem –2-dimensional –2-category Final classification –Voting of 3 component classifiers Boosting- Basic procedure

53 Train classifier C 1 with D 1 Randomly select a subset of patterns from the training set. Train the first classifier with this subset. Classifier C 1 is a weak classifier. Boosting- Basic procedure

54 Train classifier C 2 with D 2 Find a second training set that is the “most informative” given C 1. ½ should be correctly classified by C 1.½ should be correctly classified by C 1. ½ classified incorrectly classified by C 1.½ classified incorrectly classified by C 1. Train a second classifier, C 2, with this set. Boosting- Basic procedure

55 Train classifier C 3 with D 3 Seek a third data set which is not well classified by voting by C 1 and C 2. Train the third classifier, C 3, with this set. Boosting- Basic procedure

56 Ensemble of classifiers Boosting- Basic procedure Classifying a test pattern based on votes: If C 1 and C 2 agree on a label  use that label. If disagree  use the label given by C 3.

57 AdaBoost The most popular variation on basic boosting “Adaptive Boosting” (Freund & Schapire 1995). Focus in on the “difficult” patterns.  weight vector on the training data Boosting - AdaBoost

58 A weak learner h The weak requirement: Weighted training set (x1, y1, w1) (x2, y2, w2) … (xn, yn, wn) Feature vectorBinary labelNon-negative weights sum to 1 A weak rule h x1,x2,x3,…,xn labels y1,y2,y3,…,yn Boosting - AdaBoost An Introduction to Boosting, Yoav Freund

59 The key idea Final rule: Sign [ … ] h1   h2   hT   h1 (x1,y1,1/n), … (xn,yn,1/n) h2 (x1,y1,w1), … (xn,yn,wn) h3 (x1,y1,w1), … (xn,yn,wn) h4 (x1,y1,w1), … (xn,yn,wn) h5 (x1,y1,w1), … (xn,yn,wn) h6 (x1,y1,w1), … (xn,yn,wn) h7 (x1,y1,w1), … (xn,yn,wn) hT Boosting - AdaBoost

60 AdaBoost (Learning) Boosting - AdaBoost A Brief Introduction to Boosting, Robert E. Schapire, 1999

61 C4.5 vs. boosting C4.5 Boosting - AdaBoost UCI repository of machine learning database

62 Shift in mind … Instead of trying to design a learning algorithm that is accurate over the entire space, focus on finding weaking learning algorithms that are only better than random. Boosting - AdaBoost

63 Amit & Geman Approach Generating different trees using randomization. Produce multiple trees using boosting.

64 Randomized Boosting  Reweighting is done using 1/e n.  Aggregation either by averaging or by weighted vote between the N trees. Boosting - Amit & Geman

65 Experiments (NIST) Deep trees  high training classification rate  boosting performed worse than simple randomization.  boosting performed worse than simple randomization training data. (Q) # queries/node = 20 (s) # samples/node = 15 (*) Boosting is not applicable for pure trees since the training error rate is 0. Boosting - Amit & Geman m=1m=3m=10m=20 Stopping Criterion 97.6%97.2%96.5%96.0% Randomized aggregate rate *95.2%96.5%96.8% Boosting aggregate rate

66 Experiments (NIST) Shallow trees  data points of different classes that are hard to separate. 100,000 training data. 50,000 testing data. m=50Q=20m=100Q=20Protocol 99.07%99.09% Boosting aggregate rate 69.3%62.4% Individual rate Average depth

67 Conclusions  Randomization seems to be missing some problematic points.  AdaBoost does not always produce better results –Over fitting. –Sensitive to small training sets.  Better classification rates are obtained with randomized sequential reweighting. Boosting - Amit & Geman Multiple Randomized Classifiers,Y.Amit & G. Blanchard, 2001

68 Bagging, Boosting and C4.5 C4.5’s mean error rate over the 10 cross-validation. Bagged C4.5 complete cross-validation of bagging vs. C4.5. and bagging/C4.5. Boosted C4.5 Boosting vs. Bagging Bagging – vs. Boosting Bagging, Boosting and C4.5, J.R. Quinlan, /2721/2720/27

69 Boosting can fail to perform well: Given insufficient data. Overly complex weak hypotheses. Weak hypotheses which are too weak. Bagging – vs. Boosting

70 More results Boosting the Margin, Robert E. Schapire, Yoav Freund, Peter Bartlett & Wee Sun Lee, 1998 Test error Training error Base classifier Combined classifier Bagging – vs. Boosting

71 Pruning Horizon effect – lack of sufficient look ahead. A stopping condition may be met “too early” for overall optimal recognition accuracy. –Smaller decision trees are desirable.  Avoid over fitting. Avoid over fitting without lack of look ahead

72 Impurity based pruning  Grow the tree fully –until leaf nodes have minimum impurity.  Eliminate pairs of neighboring leaf nodes with small decrease in impurity. –The common antecedent declared a leaf. Prune starting at the leaf nodes. Pruning - procedure

73 Error complexity based pruning Pass the pruning set through the tree, –Record at each node: # errors if the tree was terminated there.  Compare for each node, –# errors if that node would be made a terminal node (E p ). –# errors at the terminal nodes of that subtree (E st ).  If E p ≤ E st : –Replace the subtree with that node. Pruning - procedure

74 C4.5 Rules  Each leaf has an associated rule.  Simplify by eliminating redundant decision. –Has no influence on the classifier. –Nodes near the root can be pruned. Pruning - procedure

75 Pre-pruning vs. Post-pruning  Pre-pruning: –Stops further splitting of a given node based on a stopping criteria. –Look ahead methods.  Post-pruning: –Remove subtrees which have minimal impact on some estimated sensitivity measure. –Such as: error rate, increase in impurity … –Difficulty in evaluating combination of removals  a greedy strategy  may not be most optimal. –Computation time. Pruning

76 Applications  WebSeer - an image search engine for the www (University of Chicago).  C5.0 - Rulequest Research by Ross Quinlan.

77 An Image Search Engine for the WWW An image search query. Results of the query.

78 Separating Photographs from Graphics  Color transition from pixel to pixel –Regions of constant color vs. texture and noise. –Edges: sharp vs. smooth, fast transition. –Light variation and shading.  Highly saturated colors.  Number of colors. WebSeer Distinguishing Photographs and Graphics on the WWW, Vassilis Athiotsos, Michael J. Swain & Charles Frankel, CBAIVL 1997

79 Image Metrics  Number of distinct colors.  Prevalent color –% most frequently occurring color.  Farthest neighbor –Color distance: d(p,p’) = |r-r’|+|g-g’|+|b-b’| –Transition value: maximum color distance in a 4 neighborhood environment. –For a given P: % transition value ≥ P.  Saturation –|max(r,g,b)-min(r,g,b)| –For a given P: % saturation level ≥ P. WebSeer

80 Image Metrics  Color histogram –Color histogram: (r,g,b)  (  r/16 ,  g/16 ,  b/16  ) 16x16x16 and normalize to 1. –Create an average color histogram: H g, H p. –Correlation: –A test image H i, define   Farthest neighbor histogram –Histogram of transition values. –Create an average histogram : F g, F p. –Correlation: –A test image F i, define  WebSeer

81 Individual Metrics  T –threshold to minimize E.  E g, E p – error in testing set  E =(E g+ E p )/2.  P  p>g  G  p

82 Combining the Metric Scores  Grow multiple decision trees (Amit & Geman). –Binary decision trees. –Test at node n: M n (P n )  S n > T n. –Leaf node: P p [0, 1]. –Classify by average of trees’ result (A).  if A

83 Results

84 C5.0 The software package was written by Ross Quinlan –Combines AdaBoost with C4.5. –Available commercially from

85 Is C5.0 better than C4.5 ?  Rulesets: much faster and much less memory.  Decision trees: faster and smaller. C5.0 C4.5 C5.0 C4.5 C5.0

86 Is C5.0 better than C4.5 ?  Boosting: adds accuracy C5.0 C4.5

87 New functionality  Variable misclassification costs –In practical applications some classification errors are more serious than others. –A separate cost to be defined for each predicted/actual class pair  minimize cost.  provides free source code for reading and interpreting the classifiers generated. Demos: Data Set: breast cancer.Data Set: soybean. 1.Simple Decision Tree & 10-fold cross validation 2.Pruning 3.Rule sets 4.Boosting & X-reference on data & X-reference on test & X-reference on test C5.0

88

89 References  Y. Amit, 2D Object Detection and Recognition, Models, Algorithms, and Networks, MIT PRESS,  R. Duda and P. Hart, D. Stork, Pattern Classification, Wiley,  L. Breiman, Freidman J, Olshen R, & Stone, C., Classification and Regression Trees, Wadsworth Int. Group,  Y. Amit, D. Geman and K. Wilder, Joint induction of shape features and tree classifiers, IEEE Trans. Pattern Anal. Mach. Intell., 19, ,  Y. Amit and D. Geman, Shape quantization and recognition with randomized trees, Neural computation, 9, , 1997.

90  R.E. Schapire, Y.Freund, P.Bartlett, W. Sun Lee. Boosting the margin: A new explanation for the effectiveness of voting methods. The annals of statistics, 26(5): ,  L.Breiman Bagging predictors, Machine Learning, Vol 24, no. 2, pp ,  Dietterich. T.G, An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization. Machine Learning, 40(2): ,  H. Drucker & C. Cortes, Boosting decision trees. Advances in neural information processing systems 8, , 1996  J.R. Quinlan, Bagging, Boosting, & C4.5, Proc. 13 th Conf. Artificial Intelligence, , 1996  L. Breiman. Arcing classifiers. The Annals of statistics, 26(3): ,  J.R. Quinlan. C4.5 : Programs for machine learning, Morgan Kaufman,  Quinlan, J.R Induction of decision trees. Machine Learning 1(1):  L. Breiman. Arcing classifiers. The Annals of statistics, 26(3): , 1998


Download ppt "Classification with Multiple Decision Trees CV-2003 Eran Shimonovitz Ayelet Akselrod-Ballin."

Similar presentations


Ads by Google