2Quiz 1Q: Is a tree with only pure leafs always the best classifier you can have?A: No.This tree is the best classifier on the training set, but possibly not on new and unseen data. Because of overfitting, the tree may not generalize very well.
4Pruning Goal: Prevent overfitting to noise in the data Two strategies for “pruning” the decision tree:Postpruning - take a fully-grown decision tree and discard unreliable partsPrepruning - stop growing a branch when information becomes unreliable
5Prepruning Based on statistical significance test Stop growing the tree when there is no statistically significant association between any attribute and the class at a particular nodeMost popular test: chi-squared testID3 used chi-squared test in addition to information gainOnly statistically significant attributes were allowed to be selected by information gain procedure
6Early stoppingabclass1234Pre-pruning may stop the growth process prematurely: early stoppingClassic example: XOR/Parity-problemNo individual attribute exhibits any significant association to the classStructure is only visible in fully expanded treePre-pruning won’t expand the root nodeBut: XOR-type problems rare in practiceAnd: pre-pruning faster than post-pruning
7Post-pruning First, build full tree Then, prune it Fully-grown tree shows all attribute interactionsProblem: some subtrees might be due to chance effectsTwo pruning operations:Subtree replacementSubtree raising
8Subtree replacement Bottom-up Consider replacing a tree only after considering all its subtrees
10Estimating error rates Prune only if it reduces the estimated errorError on the training data is NOT a useful estimator Q: Why would it result in very little pruning?Use hold-out set for pruning (“reduced-error pruning”)C4.5’s methodDerive confidence interval from training dataUse a heuristic limit, derived from this, for pruningStandard Bernoulli-process-based methodShaky statistical assumptions (based on training data)
11Estimating Error Rates Q: what is the error rate on the training set?A: 0.33 (2 out of 6)Q: will the error on the test set be bigger, smaller or equal?A: bigger
12*Mean and variance Mean and variance for a Bernoulli trial: p, p (1–p) Expected success rate f=S/NMean and variance for f : p, p (1–p)/NFor large enough N, f follows a Normal distributionc% confidence interval [–z X z] for random variable with 0 mean is given by:With a symmetric distribution:
13*Confidence limitsConfidence limits for the normal distribution with 0 mean and a variance of 1:Thus:To use this we have to reduce our random variable f to have 0 mean and unit variancePr[X z]z0.1%3.090.5%2.581%2.335%1.6510%1.2820%0.8425%0.6940%0.25–
14*Transforming fTransformed value for f : (i.e. subtract the mean and divide by the standard deviation)Resulting equation:Solving for p:
15C4.5’s methodError estimate for subtree is weighted sum of error estimates for all its leavesError estimate for a node (upper bound):If c = 25% then z = 0.69 (from normal distribution)f is the error on the training dataN is the number of instances covered by the leaf
16Example f = 5/14 e = 0.46 e < 0.51 so prune! f=0.33 e=0.47 Combined using ratios 6:2:6 gives 0.51
17Summary Decision Trees No method is always superior – experiment! splits – binary, multi-waysplit criteria – information gain, gain ratio, …missing value treatmentpruningNo method is always superior – experiment!