3Divide and Conquer Internal decision nodes Univariate: Uses a single attribute, xiNumeric xi : Binary split : xi > wmDiscrete xi : n-way split for n possible valuesMultivariate: Uses all attributes, xLeavesClassification: Class labels, or proportionsRegression: Numeric; r average, or local fitLearning is greedy; find the best split recursively (Breiman et al, 1984; Quinlan, 1986, 1993) without backtracking (decision taken are final)
4Side Discussion “Greedy Algorithms” Fast and therefore attractive to solve NP-hard and other problems with high complexity. Later decisions are made in the context of decision selected early dramatically reducing the size of the search space.They do not backtrack: if they make a bad decision (based on local criteria), they never revise the decision.They are not guaranteed to find the optimal solutions, and sometimes can get deceived and find really bad solutions.In spite of what is said above, a lot successful and popular algorithms in Computer Science are greedy algorithms.Greedy algorithms are particularly popular in AI and Operations Research.Popular Greedy Algorithms: Decision Tree Induction,…
7Tree Induction Greedy strategy. Issues Split the records based on an attribute test that optimizes certain criterion.IssuesDetermine how to split the recordsHow to specify the attribute test condition?How to determine the best split?Determine when to stop splitting
8Information Gain vs. Gain Ratio Result:I_Gain_Ratio:city>age>carResult:I_Gain:age > car=cityGain(D,city=)= H(1/3,2/3) – ½ H(1,0) –½ H(1/3,2/3)=0.45D=(2/3,1/3)G_Ratio_pen(city=)=H(1/2,1/2)=1city=sfcity=laD1(1,0)D2(1/3,2/3)Gain(D,car=)= H(1/3,2/3) – 1/6 H(0,1) –½ H(2/3,1/3) – 1/3 H(1,0)=0.45G_Ratio_pen(car=)=H(1/2,1/3,1/6)=1.45D=(2/3,1/3)car=vancar=merccar=taurusD1(0,1)D2(2/3,1/3)D3(1,0)G_Ratio_pen(age=)=log2(6)=2.58Gain(D,age=)= H(1/3,2/3) – 6*1/6 H(0,1)= 0.90D=(2/3,1/3)age=22age=25age=27age=35age=40age=50D1(1,0)D2(0,1)D3(1,0)D4(1,0)D5(1,0)D6(0,1)
9How to determine the Best Split? Before Splitting: 10 records of class 0, 10 records of class 1Before: E(1/2,1/2)After: 4/20*E(1/4,3/4) + 8/20*E(1,0) + 8/20*E(1/8,7/8)Gain: Before-AfterPick Test that has the highest gain!Remark: E stands for Gini, Entropy(H), Impurity(1-maxc(P(c )), Gain-ratio
10Entropy and Gain Computations Assume we have m classes in our classification problem. A test S subdivides the examples D= (p1,…,pm) into n subsets D1 =(p11,…,p1m) ,…,Dn =(p11,…,p1m). The qualify of S is evaluated using Gain(D,S) (ID3) or GainRatio(D,S) (C5.0):Let H(D=(p1,…,pm))= Si=1 (pi log2(1/pi)) (called the entropy function)Gain(D,S)= H(D) - Si=1 (|Di|/|D|)*H(Di)Gain_Ratio(D,S)= Gain(D,S) / H(|D1|/|D|,…, |Dn|/|D|)Remarks:|D| denotes the number of elements in set D.D=(p1,…,pm) implies that p1+…+ pm =1 and indicates that of the |D| examples p1*|D| examples belong to the first class, p2*|D| examples belong to the second class,…, and pm*|D| belong the m-th (last) class.H(0,1)=H(1,0)=0; H(1/2,1/2)=1, H(1/4,1/4,1/4,1/4)=2, H(1/p,…,1/p)=log2(p).C5.0 selects the test S with the highest value for Gain_Ratio(D,S), whereas ID3 picks the test S for the examples in set D with the highest value for Gain (D,S).mn
11Information Gain vs. Gain Ratio Result:I_Gain_Ratio:city>age>carResult:I_Gain:age > car=cityGain(D,city=)= H(1/3,2/3) – ½ H(1,0) –½ H(1/3,2/3)=0.45D=(2/3,1/3)G_Ratio_pen(city=)=H(1/2,1/2)=1city=sfcity=laD1(1,0)D2(1/3,2/3)Gain(D,car=)= H(1/3,2/3) – 1/6 H(0,1) –½ H(2/3,1/3) – 1/3 H(1,0)=0.45G_Ratio_pen(car=)=H(1/2,1/3,1/6)=1.45D=(2/3,1/3)car=vancar=merccar=taurusD1(0,1)D2(2/3,1/3)D3(1,0)G_Ratio_pen(age=)=log2(6)=2.58Gain(D,age=)= H(1/3,2/3) – 6*1/6 H(0,1)= 0.90D=(2/3,1/3)age=22age=25age=27age=35age=40age=50D1(1,0)D2(0,1)D3(1,0)D4(1,0)D5(1,0)D6(0,1)
12Splitting Continuous Attributes Different ways of handlingDiscretization to form an ordinal categorical attributeStatic – discretize once at the beginningDynamic – ranges can be found by equal interval bucketing, equal frequency bucketing (percentiles), clustering, or supervisedclustering.Binary Decision: (A < v) or (A v)consider all possible splits and finds the best cut vcan be more compute intensive
14Stopping Criteria for Tree Induction Grow entire treeStop expanding a node when all the records belong to the same classStop expanding a node when all the records have the same attribute valuesPre-pruning (do not grow complete tree)Stop when only x examples are left (pre-pruning)… other pre-pruning strategies
15How to Address Overfitting in Decision Trees The most popular approach: Post-pruningGrow decision tree to its entiretyTrim the nodes of the decision tree in a bottom-up fashionIf generalization error improves after trimming, replace sub-tree by a leaf node.Class label of leaf node is determined from majority class of instances in the sub-tree
16Advantages Decision Tree Based Classification Inexpensive to constructExtremely fast at classifying unknown recordsEasy to interpret for small-sized treesOkay for noisy dataCan handle both continuous and symbolic attributesAccuracy is comparable to other classification techniques for many simple data setsDecent average performance over many datasetsKind of a standard—if you want to show that your “new” classification technique really “improves the world” compare its performance against decision trees (e.g. C 5.0) using 10-fold cross-validationDoes not need distance functions; only the order of attribute values is important for classification: 0.1,0.2,0.3 and 0.331,0.332, and is the same for a decision tree learner.
17Disadvantages Decision Tree Based Classification Relies on rectangular approximation that might not be good for some datasetSelecting good learning algorihtm parameters (e.g. degree of pruning) is non-trivialEnsemble techniques, support vector machines, and kNN might obtain higher accuracies for a specific dataset.More recently, forests (ensembles of decision trees) have gained some popularity.
18Error in Regressions Trees Examples are:(1,1.37), (2,-1.45), ( ), (2.2,-125)(7, 2.06), (8, 1.66)Error before split: 1/6(2*0**2+ 2* *0.04)In general: minimizes squared error!
22Summary Regression Trees Regression trees work as decision trees exceptLeafs carry numbers which predict the output variable instead of class labelInstead of entropy/purity the squared prediction error serves as the performance measure.Tests that reduce the squared prediction error the most are selected as node test; test conditions have the form y>threshold where y is an independent variable of the prediction problem.Instead of using the majority class, leaf labels are generated by averaging over the values of the dependent variable of the examples that are associated with the particular class nodeLike decision trees, regression tree employ rectangular tessellations with a fixed number being associated with each rectangle; the number is the output for the inputs which lie inside the rectangle.In contrast to ordinary regression that performs curve fitting, regression tries use averaging with respect to the output variable when making predictions.