Decision Tree Algorithm (C4.5)
Training Examples for PlayTennis (Mitchell 1997)
Decision Tree for PlayTennis (Mitchell 1997)
Decision Tree Algorithm (C4.5) All the equations of C4.5 algorithm are as follow: Calculate Info(S) (Entropy, 熵) to identify the class in the training set S. where |S| is the number of cases in the training set, Ci is a class, i=1,2,...,k, k is the number of classes, freq(Ci, S) and is the number of cases in Ci.. (9Y, 5N)
Decision Tree Algorithm (C4.5) Calculate the expected information value, for feature X to partition S. where L is the number of outputs for feature X, Si is a subset of S corresponding to the ith output, and |Si| is the number of cases in subset Si.
Decision Tree Algorithm (C4.5) Calculate the information gained after partitioning according to feature X. Calculate the partition information value, acquired for S partitioned into L subsets.
Decision Tree Algorithm (C4.5) Calculate the gain ratio of Gain(X) over SplitInfo(X). The gain ratio can reduce the probability of choosing the node with more attribute values. If gain ratio for two attribute values were the same smallest, then you can choice one randomly.
Decision Tree Algorithm (C4.5) Step 1 Decide which attribute should consider first? Outlook? Temperature? Humidity? Wind? Use Entropy:-P+log2P+-P-log2P- At first, we have 9(YES) and 5(NO) Starting Entropy:-9/14 log2 9/14 -5/14 log25/14=0.94
Entropy (Info)
Decision Tree Algorithm (C4.5) Step 2 Compute the gain ratio for each attribute Information gain: (The last E – The possible E)… (desire MAX) Humidity Outlook High Normal Sunny Overcast Rain
Decision Tree Algorithm (C4.5) Step 2(Con`t) Compute the gain ratio for each attribute Wind Temperature Strong Weak Hot Mild Cool
Decision Tree Algorithm (C4.5) Step 2(Con`t) Compute the gain ratio for each attribute E1=(-2/5 log2 2/5 -3/5 log2 3/5)=(-0.4 *(-1.3219) -0.6 *(-0.7369)) = (0.5281+0.4421)=0.9702 E2=(-4/4 log2 4/4 -0) =(-1 log2 1 -0)=0; E3=(-3/5 log2 3/5 -2/5 log2 2/5) =(-0.6 * (-0.7369) +0.4 *(-1.3219))= 0.9702 Outlook Sunny Overcast Rain 2+ 3- 4+ 0- 3+ 2- SplitInfo = -5/14 log2 5/14 - 4/14 log2 4/14 -5/14 log2 5/14=1.577 =0.94 - 5/14*0.9702 - 4/14*0 - 5/14*0.9702 =0.94 - 0.693=0.247 Gain ratio =0.247/1.577=0.157
Decision Tree Algorithm (C4.5) Step 2(Con`t) Compute the gain ratio for each attribute E1=(-3/7 log2 3/7 -4/7 log2 4/7)=0.985 E2=(-6/7 log2 6/7 -1/7 log2 1/7)=0.592 Humidity Info gain=0.94 - 7/14*0.985 - 7/14*0.592 =0.151 Split Info =-7/14 log2 7/14 -7/14 log27/14=1 High Normal Gain ratio=0.151/1=0.151 3+ 4- 6+ 1-
Decision Tree Algorithm (C4.5) Step 2(Con`t) Compute the gain ratio for each attribute E1=(-2/4 log2 2/4 -2/4 log2 2/4)=1 E2=(-4/6 log2 4/6 -2/6 log2 2/6)=0.9182 E3=(-3/4 log2 3/4 -1/4 log2 1/4)=0.8112 Temperature Info gain=0.94 - 4/14*1 - 6/14*0.9182 - 4/14*0.8112 = 0.0292 Hot Mild Cool Split Info =-4/14 log2 4/14 -6/14 log2 6/14- 4/14 log2 4/14=1.556 2+ 2- 4+ 2- 3+ 1- Gain ratio=0.0292/1.556=0.01876
Decision Tree Algorithm (C4.5) Step 2(Con`t) Compute the gain ratio for each attribute E1=(-3/6 log2 3/6 - 3/6 log2 3/6)=1 E2=(-6/8 log2 6/8 - 2/8 log2 2/8)=0.8112 Wind Info gain=0.94 - 6/14*1 - 8/14*0.8112 = 0.048 Strong Weak Split Info=-6/14 log2 6/14 -8/14 log2 8/14 =0.9852 3+ 3- 6+ 2- Gain ratio=0.048/0.9852=0.049
Decision Tree Algorithm (C4.5) Step 2(Con`t) Summary of gain ratio Gain ratio Outlook=0.157 Gain ratio Humidify=0.151 Gain ratio Wind=0.049 Gain ratio Temperature=0.01876 So the root node is Outlook
Decision Tree Algorithm (C4.5) Step 3 Decide other attribute under root node May choice T, H, W, under Sunny and Rain (Don’t care Overcast, because no information contained) Outlook Sunny Rain Overcast ……. …… Yes
Decision Tree Algorithm (C4.5) Step 3(Con`t) Look under Outlook=Sunny May choice T, H, W, under Outlook =Sunny Outlook E (Outlook=Sunny) = - 2/5 log2 2/5 - 3/5 log2 3/5 = 0.97 Sunny 2+3-
Decision Tree Algorithm (C4.5) Step 3(Con`t) Look under Outlook=Sunny May choice T, H, W, under Outlook =Sunny Outlook E1 (Under Outlook=Sunny and Humidity=High) = - 0/3 log2 0/3 - 3/3 log2 3/3 = 0 E2 (Under Outlook=Sunny and Humidity=Normal) = - 2/2 log2 2/2 - 0/2 log2 0/2 = 0 Info gain (Under Outlook=Sunny and Humidity) = 0.97- 3/5*0- 2/5 *0 = 0.971 Sunny Humidity Split Info=-3/5 log2 3/5 -2/5 log2 2/5 =0.971 High Normal Gain ratio=0.971/0.971=1 0+3- 2+0-
Decision Tree Algorithm (C4.5) Step 3(Con`t) May choice T, H, W, under Outlook=Sunny Look under Sunny E1 (Under Outlook=Sunny and Temperature=Hot) = - 0/2 log2 0/2 - 2/2 log2 2/2 = 0 E2 (Under Outlook=Sunny and Temperature=Mild) = - 1/2 log2 1/2 - 1/2 log2 1/2 = 1 E3 (Under Outlook=Sunny and Temperature=Cool) = - 1/1 log2 1/1 - 0/1 log2 0/1 = 0 Info gain (Under Outlook=Sunny and Temperature) = 0.97 - 2/5*0 - 2/5 *1 - 1/5 *0=0.57 Outlook Sunny Temperature Split Info=-2/5 log2 2/5 -2/5 log2 2/5 -1/5 log2 1/5 =1.522 Hot Mild Cool Gain ratio=0.57/1.522=0.375 0+2- 1+1- 1+0-
Decision Tree Algorithm (C4.5) Step 3(Con`t) May choice T, H, W, under Outlook=Sunny Look under Sunny Outlook E1 (Under Outlook=Sunny and Wind=Strong) = - 1/2 log2 1/2 - 1/2 log2 1/2 = 1 E2 (Under Outlook=Sunny and Wind=Weak) = - 1/3 log2 1/3 - 2/3 log2 2/3 = 0.918 Info gain (Under Outlook=Sunny and Wind) = 0.97- 2/5*1- 3/5 *0.918 = 0.0192 Sunny Wind Split Info =-2/5 log2 2/5 -3/5 log2 3/5=0.971 Strong Weak Gain ratio=0.0192/0.971=0.02 1+1- 1+2-
Decision Tree Algorithm (C4.5) Step 3(Con`t) Summary of gain ratio Outlook ……. Gain ratio under Outlook =Sunny and Humidity: 1 Gain ratio under Outlook =Sunny and Temperature: 0.375 Gain ratio under Outlook =Sunny and Wind: 0.02 Sunny Overcast YES Humidity Normal High YES No So choice Humidity 2 4 3
Decision Tree Algorithm (C4.5) May choice T, H, W, under Rain Step 4(Con`t) E (Rain) = - 3/5 log2 3/5 - 2/5 log2 2/5 = 0.97 Look under Rain E1 (Under Outlook=Rain and Humidify=High) = - 1/2 log2 1/2 - 1/2 log2 1/2 = 1 E2 (Under Outlook=Rain and Humidify=Normal) = - 2/3 log2 2/3 - 1/3 log2 1/3 = 0.918 Info gain (Under Outlook=Rain and Humidify) = 0.97- 2/5*1- 3/5 *0.918 = 0.019 S 1+1- Outlook O High Normal 2+1- Rain Humidify Split Info =-2/5 log2 2/5 -3/5 log2 3/5=0.971 Gain ratio=0.019/0.971=0.02
Decision Tree Algorithm (C4.5) May choice T, H, W, under Rain Step 4(Con`t) E1 (Under Outlook=Rain and T=Hot) = 0 E2 (Under Outlook=Rain and T=Mild) = - 2/3 log2 2/3 - 1/3 log2 1/3 = 0.918 E3 (Under Outlook=Rain and T=Cool) = - 1/2 log2 1/2 - 1/2 log2 1/2 = 1 Info gain (Under Outlook=Rain and Temperature) = 0.97- 0/5*0 - 3/5 *0.918 - 2/5 * 1 = 0.0192 Look under Rain Outlook Rain S O Temperature Split Info =-3/5 log2 3/5 -2/5 log2 2/5=0.971 Gain ratio=0.0192/0.971=0.02 Hot Mild Cool 0+0- 2+1- 1+1-
Decision Tree Algorithm (C4.5) May choice T, H, W, under Rain Step 4(Con`t) Look under Rain E1 (Under Outlook=Rain and Wind=Strong) = - 0/2 log2 0/2 - 2/2 log2 2/2 = 0 E2 (Under Outlook=Rain and Wind=Weak) = - 3/3 log2 3/3 - 0/3 log2 0/3 = 0 Info gain (Under Outlook=Rain) = 0.97- 2/5*0- 3/5 *0 =0.971 Outlook Rain S O Wind Split Info=-2/5 log2 2/5 -3/5 log2 3/5=0.971 Strong Weak Gain ratio=0.971/0.971=1 0+2- 3+0-
Decision Tree Algorithm (C4.5) Step 4(Con`t) Summary of gain ratio Outlook Gain ratio Under O=Rain and Wind =1 Gain ratio Under O=Rain and Humidity =0.02 Gain ratio Under O=Rain and Temperature = 0.02 Rain Sunny Overcast Humidity YES Wind 4 High Normal Strong Weak So choice Wind No YES YES No 3 2 3 2
Decision Tree Algorithm (C4.5) Additional situation about continuous value Outlook Temperature Humidity Windy Play sunny 85 High false no 80 true overcast 83 yes rainy 70 68 Normal 65 64 72 69 75 81 71
Decision Tree Algorithm (C4.5) Additional situation about continuous value 85 80 83 70 68 65 64 72 69 75 75 72 81 71 Step 1. Sort 64 65 68 69 70 71 72 75 80 81 83 85 72 75 Mapping with target Y N Y Y Y N N Y N Y Y N Y Y
Decision Tree Algorithm (C4.5) Step 2 Decide the cut point When Y N or N Y 64 65 68 69 70 71 72 75 80 81 83 85 72 75 Y N Y Y Y N N Y N Y Y N Y Y Cut Point[E1(1+ 0-)&E2(8+ 5-)] Cut Point[E1(9+ 4-)&E2(0+ 1-)] Cut Point[E1(1+ 1-)&E2(8+ 4-)] Cut Point[E1(4+ 1-)&E2(5+ 4-)] Ex. Cut Point at 64.5 E1 (64.5<) = - 1/1 log2 1/1 - 0/1 log2 0/1 = 0 E2 (>64.5) = - 8/13 log2 8/13 - 5/13 log2 5/13 = 0.9611 info (Under Sunny H) = 0.97- 1/14*0 -13/14*0.9611 =0.077 Cut Point[E1(4+ 2-)&E2(5+ 3-)] Cut Point[E1(5+ 3-)&E2(4+ 2-)] Cut Point[E1(7+ 3-)&E2(2+ 2-)] Cut Point[E1(7+ 4-)&E2(2+ 1-)] 29 Then compare with other non-continuous factor
Decision Tree Algorithm (C4.5) If cut point is 64.5, gain ratio=0.077423/0.371232 =0.208557 [E1(1+ 0-)&E2(8+ 5-)] If cut point is 66.5, gain ratio=0.040032/0.591673 =0.067659 [E1(1+ 1-)&E2(8+ 4-)] If cut point is 70.5, gain ratio=0.075048/0.940628 =0.079814 [E1(4+ 1-)&E2(5+ 4-)] If cut point is 71.5, gain ratio=0.031051/1.000000 =0.031051 [E1(4+ 2-)&E2(5+ 3-)]
Decision Tree Algorithm (C4.5) If the cut point is 73.5, gain ratio=0.031054/0.985228 =0.031520 [E1(5+ 3-)&E2(4+ 2-)] If the cut point is 77.5, gain ratio=0.054792/0.863121 =0.063481 [E1(7+ 3-)&E2(2+ 2-)] If the cut point is 80.5, gain ratio=0.030204/0.749595 =0.040294 [E1(7+ 4-)&E2(2+ 1-)]
Decision Tree Algorithm (C4.5) And we need to find out the Max. gain ratio. If the cut point is 84.0, gain ratio=0.143115/0.371232 =0.385513 [E1(9+ 4-)&E2(0+ 1-)] Max. gain ratio is 0.38 and the cut point is 84.0.
Decision Tree Algorithm (C4.5) Parameter Min. Case How many Min. Case should we set? Ans. If the number of total cases in training set is under 1000, 2 is recommendation. Change Min. Case can reproduce the tree structure, the rule length and the number of rules.
Decision Tree Algorithm (C4.5) Parameter Min. Case In order to avoid the over-fitting, splits can be created if certain specified threshold (e.g. the minimum number of cases for a split search) is met. This is the so-called minimum case.
Decision Tree Algorithm (C4.5) Outlook (14 cases) Outlook (14 cases) If minimum case is set to 6 Sunny Sunny Humidity (5 cases) ……. ……. No 2+3- Normal High YES No 2+0- 0+3- 35
Decision Tree Algorithm (C4.5) Parameter prune confidence level UCF(E,N) where E is the number of error; N is number of training instance (EX: U0.25(0,6)=0.206 (預估錯誤率) and the expected number of error is 6*0.206=1.236) Use the estimated error to determine whether the tree built in growth phase is required to prune or not at certain nodes. The probability of error cannot be determined exactly; however, there exists a probability distribution that is generally summarized as a pair of confidence limits. (binomial distribution.) C4.5 simply equates the estimated error rate at a leaf with this upper limit, based on the argument that the tree has been constructed to minimize observed error rate
Decision Tree Algorithm (C4.5) Parameter prune confidence level UCF(E,N) where E is the number of error; N is number of training instance (EX: U0.25(0,6)= 0.206 and the expected error is 6*0.206=1.236) root node root node If the expect error is 4.21 in node 1 Node 1 ……. node 1 ……. If the expect error is 2.63 in node 6 node 6 If the expect error is 3.273 Leaf (6) Leaf (9) Leaf (1) 6*0.206+9*0.143+1*0.750=3.273
Decision Tree Algorithm (C4.5) Future research: Multiple cut points of continuous values.