Download presentation

Presentation is loading. Please wait.

Published byWeston Mellis Modified about 1 year ago

1
1 Decision Tree Algorithm (C4.5)

2
2 Training Examples for PlayTennis (Mitchell 1997)

3
3 Decision Tree for PlayTennis (Mitchell 1997)

4
4 Decision Tree Algorithm (C4.5) All the equations of C4.5 algorithm are as follow: Calculate Info(S) (Entropy, 熵 ) to identify the class in the training set S. where | S | is the number of cases in the training set, C i is a class, i=1,2,...,k, k is the number of classes, freq(C i, S) and is the number of cases in C i.. (9Y, 5N)

5
5 Decision Tree Algorithm (C4.5) Calculate the expected information value, for feature X to partition S. where L is the number of outputs for feature X, S i is a subset of S corresponding to the i th output, and |S i | is the number of cases in subset S i.

6
6 Decision Tree Algorithm (C4.5) Calculate the information gained after partitioning according to feature X. Calculate the partition information value, acquired for S partitioned into L subsets.

7
7 Decision Tree Algorithm (C4.5) Calculate the gain ratio of Gain(X) over SplitInfo(X). The gain ratio can reduce the probability of choosing the node with more attribute values. If gain ratio for two attribute values were the same smallest, then you can choice one randomly.

8
8 Decision Tree Algorithm (C4.5) Step 1 Decide which attribute should consider first? Outlook? Temperature? Humidity? Wind? Use Entropy: - P + log 2 P + - P - log 2 P - At first, we have 9(YES) and 5(NO) Starting Entropy: - 9/14 log 2 9/14 - 5/14 log 2 5/14=0.94

9
9 Entropy (Info)

10
10 Decision Tree Algorithm (C4.5) Step 2 Compute the gain ratio for each attribute Information gain: (The last E – The possible E)… (desire MAX) Humidity HighNormal Outlook SunnyRain Overcast

11
11 Decision Tree Algorithm (C4.5) Step 2(Con`t) Compute the gain ratio for each attribute Wind StrongWeak Temperature HotCoolMild

12
12 Decision Tree Algorithm (C4.5) Step 2(Con`t) Compute the gain ratio for each attribute Outlook SunnyRainOvercast 2+ 3-4+ 0-3+ 2- E 1 =(-2/5 log 2 2/5 -3/5 log 2 3/5)=(-0.4 *(-1.3219) -0.6 *(-0.7369)) = (0.5281+0.4421)=0.9702 E 2= (-4/4 log 2 4/4 -0) =(-1 log 2 1 -0)=0; E 3= (-3/5 log 2 3/5 -2/5 log 2 2/5) =(-0.6 * (-0.7369) +0.4 *(-1.3219))= 0.9702 =0.94 - 5/14*0.9702 - 4/14*0 - 5/14*0.9702 =0.94 - 0.693=0.247 Gain ratio =0.247/1.577=0.157 SplitInfo = -5/14 log 2 5/14 - 4/14 log 2 4/14 -5/14 log 2 5/14=1.577

13
13 Decision Tree Algorithm (C4.5) Step 2(Con`t) Compute the gain ratio for each attribute Humidity High Normal 3+ 4-6+ 1- E 1 =(-3/7 log 2 3/7 -4/7 log 2 4/7)=0.985 E 2= (-6/7 log 2 6/7 -1/7 log 2 1/7)=0.592 Info gain=0.94 - 7/14*0.985 - 7/14*0.592 =0.151 Gain ratio=0.151/1=0.151 Split Info =-7/14 log 2 7/14 -7/14 log 2 7/14=1

14
14 Decision Tree Algorithm (C4.5) Step 2(Con`t) Compute the gain ratio for each attribute E 1 =(-2/4 log 2 2/4 -2/4 log 2 2/4)=1 E 2= (-4/6 log 2 4/6 -2/6 log 2 2/6)=0.9182 E 3 =(-3/4 log 2 3/4 -1/4 log 2 1/4)=0.8112 Info gain =0.94 - 4/14*1 - 6/14*0.9182 - 4/14*0.8112 = 0.0292 Temperature HotCoolMild 2+ 2-4+ 2-3+ 1- Gain ratio=0.0292/1.556=0.01876 Split Info =-4/14 log 2 4/14 -6/14 log 2 6/14- 4/14 log 2 4/14=1.556

15
15 Decision Tree Algorithm (C4.5) Step 2(Con`t) Compute the gain ratio for each attribute Wind StrongWeak 3+ 3-6+ 2- E 1 =(-3/6 log 2 3/6 - 3/6 log 2 3/6)=1 E 2 =(-6/8 log 2 6/8 - 2/8 log 2 2/8)=0.8112 Info gain=0.94 - 6/14*1 - 8/14*0.8112 = 0.048 Gain ratio=0.048/0.9852=0.049 Split Info=-6/14 log 2 6/14 -8/14 log 2 8/14 =0.9852

16
16 Decision Tree Algorithm (C4.5) Step 2(Con`t) Summary of gain ratio Gain ratio Outlook=0.157 Gain ratio Humidify=0.151 Gain ratio Wind=0.049 Gain ratio Temperature=0.01876 So the root node is Outlook

17
17 Decision Tree Algorithm (C4.5) Step 3 Decide other attribute under root node Outlook Sunny Rain Overcast May choice T, H, W, under Sunny and Rain (Don’t care Overcast, because no information contained) Yes ……. ……

18
18 Decision Tree Algorithm (C4.5) Step 3(Con`t) Look under Outlook=Sunny Sunny 2+3- May choice T, H, W, under Outlook =Sunny Outlook E (Outlook=Sunny) = - 2/5 log 2 2/5 - 3/5 log 2 3/5 = 0.97

19
19 Decision Tree Algorithm (C4.5) Step 3(Con`t) Look under Outlook=Sunny Sunny 0+3- May choice T, H, W, under Outlook =Sunny Outlook E 1 (Under Outlook=Sunny and Humidity=High) = - 0/3 log 2 0/3 - 3/3 log 2 3/3 = 0 E 2 (Under Outlook=Sunny and Humidity=Normal) = - 2/2 log2 2/2 - 0/2 log2 0/2 = 0 Info gain (Under Outlook=Sunny and Humidity) = 0.97- 3/5*0- 2/5 *0 = 0.971 Humidity HighNormal 2+0- Gain ratio=0.971/0.971=1 Split Info=-3/5 log 2 3/5 -2/5 log 2 2/5 =0.971

20
20 Decision Tree Algorithm (C4.5) Step 3(Con`t) Look under Sunny Sunny 0+2- May choice T, H, W, under Outlook=Sunny Outlook E 1 (Under Outlook=Sunny and Temperature=Hot) = - 0/2 log 2 0/2 - 2/2 log 2 2/2 = 0 E 2 (Under Outlook=Sunny and Temperature=Mild) = - 1/2 log2 1/2 - 1/2 log2 1/2 = 1 E 3 (Under Outlook=Sunny and Temperature=Cool) = - 1/1 log2 1/1 - 0/1 log2 0/1 = 0 Info gain (Under Outlook=Sunny and Temperature) = 0.97 - 2/5*0 - 2/5 *1 - 1/5 *0=0.57 Temperature HotCool 1+0-1+1- Mild Gain ratio=0.57/1.522=0.375 Split Info=-2/5 log 2 2/5 -2/5 log 2 2/5 -1/5 log 2 1/5 =1.522

21
21 Decision Tree Algorithm (C4.5) Step 3(Con`t) Look under Sunny Sunny 1+1- May choice T, H, W, under Outlook=Sunny Outlook E 1 (Under Outlook=Sunny and Wind=Strong) = - 1/2 log 2 1/2 - 1/2 log 2 1/2 = 1 E 2 (Under Outlook=Sunny and Wind=Weak) = - 1/3 log 2 1/3 - 2/3 log 2 2/3 = 0.918 Info gain (Under Outlook=Sunny and Wind) = 0.97- 2/5*1- 3/5 *0.918 = 0.0192 Wind Strong Weak 1+2- Gain ratio=0.0192/0.971=0.02 Split Info =-2/5 log 2 2/5 -3/5 log 2 3/5=0.971

22
22 Decision Tree Algorithm (C4.5) Step 3(Con`t) Summary of gain ratio Gain ratio under Outlook =Sunny and Humidity: 1 Gain ratio under Outlook =Sunny and Temperature: 0.375 Gain ratio under Outlook =Sunny and Wind: 0.02 So choice Humidity Sunny Outlook Humidity Normal YES High No 2 3 Overcast ……. YES 4

23
23 Decision Tree Algorithm (C4.5) Step 4(Con`t) Look under Rain May choice T, H, W, under Rain E 1 (Under Outlook=Rain and Humidify=High) = - 1/2 log 2 1/2 - 1/2 log 2 1/2 = 1 E 2 (Under Outlook=Rain and Humidify=Normal) = - 2/3 log 2 2/3 - 1/3 log 2 1/3 = 0.918 Info gain (Under Outlook=Rain and Humidify) = 0.97- 2/5*1- 3/5 *0.918 = 0.019 S 1+1- Outlook O HighNormal 2+1- Rain Humidify E (Rain) = - 3/5 log 2 3/5 - 2/5 log 2 2/5 = 0.97 Gain ratio=0.019/0.971=0.02 Split Info =-2/5 log 2 2/5 -3/5 log 2 3/5=0.971

24
24 Decision Tree Algorithm (C4.5) Step 4(Con`t) Look under Rain S 0+0- May choice T, H, W, under Rain Outlook E 1 (Under Outlook=Rain and T=Hot) = 0 E 2 (Under Outlook=Rain and T=Mild) = - 2/3 log 2 2/3 - 1/3 log 2 1/3 = 0.918 E 3 (Under Outlook=Rain and T=Cool) = - 1/2 log 2 1/2 - 1/2 log 2 1/2 = 1 Info gain (Under Outlook=Rain and Temperature) = 0.97- 0/5*0 - 3/5 *0.918 - 2/5 * 1 = 0.0192 O HotMild 2+1- Rain Temperature Cool 1+1- Gain ratio=0.0192/0.971=0.02 Split Info =-3/5 log 2 3/5 -2/5 log 2 2/5=0.971

25
25 Decision Tree Algorithm (C4.5) Step 4(Con`t) Look under Rain S 0+2- May choice T, H, W, under Rain Outlook E 1 (Under Outlook=Rain and Wind=Strong) = - 0/2 log 2 0/2 - 2/2 log 2 2/2 = 0 E 2 (Under Outlook=Rain and Wind=Weak) = - 3/3 log 2 3/3 - 0/3 log 2 0/3 = 0 Info gain (Under Outlook=Rain) = 0.97- 2/5*0- 3/5 *0 =0.971 O StrongWeak 3+0- Rain Wind Gain ratio=0.971/0.971=1 Split Info=-2/5 log 2 2/5 -3/5 log 2 3/5=0.971

26
26 Decision Tree Algorithm (C4.5) Step 4(Con`t) Summary of gain ratio Gain ratio Under O=Rain and Wind =1 Gain ratio Under O=Rain and Humidity =0.02 Gain ratio Under O=Rain and Temperature = 0.02 So choice Wind Sunny Outlook Humidity Normal YES High No 2 3 Overcast Wind YES 4 2 Strong Weak 3 YESNo Rain

27
27 Decision Tree Algorithm (C4.5) Additional situation about continuous value OutlookTemperatureHumidityWindyPlay sunny85Highfalseno sunny80Hightrueno overcast83Highfalseyes rainy70Highfalseyes rainy68Normalfalseyes rainy65Normaltrueno overcast64Normaltrueyes sunny72Highfalseno sunny69Normalfalseyes rainy75Normalfalseyes sunny75Normaltrueyes overcast72Hightrueyes overcast81Normalfalseyes rainy71Hightrueno

28
28 Decision Tree Algorithm (C4.5) Additional situation about continuous value 85 80 83 70 68 65 64 72 69 75 75 72 81 71 Mapping with target Step 1. Sort 64 65 68 69 70 71 72 75 80 81 83 85 72 75 Y N Y Y Y N N Y N Y Y N Y Y

29
29 Decision Tree Algorithm (C4.5) Step 2 Decide the cut point 64 65 68 69 70 71 72 75 80 81 83 85 72 75 Y N Y Y Y N N Y N Y Y N Y Y When Y N or N Y Cut Point[E 1 (1 + 0 - )&E 2 (8 + 5 - )] Cut Point[E 1 (1 + 1 - )&E 2 (8 + 4 - )] Cut Point[E 1 (4 + 1 - )&E 2 (5 + 4 - )] Cut Point[E 1 (4 + 2 - )&E 2 (5 + 3 - )] Cut Point[E 1 (5 + 3 - )&E 2 (4 + 2 - )] Cut Point[E 1 (7 + 3 - )&E 2 (2 + 2 - )] Cut Point[E 1 (7 + 4 - )&E 2 (2 + 1 - )] Cut Point[E 1 (9 + 4 - )&E 2 (0 + 1 - )] Then compare with other non-continuous factor Ex. Cut Point at 64.5 E 1 (64.5<) = - 1/1 log 2 1/1 - 0/1 log 2 0/1 = 0 E 2 (>64.5) = - 8/13 log2 8/13 - 5/13 log2 5/13 = 0.9611 info (Under Sunny H) = 0.97- 1/14*0 -13/14*0.9611 =0.077

30
30 Decision Tree Algorithm (C4.5) If cut point is 64.5, gain ratio=0.077423/0.371232 =0.208557 [E1(1+ 0-)&E2(8+ 5-)] If cut point is 66.5, gain ratio=0.040032/0.591673 =0.067659 [E1(1+ 1-)&E2(8+ 4-)] If cut point is 70.5, gain ratio=0.075048/0.940628 =0.079814 [E1(4+ 1-)&E2(5+ 4-)] If cut point is 71.5, gain ratio=0.031051/1.000000 =0.031051 [E1(4+ 2-)&E2(5+ 3-)]

31
31 Decision Tree Algorithm (C4.5) If the cut point is 73.5, gain ratio=0.031054/0.985228 =0.031520 [E1(5+ 3-)&E2(4+ 2-)] If the cut point is 77.5, gain ratio=0.054792/0.863121 =0.063481 [E1(7+ 3-)&E2(2+ 2-)] If the cut point is 80.5, gain ratio=0.030204/0.749595 =0.040294 [E1(7+ 4-)&E2(2+ 1-)]

32
32 Decision Tree Algorithm (C4.5) And we need to find out the Max. gain ratio. If the cut point is 84.0, gain ratio=0.143115/0.371232 =0.385513 [E1(9+ 4-)&E2(0+ 1-)] Max. gain ratio is 0.38 and the cut point is 84.0.

33
33 Decision Tree Algorithm (C4.5) Parameter Min. Case How many Min. Case should we set? Ans. If the number of total cases in training set is under 1000, 2 is recommendation. Change Min. Case can reproduce the tree structure, the rule length and the number of rules.

34
Decision Tree Algorithm (C4.5) Parameter Min. Case In order to avoid the over-fitting, splits can be created if certain specified threshold (e.g. the minimum number of cases for a split search) is met. This is the so-called minimum case.

35
35 If minimum case is set to 6 Sunny Outlook (14 cases) Humidity (5 cases) Normal YES High No ……. 0+3-2+0- Sunny Outlook (14 cases) ……. 2+3- No Decision Tree Algorithm (C4.5)

36
Parameter prune confidence level U CF (E,N) where E is the number of error; N is number of training instance (EX: U 0.25 (0,6)=0.206 ( 預估錯誤率 ) and the expected number of error is 6*0.206=1.236) Use the estimated error to determine whether the tree built in growth phase is required to prune or not at certain nodes. The probability of error cannot be determined exactly; however, there exists a probability distribution that is generally summarized as a pair of confidence limits. (binomial distribution.) C4.5 simply equates the estimated error rate at a leaf with this upper limit, based on the argument that the tree has been constructed to minimize observed error rate

37
Decision Tree Algorithm (C4.5) Parameter prune confidence level U CF (E,N) where E is the number of error; N is number of training instance (EX: U 0.25 (0,6)= 0.206 and the expected error is 6*0.206=1.236) If the expect error is 2.63 in node 6 root node node 1 ……. node 6 Leaf (6) If the expect error is 3.273 root node Node 1 ……. If the expect error is 4.21 in node 1 Leaf (9)Leaf (1) 6*0.206+9*0.143+1*0.750=3.273

38
38 Decision Tree Algorithm (C4.5) Future research: Multiple cut points of continuous values.

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google