Presentation is loading. Please wait.

Presentation is loading. Please wait.

SEEM4630 2015-2016 Tutorial 1 Classification: Decision tree Siyuan Zhang,

Similar presentations


Presentation on theme: "SEEM4630 2015-2016 Tutorial 1 Classification: Decision tree Siyuan Zhang,"— Presentation transcript:

1 SEEM4630 2015-2016 Tutorial 1 Classification: Decision tree Siyuan Zhang, syzhang@se.cuhk.edu.hk

2 Classification: Definition Given a collection of records (training set ), each record contains a set of attributes, one of the attributes is the class. Find a model for class attribute as a function of the values of other attributes. Decision tree Naïve bayes k-NN Goal: previously unseen records should be assigned a class as accurately as possible. 2

3 Decision Tree Goal Construct a tree so that instances belonging to different classes should be separated Basic algorithm (a greedy algorithm) Tree is constructed in a top-down recursive manner At start, all the training examples are at the root Test attributes are selected on the basis of a heuristics or statistical measure (e.g., information gain) Examples are partitioned recursively based on selected attributes 3

4  Let p i be the probability that a tuple belongs to class C i, estimated by |C i,D |/|D|  Expected information (entropy) needed to classify a tuple in D:  Information needed (after using A to split D into v partitions) to classify D:  Information gained by branching on attribute A: Attribute Selection Measure 1: Information Gain 4

5  Information gain measure is biased towards attributes with a large number of values  C4.5 (a successor of ID3) uses gain ratio to overcome the problem (normalization to information gain):  GainRatio(A) = Gain(A)/SplitInfo(A) Attribute Selection Measure 2: Gain Ratio 5

6  If a data set D contains examples from n classes, gini index, gini(D) is defined as: where p j is the relative frequency of class j in D  If a data set D is split on A into two subsets D 1 and D 2, the gini index gini(D) is defined as  Reduction in Impurity:  If a data set D contains examples from n classes, gini index, gini(D) is defined as: where p j is the relative frequency of class j in D  If a data set D is split on A into two subsets D 1 and D 2, the gini index gini(D) is defined as  Reduction in Impurity: Attribute Selection Measure 3: Gini index 6

7 Example OutlookTemperatureHumidityWindPlay Tennis Sunny>25HighWeakNo Sunny>25HighStrongNo Overcast>25HighWeakYes Rain15-25HighWeakYes Rain<15NormalWeakYes Rain<15NormalStrongNo Overcast<15NormalStrongYes Sunny15-25HighWeakNo Sunny<15NormalWeakYes Rain15-25NormalWeakYes Sunny15-25NormalStrongYes Overcast15-25HighStrongYes Overcast>25NormalWeakYes Rain15-25HighStrongNo 7

8 Entropy of data S Split data by attribute Outlook Tree induction example 8 S[9+, 5-] Outlook Sunny [2+,3-] Overcast [4+,0-] Rain [3+,2-] Gain(Outlook) = 0.94 – 5/14[-2/5(log 2 (2/5))-3/5(log 2 (3/5))] – 4/14[-4/4(log 2 (4/4))-0/4(log 2 (0/4))] – 5/14[-3/5(log 2 (3/5))-2/5(log 2 (2/5))] = 0.94 – 0.69 = 0.25 Info(S) = -9/14(log 2 (9/14))-5/14(log 2 (5/14)) = 0.94

9 Tree induction example 9 S[9+, 5-] Temperature <15 [3+,1-] 15-25 [4+,2-] >25 [2+,2-] Gain(Temperature) = 0.94 – 4/14[-3/4(log 2 (3/4))-1/4(log 2 (1/4))] – 6/14[-4/6(log 2 (4/6))-2/6(log 2 (2/6))] – 4/14[-2/4(log 2 (2/4))-2/4(log 2 (2/4))] = 0.94 – 0.91 = 0.03 Split data by attribute Temperature

10 Tree induction example 10 S[9+, 5-] Wind Weak [6+, 2-] Strong [3+, 3-] Gain(Humidity) = 0.94 – 7/14[-3/7(log 2 (3/7))-4/7(log 2 (4/7))] – 7/14[-6/7(log 2 (6/7))-1/7(log 2 (1/7))] = 0.94 – 0.79 = 0.15 Gain(Wind) = 0.94 – 8/14[-6/8(log 2 (6/8))-2/8(log 2 (2/8))] – 6/14[-3/6(log 2 (3/6))-3/6(log 2 (3/6))] = 0.94 – 0.89 = 0.05 Split data by attribute Humidity Split data by attribute Wind S[9+, 5-] Humidity High [3+,4-] Normal [6+, 1-]

11 11 Outlook Yes?? OvercastSunnyRain Gain(Outlook) = 0.25 Gain(Temperature)=0.0 3 Gain(Humidity) = 0.15 Gain(Wind) = 0.05 NoWeakHigh>25Sunny NoStrongHigh>25Sunny YesWeakHigh>25Overcast YesWeakHigh15-25Rain YesWeakNormal<15Rain NoStrongNormal<15Rain YesStrongNormal<15Overcast NoWeakHigh15-25Sunny YesWeakNormal<15Sunny YesWeakNormal15-25Rain YesStrongNormal15-25Sunny YesStrongHigh15-25Overcast YesWeakNormal>25Overcast NoStrongHigh15-25Rain Play Tennis WindHumidityTempera ture Outlook Tree induction example

12 Entropy of branch Sunny Split Sunny branch by attribute Temperature Split Sunny branch by attribute Humidity Split Sunny branch by attribute Wind 12 Sunny[2+, 3-] Wind Weak [1+, 2-] Strong [1+, 1-] Gain(Humidity) = 0.97 – 3/5[-0/3(log 2 (0/3))-3/3(log 2 (3/3))] – 2/5[-2/2(log 2 (2/2))-0/2(log 2 (0/2))] = 0.97 – 0 = 0.97 Gain(Wind) = 0.97 – 3/5[-1/3(log 2 (1/3))-2/3(log 2 (2/3))] – 2/5[-1/2(log 2 (1/2))-1/2(log 2 (1/2))] = 0.97 – 0.95= 0.02 Info(Sunny) = -2/5(log 2 (2/5))-3/5(log 2 (3/5)) = 0.97 Sunny[2+,3-] Temperature <15 [1+,0-] 15-25 [1+,1-] >25 [0+,2-] Gain(Temperature) = 0.97 – 1/5[-1/1(log 2 (1/1))-0/1(log 2 (0/1))] – 2/5[-1/2(log 2 (1/2))-1/2(log 2 (1/2))] – 2/5[-0/2(log 2 (0/2))-2/2(log 2 (2/2))] = 0.97 – 0.4 = 0.57 Sunny[2+,3-] Humidity High [0+,3-] Normal [2+, 0-]

13 13 Outlook Yes Humidity ?? Yes No High SunnyRain Normal Overcast Tree induction example

14 Gain(Humidity) = 0.97 – 2/5[-1/2(log 2 (1/2))-1/2(log 2 (1/2))] – 3/5[-2/3(log 2 (2/3))-1/3(log 2 (1/3))] = 0.97 – 0.95 = 0.02 Gain(Wind) = 0.97 – 3/5[-3/3(log 2 (3/3))-0/3(log 2 (0/3))] – 2/5[-0/2(log 2 (0/2))-2/2(log 2 (2/2))] = 0.97 – 0 = 0.97 Entropy of branch Rain Split Rain branch by attribute Temperature Split Rain branch by attribute Humidity Split Rain branch by attribute Wind 14 Info(Rain) = -3/5(log 2 (3/5))-2/5(log 2 (2/5)) = 0.97 Rain[3+,2-] Temperature <15 [1+,1-] 15-25 [2+,1-] >25 [0+,0-] Gain(Outlook) = 0.97 – 2/5[-1/2(log 2 (1/2))-1/2(log 2 (1/2))] – 3/5[-2/3(log 2 (2/3))-1/3(log 2 (1/3))] – 0/5[-0/0(log 2 (0/0))-0/0(log 2 (0/0))] = 0.97 – 0.95 = 0.02 Rain[3+,2-] Wind Weak [3+, 0-] Strong [0+, 2-] Rain[3+,2-] Humidity High [1+,1-] Normal [2+, 1-]

15 15 Outlook YesHumidity Wind Yes No NormalHigh No Yes StrongWeak OvercastSunnyRain Tree induction example

16 Thank you & Questions? 16


Download ppt "SEEM4630 2015-2016 Tutorial 1 Classification: Decision tree Siyuan Zhang,"

Similar presentations


Ads by Google