Presentation is loading. Please wait.

Presentation is loading. Please wait.

Classification Algorithms

Similar presentations


Presentation on theme: "Classification Algorithms"— Presentation transcript:

1 Classification Algorithms
Decision Tree Algorithms

2 The problem Given a set of training cases/objects and their attribute values, try to determine the target attribute value of new cases. Classification Prediction

3 Why decision tree? Decision trees are powerful and popular tools for classification and prediction. Decision trees represent rules, which can be understood by humans and used in knowledge system such as database.

4 key requirements Attribute-value description: object or case must be expressible in terms of a fixed collection of properties or attributes (e.g., hot, mild, cold). Predefined classes (target values): the target function has discrete output values (Boolean or multiclass) Sufficient data: enough training cases should be provided to learn the model.

5 Random split The tree can grow huge
These trees are hard to understand. Larger trees are typically less accurate than smaller trees.

6 Principled Criterion Selection of an attribute to test at each node - choosing the most useful attribute for classifying examples. information gain measures how well a given attribute separates the training examples according to their target classes This measure is used to select among the candidate attributes at each step while growing the tree

7 Entropy A measure of homogeneity of the set of examples.
Given a set S of positive and negative examples of some target concept (a 2-class problem), the entropy of set S relative to this binary classification is E(S) = - p1*log2 p1 – p2*log2 p2

8 Example Suppose S has 25 examples, 15 positive and 10 negatives [15+, 10-]. Then the entropy of S relative to this classification is E(S)=-(15/25) log2(15/25) - (10/25) log2 (10/25)

9 Entropy Entropy is minimized when all values of the target attribute are the same. If we know that Joe always plays Center Offence, then entropy of Offence is 0 Entropy is maximized when there is an equal chance of all values for the target attribute (i.e. the result is random) If offence = center in 9 instances and forward in 9 instances, entropy is maximized

10 Information Gain Information gain measures the expected reduction in entropy, or uncertainty. Values(A) is the set of all possible values for attribute A, and Sv the subset of S for which attribute A has value v Sv = {s in S | A(s) = v}. the first term in the equation for Gain is just the entropy of the original collection S the second term is the expected value of the entropy after S is partitioned using attribute A

11 Information Gain It is simply the expected reduction in entropy caused by partitioning the examples according to this attribute. It is the number of bits saved when encoding the target value of an arbitrary member of S, by knowing the value of attribute A.

12 Examples Before partitioning, the entropy is
H(10/20, 10/20) = - 10/20 log(10/20) - 10/20 log(10/20) = 1 Using the ``where’’ attribute, divide into 2 subsets Entropy of the first set H(home) = - 6/12 log(6/12) - 6/12 log(6/12) = 1 Entropy of the second set H(away) = - 4/8 log(4/8) - 4/8 log(4/8) = 1 Expected entropy after partitioning 12/20 * H(home) + 8/20 * H(away) = 1

13 Using the ``when’’ attribute, divide into 3 subsets
Entropy of the first set H(5pm) = - 1/4 log(1/4) - 3/4 log(3/4); Entropy of the second set H(7pm) = - 9/12 log(9/12) - 3/12 log(3/12); Entropy of the second set H(9pm) = - 0/4 log(0/4) - 4/4 log(4/4) = 0 Expected entropy after partitioning 4/20 * H(1/4, 3/4) + 12/20 * H(9/12, 3/12) + 4/20 * H(0/4, 4/4) = 0.65 Information gain = 0.35

14 Decision Knowing the ``when’’ attribute values provides larger information gain than ``where’’. Therefore the ``when’’ attribute should be chosen for testing prior to the ``where’’ attribute. Similarly, we can compute the information gain for other attributes. At each node, choose the attribute with the largest information gain.

15 Decision Tree: Example
Day Outlook Temperature Humidity Wind Play Tennis 1 Sunny Hot High Weak No 2 Sunny Hot High Strong No 3 Overcast Hot High Weak Yes 4 Rain Mild High Weak Yes 5 Rain Cool Normal Weak Yes 6 Rain Cool Normal Strong No 7 Overcast Cool Normal Strong Yes 8 Sunny Mild High Weak No 9 Sunny Cool Normal Weak Yes 10 Rain Mild Normal Weak Yes 11 Sunny Mild Normal Strong Yes 12 Overcast Mild High Strong Yes 13 Overcast Hot Normal Weak Yes 14 Rain Mild High Strong No Outlook Sunny Overcast Rain Humidity Yes Wind High Normal No Strong Weak

16 Weather Data: Play or not Play?
Outlook Temperature Humidity Windy Play? sunny hot high false No true overcast Yes rain mild cool normal Note: Outlook is the Forecast, no relation to Microsoft program

17 Example Tree for “Play?”
Outlook sunny rain overcast Humidity Yes Windy high normal true false No Yes No Yes

18 Which attribute to select?

19 Example: attribute “Outlook”
“Outlook” = “Sunny”: “Outlook” = “Overcast”: “Outlook” = “Rainy”: Expected information for attribute: Note: log(0) is not defined, but we evaluate 0*log(0) as zero

20 Computing the information gain
(information before split) – (information after split) Information gain for attributes from weather data:

21 Continuing to split

22 The final decision tree
Note: not all leaves need to be pure; sometimes identical instances have different classes  Splitting stops when data can’t be split any further

23 A worked example Weekend (Example) Weather Parents Money
Decision (Category) W1 Sunny Yes Rich Cinema W2 No Tennis W3 Windy W4 Rainy Poor W5 Stay in W6 W7 W8 Shopping W9 W10

24 Determining the Best Attribute
Entropy(S) = -pcinema log2(pcinema) -ptennis log2(ptennis) – pshopping log2(pshopping) –pstay_in log2(pstay_in)                    = -(6/10) * log2(6/10) -(2/10) * log2(2/10) -(1/10) * log2(1/10) -(1/10) * log2(1/10)                    = -(6/10) * (2/10) * (1/10) * (1/10) *                    = = 1.571 and we need to determine the best of: Gain(S, weather) = (|Ssun|/10)*Entropy(Ssun) – (|Swind|/10)*Entropy(Swind) – (|Srain|/10)*Entropy(Srain)                           = (0.3)*Entropy(Ssun) - (0.4)*Entropy(Swind) – (0.3)*Entropy(Srain)                           = (0.3)*(0.918) - (0.4)*( ) - (0.3)*(0.918) = 0.70 Gain(S, parents) = (|Syes|/10)*Entropy(Syes) - (|Sno|/10)*Entropy(Sno)                           = (0.5) * 0 - (0.5) * = = 0.61 Gain(S, money) = (|Srich|/10)*Entropy(Srich) - (|Spoor|/10)*Entropy(Spoor)                           = (0.7) * (1.842) - (0.3) * 0 = =

25 Now we look at the first branch
Now we look at the first branch. Ssunny = {W1, W2, W10}, not empty and the class labels for these rows are not common, thus we put a node rather than a leaf. Samething is for the 2nd and 3rd branches (Weather, windy) and (weather, rainy), and therefore, we put a node for each one of them.

26 Hence we can calculate:
Now we focus on the first branch of the tree data, i.e. data for attribute values (Weather, Sunny) as shown below: Weekend (Example) Weather Parents Money Decision (Category) W1 Sunny Yes Rich Cinema W2 No Tennis W10 Hence we can calculate: Gain(Ssunny, parents) = (|Syes|/|S|)*Entropy(Syes) - (|Sno|/|S|)*Entropy(Sno)                           = (1/3)*0 - (2/3)*0 = 0.918 Gain(Ssunny, money) = (|Srich|/|S|)*Entropy(Srich) - (|Spoor|/|S|)*Entropy(Spoor)                           = (3/3)* (0/3)*0 = = 0

27 Finishing this tree off is left as a tutorial exercise.
Remembering that we replaced the set S by the set S(Sunny), looking at S(yes), we see that the only example of this is W1. Hence, the branch for yes stops at a categorisation leaf, with the category being Cinema. Also, S(no) contains W2 and W10, but these are in the same category (Tennis). Hence the branch for no ends here at a categorisation leaf. Hence our upgraded tree looks like this: Finishing this tree off is left as a tutorial exercise.


Download ppt "Classification Algorithms"

Similar presentations


Ads by Google