Presentation is loading. Please wait.

Presentation is loading. Please wait.

Slide 1 DSCI 4520/5240: Data Mining Fall 2013 – Dr. Nick Evangelopoulos Lecture 5: Decision Tree Algorithms Material based on: Witten & Frank 2000, Olson.

Similar presentations


Presentation on theme: "Slide 1 DSCI 4520/5240: Data Mining Fall 2013 – Dr. Nick Evangelopoulos Lecture 5: Decision Tree Algorithms Material based on: Witten & Frank 2000, Olson."— Presentation transcript:

1 slide 1 DSCI 4520/5240: Data Mining Fall 2013 – Dr. Nick Evangelopoulos Lecture 5: Decision Tree Algorithms Material based on: Witten & Frank 2000, Olson & Shi 2007, de Ville 2006, SAS Education 2005

2 Lecture 6 - 2 DSCI 4520/5240 DATA MINING Decision Tree Algorithms n Decision Tree algorithms were developed by the Machine Learning research community (part of Computer Science/Artificial Intelligence)

3 slide 3 DSCI 4520/5240 DATA MINING Agenda n Rule evaluation criteria in Machine Learning n Decision Tree algorithms: 1R, ID3 –Entropy n Naïve Bayes classification (optional)

4 Lecture 6 - 4 DSCI 4520/5240 DATA MINING This example is related to determining credit risks. We have a total of 10 people. 6 are good risks and 4 are bad. We apply splits to the tree based on employment status. When we break this down, we find that there are 7 employed and 3 not employed. Of the 3 that are not employed, all of them are bad credit risks and thus we have learned something about our data. Decision Trees: Credit risk example Note that here we cannot split this node down any further since all of our data is grouped into one set. This is called a pure node. The other node, however, can be split again based on a different criterion. So we can continue to grow the tree on the left hand side. CORRESPONDING RULES: IF employed = yes AND married = yes THEN risk = good IF employed = yes AND married = no THEN risk = good IF employed = no THEN risk = bad

5 Lecture 6 - 5 DSCI 4520/5240 DATA MINING Decision Tree performance Confidence is the degree of accuracy of a rule. Support is the degree to which the rule conditions occur in the data. EXAMPLE: if 10 customers purchased Zane Grey’s The Young Pitcher and 8 of them also purchased The Short Stop, the rule: {IF basket has The Young Pitcher THEN basket has The Short Stop} has confidence of 0.80. If these purchases were the only 10 to cover these books out of 10,000,000 purchases, the support is only 0.000001.

6 Lecture 6 - 6 DSCI 4520/5240 DATA MINING Rule Interestingness Interestingness is the idea that Data Mining discovers something unexpected. Consider the rule: {IF basket has eggs THEN basket has bacon}. Suppose the confidence level is 0.90 and the support level is 0.20. This may be a useful rule, but it may not be interesting if the grocer was already aware of this association. Recall the definition of DM as the discovery of previously unknown knowledge!

7 Lecture 6 - 7 DSCI 4520/5240 DATA MINING Rule Induction algorithms n 1R n ID3 n C4.5/C5.0 n CART  CHAID  CN2  BruteDL  SDL They are recursive algorithms that identify data partitions of progressive separation with respect to the outcome. The partitions are then organized into a decision tree. Common Algorithms:

8 Lecture 6 - 8 DSCI 4520/5240 DATA MINING Illustration of two Tree algorithms n 1R and Discretization in 1R n ID3: Min Entropy and Max Info Gain

9 slide 9 DSCI 4520/5240 DATA MINING 1R

10 Lecture 6 - 10 DSCI 4520/5240 DATA MINING 1R: Inferring Rudimentary Rules 1R: learns a 1-level decision tree n In other words, generates a set of rules that all test on one particular attribute Basic version (assuming nominal attributes) n One branch for each of the attribute’s values n Each branch assigns most frequent class n Error rate: proportion of instances that don’t belong to the majority class of their corresponding branch n Choose attribute with lowest error rate

11 Lecture 6 - 11 DSCI 4520/5240 DATA MINING Pseudo-code for 1R For each attribute, For each value of the attribute, make a rule as follows: count how often each class appears find the most frequent class make the rule assign that class to this attribute-value Calculate the error rate of the rules Choose the rules with the smallest error rate Let’s apply 1R on the weather data: n Consider the first (outlook) of the 4 attributes (outlook, temp, humidity, windy). Consider all values (sunny, overcast, rainy) and make 3 corresponding rules. Continue until you get all 4 sets of rules.

12 slide 12 DSCI 4520/5240 DATA MINING A simple example: Weather Data OutlookTempHumidityWindyPlay? SunnyHotHighFalseNo SunnyHotHighTrueNo OvercastHotHighFalseYes RainyMildHighFalseYes RainyCoolNormalFalseYes RainyCoolNormalTrueNo OvercastCoolNormalTrueYes SunnyMildHighFalseNo SunnyCoolNormalFalseYes RainyMildNormalFalseYes SunnyMildNormalTrueYes OvercastMildHighTrueYes OvercastHotNormalFalseYes RainyMildHighTrueNo

13 slide 13 DSCI 4520/5240 DATA MINING Evaluating the Weather Attributes in 1R (*) indicates a random choice between two equally likely outcomes

14 slide 14 DSCI 4520/5240 DATA MINING Decision tree for the weather data Outlook no normal high sunny Humidity yes Windy rainy overcast false yes true yes no

15 slide 15 DSCI 4520/5240 DATA MINING Discretization in 1R Consider continuous Temperature data, after sorting them in ascending order: 65 65 68 69 70 71 72 72 75 75 80 81 83 85 Yes No Yes Yes Yes No No Yes Yes Yes No Yes Yes No One way to discretize temperature is to place breakpoints wherever the class changes: Yes | No | Yes Yes Yes | No No | Yes Yes Yes | No | Yes Yes | No To avoid overfitting, 1R adopts the rule that observations of the majority class in each partition be up to 3 (if available), unless there is a “run”: Yes No Yes Yes Yes | No No Yes Yes Yes | No Yes Yes No If adjacent partitions have the same majority class, the partitions are merged: Yes No Yes Yes Yes No No Yes Yes Yes | No Yes Yes No The final discretization leads to the rule set: IF temperature <= 77.5 THEN Yes IF temperature > 77.5 THEN No

16 Lecture 6 - 16 DSCI 4520/5240 DATA MINING Comments on 1R n 1R was described in a paper by Holte (1993) n Contains an experimental evaluation on 16 datasets (using cross-validation so that results were representative of performance on future data) n Minimum number of instances was set to 6 after some experimentation n 1R’s simple rules performed not much worse than much more complex decision trees n Simplicity-first pays off!

17 slide 17 DSCI 4520/5240 DATA MINING Entropy and Information Gain

18 Lecture 6 - 18 DSCI 4520/5240 DATA MINING Constructing Decision Trees in ID3, C4.5, C5.0 Normal procedure: top down in recursive divide-and-conquer fashion  First: attribute is selected for root node and branch is created for each possible attribute value  Then: the instances are split into subsets (one for each branch extending from the node)  Finally: procedure is repeated recursively for each branch, using only instances that reach the branch Process stops if all instances have the same class

19 slide 19 DSCI 4520/5240 DATA MINING Which attribute to select? Outlook yes no yes no rainy overcast sunny Temperature yes no yes no yes no mild cool hot Windy yes no yes no true false Humidity yes no yes no normal high

20 slide 20 DSCI 4520/5240 DATA MINING A criterion for attribute selection Which is the best attribute? The one which will result in the smallest tree. Heuristic: choose the attribute that produces the “purest” nodes! Popular impurity criterion: Information. This is the extra information needed to classify an instance. It takes a low value for pure nodes and a high value for impure nodes. We can then compare a tree before the split and after the split using Information Gain = Info (before) – Info (after). Information Gain increases with the average purity of the subsets that an attribute produces Strategy: choose attribute that results in greatest information gain. However, it equivalent (and faster) to select the smallest information (=total weighted entropy)

21 slide 21 DSCI 4520/5240 DATA MINING Computing Information n Information is measured in bits n Given a probability distribution, the info required to predict an event is the distribution’s entropy n Entropy gives the additional required information (i.e., the information deficit) in bits n This can involve fractions of bits! n The negative sign in the entropy formula is needed to convert all negative logs back to positive values Formula for computing the entropy: Entropy (p 1, p 2, …, p n ) = –p 1 logp 1 –p 2 logp 2 … –p n logp n

22 slide 22 DSCI 4520/5240 DATA MINING Weather example: attribute “outlook” Outlook = “Overcast” Info([4,0]) = entropy(1, 0) = –1log(1) –0log(0) = 0 bits (by definition) Outlook = “Rainy” Info([3,2]) = entropy(3/5, 2/5) = –3/5log(3/5) –2/5log(2/5) = 0.971 bits Information (=total weighted entropy) for attribute Outlook: Info([3,2], [4,0], [3,2]) = (5/14)×0.971+ (4/14)×0 + (5/14)×0.971 = 0.693 bits. Outlook yes no yes no rainy overcast sunny Outlook = “Sunny” Info([2,3]) = entropy(2/5, 3/5) = –2/5log(2/5) –3/5log(3/5) = 0.971 bits Info([2,3])

23 slide 23 DSCI 4520/5240 DATA MINING Comparing Information Gains (OPTIONAL) or total weighted entropies Information Gain = Information Before – Information After Gain (Outlook) = info([9,5]) – info([2,3], [4,0], [3,2]) = 0.940 – 0.693 = 0.247 bits Information Gain for attributes from the Weather Data: Gain (Outlook) = 0.247 bits Gain (Temperature) = 0.029 bits Gain (Humidity) = 0.152 bits Gain (Windy) = 0.048 Outlook also has minimum weighted entropy at 0.693

24 slide 24 DSCI 4520/5240 DATA MINING Continuing to split Gain (Temperature) = 0.571 bits Gain (Humidity) = 0.971 bits Gain (Windy) = 0.020 bits Outlook no mild hot sunny Temperature cool yes no yes Outlook yes no true false sunny Windy yes no Outlook no normal high sunny Humidity yes

25 slide 25 DSCI 4520/5240 DATA MINING Final Decision Tree Not all leaves need to be pure. Sometimes identical instances belong to different classes Splitting stops when data cannot split any further Outlook no normal high sunny Humidity yes Windy rainy overcast false yes true yes no

26 slide 26 DSCI 4520/5240 DATA MINING Another example: Loan Application Data Twenty loan application cases are presented. The target variable OnTime? Indicates whether the loan was paid off on time.

27 slide 27 DSCI 4520/5240 DATA MINING Loan Example: probability calculations All possible values for the three attributes (Age, Income, Risk) are shown below. For each value, the probability for the loan to be On Time (OnTime = yes) is calculated:

28 slide 28 DSCI 4520/5240 DATA MINING Loan Example: Entropy calculations Information calculations for attribute Age are shown below. First we calculate the probability for each value to result in Yes Also the probability for this value to result in No. Then we compute the entropy for this value as: E = –p(yes) logp(yes) –p(no) logp(no) Finally we calculate Information (=weighted entropy) for the entire attribute: Inform = E 1 p 1 + E 2 p 2 + E 3 p 3

29 slide 29 DSCI 4520/5240 DATA MINING Loan Example: The first split The calculations continue until we have, for each competing attribute, the Information required to predict the outcome. The attribute with lowest required information is also the attribute with largest information gain, when we compare the required information before and after the split. Risk low high average

30 slide 30 DSCI 4520/5240 DATA MINING Naïve Bayes Classification (optional topic)

31 slide 31 DSCI 4520/5240 DATA MINING Statistical Decision Tree Modeling 1R uses one attribute at a time and chooses the one that works best. Consider the “opposite” of 1R: Use all the attributes. Let’s first make two assumptions: Attributes are Equally important Statistically independent Although based on assumptions that are almost never correct, this scheme works well in practice!

32 slide 32 DSCI 4520/5240 DATA MINING Probabilities for the Weather Data Table showing counts and conditional probabilities (contigencies): A new day: Suppose the answer is Play=Yes. How likely is to get the attribute values of this new day?

33 slide 33 DSCI 4520/5240 DATA MINING Baye’s Rule Probability of event H given evidence E: P(H|E) = P(E|H) P(H) P(E) “A priori” probability of H: P(H) (Probability of event before evidence has been seen) “A posteriori” probability of H: P(H|E) (Probability of event after evidence has been seen) WHERE: H = target value, E = input variable values

34 slide 34 DSCI 4520/5240 DATA MINING Naïve Bayes Classification P(H|E) = P(E 1 |H) P(E 2 |H) … P(E n |H) P(H) P(E) Classification learning: what’s the probability of the class given an instance? Evidence E = instance Event H = class value for instance Naïve Bayes assumption: evidence can be split into independent parts (i.e. attributes of instance!)

35 slide 35 DSCI 4520/5240 DATA MINING Naïve Bayes on the Weather Data P( Yes | E) = (P( Outlook = Sunny | Yes) × P( Temperature = Cool | Yes) × P( Humidity = High | Yes) × P( Windy = True | Yes) × P(Yes)) / P(E) Evidence E P( Yes | E) = (2/9 × 3/9 × 3/9 × 3/9 × 9/14) / P(E) = 0.0053 / P(E) P( No | E) = (3/5 × 1/5 × 4/5 × 3/5 × 5/14) / P(E) = 0.0206 / P(E) Note that P(E) will disappear when we normalize!

36 slide 36 DSCI 4520/5240 DATA MINING Comments on Naïve Bayes Classification n Naïve Bayes works surprisingly well (even if independence assumption is clearly violated) n Why? Because classification doesn’t require accurate probability estimates as long as maximum probability is assigned to the correct class n However: adding too many redundant attributes will cause problems (e.g. identical attributes)

37 slide 37 DSCI 4520/5240 DATA MINING Suggested readings Verify the entropy, information, and information gain calculations we did in these slides Hint: All logs are base 2!!! Read the SAS GSEM 5.3 text, chapter 4 (pp. 61-102) Read the Sarma text, chapter 4 (pp. 113-168). Pay particular attention to: Entropy calculations (p. 126) Profit Matrix (p. 136) Expected profit calculations (p. 137) How to use SAS EM and grow a decision tree (pp. 143-158)


Download ppt "Slide 1 DSCI 4520/5240: Data Mining Fall 2013 – Dr. Nick Evangelopoulos Lecture 5: Decision Tree Algorithms Material based on: Witten & Frank 2000, Olson."

Similar presentations


Ads by Google