Slide 1 DSCI 4520/5240: Data Mining Fall 2013 – Dr. Nick Evangelopoulos Lecture 5: Decision Tree Algorithms Material based on: Witten & Frank 2000, Olson.

Slides:



Advertisements
Similar presentations
COMP3740 CR32: Knowledge Management and Adaptive Systems
Advertisements

Machine Learning in Real World: C4.5
Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.
Decision Tree Approach in Data Mining
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Part I Introduction to Data Mining by Tan,
Decision Tree.
Classification Techniques: Decision Tree Learning
Naïve Bayes: discussion
Decision Trees.
Algorithms: The basic methods. Inferring rudimentary rules Simplicity first Simple algorithms often work surprisingly well Many different kinds of simple.
Decision Trees Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei Han.
Constructing Decision Trees. A Decision Tree Example The weather data example. ID codeOutlookTemperatureHumidityWindyPlay abcdefghijklmnabcdefghijklmn.
Data Mining with Naïve Bayesian Methods
Classification: Decision Trees
Induction of Decision Trees
1 Classification with Decision Trees I Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei.
1 Bayesian Classification Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Dan Weld, Eibe Frank.
Classification Continued
Algorithms for Classification: Notes by Gregory Piatetsky.
Decision Trees an Introduction.
Decision Trees Chapter 18 From Data to Knowledge.
Review. 2 Statistical modeling  “Opposite” of 1R: use all the attributes  Two assumptions: Attributes are  equally important  statistically independent.
Classification and Prediction by Yen-Hsien Lee Department of Information Management College of Management National Sun Yat-Sen University March 4, 2003.
Inferring rudimentary rules
Algorithms for Classification: The Basic Methods.
Classification: Decision Trees 2 Outline  Top-Down Decision Tree Construction  Choosing the Splitting Attribute  Information Gain and Gain Ratio.
Decision Tree Models in Data Mining
Data Mining – Algorithms: OneR Chapter 4, Section 4.1.
Machine Learning Chapter 3. Decision Tree Learning
Learning what questions to ask. 8/29/03Decision Trees2  Job is to build a tree that represents a series of questions that the classifier will ask of.
Classification II. 2 Numeric Attributes Numeric attributes can take many values –Creating branches for each value is not ideal The value range is usually.
1 Data Mining Lecture 3: Decision Trees. 2 Classification: Definition l Given a collection of records (training set ) –Each record contains a set of attributes,
Data Mining – Algorithms: Prism – Learning Rules via Separating and Covering Chapter 4, Section 4.4.
Classification I. 2 The Task Input: Collection of instances with a set of attributes x and a special nominal attribute Y called class attribute Output:
Lecture 7. Outline 1. Overview of Classification and Decision Tree 2. Algorithm to build Decision Tree 3. Formula to measure information 4. Weka, data.
1 Learning Chapter 18 and Parts of Chapter 20 AI systems are complex and may have many parameters. It is impractical and often impossible to encode all.
Decision-Tree Induction & Decision-Rule Induction
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.3: Decision Trees Rodney Nielsen Many of.
Data Mining – Algorithms: Decision Trees - ID3 Chapter 4, Section 4.3.
Chapter 20 Data Analysis and Mining. 2 n Decision Support Systems  Obtain high-level information out of detailed information stored in (DB) transaction-processing.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.2 Statistical Modeling Rodney Nielsen Many.
 Classification 1. 2  Task: Given a set of pre-classified examples, build a model or classifier to classify new cases.  Supervised learning: classes.
Algorithms for Classification: The Basic Methods.
Data Mining – Algorithms: Naïve Bayes Chapter 4, Section 4.2.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Sections 4.1 Inferring Rudimentary Rules Rodney Nielsen.
Example: input data outlooktemp.humiditywindyplay sunnyhothighfalseno sunnyhothightrueno overcasthothighfalseyes rainymildhighfalseyes rainycoolnormalfalseyes.
Data Management and Database Technologies 1 DATA MINING Extracting Knowledge From Data Petr Olmer CERN
Chapter 4: Algorithms CS 795. Inferring Rudimentary Rules 1R – Single rule – one level decision tree –Pick each attribute and form a single level tree.
CIS 335 CIS 335 Data Mining Classification Part I.
Decision Trees by Muhammad Owais Zahid
CSE343/543 Machine Learning: Lecture 4.  Chapter 3: Decision Trees  Weekly assignment:  There are lot of applications and systems using machine learning.
Decision Tree Learning DA514 - Lecture Slides 2 Modified and expanded from: E. Alpaydin-ML (chapter 9) T. Mitchell-ML.
Data Mining Chapter 4 Algorithms: The Basic Methods Reporter: Yuen-Kuei Hsueh.
Data Mining Chapter 4 Algorithms: The Basic Methods - Constructing decision trees Reporter: Yuen-Kuei Hsueh Date: 2008/7/24.
Review of Decision Tree Learning Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
DECISION TREES An internal node represents a test on an attribute.
Decision Trees an introduction.
Data Science Algorithms: The Basic Methods
Classification Algorithms
Data Science Algorithms: The Basic Methods
Artificial Intelligence
Data Science Algorithms: The Basic Methods
Decision Tree Saed Sayad 9/21/2018.
Machine Learning Techniques for Data Mining
Machine Learning Chapter 3. Decision Tree Learning
Machine Learning: Lecture 3
Machine Learning Chapter 3. Decision Tree Learning
Data Mining CSCI 307, Spring 2019 Lecture 15
Data Mining CSCI 307, Spring 2019 Lecture 6
Data Mining CSCI 307, Spring 2019 Lecture 9
Presentation transcript:

slide 1 DSCI 4520/5240: Data Mining Fall 2013 – Dr. Nick Evangelopoulos Lecture 5: Decision Tree Algorithms Material based on: Witten & Frank 2000, Olson & Shi 2007, de Ville 2006, SAS Education 2005

Lecture DSCI 4520/5240 DATA MINING Decision Tree Algorithms n Decision Tree algorithms were developed by the Machine Learning research community (part of Computer Science/Artificial Intelligence)

slide 3 DSCI 4520/5240 DATA MINING Agenda n Rule evaluation criteria in Machine Learning n Decision Tree algorithms: 1R, ID3 –Entropy n Naïve Bayes classification (optional)

Lecture DSCI 4520/5240 DATA MINING This example is related to determining credit risks. We have a total of 10 people. 6 are good risks and 4 are bad. We apply splits to the tree based on employment status. When we break this down, we find that there are 7 employed and 3 not employed. Of the 3 that are not employed, all of them are bad credit risks and thus we have learned something about our data. Decision Trees: Credit risk example Note that here we cannot split this node down any further since all of our data is grouped into one set. This is called a pure node. The other node, however, can be split again based on a different criterion. So we can continue to grow the tree on the left hand side. CORRESPONDING RULES: IF employed = yes AND married = yes THEN risk = good IF employed = yes AND married = no THEN risk = good IF employed = no THEN risk = bad

Lecture DSCI 4520/5240 DATA MINING Decision Tree performance Confidence is the degree of accuracy of a rule. Support is the degree to which the rule conditions occur in the data. EXAMPLE: if 10 customers purchased Zane Grey’s The Young Pitcher and 8 of them also purchased The Short Stop, the rule: {IF basket has The Young Pitcher THEN basket has The Short Stop} has confidence of If these purchases were the only 10 to cover these books out of 10,000,000 purchases, the support is only

Lecture DSCI 4520/5240 DATA MINING Rule Interestingness Interestingness is the idea that Data Mining discovers something unexpected. Consider the rule: {IF basket has eggs THEN basket has bacon}. Suppose the confidence level is 0.90 and the support level is This may be a useful rule, but it may not be interesting if the grocer was already aware of this association. Recall the definition of DM as the discovery of previously unknown knowledge!

Lecture DSCI 4520/5240 DATA MINING Rule Induction algorithms n 1R n ID3 n C4.5/C5.0 n CART  CHAID  CN2  BruteDL  SDL They are recursive algorithms that identify data partitions of progressive separation with respect to the outcome. The partitions are then organized into a decision tree. Common Algorithms:

Lecture DSCI 4520/5240 DATA MINING Illustration of two Tree algorithms n 1R and Discretization in 1R n ID3: Min Entropy and Max Info Gain

slide 9 DSCI 4520/5240 DATA MINING 1R

Lecture DSCI 4520/5240 DATA MINING 1R: Inferring Rudimentary Rules 1R: learns a 1-level decision tree n In other words, generates a set of rules that all test on one particular attribute Basic version (assuming nominal attributes) n One branch for each of the attribute’s values n Each branch assigns most frequent class n Error rate: proportion of instances that don’t belong to the majority class of their corresponding branch n Choose attribute with lowest error rate

Lecture DSCI 4520/5240 DATA MINING Pseudo-code for 1R For each attribute, For each value of the attribute, make a rule as follows: count how often each class appears find the most frequent class make the rule assign that class to this attribute-value Calculate the error rate of the rules Choose the rules with the smallest error rate Let’s apply 1R on the weather data: n Consider the first (outlook) of the 4 attributes (outlook, temp, humidity, windy). Consider all values (sunny, overcast, rainy) and make 3 corresponding rules. Continue until you get all 4 sets of rules.

slide 12 DSCI 4520/5240 DATA MINING A simple example: Weather Data OutlookTempHumidityWindyPlay? SunnyHotHighFalseNo SunnyHotHighTrueNo OvercastHotHighFalseYes RainyMildHighFalseYes RainyCoolNormalFalseYes RainyCoolNormalTrueNo OvercastCoolNormalTrueYes SunnyMildHighFalseNo SunnyCoolNormalFalseYes RainyMildNormalFalseYes SunnyMildNormalTrueYes OvercastMildHighTrueYes OvercastHotNormalFalseYes RainyMildHighTrueNo

slide 13 DSCI 4520/5240 DATA MINING Evaluating the Weather Attributes in 1R (*) indicates a random choice between two equally likely outcomes

slide 14 DSCI 4520/5240 DATA MINING Decision tree for the weather data Outlook no normal high sunny Humidity yes Windy rainy overcast false yes true yes no

slide 15 DSCI 4520/5240 DATA MINING Discretization in 1R Consider continuous Temperature data, after sorting them in ascending order: Yes No Yes Yes Yes No No Yes Yes Yes No Yes Yes No One way to discretize temperature is to place breakpoints wherever the class changes: Yes | No | Yes Yes Yes | No No | Yes Yes Yes | No | Yes Yes | No To avoid overfitting, 1R adopts the rule that observations of the majority class in each partition be up to 3 (if available), unless there is a “run”: Yes No Yes Yes Yes | No No Yes Yes Yes | No Yes Yes No If adjacent partitions have the same majority class, the partitions are merged: Yes No Yes Yes Yes No No Yes Yes Yes | No Yes Yes No The final discretization leads to the rule set: IF temperature <= 77.5 THEN Yes IF temperature > 77.5 THEN No

Lecture DSCI 4520/5240 DATA MINING Comments on 1R n 1R was described in a paper by Holte (1993) n Contains an experimental evaluation on 16 datasets (using cross-validation so that results were representative of performance on future data) n Minimum number of instances was set to 6 after some experimentation n 1R’s simple rules performed not much worse than much more complex decision trees n Simplicity-first pays off!

slide 17 DSCI 4520/5240 DATA MINING Entropy and Information Gain

Lecture DSCI 4520/5240 DATA MINING Constructing Decision Trees in ID3, C4.5, C5.0 Normal procedure: top down in recursive divide-and-conquer fashion  First: attribute is selected for root node and branch is created for each possible attribute value  Then: the instances are split into subsets (one for each branch extending from the node)  Finally: procedure is repeated recursively for each branch, using only instances that reach the branch Process stops if all instances have the same class

slide 19 DSCI 4520/5240 DATA MINING Which attribute to select? Outlook yes no yes no rainy overcast sunny Temperature yes no yes no yes no mild cool hot Windy yes no yes no true false Humidity yes no yes no normal high

slide 20 DSCI 4520/5240 DATA MINING A criterion for attribute selection Which is the best attribute? The one which will result in the smallest tree. Heuristic: choose the attribute that produces the “purest” nodes! Popular impurity criterion: Information. This is the extra information needed to classify an instance. It takes a low value for pure nodes and a high value for impure nodes. We can then compare a tree before the split and after the split using Information Gain = Info (before) – Info (after). Information Gain increases with the average purity of the subsets that an attribute produces Strategy: choose attribute that results in greatest information gain. However, it equivalent (and faster) to select the smallest information (=total weighted entropy)

slide 21 DSCI 4520/5240 DATA MINING Computing Information n Information is measured in bits n Given a probability distribution, the info required to predict an event is the distribution’s entropy n Entropy gives the additional required information (i.e., the information deficit) in bits n This can involve fractions of bits! n The negative sign in the entropy formula is needed to convert all negative logs back to positive values Formula for computing the entropy: Entropy (p 1, p 2, …, p n ) = –p 1 logp 1 –p 2 logp 2 … –p n logp n

slide 22 DSCI 4520/5240 DATA MINING Weather example: attribute “outlook” Outlook = “Overcast” Info([4,0]) = entropy(1, 0) = –1log(1) –0log(0) = 0 bits (by definition) Outlook = “Rainy” Info([3,2]) = entropy(3/5, 2/5) = –3/5log(3/5) –2/5log(2/5) = bits Information (=total weighted entropy) for attribute Outlook: Info([3,2], [4,0], [3,2]) = (5/14)× (4/14)×0 + (5/14)×0.971 = bits. Outlook yes no yes no rainy overcast sunny Outlook = “Sunny” Info([2,3]) = entropy(2/5, 3/5) = –2/5log(2/5) –3/5log(3/5) = bits Info([2,3])

slide 23 DSCI 4520/5240 DATA MINING Comparing Information Gains (OPTIONAL) or total weighted entropies Information Gain = Information Before – Information After Gain (Outlook) = info([9,5]) – info([2,3], [4,0], [3,2]) = – = bits Information Gain for attributes from the Weather Data: Gain (Outlook) = bits Gain (Temperature) = bits Gain (Humidity) = bits Gain (Windy) = Outlook also has minimum weighted entropy at 0.693

slide 24 DSCI 4520/5240 DATA MINING Continuing to split Gain (Temperature) = bits Gain (Humidity) = bits Gain (Windy) = bits Outlook no mild hot sunny Temperature cool yes no yes Outlook yes no true false sunny Windy yes no Outlook no normal high sunny Humidity yes

slide 25 DSCI 4520/5240 DATA MINING Final Decision Tree Not all leaves need to be pure. Sometimes identical instances belong to different classes Splitting stops when data cannot split any further Outlook no normal high sunny Humidity yes Windy rainy overcast false yes true yes no

slide 26 DSCI 4520/5240 DATA MINING Another example: Loan Application Data Twenty loan application cases are presented. The target variable OnTime? Indicates whether the loan was paid off on time.

slide 27 DSCI 4520/5240 DATA MINING Loan Example: probability calculations All possible values for the three attributes (Age, Income, Risk) are shown below. For each value, the probability for the loan to be On Time (OnTime = yes) is calculated:

slide 28 DSCI 4520/5240 DATA MINING Loan Example: Entropy calculations Information calculations for attribute Age are shown below. First we calculate the probability for each value to result in Yes Also the probability for this value to result in No. Then we compute the entropy for this value as: E = –p(yes) logp(yes) –p(no) logp(no) Finally we calculate Information (=weighted entropy) for the entire attribute: Inform = E 1 p 1 + E 2 p 2 + E 3 p 3

slide 29 DSCI 4520/5240 DATA MINING Loan Example: The first split The calculations continue until we have, for each competing attribute, the Information required to predict the outcome. The attribute with lowest required information is also the attribute with largest information gain, when we compare the required information before and after the split. Risk low high average

slide 30 DSCI 4520/5240 DATA MINING Naïve Bayes Classification (optional topic)

slide 31 DSCI 4520/5240 DATA MINING Statistical Decision Tree Modeling 1R uses one attribute at a time and chooses the one that works best. Consider the “opposite” of 1R: Use all the attributes. Let’s first make two assumptions: Attributes are Equally important Statistically independent Although based on assumptions that are almost never correct, this scheme works well in practice!

slide 32 DSCI 4520/5240 DATA MINING Probabilities for the Weather Data Table showing counts and conditional probabilities (contigencies): A new day: Suppose the answer is Play=Yes. How likely is to get the attribute values of this new day?

slide 33 DSCI 4520/5240 DATA MINING Baye’s Rule Probability of event H given evidence E: P(H|E) = P(E|H) P(H) P(E) “A priori” probability of H: P(H) (Probability of event before evidence has been seen) “A posteriori” probability of H: P(H|E) (Probability of event after evidence has been seen) WHERE: H = target value, E = input variable values

slide 34 DSCI 4520/5240 DATA MINING Naïve Bayes Classification P(H|E) = P(E 1 |H) P(E 2 |H) … P(E n |H) P(H) P(E) Classification learning: what’s the probability of the class given an instance? Evidence E = instance Event H = class value for instance Naïve Bayes assumption: evidence can be split into independent parts (i.e. attributes of instance!)

slide 35 DSCI 4520/5240 DATA MINING Naïve Bayes on the Weather Data P( Yes | E) = (P( Outlook = Sunny | Yes) × P( Temperature = Cool | Yes) × P( Humidity = High | Yes) × P( Windy = True | Yes) × P(Yes)) / P(E) Evidence E P( Yes | E) = (2/9 × 3/9 × 3/9 × 3/9 × 9/14) / P(E) = / P(E) P( No | E) = (3/5 × 1/5 × 4/5 × 3/5 × 5/14) / P(E) = / P(E) Note that P(E) will disappear when we normalize!

slide 36 DSCI 4520/5240 DATA MINING Comments on Naïve Bayes Classification n Naïve Bayes works surprisingly well (even if independence assumption is clearly violated) n Why? Because classification doesn’t require accurate probability estimates as long as maximum probability is assigned to the correct class n However: adding too many redundant attributes will cause problems (e.g. identical attributes)

slide 37 DSCI 4520/5240 DATA MINING Suggested readings Verify the entropy, information, and information gain calculations we did in these slides Hint: All logs are base 2!!! Read the SAS GSEM 5.3 text, chapter 4 (pp ) Read the Sarma text, chapter 4 (pp ). Pay particular attention to: Entropy calculations (p. 126) Profit Matrix (p. 136) Expected profit calculations (p. 137) How to use SAS EM and grow a decision tree (pp )