Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lecture 7. Outline 1. Overview of Classification and Decision Tree 2. Algorithm to build Decision Tree 3. Formula to measure information 4. Weka, data.

Similar presentations


Presentation on theme: "Lecture 7. Outline 1. Overview of Classification and Decision Tree 2. Algorithm to build Decision Tree 3. Formula to measure information 4. Weka, data."— Presentation transcript:

1 Lecture 7

2 Outline 1. Overview of Classification and Decision Tree 2. Algorithm to build Decision Tree 3. Formula to measure information 4. Weka, data preparation and Visualization

3 1. Illustration of the Classification Task Courtesy to Professor David Mease for Next 10 slides Learning Algorithm Model

4 Classification: Definition Given a collection of records (training set) – Each record contains a set of attributes (x), with one additional attribute which is the class (y). l Find a model to predict the class as a function of the values of other attributes. l Goal: previously unseen records should be assigned a class as accurately as possible. – A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.

5 Classification Examples Classifying credit card transactions as legitimate or fraudulent l Classifying secondary structures of protein as alpha-helix, beta-sheet, or random coil l Categorizing news stories as finance, weather, entertainment, sports, etc l Predicting tumor cells as benign or malignant

6 Classification Techniques l There are many techniques/algorithms for carrying out classification l In this chapter we will study only decision trees l In Chapter 5 we will study other techniques, including some very modern and effective techniques

7 An Example of a Decision Tree categorical continuous class Refund MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K Splitting Attributes Training Data Model: Decision Tree

8 Applying the Tree Model to Predict the Class for a New Observation Refund MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K Test Data Start from the root of tree.

9 Applying the Tree Model to Predict the Class for a New Observation Refund MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K Test Data

10 Applying the Tree Model to Predict the Class for a New Observation Refund MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K Test Data

11 Applying the Tree Model to Predict the Class for a New Observation Refund MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K Test Data

12 Applying the Tree Model to Predict the Class for a New Observation Refund MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K Test Data

13 Applying the Tree Model to Predict the Class for a New Observation Refund MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K Test Data Assign Cheat to “No”

14 DECISION TREE CHARACTERISTICS Easy to understand -- similar to human decision process Deals with both discrete and continuous features Simple, nonparametric classifier No assumptions regarding probability distribution types NP-complete Algorithms are computationally inexpensive can represent arbitrarily complex decision boundaries Overfitting can be a problem

15 2. Algorithm to Build Decision Trees Defined recursively Select attribute as root node Use a greedy algorithm to create branches for each possible value of a selected attribute Repeat recursively until Either running out of instances or attributes Or reach a predefined thresholds of purity Use only branches that are reached by instances

16 WEATHER EXAMPLE, 9 yes/5 no

17 Outlook 2,Yes 3 No 4 Yes 3 Yes 2 No Sunny Overcast Rainy Temp Hum Wind Play Hot high FALSE No Hot high TRUE No Mild high FALSE No Cool normal FALSE Yes Mild normal TRUE Yes Tem Hum Wind Play Hot high FALSE Yes Cool High TRUE Yes Mild high TRUE Yes Hot normal FALSE Yes Temp Hum Wind Play Mild high FALSE Yes Cool normal FALSE Yes Cool normal TURE No mild normal FALSE Yes Mild high TRUE No

18 DECISION TREES: Weather Example Temperature Yes No Yes No Yes No Hot Mild Cool

19 DECISION TREES: Weather Example Humidity Yes No Yes No HighNormal

20 DECISION TREES: Weather Example Windy Yes No Yes No True False

21 3. Information Measures Selecting attribute upon which to split Need measure of “purity”/information Entropy Gini Index Classification error

22 A Graphical Comparison

23 Entropy l Measures purity similar to Gini l Used in C4.5 l After the entropy is computed in each node, the overall value of the entropy is computed as the weighted average of the entropy in each node as with the Gini index l The decrease in Entropy is called “information gain” (page 160)

24 Entropy Examples for a Single Node P(C1) = 0/6 = 0 P(C2) = 6/6 = 1 Entropy = – 0 log 0 – 1 log 1 = – 0 – 0 = 0 P(C1) = 1/6 P(C2) = 5/6 Entropy = – (1/6) log 2 (1/6) – (5/6) log 2 (5/6) = 0.65 P(C1) = 2/6 P(C2) = 4/6 Entropy = – (2/6) log 2 (2/6) – (4/6) log 2 (4/6) = 0.92

25 3. Entropy: Calculating Information All three measures are consistent with each other Will use entropy as example, The less purity, the more bits, the less info, Outlook as root: Info[2, 3] = 0.971 bits Info[4, 0] = 0.0 bits Info[3, 2] = 0.971 bits Total info = 5/14*0.971 + 4/14*0.0 + 5/14* 0.971 = 0.693 bits

26 3. Selecting Root Attribute Initial info = info[9, 5] = 0.940 bits Gain(outlook) = 0.940 – 0.693 = 0.247 bits Gain( temperature) = 0.029 bits Gain(humidity) = 0.152 bits Gain(windy) = 0.048 bits So, select outlook as root for splitting

27 Outlook Humidity 2/3 FALSE YES TRUE YES FALSE No TRUE No FALSE No 4 Yes Wind 3/2 YES No NO Sunny OvercastRainy High Normal FALSE TRUE

28 Outlook Humidity 2/3 FALSE YES TRUE YES FALSE No TRUE No FALSE No 4 Yes Wind 3/2 YES Cool normal Yes Cool normal No Mild high No Sunny OvercastRainy High Normal FALSE TRUE Contradicted Training example

29 3. Hunt’s Algorithm Many algorithms use a version of a “top-down” or “divide- and-conquer” approach known as Hunt’s Algorithm (Page 152): Let D t be the set of training records that reach a node t – If D t contains records that belong the same class y t, then t is a leaf node labeled as y t – If D t contains records that belong to more than one class, use an attribute test to split the data into smaller subsets. Recursively apply the procedure to each subset.

30 An Example of Hunt’s Algorithm Don’t Cheat Refund Don’t Cheat Don’t Cheat YesNo Refund Don’t Cheat YesNo Marital Status Don’t Cheat Single, Divorced Married Taxable Income Don’t Cheat < 80K>= 80K Refund Don’t Cheat YesNo Marital Status Don’t Cheat Single, Divorced Married

31 How to Apply Hunt’s Algorithm Usually it is done in a “greedy” fashion. “Greedy” means that the optimal split is chosen at each stage according to some criterion. This may not be optimal at the end even for the same criterion. However, the greedy approach is computational efficient so it is popular.

32 Using the greedy approach we still have to decide 3 things: #1) What attribute test conditions to consider #2) What criterion to use to select the “best” split #3) When to stop splitting For #1 we will consider only binary splits for both numeric and categorical predictors as discussed on the next slide For #2 we will consider misclassification error, Gini index and entropy #3 is a subtle business involving model selection. It is tricky because we don’t want to overfit or underfit.

33 l Misclassification error is usually our final metric which we want to minimize on the test set, so there is a logical argument for using it as the split criterion l It is simply the fraction of total cases misclassified l 1 - Misclassification error = “Accuracy” (page 149) Misclassification Error

34 Gini Index l This is commonly used in many algorithms like CART and the rpart() function in R l After the Gini index is computed in each node, the overall value of the Gini index is computed as the weighted average of the Gini index in each node

35 Gini Examples for a Single Node P(C1) = 0/6 = 0 P(C2) = 6/6 = 1 Gini = 1 – P(C1) 2 – P(C2) 2 = 1 – 0 – 1 = 0 P(C1) = 1/6 P(C2) = 5/6 Gini = 1 – (1/6) 2 – (5/6) 2 = 0.278 P(C1) = 2/6 P(C2) = 4/6 Gini = 1 – (2/6) 2 – (4/6) 2 = 0.444

36 The Gini index decreases from.42 to.343 while the misclassification error stays at 30%. This illustrates why we often want to use a surrogate loss function like the Gini index even if we really only care about misclassification. A? YesNo Node N1Node N2 Gini(N1) = 1 – (3/3) 2 – (0/3) 2 = 0 Gini(Children) = 3/10 * 0 + 7/10 * 0.49 = 0.343 Gini(N2) = 1 – (4/7) 2 – (3/7) 2 = 0.490 Misclassification Error Vs. Gini Index

37 5. Discretization of Numeric Data


Download ppt "Lecture 7. Outline 1. Overview of Classification and Decision Tree 2. Algorithm to build Decision Tree 3. Formula to measure information 4. Weka, data."

Similar presentations


Ads by Google