Presentation is loading. Please wait.

Presentation is loading. Please wait.

Decision Trees Jyh-Shing Roger Jang ( 張智星 ) CSIE Dept, National Taiwan University.

Similar presentations


Presentation on theme: "Decision Trees Jyh-Shing Roger Jang ( 張智星 ) CSIE Dept, National Taiwan University."— Presentation transcript:

1 Decision Trees Jyh-Shing Roger Jang ( 張智星 ) CSIE Dept, National Taiwan University

2 2 Classification Stages in classification Model construction: Given a collection of records (training set), where each record has a set of attributes, including the class, we want to find a model (classifier) for predicting the class as a function of other attributes. Model evaluation: Use previously unseen records (test set) to test the model, and hopefully the model should be able to assign a class as accurately as possible. Model application: Apply the model directly.

3 3 Stages in Classification

4 4 Example

5 5 Examples Classification/Regression Tasks Classification Predict the trend (up or down) of stock markets Predict tumors as benign or malignant Classify credit card transactions as legitimate or fraudulent Categorize news articles as finance, weather, entertainment, sports, etc. Regression Predict the temperature in 3 hours from now Predict tomorrow’s gold/oil price Estimate the paths of typhoon

6 6 Methods for Classification Numerous methods for classification Decision trees Minimum-distance classifiers Artificial neural networks Naïve Bayes classifiers Quadratic classifiers Gaussian-mixture-model classifiers Support vector machines Rule-based methods …

7 7 Decision Tree Induction Again, many algorithms Hunt’s algorithm (one of the earliest) CART (classification and regression trees) ID3, C4.5 SLIQ, SPRINT …

8 8 Decision Tree Induction Again, many algorithms Hunt’s algorithm (one of the earliest) CART (classification and regression trees) ID3, C4.5 SLIQ, SPRINT …

9 9 General Steps in Tree Induction Idea We want to send all the training data along the tree until it reach the leaves where the data should be as “pure” as possible. Let D be the data set that reach a node General procedure If D contains records belonging to the same class y, then mark the node as a leaf with class y. Otherwise use a test to split the data set based on an attribute to create subtree recursively.

10 10 Tree Induction Issues in tree induction How to split the dataset at a node: Split the dataset based on a greedy search to optimize a certain criterion/test When to stop splitting: When the “impurity measure” is less than a threshold

11 11 How to Specify Test? Depends on attribute types Nominal Car types: Family, sports, luxury, etc Ordinal T-shirt size: Small, median big, etc Continuous Temperature: 10.3, 25.6, 38, etc Depends on number of ways to split Binary (2-way) split Multi-way split Aka “factor”

12 12 Splitting Based on Nominal/Ordinal Attributes Multi-way split Use as many partitions as distinct values Binary split Divides values into two subsets via optimal partitioning CarType Family Sports Luxury CarType {Family, Luxury} {Sports} CarType {Sports, Luxury} {Family} OR

13 13 Splitting Based on Continuous Attributes Multi-way split Discretization to form an ordinal categorical attribute Binary split (A<v or A  v) Consider all possible splits to find the best one

14 14 To Determine the Best Split Goal Nodes with homogeneous (pure) class distribution are preferred Need a measure of node impurity (which should be keep as low as possible during split selection) Non-homogeneous, High degree of impurity Homogeneous, Low degree of impurity

15 15 Measures of Node Impurity Numerous measures of node impurity Gini index Entropy Classification error For 2-class problem

16 16 Impurity Measure: Gini Index Gini index for a given node t: Extreme values Minimum = 0 Maximum = 1/(# of classes) Examples P(j|t) is the relative frequency of class j at node t All records in the same class Records equally distributed among all classes “confusion” in HW4

17 17 Splitting Based on Gini Index The quality of splitting a node t into k childrens t i = node of child i n i = number of records at t i n = number of records at note t “total confusion” in HW4

18 18 Gini Index for General Binary Split Example for computing Gini index for binary split B? YesNo Node N1Node N2 Gini(N1) = 1 – (5/6) 2 – (2/6) 2 = 0.194 Gini(N2) = 1 – (1/6) 2 – (4/6) 2 = 0.528 Gini split (B) = 7/12 * 0.194 + 5/12 * 0.528 = 0.333

19 19 Gini Index for Nominal Attributes For each child, obtain counts for each class Compute the Gini index for each child Compute the Gini index for the split Multi-way splitTwo-way split (find best partition of values)

20 20 Gini Index for Binary Split on Continuous Attributes For each attribute Sort the attribute values Linearly scan these value, and update the count matrix and compute Gini index for a new value each time Choose the split that has the smallest Gini index Split Positions Sorted Values


Download ppt "Decision Trees Jyh-Shing Roger Jang ( 張智星 ) CSIE Dept, National Taiwan University."

Similar presentations


Ads by Google