Classification I: Decision Tree

Slides:



Advertisements
Similar presentations
Paper By - Manish Mehta, Rakesh Agarwal and Jorma Rissanen
Advertisements

Classification Basic Concepts Decision Trees
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Part I Introduction to Data Mining by Tan,
Classification: Definition Given a collection of records (training set ) –Each record contains a set of attributes, one of the attributes is the class.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Classification: Definition l Given a collection of records (training set) l Find a model.
1 Data Mining Classification Techniques: Decision Trees (BUSINESS INTELLIGENCE) Slides prepared by Elizabeth Anglo, DISCS ADMU.
Decision Tree.
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining by Tan, Steinbach,
Data Mining Classification This lecture node is modified based on Lecture Notes for Chapter 4/5 of Introduction to Data Mining by Tan, Steinbach, Kumar,
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Other Classification Techniques 1.Nearest Neighbor Classifiers 2.Support Vector Machines.
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining by Tan, Steinbach,
Lecture Notes for Chapter 4 (2) Introduction to Data Mining
Classification Techniques: Decision Tree Learning
Lecture Notes for Chapter 4 Part III Introduction to Data Mining
Classification: Basic Concepts and Decision Trees.
Lecture Notes for Chapter 4 Introduction to Data Mining
Classification: Decision Trees, and Naïve Bayes etc. March 17, 2010 Adapted from Chapters 4 and 5 of the book Introduction to Data Mining by Tan, Steinbach,
Model Evaluation Metrics for Performance Evaluation
Classification and Prediction
Lecture Notes for Chapter 4 Introduction to Data Mining
CSci 8980: Data Mining (Fall 2002)
© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.
© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.
Classification Continued
Lecture 5 (Classification with Decision Trees)
© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.
Example of a Decision Tree categorical continuous class Splitting Attributes Refund Yes No NO MarSt Single, Divorced Married TaxInc NO < 80K > 80K.
Classification.
1 Statistics 202: Statistical Aspects of Data Mining Professor David Mease Tuesday, Thursday 9:00-10:15 AM Terman 156 Lecture 11 = Finish ch. 4 and start.
Classification Decision Trees Evaluation
DATA MINING LECTURE 9 Classification Basic Concepts Decision Trees.
1 Data Mining Lecture 3: Decision Trees. 2 Classification: Definition l Given a collection of records (training set ) –Each record contains a set of attributes,
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining by Tan, Steinbach,
Chapter 4 Classification. 2 Classification: Definition Given a collection of records (training set ) –Each record contains a set of attributes, one of.
Classification: Basic Concepts, Decision Trees, and Model Evaluation
Classification. 2 Classification: Definition  Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes.
Classification Basic Concepts, Decision Trees, and Model Evaluation
Lecture 7. Outline 1. Overview of Classification and Decision Tree 2. Algorithm to build Decision Tree 3. Formula to measure information 4. Weka, data.
Modul 6: Classification. 2 Classification: Definition  Given a collection of records (training set ) Each record contains a set of attributes, one of.
Review - Decision Trees
Decision Trees Jyh-Shing Roger Jang ( 張智星 ) CSIE Dept, National Taiwan University.
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining by Minqi Zhou.
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining by Tan, Steinbach,
Model Evaluation l Metrics for Performance Evaluation –How to evaluate the performance of a model? l Methods for Performance Evaluation –How to obtain.
Bab /57 Bab 4 Classification: Basic Concepts, Decision Trees & Model Evaluation Part 2 Model Overfitting & Classifier Evaluation.
Practical Issues of Classification Underfitting and Overfitting –Training errors –Generalization (test) errors Missing Values Costs of Classification.
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining by Tan, Steinbach,
1 Data Mining Lecture 4: Decision Tree & Model Evaluation.
Classification: Basic Concepts, Decision Trees. Classification: Definition l Given a collection of records (training set ) –Each record contains a set.
Decision Trees Example of a Decision Tree categorical continuous class Refund MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K Splitting.
Classification and Prediction
Lecture Notes for Chapter 4 Introduction to Data Mining
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining by Tan, Steinbach,
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4.
1 Illustration of the Classification Task: Learning Algorithm Model.
Big Data Analysis and Mining Qinpei Zhao 赵钦佩 2015 Fall Decision Tree.
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining By Tan, Steinbach,
Decision Trees.
Introduction to Data Mining Clustering & Classification Reference: Tan et al: Introduction to data mining. Some slides are adopted from Tan et al.
Review of Decision Tree Learning Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Lecture Notes for Chapter 4 Introduction to Data Mining
Classification Decision Trees
Data Mining Classification: Basic Concepts and Techniques
Lecture Notes for Chapter 4 Introduction to Data Mining
Scalable Decision Tree Induction Methods
Basic Concepts and Decision Trees
آبان 96. آبان 96 Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining by Tan,
Chapter 4 Classification
COSC 4368 Intro Supervised Learning Organization
COP5577: Principles of Data Mining Fall 2008 Lecture 4 Dr
Presentation transcript:

Classification I: Decision Tree AMCS/CS 340: Data Mining Classification I: Decision Tree Xiangliang Zhang King Abdullah University of Science and Technology

Classification: Definition Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes is the class. Find a model for class attribute as a function of the values of other attributes. Goal: previously unseen records should be assigned a class as accurately as possible. A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it. Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Classification Example categorical categorical continuous class Test Set Learn Classifier Model Training Set predicting borrowers who cheat on loan payments.

Issues: Evaluating Classification Methods Accuracy classifier accuracy: how well the class labels of test data are predicted Speed time to construct the model (training time) time to use the model (classification/prediction time) Robustness: handling noise and missing values Scalability: efficiency in large-scale data Interpretability understanding and insight provided by the model Other measures, e.g., goodness of rules, such as decision tree size or compactness of classification rules 4 4

Classification Techniques Decision Tree based Methods Rule-based Methods Learning from Neighbors Bayesian Classification Neural Networks Ensemble Methods Support Vector Machines Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Example of a Decision Tree categorical continuous class Splitting Attributes Refund Yes No NO MarSt Single, Divorced Married TaxInc NO < 80K > 80K NO YES Training Data Model: Decision Tree Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Another Example of Decision Tree categorical categorical continuous MarSt Single, Divorced class Married NO Refund No Yes NO TaxInc < 80K > 80K NO YES There could be more than one tree that fits the same data! Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Decision Tree Classification Task Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Apply Model to Test Data Start from the root of tree. Refund MarSt TaxInc YES NO Yes No Married Single, Divorced < 80K > 80K

Apply Model to Test Data Refund MarSt TaxInc YES NO Yes No Married Single, Divorced < 80K > 80K Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Apply Model to Test Data Refund Yes No NO MarSt Single, Divorced Married TaxInc NO < 80K > 80K NO YES Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Apply Model to Test Data Refund Yes No NO MarSt Single, Divorced Married TaxInc NO < 80K > 80K NO YES Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Apply Model to Test Data Refund Yes No NO MarSt Single, Divorced Married TaxInc NO < 80K > 80K NO YES Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Apply Model to Test Data Refund Yes No NO MarSt Single, Divorced Married Assign Cheat to “No” TaxInc NO < 80K > 80K NO YES Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Decision Tree Classification Task Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm) Tree is constructed in a top-down recursive divide-and-conquer manner At start, all the training examples are at the root Examples are partitioned recursively based on selected attributes Attributes are categorical (if continuous-valued, they are discretized in advance) Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain) Conditions for stopping partitioning All samples for a given node belong to the same class There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf There are no samples left Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Decision Tree Induction Many Algorithms: Hunt’s Algorithm (one of the earliest) CART ID3, C4.5 SLIQ,SPRINT Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

General Structure of Hunt’s Algorithm Let Dt be the set of training records that reach a node t General Procedure: If Dt contains records that belong the same class yt, then t is a leaf node labeled as yt If Dt is an empty set, then t is a leaf node labeled by the default class, yd If Dt contains records that belong to more than one class, use an attribute test to split the data into smaller subsets. Recursively apply the procedure to each subset. Dt ? Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Hunt’s Algorithm Refund Refund Refund Marital Marital Status Status Don’t Cheat Yes No Don’t Cheat Refund Don’t Cheat Yes No Marital Status Single, Divorced Married Refund Don’t Cheat Yes No Marital Status Single, Divorced Married Taxable Income < 80K >= 80K

Issues of Hunt's Algorithm Issues: Determine how to split the records How to specify the attribute test condition? How many branches Partition threshold for splitting How to determine the best split? Choose which attribute ? Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

How to Specify Test Condition? Depends on attribute types Nominal Ordinal Continuous Depends on number of ways to split 2-way split Multi-way split Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Splitting Based on Nominal Attributes Multi-way split: Use as many partitions as distinct values. Binary split: Divides values into two subsets. Need to find optimal partitioning. CarType Family Sports Luxury CarType {Sports, Luxury} {Family} CarType {Family, Luxury} {Sports} OR Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Splitting Based on Ordinal Attributes Multi-way split: Use as many partitions as distinct values. Binary split: Divides values into two subsets. Need to find optimal partitioning. What about this split? Size Small Medium Large Size {Small, Medium} {Large} Size {Medium, Large} {Small} OR Size {Small, Large} {Medium} Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Splitting Based on Continuous Attributes Different ways of handling Binary Decision: (A < v) or ( A >= v ) consider all possible splits and finds the best cut Discretization to form an ordinal categorical attribute ranges can be found by equal interval bucketing, equal frequency bucketing (percentiles), or clustering. Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Splitting Based on Continuous Attributes Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Issues of Hunt's Algorithm Issues: Determine how to split the records How to specify the attribute test condition? How many branches Partition threshold for splitting How to determine the best split? Choose which attribute ? Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

How to determine the Best Split Before Splitting: 10 records of class 0, 10 records of class 1 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

How to determine the Best Split Greedy approach: Nodes with homogeneous class distribution are preferred Need a measure of node impurity: Non-homogeneous, High degree of impurity Homogeneous, Low degree of impurity Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining How to determine the Best Split Before Splitting: 10 records of class 0, 10 records of class 1 Non-homogeneous, High degree of impurity Homogeneous, Low degree of impurity Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 29

Measures of Node Impurity Gini Index Gini impurity is a measure of how frequently a randomly chosen element from a set is incorrectly labeled if it were labeled randomly according to the distribution of labels in the subset. Entropy a measure of the uncertainty associated with a random variable. Misclassification error the proportion of misclassified samples Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining Measures of Node Impurity Gini Index p( j | t) is the relative frequency of class j at node t Entropy Misclassification error Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 31

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining Comparison among Measures of Node Impurity For a 2-class problem: Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 32

Comparison among Measures of Node Impurity For a 2-class problem: Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 33

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining Quality of Split p: n child 1: n1 GINI(1) child k: nk GINI(k) child i: ni GINI(i) When a node p is split into k partitions, the quality of split is computed as, Or information gain where, ni = number of records at child i, n = number of records at node p. Measures reduction in GINI/Entropy achieved because of the split. Choose the split that achieves most reduction (maximizes GAIN) Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Quality of Split: binary attributes Splits into two partitions (binary attributes) P Yes No Node N1 Node N2 Gini(N1) = 1 – (4/7)2 – (3/7)2 = 0.4898 Gini(N2) = 1 – (2/5)2 – (3/5)2 = 0.480 Ginisplit(Children) = 7/12 * 0.4898 + 5/12 * 0.480 = 0.486 Gainsplit = Gini (parent) – Ginisplit (children) = 0.014

Decision Tree Induction Many Algorithms: Hunt’s Algorithm (one of the earliest) CART ID3, C4.5 SLIQ,SPRINT Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining CART CART: Classification and Regression Trees constructs trees with only binary splits (simplifies the splitting criterion) use Gini Index as a splitting criterion split the attribute who provides the smallest Ginisplit(p) or the largest GAINsplit(p) need to enumerate all the possible splitting points for each attribute Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 37

Continuous Attributes: Computing Gini Index For efficient computation: for each attribute, Sort the attribute on values Set candidate split positions as the midpoints between two adjacent sorted values. Linearly scan these values, each time updating the count matrix and computing Gini index Choose the split position that has the least Gini index Split Positions Sorted Values

Decision Tree Induction Many Algorithms: Hunt’s Algorithm (one of the earliest) CART ID3, C4.5 SLIQ,SPRINT Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

How to Find the Best Split Two-way split (find best partition of values) Multi-way split CarType {Sports, Luxury} {Family} CarType {Family, Luxury} {Sports} CarType Family Sports Luxury largest Gain = 0.337 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining Which Attribute to Split ? Gain = 0.02 Gain = 0.337 Gain = 0.5 Best ? Disadvantage: Tends to prefer splits that result in large number of partitions, each being small but pure. Unique value for each record  not predictable Small number of recodes in each node  not reliable prediction Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 41

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining Splitting Based on Gain Ratio Gain Ratio: Parent node p is split into k partitions where ni is the number of records in partition i designed to overcome the disadvantage of Information Gain adjusts Information Gain by the entropy of the partitioning (SplitINFO). higher entropy partitioning (large number of small partitions) is penalized! Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Comparing Attribute Selection Measures The three measures, in general, return good results but Gini gain: biased to multivalued attributes has difficulty when # of classes is large tends to favor tests that result in equal-sized partitions and purity in both partitions Information gain: biased towards multivalued attributes Gain ratio: tends to prefer unbalanced splits in which one partition is much smaller than the others Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining September 14, 2010 Data Mining: Concepts and Techniques 43 43

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining ID3 and C4.5 ID3 (Ross Quinlan1986) is the precursor to the C4.5 algorithm (Ross Quinlan 1993) C4.5 is an extension of earlier ID3 algorithm For each unused attribute Ai, count the information GAIN (ID3) or GAINRatio (C4.5) from splitting on Ai Find the best splitting attribute Abest with the highest GAIN or GAINRatio Create a decision node that splits on Abest Recur on the sublists obtained by splitting on Abest , and add those nodes as children of node Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining Improvements of C4.5 from ID3 algorithm Handling both continuous and discrete attributes In order to handle continuous attributes, C4.5 creates a threshold and then splits the list into those whose attribute value is above the threshold and those that are less than or equal to it. Handling training data with missing attribute values C4.5 allows attribute values to be marked as ? for missing. Missing attribute values are simply not used in gain and entropy calculations. Pruning trees after creation C4.5 goes back through the tree once it's been created and attempts to remove branches that do not help by replacing them with leaf nodes. Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining Issues: Needs entire data to fit in memory. Unsuitable for Large Datasets. Needs a lot of computation at every stage of construction of decision tree. You can download the software from: http://www2.cs.uregina.ca/~dbd/cs831/notes/ml/dtrees/c4.5/c4.5r8.tar.gz More information http://www2.cs.uregina.ca/~dbd/cs831/notes/ml/dtrees/c4.5/tutorial.html Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Decision Tree Induction Many Algorithms: Hunt’s Algorithm (one of the earliest) CART ID3, C4.5 SLIQ,SPRINT Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining SLIQ – a decision tree classifier SLIQ, Supervised Learning In Quest (EDBT’96 — Mehta et al.) Uses a pre-sorting technique in the tree growing phase (eliminates the need to sort data at each node) creates a separate list for each attribute of the training data A separate list, called class list, is created for the class labels attached to the examples. SLIQ requires that the class list and (only) one attribute list could be kept in the memory at any time Suitable for classification of large disk-resident datasets Applies to both numerical and categorical attributes Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining SLIQ Methodology Create decision tree by partitioning records Generate attribute list for each attribute Sort attribute lists for NUMERIC Attributes Start End Example Only NUMERIC attributes sorted Drivers Age CarType Class (risk) 23 Family HIGH 17 Sports 43 68 LOW 32 Truck 20 Age Class RecId 17 HIGH 2 20 6 23 1 32 LOW 5 43 3 68 4 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining Numeric attributes splitting index States of class histograms Partition position Numeric attributes sorted HIGH LOW Cbelow Cabove 4 2 Ginisplit =0.44 Age Class RecId 17 HIGH 2 20 6 23 1 32 LOW 5 43 3 68 4 Position 0 HIGH LOW Cbelow 3 Cabove 1 2 Ginisplit =0.22 Position 3 Position 6 HIGH LOW Cbelow 4 2 Cabove Ginisplit =0.44 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Numeric attributes splitting index States of class histograms Partition position Numeric attributes sorted HIGH LOW Cbelow Cabove 4 2 Ginisplit =0.44 Age Class RecId 17 HIGH 2 20 6 23 1 32 LOW 5 43 3 68 4 Position 0 HIGH LOW Cbelow 3 Cabove 1 2 Ginisplit =0.22 Position 3 Position 6 HIGH LOW Cbelow 4 2 Cabove Ginisplit =0.44 age < 27.5 > 27.5 N1 N2

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining Numeric attributes splitting index age N1 N2 < 27.5 > 27.5 Age Class Car type RecId 32 LOW truck 5 43 HIGH sport 3 68 family 4 Age Class Car type RecId 17 HIGH sport 2 20 family 6 23 1 Car type HIGH Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining SPRINT SPRINT: A Scalable Parallel Classifier for Data Mining (VLDB’96 — J. Shafer et al.) an enhancement of SLIQ, implemented in both serial and parallel pattern for good data placement and load balancing one time sort of the data items uses two data structures: attribute list and histogram (no memory resident making SPRINT suitable for large data set) handles both continuous and categorical attributes. 53 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining Decision Tree Issues Issues: Underfitting and Overfitting Missing values 54 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 54

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining Underfitting and Overfitting Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Overfitting and Tree Pruning Overfitting: An induced tree may overfit the training data Too many branches, some may reflect anomalies due to noise or outliers Poor accuracy for unseen samples Two approaches to avoid overfitting Pre-pruning: Stop tree construction early—do not split a node if this would result in the gain measure falling below a threshold Difficult to choose an appropriate threshold Post-pruning: Remove branches from a “fully grown” tree Use a set of data different from the training data to decide which is the “best pruned tree” Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining September 14, 2010 Data Mining: Concepts and Techniques 56 56

Data Mining: Concepts and Techniques Tree Post-Pruning Reduced error pruning Start from leaves Replace each node with its most popular class Keep the change, if the prediction accuracy is not affected Cost complexity pruning Generate a series of trees T0,…,Tm, (T0 is the initial tree, Tm is the root alone) Construct tree Ti by Replacing a subtree of tree Ti-1 with a leaf node. the class label of leaf node is determined from majority class of instances in the sub-tree Which subtree to remove ? Data Mining: Concepts and Techniques 57 57

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining Decision Tree Issues Issues: Underfitting and Overfitting Missing values 58 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 58

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining Handling Missing Values of Attribute Missing values affect decision tree construction in three different ways: Affects how impurity measures are computed Affects how to distribute instance with missing value to child nodes Affects how a test instance with missing value is classified Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Computing Impurity Measure Before Splitting: Entropy(Parent) = -0.3 log(0.3)-(0.7)log(0.7) = 0.8813 Split on Refund: Entropy(Refund=Yes) = 0 Entropy(Refund=No) = -(2/6)log(2/6) – (4/6)log(4/6) = 0.9183 Entropy(Children) = 0.3 (0) + 0.6 (0.9183) = 0.551 Gain = 0.9  (0.8813 – 0.551) = 0.3303 Missing value

Distribute Instances Probability that Refund=Yes is 3/9 No Probability that Refund=Yes is 3/9 Probability that Refund=No is 6/9 Assign record to the left child with weight = 3/9 and to the right child with weight = 6/9 Refund Yes No

Classify Instances New record: Married Single Divorced Total Class=No 3 1 4 Class=Yes 6/9 2.67 3.67 2 6.67 Refund Yes No NO MarSt Single, Divorced Married Probability that Marital Status = Married is 3.67/6.67 = 0.55 Probability that Marital Status ={Single,Divorced} is 3/6.67 = 0.45 TaxInc NO < 80K > 80K NO YES

Classify Instances New record: Probability that Class = NO is 0.55 Married Single Divorced Total Class=No 3 1 4 Class=Yes 6/9 2.67 3.67 2 6.67 Refund Yes No NO MarSt Single, Divorced Married TaxInc NO Probability that Class = NO is 0.55 < 80K > 80K NO YES Probability that Class = YES is 0.45

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining Questions What is Decision Tree? How to choose the best attribute to split? How to decide the branches when splitting? What are CART, ID3 and C4.5? What are their differences? How Decision Tree can be used to learn from large data? How to avoid overfitting in Decision Tree learning? How missing values can be handled by Decision Tree? Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Classification Techniques Decision Tree based Methods Rule-based Methods Learning from Neighbors Bayesian Classification Neural Networks Ensemble Methods Support Vector Machines Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

How tree can be used in other ways? Tree  Rules Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Rule Extraction from a Decision Tree One rule is created for each path from the root to a leaf Each attribute-value pair along a path forms a conjunction: the leaf holds the class prediction Rules are mutually exclusive (rules are independent and each record is covered by at most one rule) and exhaustive (each record is covered by at least one rule) Rule set contains as much information as the tree

Rule Extraction from a Decision Tree (Marital Status = {Married}) ==> No Rules can be simplified Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Rule-Based Classifier Classify records by using a collection of “if…then…” rules Rule: Represent the knowledge in the form of IF-THEN rules (Condition)  y, where Condition is a conjunctions of attributes y is the class label Examples of classification rules: (Blood Type=Warm)  (Lay Eggs=Yes)  Birds (Taxable Income < 50K)  (Refund=Yes)  Cheat=No (rule antecedent or condition)  (rule consequent) Classifier: A rule R covers an instance x if the attributes of the instance satisfy the condition of the rule, instance x labeled by the consequent of R Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 69

Sequential covering algorithm Extract rules directly from training data e.g., FOIL (Quinlan 1990), AQ (Michalski et al 1986), CN2(Clark & Niblett89), RIPPER (Cohen95) Steps: Start from an empty rule Grow a rule for one class Ci using the Learn-One-Rule function Remove training records covered by the rule Repeat Step (2) and (3) until stopping criterion is met, e.g., when no more training examples or when the quality of a rule returned is below a user-specified threshold Compare with decision-tree induction: DT learns a set of rules simultaneously Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 70

Rules for 2-class and multi-class For 2-class problem choose one of the classes as positive class, and the other as negative class Learn rules for positive class Negative class will be default class For multi-class problem Order the classes according to increasing class prevalence (fraction of instances that belong to a particular class) Learn the rule set for smallest class first, treat the rest as negative class Repeat with next smallest class as positive class Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 71

Classification: Evaluation AMCS/CS 340: Data Mining Classification: Evaluation Xiangliang Zhang King Abdullah University of Science and Technology

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining Model Evaluation Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates? Methods for Model Comparison How to compare the relative performance among competing models? 73 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Metrics for Performance Evaluation Focus on the predictive capability of a model Rather than how fast it takes to classify or build models, scalability, etc. Confusion Matrix: Most widely-used metric: PREDICTED CLASS ACTUAL CLASS Class=Yes Class=No a (TP) b (FN) c (FP) d (TN) TP: True Positive FP: False Positive TN: True Negative FN: False Negative 74 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Limitation of Accuracy Consider a 2-class problem Number of Class 0 examples = 9990 Number of Class 1 examples = 10 Unbalanced classes If model predicts everything to be class 0, accuracy is 9990/10000 = 99.9 % Accuracy is misleading because model does not detect any class 1 example 75 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining Other Measures PREDICTED CLASS ACTUAL CLASS Class=Yes Class=No a (TP) b (FN) c (FP) d (TN) 76 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining Model Evaluation Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates? Methods for Model Comparison How to compare the relative performance among competing models? 77 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Methods for Performance Evaluation How to obtain a reliable estimate of performance? Performance of a model may depend on other factors besides the learning algorithm: Class distribution Cost of misclassification Size of training and test sets 78 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining Methods of Estimation Holdout Reserve 2/3 for training and 1/3 for testing Random subsampling Repeated holdout Cross validation Partition data into k disjoint subsets k-fold: train on k-1 partitions, test on the remaining one Leave-one-out: k=n Bootstrap Sampling with replacement 79 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining Model Evaluation Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates? Methods for Model Comparison How to compare the relative performance among competing models? 80 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

ROC (Receiver Operating Characteristic) Developed in 1950s for signal detection theory to analyze noisy signals Characterize the trade-off between positive hits and false alarms ROC curve plots TP rate (y-axis) against FP rate (x-axis) Performance of each classifier represented as a point on ROC curve changing the threshold of algorithm, or sample distribution changes the location of the point 1-dimensional data set containing 2 classes (positive and negative) - any points located at x > t is classified as positive 81 At threshold t: TPR=0.5, FPR=0.12

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining ROC Curve (TPR,FPR): (0,0): declare everything to be negative class (1,1): declare everything to be positive class (0,1): ideal Diagonal line: Random guessing Below diagonal line: prediction is opposite of the true class 82 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Using ROC for Model Comparison No model consistently outperform the other M1 is better for small FPR M2 is better for large FPR Area Under the ROC curve Ideal: Area = 1 Random guess: Area = 0.5 83 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

How to construct an ROC curve Posterior probability of test instance x Threshold: t # of + >= t # of - >= t ROC Curve: 84

Confidence Interval for Accuracy Prediction can be regarded as a Bernoulli trial A Bernoulli trial has 2 possible outcomes Possible outcomes for prediction: correct or wrong Collection of Bernoulli trials has a Binomial distribution: x  Bin(N, p) x: number of correct predictions e.g: Toss a fair coin 50 times, how many heads would turn up? Expected number of heads = Np = 50  0.5 = 25 Given x (# of correct predictions) or equivalently, acc=x/N, and N (# of test instances), Can we predict p (true accuracy of model)? 85 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Confidence Interval for Accuracy Area = 1 -  For large test sets (N > 30), acc has a normal distribution with mean p and variance p(1-p)/N Confidence Interval for p: Z/2 Z1-  /2 86 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Confidence Interval for Accuracy Consider a model that produces an accuracy of 80% when evaluated on 100 test instances: N=100, acc = 0.8 Let 1- = 0.95 (95% confidence) From probability table, Z/2=1.96 1- Z 0.99 2.58 0.98 2.33 0.95 1.96 0.90 1.65 N 50 100 500 1000 5000 p(lower) 0.670 0.711 0.763 0.774 0.789 p(upper) 0.888 0.866 0.833 0.824 0.811 Standard Normal distribution 87 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining Test of Significance Given two models: Model M1: accuracy = 85%, tested on 30 instances Model M2: accuracy = 75%, tested on 5000 instances Can we say M1 is better than M2? How much confidence can we place on accuracy of M1 and M2? Can the difference in performance measure be explained as a result of random fluctuations in the test set? 88 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Comparing Performance of 2 Models Given two models, say M1 and M2, which is better? M1 is tested on D1 (size=n1), found error rate = e1 M2 is tested on D2 (size=n2), found error rate = e2 Assume D1 and D2 are independent If n1 and n2 are sufficiently large, then Approximate of variance (Binomial distribution): 89 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Comparing Performance of 2 Models To test if performance difference is statistically significant: d = e1 – e2 where dt is the true difference Since D1 and D2 are independent, their variance adds up: At (1-) confidence level, 90 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

An Illustrative Example Given: M1: n1 = 30, e1 = 0.15 M2: n2 = 5000, e2 = 0.25 d = |e2 – e1| = 0.1 (2-sided test) At 95% confidence level, Z/2=1.96 => Interval contains 0 => difference may not be statistically significant 91 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Our Teaching Assistant Name: Francisco Franco Email: francisco.franco@kaust.edu.sa Responsibilities: Homework Project report Final grades Questions 92 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining