Presentation is loading. Please wait.

Presentation is loading. Please wait.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 Classification: Definition l Given a collection of records (training set) l Find a model.

Similar presentations


Presentation on theme: "© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 Classification: Definition l Given a collection of records (training set) l Find a model."— Presentation transcript:

1 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 Classification: Definition l Given a collection of records (training set) l Find a model for class attribute as a function of the values of other attributes. l Goal: previously unseen records should be assigned a class as accurately as possible. –A test set is used to determine the accuracy.

2 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2 Illustrating Classification Task

3 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 3 Examples of Classification Task l Predicting tumor cells as benign or malignant l Classifying credit card transactions as legitimate or fraudulent l Classifying secondary structures of protein as alpha-helix, beta-sheet, or random coil l Categorizing news stories as finance, weather, entertainment, sports, etc

4 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 4 Classification Techniques l Decision Tree based Methods l Rule-based Methods l Memory based reasoning l Neural Networks l Naïve Bayes and Bayesian Belief Networks l Support Vector Machines

5 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 5 Example of a Decision Tree categorical continuous class Refund MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K Splitting Attributes Training Data Model: Decision Tree

6 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 6 Another Example of Decision Tree categorical continuous class MarSt Refund TaxInc YES NO Yes No Married Single, Divorced < 80K> 80K There could be more than one tree that fits the same data!

7 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 7 Decision Tree Classification Task Decision Tree

8 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 8 Apply Model to Test Data Refund MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K Test Data Start from the root of tree.

9 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 9 Apply Model to Test Data Refund MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K Test Data

10 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 10 Apply Model to Test Data Refund MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K Test Data

11 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 11 Apply Model to Test Data Refund MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K Test Data

12 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 12 Apply Model to Test Data Refund MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K Test Data

13 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 13 Apply Model to Test Data Refund MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K Test Data Assign Cheat to “No”

14 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 14 Decision Tree Classification Task Decision Tree

15 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 15 Decision Tree Induction l Many Algorithms: –Hunt’s Algorithm (one of the earliest) –CART –ID3, C4.5 –SLIQ,SPRINT

16 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 16 General Structure of Hunt’s Algorithm l D t = set of training records of node t l General Procedure: –If D t only records of same class y t  t leaf node labeled as y t –Else: use an attribute test to split the data. l Recursively apply the procedure to each subset. DtDt ?

17 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 17 Hunt’s Algorithm Don’t Cheat Refund Don’t Cheat Don’t Cheat YesNo Refund Don’t Cheat YesNo Marital Status Don’t Cheat Single, Divorced Married Taxable Income Don’t Cheat < 80K>= 80K Refund Don’t Cheat YesNo Marital Status Don’t Cheat Single, Divorced Married

18 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 18 Tree Induction l Greedy strategy. –Split the records based on an attribute test that optimizes a local criterion. l Issues –Determine how to split the records  How to specify the attribute test condition?  How to determine the best split? –Determine when to stop splitting

19 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 19 Tree Induction l Greedy strategy. –Split the records based on an attribute test that optimizes certain criterion. l Issues –Determine how to split the records  How to specify the attribute test condition?  How to determine the best split? –Determine when to stop splitting

20 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 20 How to Specify Test Condition? l Depends on attribute types –Nominal(No order; e.g., Country) –Ordinal(Discrete, order; e.g., S,M,L,XL) –Continuous (Ordered, cont.; e.g., temperature) l Depends on number of ways to split –2-way split –Multi-way split

21 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 21 Splitting Based on Nominal Attributes l Multi-way split: Use as many partitions as distinct values. l Binary split: Divides values into two subsets. Need to find optimal partitioning. CarType Family Sports Luxury CarType {Family, Luxury} {Sports} CarType {Sports, Luxury} {Family} OR

22 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 22 Splitting Based on Continuous Attributes l Different ways of handling –Discretization to form an ordinal categorical attribute  Static – discretize once at the beginning  Dynamic – ranges can be found by equal interval bucketing, equal frequency bucketing (percentiles), or clustering. –Binary Decision: (A < v) or (A  v)  consider all possible splits and finds the best cut  can be more compute intensive

23 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 23 Splitting Based on Continuous Attributes

24 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 24 Tree Induction l Greedy strategy. –Split the records based on an attribute test that optimizes certain criterion. l Issues –Determine how to split the records  How to specify the attribute test condition?  How to determine the best split? –Determine when to stop splitting

25 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 25 How to determine the Best Split Before Splitting: 10 records of class 0, 10 records of class 1 Which test condition is the best?

26 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 26 How to determine the Best Split l Greedy approach: –Nodes with homogeneous class distribution are preferred l Need a measure of node impurity: Non-homogeneous, High degree of impurity Homogeneous, Low degree of impurity

27 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 27 Measures of Node Impurity l Gini Index l Entropy l Misclassification error

28 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 28 How to Find the Best Split B? YesNo Node N3Node N4 A? YesNo Node N1Node N2 Before Splitting: M0 M1 M2M3M4 M12 M34 Gain = M0 – M12 vs M0 – M34

29 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 29 Measure of Impurity: GINI l Gini Index for a given node t : (NOTE: p( j | t) is the relative frequency of class j at node t). –Maximum (1 - 1/n c ) when records are equally distributed among all classes, implying least interesting information –Minimum (0.0) when all records belong to one class, implying most interesting information

30 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 30 Examples for computing GINI P(C1) = 0/6 = 0 P(C2) = 6/6 = 1 Gini = 1 – P(C1) 2 – P(C2) 2 = 1 – 0 – 1 = 0 P(C1) = 1/6 P(C2) = 5/6 Gini = 1 – (1/6) 2 – (5/6) 2 = 0.278 P(C1) = 2/6 P(C2) = 4/6 Gini = 1 – (2/6) 2 – (4/6) 2 = 0.444

31 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 31 Splitting Based on GINI l Used in CART, SLIQ, SPRINT. l When a node p is split into k partitions (children), the quality of split is computed as, where,n i = number of records at child i, n = number of records at node p.

32 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 32 Binary Attributes: Computing GINI Index l Splits into two partitions l Effect of Weighing partitions: –Larger and Purer Partitions are sought for. B? YesNo Node N1Node N2 Gini(N1) = 1 – (5/6) 2 – (2/6) 2 = 0.194 Gini(N2) = 1 – (1/6) 2 – (4/6) 2 = 0.528 Gini(Children) = 7/12 * 0.194 + 5/12 * 0.528 = 0.333

33 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 33 Tree Induction l Greedy strategy. –Split the records based on an attribute test that optimizes certain criterion. l Issues –Determine how to split the records  How to specify the attribute test condition?  How to determine the best split? –Determine when to stop splitting

34 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 34 Stopping Criteria for Tree Induction l Stop expanding a node when all the records belong to the same class l Stop expanding a node when all the records have similar attribute values l Early termination (to be discussed later)

35 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 35 Decision Tree Based Classification l Advantages: –Inexpensive to construct –Extremely fast at classifying unknown records –Easy to interpret for small-sized trees –Accuracy is comparable to other classification techniques for many simple data sets

36 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 36 Practical Issues of Classification l Underfitting and Overfitting l Missing Values l Costs of Classification

37 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 37 Underfitting and Overfitting Overfitting Underfitting: when model is too simple, both training and test errors are large Underfitting

38 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 38 Overfitting due to Noise Decision boundary is distorted by noise point

39 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 39 Overfitting due to Insufficient Examples Lack of data points in the lower half of the diagram makes it difficult to predict correctly the class labels of that region - Insufficient number of training records in the region causes the decision tree to predict the test examples using other training records that are irrelevant to the classification task

40 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 40 Notes on Overfitting l Overfitting results in decision trees that are more complex than necessary l Training error no longer provides a good estimate of how well the tree will perform on previously unseen records l Need new ways for estimating errors

41 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 41 How to Address Overfitting l Pre-Pruning (Early Stopping Rule) –Stop the algorithm before it becomes a fully- grown tree  Stop if number of instances is less than some user- specified threshold  Stop if class distribution of instances are independent of the available features (e.g., using  2 test)  Stop if expanding the current node does not improve impurity measures (e.g., Gini or information gain).

42 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 42 How to Address Overfitting… l Post-pruning –Grow decision tree to its entirety –Trim the nodes of the decision tree in a bottom-up fashion –If generalization error improves after trimming, replace sub-tree by a leaf node. –Class label of leaf node is determined from majority class of instances in the sub-tree

43 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 43 Model Evaluation l Metrics for Performance Evaluation –How to evaluate the performance of a model? l Methods for Performance Evaluation –How to obtain reliable estimates? l Methods for Model Comparison –How to compare the relative performance among competing models?

44 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 44 Metrics for Performance Evaluation l Focus on the predictive capability of a model –Rather than how fast it takes to classify or build models, scalability, etc. l Confusion Matrix: PREDICTED CLASS ACTUAL CLASS Class=YesClass=No Class=Yesab Class=Nocd a: TP (true positive) b: FN (false negative) c: FP (false positive) d: TN (true negative)

45 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 45 Metrics for Performance Evaluation… l Most widely-used metric: PREDICTED CLASS ACTUAL CLASS Class=YesClass=No Class=Yesa (TP) b (FN) Class=Noc (FP) d (TN)

46 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 46 Limitation of Accuracy l Consider a 2-class problem –Number of Class 0 examples = 9990 –Number of Class 1 examples = 10 l If model predicts everything to be class 0, accuracy is 9990/10000 = 99.9 % –Accuracy is misleading because model does not detect any class 1 example

47 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 47 Cost-Sensitive Measures l Precision is biased towards C(Yes|Yes) & C(Yes|No) l Recall is biased towards C(Yes|Yes) & C(No|Yes) l F-measure is biased towards all except C(No|No)


Download ppt "© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 Classification: Definition l Given a collection of records (training set) l Find a model."

Similar presentations


Ads by Google