Presentation is loading. Please wait.

Presentation is loading. Please wait.

M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Classification: Evaluation February 23,

Similar presentations


Presentation on theme: "M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Classification: Evaluation February 23,"— Presentation transcript:

1 M. Sulaiman Khan (mskhan@liv.ac.uk)‏ Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Classification: Evaluation February 23, 2009 Slide 1 COMP527: Data Mining

2 Introduction to the Course Introduction to Data Mining Introduction to Text Mining General Data Mining Issues Data Warehousing Classification: Challenges, Basics Classification: Rules Classification: Trees Classification: Trees 2 Classification: Bayes Classification: Neural Networks Classification: SVM Classification: Evaluation Classification: Evaluation 2 Regression, Prediction COMP527: Data Mining Classification: Evaluation February 23, 2009 Slide 2 COMP527: Data Mining Input Preprocessing Attribute Selection Association Rule Mining ARM: A Priori and Data Structures ARM: Improvements ARM: Advanced Techniques Clustering: Challenges, Basics Clustering: Improvements Clustering: Advanced Algorithms Hybrid Approaches Graph Mining, Web Mining Text Mining: Challenges, Basics Text Mining: Text-as-Data Text Mining: Text-as-Language Revision for Exam

3 Evaluation Samples Cross Validation Bootstrap Confidence of Accuracy Today's Topics Classification: Evaluation February 23, 2009 Slide 3 COMP527: Data Mining

4 We need some way to quantitatively evaluate the results of data mining.  Just how accurate is the classification?  How accurate can we expect a classifier to be?  If we can't evaluate the classifier, how can it be improved?  Can different types of classifier be evaluated in the same way?  What are useful criteria for such a comparison?  How can we evaluate clusters or association rules? There are lots of issues to do with evaluation. Evaluation Classification: Evaluation February 23, 2009 Slide 4 COMP527: Data Mining

5 Assuming classification, the basic evaluation is how many correct predictions it makes as opposed to incorrect predictions. Can't test on data used for training the classifier and get an accurate result. The result is "hopelessly optimistic" (Witten). Eg: Due to over-fitting, a classifier might get 100% accuracy on the data it was trained from and 0% accuracy on other data. This is called the resubstitution error rate -- the error rate when you substitute the data back into the classifier generated from it. So we need some new, but labeled data to test on. Evaluation Classification: Evaluation February 23, 2009 Slide 5 COMP527: Data Mining

6 Most of the time we do not have enough data to have a lot for training and a lot for testing, though sometimes this is possible (eg sales data)‏ Some systems have two phases of training. An initial learning period and then fine tuning. For example the Growing and Pruning sets for building trees. It's important to not use the validation set either. Note that this reduces the amount of data that you can actually train on by a significant amount. Validation Classification: Evaluation February 23, 2009 Slide 6 COMP527: Data Mining

7 Further issues to consider:  Some classifiers produce probabilities for one or more classes. We need some way to handle the probabilities – for a classifier to be partly correct. Also for multi-class problems (eg instance has 2 or more classes) we need some 'cost' function for getting an accurate subset of the classes.  Regression/Numeric Prediction produces a numeric value. We need statistical tests to determine how accurate this is rather than true/false for nominal classes. Numeric Data, Multiple Classes Classification: Evaluation February 23, 2009 Slide 7 COMP527: Data Mining

8 Obvious answer: Keep part of the data set aside for testing purposes and use the rest to train the classifier. Then use the test set to evaluate the resulting classifier in terms of accuracy. Accuracy: Number of correctly classified instances / total number of instances to classify. Ratio is often 2/3rds training, 1/3rd test. How should we select the instances for each section? Hold Out Method Classification: Evaluation February 23, 2009 Slide 8 COMP527: Data Mining

9 Easy: Randomly select instances. Data could be very unbalanced: Eg 99% one class, 1% the other class. Then random sampling is likely to not draw any of the 1% class. Stratified: Group the instances by class and then select a proportionate number from each class. Balanced: Randomly select a desired amount of minority class instances, and then add the same number from the majority class. Samples Classification: Evaluation February 23, 2009 Slide 9 COMP527: Data Mining

10 Stratified: Group the instances by class and then select a proportionate number from each class. Samples Classification: Evaluation February 23, 2009 Slide 10 COMP527: Data Mining

11 Balanced: Randomly select a desired amount of minority class instances, and then add the same number from the majority class. Samples Classification: Evaluation February 23, 2009 Slide 11 COMP527: Data Mining

12 For small data sets, removing some as a test set and still having a representative set to train from is hard. Solutions? Repeat the process multiple times, select a different test set. Then find the error from each, and average across all of the iterations. Of course there's no reason to do this only for small data sets! Different test sets might still overlap, which might give a biased estimate of the accuracy. (eg if it randomly selects good records multiple times)‏ Can we prevent this? Small Data Sets Classification: Evaluation February 23, 2009 Slide 12 COMP527: Data Mining

13 Split the dataset up into k parts, then use each part in turn as the test set and the others as the training set. If the data set is also stratified, we can have stratified cross validation, rather than perhaps ending up with a non representative sample in one or more parts. Common values for k are 3 (eg hold out) and 10. Hence: stratified 10-fold cross validation Again, the error values are averaged after the k iterations. Cross Validation Classification: Evaluation February 23, 2009 Slide 13 COMP527: Data Mining

14 Why 10? Extensive testing shows it to be a good middle ground -- not too much processing, not too random. Cross validation is used extensively in all data mining literature. It's the simplest and easiest to understand evaluation technique, while having a good accuracy. There are other similar evaluation techniques, however... Cross Validation Classification: Evaluation February 23, 2009 Slide 14 COMP527: Data Mining

15 Select one instance and train on all others. Then see if the instance is correctly classified. Repeat and find the percentage of accurate results. Eg: N-fold cross validation, where N is the number of instances in the data set. Attractive: If 10 is good, surely N is better :)‏ No random sampling problems Trains with the most amount of data Leave One Out Classification: Evaluation February 23, 2009 Slide 15 COMP527: Data Mining

16 Disadvantages: Computationally expensive, builds N models! Guarantees a non-stratified, non-balanced sample. Worst case: class distribution is exactly 50/50. Data is so complicated, classifier simply picks the most common class. -- Will always pick the wrong class. Leave One Out Classification: Evaluation February 23, 2009 Slide 16 COMP527: Data Mining

17 Until now, the sampling has been without replacement (eg each instance occurs once, either in training or test set). However we could put back an instance to be drawn again -- sampling with replacement. This results in the 0.632 bootstrap evaluation technique. Draw a training set from the data set with replacement such that the number of instances in both is the same, then use the instances which are not in the training set as the test set. (Eg some instances will appear more than once in the training set)‏ Statistically, the likelihood of an instance not being picked is 0.368, hence the name. Bootstrap Classification: Evaluation February 23, 2009 Slide 17 COMP527: Data Mining

18 Eg: Have a dataset of 1000 instances. We sample with replacement 1000 times – eg we randomly select an instance from all 1000 instances 1000 times. This should leave us with approximately 368 instances that have not been selected. We remove these and use them for the test set. Error rate will be pessimistic – only training on 63% of the data, with some repeated instances. We compensate by combining with the optimistic error rate from resubstitution: error rate: 0.632 * error-on-test + 0.368 * error-on-training Bootstrap Classification: Evaluation February 23, 2009 Slide 18 COMP527: Data Mining

19 What about the size of the test set? More test instances should make us more confident that the accuracy predicted is close to the true accuracy. Eg getting 75% on 10,000 samples is more likely closer to the accuracy than 75% on 10. A series of events that succeed of fail is a Bernoulli process, eg coin tosses. We can find out S successes from N trials, and then S/N... but what does that tell us about the true accuracy rate? Statistics can then tell us the range within which the true accuracy rate should fall. Eg: 750/1000 is very likely to be between 73.2% to 76.7%. (Witten 147 to 149 has the full maths!)‏ Confidence of Accuracy Classification: Evaluation February 23, 2009 Slide 19 COMP527: Data Mining

20 We might wish to compare two classifiers of different types. Could compare accuracy of 10 fold cross validation, but there's another method: Student's T-Test Method:  We perform cross validation 10 times – eg 10 times TCV = 100 models  Perform the same repeated TCV with the second classifier  This gives us x1..x10 for the first, and y1..y10 for the second.  Find the mean of the 10 cross-validation runs for each.  Find the difference between the two means. We want to know if the difference is statistically significant. Confidence of Accuracy Classification: Evaluation February 23, 2009 Slide 20 COMP527: Data Mining

21 We then find 't' by: Where d is the difference between the means, k is the number of times the cross validation was performed, and 2 is the variance of the differences between the samples. (variance = sum of squared differences between mean and actual)‏ Then look up on the table for k-1 number of degrees of freedom. (more tables! But printed in Witten pg 155)‏ If t is greater than z on the table, then it is statistically significant. Student's T-Test Classification: Evaluation February 23, 2009 Slide 21 COMP527: Data Mining

22 Introductory statistical text books, again Witten, 5.1-5.4 Han 6.2, 6.12, 6.13 Berry and Browne, 1.4 Devijver and Kittler, Chapter 10 Further Reading Classification: Evaluation February 23, 2009 Slide 22 COMP527: Data Mining


Download ppt "M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Classification: Evaluation February 23,"

Similar presentations


Ads by Google