Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.

Slides:



Advertisements
Similar presentations
Evaluating Classifiers
Advertisements

Learning Algorithm Evaluation
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Other Classification Techniques 1.Nearest Neighbor Classifiers 2.Support Vector Machines.
Lecture Notes for Chapter 4 (2) Introduction to Data Mining
Lecture 22: Evaluation April 24, 2010.
Lecture Notes for Chapter 4 Part III Introduction to Data Mining
Evaluation (practice). 2 Predicting performance  Assume the estimated error rate is 25%. How close is this to the true error rate?  Depends on the amount.
Evaluation.
Assessing and Comparing Classification Algorithms Introduction Resampling and Cross Validation Measuring Error Interval Estimation and Hypothesis Testing.
Model Evaluation Metrics for Performance Evaluation
Cost-Sensitive Classifier Evaluation Robert Holte Computing Science Dept. University of Alberta Co-author Chris Drummond IIT, National Research Council,
Credibility: Evaluating what’s been learned. Evaluation: the key to success How predictive is the model we learned? Error on the training data is not.
Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.
Evaluation.
© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.
1 Classification with Decision Trees I Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei.
Experimental Evaluation
On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.
Introduction to Machine Learning Approach Lecture 5.
1 CSI5388 Data Sets: Running Proper Comparative Studies with Large Data Repositories [Based on Salzberg, S.L., 1997 “On Comparing Classifiers: Pitfalls.
Theses slides are based on the slides by
CSCI 347 / CS 4206: Data Mining Module 06: Evaluation Topic 07: Cost-Sensitive Measures.
Today Evaluation Measures Accuracy Significance Testing
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Classification: Evaluation February 23,
1  The goal is to estimate the error probability of the designed classification system  Error Counting Technique  Let classes  Let data points in class.
Evaluating Classifiers
A Multivariate Biomarker for Parkinson’s Disease M. Coakley, G. Crocetti, P. Dressner, W. Kellum, T. Lamin The Michael L. Gargano 12 th Annual Research.
CLassification TESTING Testing classifier accuracy
Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 5 of Data Mining by I. H. Witten, E. Frank and M. A. Hall 報告人:黃子齊
Evaluation – next steps
Classification II. 2 Numeric Attributes Numeric attributes can take many values –Creating branches for each value is not ideal The value range is usually.
1 Evaluating Model Performance Lantz Ch 10 Wk 5, Part 2 Right – Graphing is often used to evaluate results from different variations of an algorithm. Depending.
RMTD 404 Lecture 8. 2 Power Recall what you learned about statistical errors in Chapter 4: Type I Error: Finding a difference when there is no true difference.
Error estimation Data Mining II Year Lluís Belanche Alfredo Vellido.
Evaluating Hypotheses Reading: Coursepack: Learning From Examples, Section 4 (pp )
Evaluating What’s Been Learned. Cross-Validation Foundation is a simple idea – “ holdout ” – holds out a certain amount for testing and uses rest for.
Experiments in Machine Learning COMP24111 lecture 5 Accuracy (%) A BC D Learning algorithm.
Experimental Evaluation of Learning Algorithms Part 1.
Classification Performance Evaluation. How do you know that you have a good classifier? Is a feature contributing to overall performance? Is classifier.
1 Learning Chapter 18 and Parts of Chapter 20 AI systems are complex and may have many parameters. It is impractical and often impossible to encode all.
CpSc 810: Machine Learning Evaluation of Classifier.
Processing of large document collections Part 3 (Evaluation of text classifiers, term selection) Helena Ahonen-Myka Spring 2006.
CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.
Evaluating Results of Learning Blaž Zupan
Computational Intelligence: Methods and Applications Lecture 16 Model evaluation and ROC Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
Model Evaluation l Metrics for Performance Evaluation –How to evaluate the performance of a model? l Methods for Performance Evaluation –How to obtain.
Bab /57 Bab 4 Classification: Basic Concepts, Decision Trees & Model Evaluation Part 2 Model Overfitting & Classifier Evaluation.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
Practical Issues of Classification Underfitting and Overfitting –Training errors –Generalization (test) errors Missing Values Costs of Classification.
1 Evaluation of Learning Models Literature: Literature: T. Mitchel, Machine Learning, chapter 5 T. Mitchel, Machine Learning, chapter 5 I.H. Witten and.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
WEKA Machine Learning Toolbox. You can install Weka on your computer from
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Classification Evaluation. Estimating Future Accuracy Given available data, how can we reliably predict accuracy on future, unseen data? Three basic approaches.
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
Evaluating Classifiers Reading: T. Fawcett, An introduction to ROC analysis, Sections 1-4, 7 (linked from class website)An introduction to ROC analysis.
Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
Classification Cheng Lei Department of Electrical and Computer Engineering University of Victoria April 24, 2015.
Evaluating Classifiers. Reading for this topic: T. Fawcett, An introduction to ROC analysis, Sections 1-4, 7 (linked from class website)
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Data Science Credibility: Evaluating What’s Been Learned
7. Performance Measurement
Evaluating Classifiers
Evaluating Results of Learning
9. Credibility: Evaluating What’s Been Learned
Learning Algorithm Evaluation
Model Evaluation and Selection
Presentation transcript:

Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5

Recall and Precision Recall: The percentage of the total relevant documents in a database retrieved by your search. If you knew that there were 1000 relevant documents in a database and your search retrieved 100 of these relevant documents, your recall would be 10%. Precision: The percentage of relevant documents in relation to the number of documents retrieved. If your search retrieves 100 documents and 20 of these are relevant, your precision is 20%.

More Definitions… F-measure: The harmonic mean of precision and recall F = 2 * Recall * Precision / (Recall + Precision) Fallout: This is just everything that is left over. All the junk that came up in your search that was irrelevant. If you retrieve 100 documents and 20 are relevant, your fallout is 80%. Region C in previous slide Contradiction between recall and precision. –Inversely related?? How?

According to ML In a binary decision problem, a classifier labels examples as either positive or negative. Classifiers produce confusion/ contingency matrix, which shows four entities: TP (true positive), TN (true negative), FP (false positive), FN (false negative) Positivenegative Predicted positive TPFP Predicted negative FNTN Confusion Matrix

Results Correctly Classified Instances 7 50 % Incorrectly Classified Instances 7 50 % Total Number of Instances 14 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure Class yes no === Confusion Matrix === a b <-- classified as 5 4 | a = yes 3 2 | b = no

Frank Keller Precision and Recall The error rate is an inadequate measure of the performance of an algorithm, it doesn’t take into account the cost of making wrong decisions. Example: Based on chemical analysis of the water try to detect an oil slick in the sea. –False positive: wrongly identifying an oil slick if there is none. –False negative: fail to identify an oil slick if there is one. Here, false negatives (environmental disasters) are much more costly than false negatives (false alarms). We have to take that into account when we evaluate our model.

Frank Keller Problem Suppose there are 1000 cases, 995 of which are negative cases and 5 of which are positive cases. If the system classifies them all as negative, the accuracy would be 99.5%, even though the classifier missed all positive cases. Is accuracy a good measure for highly skewed data set? ROC curves Report false positives and false negatives

Frank Keller Training and Test Set For classification problems, we measure the performance of a model in terms of its error rate: percentage of incorrectly classified instances in the data set. We build a model because we want to use it to classify new data. Hence we are chiefly interested in model performance on new (unseen) data. The resubstitution error (error rate on the training set) is a bad predictor of performance on new data. The model was build to account for the training data, so might overfit it, i.e., not generalize to unseen data.

Frank Keller Evaluation and Data Is the model able to generalize? Can it deal with unseen data, or does it overfit the data? Test on held-out data: – split data to be modeled in training and test set; – train the model (determine its parameters) on training set; – apply model to training set; compute model fit – apply model to test set; compute model fit; – difference between compare model fit on training and test –data measured the model’s ability to generalize.

Frank Keller Holdout If a lot of data are available, simply take two independent samples and use one for training and one for testing. The more training, the better the model. The more test data, the more accurate the error estimate. Problem: obtaining data is often expensive and time consuming. Solution: obtain a limited data set and use a holdout procedure. Most straightforward: random split into test and training set. Typically between 1/3 and 1/10 held out for testing.

Frank Keller Statification Problem: the split into training and test set might be unrepresentative, e.g., a certain class is not represented in the training set, thus the model will not learn to classify it. Solution: use stratified holdout, i.e., sample in such a way that each class is represented in both sets. Example: data set with two classes A and B. Aim: construct a 10% test set. Take a 10% sample of all instances of class A plus a 10% sample of all instances of class B. However, this procedure doesn’t work well on small data sets.

Frank Keller Crossvalidation Solution: k-fold crossvalidation maximizes the use of the data. –Divide data randomly into k folds (subsets) of equal size. –Train the model on k−1 folds, use one fold for testing. –Repeat this process k times so that all folds are used for –testing. –Compute the average performance on the k test sets. This effectively uses all the data for both training and testing. Typically k = 10 is used. Sometimes stratified k-fold crossvalidation is used.

Frank Keller Crossvalidation Example: data set with 20 instances, 5-fold crossvalidation trainingtest d1d1 d2d2 d3d3 d4d4 d5d5 d6d6 d7d7 d8d8 d9d9 d 10 d 11 d 12 d 13 d 14 d 15 d 16 d 17 d 18 d 19 d 20 d1d1 d2d2 d3d3 d4d4 d5d5 d6d6 d7d7 d8d8 d9d9 d 10 d 11 d 12 d 13 d 14 d 15 d 16 d 17 d 18 d 19 d 20 d1d1 d2d2 d3d3 d4d4 d5d5 d6d6 d7d7 d8d8 d9d9 d 10 d 11 d 12 d 13 d 14 d 15 d 16 d 17 d 18 d 19 d 20 d1d1 d2d2 d3d3 d4d4 d5d5 d6d6 d7d7 d8d8 d9d9 d 10 d 11 d 12 d 13 d 14 d 15 d 16 d 17 d 18 d 19 d 20 d1d1 d2d2 d3d3 d4d4 d5d5 d6d6 d7d7 d8d8 d9d9 d 10 d 11 d 12 d 13 d 14 d 15 d 16 d 17 d 18 d 19 d 20 compute error rate for each fold  then compute average error rate

Frank Keller Leave-one-out crossvalidation Leave-one-out crossvalidation is simply k-fold crossvalidation with k set to n, the number of instances in the data set. The test set only consists of a single instance, which will be classified either correctly or incorrectly. Advantages: maximal use of training data, i.e., training on n−1 instances. The procedure is deterministic, no sampling involved. Disadvantages: unfeasible for large data sets: large number of training runs required, high computational cost. Cannot be stratified (only one class in the test set).

Frank Keller Comparing Algorithms Assume you want to compare the performance of two machine learning algorithms A and B on the same data set. You could use crossvalidation to determine the error rates of A and B and then compute the difference. Problem: sampling is involved (in getting the data and in crossvalidation), hence there is variance for the error rates. Solution: determine if the difference between error rates is statistically significant. If the crossvalidations of A and B use the same random division of the data (the same folds), then a paired t-test is appropriate.

Frank Keller Comparing Algorithms Let k samples of the error rate of algorithms A be denoted by x 1,…, x k and k samples of the error rate of B by y 1,…, y k. Then the t for a paired t-test is: Where đ is the mean f the difference x n -y n, and σ d is the standard deviation of this mean

Frank Keller Comparing against a Baseline An error rate in itself is not very meaningful. We have to take into account how hard the problem is. This means comparing against a baseline model and showing that our model performs significantly better than the baseline. The simplest model is the chance baseline, which assigns a classification randomly. Example: if we have two classes A and B in our data, and we classify each instance randomly as either A or B, then we will get 50% right just by chance (in the limit). In the general case, the error rate for the chance baseline is 1− 1/n for an n-way classification.

Frank Keller Comparing against a Baseline Problem: a chance baseline is not useful if the distribution of the data is skewed. We need to compare against a frequency baseline instead. A frequency baseline always assigns the most frequent class. Its error rate is 1− f max, where f max is the percentage of instances in the data that belong to the most frequent class. Example: determining when a cow is in oestrus (classes: yes, no). Chance baseline: 50%, frequency baseline: 3% (1 out of 30 days). More realistic example: assigning part-of-speech tags for English text. A tagger that assigns to each word its most frequent tag gets 20% error rate. Current state of the art is 4%.

Frank Keller Summary No matter how the performance of the model is measured (accuracy, precision, recall), we always need to measure on the test set, not on the training set. Performance on the training only tells us that the model learns what it’s supposed to learn. It is not a good indicator of performance on unseen data. The test set can be obtained using an independent sample or holdout techniques (crossvalidation, leave-one-out). To meaningfully compare the performance of two algorithms for a given type of data, we need to compute if a difference in performance is significant. We also need to compare performance against a baseline (chance or frequency).