1 Evaluating Model Performance Lantz Ch 10 Wk 5, Part 2 Right – Graphing is often used to evaluate results from different variations of an algorithm. Depending.

Slides:



Advertisements
Similar presentations
Imbalanced data David Kauchak CS 451 – Fall 2013.
Advertisements

Learning Algorithm Evaluation
Evaluation of segmentation. Example Reference standard & segmentation.
Receiver Operating Characteristic (ROC) Curves
Accuracy Assessment of Thematic Maps
Lecture 22: Evaluation April 24, 2010.
Evaluation.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Assessing and Comparing Classification Algorithms Introduction Resampling and Cross Validation Measuring Error Interval Estimation and Hypothesis Testing.
Model Evaluation Metrics for Performance Evaluation
Cost-Sensitive Classifier Evaluation Robert Holte Computing Science Dept. University of Alberta Co-author Chris Drummond IIT, National Research Council,
CS 8751 ML & KDDEvaluating Hypotheses1 Sample error, true error Confidence intervals for observed hypothesis error Estimators Binomial distribution, Normal.
Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.
Evaluation.
Performance Evaluation in Computer Vision Kyungnam Kim Computer Vision Lab, University of Maryland, College Park.
ROC Curve and Classification Matrix for Binary Choice Professor Thomas B. Fomby Department of Economics SMU Dallas, TX February, 2015.
On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.
CSCI 347 / CS 4206: Data Mining Module 06: Evaluation Topic 07: Cost-Sensitive Measures.
Accuracy Assessment. 2 Because it is not practical to test every pixel in the classification image, a representative sample of reference points in the.
Evaluating Classifiers
Evaluation – next steps
Lecture 4: Assessing Diagnostic and Screening Tests
Error estimation Data Mining II Year Lluís Belanche Alfredo Vellido.
Performance measurement. Must be careful what performance metric we use For example, say we have a NN classifier with 1 output unit, and we code ‘1 =
Evaluating Hypotheses Reading: Coursepack: Learning From Examples, Section 4 (pp )
Evaluating What’s Been Learned. Cross-Validation Foundation is a simple idea – “ holdout ” – holds out a certain amount for testing and uses rest for.
Experimental Evaluation of Learning Algorithms Part 1.
Classification Performance Evaluation. How do you know that you have a good classifier? Is a feature contributing to overall performance? Is classifier.
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
CpSc 810: Machine Learning Evaluation of Classifier.
CSSE463: Image Recognition Day 11 Lab 4 (shape) tomorrow: feel free to start in advance Lab 4 (shape) tomorrow: feel free to start in advance Test Monday.
Remote Sensing Classification Accuracy
Error & Uncertainty: II CE / ENVE 424/524. Handling Error Methods for measuring and visualizing error and uncertainty vary for nominal/ordinal and interval/ratio.
Evaluating Results of Learning Blaž Zupan
Computational Intelligence: Methods and Applications Lecture 16 Model evaluation and ROC Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
Model Evaluation l Metrics for Performance Evaluation –How to evaluate the performance of a model? l Methods for Performance Evaluation –How to obtain.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
Week 6. Statistics etc. GRS LX 865 Topics in Linguistics.
CSSE463: Image Recognition Day 11 Due: Due: Written assignment 1 tomorrow, 4:00 pm Written assignment 1 tomorrow, 4:00 pm Start thinking about term project.
Evaluating Classification Performance
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
Evaluating Classifiers Reading: T. Fawcett, An introduction to ROC analysis, Sections 1-4, 7 (linked from class website)An introduction to ROC analysis.
RELIABILITY OF DISEASE CLASSIFICATION Nigel Paneth.
Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Chapter 5 – Evaluating Predictive Performance Data Mining for Business Analytics Shmueli, Patel & Bruce.
ROC curve estimation. Index Introduction to ROC ROC curve Area under ROC curve Visualization using ROC curve.
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
Evaluating Classifiers. Reading for this topic: T. Fawcett, An introduction to ROC analysis, Sections 1-4, 7 (linked from class website)
Accuracy Assessment Accuracy Assessment Error Matrix Sampling Method
Accuracy Assessment of Thematic Maps THEMATIC ACCURACY.
Information Organization: Evaluation of Classification Performance.
Data Science Credibility: Evaluating What’s Been Learned
7. Performance Measurement
Evaluating Classifiers
26. Classification Accuracy Assessment
Evaluating Results of Learning
Accuracy Assessment of Thematic Maps
9. Credibility: Evaluating What’s Been Learned
Performance Measures II
Features & Decision regions
Evaluating Classifiers (& other algorithms)
ROC Curves and Operating Points
Model Evaluation and Selection
Evaluating Classifiers
Evaluating Classifiers
Evaluation and Its Methods
ROC Curves and Operating Points
Information Organization: Evaluation of Classification Performance
Presentation transcript:

1 Evaluating Model Performance Lantz Ch 10 Wk 5, Part 2 Right – Graphing is often used to evaluate results from different variations of an algorithm. Depending on how important false positives are, one or another of the points shown might be best.

2 A truckload of tricks Lantz warns us that, in different domains, different ways are preferred, of deciding what results are “best”. – Includes many types of summary statistics, graphing for visual decisions, etc. – Our goal – you know where to look up and calculate any of these, if someone says, “Oh, and what’s the precision on that?” Etc.

3 Classifiers - Main types of data involved Actual class values Predicted class values Estimated probability of the prediction – If two methods are equally accurate, but – One is more able to assess certainty, then – It is the “smarter” model.

4 Let’s look at ham vs spam again > sms_results <- read.csv("/Users/chenowet/Documents/Rstuff/sms _results.csv") > head(sms_results) actual_type predict_type prob_spam 1 ham ham e-07 2 ham ham e-04 3 ham ham e-05 4 ham ham e-04 5 spam spam e+00 6 ham ham e-03 This model is very confident of its choices!

5 How about when the model was wrong? > head(subset(sms_results, actual_type != predict_type)) actual_type predict_type prob_spam 53 spam ham spam ham spam ham spam ham spam ham spam ham Classifier knew it had a good chance of being wrong!

6 Confusion matrices tell the tale We can judge based on what mistakes the algorithm made, Knowing which ones we care most about.

7 E.g., predicting birth defects Which kind of error is worse?

8 A common, global figure from these confusion matrices An overall way to decide “accuracy” of an algorithm? Like, what percent is on the diagonal in the confusion matrix? “Error rate” is then 1 – accuracy.

9 So, for ham & spam… > CrossTable(sms_results$actual_type, sms_results$predict_type) Cell Contents | | | N | | Chi-square contribution | | N / Row Total | | N / Col Total | | N / Table Total | | | Total Observations in Table: 1390 | sms_results$predict_type sms_results$actual_type | ham | spam | Row Total | | | | | ham | 1202 | 5 | 1207 | | | | | | | | | | | | | | | | | | | | | spam | 29 | 154 | 183 | | | | | | | | | | | | | | | | | | | | | Column Total | 1231 | 159 | 1390 | | | | | | | | | Accuracy = ( )/ 1390 = Error rate =

10 “Beyond accuracy” The “kappa” statistic – Adjusts accuracy by accounting for a correct prediction by chance alone. – For our spam & ham, it’s , vs – “Good agreement” = 0.60 to 0.80 – “Very good agreement” = 0.80 to 1.00 – Subjective categories

11 Sensitivity and specificity Sensitivity = “True positive rate” and Specificity = “True negative rate”

12 For ham and spam… > confusionMatrix(sms_results$predict _type, sms_results$actual_type, positive = "spam") Confusion Matrix and Statistics Reference Prediction ham spam ham spam Accuracy : % CI : (0.966, 0.983) No Information Rate : P-Value [Acc > NIR] : < 2.2e-16 Kappa : Mcnemar's Test P-Value : 7.998e-05 Sensitivity : Specificity : Pos Pred Value : Neg Pred Value : Prevalence : Detection Rate : Detection Prevalence : Balanced Accuracy : 'Positive' Class : spam

13 Precision and recall Also measures of the compromises made in classification. How much the results are diluted by meaningless noise.

14 The F-measure The “harmonic mean” of precision and recall. Also called the “balanced F-score.” There are also F 1 and F β variants!

15 Visualizing A chance to pick what you want based on how it looks vs other alternatives. “ROC” curves – Receiver Operating Characteristic. Detecting true positives vs avoiding false positives. Area under the ROC curve varies from 0.5 (no predictive value) to 1.0 (perfect).

16 For SMS data “Pretty good ROC curve”:

17 Further tricks Estimating future performance by calculating the “resubstitution error.” – Lantz doesn’t put much stock in these. The “holdout method.” – Lantz has used this throughout. – Except it recommends holding out 25% – 50% of the data for testing, and he’s often used much less.

18 K-fold cross validation Hold out 10% in different ways, 10 different times. Then compare results of the 10 tests. “Bootstrap sampling” – similar, but uses randomly selected training and test sets.