Evaluating Classification Performance

Slides:



Advertisements
Similar presentations
Chapter 4 Evaluating Classification and Predictive Performance
Advertisements

...visualizing classifier performance Tobias Sing Dept. of Modeling & Simulation Novartis Pharma AG Joint work with Oliver Sander (MPI for Informatics,
Chapter 4 Pattern Recognition Concepts: Introduction & ROC Analysis.
Learning Algorithm Evaluation
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Evaluation of segmentation. Example Reference standard & segmentation.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Other Classification Techniques 1.Nearest Neighbor Classifiers 2.Support Vector Machines.
Chapter 5 – Evaluating Classification & Predictive Performance
Chapter 4 – Evaluating Classification & Predictive Performance © Galit Shmueli and Peter Bruce 2008 Data Mining for Business Intelligence Shmueli, Patel.
Chapter 8 – Logistic Regression
Assessing and Comparing Classification Algorithms Introduction Resampling and Cross Validation Measuring Error Interval Estimation and Hypothesis Testing.
Classification and risk prediction
Model Evaluation Metrics for Performance Evaluation
CS 8751 ML & KDDEvaluating Hypotheses1 Sample error, true error Confidence intervals for observed hypothesis error Estimators Binomial distribution, Normal.
CAP and ROC curves.
© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.
Decision Theory Naïve Bayes ROC Curves
ROC Curves.
Data Mining: A Closer Look Chapter Data Mining Strategies (p35) Moh!
Jeremy Wyatt Thanks to Gavin Brown
ROC Curves.
Decision Tree Models in Data Mining
CSCI 347 / CS 4206: Data Mining Module 06: Evaluation Topic 07: Cost-Sensitive Measures.
Evaluating Classifiers
Repository Method to suit different investment strategies Alma Lilia Garcia & Edward Tsang.
Chapter 4 Pattern Recognition Concepts continued.
Evaluation – next steps
Classification II. 2 Numeric Attributes Numeric attributes can take many values –Creating branches for each value is not ideal The value range is usually.
1 Evaluating Model Performance Lantz Ch 10 Wk 5, Part 2 Right – Graphing is often used to evaluate results from different variations of an algorithm. Depending.
Performance measurement. Must be careful what performance metric we use For example, say we have a NN classifier with 1 output unit, and we code ‘1 =
Copyright © 2003, SAS Institute Inc. All rights reserved. Cost-Sensitive Classifier Selection Ross Bettinger Analytical Consultant SAS Services.
Data Analysis 1 Mark Stamp. Topics  Experimental design o Training set, test set, n-fold cross validation, thresholding, imbalance, etc.  Accuracy o.
Chapter 9 – Classification and Regression Trees
TOPICS IN BUSINESS INTELLIGENCE K-NN & Naive Bayes – GROUP 1 Isabel van der Lijke Nathan Bok Gökhan Korkmaz.
Evaluating What’s Been Learned. Cross-Validation Foundation is a simple idea – “ holdout ” – holds out a certain amount for testing and uses rest for.
Experiments in Machine Learning COMP24111 lecture 5 Accuracy (%) A BC D Learning algorithm.
Evaluating Results of Learning Blaž Zupan
Computational Intelligence: Methods and Applications Lecture 16 Model evaluation and ROC Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
Model Evaluation l Metrics for Performance Evaluation –How to evaluate the performance of a model? l Methods for Performance Evaluation –How to obtain.
Chapter 6 – Three Simple Classification Methods © Galit Shmueli and Peter Bruce 2008 Data Mining for Business Intelligence Shmueli, Patel & Bruce.
1 Evaluation of Learning Models Literature: Literature: T. Mitchel, Machine Learning, chapter 5 T. Mitchel, Machine Learning, chapter 5 I.H. Witten and.
Preventing Overfitting Problem: We don’t want to these algorithms to fit to ``noise’’ Reduced-error pruning : –breaks the samples into a training set and.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
1 Performance Measures for Machine Learning. 2 Performance Measures Accuracy Weighted (Cost-Sensitive) Accuracy Lift Precision/Recall –F –Break Even Point.
Bayesian decision theory: A framework for making decisions when uncertainty exit 1 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Genetic Algorithms (in 1 Slide) l GA: based on an analogy to biological evolution l Each.
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
Evaluating Classifiers Reading: T. Fawcett, An introduction to ROC analysis, Sections 1-4, 7 (linked from class website)An introduction to ROC analysis.
Chapter 5 – Evaluating Predictive Performance Data Mining for Business Analytics Shmueli, Patel & Bruce.
ROC curve estimation. Index Introduction to ROC ROC curve Area under ROC curve Visualization using ROC curve.
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
Data Analytics CMIS Short Course part II Day 1 Part 4: ROC Curves Sam Buttrey December 2015.
Eco 6380 Predictive Analytics For Economists Spring 2016 Professor Tom Fomby Department of Economics SMU.
2011 Data Mining Industrial & Information Systems Engineering Pilsung Kang Industrial & Information Systems Engineering Seoul National University of Science.
CHAPTER 7 Decision Analytic Thinking I: What Is a Good Model?
Chapter 12 – Discriminant Analysis
Evolving Decision Rules (EDR)
Evaluation – next steps
Performance Measures II
Chapter 7 – K-Nearest-Neighbor
Data Mining Classification: Alternative Techniques
Features & Decision regions
Experiments in Machine Learning
Big Data Analytics: Evaluating Classification Performance April, 2016 R. Bohn Some overheads from Galit Shmueli and Peter Bruce 2010.
Model Evaluation and Selection
Computational Intelligence: Methods and Applications
Data Mining Class Imbalance
Roc curves By Vittoria Cozza, matr
Presentation transcript:

Evaluating Classification Performance Data Mining

Why Evaluate? Multiple methods are available for classification For each method, multiple choices are available for settings (e.g. value of K for KNN, size of Tree of Decision Tree Learning) To choose best model, need to assess each model’s performance

Basic performance measure: Misclassification error Error = classifying an example as belonging to one class when it belongs to another class. Error rate = percent of misclassified examples out of the total examples in the validation data (or test data)

Naive classification Rule Naïve rule: classify all examples as belonging to the most prevalent class Often used as benchmark: we hope to do better than that Exception: when goal is to identify high-value but rare outcomes, we may do well by doing worse than the naïve rule (see “lift” – later)

Confusion Matrix 201 1’s correctly classified as “1” - True Positives 25 0’s incorrectly classified as “1” - False Positives 85 1’s incorrectly classified as “0” - False Negatives 2689 0’s correctly classified as “0”- True Negatives We use TP to denote True Positives . Similarly , FP, FN and TN

Error Rate and Accuracy TP = FN = FP = TN = Error rate = (FP + FN)/(TP + FP + FN + TN) Overall error rate = (25+85)/3000 = 3.67% Accuracy = 1 – err = (201+2689) = 96.33% If multiple classes, error rate is: (sum of misclassified records)/(total records)

Cutoff for classification Many classification algorithms classify via a 2-step process: For each record, Compute a score or a probability of belonging to class “1” Compare to cutoff value, and classify accordingly For example, with Naïve Bayes the default cutoff value is 0.5 If p(y=1|x) >= 0.5, classify as “1” If p(y=1|x) < 0.50, classify as “0” Can use different cutoff values Typically, error rate is lowest for cutoff = 0.50

Cutoff Table - example If cutoff is 0.50: 13 examples are classified as “1” If cutoff is 0.75: 8 examples are classified as “1” If cutoff is 0.25: 15 examples are classified as “1”

Confusion Matrix for Different Cutoffs Predicted class 0 Predicted class 1 1 11 Actual class 8 4 Actual class Cutoff probability = 0.25 Accuracy = 19/24 Cutoff probability = 0.5 (Not shown) Accuracy = 21/24 Cutoff probability = 0.75 Accuracy = 18/24 Predicted class 0 Predicted class 1 5 7 Actual class 1 11 Actual class

Lift

When One Class is More Important In many cases it is more important to identify members of one class Tax fraud Response to promotional offer Detecting malignant tumors In such cases, we are willing to tolerate greater overall error, in return for better identifying the important class for further attention

Alternate Accuracy Measures Predicted class 0 Predicted class 1 FN TP Actual class =1 TN FP Actual class =0 We assume that the important class is 1 Sensitivity = % of class 1 examples correctly classified Sensitivity = TP / (TP+ FN ) Specificity = % of class 0 examples correctly classified Specificity = TN / (TN+ FP ) True positive rate = % of class 1 examples correctly as class 1 = Sensitivity = TP / (TP+ FN ) False positive rate = % of class 0 examples that were classified as class 1 = 1- specificity = FP / (TN+ FP )

ROC curve Plot the True Positive Rate versus False Positive Rate for various values of the threshold The diagonal is the baseline – a random classifier Sometimes researchers use the Area under the ROC curve as a performance measure – AUC AUC by definition is between 0 and 1

Lift Charts: Goal Useful for assessing performance in terms of identifying the most important class Helps evaluate, e.g., How many tax records to examine How many loans to grant How many customers to mail offer to

Lift Charts – Cont. Compare performance of DM model to “no model, pick randomly” Measures ability of DM model to identify the important class, relative to its average prevalence Charts give explicit assessment of results over a large number of cutoffs 15

Lift Chart – cumulative performance Positives #Positives #Positives For example: after examining 10 cases (x-axis), 9 positive cases (y-axis) have been correctly identified

Lift Charts: How to Compute Using the model’s classification scores, sort examples from most likely to least likely members of the important class Compute lift: Accumulate the correctly classified “important class” records (Y axis) and compare to number of total records (X axis)

Asymmetric Costs

Misclassification Costs May Differ The cost of making a misclassification error may be higher for one class than the other(s) Looked at another way, the benefit of making a correct classification may be higher for one class than the other(s)

Example – Response to Promotional Offer Suppose we send an offer to 1000 people, with 1% average response rate (“1” = response, “0” = nonresponse) “Naïve rule” (classify everyone as “0”) has error rate of 1%, accuracy 99% (seems good) Using DM we can correctly classify eight 1’s as 1’s It comes at the cost of misclassifying twenty 0’s as 1’s and two 1’s as 0’s.

The Confusion Matrix Error rate = (2+20) = 2.2% (higher than naïve rate)

Introducing Costs & Benefits Suppose: Profit from a “1” is $10 Cost of sending offer is $1 Then: Under naïve rule, all are classified as “0”, so no offers are sent: no cost, no profit Under DM predictions, 28 offers are sent. 8 respond with profit of $10 each 20 fail to respond, cost $1 each 972 receive nothing (no cost, no profit) Net profit = $60

Profit Matrix

Lift (again) Adding costs to the mix, as above, does not change the actual classifications. But it allows us to get a better decision (threshold) Use the lift curve and change the cutoff value for “1” to maximize profit For example with the 24 examples shown earlier we have max accuracy at threshold = 0.5 (accuracy = 0.875) (at threshold = 0.75 accuracy = 0.75 and at threshold = 0.25 accuracy = 0.791) But if profit of a TP is 10$ and cost of a FP is 2$ then we get a different optimal working point: The best total profit is at threshold 0.2 (profit = 112) (at threshold = 0. 5 profit = 106 and at threshold = 0.75 profit = 68)

Adding Cost/Benefit to Lift Curve Sort test examples in descending probability of success For each case, record cost/benefit of actual outcome Also record cumulative cost/benefit Plot all records X-axis is index number (1 for 1st case, n for nth case) Y-axis is cumulative cost/benefit Reference line from origin to yn ( yn = total net benefit)

Lift Curve May Go Negative If total net benefit from all cases is negative, reference line will have negative slope Nonetheless, goal is still to use cutoff to select the point where net benefit is at a maximum

Negative slope to reference curve Zoom in Maximum profit = 60$

Multiple Classes For m classes, confusion matrix has m rows and m columns Theoretically, there are m(m-1) misclassification costs, since any case could be misclassified in m-1 ways Practically too many to work with In decision-making context, though, such complexity rarely arises – one class is usually of primary interest

Classification Using Triage Take into account a gray area in making classification decisions Instead of classifying as C1 or C0, we classify as C1 C0 Can’t say The third category might receive special human review