Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Analytics CMIS Short Course part II Day 1 Part 4: ROC Curves Sam Buttrey December 2015.

Similar presentations


Presentation on theme: "Data Analytics CMIS Short Course part II Day 1 Part 4: ROC Curves Sam Buttrey December 2015."— Presentation transcript:

1 Data Analytics CMIS Short Course part II Day 1 Part 4: ROC Curves Sam Buttrey December 2015

2 Assessing a Classifier In data sets with very few “bads,” the “naïve” model that says “everyone is good” is highly accurate –It never pays to predict “bad” How can we decide that a model is getting probabilities correct, or compare two models? How can we put our model to use? –What if we don’t like the 0.5 threshold to decide which are predicted good or not? One answer: ROC

3 ROC Plot

4 2x2 Confusion Matrix Pos a b Neg c d Sensitivity: true positive rate, a/(a+b) false negative rate, b/(a+b) Specificity: true negative rate, d/(c+d) false positive rate, c/(c+d) If t is large, few are predicted positive, sensitivity small, specificity high. Predicted Observed

5 Plotting the ROC The ROC curve plots Sensitivity (true pos rate) against 1 – Specificity (or the false pos rate) for different k’s In a good test, there is a k for which both Sensitivity and Specificity are near 1 That curve would pass near (1, 0) –Top left corner In a bad test, the proportion classified as positive would be the same regardless of the truth. We would have Sensitivity = (1 – Specificity) for all k –45 o line

6 The ROC Sensitivity (true pos. rate) 0 0 1 1 30% 1 – Specificity (false pos. rate) Low Threshold (t) 73% High Threshold (t) We give every observation a score. For this value of t, 73% of positive observations have score  t, but only 30% of negatives have score  t

7 On the 45  Line If your classifier’s ROC curve follows the 45  line, then for any t the probability of classifying a positive as positive (sensitivity) is the same as the probability of classifying a negative as positive (1 – specificity) The area between your ROC curve and the 45  line is a measure of quality. So is the total area under the curve or the AUC (area under the curve).

8 The ROC Sensitivity (true pos. rate) 0 0 1 1 1  Specificity (false pos. rate) Low Threshold Area Under Curve (AUC) High Threshold

9 Area Under The Curve The Area under the Curve (AUC) is often measured or estimated A “random” classifier has AUC = 0.5 Rule of thumb: AUC >.8 good -- but often only a few thresholds make sense R draws ROC curves via pROC, ROCR…

10 Other interpretations of AUC Select two observations, one with a “hit” and one without, randomly For a particular model, what is the probability that the predicted probability for the “hit” is the greater? Answer: it’s exactly the AUC This number also relates to the Wilcoxon two-sample non-parametric test applied to the predicted classes.

11 Using AUC ROC is for binary classifiers that produce numeric scores Produce a set of predicted scores on test set, plot ROCs If one classifier’s curve is always above another’s, it dominates Otherwise, compare by AUC – or by using some “real” threshold Examples! Yay?


Download ppt "Data Analytics CMIS Short Course part II Day 1 Part 4: ROC Curves Sam Buttrey December 2015."

Similar presentations


Ads by Google