Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lucila Ohno-Machado An introduction to calibration and discrimination methods HST951 Medical Decision Support Harvard Medical School Massachusetts Institute.

Similar presentations


Presentation on theme: "Lucila Ohno-Machado An introduction to calibration and discrimination methods HST951 Medical Decision Support Harvard Medical School Massachusetts Institute."— Presentation transcript:

1 Lucila Ohno-Machado An introduction to calibration and discrimination methods HST951 Medical Decision Support Harvard Medical School Massachusetts Institute of Technology Evaluation of Medical Systems

2 Main Concepts Example of System Discrimination –Discrimination: sensitivity, specificity, PPV, NPV, accuracy, ROC curves Calibration –Calibration curves –Hosmer and Lemeshow goodness of fit

3 Example

4 Comparison of Practical Prediction Models for Ambulation Following Spinal Cord Injury Todd Rowland, M.D. Decision Systems Group Brigham and Womens Hospital Harvard Medical School

5 Study Rationale Patient’s most common question: “Will I walk again” Study was conducted to compare logistic regression, neural network, and rough sets models which predict ambulation at discharge based upon information available at admission for individuals with acute spinal cord injury. Create simple models with good performance 762 training set 376 test set univariate statistic compared (means)

6 SCI Ambulation NN Design Inputs & Output Admission Info (9 items) system days injury days age gender racial/ethnic group level of neurologic fxn ASIA impairment index UEMS LEMS Ambulation (1 item) yes no

7 Evaluation Indices

8 General indices Brier score (a.k.a. mean squared error)  (e i - o i ) 2 n e = estimate (e.g., 0.2) o = observation (0 or 1) n = number of cases

9 Decompositions of Brier score Murphy Brier = d(1-d) - CI + DI where d = prior probability CI = calibration index DI = discrimination index

10 Calibration and Discrimination Calibration measures how close the estimates are to a “real” probability Discrimination measures how much the system can discriminate between cases with gold standard ‘1’ and gold standard ‘0’ “If the system is good in discrimination, calibration may be fixed”

11 Discrimination Indices

12 Discrimination The system can “somehow” differentiate between cases in different categories Special case: –diagnosis (differentiate sick and healthy individuals) –prognosis (differentiate poor and good outcomes)

13 Discrimination of Binary Outcomes Real outcome is 0 or 1, estimated outcome is usually a number between 0 and 1 (e.g., 0.34) or a rank Based on Thresholded Results (e.g., if output > 0.5 then consider “positive”) Discrimination is based on: –sensitivity –specificity –PPV, NPV –accuracy –area under ROC curve

14 Notation D means disease present “D” means system says disease is present nl means disease absent “nl” means systems says disease is absent TP means true positive FP means false positive TN means true positive

15 threshold 3.0 normal Disease 1.0 1.7 FN True Negative (TN) FP True Positive (TP)

16 Prevalence = P(D) = FN + TP “nl” “D” “nl” nl D TN TP FP FN

17 Specificity Spec = TN/TN+FP “nl” “D” “nl” nl D TN TP FP FN

18 Positive Predictive Value PPV = TP/TP+FP “nl” “D” “nl” nl D TN TP FP FN

19 Negative Predictive Value NPV = TN/TN+FN “nl” “D” “nl” nl D TN TP FP FN

20 Accuracy = (TN +TP)/all “nl” “D” “nl” nl D TN TP FP FN

21 Sensitivity Sens = TP/TP+FN Specificity Spec = TN/TN+FP Positive Predictive Value PPV = TP/TP+FP Negative Predictive Value NPV = TN/TN+FN Accuracy = (TN +TP)/all “nl” “D” “nl” nl D TN TP FP FN

22 Sens = TP/TP+FN 40/50 =.8 Spec = TN/TN+FP 30/50 =.6 PPV = TP/TP+FP 40/60 =.67 NPV = TN/TN+FN 30/40 =.75 Accuracy = TN +TP 70/100 =.7 “nl” “D” “nl” nl D 30 40 20 10

23 nl disease threshold 1.0 3.01.4 TN FP TP Sensitivity = 50/50 = 1 Specificity = 40/50 = 0.8 “D” “nl” nl D 40 50 10 0 50 40 60

24 nl disease threshold 1.0 3.01.7 FN TN FP TP “D” “nl” nl D 40 10 50 Sensitivity = 40/50 =.8 Specificity = 40/50 =.8

25 nl disease threshold 1.0 3.02.0 FN TN TP Sensitivity = 30/50 =.6 Specificity = 1 “D” “nl” nl D 50 30 0 20 50 70 30

26 ROC curve “D” “nl” nl D 50 30 0 20 50 70 30 “D” “nl” nl D 40 10 50 “D” “nl” nl D 40 50 10 0 50 40 60 Sensitivity 1 - Specificity 0 1 1 Threshold 1.4 Threshold 1.7 Threshold 2.0

27 ROC curve Sensitivity 1 - Specificity 0 1 1 All Thresholds

28 Sensitivity 1 - Specificity 0 1 1 45 degree line: no discrimination

29 Sensitivity 1 - Specificity 0 1 1 45 degree line: no discrimination Area under ROC: 0.5

30 Sensitivity 1 - Specificity 0 1 1 Perfect discrimination

31 Sensitivity 1 - Specificity 0 1 1 Perfect discrimination Area under ROC: 1

32 ROC curve Sensitivity 1 - Specificity 0 1 1 Area = 0.86

33 What is the area under the ROC? An estimate of the discriminatory performance of the system –the real outcome is binary, and systems’ estimates are continuous (0 to 1) –all thresholds are considered NOT an estimate on how many times the system will give the “right” answer

34 Interpretation of the Area Systems’ estimates for Healthy (real outcome 0) 0.3 0.2 0.5 0.1 0.7 Sick (real outcome 1) 0.8 0.2 0.5 0.7 0.9

35 All possible pairs 0-1 Systems’ estimates for Healthy 0.3 0.2 0.5 0.1 0.7 Sick 0.8 0.2 0.5 0.7 0.9 concordant discordant concordant <

36 All possible pairs 0-1 Systems’ estimates for Healthy 0.3 0.2 0.5 0.1 0.7 Sick 0.8 0.2 0.5 0.7 0.9 concordant tie concordant

37 C - index Concordant 18 Discordant 4 Ties 3 C -index = Concordant + 1/2 Ties = 18 + 1.5 All pairs 25

38 All possible pairs 0-1 Systems’ estimates for Healthy 0.3 0.2 0.5 0.1 0.7 Sick 0.8 0.2 0.5 0.7 0.9 concordant discordant concordant <

39 Related statistics Mann-Whitney U p(x < y) = U(x < y) mn where m is the number of real ‘0’s (e.g., deaths) and n is the number of real ‘1’s (e.g. survivals) Wilcoxon rank test

40 Calibration Indices

41 Calibration System can reliably estimate probability of –a diagnosis –a prognosis Probability is close to the “real” probability for similar cases

42 What is the problem? Binary events are YES/NO (0/1) i.e., probabilities are 0 or 1 for a given individual Some models produce continuous (or quasi- continuous estimates for the binary events) Example: –Database of patients with spinal cord injury, and a model that predicts whether a patient will ambulate or not at hospital discharge –Event is 0: doesn’t walk or 1: walks –Models produce a probability that patient will walk: 0.05, 0.10,...

43 How close are the estimates to the “true” probability for a patient? “True” probability can be interpreted as probability within a set of similar patients What are similar patients? –Patients who get similar scores from models –How to define boundaries for similarity? Examples out there in the real world... –APACHE III for ICU outcomes –ACI-TIPI for diagnosis of MI

44 Calibration Sort systems’ estimates, group, sum 0.1 0 0.20 0.2sum of group = 0.51 sum = 1 0.30 0.50 0.5sum of group = 1.31sum = 1 0.70 0.71 0.81 0.9sum of group = 3.11sum = 3

45 Sum of system’s estimates Sum of real outcomes 0 1 1 overestimation Calibration Curves

46 Sum of system’s estimates Sum of real outcomes 0 1 1 Regression line Linear Regression and 45 0 line

47 Goodness-of-fit Sort systems’ estimates, group, sum, chi-square ExpectedObserved 0.1 0 0.20 0.2sum of group = 0.51 sum = 1 0.30 0.50 0.5sum of group = 1.31sum = 1 0.70 0.71 0.81 0.9sum of group = 3.11sum = 3  =  [(observed - expected) 2 /expected]  = (0.5 - 1) 2 / 0.5 + (1.3 - 1) 2 / 1.3 + (3.1 - 3) 2 / 3.1

48 Hosmer-Lemeshow C-hat Groups based on n-iles (e.g., deciles, terciles), n-2 d..f. Measured Groups“Mirror groups” ExpectedObservedExpectedObserved 0.1 00.91 0.200.81 0.2sum = 0.51 sum = 10.8 sum = 2.50 sum = 2 0.300.71 0.500.51 0.5sum = 1.31 sum = 10.5 sum = 1.70 sum = 2 0.700.31 0.710.30 0.810.20 0.9sum = 3.11 sum = 30.1 sum=0.90 sum = 1  =(0.5-1) 2 /0.5+(1.3-1) 2 /1.3+(3.1-3) 2 /3.1+(2.5-2) 2 /2.5+(1.7-2) 2 /1.7+(0.9-1) 2 /0.9

49 Hosmer-Lemeshow H-hat Groups based on n fixed thresholds (e.g., 0.3, 0.6, 0.9), n-2 d.f. Measured Groups“Mirror groups” ExpectedObservedExpectedObserved 0.1 00.91 0.200.81 0.21 0.8 0 0.3 sum = 0.8 0 sum = 1 0.7 sum = 3.2 1 sum = 2 0.500.51 0.5sum = 1.01 sum = 10.5 sum = 1.00 sum = 1 0.700.31 0.710.30 0.810.20 0.9sum = 3.11 sum = 30.1 sum=0.90 sum = 1  =(0.8-1) 2 /0.8+(1.0-1) 2 /1+(3.1-3) 2 /3.1+(3.2-2) 2 /3.2+(1.0-1) 2 /1+(0.9-1) 2 /0.9

50 Covariance decomposition Arkes et al, 1995 Brier = d(1-d) + bias 2 + d(1-d)slope(slope-2) + scatter where d = prior bias is a calibration index slope is a discrimination index scatter is a variance index

51 Covariance Graph ô =.7 ê =.6 ê 1 =.7 ê 0 =.4 01 outcome index (o) estimated probability (e) 1 slope PS=.2 bias= -0.1 slope=.3 scatter=.1

52 Back to Example

53 Thresholded Results SensSpecNPVPPVAccuracy LR0.875 0.853 0.971 0.549 0.856 NN0.844 0.878 0.965 0.587 0.872 RS0.875 0.862 0.971 0.566 0.864

54 Brier Scores BrierSpiegelhalter’s Z LR 0.08040.01 NN 0.08111.34 RS 0.0883-1.0

55 ROC Curves 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.10.20.30.40.50.60.70.80.9 1 Sensitivity 1-Specificity LR NN RS

56 Areas under ROC Curves

57

58 Results: Goodness-of-fit Logistic Regression: H-L p = 0.50 Neural Network: H-L p = 0.21 Rough Sets: H-L p <.01 p > 0.05 indicates reasonable fit

59 Summary Always test performance in previously unseen cases General indices not very informative Combination of indices describing discrimination and calibration are more informative Hard to conclude that one system is better than another one in terms of classification performance alone: explanation, variable selection, and acceptability by clinicians are key


Download ppt "Lucila Ohno-Machado An introduction to calibration and discrimination methods HST951 Medical Decision Support Harvard Medical School Massachusetts Institute."

Similar presentations


Ads by Google