Presentation on theme: "Assessing agreement for diagnostic devices"— Presentation transcript:
1Assessing agreement for diagnostic devices FDA/Industry Statistics WorkshopSeptember 28-29, 2006Bipasa BiswasMathematical Statistician, Division of BiostatisticsOffice of Surveillance and BiometricsCenter for Devices and Radiological Health, FDANo official support or endorsement by the Food and Drug Administration of this presentation is intended or should be inferred
2OutlineAccuracy measures for diagnostic tests with a dichotomous outcome. Ideal world -tests with reference standard.Two indices to measure accuracy –Sensitivity and SpecificityAssessing agreement between two tests in the absence of a reference standard.Overall agreementCohen’s KappaMcNemar’s testProposed remedyExtending agreement to tests with more than 2 outcomes.Extension to Random Marginal Agreement coefficient (RMAC)Should agreement per cell be reported?
3Ideal World-Tests with perfect reference standard (Single) If a perfect reference standard exists to classify patients as diseased (D+) versus not diseased (D-) then we can represent the data as:True StatusTest D+ D-T +T -If the true status of the disease is known then we can estimate the Se =TP/(TP+FN) and the Sp=TN/(TN+FP)TPFPTP+FPFNTNFN+TNTP+FNFP+TNTP+FP+FN+TN
4Ideal World-Tests with perfect reference standard (Comparing two tests) McNemar’s test to test equality of either sensitivity or specificity.True StatusDisease D No Disease D-Comparator test Comparator testNew test R+ R- New test R R-T T +T T -McNemar Chi square:Check equality of sensitivities of the two tests (|b1-c1|-1)2/(b1+c1)Check equality of specifities of the two tests (|c2-b2|-1)2/(c2+b2)a1b1a1+b1c1d1c1+d1a1+c1b1+d1a1+b1+c1+d1a2b2a2+b2c2d2c2+d2a2+c2b2+d2a2+b2+c2+d2
5Ideal World-Tests with perfect reference standard (Comparing two tests) ExampleTrue StatusDisease D+ Disease D-Comparator test Comparator testNew test R R- New test R+ R-T T +T T -SeT=85.0%(85/100) SpT=88.3%(795/900)SeR=90.0%(90/100) SpR=90.0%(810/900)McNemar Chi square:Check equality of sensitivities of the two tests (|5–10|–1)2/(5+10)p-value=0.3095% CI (–13.5%,3.5%)Check equality of specifities of the two tests (|5–20|–1)2/(5+20)p-value=.00595% CI (–2.9%, –0.5%)852010557907959081090080585101590100
6McNemar’s test when a reference standard exists Note however that the McNemar’s test is only checking for equality and thus the null hypothesis is of equivalence and the alternative hypothesis of difference. This is not an appropriate hypothesis as a failure to find a statistically significant difference is naively interpreted as evidence for equivalence.The 95% confidence interval of the difference in sensitivities and specificities provides a better idea on the difference between the two tests.
7Imperfect reference standard A subject’s true disease status is seldom known with certainty.What is the effect on sensitivity and specificity when the comparator test R itself has error?Imperfect reference test (Comparator test)New test R+ R-T +T -aba+bcdc+da+cb+da+b+c+d
8Imperfect reference standard Example1: Say we have a new Test T with 80% sensitivity and 70% specificity. And an imperfect reference test R (the comparator test) which misses 20% of the diseased subjects but never falsely indicates disease.True Status Imperfect reference testD D R R-T +T –Se= (80/100)80.0% Se (relative to R)= (64/80) 80.0%Sp =(70/100)70.0% Sp (relative to R)= (74/120)62.0%8030110207090100200644611016749080120200
9Imperfect reference standard Example 2: Say we have a new Test T with 80% sensitivity and 70% specificity. And an imperfect reference test R which misses 20% of the diseased subjects but the error in R is related to the error in T.True Status Imperfect reference testD+ D- R+ R-T +T –Se =(80/100)80.0% Se (relative to R)=(80/80) 100.0%Sp =(70/100)70.0% Sp (relative to R) =(90/120)75.0%8030110207090100200803011090120200
10Imperfect reference standard Example3: Now suppose our test is perfect, that is has 100% sensitivity and 100% specificity, but the imperfect reference test R has only 90% sensitivity and 90% specificity.True Status Imperfect reference testD+ D- R+ R-T +T –Se =(100/100)100.0% Se (relative to R)=(90/100) 90.0%Sp =(100/100)100.0% Sp (relative to R)=(90/100) 90.0%1002009010100200
11Challenges in assessing agreement in the absence of a reference standard. Two commonly used overall measures are:Overall agreement measureCohen’s KappaMcNemar’s TestIn stead report positive percent agreement (ppa) and negative percent agreement (npa).
12Estimate of AgreementThe overall percent agreement can be calculated as:100%x(a+d)/(a+b+c+d)The overall percent agreement however, does not differentiate between the agreement on the positives and agreement on the negatives.Instead of overall agreement, report positive percent agreement (PPA) with respect to the imperfect reference standard positives and negative percent agreement (NPA) with respect to imperfect reference standard negative. (reference Feinstein et. al.)PPA=100%xa/(a+c)NPA=100%xd/(b+d)
13Why not to report just the overall percent agreement? The overall percent agreement is insensitive to off diagonalimperfect reference testR+ R-New T+TestT-The overall percent agreement is 85.0% and yet it does not account for the off-diagonal imbalance. The PPA is 100% and the NPA is only 50%70158530100
14Why report both PPA and NPA? imperfect reference test imperfect reference testR+ R R R-New T+ new T+Test T- test T-Table 1 Table2Overall pct. agreement=90.0% Overall pct. agreement=90.0%PPA=50.0% (5/10) PPA=87.5% (35/40)[95% CI= 18.7%,81.3%] [95% CI=73.2%,95.8%]NPA=94.4% (85/90) NPA=91.7% (55/60)[95% CI= 87.5%,98.2 %] [95% CI=81.6%,97.2%]5108590100355405560100
15Kappa measure of agreement Kappa is defined as the difference between observed and expected agreement expressed as a fraction of the maximum difference and ranges between -1 to 1.Imperfect reference standardR R-New T+TestT-k=(Io-Ie)/(1-Ie) where Io=(a+d)/n, Ie=((a+c)(a+b)+(b+d)(c+d))/n2aba+bcdc+da+cb+dn=a+b+c+d
16Kappa measure of agreement Imperfect reference standardR+ R-New T+TestT-Io=(70)/100=0.70, Ie=((50)(50)+(50)(50))/10000= 0.50κ=( )/(1-0.50)=0.40[95% CI=0.22,0.58]By the way the overall percent agreement is 70.0%351550100
17Kappa measure of agreement sensitive to off-diagonal? Imperfect reference testR R-New T+Test T-Kappa=κ=0.45 [95% CI=0.31,0.59]Although the overall agreement stayed the same (70%) and the marginal differences are much bigger than before, the kappa agreement index indicates otherwise.Kappa statistics is impacted by the marginal totals even though the overall agreement is the same.353065100
18McNemar’s Test to check for equality in the absence of a reference standard Hypothesizes: Equality of rates of positive responseImperfect reference testR+ R-New T+Test T-McNemar Chi square=(|b-c|-1)2/(b+c)=(|30-5|-1)2/(30+5)=16.46Two sided p-value=373067528334258100
19McNemar’s test (insensitivity to main diagonal) Imperfect reference testR+ R-New T+Test T-Same p-value as when A=37 and D=28, even though the new and the old test agree on 99.5% of individual cases.3700303730528002805370528306535
20McNemar’s test (insensitivity to main diagonal) Imperfect reference testR+ R-New T+Test T-Two sided p-value=1 even though old and new test agree on no cases.191837
21Proposed remedyIn stead of reporting overall agreement or kappa or the McNemar’s test p-value, report both positive percent agreement and negative percent agreement.In the 510(k) paradigm where a new device is compared to an already marketed device the positive percent agreement and the negative percent agreement is relative to the comparator device, which is appropriate.
22Agreement of tests with more than two outcomes For example in radiology one often compares the standard film mammogram to a digital mammogram where the radiologists assign a score of 1(negative finding) to 5 (highly suggestive of malignancy) depending on severity.The article by Fay in 2005 in Biostatistics proposes a random marginal agreement coefficient (RMAC) which uses a different adjustment for chance than the standard agreement coefficient (Cohen’s Kappa).
23Comparing two tests with more than two outcomes The advantages of RMAC is that the differences between two marginal distributions will not induce greater apparent agreement.However, as stated in the paper similar to Cohen’s Kappa with the fixed marginal assumption, the RMAC also depends on the heterogeneity of the population. Thus in cases where the probability of responding in one category is nearly 1 then the chance agreement will be large leading to low agreement coefficients.
24Comparing two tests with more than two outcomes An omnibus agreement index for situations with more than two outcomes is also ridden by similar situations faced for tests with dichotomous outcome. Also, in a regulatory set-up where a new test device is being compared to a predicate device RMAC may not be appropriate as it gives equal weight to the marginals from the test and the predicate device.In stead report individual agreement for each category.
25SummaryPerfect standard exists then for a dichotomous test then both sensitivity and specificity can be estimated and appropriate hypothesis tests can be performed.If a new test is being compared to an imperfect predicate test then the positive percent agreement and negative percent agreement along with their 95% confidence interval is a more appropriate way of comparison than reporting the overall agreement or the kappa statistics or the McNemar’s test.In case of tests with more than two outcomes the kappa statistics or the overall agreement has the same problems if the goal of the study is to compare the new test against a predicate. A suggestion would be to report agreement for each cell.
26ReferencesPepe, M.S. (2003). The Statistical Evaluation of Medical Tests for Classification and Prediction. Oxford University Press.Statistical Guidance on Reporting Results from Studies Evaluating Diagnostic Tests; Draft Guidance for Industry and FDA Reviewers. March 2, 2003.Fleiss, JL, Statistical Methods for Rates and Proportions, John Wiley & Sons, New York (2nd ed., 1981).Bossuyt, P.M., Reitsma, J.B., Bruns, D.E., Gatsonis, C.A., Glasziou, P.P., Irwig, L.M., Lijmer, J.G., Moher, D., Rennie, D., & deVet, H.C.W. (2003). Towards complete and accurate reporting of studies of diagnostic accuracy: The STARD initiative. Clinical Chemistry, 49(1), 1–6. (Also appears in Annals of Internal Medicine (2003) 138(1), W1–12 and in British Medical Journal (2003) 329(7379), 41–44)
27References (continued) Dunn, G and Everitt, B, Clinical Biostatistics –An Introduction to Evidence-Based Medicine, John Wiley & Sons, New York.Feinstein A. R. and Cicchetti D. V. (1990). High agreement but low kappa: I. The problems of two paradoxes. J. Clin. Epidemiol 1990; Vol. 43, No. 6,Feinstein A. R. and Cicchetti D. V. (1990). High agreement but low kappa: II. Resolving the paradoxes. J. Clin. Epidemiol 1990; Vol. 43, No. 6,Fay M. P. (2005). Random marginal agreement coefficients: rethinking the adjustment for chance when measuring agreement 2005; Biostatistics 6: