Assessing agreement for diagnostic devices

Slides:

Advertisements

Similar presentations

Jack Jedwab Association for Canadian Studies September 27 th, 2008 Canadian Post Olympic Survey.

Advertisements

AGVISE Laboratories %Zone or Grid Samples – Northwood laboratory

Fill in missing numbers or operations

Lecture 8: Hypothesis Testing

EuroCondens SGB E.

& dding ubtracting ractions.

Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.

STATISTICS Linear Statistical Models

STATISTICS HYPOTHESES TEST (I)

STATISTICS INTERVAL ESTIMATION Professor Ke-Sheng Cheng Department of Bioenvironmental Systems Engineering National Taiwan University.

Addition and Subtraction Equations

Multiplication X 1 1 x 1 = 1 2 x 1 = 2 3 x 1 = 3 4 x 1 = 4 5 x 1 = 5 6 x 1 = 6 7 x 1 = 7 8 x 1 = 8 9 x 1 = 9 10 x 1 = x 1 = x 1 = 12 X 2 1.

Division ÷ 1 1 ÷ 1 = 1 2 ÷ 1 = 2 3 ÷ 1 = 3 4 ÷ 1 = 4 5 ÷ 1 = 5 6 ÷ 1 = 6 7 ÷ 1 = 7 8 ÷ 1 = 8 9 ÷ 1 = 9 10 ÷ 1 = ÷ 1 = ÷ 1 = 12 ÷ 2 2 ÷ 2 =

The Application of Propensity Score Analysis to Non-randomized Medical Device Clinical Studies: A Regulatory Perspective Lilly Yue, Ph.D.* CDRH, FDA,

Performance of a diagnostic test

Add Governors Discretionary (1G) Grants Chapter 6.

Copyright © 2010 Pearson Education, Inc. Slide

DiseaseNo disease 60 people with disease 40 people without disease Total population = 100.

Lecture 2 ANALYSIS OF VARIANCE: AN INTRODUCTION

Learning to show the remainder

Chapter 7 Sampling and Sampling Distributions

The 5S numbers game..

突破信息检索壁垒－SciFinder Scholar 介绍

A Fractional Order (Proportional and Derivative) Motion Controller Design for A Class of Second-order Systems Center for Self-Organizing Intelligent.

The basics for simulations

You will need Your text Your calculator

9.4 t test and u test Hypothesis testing for population mean Example : Hemoglobin of 280 healthy male adults in a region: Question: Whether the population.

PP Test Review Sections 6-1 to 6-6

MM4A6c: Apply the law of sines and the law of cosines.

Ch 6, Principle of Biostatistics

Chapter 16 Goodness-of-Fit Tests and Contingency Tables

Chi-Square and Analysis of Variance (ANOVA)

Oil & Gas Final Sample Analysis April 27, Background Information TXU ED provided a list of ESI IDs with SIC codes indicating Oil & Gas (8,583)

TCCI Barometer March “Establishing a reliable tool for monitoring the financial, business and social activity in the Prefecture of Thessaloniki”

Dynamic Access Control the file server, reimagined Presented by Mark on twitter 1 contents copyright 2013 Mark Minasi.

TCCI Barometer March “Establishing a reliable tool for monitoring the financial, business and social activity in the Prefecture of Thessaloniki”

Hypothesis Tests: Two Independent Samples

Chapter 10 Estimating Means and Proportions

Statistics Review – Part I

Progressive Aerobic Cardiovascular Endurance Run

Lecture 3 Validity of screening and diagnostic tests

The Canadian Flag as a Symbol of National Pride: A question of Shared Values Jack Jedwab Association for Canadian Studies November 28 th, 2012.

MaK_Full ahead loaded 1 Alarm Page Directory (F11)

TCCI Barometer September “Establishing a reliable tool for monitoring the financial, business and social activity in the Prefecture of Thessaloniki”

2011 WINNISQUAM COMMUNITY SURVEY YOUTH RISK BEHAVIOR GRADES 9-12 STUDENTS=1021.

Before Between After.

2011 FRANKLIN COMMUNITY SURVEY YOUTH RISK BEHAVIOR GRADES 9-12 STUDENTS=332.

Subtraction: Adding UP

Putting Statistics to Work

Statistical Inferences Based on Two Samples

Static Equilibrium; Elasticity and Fracture

Converting a Fraction to %

Resistência dos Materiais, 5ª ed.

Copyright © 2013 Pearson Education, Inc. All rights reserved Chapter 11 Simple Linear Regression.

Lial/Hungerford/Holcomb/Mullins: Mathematics with Applications 11e Finite Mathematics with Applications 11e Copyright ©2015 Pearson Education, Inc. All.

Biostatistics course Part 14 Analysis of binary paired data

January Structure of the book Section 1 (Ch 1 – 10) Basic concepts and techniques Section 2 (Ch 11 – 15): Inference for quantitative outcomes Section.

Completing the Square Topic

9. Two Functions of Two Random Variables

4/4/2015Slide 1 SOLVING THE PROBLEM A one-sample t-test of a population mean requires that the variable be quantitative. A one-sample test of a population.

Chart Deception Main Source: How to Lie with Charts, by Gerald E. Jones Dr. Michael R. Hyman, NMSU.

1 Non Deterministic Automata. 2 Alphabet = Nondeterministic Finite Accepter (NFA)

Schutzvermerk nach DIN 34 beachten 05/04/15 Seite 1 Training EPAM and CANopen Basic Solution: Password * * Level 1 Level 2 * Level 3 Password2 IP-Adr.

Presentation transcript:

Assessing agreement for diagnostic devices FDA/Industry Statistics Workshop September 28-29, 2006 Bipasa Biswas Mathematical Statistician, Division of Biostatistics Office of Surveillance and Biometrics Center for Devices and Radiological Health, FDA No official support or endorsement by the Food and Drug Administration of this presentation is intended or should be inferred

Outline Accuracy measures for diagnostic tests with a dichotomous outcome. Ideal world -tests with reference standard. Two indices to measure accuracy –Sensitivity and Specificity Assessing agreement between two tests in the absence of a reference standard. Overall agreement Cohen’s Kappa McNemar’s test Proposed remedy Extending agreement to tests with more than 2 outcomes. Extension to Random Marginal Agreement coefficient (RMAC) Should agreement per cell be reported?

Ideal World-Tests with perfect reference standard (Single) If a perfect reference standard exists to classify patients as diseased (D+) versus not diseased (D-) then we can represent the data as: True Status Test D+ D- T + T - If the true status of the disease is known then we can estimate the Se =TP/(TP+FN) and the Sp=TN/(TN+FP) TP FP TP+FP FN TN FN+TN TP+FN FP+TN TP+FP+FN+TN

Ideal World-Tests with perfect reference standard (Comparing two tests) McNemar’s test to test equality of either sensitivity or specificity. True Status Disease D+ No Disease D- Comparator test Comparator test New test R+ R- New test R+ R- T + T + T - T - McNemar Chi square: Check equality of sensitivities of the two tests (|b1-c1|-1)2/(b1+c1) Check equality of specifities of the two tests (|c2-b2|-1)2/(c2+b2) a1 b1 a1+b1 c1 d1 c1+d1 a1+c1 b1+d1 a1+b1+c1+d1 a2 b2 a2+b2 c2 d2 c2+d2 a2+c2 b2+d2 a2+b2+c2+d2

Ideal World-Tests with perfect reference standard (Comparing two tests) Example True Status Disease D+ Disease D- Comparator test Comparator test New test R+ R- New test R+ R- T + T + T - T - SeT=85.0%(85/100) SpT=88.3%(795/900) SeR=90.0%(90/100) SpR=90.0%(810/900) McNemar Chi square: Check equality of sensitivities of the two tests (|5–10|–1)2/(5+10) p-value=0.30 95% CI (–13.5%,3.5%) Check equality of specifities of the two tests (|5–20|–1)2/(5+20) p-value=.005 95% CI (–2.9%, –0.5%) 85 20 105 5 790 795 90 810 900 80 5 85 10 15 90 100

McNemar’s test when a reference standard exists Note however that the McNemar’s test is only checking for equality and thus the null hypothesis is of equivalence and the alternative hypothesis of difference. This is not an appropriate hypothesis as a failure to find a statistically significant difference is naively interpreted as evidence for equivalence. The 95% confidence interval of the difference in sensitivities and specificities provides a better idea on the difference between the two tests.

Imperfect reference standard A subject’s true disease status is seldom known with certainty. What is the effect on sensitivity and specificity when the comparator test R itself has error? Imperfect reference test (Comparator test) New test R+ R- T + T - a b a+b c d c+d a+c b+d a+b+c+d

Imperfect reference standard Example1: Say we have a new Test T with 80% sensitivity and 70% specificity. And an imperfect reference test R (the comparator test) which misses 20% of the diseased subjects but never falsely indicates disease. True Status Imperfect reference test D+ D- R+ R- T + T – Se= (80/100)80.0% Se (relative to R)= (64/80) 80.0% Sp =(70/100)70.0% Sp (relative to R)= (74/120)62.0% 80 30 110 20 70 90 100 200 64 46 110 16 74 90 80 120 200

Imperfect reference standard Example 2: Say we have a new Test T with 80% sensitivity and 70% specificity. And an imperfect reference test R which misses 20% of the diseased subjects but the error in R is related to the error in T. True Status Imperfect reference test D+ D- R+ R- T + T – Se =(80/100)80.0% Se (relative to R)=(80/80) 100.0% Sp =(70/100)70.0% Sp (relative to R) =(90/120)75.0% 80 30 110 20 70 90 100 200 80 30 110 90 120 200

Imperfect reference standard Example3: Now suppose our test is perfect, that is has 100% sensitivity and 100% specificity, but the imperfect reference test R has only 90% sensitivity and 90% specificity. True Status Imperfect reference test D+ D- R+ R- T + T – Se =(100/100)100.0% Se (relative to R)=(90/100) 90.0% Sp =(100/100)100.0% Sp (relative to R)=(90/100) 90.0% 100 200 90 10 100 200

Challenges in assessing agreement in the absence of a reference standard. Two commonly used overall measures are: Overall agreement measure Cohen’s Kappa McNemar’s Test In stead report positive percent agreement (ppa) and negative percent agreement (npa).

Estimate of Agreement The overall percent agreement can be calculated as: 100%x(a+d)/(a+b+c+d) The overall percent agreement however, does not differentiate between the agreement on the positives and agreement on the negatives. Instead of overall agreement, report positive percent agreement (PPA) with respect to the imperfect reference standard positives and negative percent agreement (NPA) with respect to imperfect reference standard negative. (reference Feinstein et. al.) PPA=100%xa/(a+c) NPA=100%xd/(b+d)

Why not to report just the overall percent agreement? The overall percent agreement is insensitive to off diagonal imperfect reference test R+ R- New T+ Test T- The overall percent agreement is 85.0% and yet it does not account for the off-diagonal imbalance. The PPA is 100% and the NPA is only 50% 70 15 85 30 100

Why report both PPA and NPA? imperfect reference test imperfect reference test R+ R- R+ R- New T+ new T+ Test T- test T- Table 1 Table2 Overall pct. agreement=90.0% Overall pct. agreement=90.0% PPA=50.0% (5/10) PPA=87.5% (35/40) [95% CI= 18.7%,81.3%] [95% CI=73.2%,95.8%] NPA=94.4% (85/90) NPA=91.7% (55/60) [95% CI= 87.5%,98.2 %] [95% CI=81.6%,97.2%] 5 10 85 90 100 35 5 40 55 60 100

Kappa measure of agreement Kappa is defined as the difference between observed and expected agreement expressed as a fraction of the maximum difference and ranges between -1 to 1. Imperfect reference standard R+ R- New T+ Test T- k=(Io-Ie)/(1-Ie) where Io=(a+d)/n, Ie=((a+c)(a+b)+(b+d)(c+d))/n2 a b a+b c d c+d a+c b+d n=a+b+c+d

Kappa measure of agreement Imperfect reference standard R+ R- New T+ Test T- Io=(70)/100=0.70, Ie=((50)(50)+(50)(50))/10000= 0.50 κ=(0.70-0.50)/(1-0.50)=0.40 [95% CI=0.22,0.58] By the way the overall percent agreement is 70.0% 35 15 50 100

Kappa measure of agreement sensitive to off-diagonal? Imperfect reference test R+ R- New T+ Test T- Kappa=κ=0.45 [95% CI=0.31,0.59] Although the overall agreement stayed the same (70%) and the marginal differences are much bigger than before, the kappa agreement index indicates otherwise. Kappa statistics is impacted by the marginal totals even though the overall agreement is the same. 35 30 65 100

McNemar’s Test to check for equality in the absence of a reference standard Hypothesizes: Equality of rates of positive response Imperfect reference test R+ R- New T+ Test T- McNemar Chi square=(|b-c|-1)2/(b+c) =(|30-5|-1)2/(30+5)=16.46 Two sided p-value=0.00005 37 30 67 5 28 33 42 58 100

McNemar’s test (insensitivity to main diagonal) Imperfect reference test R+ R- New T+ Test T- Same p-value as when A=37 and D=28, even though the new and the old test agree on 99.5% of individual cases. 3700 30 3730 5 2800 2805 3705 2830 6535

McNemar’s test (insensitivity to main diagonal) Imperfect reference test R+ R- New T+ Test T- Two sided p-value=1 even though old and new test agree on no cases. 19 18 37

Proposed remedy In stead of reporting overall agreement or kappa or the McNemar’s test p-value, report both positive percent agreement and negative percent agreement. In the 510(k) paradigm where a new device is compared to an already marketed device the positive percent agreement and the negative percent agreement is relative to the comparator device, which is appropriate.

Agreement of tests with more than two outcomes For example in radiology one often compares the standard film mammogram to a digital mammogram where the radiologists assign a score of 1(negative finding) to 5 (highly suggestive of malignancy) depending on severity. The article by Fay in 2005 in Biostatistics proposes a random marginal agreement coefficient (RMAC) which uses a different adjustment for chance than the standard agreement coefficient (Cohen’s Kappa).

Comparing two tests with more than two outcomes The advantages of RMAC is that the differences between two marginal distributions will not induce greater apparent agreement. However, as stated in the paper similar to Cohen’s Kappa with the fixed marginal assumption, the RMAC also depends on the heterogeneity of the population. Thus in cases where the probability of responding in one category is nearly 1 then the chance agreement will be large leading to low agreement coefficients.

Comparing two tests with more than two outcomes An omnibus agreement index for situations with more than two outcomes is also ridden by similar situations faced for tests with dichotomous outcome. Also, in a regulatory set-up where a new test device is being compared to a predicate device RMAC may not be appropriate as it gives equal weight to the marginals from the test and the predicate device. In stead report individual agreement for each category.

Summary Perfect standard exists then for a dichotomous test then both sensitivity and specificity can be estimated and appropriate hypothesis tests can be performed. If a new test is being compared to an imperfect predicate test then the positive percent agreement and negative percent agreement along with their 95% confidence interval is a more appropriate way of comparison than reporting the overall agreement or the kappa statistics or the McNemar’s test. In case of tests with more than two outcomes the kappa statistics or the overall agreement has the same problems if the goal of the study is to compare the new test against a predicate. A suggestion would be to report agreement for each cell.

References Pepe, M.S. (2003). The Statistical Evaluation of Medical Tests for Classification and Prediction. Oxford University Press. Statistical Guidance on Reporting Results from Studies Evaluating Diagnostic Tests; Draft Guidance for Industry and FDA Reviewers. March 2, 2003. Fleiss, JL, Statistical Methods for Rates and Proportions, John Wiley & Sons, New York (2nd ed., 1981). Bossuyt, P.M., Reitsma, J.B., Bruns, D.E., Gatsonis, C.A., Glasziou, P.P., Irwig, L.M., Lijmer, J.G., Moher, D., Rennie, D., & deVet, H.C.W. (2003). Towards complete and accurate reporting of studies of diagnostic accuracy: The STARD initiative. Clinical Chemistry, 49(1), 1–6. (Also appears in Annals of Internal Medicine (2003) 138(1), W1–12 and in British Medical Journal (2003) 329(7379), 41–44)

References (continued) Dunn, G and Everitt, B, Clinical Biostatistics –An Introduction to Evidence-Based Medicine, John Wiley & Sons, New York. Feinstein A. R. and Cicchetti D. V. (1990). High agreement but low kappa: I. The problems of two paradoxes. J. Clin. Epidemiol 1990; Vol. 43, No. 6, 543-549. Feinstein A. R. and Cicchetti D. V. (1990). High agreement but low kappa: II. Resolving the paradoxes. J. Clin. Epidemiol 1990; Vol. 43, No. 6, 551-558. Fay M. P. (2005). Random marginal agreement coefficients: rethinking the adjustment for chance when measuring agreement 2005; Biostatistics 6:171-180.