Presentation is loading. Please wait.

Presentation is loading. Please wait.

Grading the Strength of a Body of Evidence on Diagnostic Tests Prepared for: The Agency for Healthcare Research and Quality (AHRQ) Training Modules for.

Similar presentations


Presentation on theme: "Grading the Strength of a Body of Evidence on Diagnostic Tests Prepared for: The Agency for Healthcare Research and Quality (AHRQ) Training Modules for."— Presentation transcript:

1 Grading the Strength of a Body of Evidence on Diagnostic Tests Prepared for: The Agency for Healthcare Research and Quality (AHRQ) Training Modules for Medical Test Reviews Methods Guide

2 Overview of a Medical Test Review Prepare Topic Develop the Topic and Structure the Review Choose the Important Outcomes Research Sources Search for and Select Studies for Inclusion Search for Studies Extract Data From Studies Analyze and Synthesize Studies Assess Risk of Bias as a Domain of Quality Assess Applicability Grade the Body of Evidence Meta-analysis of Test Performance Evidence With a “Gold Standard” — or — Meta-analysis of Test Performance Evidence With an Imperfect Reference Standard Decision Modeling Report Medical Test Review Singh S, Chang SM, Matchar DB, et al. Grading a body of evidence on diagnostic tests. In: Methods guide for medical test reviews. Available at

3  Explain the value of grading the bodies of evidence that influence decisionmaking.  Describe the required domains involved in assessing strength of evidence.  Understand the challenges specific to grading the strength of a body of evidence on medical test performance. Learning Objectives Singh S, Chang SM, Matchar DB, et al. Grading a body of evidence on diagnostic tests. In: Methods guide for medical test reviews. Available at

4  Grading the strength of evidence  Challenges specific to diagnostic tests  Principles of grading the strength of a body of evidence  Illustrative examples Overview Singh S, Chang SM, Matchar DB, et al. Grading a body of evidence on diagnostic tests. In: Methods guide for medical test reviews. Available at

5  “Grading” refers to assessment of the strength of the body of evidence supporting a given statement or conclusion.  It does not refer to the quality of an individual study.  Grading allows decisionmakers know if they are working with bodies of evidence that are of high, moderate, or low strength.  It makes their recommendations about interventions/tests more defensible.  This module focuses on grading diagnostic tests (used in diagnostic and management strategies), not prognostic tests (predict responsiveness to treatment). Grading the Strength of Evidence (1 of 4) Singh S, Chang SM, Matchar DB, et al. Grading a body of evidence on diagnostic tests. In: Methods guide for medical test reviews. Available at

6  There are two approaches to grading strength of evidence:  GRADE (Grading of Recommendations Assessment, Development, and Evaluation)  EPC (Evidence-based Practice Center), which is a derivation of GRADE  This module discusses the EPC approach while pointing out the minor differences between it and GRADE.  It is important to differentiate between assessing the quality of an individual study and the strength of an entire body of evidence.  “Strength” assessment includes the quality of an individual study plus other domains.  The EPC approach uses the term “strength of evidence” rather than the term “quality of evidence” used in the GRADE approach to prevent mixing up these ideas. Grading the Strength of Evidence (2 of 4) Singh S, Chang SM, Matchar DB, et al. Grading a body of evidence on diagnostic tests. In: Methods guide for medical test reviews. Available at Agency for Health Care Research and Quality. Methods Guide for Effectiveness and Comparative Effectiveness Reviews. Available at

7  Four principle domains are used to assess strength of evidence:  Risk of bias  Consistency  Directness*  Precision  Additional domains to be considered as needed:  Publication bias  Dose-response association  Existence of plausible unmeasured confounders  Strength of association (i.e. magnitude of effect) Grading the Strength of Evidence (3 of 4) Singh S, Chang SM, Matchar DB, et al. Grading a body of evidence on diagnostic tests. In: Methods guide for medical test reviews. Available at * The GRADE approach considers “applicability” to be part of directness; the EPC approach considers it a component separate from strength of evidence.

8  In the Evidence-based Practice Center approach, strength of evidence is graded for each outcome and comparison in the Key Questions.  At least one Key Question usually relates to test performance (e.g., sensitivity and specificity).  Other key question topics that can also be evaluated using the methods outlined here:  Analytic validity  Clinical validity  Clinical utility Grading the Strength of Evidence (4 of 4) Singh S, Chang SM, Matchar DB, et al. Grading a body of evidence on diagnostic tests. In: Methods guide for medical test reviews. Available at Agency for Health Care Research and Quality. Methods Guide for Effectiveness and Comparative Effectiveness Reviews. Available at Samson D, Schoelles KM. Developing the topic and structuring systematic reviews of medical tests. In: Methods guide for medical test reviews. Available at

9  Challenge 1: Tools for assessing the quality of evidence for an intervention are not easily applied to evidence for a diagnostic test.  Evidence directly relates to the overarching Key Question for interventions; there is an indirect relationship for diagnostic tests.  Establishment of logical chains with analytic frameworks is needed, as is assessment of strength of evidence for each “link.”  The strength of evidence of the overarching question of whether a test improves clinical outcomes depends on:  Total body of evidence  Body of evidence for weakest link in chain  Often there is no direct linkage between diagnostic accuracy and clinical outcomes.  Example: When tests are used as triage, accuracy of risk classification is more important than accuracy of diagnosis. Challenges Specific to Diagnostic Tests (1 of 2) Singh S, Chang SM, Matchar DB, et al. Grading a body of evidence on diagnostic tests. In: Methods guide for medical test reviews. Available at

10  Challenge 2: A second challenge to reviewers is applying the strength-of-evidence domains to studies of diagnostic tests.  Example: It is difficult to judge the precision of a particular confidence interval in estimates of test performance.  This difficulty is due to the logarithmic nature of diagnostic performance measurements such as:  Sensitivity  Specificity  Likelihood ratios  Diagnostic odds ratios  Relatively wide confidence intervals (suggesting imprecision) may not translate into clinically meaningful impacts. Challenges Specific to Diagnostic Tests (2 of 2) Singh S, Chang SM, Matchar DB, et al. Grading a body of evidence on diagnostic tests. In: Methods guide for medical test reviews. Available at

11 Challenge 2 Example: Impact of the Precision of Sensitivity on Negative Predictive Value  This example shows wide confidence intervals that did not result in a clinically meaningful impact on the clinic outcome.  A 10-percent reduction in the sensitivity of various biopsy techniques (98%  88%) changes the estimated probability of having cancer after a negative test by <5 percent. Type of Biopsy Postbiopsy Probability of Having Cancer After a Negative Core-Needle Biopsy Result Results of analysis Analysis overestimated sensitivity by 1% (e.g., sensitivity 97% rather than 98%) Analysis overestimated sensitivity by 5% (e.g., sensitivity 93% rather than 98%) Analysis overestimated sensitivity by 10% (e.g., sensitivity 88% rather than 98%) Freehand automated gun 6% 8%9% Ultrasound guidance automated gun 1% 3%5% Stereotactic guidance automated gun 1% 3%5% Ultrasound guidance vacuum-assisted 2% 3%6% Stereotactic guidance vacuum-assisted 0.4%0.8%3%5% Bruening W, Schoelles K, Treadwell J, et al. Comparative Effectiveness of Core-Needle and Open Surgical Biopsy for the Diagnosis of Breast Lesions. Available at

12 1.Methods for grading intervention studies can be adapted for diagnostic tests with clinical outcomes. 2.Consider carefully what test characteristic measures are the most appropriate intermediate outcomes for assessing the impact of a test on clinical outcomes and their precision in the clinical context represented by the Key Question. 3.The principle domains of the Grading of Recommendations Assessment, Development, and Evaluation (GRADE) approach can be adapted to assess a body of evidence on diagnostic test accuracy. 4.Additional GRADE domains can be adapted to assess a body of evidence on diagnostic test accuracy. 5.Multiple domains should be incorporated into an overall assessment in a transparent way. Principles of Grading the Strength of a Body of Evidence Singh S, Chang SM, Matchar DB, et al. Grading a body of evidence on diagnostic tests. In: Methods guide for medical test reviews. Available at

13  Can assess a body of evidence evaluating diagnostic test outcomes (e.g., diagnostic thinking, therapeutic choice, clinical outcomes) in a way similar to therapeutic interventions.  Grading issues for studies measuring these diagnostic test outcomes are more straightforward than for accuracy outcomes.  Trial evidence can directly assess the effect of tests on clinical outcomes (rarely done).  In this case, the application of grading criteria should not differ significantly from the methods used for intervention evidence. Principle 1: Methods for Grading Intervention Studies Can Be Adapted for Diagnostic Tests With Clinical Outcomes (1 of 2) Singh S, Chang SM, Matchar DB, et al. Grading a body of evidence on diagnostic tests. In: Methods guide for medical test reviews. Available at

14  When direct evidence linking the test to the outcome of interest is not available:  For intervention studies, intermediate outcomes (e.g., accuracy outcomes) are “indirect” evidence and reduce the strength of the grade  Linking accuracy outcomes (e.g., false/true-positive outcomes) to clinical outcomes partly depends on factors such as benefits and harms of treatment.  No consensus exists on the approach to use with only indirect evidence.  Accuracy outcomes may be enough to determine if there is a benefit on clinical outcomes  Otherwise, discuss the benefits of including this evidence and the specific methods to be used Principle 1: Methods for Grading Intervention Studies Can Be Adapted for Diagnostic Tests With Clinical Outcomes (2 of 2) Singh S, Chang SM, Matchar DB, et al. Grading a body of evidence on diagnostic tests. In: Methods guide for medical test reviews. Available at Samson D, Schoelles KM. Developing the topic and structuring systematic reviews of medical tests. In: Methods guide for medical test reviews. Available at

15  Consider how surrogates (e.g., accuracy outcomes) will lead to changes in clinical outcomes.  Doing so is consistent with both the EPC and GRADE principles of emphasizing outcomes important to patients.  Using an analytic framework and decision models helps clarify the linkage between accuracy and clinical outcomes.  True-positive/negative and false-positive/negative outcomes are easily translated to other accuracy outcomes (e.g., sensitivity, specificity, positive/negative predictive value).  Choose accuracy outcomes that:  Are most directly related to clinical outcomes  Have the necessary precision levels Principle 2: Consider Appropriate Test Characteristic Measures and Their Precision (1 of 4) Singh S, Chang SM, Matchar DB, et al. Grading a body of evidence on diagnostic tests. In: Methods guide for medical test reviews. Available at Samson D, Schoelles KM. Developing the topic and structuring systematic reviews of medical tests. In: Methods guide for medical test reviews. Available at Trikalinos TA, Kulasingam S, Lawrence WF. Deciding whether to complement a systematic review of medical tests with decision modeling. In: Methods guide for medical test reviews. Available at

16  Occasionally, it is more important to rule out a disease with severe consequences than to diagnose what disease the patient has.  Triage tests with high sensitivity and high negative predictive value are useful.  Multiple tests are often used to ensure accuracy when high risks exist with treatment of a disease.  Tests used alone need both high sensitivity and specificity (or high positive and high negative predictive value).  If no test meets these criteria, the interest may be in benefits and harms of adding on a test.  The accuracy outcome of interest then is high specificity or high positive predictive value. Principle 2: Consider Appropriate Test Characteristic Measures and Their Precision (2 of 4) Singh S, Chang SM, Matchar DB, et al. Grading a body of evidence on diagnostic tests. In: Methods guide for medical test reviews. Available at

17  More invasive tests have greater harms, with further harms resulting from misdiagnosis.  False-positive and false-negative measurements for a test become important.  The degree of harms depends on:  False-negative results  Severity of disease (for missed diagnosis)  Risks of testing (if test is invasive and has harms itself)  False-positive results  Invasiveness of further testing/treatment  Cognitive/emotional effects of inaccurate disease labeling Principle 2: Consider Appropriate Test Characteristic Measures and Their Precision (3 of 4) Singh S, Chang SM, Matchar DB, et al. Grading a body of evidence on diagnostic tests. In: Methods guide for medical test reviews. Available at

18  A simplified example of precision in context: A test with outcomes data resulting from true-positive, true-negative, false- positive, and false-negative test results  Using a decision model, find the threshold of test sensitivity and specificity combinations where testing versus not testing is a tossup.  Net benefits are equivalent to net harms at the threshold.  There is enough precision if the confidence intervals for sensitivity and specificity from the data in the body of evidence are completely on one side of the threshold.  This example is oversimplified for many situations.  Tests are usually part of an algorithm.  The accuracy outcome of interest depends on how the test is used in an algorithm and its mechanism of action on outcomes. Principle 2: Consider Appropriate Test Characteristic Measures and Their Precision (4 of 4) Singh S, Chang SM, Matchar DB, et al. Grading a body of evidence on diagnostic tests. In: Methods guide for medical test reviews. Available at Pauker SG, Kassirer JP. N Engl J Med 1980 May 15;302(10): PMID:

19  The risk of bias, consistency, directness, and precision domains of GRADE can be adapted to assess a body of evidence on diagnostic test performance.  Risk of bias  How study type/design/conduct may have contributed to systematic bias must be considered.  Different technical methods used in diagnostic (particularly laboratory) tests can lead to heterogeneous results. For example:  Using different antibodies in immunoassays  Using standards with different values and units Principle 3: The Principle Domains of GRADE Can Be Adapted To Assess a Body of Evidence on Diagnostic Test Accuracy (1 of 5) Singh S, Chang SM, Matchar DB, et al. Grading a body of evidence on diagnostic tests. In: Methods guide for medical test reviews. Available at Samson D, Schoelles KM. Developing the topic and structuring systematic reviews of medical tests. In: Methods guide for medical test reviews. Available at

20  Consistency is homogeneity in the direction and magnitude of results across studies.  It can be applied to diagnostic test performance studies (the method of measurement may differ). For example:  Intervention studies visually assess consistency with forest plots.  Diagnostic test performance reviews use a summary receiver operating characteristic (ROC) curve.  ROC curves show sensitivity and specificity results from studies.  The consistency of diagnostic accuracy can be shown by bubble plots of true-positive versus false-positive rates in ROC space.  As in intervention studies, strength of evidence is reduced by unexplained heterogeneity (i.e., heterogeneity not explained by different study designs, methodological quality, diversity in subject characteristics, or study context). Principle 3: The Principle Domains of GRADE Can Be Adapted To Assess a Body of Evidence on Diagnostic Test Accuracy (2 of 5) Singh S, Chang SM, Matchar DB, et al. Grading a body of evidence on diagnostic tests. In: Methods guide for medical test reviews. Available at

21  Evidence is direct is when it reflects a single, direct link between the medical tests and the ultimate health outcome under consideration.  If no studies link the test to clinical outcome, then diagnostic accuracy outcomes are considered indirect.  If grading evidence of intermediate outcomes like diagnostic accuracy, then do not “downgrade” for being indirect  Directness does apply to how the use of the test compares to others. For example:  Comparison of d-dimer test and venous ultrasound as two opposing methods for diagnosis of venous thromboembolism  Comparison of interest may actually be use of d-dimer test as triage for venous ultrasound vs. ultrasound alone Principle 3: The Principle Domains of GRADE Can Be Adapted To Assess a Body of Evidence on Diagnostic Test Accuracy (3 of 5) Singh S, Chang SM, Matchar DB, et al. Grading a body of evidence on diagnostic tests. In: Methods guide for medical test reviews. Available at Agency for Health Care Research and Quality. Methods Guide for Effectiveness and Comparative Effectiveness Reviews. Available at

22  Schemas such as GRADE rate directness based on:  Whether the test evaluated is used in practice  If test accuracy is calculated in a population or for a use that is different than those evaluated in the review (applicability)  The Evidence-base Practice Center (EPC) approach does not include these factors when assessing directness for strength of evidence (SOE).  Reports prepared by EPC reviewers are intended for use by a broad spectrum of stakeholders.  Describing applicability of evidence separately from SOE allows a wide variety of decisionmakers to consider how the evidence relates to their particular test and population. Principle 3: The Principle Domains of GRADE Can Be Adapted To Assess a Body of Evidence on Diagnostic Test Accuracy (4 of 5) Singh S, Chang SM, Matchar DB, et al. Grading a body of evidence on diagnostic tests. In: Methods guide for medical test reviews. Available at Hartman KE, Matchar DB, Chang SM. Assessing applicability of medical test studies in systematic reviews. In: Methods guide for medical test reviews. Available at

23  Precision is affected by the width of the confidence intervals for diagnostic accuracy measurements.  It is integrally related to sample size.  It is important to consider how clinically meaningful outcomes may be impacted by imprecision.  This may involve calculation of post-test probabilities over a range of values for sensitivity and specificity (see Slide 11).  A more formal analysis with a decision model could be beneficial.  The strength-of-evidence grade for precision should not be downgraded if:  The impact of imprecision on outcomes is negligible.  There is enough precision that decisions based on the test are unaffected. Principle 3: The Principle Domains of GRADE Can Be Adapted To Assess a Body of Evidence on Diagnostic Test Accuracy (5 of 5) Singh S, Chang SM, Matchar DB, et al. Grading a body of evidence on diagnostic tests. In: Methods guide for medical test reviews. Available at Owens DK, Lohr KN, Atkins D, et al. J Clin Epidemiol 2002 May;63(5): PMID: Trikalinos TA, Kulasingam S, Lawrence WF. Deciding whether to complement a systematic review of medical tests with decision modeling. In: Methods guide for medical test reviews. Available at

24  Additional domains can be adapted  Publication bias  Use when negative or no-difference findings appear to be unpublished/unavailable  Dose-response association  For tests with continuous outcomes/multiple cutoffs (e.g., serum prostate-specific antigen), may support underlying mechanism of detection  Existence of plausible unmeasured confounders  May be relevant to testing strategies that predict outcomes  Example: a study biased to find low diagnostic accuracy actually finds very high diagnostic accuracy  Strength of association (i.e., magnitude of effect)  Use when comparing accuracy of two different tests where one is more accurate Principle 4: Additional GRADE Domains Can Be Adapted To Assess a Body of Evidence on Diagnostic Test Accuracy Singh S, Chang SM, Matchar DB, et al. Grading a body of evidence on diagnostic tests. In: Methods guide for medical test reviews. Available at

25 DomainDefinition and Elements Application to Evaluation of Diagnostic Test Performance Risk of BiasRisk of bias is the degree to which the included studies for a given outcome or comparison have a high likelihood of adequate protection against bias (i.e., good internal validity), as assessed through main elements:  Study design (e.g., randomized controlled trials or observational studies)  Aggregate quality of the studies under consideration from the rating of quality (good/fair/poor) done for individual studies Use one of three levels of aggregate risk of bias:  Low risk of bias  Medium risk of bias  High risk of bias Well designed and executed studies of new tests compared to an adequate criterion standard are rated as having “low risk of bias.” ConsistencyConsistency is the degree to which reported study results (e.g., sensitivity, specificity, likelihood ratios) from included studies are similar. This can be assessed through two main elements:  The range of study results is narrow.  Variability in study results is explained by differences in study design, patient population or test variability. Use one of three levels of consistency:  Consistent (i.e., no inconsistency)  Inconsistent  Unknown or not applicable (e.g., single study) Single-study evidence bases should be considered as “consistency unknown (single study).” GRADE Domains and Descriptions (1 of 2) Adapted from Agency for Health Care Research and Quality. Methods Guide for Effectiveness and Comparative Effectiveness Reviews. Available at

26 GRADE Domains and Descriptions (2 of 2) DomainDefinition and ElementsApplication to Evaluation of Diagnostic Test Performance DirectnessDirectness relates to whether the evidence links the interventions directly to outcomes. For a comparison of two diagnostic tests, directness implies head-to-head comparisons against a common criterion standard. Directness may be contingent on the outcomes of interest. Score dichotomously as one of two levels of directness:  Direct  Indirect When assessing the directness of the overarching question, if there are no studies linking the test to a clinical outcome, then evidence that only provides diagnostic accuracy outcomes would be considered indirect. If indirect, specify which of the two types of indirectness account for the rating (or both, if that is the case) — namely, use of intermediate/surrogate outcomes rather than health outcomes and use of indirect comparisons. If the decision is made to grade the strength of evidence of an intermediate outcome such as diagnostic accuracy, then the reviewer does not need to automatically “downgrade” that outcome for being indirect. PrecisionPrecision is the degree of certainty surrounding an effect estimate with respect to a given outcome (i.e., for each outcome separately). If a meta-analysis was performed, this will be the confidence interval around the summary measure(s) of test performance (e.g., sensitivity, true positive). Score dichotomously as one of two levels of precision:  Precise  Imprecise A precise estimate is an estimate that would allow a clinically useful conclusion. An imprecise estimate is one for which the confidence interval is wide enough to include clinically distinct conclusions. Adapted from Agency for Health Care Research and Quality. Methods Guide for Effectiveness and Comparative Effectiveness Reviews. Available at

27 Additional GRADE Domains and Descriptions (1 of 2) DomainDefinition and ElementsApplication to Evaluation of Diagnostic Test Performance Publication bias* Publication bias indicates that studies may have been published selectively, with the result that the estimate of test performance based on published studies does not reflect the true effect. Methods to detect publication bias for medical test studies are not robust. Evidence from small studies of new tests or asymmetry in funnel plots should raise suspicion for publication bias. Publication bias can influence ratings of consistency, precision, or magnitude of effect (and, to a lesser degree, risk of bias and directness). Reviewers should comment on publication bias when circumstances suggest that relevant empirical findings, particularly negative or no-difference findings, have not been published or are unavailable. Dose- response association This association, either across or within studies, typically refers to a pattern of a larger effect associated with a greater level of an exposure (e.g., dose, duration, adherence). A dose-response association may support an underlying mechanism of detection and has potential relevance to some tests results that are measured on a continuous or ordinal scale (e.g., serum levels of prostate-specific antigen; ventilation/perfusion scanning). * The GRADE approach is moving toward considering publication bias a GRADE principle domain. Adapted from Agency for Health Care Research and Quality. Methods Guide for Effectiveness and Comparative Effectiveness Reviews. Available at

28 Additional GRADE Domains and Descriptions (2 of 2) DomainDefinition and ElementsApplication to Evaluation of Diagnostic Test Performance Plausible unmeasured confounding and bias that would decrease an observed effect or increase an effect if none was observed Occasionally, in an observational study, plausible confounding factors would work in the direction opposite to that of the observed effect. Had these confounders not been present, the observed effect would have larger. In such a case, the evidence can be upgraded The impact of plausible unmeasured confounders may be relevant to testing strategies that predict outcomes. A study may be biased to find low diagnostic accuracy via spectrum bias and yet despite this find very high diagnostic accuracy. Strength of association (magnitude of effect) Strength of association refers to the likelihood that the observed effect or association is large enough that it cannot have occurred solely as a result of bias from potential confounding factors. The strength of association may be relevant when comparing the accuracy of two different medical tests with one being more accurate than the other. It is possible that the accuracy of a test is better than the reference standard because of an imperfect reference standard. It is important to consider this and modify the analysis to take into consideration alternative assumptions about the best reference standard. Adapted from Agency for Health Care Research and Quality. Methods Guide for Effectiveness and Comparative Effectiveness Reviews. Available at

29  Strength-of-evidence grades reflect a global assessment of principle and associated domains.  This assessment should generate an overall summary grade (high, moderate, low, or insufficient evidence).  Focus on providing an overall grade for the relevant Key Question link in the analytical chain or in outcomes relevant to patients/decisionmakers.  These outcomes should be identified beforehand.  Consider how to incorporate multiple domains into the overall assessment. Principle 5: Multiple Domains Should Be Incorporated Into an Overall Assessment in a Transparent Way (1 of 2) Singh S, Chang SM, Matchar DB, et al. Grading a body of evidence on diagnostic tests. In: Methods guide for medical test reviews. Available at

30  There is no empirical evidence for a difference between assigning grades based on qualitative versus quantitative approaches.  GRADE approach  The ordinal rankings are high  moderate  low  very low.  These grades/overall rankings were developed using the eight GRADE domains.  EPC approach  The approach allows greater flexibility in grading.  Whichever approach is used, the rationale for weighting the domains in assigning summary grades should be described. Principle 5: Multiple Domains Should Be Incorporated Into an Overall Assessment in a Transparent Way (2 of 2) Singh S, Chang SM, Matchar DB, et al. Grading a body of evidence on diagnostic tests. In: Methods guide for medical test reviews. Available at

31 Approach to Evidence Grading Adapted from Agency for Health Care Research and Quality. Methods Guide for Effectiveness and Comparative Effectiveness Reviews. Available at gov/methodsguide.cfm.

32  When grading a body of evidence on diagnostic test accuracy, carefully consider which accuracy outcomes are linked to clinical outcomes.  The number of false-positive results are important when the diagnostic test is followed by an invasive procedure.  Positive predictive value, negative predictive value, and likelihood ratios also important when the test is part of a management strategy. Evidence Grading Examples: Considering Accuracy Outcomes (1 of 6) Singh S, Chang SM, Matchar DB, et al. Grading a body of evidence on diagnostic tests. In: Methods guide for medical test reviews. Available at

33 Evidence Grading Examples: Considering Accuracy Outcomes (2 of 6) Outcome Quantity and Type of EvidenceFinding Starting Grade Decrease GRADE ‡ GRADE of Evidence for Outcome Overall GRADE § RB ‡ CON ‡ DIR ‡ PRE ‡ PUB ‡ Sensitivity*1 DIAG68%High0000Moderate Specificity † 1 DIAG99%High0000Moderate PPV † 1 DIAG97%High0000Moderate NPV † 1 DIAG82%High0000Moderate *Data taken from MacCannell T, Umscheid CA, Agarwal RK, et al. Infect Control Hosp Epidemiol 2011 Oct;32(10): PMID: † These outcomes were considered the most critical by the guideline developers. Data taken from Turcios RM, Widdowson MA, Sulka AC, et al. Clin Infect Dis 2006 Apr 1;42(7): PMID: ‡ These modifiers can impact the GRADE rating by 1 or 2 points. Data taken from Turcios et al. (2011). § If applicable, consider the additional domains of strength of association, dose-response, and impact of plausible confounders. CON = consistency; DIAG = diagnostic test; DIR = directness; GRADE = Grading of Recommendations Assessment, Development, and Evaluation; NPV = negative predictive value; PPV = positive predictive value; PRE = precision; PUB = publication bias; RB = risk of bias Identifying Norovirus Infection in a Health Care Setting Singh S, Chang SM, Matchar DB, et al. Grading a body of evidence on diagnostic tests. In: Methods guide for medical test reviews. Available at

34  This example concerns accuracy unrelated to outcomes when the test is part of a treatment versus watchful-waiting algorithm.  A review was conducted on the use of noninvasive imaging in addition to a standard workup after detection of a breast lesion.  Studies were intended to triage women into three groups: “biopsy,” “watchful waiting,”, or “return to normal screening intervals.”  Although the accuracy of the exact diagnosis was low, clinical management was the same.  This finding holds true if the post-test probability does not cross the decision management threshold.  It shows need to evaluate the surrogate measure in the context of the clinical outcome. Evidence Grading Examples: Considering Accuracy Outcomes (3 of 6) Bruening W, Uhl S, Fontanarosa J, et al. Comparative Effectiveness Review No. 47. Available at Singh S, Chang SM, Matchar DB, et al. Grading a body of evidence on diagnostic tests. In: Methods guide for medical test reviews. Available at

35  Two reviewers should independently score the major outcomes and comparisons within each Key Question.  Each reviewer should assess each of the following domains to grade each outcome:  Directness  Precision  Consistency  Risk of bias  Publication bias (as needed)  Dose-response association (as needed)  Strength of association (as needed)  Impact of unmeasured confounders (as needed) Evidence Grading Examples: Considering Accuracy Outcomes (4 of 6) Singh S, Chang SM, Matchar DB, et al. Grading a body of evidence on diagnostic tests. In: Methods guide for medical test reviews. Available at

36  The outcomes reviewers choose as critical affect how the domains of directness, precision, and consistency are graded.  In a comparison of multisplice coronary computed tomography scans versus coronary angiography:  True-positive, true-negative, and false-positive results were considered direct evidence.  There was little uncertainty about the clinical implications of these results.  False-negative results were considered indirect evidence.  There was uncertainty about the clinical implications of these results.  The overall results were: a “low” strength-of-evidence grade for false-negative results, and a “moderate” strength-of-evidence grade for other accuracy outcomes. Evidence Grading Examples: Considering Accuracy Outcomes (5 of 6) Schunemann HJ, Oxman AD, Brozek J, et al. BMJ 2008 May 17;336(7653): PMID: Singh S, Chang SM, Matchar DB, et al. Grading a body of evidence on diagnostic tests. In: Methods guide for medical test reviews. Available at

37  It is reasonable to consider either:  A more flexible qualitative approach to grading  A standard ordinal approach (ranging from high to insufficient strength of evidence)  Resolve differences in domain assessments and outcome grades.  Describe how the consensus score was reached (e.g., discussion, third-party adjudication).  Arrive at a single summary grade through transparent and systematic methods.  Consider the impact of various accuracy outcomes on the overall strength-of-evidence grade.  Identify which of these outcomes are “key.” Evidence Grading Examples: Considering Accuracy Outcomes (6 of 6) Singh S, Chang SM, Matchar DB, et al. Grading a body of evidence on diagnostic tests. In: Methods guide for medical test reviews. Available at

38  The greatest challenge is assessing multiple links in a chain of evidence connecting test performance to changes in clinical outcomes.  This module focused on diagnostic test performance, a crucial link in the chain.  No one system for grading the strength of evidence for diagnostic tests has been shown to be superior to any other; many are still early in development.  Reviews are made more useful to decisionmakers by:  Applying the consistent and transparent system of grading using the discussed domains  Giving an explicit rationale for the choice of grades based on these domains Summary Singh S, Chang SM, Matchar DB, et al. Grading a body of evidence on diagnostic tests. In: Methods guide for medical test reviews. Available at

39  The GRADE (Grading of Recommendations Assessment, Development, and Evaluation) system can be used for diagnostic tests.  Clinical outcomes to consider (if available):  Effectiveness  Harm  When using intermediate accuracy outcomes, an analytic framework is necessary to:  Describe how the test is related to clinical outcomes  Delineate the individual questions that can be answered Key Messages (1 of 3) Singh S, Chang SM, Matchar DB, et al. Grading a body of evidence on diagnostic tests. In: Methods guide for medical test reviews. Available at

40  Reviewers can use these domains to grade the strength of evidence of a diagnostic test:  Risk of bias  Directness  Consistency  Precision  Publication bias  Dose-response association  Strength of association  Plausible unmeasured confounders Key Messages (2 of 3) Singh S, Chang SM, Matchar DB, et al. Grading a body of evidence on diagnostic tests. In: Methods guide for medical test reviews. Available at

41  Consider the test’s clinical context when:  Selecting accuracy outcomes (i.e., sensitivity, specificity, positive predictive value, negative predictive value, true- positive/negative outcomes, and false-positive/negative outcomes)  Determining the needed precision levels of these outcomes  Reviewers should consider explaining:  The rationale for a particular summary grade  Domains that were weighted the most in assigning the grade Key Messages (3 of 3) Singh S, Chang SM, Matchar DB, et al. Grading a body of evidence on diagnostic tests. In: Methods guide for medical test reviews. Available at

42 1.A summary receiver operating characteristic (ROC) curve can be used to assess consistency of diagnostic accuracy among studies. a.True b.False Practice Question 1 (1 of 2)

43 Explanation for Question 1: The statement is true. The ROC curve can display the sensitivity and specificity results from various studies. This offers a helpful visual to assess consistency or homogeneity in the direction and magnitude of results across different studies. Practice Question 1 (2 of 2)

44 2.When diagnostic tests are used as a triage test to other more invasive tests, there is a direct link between the accuracy of a diagnostic test and clinical outcomes. a.True b.False Practice Question 2 (1 of 2)

45 Explanation for Question 2: The statement is false. Diagnostic tests are often used in triage or as part of a decision tree. In that case, the accuracy of risk classification is more important than the accuracy of diagnosis. The triage test is not linked directly to the clinical outcome. Practice Question 2 (2 of 2)

46 3.If the question is asking about the effect of testing on clinical outcomes, then a body of evidence that measures sensitivity and specificity as the outcomes should be considered indirect. a.True b.False Practice Question 3 (1 of 2)

47 Explanation for Question 3: The statement is true. Sensitivity and specificity are considered intermediate outcomes, or accuracy outcomes, that are used to assess diagnostic tests. They are considered indirect evidence of clinical outcomes. Practice Question 3 (2 of 2)

48 4.It is preferable to use a quantitative approach to assigning grades when feasible as opposed to using a qualitative approach. a.True b.False Practice Question 4 (1 of 2)

49 Explanation for Question 4: The statement is false. There is no empirical evidence of a difference between quantitative and qualitative approaches to assigning grades. Practice Question 4 (2 of 2)

50  This presentation was prepared by Brooke Heidenfelder, Rachael Posey, Lorraine Sease, Remy Coeytaux, Gillian Sanders, and Alex Vaz, members of the Duke University Evidence-based Practice Center.  The module is based on Singh S, Chang SM, Matchar DB, et al. Grading a body of evidence on diagnostic tests. In: Chang SM and Matchar DB, eds. Methods guide for medical test reviews. Rockville, MD: Agency for Healthcare Research and Quality; June p AHRQ Publication No. 12-EHC017. Available at Authors

51  Agency for Healthcare Research and Quality. Methods Guide for Effectiveness and Comparative Effectiveness Reviews. Rockville, MD: Agency for Healthcare Research and Quality; April AHRQ Publication No. 10(12)-EHC063-EF. Chapters available at  Atkins D, Fink K, Slutsky J. Better information for better health care: the Evidence-based Practice Center Program and the Agency for Healthcare Research and Quality. Ann Intern Med 2005 Jun 21;142(12 Pt 2): PMID:  Balshem H, Helfand M, Schunemann HJ, et al. GRADE guidelines: 3. Rating the quality of evidence. J Clin Epidemiol 2011 Apr;64(4): PMID: References (1 of 9)

52  Bruening W, Schoelles K, Treadwell J, et al. Comparative Effectiveness of Core-Needle and Open Surgical Biopsy for the Diagnosis of Breast Lesions. Comparative Effectiveness Review No. 19 (Prepared by the ECRI Institute Evidence-based Practice Center under Contract No ). Rockville, MD: Agency for Healthcare Research and Quality; December AHRQ Publication No. 10-EHC007-EF. Available at NBK45220/pdf/TOC.pdf.  Bruening W, Uhl S, Fontanarosa J, et al. Noninvasive Diagnostic Tests for Breast Abnormalities: Update of a 2006 Review. Comparative Effectiveness Review No. 47 (Prepared by the ECRI Institute Evidence-based Practice Center under Contract No ). Rockville, MD: Agency for Healthcare Research and Quality; February AHRQ Publication No. 12-EHC014-EF. Available at References (2 of 9)

53  Guyatt GH, Oxman AD, Vist GE, et al; GRADE Working Group. GRADE: an emerging consensus on rating quality of evidence and strength of recommendations. BMJ 2008 Apr 26;336(7650): PMID:  Hartman KE, Matchar DB, Chang SM. Assessing applicability of medical test studies in systematic reviews. In: Chang SM and Matchar DB, eds. Methods guide for medical test reviews. Rockville, MD: Agency for Healthcare Research and Quality; June p AHRQ Publication No. 12-EHC017. Available at  Lohr KN, Carey TS. Assessing "best evidence": issues in grading the quality of studies for systematic reviews. Jt Comm J Qual Improv 1999 Sep;25(9): PMID: References (3 of 9)

54  Lord SJ, Irwig L, Simes RJ. When is measuring sensitivity and specificity sufficient to evaluate a diagnostic test, and when do we need randomized trials? Ann Intern Med 2006 Jun 6;144(11): PMID:  MacCannell T, Umscheid CA, Agarwal RK, et al; Healthcare Infection Control Practices Advisory Committee (HICPAC). Guideline for the prevention and control of norovirus gastroenteritis outbreaks in healthcare settings. Infect Control Hosp Epidemiol 2011 Oct;32(10): PMID:  Marchionni L, Wilson RF, Marinopoulos SS, et al. Impact of Gene Expression Profiling Tests on Breast Cancer Outcomes. Evidence Report/Technology Assessment No. 160 (Prepared by the Johns Hopkins University Evidence-based Practice Center Under Contract No ). Rockville, MD: Agency for Healthcare Research and Quality; January AHRQ Publication No. 08-E002. Available at References (4 of 9)

55  Owens DK, Lohr KN, Atkins D, et al. AHRQ series paper 5: grading the strength of a body of evidence when comparing medical interventions—Agency for Healthcare Research and Quality and the Effective Health-Care Program. J Clin Epidemiol 2010 May;63(5): PMID:  Pauker SG, Kassirer JP. The threshold approach to clinical decision making. N Engl J Med 1980 May 15;302(20): PMID:  Ross SD, Allen IE, Harrison KJ, et al. Systematic Review of the Literature Regarding the Diagnosis of Sleep Apnea. Evidence Report/Technology Assessment No. 1 (Prepared by MetaWorks Inc. under Contract No ). Rockville, MD: Agency for Health Care Policy and Research. February AHCPR Publication No. 99- E002. Available at References (5 of 9)

56  Samson D, Schoelles KM. Developing the topic and structuring systematic reviews of medical tests: utility of PICOTS, analytic frameworks, decision trees, and other frameworks. In: Chang SM and Matchar DB, eds. Methods guide for medical test reviews. Rockville, MD: Agency for Healthcare Research and Quality; June p AHRQ Publication No. 12-EHC017. Available at  Schunemann HJ, Oxman AD, Brozek J, et al; GRADE Working Group. Grading quality of evidence and strength of recommendations for diagnostic tests and strategies. BMJ 2008 May 17;336(7653): PMID: References (6 of 9)

57  Segal JB. Choosing the important outcomes for a systematic review of a medical test. In: Chang SM and Matchar DB, eds. Methods guide for medical test reviews. Rockville, MD: Agency for Healthcare Research and Quality; June p AHRQ Publication No. 12-EHC017. Available at medtestsguide.cfm.  Segal JB, Brotman DJ, Emadi A, et al. Outcomes of Genetic Testing in Adults With a History of Venous Thromboembolism. Evidence Report/Technology Assessment No. 180 (Prepared by the Johns Hopkins University Evidence-based Practice Center under Contract No. HHSA I). Rockville, MD: Agency for Healthcare Research and Quality; June AHRQ Publication No. 09-E011. Available at References (7 of 9)

58  Singh S, Chang SM, Matchar DB, et al. Grading a body of evidence on diagnostic tests. In: Chang SM and Matchar DB, eds. Methods guide for medical test reviews. Rockville, MD: Agency for Healthcare Research and Quality; June p AHRQ Publication No. 12-EHC017. Available at medtestsguide.cfm.  Trikalinos TA, Kulasingam S, Lawrence WF. Deciding whether to complement a systematic review of medical tests with decision modeling. In: Chang SM and Matchar DB, eds. Methods guide for medical test reviews. Rockville, MD: Agency for Healthcare Research and Quality; June p AHRQ Publication No. 12- EHC017. Available at medtestsguide.cfm. References (8 of 9)

59  Turcios RM, Widdowson MA, Sulka AC, et al. Reevaluation of epidemiological criteria for identifying outbreaks of acute gastroenteritis due to norovirus: United States, 1998–2000. Clin Infect Dis 2006 Apr 1;42(7): PMID: References (9 of 9)


Download ppt "Grading the Strength of a Body of Evidence on Diagnostic Tests Prepared for: The Agency for Healthcare Research and Quality (AHRQ) Training Modules for."

Similar presentations


Ads by Google