Presentation is loading. Please wait.

Presentation is loading. Please wait.

Studies of Diagnostic Tests Thomas B. Newman, MD, MPH October 11, 2012.

Similar presentations


Presentation on theme: "Studies of Diagnostic Tests Thomas B. Newman, MD, MPH October 11, 2012."— Presentation transcript:

1 Studies of Diagnostic Tests Thomas B. Newman, MD, MPH October 11, 2012

2 Reminders/Announcements n OK (encouraged!) to help each other, but give credit HW n Write down answers to as many of the problems in the book as you can (not just those assigned) and check your answers! n Homework/exam problem due by 11/15 (preferably sooner) n Final exam to be passed out 11/29, reviewed 12/6 n Tom and Michael away next week at meeting of the Society for Medical Decision Making –Screening lecture by Dr. Andi Marmor

3 Overview n Common biases of studies of diagnostic test accuracy n Prevalence, spectrum and nonindependence n Meta-analyses of diagnostic tests n Checklist & systematic approach n Examples: –Pain with percussion, hopping or cough for appendicitis –Clinical diagnosis of pertussis

4 Bias #1 Example n Study of BNP to diagnose congestive heart failure (CHF; Chapter 4, Problem 3)

5 Bias #1 Example n Gold standard: determination of CHF by two cardiologists blinded to BNP n “The best clinical predictor of congestive heart failure was an increased heart size on chest roentgenogram (accuracy, 81 percent)” n Is there a problem with assessing accuracy of chest x-rays to diagnose CHF in this study? *Maisel AS, Krishnaswamy P, Nowak RM, McCord J, Hollander JE, Duc P, et al. Rapid measurement of B-type natriuretic peptide in the emergency diagnosis of heart failure. N Engl J Med 2002;347(3):161-7.

6 Bias #1: Incorporation bias n Cardiologists not blinded to chest x-ray n Probably used (incorporated) chest x-ray to make final diagnosis n Incorporation bias for assessment of chest x-ray (not BNP) n Biases both sensitivity and specificity upward ©2000 by British Medical Journal Publishing Group

7 Bias #2 Example: n Visual assessment of jaundice in newborns –Study patients who are getting a bilirubin measurement –Ask clinicians to estimate extent of jaundice at time of blood draw –Compare with blood test

8 Visual Assessment of jaundice*: Results *Moyer et al., APAM 2000; 154:391 n Sensitivity of jaundice below the nipple line for bilirubin ≥ 12 mg/dL = 97% n Specificity = 19% n What is the problem? Editor’s Note: The take-home message for me is that no jaundice below the nipple line equals no bilirubin test, unless there’s some other indication. --Catherine D. DeAngelis, MD

9 Bias #2: Verification Bias* -1 n Inclusion criterion for study: gold standard test was done –in this case, blood test for bilirubin n Subjects with positive index tests are more likely to be get the gold standard and to be included in the study –clinicians usually don’t order blood test for bilirubin if there is little or no jaundice n How does this affect sensitivity and specificity? *AKA Work-up, Referral Bias, or Ascertainment Bias

10 Bias #2: Verification Bias TSB >12TSB < 12 Jaundice below nipple ab No jaundice below nipple c  d  Sensitivity, a/(a+c), is biased ___. Specificity, d/(b+d), is biased ___. *AKA Work-up, Referral Bias, or Ascertainment Bias

11 Visual Assessment of jaundice*: Results *Moyer et al., Archives Pediatr Adol Med 2000; 154:391 n Recall “Gold Standard” was bilirubin ≥ 12 mg/dL n Specificity = 19% n This low specificity was a clue! What does it mean? n NIH: 19% of newborns who don’t have a bilirubin ≥ 12 mg/dL are not jaundiced below the nipple line n 81% of babies with bilirubin <12 mg/dL are jaundiced below the nipple line

12 Copyright restrictions may apply. Does This Child Have Appendicitis? JAMA. 2007;298:438-451. RLQ Pain: Sensitivity = 96% Specificity = 5% (1 – Specificity = 95%) Likelihood Ratio =1.0 RLQ pain was present in 96% of those with appendicitis and 95% of those without appendicitis.

13 Bias #3 n Example: PIOPED study of accuracy of ventilation/perfusion (V/Q) scan to diagnose pulmonary embolism* n Study Population: All patients presenting to the ED who received a V/Q scan n Test: V/Q Scan n Disease: Pulmonary embolism (PE) n Gold Standards: –1. Pulmonary arteriogram (PA-gram) if done (more likely with more abnormal V/Q scan) –2. Clinical follow-up in other patients (more likely with normal VQ scan * (Blood clot in the lungs. PIOPED. JAMA 1990;263(20):2753-9.

14 Double Gold Standard Bias n Also called differential verification bias n Two different “gold standards” –One gold standard (usually an immediate, more invasive test, e.g., angiogram, surgery) is more likely to be applied in patients with positive index test –Second gold standard (e.g., clinical follow-up) is more likely to be applied in patients with a negative index test.

15 Double Gold Standard Bias n There are some patients in whom the two “gold standards” do not give the same answer –Spontaneously resolving disease (positive with immediate invasive test, but not with follow-up) –Newly occurring or newly detectable disease (positive with follow-up but not with immediate invasive test)

16 Effect of Double Gold Standard Bias 1: Spontaneously resolving disease n Test result will always agree with gold standard n Both sensitivity and specificity increase n Example: Joe has a small pulmonary embolus (PE) that will resolve spontaneously. –If his VQ scan is positive, he will get an angiogram that shows the PE (true positive) –If his VQ scan is negative, his PE will resolve and we will think he never had one (true negative) n VQ scan can’t be wrong!

17 Effect of Double Gold Standard Bias 2: Newly occurring or newly detectable disease n Test result will always disagree with gold standard n Both sensitivity and specificity decrease n Example: Jane has a nasty breast cancer but it is currently undetectable by biopsy –If her mammogram is positive, she will get biopsies that will not find the tumor (mammogram will look falsely positive) –If her mammogram is negative, she will return in several months an we will think the tumor was initially missed (mammogram will look falsely negative) n Mammogram can’t be right!

18 Spectrum of Disease, Nondisease and Test Results n Disease is often easier to diagnose if severe n “Nondisease” is easier to diagnose if patient is well than if the patient has other diseases n Test results will be more reproducible if ambiguous results excluded

19 Spectrum Bias n Sensitivity depends on the spectrum of disease in the population being tested. n Specificity depends on the spectrum of non-disease in the population being tested. n Example: Absence of Nasal Bone (on 13-week ultrasound) as a Test for Chromosomal Abnormality

20 Spectrum Bias Example: Absence of Nasal Bone as a Test for Chromosomal Abnormality* Sensitivity = 229/333 = 69% BUT the D+ group only included fetuses with Trisomy 21 Cicero et al., Ultrasound Obstet Gynecol 2004; 23: 218-23 Nasal Bone AbsentD+ D-Total LR Yes229129358 27.8 No10450945198 0.32 Total33352235556

21 n The D+ group excluded 295 fetuses with other chromosomal abnormalities (mainly Trisomy 18) n Among these fetuses, the sensitivity of nasal bone absence was 32% (not 69%) n What decision is this test supposed to help with? –If it is whether to test chromosomes using chorionic villus sampling or amniocentesis, these 295 fetuses should be included! Spectrum Bias: Absence of Nasal Bone as a Test for Chromosomal Abnormality

22 Sensitivity = 324/628 = 52% vs. 69% obtained when the D+ group only included fetuses with Trisomy 21 Spectrum Bias: Absence of Nasal Bone as a Test for Chromosomal Abnormality, effect of including other trisomies in D+ group

23 Quiz: What if we considered the nasal bone absence as a test for Trisomy 21 (only)? n Then instead of excluding subjects with other chromosomal abnormalities or including them as D+, we should count them as D-. Compared with excluding them, n What would happen to sensitivity? n What would happen to positive predictive value?

24 Quiz: What if we considered the nasal bone absence as a test for Trisomy 21? Nasal Bone AbsentD+ D- Yes229129 + 95 = 224 No1045094 + 200=5294 Total3335223 + 295=5518 n What would happen to sensitivity? n What would happen to positive predictive value? n Sensitivity unchanged. n PPV would decrease (95 more false positives) from 64% to 51%. Compared with excluding patients with other trisomies,

25 BiasDescriptionSensitivity is falsely … Specificity is falsely … Incorporation Gold standard incorporates index test. Spectrum D+ only includes “sickest of the sick” D- only includes “wellest of the well: Verification Positive index test makes gold standard more likely. Double Gold Standard Disease resolves spontaneously Disease become sdetectable during follow-up

26 Prevalence, spectrum and nonindependence n Prevalence (prior probability) of disease may be related to disease severity n One mechanism is different spectra of disease or nondisease n Another is that whatever is causing the high prior probability is related to the same aspect of the disease as the test

27 Prior probability, spectrum and nonindependence: examples n Diseases identified by screening or incidentally – higher prevalence assoc with lower severity –Prostate cancer –Thyroid cancer n Diseases where higher prevalence associated with greater severity –Fe deficiency – Higher prevalence of TB where HIV is more prevalent; TB also more severe there

28 Prior probability, spectrum and nonindependence: examples n Symptoms of disease associated with the aspect of disease being tested: Urinalysis as a test for UTI in women with more and fewer symptoms (high and low prior probability)* *EBD Table 5.3, from Lachs, Ann Int Med 1992; 117:135-40

29 Overfitting

30 n Choosing best cutoff based on the data (small problem) n Choosing best cutoffs on best combination of multiple tests (big problem; 2 weeks)

31 Meta-analyses of Diagnostic Tests n Systematic and reproducible approach to finding studies n Summary of results of each study n Investigation into heterogeneity n Summary estimate of results, if appropriate n Unlike other meta-analyses (risk factors, treatments), results aren’t summarized with a single number (e.g., RR), but with two related numbers (sensitivity and specificity) n These can be plotted on an ROC plane

32 MRI for the diagnosis of MS Whiting et al. BMJ 2006;332:875-84

33 Figure 1. Graph showing the summary receiver operating characteristic curve (SROC) for the 25 stress echocardiography studies (closed diamond) or the 50 stress nuclear scintigraphy studies (open squares). Beattie W S et al. Anesth Analg 2006;102:8-16 ©2006 by Lippincott Williams & Wilkins SROC Predicting post-op MI or death in elective noncardiac surgery patients

34 Dermoscopy vs Naked Eye for Diagnosis of Malignant Melanoma Br J Dermatol. 2008 Sep;159(3):669-76 Dermoscopy performed unequivocally better in 7 of the 9 studies. Can you call out the coordinates for the 2 studies for which this was not the case?

35 Studies of Diagnostic Test Accuracy: Checklist n Was there an independent, blind comparison with a reference (“gold”) standard of diagnosis? n Was the diagnostic test evaluated in an appropriate spectrum of patients (like those in whom we would use it in practice)? n Was the reference standard applied regardless of the diagnostic test result? n Was the test (or cluster of tests) validated in a second, independent group of patients? From Sackett et al., Evidence-based Medicine,2 nd ed. (NY: Churchill Livingstone), 2000. p 68

36 Systematic Approach n Authors and funding source n Research question n Study design n Study subjects n Predictor variable n Outcome variable n Results & Analysis n Conclusions

37 A clinical decision rule to identify children at low risk for appendicitis (Problem 5.6)* n Study design: prospective cohort study n Subjects –4140 patients 3-18 years presenting to Boston Children’s Hospital ED with abdominal pain –767 (19%) received surgical consultation for possible appendicitis 113 Excluded (chronic diseases, recent imaging) 53 missed –601 included in the study (425 in derivation set) *Kharbanda et al. Pediatrics 2005; 116(3): 709-16

38 A clinical decision rule to identify children at low risk for appendicitis n Predictor variables –Standardized assessment by pediatric ED attending –Focus on “Pain with percussion, hopping or cough” (complete data in N=381) n Outcome variable: –Pathologic diagnosis of appendicitis (or not) for those who received surgery (37%) –Follow-up telephone call to family or pediatrician 2-4 weeks after the ED visit for those who did not receive surgery (63%) Kharbanda et al. Pediatrics 116(3): 709-16

39 A clinical decision rule to identify children at low risk for appendicitis n Results: Pain with percussion, hopping or cough n 78% sensitivity and 83% NPV seem low to me. Are they valid for me in deciding whom to image? Kharbanda et al. Pediatrics 116(3): 709-16

40 Checklist n Was there an independent, blind comparison with a reference (“gold”) standard of diagnosis? n Was the diagnostic test evaluated in an appropriate spectrum of patients (like those in whom we would use it in practice)? n Was the reference standard applied regardless of the diagnostic test result? n Was the test (or cluster of tests) validated in a second, independent group of patients? From Sackett et al., Evidence-based Medicine,2 nd ed. (NY: Churchill Livingstone), 2000. p 68

41 In what direction would these biases affect results? n Sample not representative (population referred to pedi surgery)? n Verification bias? n Double-gold standard bias? n Spectrum bias

42 For children presenting with abdominal pain to SFGH 6-M n Sensitivity probably valid (not falsely low) –But whether all of the kids in the study tried to hop is not clear n Specificity probably low n PPV is too high n NPV is too low n Does not address surgical consultation decision

43 Does this coughing patient have pertussis?* n RQ (for us): what are LR for coughing fits, whoop, and post-tussive vomiting in adults with persistent cough? n Design (for one study we reviewed**): Prospective cross-sectional study n Subjects: 217 adults ≥18 years with cough 7- 21 days, no fever or other clear cause for cough enrolled by 80 French GPs. –In a subsample from 58 GPs, of 710 who met inclusion criteria only 99 (14%) enrolled * Cornia et al. JAMA 2010;304(8):890-896 **Gilberg S et al. J Inf Dis 2002;186:415-8

44 Pertussis diagnosis n Predictor variables: “GPs interviewed patients using a standardized questionnaire.” n Outcome variable: Laboratory evidence of pertussis based on any of: –Culture (N=1) –PCR (N=36) – ≥ 2-fold change in anti-pertussis toxin IgG (N=40) –Total N = 70/217 with evidence of pertussis (32%) *Gilberg S et al. J Inf Dis 2002;186:415-8

45 Results n 89% in both groups (with and without laboratory “evidence of pertussis”) met CDC criteria for pertussis* *Gilberg S et al. J Inf Dis 2002;186:415-8

46 Issues n Verification bias: only 14% of eligible subjects included –Subjects with more pertussis symptoms probably more likely to be included n Questionable “gold standard”

47 What is wrong with this picture? n Outcome variable: Evidence of pertussis based on any of: –Culture (N=1) –PCR (N=36) – ≥ 2-fold change in anti-pertussis toxin IgG (N=40) –Total N = 70/217 with evidence of pertussis (32%) n Protocol apparently included serologic tests and PCR on all, but culture only if it could be plated in < 4 hours n Not much overlap!

48 Issues n Correlation between serologic and PCR pertussis tests (derived from Table 1 of Gilberg et al.*). tab PT PCR [fw=pop] PT_IGG_cha | PCR nge | POS NEG | Total -----------+----------------------+---------- POS | 6 30 | 36 NEG | 34 53 | 87 -----------+----------------------+---------- Total | 40 83 | 123 *Gilberg S et al. J Inf Dis 2002;186:415-8

49 Issues n Nice illustration of difficulty doing a systematic review! n Important take-home message: you can’t judge study quality only by looking at the methods! You need to look at results, too!. kap PT PCR [fw=pop] Expected Agreement Agreement Kappa Std. Err. Z Prob>Z ----------------------------------------------------------------- 47.97% 57.25% -0.2171 0.0899 -2.41 0.9921

50 Table 1 from paper *Gilberg S et al. J Inf Dis 2002;186:415-8

51 Questions?

52 Additional slides

53 Double Gold Standard Bias: effect of spontaneously resolving disease PE +PE - V/Q Scan +ab V/Q Scan -cd Sensitivity, a/(a+c) biased __ Specificity, d/(b+d) biased __ Double gold standard compared with immediate invasive test for all Double gold standard compared with follow-up for all

54 Double Gold Standard Bias: effect of newly occurring cases PE +PE - V/Q Scan +ab V/Q Scan -cd Sensitivity, a/(a+c) biased __ Specificity, d/(b+d) biased __ Double gold standard compared with PA-Gram for all Double gold standard compared with follow-up for all


Download ppt "Studies of Diagnostic Tests Thomas B. Newman, MD, MPH October 11, 2012."

Similar presentations


Ads by Google