2Thursday Afternoon1:30-2:15 Studies and Systematic Review of Diagnostic Test Accuracy (Tom)2:15-3:00 Prognostic and Genetic Tests (Mark)3:00-3:45 Combining Tests (Michael)3:45-4:00 Break4:00-6:00 Small Groups6:00 Meet in 6702 to head to Giants game
3Studies of Diagnostic Test Accuracy After lunch. Tom again or Michael to start with incorporation and spectrum bias?
4ChecklistWas there an independent, blind comparison with a reference (“gold”) standard of diagnosis?Was the diagnostic test evaluated in an appropriate spectrum of patients (like those in whom we would use it in practice)?Was the reference standard applied regardless of the diagnostic test result?Was the test (or cluster of tests) validated in a second, independent group of patients?From Sackett et al., Evidence-based Medicine,2nd ed. (NY: Churchill Livingstone), p 68
5Beyond the ChecklistConsider not only possibility of bias, but WHY it may occur and DIRECTION it would affect resultsIncorporation biasSpectrum biasVerification biasDouble gold standard bias
6Incorporation BiasWhen the test itself can be incorporated into the gold standardPrevented by blinding
7Example: Study of BNP as a test for congestive heart failure (CHF)* Gold standard: determination of CHF by two cardiologists blinded to BNP“The best clinical predictor of congestive heart failure was an increased heart size on chest roentgenogram (accuracy, 81 percent)”Is there a problem with assessing accuracy of chest x-rays to diagnose CHF in this study?*Maisel AS, Krishnaswamy P, Nowak RM, McCord J, Hollander JE, Duc P, et al. Rapid measurement of B-type natriuretic peptide in the emergency diagnosis of heart failure. N Engl J Med 2002;347(3): Problem 4.3
8Incorporation bias Cardiologists not blinded to Chest X-ray Used (incorporated) Chest x-ray for CHF diagnosisIncorporation bias for assessment of Chest X-ray, not BNP
9Spectrum of Disease and Nondisease Disease is often easier to diagnose if severe“Nondisease” is easier to diagnose if patient is well than if the patient has other diseases
10Spectrum BiasSensitivity depends on the spectrum of disease in the population being tested.Specificity depends on the spectrum of non-disease in the population being tested.Example: Absence of Nasal Bone (on 13-week ultrasound) as a Test for Chromosomal Abnormality
11Spectrum Bias Example: Absence of Nasal Bone as a Test for Trisomy 21* Sensitivity = 229/333 = 69%Specificity = 5094/5223 = 97.5%BUT the D- group only included chromosomally normal fetusesCicero et al., Ultrasound Obstet Gynecol 2004; 23:
12Spectrum Bias: Absence of Nasal Bone as a Test for Chromosomal Abnormality D- group excluded 295 fetuses with other chromosomal abnormalities (esp. Trisomy 18)Among these fetuses, 32% had absent nasal bone (not 2.5%)What decision is this test supposed to help with?If it is whether to test chromosomes using chorionic villus sampling or amniocentesis, these 295 fetuses should be included in D+ group!
13Spectrum Bias: Absence of Nasal Bone as a Test for Chromosomal Abnormality, effect of including other trisomies in D+ groupSensitivity = 324/628 = 52%NOT 69% obtained when the D+ group only included fetuses with Trisomy 21
14Verification bias: Example Visual assessment of jaundice in newbornsStudy patients who are getting a bilirubin measurementAsk clinicians to estimate extent of jaundice at time of blood draw
15Visual Assessment of jaundice*: Results Sensitivity of jaundice below the nipple line for bilirubin ≥ 12 mg/dL = 97%Specificity = 19%What is the problem?Editor’s Note: The take-home message for me is that no jaundice below the nipple line equals no bilirubin test, unless there’s some other indication.--Catherine D. DeAngelis, MD*Moyer et al., Archives Pediatr Adol Med 2000; 154:391
16Verification Bias*Inclusion criterion for study: gold standard test was donein this case, blood test for bilirubinSubjects with positive index tests are more likely to be get the gold standard and to be included in the studyclinicians usually don’t order blood test for bilirubin if there is little or no jaundiceHow does this affect sensitivity and specificity?*AKA Work-up, Referral Bias, or Ascertainment Bias
18Double Gold Standard Bias-1 Two different “gold standards”One gold standard (e.g., surgery, invasive test) is more likely to be applied in patients with positive index testOther gold standard (e.g., clinical follow-up) is more likely to be applied in patients with a negative index test.
19Double Gold Standard Bias- 2 There are some patients in whom the two “gold standards” do not give the same answerSpontaneously resolving disease (positive with immediate invasive test, but not with follow-up)Newly occurring or newly detectable disease (positive with follow-up but not with immediate invasive test)
20Double Gold Standard Bias, example Study Population: All patients presenting to the ED who received a V/Q scanTest: V/Q ScanDisease: Pulmonary embolism (PE)Gold Standards:1. Pulmonary arteriogram (PA-gram) if done (more likely with more abnormal V/Q scan)2. Clinical follow-up in other patients (more likely with normal VQ scanWhat happens if some PE resolve spontaneously?*PIOPED. JAMA 1990;263(20):
21Effect of Double Gold Standard Bias 1: Spontaneously resolving disease Test result will always agree with gold standardBoth sensitivity and specificity increaseExample: Joe has a small pulmonary embolus (PE) that will resolve spontaneously.If his VQ scan is positive, he will get an angiogram that shows the PE (true positive)If his VQ scan is negative, his PE will resolve and we will think he never had one (true negative)VQ scan can’t be wrong!
22Effect of Double Gold Standard Bias 2: Newly occurring or newly detectable disease Test result will always disagree with gold standardBoth sensitivity and specificity decreaseExample: Jane has or will soon get a nasty breast cancer that is currently undetectableIf her mammogram is positive, she will get biopsies that will not find the tumor (mammogram will look falsely positive)If her mammogram is negative, she will return in several months and we will think the tumor was initially missed (mammogram will look falsely negative)Mammogram can’t be right!
23Effect of Double Gold Standard Bias Newly occurring or newly detectable diseaseSensitivity falsely decreasedSpecificity falsely decreasedSpontaneously resolving diseaseSensitivity falsely increasedSpecificity falsely increased
24Sensitivity is falsely … Specificity is falsely … BiasDescriptionSensitivity is falsely …Specificity is falsely …IncorporationGold standard incorporates index test.SpectrumD+ only includes “sickest of the sick”D- only includes “wellest of the well:VerificationPositive index test makes gold standard more likely.Double Gold StandardDisease resolves spontaneouslyDisease become sdetectable during follow-up
25Systematic Reviews of Diagnostic Accuracy Studies
26Meta-analyses of Diagnostic Tests Systematic and reproducible approach to finding studiesSummary of results of each studyInvestigation into heterogeneitySummary estimate of results, if appropriateUnlike other meta-analyses (risk factors, treatments), results aren’t summarized with a single number (e.g., RR), but with two related numbers (sensitivity and specificity)These can be plotted on an ROC plane
27MRI for the diagnosis of MS Whiting et al. BMJ 2006;332:875-84
28Dermoscopy vs Naked Eye for Diagnosis of Malignant Melanoma Dermoscopy performed unequivocally better in 7 of the 9 studies. Can you circle results for the 2 studies for which this was not the case?Br J Dermatol Sep;159(3):669-76
29Kharbanda et al. Pediatrics 2005; 116(3): 709-16 Example: A clinical decision rule to identify children at low risk for appendicitis (Problem 5.6)Study design: prospective cohort studySubjects4140 patients 3-18 years presenting to Boston Children’s Hospital ED with abdominal painOf these, 767 (19%) received surgical consultation for possible appendicitis113 excluded (chronic diseases, recent imaging)53 missed601 included in the study (425 in derivation set)9Kharbanda et al. Pediatrics 2005; 116(3):
30Kharbanda et al. Pediatrics 2005; 116(3): 709-16 A clinical decision rule to identify children at low risk for appendicitisPredictor variableStandardized assessment by pediatric ED attendingFocus on “Pain with percussion, hopping or cough” (complete data in N=381)Outcome variable:Pathologic diagnosis of appendicitis (or not) for those who received surgery (37%)Follow-up telephone call to family or pediatrician 2-4 weeks after the ED visit for those who did not receive surgery (63%)Kharbanda et al. Pediatrics 2005; 116(3):
31Kharbanda et al. Pediatrics 2005; 116(3): 709-16 A clinical decision rule to identify children at low risk for appendicitisResults: Pain with percussion, hopping or cough78% sensitivity and 83% NPV seem low to me. Are they valid for me in deciding whom to image?Kharbanda et al. Pediatrics 2005; 116(3):
32ChecklistWas there an independent, blind comparison with a reference (“gold”) standard of diagnosis?Was the diagnostic test evaluated in an appropriate spectrum of patients (like those in whom we would use it in practice)?Was the reference standard applied regardless of the diagnostic test result?Was the test (or cluster of tests) validated in a second, independent group of patients?From Sackett et al., Evidence-based Medicine,2nd ed. (NY: Churchill Livingstone), p 68
33In what direction would these biases affect results? Sample not representative (population referred to pedi surgery)?Verification bias?Double-gold standard bias?Spectrum biasSample NOT representative. Prevalence of Appy too high for decision about imagingVerification bias probably operating – lack of pain with hopping would make me LESS likely to seek surgical consultation. But this would bias sensitivity UP.DGSB COULD be a bias, if some cases of appendicitis spontaneously resolve, but this would bias sensitivity and specificity UPSpectrum bias probably operates for Specificity, not Sensitivity. Presumably the non-appy cases referred to pedi surgery looked more like appendicitis, therefore likely to have higher FP rate for pain with hopping than those note studied
34For children presenting with abdominal pain to SFGH 6-M Sensitivity probably valid (not falsely low)But whether all of them tried to hop is not clearSpecificity probably lowPPV is too highNPV is too lowDoes not address surgical consultation decision