Presentation on theme: "1 Using and understanding numbers in health news and research Heejung Bang, PhD Department of Public Health Weill Medical College of Cornell University."— Presentation transcript:
1 Using and understanding numbers in health news and research Heejung Bang, PhD Department of Public Health Weill Medical College of Cornell University
2 A rationale for today’s talk Coffee is bad yesterday, but good today and bad again tomorrow. “It's the cure of the week or the killer of the week, the danger of the week.” says B. Kramer. “I've seen so many contradictory studies with coffee that I've come to ignore them all.” says D. Berry. What to believe? For a while, you may just keep drinking coffee.
3 Hardly a day goes by without a new headline about the supposed health risks or benefits of some thing… Are these headlines justified? Often, the answer is NO.
4 R. Peto phrases the nature of the conflict this way: “Epidemiology is so beautiful and provides such an important perspective on human life and death, but an incredible amount of rubbish is published,” by which he means the results of observational studies that appear daily in the news media and often become the basis of public-health recommendations about what we should or should not do.
5 3 major reasons for coffee-like situations Confounding Multiple testing Faulty design/sample selection
6 Topics to be covered today 1. Numbers in press release 2. Lies, Damn Lies & Statistics 3. Association vs. Causation 4. Experiment (e.g., RCT) vs. Observational study 5. Replicate or Perish 6. Hierarchy of evidence and study design 7. Meta-analysis 8. Multiple testing 9. Same words, different meanings? 10. Data sharing 11. Other Take-Home messages
7 1. Numbers in press release No p-value, no odds or hazards ratio in press release! -- Ask people on the street “what is p- value?” -- Only we may laugh if I make a statistical joke using 0.05, 1.96 and 95%, etc.
8 What is P-value? In statistical hypothesis testing, the p-value is the probability of obtaining a result at least as extreme as a given data point, under the null hypothesis. hypothesis testingnull hypothesis testingnull hypothesis -- If there is no hypothesis, there is no test and no p- value. -- If there is no hypothesis, there is no test and no p- value. Current statistical training and practice, statistical testing/p-value are overly emphasized. However, p-value (1 number, 0-1) can be useful to decision making. -- you cannot say “it depends” all the times although it can be true. -- you cannot say “it depends” all the times although it can be true.
9 Numerator & denominator Always try to check numerator and denominator (and when, how long) Try to read footnotes under * -- 100% increase can be 1 → 2 cases -- 100% increase can be 1 → 2 cases -- 20% event rate can be 1 out of 5 samples -- 20% event rate can be 1 out of 5 samples
10 Large Number myths With large N, one will more likely find a difference when a difference truly exists – notion of statistical power. However, many fundamental problems (e.g., bias, confounding and wrong sample selection) CANNOT be cured by large N. (more later) Combining multiple incorrect stories can create more serious problems than reporting a single incorrect story. (more later in meta) N>200,000 needed to detect 20% reduction in mortality (Mann, Science 1990) Means (and t-test) can be very dangerous b/c with large N, everything is significant -- Perhaps, for DNA and race, Watson should see the entire distribution or SD! -- Perhaps, for DNA and race, Watson should see the entire distribution or SD!
11 2. Lies, damned lies & statistics There are three kinds of lies --B Disraeli & M Twain --- Title speaks for itself --- Title speaks for itself “J Robins makes statistics tell the truth: Numbers in the service of health” (Harvard Gazette interview) If numbers/statistics are properly generated and used, they can be the best piece of empirical evidence. --- some empirical evidence is almost always good to have --- some empirical evidence is almost always good to have --- it is hard to fight with numbers (and age)! --- it is hard to fight with numbers (and age)!
12 Some Advice No statistics is better than bad statistics. Just present your data (e.g., N=3) when statistics are not necessary. Descriptive statistics vs. inferential statistics If you use wrong stats, you can be on the news. See ‘Statistical flaw trips up study of bad stats’. Nature 2006
13 3. Association vs. Causation #1 error in health news, Association=Causation In 1748, D. Hume stated ‘we may define a cause to be an object followed by another… where, if the first object had not been, the second never had existed.’ ---this is a true cause! ---this is a true cause! A more profound quote from Hume is ‘All arguments concerning existence are founded on the relation of cause and effect.’ ‘All arguments concerning existence are founded on the relation of cause and effect.’
14 Misuses and abuses of “causes” You may avoid the words ‘cause’, ‘responsible’, ‘influence’, ‘impact’ or ‘effect’ in your paper or press release (esp., title), if results are obtained from observational studies. Instead you may use ‘association’ or ‘correlation’. Often, “may/might” not enough. Media misuses and public misunderstands this severely --- Every morning, we hear new causes of some disease are found.
15 50% risk reduction, 20% risk reduction, and so on. If you add up, by now all causes of cancer (& many other diseases) should have been identified. Almost all are association, not causation. -- there are an exceedingly large number of associated and correlated factors, compared to true causes. -- a survey of 246 suggested coronary risk factors. Hopkins & Williams (1981) -- I believe cancer >1000 risk factors. ‘Too many don’t do’ is no better than ‘do anything’. ‘Too many don’t do’ is no better than ‘do anything’.
16 Why Association ≠ Causation? Confounders aka, third variable(s) Biggest threat to any observational studies. Definition of ‘confound’: vt. Throw (things) into disorder; mix up; confuse. (Oxford Dictionary) vt. Throw (things) into disorder; mix up; confuse. (Oxford Dictionary) However, confounders CANNOT be defined in terms of statistical notions alone (Pearl)
17 Confounder samplers Grey Hair vs. heart attack Stork vs. birth rate Rock & Roll vs. HIV Eating late & weight gain? Drinking (or match-carrying) & lung cancer No father’s name & infant mortality Long leg & skin cancer Vitamins/HRT, too? Any remedy? -- first thing to do is ‘Use common sense’. Think about any other (hidden) factor or alternative explanation’.
18 Common sense & serendipity Common sense is the basis for most of the ideas for designing scientific investigations. --- M Davidian although we should not ignore the importance of serendipity in science
19 By the way, why ‘causes’ are so important? If causes can be removed, susceptibility ceases to matter (Rose 1985) and the outcome will not occur. Neither associated nor correlated factors have this power. Neither associated nor correlated factors have this power. Gladly, some efforts have been made: ‘Distinguishing Association from Causation: ‘Distinguishing Association from Causation: A Backgrounder for Journalists’ from A Backgrounder for Journalists’ from American Council on Science and Health American Council on Science and Health
20 Greenland’s Dictum (Science 1995) There is nothing sinful about going out and getting evidence, like asking people how much do you drink and checking breast cancer records. There’s nothing sinful about seeing if that evidence correlates. There’s nothing sinful about checking for confounding variables. The sin comes in believing a causal hypothesis is true because your study came up with a positive result, or believing the opposite because your study was negative.
21 Association to causation? In 1965, Hill proposed a set of the following causal criteria: 1. Strength 2. Consistency 3. Specificity 4. Temporality (i.e., cause before effect) 5. Biological gradient 6. Plausibility 7. Coherence 8. Experiment 9. Analogy However, Hill also said “None of my nine viewpoints can bring indisputable evidence for or against the cause- and-effect hypothesis and none can be required as a sine qua non’.
22 Another big problem: bias and faulty design/samples Selection bias: the distortion of a statistical analysis, due to the method of collecting samples. The easiest way to cheat (intentionally or unintentionally) -- Make group1: group2 = healthy people: sick people. -- Oftentimes, treatment is bad in observational studies, why? -- Do a survey among your friends only -- People are different from the beginning?? (e.g., vegetarians vs. meat-lover, HRT users vs. non-users) Case-control study & matching: easy to say but hard to do correctly. -- Vitamin C and cancer For any comparison: FAIRNESS is most important! -- Numerous other biases exist
23 Would you believe these p-values? (Cameron and Pauling, 1976) This famous study has failed to replicate 16 or so times! Pauling received two Nobel.
24 4. Experiment vs. Observational study Although the arguing from experiments and observations by induction be no demonstration of general conclusions, yet it is the best way of arguing which the nature of things admits of. --- I Newton Newton’s "experimental philosophy" of science: Science should not, as Descartes argued, be based on fundamental principles discovered by reason, but based on fundamental axioms shown to be true by experiments.
25 Why clinical trials are important? Randomized Controlled Trial (RCT) is the most common form of experiment on humans. ‘Average causal effects’ can be estimated from experiment. -- To know the true effect of treatment within person, one should be treated and untreated at the same time. Experimentation trumps observation. (power of coin-flip! Confounders disappear.) Very difficult to cheat in RCTs (due to randomization and protocol). “Causality: God knows but humans need a time machine. When God is busy and no time machine is available, a RCT would do.”
26 Problems/issues of RCTs Restrictive settings Human subjects under experiments Can be unethical or infeasible Short terms 1-2 treatments, 1-2 doses only Limited generalizability Other issues: blinding, drop-up, compliance
27 Problems/issues of observational studies Bias & confounding Post-hoc arguments about biological plausibility must be viewed with some skepticism since the human imagination seems capable of developing a rationale for most findings, however unanticipated (Ware 2003). i.e., retrospective rationalization. i.e., retrospective rationalization. We are tempted to ‘Pick & Choose’! Data-dredging, Fishing expedition, Significance- chasing (p<0.05) Observational studies can overcomes some limitations of RCTs.
28 Ideal attitudes RCTs and observational studies should be complementary each other, rather than competing. --because real life stories can be complicated. --because real life stories can be complicated. When RCTs and observational studies conflict, generally (not always) go with RCTs. Even if you conduct a observational study, try to think in a RCT way. (e.g., a priori 1-2 hypothesis, protocol, data analysis plan, ask yourself ‘Is this result likely to replicate in RCT?’)
29 Quotes for observational studies The data are still no more than observational, no matter how sophisticated the analytic methodology – anonymous reviewer Observational studies are not a substitute for clinical trials no matter how sophisticated the statistical adjustment may seem – D. Freedman No fancy statistical analysis is better than the quality of the data. Garbage in, garbage out, as they say. So whether the data is good enough to need this level of improvement, only time will tell. – J. Robins Remark: However, advanced statistical technique, causal inference, may help.
30 Some studies are difficult Diet/alcohol: Type/amount, How to measure? Do you remember what we ate last week? Exercise/physical activity/SES: Can we measure? Do you tell the truth? -- people tend to say ‘yes’, ‘moderately’ Long term cumulative effects Positive thinking and spirituality? Quality and value of life: How to define and measure -- priceless?
31 5. Replicate or perish Publish or perish: Old era vs. Replicate or perish: New era Replicability of the scientific findings can never be overemphasized. Results being ‘significant’ or ‘predictive’ without being replicable misinform the public and needlessly expend time and resources, and they are no service to investigators and science –S. Young Given that we currently have too many findings, often with low credibility, replication and rigorous evaluation become as important as or even more important than discovery - J. Ioannidis (2006) -- Pay more attention to 2 nd study!
32 Examples of highly cited heart-disease studies that were later contradicted (Ioannidis 2005) -- The Nurses Health Study, showing a 44% relative risk reduction in coronary disease in women receiving hormone therapy. Later refuted by Women's Health Initiative, which found that hormone treatment significantly increases the risk of coronary events. -- Two large cohort studies, the Health Professionals Follow-Up Study and the Nurses Health Study, and a RCT all found that vitamin E was associated with a significantly reduced risk of coronary artery disease. But larger randomized trials subsequently showed no benefit of vitamin E on coronary disease
33 More Ioannidis Ioannidis (2005) serves as a reminder of the perils of small trials, nonrandomized trials, and those using surrogate markers. He concludes "Evidence from recent trials, no matter how impressive, should be interpreted with caution when only one trial is available. It is important to know whether other similar or larger trials are still ongoing or being planned. Therefore, transparent and thorough trial registration is of paramount importance to limit premature claims [of] efficacy."
34 More Freedman Modeling, the search for significance, the preference for novelty, and lack of interest in assumptions --- these norms are likely to generate a flood of nonreproducible results
35 6. Hierarchy of evidence study design
36 What goes on top? ANSWER is total evidence. RCT can provide strong evidence for a causal effect, especially if its findings are replicated by other studies
37 When you read the article, you may check the study design Cross-sectional study: which is first? what is cause and what is effect? e.g., depression vs. obesity e.g., depression vs. obesity Prospective cohort studies: much better but still not causal Prospective is generally better than retrospective RCT is better than non-RCT
38 7. Meta-analysis Statistical technique for systematic literature review There are 3 things you should not watch being made: law, sausage & meta-analysis No data collection but Nothing is free. Can you find all studies in the universe including ones in researchers’ file drawers? Or at least unbiased subsample? Google or pubmed can do? NO! Publication bias (favoring positive studies) and language bias, etc. Much bigger problem in obs studies than RCTs. Combining multiple incorrect stories is worse than one incorrect story.
39 Funny (real) titles of papers about meta-analysis Meta-analysis: apples and oranges, or fruitless Apples and oranges (and pears, oh my!): the search for moderators in meta-analysis Of apples and oranges, file drawers and garbage: why validity issues in meta-analysis will not go away Meta analysis/shmeta-analysis Meta-analysis of clinical trials: a consumer's guide. Publication bias in situ
40 8. Multiple testing Multiple testing/comparisons refers to the testing of more than one hypothesis at a time. When many hypotheses are tested, and each test has a specified Type I error probability (α), the probability that at least 1 Type I error is committed increases with the number of hypotheses. Bonferroni method: α=0.05/# of tests Many researchers’ thorny issue. -- Bonferroni might be the most hated statistician in history. -- ‘Escaping the Bonferroni iron claw in ecological studies’ by Garcı´a et al. (2004)
41 Two errors Type I (false positive: rejecting H 0 when it is true) vs. Type II (false negative: accepting H 0 when it is false) -- Controlling Type I is more important in stat and court. (e.g., innocent → guilty: disaster!) -- In other fields, Type 2 can be more important. α=p=0.05 – is this the law in science? Only 5% error do you commit in your life? α=5% seems reasonable to one research question/publication.
43 Multiple testing in different forms Multiple testing in different forms Subgroup analyses -- You should always do subgroup analyses but never believe them. – R. Peto -- Multiple testing adjustment and cross-validation may be solutions. Trying different cutpoints (e.g., tertiles, quintiles, etc.) -- A priori chosen cutpoints or multiple testing adjustment can be solutions. Nothing is free. To look more, you have to pay.
44 Multiple testing (underlying mechanism) Lottery tickets should not be free. In random and independent events as the lottery, the probability of having a winning number depends on the N of tickets you have purchased. When one evaluates the outcome of a scientific work, attention must be given not only to the potential interest of the ‘significant’ outcomes but also to the N of ‘lottery tickets’ the authors have ‘bought’. Those having many have a much higher chance of ‘winning a lottery prize’ than of getting a meaningful scientific result. It would be unfair not to distinguish between significant results of well- planned, powerful, sharply focused studies, and those from ‘fishing expeditions’, with a much higher probability of catching an old truck tyre than of a really big fish. --- Garcı´a et al. (2004)
45 Multiple testing disaster I In the 1970s, every disease was reported to be associated with an HLA allele (schizophrenia, hypertension.... you name it!). Researchers did case control studies with 40 antigens, so there was a very high probability of at least one was significant result This result was reported without any mention of the fact that it was the most significant of 40 tests --- R. Elston
46 Multiple testing disaster II Association between reserpine (then a popular antihypertensive) and breast cancer. Shapiro (2004) gave the history. His team published initial results that were extensively covered by media with a huge impact on research community. When the results did not replicate, he confessed that the initial findings were chance due to thousands of comparisons involving hundreds of outcomes and hundreds of exposures. He hopes that we learned for the future from his mistake.
47 Multiple testing disaster III You are what your mother eats (Mathews et al. 2008). All over the places on the news and internet. Over 50,000 Google hits for 1 st week. All over the places on the news and internet. Over 50,000 Google hits for 1 st week. Numerous comparisons were conducted Sodium, calcium, potassium, etc. were significant (p<0.05), but sodium was dismissed claiming it is hard to measure accurately. --possible ‘pick and choose’! --possible ‘pick and choose’! Other problems: lack of biological credibility, difficulty in dietary data.
48 Leaving no trace (Shaffer 2007) Usually these attempts through which the experimenter passed don't leave any traces; the public will only know the result that has been found worth pointing out; and as a consequence, someone unfamiliar with the attempts which have led to this result completely lacks a clear rule for deciding whether the result can or can not be attributed to chance.
49 If you keep testing without controlling α Everything is Dangerous – S. Young It is fairly easy to find risk factors for premature morbidity or mortality. Indeed, given a large enough study and enough measured factors and outcomes, almost any potentially interesting variable will be linked to some health outcome – Christenfeld et al. 2004. Even checking 1000 correlation can be a sin– S. Young The only thing to fear is fear itself……………………. …..………………………………and everything else
50 Multiple testing adjustment In RCTs: mandatory (by FDA) -- If not, more (interim) looks would lead what you want -- If not, more (interim) looks would lead what you want In genetic/genomic studies: almost mandatory -- think about # of genes! -- think about # of genes! In observational studies: almost infeasible Realistic strategies can be: 1. α=5% for one hypothesis. Adjust multiple testing or state clearly how many tests/comparisons you conducted. 2. Think and act in RCT ways.
51 Replication, again is universal solution for multiplicity and subgroup analyses (Vandenbroucke 2008) In genome-wide analyses, it is a prerequisite for publication (Khoury et al. 2007) -- However, replication is for someone else! The data analysis strategy of splitting the data into two parts, testing and verification, can be considered.
52 9. Same words, different meanings? Professionals vs. lay terms (or even among scientific disciplines): not always the same e.g., risk, hazard, odds, likelihood, rate, prevalence, incidence, valid, unbiased, consistent, cost-effective (≠cheap), efficient, SD vs. SE People on the street may not distinguish RCT from observational study. As easy and intuitive as possible but should be correct
53 10. Data sharing Although … is tax-supported, its data are not available to us. ….Policies governing data dissemination need to be reconsidered, although due regard must be paid to patient confidentiality. Only by thorough scrutiny can error be avoided. Transparency is the best assurance of scientific quality. (Freedman & Pettiti 2005) Open the data to public (after sufficient de- identification). “ Alas, we don ’ t have the process ” – D. Kenney responding to ‘ File-drawer problem, revisited ’ by Young & Bang (2004)
54 So, who are responsible? Authors/scientists: lack of integrity, pressure to get grant, pub, CV, want to be famous or on the news, publish to publish Editors: too many journals, reviewers are busy, 2 wks review time, shortage of stat reviewers Media: don’t think critically, to surprise or shock people Lay persons: like more shocking news, may not use common sense We are all responsible for all ---Dostoevsky (Rose’s Epi book)
55 Try to remember as scientists If false positive and false negative results continued to be produced with disturbing frequency, it might be so true that ‘we are fast becoming a nuisance to society….people don’t take us seriously anymore, and when they do… we may unintentionally do more harm than good’ --- Trichopoulos. Remember it is extremely difficult to un-shock the shocked. Any researchers who use observational studies may want to remind them of one question when they do research ‘is this result likely reproducible in RCT (if it will ever happen)?’ Transparency!!!! Ultimately, data to be shared.
56 Try to remember as readers RCT vs. Obs studies (remember hierarchy) RCT is the best available but not perfect Use your common sense, while don’t forget serendipity (Start from the null. Ask ‘Why’, rather than ‘Why Not’?) Check N (denom & num, large N does not fix bias) Think about Third variables (i.e., confounders) Be careful about meta-analysis Do not worship p-value (<0.05) Perhaps, by Chance? Multiple testing. How many questions/analyses? Anything hidden?
57 Today’s Quotes In God we trust; all others must bring data (protocol and SAS output). --- W. Deming (& K. Griffin) (protocol and SAS output). --- W. Deming (& K. Griffin) All models are wrong, but some are useful. -- - G. Box Do whatever you what. But you should be responsible for what you do.
58 Useful reading (research articles) Bang, H. (2009) Introduction to Observational Studies. Young, SS. and Bang, H. (2004) The file-drawer problem, revisited. Science. Taubes, G. (1995) Epidemiology faces its limits. Science. Freedman, DA. (2008) Oasis or Mirage. Chance. Shapiro, S. (2004) Looking to the 21 st century: have we learned from our mistakes, or are we doomed to compound them? Ioannidis, JPA. (2006) Evolution and translation of research findings: From bench to where? PLOS. Breslow, NE. (2003) Are statistical contributions to medicine undervalued? Biometrics. Austin, PC. (2006) Testing multiple statistical hypotheses resulted in spurious associations: a study of astrological signs and health. Begg, C. (2001) The search for cancer risk factors: when can we stop looking?
59 Useful reading (newspaper articles) Do we really know what makes us healthy? --- NYT 2007 Scientists do the numbers: Coffee is good for you -- no, it's bad. Epidemiological studies can come up with some crazy results, causing some critics to wonder if they're really worthwhile. --- LAT 2007 Women's Health Studies leave questions in place of certainty --- NYT 2006 Why so much medical research is rot --- The Economist 2007.
60 Do we really know what makes us healthy? ‘We know exactly why certain people commit suicide. We don’t know, within the ordinary concepts of causality, why certain others don't commit suicide. …. We know a great deal more about the causes of physical disease than we do about the causes of physical health.’ --- Scott Peck, MD, in the book ‘The Road Less Travelled’.