Geoff Cumming: LAM, Paris 1 (Friday 11 May, 2012) The New Statistics: Why, How, and Where Next? Many disciplines rely on null hypothesis significance testing.

Geoff Cumming: LAM, Paris 1 (Friday 11 May, 2012) The New Statistics: Why, How, and Where Next? Many disciplines rely on null hypothesis significance testing (NHST) and p values, despite their deep flaws having been known for more than half a century. I will explain why estimation, based on effect sizes and confidence intervals, is a much more informative approach. 'The Dance of the p Values' demonstrates one dramatic shortcoming of p values: A replication experiment is likely to give a very different value of p, so p simply cannot be trusted. I refer to estimation--and its extension, meta-analysis--as The New Statistics. The techniques themselves are not new, but for most researchers it would be new, and a highly beneficial change, to switch from NHST to these techniques. I will describe practical ways to use the new statistics. I will use ESCI (Exploratory Software for Confidence Intervals), which runs under Excel, to illustrate concepts and calculate confidence intervals. ESCI is a free download from www.thenewstatistics.com www.thenewstatistics.com That website also gives information about my book: Cumming, G. (2012). Understanding The New Statistics: Effect Sizes, Confidence Intervals, and Meta-Analysis. New York: Routledge. I gave a short radio talk that summarises the main argument for the new statistics. The podcast and transcript are available at: http://tinyurl.com/geofftalk http://tinyurl.com/geofftalk Also a magazine article: http://tiny.cc/GeoffConversationhttp://tiny.cc/GeoffConversation 1

2 The New Statistics: Why, How, and Where Next? Geoff Cumming Statistical Cognition Laboratory, School of Psychological Science, La Trobe University, Melbourne, Australia 3086 G.Cumming@latrobe.edu.au www.latrobe.edu.au/psy/staff/cumming.html LAM, Paris, Talk 1, 11 May 2012 THANKS TO: Claudia Fritz, and: Bruce Thompson, Sue Finch, Robert Maillardet, Ben Ong, Ross Day, Mary Omodei, Jim McLennan, Sheila Crewther, David Crewther, Melanie Murphy, Cathy Faulkner, Pav Kalinowski, Jerry Lai, Debra Hansen, Mary Castellani, Mark Halloran, Kavi Jayasinghe, Mitra Jazayeri, Matthew Page, Leslie Schachte, Anna Snell, Andrew Speirs-Bridge, Eva van der Brugge, Elizabeth Silver, Jacenta Abbott, Sarah Rostron, Amy Antcliffe, Lisa L. Harlow, Dennis Doverspike, Alan Reifman, Joseph S. Rossi, Frank L. Schmidt, Meng-Jia Wu, Fiona Fidler, Neil Thomason, Claire Layman, Gideon Polya, Debra Riegert, Andrea Zekus, Mimi Williams, Lindy Cumming © G. Cumming 2012

3 A researcher arrives from Mars, with data, and consults Dr Inference at her store: Option AOption B Answers the question: “Is there an effect?”“How large is the effect?” “How precise an answer?” Used by: ~97% of psych, etc3%? 5%? Problems? Misunderstood Misused Reasonable understanding? Some misconceptions Meta-analysis? Irrelevant, but can cause bias Integral to meta-analysis NHST, p values Estimation: Effect sizes, Confidence intervals, Meta-analysis

4 RCTs for Evidence-based practice in psychology  What stats? Option A or B? Cathy Faulkner’s D Psych thesis:  Study 104 RCTs in JCCP, BR&T (1999-2003)  99% used NHST (sob!)  5% reported CIs, fewer interpreted them  31% reported and interpreted an effect size  78% considered clinical importance Faulkner, C., Fidler, F., & Cumming, G. (2008). The value of RCT evidence depends on the quality of statistical analysis. Behaviour Research and Therapy, 46, 270-281

5 Ask clinical psychology researchers (RCT authors):

6 Please rate each item, 1-7:

7 Conclusions from Cathy’s results  NHST dominates (Option A!)  NHST thought pattern dominates: “Is there an effect?”  For EBP, psychologists need to know the size of the effect  And, when prompted, they know this  Published RCTs don’t provide what psychologists need to know  An estimation approach (“How large is the effect?”) would provide that (Yay for Option B!)  Also needed for meta-analysis, the basis for EBP

9 Option B, The New Statistics: Why?  Estimation tells us what we need to know  It’s simply more informative than NHST In addition:  Other disciplines (physics, chemistry…) succeed while using estimation, and rarely using NHST  Successful psychologists have avoided NHST (Piaget, Skinner, Ebbinghaus…)  Estimation may feel natural, if no p value obligation  NHST problems (13 of them!) Kline, R. B. (2004). Beyond significance testing: Reforming data analysis methods in behavioral research. Washington, DC: APA tinyurl.com/klinechap3 tinyurl.com/klinechap3

10 Option B, The New Statistics: Why? And further:  Other disciplines are shifting.  Medicine has routinely reported CIs since the 1980s  The sixth edition of the APA Publication Manual (APA, 2010) states unequivocally: “Wherever possible, base discussion and interpretation of results on point and interval estimates” (p. 34).  And gives a format for reporting CIs  And numerous examples of reporting CIs

11 More on why : p values and replication  Replication is central in science  Our statistical analyses should tell us about replication  Given the p value from an initial experiment  What’s likely to happen if you repeat the experiment?  If you replicate, do you get a similar p? Is p reliable?!  But first, how do we think about p?  A great question for statistical cognition research!  Strangely, p has hardly been studied Chapter 5

Tenure, Research grant PhD, Prize, Top publication Consolation prize, Fair publication …and has real-life consequences! 12

13 But: p values and replication?  We’ll simulate and ask: What p values? What distribution of p ?  Experimental group (N = 32), and Control group (N = 32)  Assume population difference of Cohen’s  = 0.5 (SMD of half a SD)  Power =.52  Typical, for many fields in social and behavioural science Dance p page of ESCI chapters 5-6ESCI chapters 5-6  Dance of the p values! (It’s very drunken!)  The video: tinyurl.com/danceptrial2 or link at book websitetinyurl.com/danceptrial2 Chapter 5

14 Replicate …get a VERY different p value!? p interval: An 80% prediction interval for one-tailed p (given a two-tailed p obt ) p obt One-sided p interval Two-sided p interval.001(0,.018)(.0000002,.070).01(0,.083)(.000006,.22).05(0,.22)(.00008,.44).2(0,.46) (.00099,.70)  A p value gives only extremely vague information about p next time!  Any p value could easily have been very different, because of sampling variability!  Only p <.001 (and perhaps p <.01) tells you anything; all other p values tell you virtually nothing at all!  p intervals apply for any N, even very large N ! Cumming, G. (2008). Replication and p intervals: p values predict the future only vaguely, but confidence intervals do much better. Perspectives on Psychological Science, 3, 286-300. Lai, J., Fidler, F., & Cumming, G. (2011). Subjective p intervals: Researchers underestimate the variability of p values over replication. Methodology, 8, 51-62. Chapter 5

15 Multiple effects in one study: The pattern! Simulation of 4 studies, and meta-analysis (MA) combination. Each examined 6 independent effects, each  = 0.5.  Star-jumping is crazy! Yet many disciplines do it all the time…  ‘Failure to replicate’?! Be cautious. MA! Chapter 5

16 Implications of the variability of p ?  Weird inconsistency in textbooks:  Sampling variability of means—whole chapters  Sampling variability of CIs—illustrated, part of definition  Sampling variability of p—not mentioned!  Require reporting of p intervals? E.g. z = 2.05, p =.04, p interval (0,.19)  Does it make sense to calculate p =.050 so exactly, if you know the p interval is (.00008,.44)?  Why apply sophisticated ‘corrections’ to p (Bonferroni, etc), when p could easily have been very different?  Whenever you see a p value, bring to mind the (approx) p interval Chapter 5

17 In summary, The New Statistics: Why?  Estimation tells us what we need to know  It’s simply more informative than NHST In addition:  Other folks rely on it, and the sun still rises!  The APA Manual says to use estimation  NHST problems (13 of them!)  In particular, p is highly unreliable—you can’t trust p ! NHST, the researcher’s heroin—can we kick the habit?! The seductive, but illusory, ‘certainty’ of p ! Chapters 1, 2

19 The New Statistics: How? The ‘New’ Statistics  Are not themselves new, but using them widely in psychology would be new, and highly beneficial Point estimate (e.g. sample mean, M), our effect size (ES) estimate Interval estimate, 95% CI. M = 12.2 cm, 95% [3.7, 20.7] MOE (margin of error) Chapter 3

20 The New Statistics: How? Estimation: The six-step plan 1. Use estimation thinking. State estimation questions as: “How much…?”, “To what extent…?”, “How many…?” Key to a more quantitative discipline? 2. Indentify the ESs that best answer the questions 3. From the data, calculate point and interval estimates (CIs) for those ESs 4. Make a picture, including CIs 5. Interpret 6. Use meta-analytic thinking at every stage Cumming, G., & Fidler, F. (2009). Confidence intervals: Better answers to better questions. Zeitschrift für Psychologie / Journal of Psychology, 217, 15-26. Chapters 1, 2, 15

21 A tricognitive analysis: Three ways of thinking  DT Dichotomous thinkingNHST  Reject or don’t reject H 0.  Is there an effect? At most, a direction of difference.  Does the therapy work? (Yes-No)  ET Estimation thinkingCIs  How much? How big? To what extent?  How large is the effect of the therapy?  MAT Meta-analytic thinkingMA  Combine evidence over studies  The single study: part of “a mosaic of study effects” An over-arching, under-pinning idea! Chapters 1, 2

22 Dichotomous thinking & NHST  DT is incredibly deeply embedded in our psyches!  NHST and DT are mutually reinforcing  DT limits psychology’s theorising  Theories are ordinal, not quantitative (cf. Newton!)  One of the worst effects of reliance on NHST?  So let’s shift to CIs and MA  And build quantitative theories, and a quantitative discipline Gigerenzer, G. (1998). Surrogates for theories. Theory & Psychology, 8, 195-204. Meehl, P. E. (1978) Theoretical risks and tabular asterisks: Sir Karl, Sir Ronald, and the slow progress of soft psychology. Journal of Consulting and Clinical Psychology, 46, 806-834. Chapter 2

23 Dichotomous thinking & NHST In a recent issue of Psychological Science, 8/10 articles based on DT: Aim : “We predicted that playing a violent video game … would decrease the likelihood of help.” Conclusion : “Participants who played a violent game took significantly longer to help.” Aim : “We hypothesised that stressed participants would exhibit increased risky behavior …” The usual p =.05 cutoff was relaxed for the conclusion : Conclusion : “Participants showed a trend toward making a higher number of risky decision under acute stress… p <.10.” Chapter 2

24 A hint of the ET future? The other two articles from Psychological Science were based on ET: Article 1 “The current study measured the degree to which the public’s interpretation of the forecasts … matches the authors’ intentions.” Discussion focussed on the extent of differences. Article 2 The aim was “estimating the financial value of pain.” Discussion focussed on the price people would pay to avoid pain in various circumstances. A big take-home message: Use estimation language! Chapter 2

26 A simple example: The old Old style, but still extremely common in journals: Introduction: “The study was designed to test the prediction that, following the new procedure, 6 year olds would show significantly improved reading, but 4 year olds would not.”  NHST is assumed. Only a direction is predicted. DT reigns. “Significantly” is used ambiguously. Results: “In 6 year olds there was a highly significant improvement (p.05, ns).”  NO! We need to know how large each improvement was—then we can judge whether it matters! Give us ET!  Maybe both improved? Chapter 2

27 A simple example: The new New style: Introduction: “The study was designed to estimate reading improvement, following the new procedure, in 6 and 4 year olds.” Even better: “The theory predicted a medium-to-large increase in 6 year olds, but little or no increase in 4 year olds.” Better still if the predictions are quantitative.  ET is assumed. Estimation of ESs is the focus. Results: “The reading age of 6 year olds increased by 4.5 months 95% CI [1.9, 7.1], which is a large and educationally substantial increase. That of 4 year olds increased by only a negligible 0.6 months [-0.8, 2.0], …”  We are given ESs, with CIs, then the ESs are interpreted. Further comment would assess the more specific (ideally, quantitative) predictions. …The New Statistics in action Chapters 2 - 6

28 Caveats  I assume populations have a normal distribution  This is the conventional thing to do  No mention of  robust statistics  Bayesian statistics  resampling methods  model fitting  etc…  …Even though they are all full of potential Chapter 3

29  How, pragmatically, should we think about CIs?  How should we use CIs to interpret results? CIjumping page of ESCI chapters 1-4ESCI chapters 1-4 Interpretation 1: One from the dance  Our CI is a randomly chosen one from the infinite dance, 95% of which include , but... It might be red! CIs: Interpretation 1 (of six) Chapter 3 Our 95% CI

30 CIs: Interpretations 2 and 3 Interpretation 2: Interpret the interval  We can be 95% confident our interval includes   Values in the interval are plausible for , and values outside the interval are relatively less plausible for  Interpretation 3: Prediction interval for M  The interval signals, approximately, the bouncing around  On average, approx 83% of future M‘s lie within our CI Cumming, G., Williams, J., & Fidler, F. (2004). Replication, and researchers’ understanding of confidence intervals and standard error bars. Understanding Statistics, 3, 299-311. Our 95% CI Chapters 3, 5

31 CIs: Interpretation 4 (least preferred!) Interpretation 4: Use it for NHST  If the null hypothesised value (e.g. zero) lies outside the 95% CI, reject the null hypothesis, at the p =.05 level  If it lies within the interval, don’t reject  But interpreting CIs in this way ignores much of the information they provide  And can prompt incorrect interpretation of simple and common patterns of results Coulson, M., Healey, M., Fidler, F., & Cumming, G. (2010). Confidence intervals permit, but do not guarantee, better inference than statistical significance testing. Frontiers in Quantitative Psychology and Measurement, 1:26. tinyurl.com/cisbettertinyurl.com/cisbetter Chapter 1 Our 95% CI

32 CIs: Interpretation 5 Interpretation 5: Cat’s eye picture  The beautiful shape of a CI!  Best bets for  are values near M  Less good bets towards either CI limit  Progressively even less beyond the limits Different levels of confidence (99%, 90%, …) have different cat’s eye shading shapes within the interval CIfunction page of ESCI chapters 1-4 Cumming, G. (2007). Inference by eye: Pictures of confidence intervals and thinking about levels of confidence. Teaching Statistics, 29, 89- 93. Our 95% CI Chapter 4

33 CIs and p values: Benchmarks  If the H 0 value falls at the limit of the 95% CI, two-tailed p value is.05  … one third of the way out, about.50  … half the way out, about.32  … 5/6 of the way out, about.10  … 1/3 of the way beyond, about.01. Given p,  0, generate CI in your mind’s eye! CI and p page of ESCI chapters 1-4ESCI chapters 1-4.022,.19,.37 0 00 Chapter 4

35 Understanding The New Statistics : Contents z1. Introduction to The New Statistics z2. From Null Hypothesis Significance Testing to Effect Sizes z3. Confidence Intervals z4. Confidence Intervals, Error Bars, and p Values z5. Replication z6. Two Simple Designs z7. Meta-Analysis 1: Introduction and Forest Plots z8. Meta-Analysis 2: Models z9. Meta-Analysis 3: Larger-Scale Analyses z10. The Noncentral t Distribution z11. Cohen’s d z12. Power z13. Precision for Planning z14. Correlations, Proportions, and Further Effect Size Measures z15. More Complex Designs and The New Statistics in Practice

36 Some topics:  Compare two conditions—independent, or paired data  Randomised control trial (RCT)  CI on correlation, r  CI on proportion, P  Cohen’s d, and CI on d  Statistical power  Precision for planning  Meta-analysis Chapters 6 - 15 The New Statistics: How?

37 Compare two conditions  Two independent groups  Difference between the means, with its CI  Compare the CIs on the separate means (any overlap?)  Paired (repeated measure) design  Focus on the paired differences  We need the CI on the mean difference  Because the two measures are usually correlated, the two separate CIs are virtually irrelevant for assessing the ES of interest  IV in a figure: independent or a repeated measure?! Several pages of ESCI chapters 5-6 Cumming, G., & Finch, S. (2005). Inference by eye: Confidence intervals, and how to read pictures of data. American Psychologist, 60, 170-180. tinyurl.com/inferencebyeye tinyurl.com/inferencebyeye Chapter 6

38 Two independent groups: Rules of Eye Two 95% CIs just touching (zero overlap) indicates fairly strong evidence of a population difference (approx p =.01) Moderate overlap (about half average MOE) is some evidence of a difference (approx p =.05) Interpretation 5: Cat’s eye CI Cumming, G., & Finch, S. (2005). Inference by eye: Confidence intervals, and how to read pictures of data. American Psychologist, 60, 170-180. tinyurl.com/inferencebyeyetinyurl.com/inferencebyeye Chapter 6

39 Two independent groups: Rules of Eye Two 95% CIs just touching (zero overlap) indicates fairly strong evidence of a population difference (approx p =.01) Moderate overlap (about half average MOE) is some evidence of a difference (approx p =.05) Interpretation 5: Cat’s eye CI Cumming, G. (2009). Inference by eye: Reading the overlap of independent confidence intervals. Statistics in Medicine, 28, 205-220. Chapter 6

41 2. Paired (or matched) design  A repeated measure on a single group of participants  Pretest vs Posttest  Experimental treatment vs Control treatment (same participants)  A woman vs her sister (participants paired; not a repeated measure)  Each participant is their own control, so often a sensitive design  Carry-over effects? Counterbalance order of presentation?  The higher the correlation between measures, the more sensitive  … and the shorter the CI on the mean difference  … which must be our focus Data paired and Simulate paired pages of ESCI chapters 5-6ESCI chapters 5-6 Chapter 6

42 Two t tests  Two independent groups  Paired (or matched) design  Different error term!  SO, CIs on the separate measures are irrelevant  No overlap rule possible for paired design! Chapter 6

43 Overlap Rule of Eye  Overlap Rule of Eye  !! Pretest Posttest Compare A B page of ESCI chapters 5-6ESCI chapters 5-6 Chapter 6

44 Internet study of researchers’ understanding of CIs and SE bars  Ask journal authors: “Set two means, with error bars, to be just statistically significantly different.”  Enormous spread! Performance is all over the place!  Few responses are accurate.  Little distinction is made between CIs and SE bars!  Setting bars (CI or SE) to just touch is popular.  Repeated measure design (pretest, posttest): only 11% of respondents identified a problem. (Slash wrists…) Belia, S., Fidler, F., Williams, J., & Cumming, G. (2005). Researchers misunderstand confidence intervals and standard error bars. Psychological Methods, 10, 389-396. Chapter 6

45 Do we give up on CIs, and statistical reform? Or seek ways to improve how CIs can be understood and used?

46 Questions that should spring to mind…  What is the DV? (Dependent variable, the measure on the vertical axis)  What are the two conditions?  What design? (Two independent groups? Paired?)  What are the bars? (SE? 95% CI? SD? Some other CI?) They all look exactly the same! (Madness!!) Can we apply an overlap rule of eye?? Every figure must provide all this information, in the figure or figure caption. Cumming, G., Fidler, F., & Vaux, D. L. (2007). Error bars in experimental biology. Journal of Cell Biology, 177, 7-11. tinyurl.com/errorbars101tinyurl.com/errorbars101 Chapters 4, 6

47 Randomised control trial, in psychology Means, with 95% CIs These CIs can guide assessment of which comparisons? They cannot help with which comparisons? Figure page of ESCI chapters 14-15ESCI chapters 14-15 Fidler, F., Faulkner, S., & Cumming, G. (2008). Analyzing and presenting outcomes: Focus on effect size estimates and confidence intervals. In A. M. Nezu & C. M. Nezu (Eds.) Evidence- based outcome research: A practical guide to conducting randomized controlled trials for psychosocial interventions (pp. 315-334). New York: OUP. Chapter 15

48 An RCT example from Cathy’s tutorial on CI formats: Plot mean change score, with its CI Faulkner, C., Fidler, F., & Cumming, G. (2008). The value of RCT evidence depends on the quality of statistical analysis. Behaviour Research and Therapy, 46, 270-281. Chapter 15

49 CI on correlation, r  Use Fisher’s r to z transformation  CIs are asymmetric, especially for r near -1 or 1  CIs are shorter when near -1 or 1  CIs may seem surprisingly wide, unless N is large r to z and Two correlations pages of ESCI chapters 14-15ESCI chapters 14-15 Correlations and Diff correlations pages of ESCI Effect sizesESCI Effect sizes Finch, S., & Cumming, G. (2009). Putting research in context: Understanding confidence intervals from one or more studies. Journal of Pediatric Psychology, 34, 903-916. Chapter 14

50 CI on proportion, P Difference between two proportions (instead of  2 ) ES = 17/20 – 11/20 =.30, [.02,.53] Diff proportions page of ESCI Effect sizes Finch, S., & Cumming, G. (2009). Putting research in context: Understanding confidence intervals from one or more studies. Journal of Pediatric Psychology, 34, 903-916. Chapter 14

51 The New Statistics: How? Estimation: The six-step plan 1. Use estimation thinking. State estimation questions as: “How much…?”, “To what extent…?”, “How many…?” Key to a more quantitative discipline? 2. Indentify the ESs that best answer the questions 3. From the data, calculate point and interval estimates (CIs) for those ESs 4. Make a picture, including CIs 5. Interpret 6. Use meta-analytic thinking at every stage Cumming, G., & Fidler, F. (2009). Confidence intervals: Better answers to better questions. Zeitschrift für Psychologie / Journal of Psychology, 217, 15-26. Chapters 1, 2, 15

52 The New Statistics: Where next?  Examples and advice, for many situations, many ESs  ANOVA, multivariate, SEM, model fitting…  Tabachnick & Fidell (2007) includes CIs  New textbooks, new software  Editors insisting  Statistical cognition research  Provides the evidence for evidence-based statistical practice  p values and emotions  Thinking about different figures—we need better graphics!  Estimation thinking—does it help? Practitioners’ thinking? Beyth-Marom, R., Fidler, F., & Cumming, G. (2008). Statistical cognition: Towards evidence-based practice in statistics and statistics education. Statistics Education Research Journal, 7, 20-39 tinyurl.com/statcogtinyurl.com/statcog Chapter 15

53 Queries or comments to: g.cumming@latrobe.edu.au Geoff’s brief radio talk: tinyurl.com/geofftalk Geoff’s short magazine article: tiny.cc/GeoffConversation Preface, contents & sample chapter: tinyurl.com/tnschapter7 Dance of the p values: tinyurl.com/danceptrial2 Book info, and ESCI: www.thenewstatistics.com Hug a confidence interval today!

Geoff Cumming: LAM, Paris 1 (Friday 11 May, 2012) The New Statistics: Why, How, and Where Next? Many disciplines rely on null hypothesis significance testing.

Similar presentations

Presentation on theme: "Geoff Cumming: LAM, Paris 1 (Friday 11 May, 2012) The New Statistics: Why, How, and Where Next? Many disciplines rely on null hypothesis significance testing."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Geoff Cumming: LAM, Paris 1 (Friday 11 May, 2012) The New Statistics: Why, How, and Where Next? Many disciplines rely on null hypothesis significance testing.

Similar presentations

Presentation on theme: "Geoff Cumming: LAM, Paris 1 (Friday 11 May, 2012) The New Statistics: Why, How, and Where Next? Many disciplines rely on null hypothesis significance testing."— Presentation transcript:

Similar presentations

About project

Feedback