Presentation is loading. Please wait.

Presentation is loading. Please wait.

The New Statistics: Estimation and Research Integrity 1 Geoff Cumming School of Psychological Science, La Trobe University, Melbourne, Australia

Similar presentations


Presentation on theme: "The New Statistics: Estimation and Research Integrity 1 Geoff Cumming School of Psychological Science, La Trobe University, Melbourne, Australia"— Presentation transcript:

1 The New Statistics: Estimation and Research Integrity 1 Geoff Cumming School of Psychological Science, La Trobe University, Melbourne, Australia APS-SMEP Workshop, APS Convention, San Francisco Thursday 22 May 2014 This PowerPoint file: tiny.cc/geoffdocstiny.cc/geoffdocs Tutorial article: The New Statistics: Why and How tiny.cc/tnswhyhowtiny.cc/tnswhyhow THANKS TO: Alan Kraut, Kate McMahon, The Australian Research Council, Neil Thomason, Fiona Fidler, and many others © G. Cumming 2014

2 The new statistics  Effect sizes, confidence intervals, meta-analysis  …which is Estimation  The techniques are not new, but using them widely would, in many disciplines, be new Sections 1.The new statistics: Why 2.Research integrity and the new statistics 3.Effect sizes and confidence intervals 4.The new statistics: How 5.Planning, power, and precision 6.Meta-analysis Take-home message: Intuitions about variability—the dances 2

3 Understanding The New Statistics (New York: Routledge, 2012) 1. Introduction to The New Statistics 2. From Null Hypothesis Significance Testing to Effect Sizes 3. Confidence Intervals 4. Confidence Intervals, Error Bars, and p Values 5. Replication 6. Two Simple Designs 7. Meta-Analysis 1: Introduction and Forest Plots 8. Meta-Analysis 2: Models 9. Meta-Analysis 3: Larger-Scale Analyses 10. The Noncentral t Distribution 11. Cohen’s d 12. Power 13. Precision for Planning 14. Correlations, Proportions, and Further Effect Size Measures 15. More Complex Designs and The New Statistics in Practice 3

4 1.The new statistics: Why 2.Research integrity and the new statistics 3.Effect sizes and confidence intervals 4.The new statistics: How 5.Planning, power, and precision 6.Meta-analysis 4

5 The Boots anti-ageing stampede  British J. Dermatology, online, 2009: A cosmetic ‘anti-ageing’ product improves photoaged skin: A double-blind, randomized controlled trial  “…statistically significant improvement in facial wrinkles as compared to baseline assessment (p =.013), whereas vehicle-treated skin was not significantly improved (p =.11)”  Media reports: “significant clinical improvement in facial wrinkles…”  Queues at Boots for ‘No. 7 Protect & Perfect Intense Beauty Serum’ Watson, R. E. B., et al. (2009). British Journal of Dermatology, 161, Chapter 2

6 The Boots anti-ageing stampede  Concluding ‘no effect’ for placebo is accepting the null hypothesis  Statistical criticisms, then revised article: “non-significant trend…”  p values and CIs closely linked  Should assess the difference directly, but it’s a common error:  Watson, R. E. B., et al. (2009). British Journal of Dermatology, 161, p =.013 p =.11 Chapter 2

7 Comparing significance levels is everywhere “…incorrect procedure… in which researchers conclude that effects differ when one effect is significant (p.05). We reviewed 513 … articles in Science, Nature, Nature Neuroscience, Neuron and The Journal of Neuroscience and found that 78 used the correct procedure and 79 used the incorrect procedure.” Nieuwenhuis, S., Forstmann, B. U., & Wagenmakers, E-J. (2011). Erroneous analyses of interactions in neuroscience: a problem of significance. Nature neuroscience, 14,

8 Conclusions, so far  Presentation format can matter—a lot  Null Hypothesis Significance Testing (NHST) promotes dichotomous thinking (an effect exists, or it doesn’t)  NHST: seductive, but illusory ‘certainty’  CIs can prompt better interpretation …and are highly informative  p values and CIs are closely linked, but there are important differences 8 Chapter 2

9 9

10  authors in medical and psychology journals.  Ask them to rate: “Results of the two are broadly consistent, or similar”  Ask for comments, classify these as ‘mention NHST’ or no such mention. Conclude:  Even if see CIs, often think of NHST  Better interpretation if avoid NHST, and think in terms of intervals  Don’t report p values as well as CIs 10 Evidence? (statistical cognition—only psychology can do it) Coulson, M., Healey, M., Fidler, F., & Cumming, G. (2010). Confidence intervals permit, but do not guarantee, better inference than statistical significance testing. Frontiers in Quantitative Psychology and Measurement, 1:26, 1-9. ) tiny.cc/cisbettertiny.cc/cisbetter

11 Time for a crusade? The New Statistics: Effect sizes, confidence intervals, meta-analysis …are not themselves new, but using them widely in psychology would be new, and highly beneficial 11

12 p values—how do we think about them?  The p value:  …is central to research thinking  …has hardly been studied  A great question for statistical cognition research  How do people think about p?  … talk about p?  … feel about p? 12 Chapter 5

13 Tenure, Research grant PhD, Prize, Top publication Consolation prize, Fair publication …and has real-life consequences! 13

14 But: p values and replication?  Given the p value from an initial experiment  What’s likely to happen if you replicate—do you get a similar p?  We’ll simulate and ask: What p values?  Experimental (N = 32), and Control groups (N = 32)  Assume population difference of Cohen’s  = 0.5 (a medium effect?)  Power =.52, typical for many fields in social and behavioural science Dance p page of ESCI chapters 5-6ESCI chapters 5-6 …free download from  Dance of the p values (It’s very drunken!)  The video: tiny.cc/dancepvals or tiny.cc/dancepvals2tiny.cc/dancepvalstiny.cc/dancepvals2 14 Chapter 5

15 Replicate …get a VERY different p value p interval: 80% prediction interval for one-tailed p (given two-tailed p obt ) p obt p interval.001( ,.070).01( ,.22).05(.00008,.44) Independent of N !.2(.00099,.70)  Any p could easily have been very different. (That’s sampling variability)  A p value gives only extremely vague information about p next time!  Researchers severely underestimate p intervals! (Med, Psych, Stats) Cumming, G. (2008). Replication and p intervals: p values predict the future only vaguely, but confidence intervals do much better. Perspectives on Psychological Science, 3, Lai, J., Fidler, F., & Cumming, G. (2011). Subjective p intervals: Researchers underestimate the variability of p values over replication. Methodology: European Journal of Research Methods for the Behavioral and Social Sciences, 8, tiny.cc/subjectiveptiny.cc/subjectivep 15 Chapter 5

16 Traditional ANOVA table 16  Star-jumping is crazy, yet common.  ‘Failure to replicate’? Be cautious. Meta-analysis.  Interpretation may be based on p values, and little else? Chapter 5

17 Implications of the variability of p ?  Weird inconsistency in textbooks:  Sampling variability of means—whole chapters  Sampling variability of CIs—illustrated, part of definition  Sampling variability of p—not mentioned!  Require reporting of p intervals? E.g. p =.04, p interval (0,.19)  See a p value, think of the (approx) p interval More generally:  Sampling variability is so large—remember the dances 17 Chapter 5

18 Reasons for NHST? ??Significance testing is needed:  to identify which results are real and which due to chance,  to determine whether or not an effect exists,  to ensure that data analysis is objective, and  to make clear decisions, as in practice we need to do. In every case: NO, estimation does better Schmidt, F. L., & Hunter, J. E. (1997). Eight common but false objections to the discontinuation of significance testing in the analysis of research data. In Lisa L. Harlow, Stanley A. Mulaik, & James H. Steiger (Eds). What if there were no significance tests? (pp ). Mahwah, NJ: Erlbaum. 18

19 Section 1 Conclusions p can mislead, CIs inform Our p tells virtually nothing about the dance Our CI gives useful information about the dance  A p value indicates strength of evidence against a null hypothesis  A p value does not signal its unreliability, CI length does signal uncertainty  p values and CIs are closely linked  A CI estimates the size of an effect, which is what we want to know  Estimation is more informative than dichotomous NHST NEXT: Further reasons for reform of statistical and other research practices 19 Chapter 2

20 1.The new statistics: Why 2.Research integrity and the new statistics 3.Effect sizes and confidence intervals 4.The new statistics: How 5.Planning, power, and precision 6.Meta-analysis 20

21 A title to die for The replication crisis …from cancer research to social psychology, some published findings won’t replicate Ioannidis, J. P. A. (2005). Why most published research findings are false. PLoS Medicine 2: e124. tiny.cc/mostfalsetiny.cc/mostfalse 21

22 Why are most false? The Ioannidis argument: The imperative to achieve statistical significance explains: 1. selective publication—file drawer 2. data selection, tweaking, and p-hacking until p is sufficiently small 3. why we think any finding that once meets the criterion of statistical significance is true and doesn’t require replication Many false positives published  ==> Most published findings false  22

23 Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22, tiny.cc/falsepositivepsychtiny.cc/falsepositivepsych 23

24 2. Data selection, tweaking, p hacking The Simmons et al. argument: p-hacking—it’s very easy to:  test a few extra participants  drop or add dependent variables  select which comparisons to analyze  drop some results as aberrant  try a few different statistical analysis strategies  then finally choose which of all the above to report Many degrees of freedom ==> always find statistical significance  Numerous published results are false positives   Many won’t replicate, not that we do many replications  24

25 Many false positives published?  Low power studies (e.g., power =.50, for medium effect) Dark = stat sig  Selective publication ==> high proportion of false positives   Selection, tweaking, p-hacking ==> turns some ns into stat sig Red = spurious stat sig  Now even higher proportion of false positives   If a journal prefers surprise, false positives selectively published  Therefore even higher proportion of false positives    25

26 Research integrity ≈ Open science The problems: 1. Selective publication 2. p hacking, and other dubious data analytic practices 3. Lack of replication; replications fail (the crisis) Two meanings of ‘research integrity’  Completeness and validity of the published research literature  Requires solution to all three problems  Ethical, morally correct behaviour of researchers 1: Make results of all competent research available, somehow 2: Avoid dubious data analytic practices: No p-hacking 1 and 2: Report everything in full, accurate detail 3: Carry out replications How do we achieve research integrity? 26

27 Research integrity: A new-statistics perspective The new statistics should help in some ways:  Remove the imperative for statistical significance  Remove the dichotomous mindset of replication as Yes or No  Emphasise meta-analysis and thus the importance of cumulation, replication, and making ALL results available But may not solve other problems:  Career imperative to publish in best journals  Selection in data analysis and reporting  Journals’ preference for exciting, novel findings, not replications 27

28 Research integrity: What we need  Understanding of the 3 problems—no simple changes will suffice  The new statistics  Better, more informative experiments  A clear distinction:  Pilot explorations, not publishable, vs.  Planned, pre-specified experiments, results ‘published’ in full  Ways to declare research plans in advance  Ways to ‘publish’, whatever the data (but quality control)  Ethical review boards: Require pre-registration? ‘publication’?  Replications (close and not-so-close)  Tools, guidelines, training, editorial policies… 28

29 Research integrity findings, proposals, arguments… Perspectives on Psychological Science, 2012, issue 7(6); 2013, issue 8(4) Makel: Only 1% of articles are replications, most ‘successful’; rate is increasing Bakker; Francis: Many meta-analyses have too many statistically significant results Giner-Sorolla: Top journals require ‘cool’ results! Aesthetics beat truth! Klein: Demand characteristics and experimenter bias: Still alive! Frank; Grahe: Students should do replication experiments Koole: Ways to reward replications. Publish online, linked to original reports. Nosek: Scientific Utopia: Place truth before ‘cool’. Open tools, data, publication. Wagenmakers: Publish protocol in advance. Pre-specified (confirmatory) vs. exploratory Fuchs: Psychologists open to change, but wary; prefer standards to rigid rules Ioannidis: Credibility of science. Replication. Truth. Progress. Open Science Collaboration: Reproducibility Project—replications of 2008 studies Fiedler: Test many alternative hypotheses. Converging evidence. Ingenuity. Gullo: DSM5: Pathological publishing Positive results, no null results, ‘good story’ 29

30 Research integrity: A few current projects  Open Science Framework (OSF) tiny.cc/osftiny.cc/osf  Manage workflow, declare protocols, archive results and data  Reproducibility Project tiny.cc/repprojecttiny.cc/repproject  An open collaboration for replication, part of OSF  Replicate findings published in 2008  Registered Replication Report, in Perspectives on Psychological Science  Open, refereed & pre-specified, guaranteed publication, meta-analysis  PsychFileDrawer tiny.cc/psychfiledrawertiny.cc/psychfiledrawer  Archive of reports of replications in psychology  figshare tiny.cc/figsharetiny.cc/figshare  A repository of reports and datasets  Archives of Scientific Psychology (APA) tiny.cc/archivesscipsytiny.cc/archivesscipsy  Open online journal; requires full data  rOpenSci  It’s not just psychology! (Posting of software, data, analyses, discussion…) 30

31 Research integrity findings, proposals, arguments… Perspectives on Psychological Science, 2014, issue 9(3), May Ledgerwood: Introduction. “best practices… things we can change right now” Lakens & Evers: Increasing the informational value of studies, the v statistic Sagarin et al.: Protection for data peeking, then further data collection Perugini et al.: Imprecise power estimates, ‘safeguard power’ Stanley & Spence: Replications often vary greatly. Sampling error, measurement error. “Researchers should adjust their expectations concerning replications and shift to a meta-analytic mindset.” Braver, Thoemmes, & Rosenthal: Continuously cumulating meta-analysis Maner: Implications for editors and manuscript reviewers 31

32 New journal guidelines  Psychonomic Society journals New statistical guidelines tiny.cc/psychonomicstatstiny.cc/psychonomicstats  Society for Personality and Social Psychology (SPSP) Task Force Funder, D. C., et al. (2014). Improving the dependability of research in personality and social psychology: Recommendations for research and educational practice. Personality and Social Psychology Review,18,  Psychological Science New guidelines, from Jan 2014: tiny.cc/eicheditorial tiny.cc/pssubguidetiny.cc/eicheditorialtiny.cc/pssubguide Editor-in-chief, Eric Eich, explains: tiny.cc/apseichinterviewtiny.cc/apseichinterview 32

33 Eich, E. (2014). Business not as usual. Psychological Science, 25, 3-6. tiny.cc/eicheditorialtiny.cc/eicheditorial 33

34 Psychological Science guidelines  Enhanced reporting of methods  Compulsory full disclosure check-boxes: Exclusions, Manipulations, Measures, Sample Sizes  Up to three ‘open science’ badges  Embracing the new statistics  Tutorial article: tiny.cc/tnswhyhowtiny.cc/tnswhyhow …APS is keen for other societies and journals to do similar 34

35 OSF badges …as in Psychological Science Preregistered: The design and analysis plans for the reported research were preregistered in a public, open-access repository. Open Materials: All digitally shareable materials necessary to reproduce the reported methodology have been made available in a public, open-access repository. Open Data: All digitally shareable data necessary to reproduce the reported results have been made available in a public, open-access repository. 35

36 Cumming, G. (2014). The new statistics: Why and how. Psychological Science, 25, tiny.cc/tnswhyhow tiny.cc/tnswhyhow 36

37 Statistical reform efforts  The full story: Fidler (2005) tiny.cc/fionasphdtiny.cc/fionasphd  The brief story: Cumming (2014, APS Observer) tiny.cc/geoffobservertiny.cc/geoffobserver  International Committee of Medical Journal Editors (ICMJE) 1988: Use CIs  Ken Rothman: 1990 founded Epidemiology “We won’t publish p values” …for 10 years there were virtually none  Geoff Loftus at Memory & Cognition (1993-7): Increased use of figures with error bars, but a decrease after he left  APA Publication Manual (2010): Recommended estimation  Numerous examples of reporting CIs  Format for CI: “mean is 275 ms, 95% CI [210, 340]” Fidler, F., Thomason, N., Cumming, G., Finch, S., & Leeman, J. (2004). Editors can lead researchers to confidence intervals, but can’t make them think: Statistical reform lessons from medicine. Psychological Science, 15, Finch, S., Cumming, G., Williams, J., Palmer, L., Griffith, E., Alders, C., Anderson, J., & Goodman, O. (2004). Reform of statistical inference in psychology: The case of Memory & Cognition. Behavior Research Methods, Instruments & Computers, 36,

38 Prospects for reform  For 60+ years, damning critiques of NHST, almost no replies, almost no change  Kline review: tiny.cc/klinechap3 (13 severe NHST problems)tiny.cc/klinechap3  Critical quotes: tiny.cc/nhstquotestiny.cc/nhstquotes  Last years: Meta-analysis arrives, no need for NHST, damage done by p-based selective publication, but still little change  Last 5-10 years: Replication crisis: Urgent!  At last, a tipping point? 38

39 39 Psychology struggles out of the p-swamp, into the beautiful garden of confidence intervals

40 In summary, The new statistics: Why? Sections 1 and 2 Conclusions  p values are unreliable, give seductive but illusory ‘certainty’  Dichotomous NHST is limiting, CIs are more informative  Estimation for a cumulative quantitative discipline (Meehl, Gigerenzer) And now:  Replicability crisis demands change: Research integrity  APA Publication Manual recommends estimation  New journal requirements, Psychological Science leads Research integrity: Pre-register, disclose fully, report fully NHST, the researcher’s heroin—can we kick the habit? —can we abandon the security blanket of ‘significance’ and p?! NEXT: Confidence intervals 40 Chapters 1, 2

41 1.The new statistics: Why 2.Research integrity and the new statistics 3.Effect sizes and confidence intervals 4.The new statistics: How 5.Planning, power, and precision 6.Meta-analysis 41

42 The new statistics: How? Effect sizes Effect size, the amount of something of interest  Many ES measures are very familiar  No cause need be identified An effect size (ES) can be:  A mean, or difference between means  A percentage, or percentage change  A correlation (e.g., Pearson r)  Proportion of variance (R 2,  2,  2 …)  A standardised measure (Cohen’s d, Hedges g…)  A regression slope (b or  )  A measure of goodness of fit  Many other things… (but NOT a p value!) 42 Chapter 2

43 My strategy  I assume populations have a normal distribution  No mention of alternatives, all full of potential:  Bayesian statistics  robust statistics  resampling methods  model comparison and selection  etc… Why choose estimation?  Three criteria for reform that has a chance of succeeding 1. Move on from NHST 2. Move on from dichotomous thinking and decision making 3. Resources available now to make techniques accessible 43 Chapter 3

44 A 95% confidence interval (CI) Public support for Proposition C is 53%, in a poll with a 2% margin of error (MOE) 44 Chapter 4 Relative likelihood

45 A 95% confidence interval (CI) Public support for Proposition C is 53%, in a poll with a 2% margin of error (MOE) 45 Chapter 4

46 A 95% confidence interval (CI) Public support for Proposition C is 53%, in a poll with a 2% margin of error (MOE) 46 Chapter 4

47 A 95% confidence interval (CI) Public support for Proposition C is 53%, in a poll with a 2% margin of error (MOE) 47 Chapter 4

48 A 95% confidence interval (CI) Public support for Proposition C is 53%, in a poll with a 2% margin of error (MOE) 48 Chapter 4

49 ESCI play time: CIjumping page of ESCI chapters 1-4ESCI chapters 1-4  Dance of the means: Narrow is good; large N is gold  Mean heap: Sampling distribution of sample means  SE (standard error) is SD of mean heap (SE = SD / )  Central Limit Theorem (CLT): Magic: Normal distribution from thin air   ± 1.96 x SE contains almost all (95% of) sample means: Tram lines  95% of sample means lie within 1.96 x SE of   1.96 x SE is margin of error (MOE); most errors less than MOE  M ± 1.96 x SE will capture  for most (≈95% of) samples  For  known, 95% CI = M ± 1.96 x SE = M ± 1.96 x  /  For  not known, 95% CI = M ± t crit x SE = M ± t crit x s/ 49 Chapter 3

50 CIs: Interpretation 1 (of five)  How, pragmatically, should we think about CIs?  How should we use CIs to interpret results? Interpretation 1: One from the dance  RT was 457 ms [427, 487]  Our CI is randomly chosen from an infinite dance, 95% of which include , but... It might be red  Of all 95% CIs we ever see, around 5% will be red, but we’ll never know which 50 Our 95% CI Chapter

51 CIs: Interpretations 2-5 Is it reasonable to interpret our CI? YES, if it’s likely to be typical of its dance, so yes, provided that:  N is not very small, say less than around 8  Our CI has not been selected 51 Our 95% CI Chapter

52 Dance of the means The mean is the best point estimate, for any N 52 Chapter 3

53 Dance of the CIs 95% of CIs capture , for any N BUT for N very small, CI length can be misleading 53 Chapter 3

54 CIs: Interpretations 2-5 Is it reasonable to interpret our CI? YES, if likely to be typical of its dance, so yes, provided that:  N is not very small, say less than around 8  For any N, the mean is the best point estimate  But for very small N, CI length can be very misleading  Our CI has not been selected  Several CIs available, but only a selected one reported? 54 Our 95% CI Chapter

55 CIs: Interpretation 2 Interpretation 2: Our interval, with cat’s eye picture  The beautiful shape of a CI: likelihood or plausibility  Interpret the point estimate, 457 ms, and CI limits  We can be 95% confident our interval includes   Best bets for  are values near M,  Less good bets toward and beyond each limit  No sharp drop at the limits  It matters little whether a point is just inside or just outside the CI Cumming, G. (2007). Inference by eye: Pictures of confidence intervals and thinking about levels of confidence. Teaching Statistics, 29, Our 95% CI Chapter

56 CIs: Interpretation 3 Interpretation 3: Prediction interval for next M  The CI signals, approximately, the ‘width’ of the dance  …it signals where the next mean is likely to land  On average, approx 83% of future M ’s lie within our CI  On average, a 5 in 6 chance  Researchers understand this moderately well Cumming, G., Williams, J., & Fidler, F. (2004). Replication, and researchers’ understanding of confidence intervals and standard error bars. Understanding Statistics, 3, Cumming, G., & Maillardet, R. (2006). Confidence intervals and replication: Where will the next mean fall? Psychological Methods, 11, 217– Our 95% CI Chapters 3,

57 CIs: Interpretation 4 Interpretation 4: MOE as error of estimation  MOE = 30 ms  The error of estimation is |M – μ |  MOE is the maximum likely error of estimation  MOE is our measure of precision  Large MOE, low precision  Small MOE, high precision 57 Our 95% CI Chapter 4 MOE

58 CIs: Interpretation 5 (least preferred) Interpretation 5: NHST  If null hypothesis value (e.g. zero) is outside the 95% CI, reject at the p =.05 level  If within the interval, don’t reject  Ignores much of the information CIs provide  Can prompt incorrect interpretation of results  There are links: CI length, level of confidence (C), and p Coulson, M., Healey, M., Fidler, F., & Cumming, G. (2010). Confidence intervals permit, but do not guarantee, better inference than statistical significance testing. Frontiers in Quantitative Psychology and Measurement, 1:26. tiny.cc/cisbettertiny.cc/cisbetter 58 Our 95% CI Chapter

59 CI length and level of confidence, C 59 Chapter 4 One third longer than 95% CI One sixth shorter than 95% CI One third as long as 95% CI  The C% CI spans C% of the cat’s eye area  Some simple approximate relations between C and CI length

60 60

61 61 CI length and level of confidence, C Chapter 4

62 Position relative to a 95% CI, and p 62 Chapter 4  Note where the null hypothesis value,  0, falls in relation to a 95% CI  Eyeball the two-tailed p value

63 From a 95% CI to strength of evidence 63 Chapter 4  Note where a 95% CI falls in relation to the null hypothesis value,  0  No need to think about p

64 If the null hypothesis value falls at the limit of the 95% CI, two-tailed p value is.05 … 1/3 of MOE from M, about.50 … 1/6 MOE back from a limit, about.10 … 1/3 of MOE beyond a limit, about.01 … 2/3 of MOE beyond a limit, about.001 … eyeballed p value? … or strength of evidence 64 Chapter 4 Position relative to a 95% CI, and p.033,.12, <.001,.29,.70, % CIs

65 From p (and M) to the 95% CI 65 Chapter 4  Given p,  0, and M, eyeball the 95% CI

66 Five ways to interpret a result, with CI Public support for Proposition C is 53%, in a poll with a 2% margin of error (MOE) 1. One from the dance: most likely includes , but might be red 2. Eyeball the cat’s eye. Values around 53 are most plausible for  ; values towards and beyond 51 or 55 are progressively less plausible. We are 95% confident  lies in [51, 55]. Interpret the midpoint (53) and limits (51 and 55). 3. Quite likely (83% chance) a repeat survey would give a result in [51, 55]. 4. The maximum likely error of estimation is 2%. 5. Support is statistically above 50%, p <.01. … then interpret, in the research and practical context. 66 Chapters 3, 4, 5

67 Interpret ESs and their CIs  Possible results of a study seeking to reduce anxiety  For each possible result, perhaps consider:  Interpretation of 95% CI  Use any or all of five ways  p value (reject H 0 ?) ??  Can we ‘accept H 0 ’?  ES reference values are shown. We could mark the ES that is practically significant.  MOE: short is good 67 Chapters 3, 4

68 Interpret ESs and their CIs Knowledgeable judgment is required, in context, but that’s OK  Justify interpretations  In the research context  Including practical, theoretical, … implications  Small, large, notable, important, economically valuable, negligible…  Examples of ES reference values (mark in figures):  10 mm on the 100 mm pain line is the minimum change of clinical note  15% change in memory score is the smallest of clinical importance  Scores of 0-13, 14-19, 20-28, and are, respectively, ‘minimal’, ‘mild’, ‘moderate’, and ‘severe’ levels of depression on the Beck (BDI)  Avoid the ‘S word’ … significant (shhhh) 68 Chapters 3, 4

69 69 Chapters 4, 6 The tragedy of the error bar  Samples from same population  Tweaked so M, SD same for all N  SD is descriptive of sample data  95% CI length varies greatly with N  CI inferential, tells about population   SE length varies inversely with √N  Ratio CI / SE varies—for small N  SE neither descriptive nor inferential  Use 95% CI, not SE bars  A tragedy that bars don’t say what they mean—always define bars Cumming, G., Fidler, F., & Vaux, D. L. (2007). Error bars in experimental biology. Journal of Cell Biology, 177, tiny.cc/errorbars101tiny.cc/errorbars101

70 70 Chapters 4, 6 The tragedy of the error bar  Double the length of SE bars to get the 95% CI, approximately  Doesn’t work for small N, other ESs  Use 95% CI, not SE bars …CI is inferential—what we want  Prefer 95% CIs—the most common Approx 68% CI Cumming, G., Fidler, F., & Vaux, D. L. (2007). Error bars in experimental biology. Journal of Cell Biology, 177, tiny.cc/errorbars101tiny.cc/errorbars101

71 Questions that should spring to mind… 71  What’s the DV? (on the vertical axis)  What are the two conditions?  What design? (Two independent groups? Paired?)  What are the bars? (SE? 95% CI? SD? Some other CI?) Every figure must provide all this information, in the figure or caption. Chapters 4, 6

72 The New Statistics: How? Estimation: The eight-step plan 1. Use estimation thinking. State estimation questions as: “How much…?”, “To what extent…?”, “How many…?” The key to a more quantitative discipline 2. Indentify the ESs that best answer the questions (a difference?) 3. Declare full details of the intended procedure, data analysis, … 4. Calculate point and interval estimates (CIs) for those ESs 5. Make a picture, including CIs 6. Interpret (use knowledgeable judgment, in context) 7. Use meta-analytic thinking at every stage (…cumulative discipline) 8. Make a full report publicly available (an imperative, not just a goal) Cumming, G., & Fidler, F. (2009). Confidence intervals: Better answers to better questions. Zeitschrift für Psychologie / Journal of Psychology, 217, Chapters 1, 2, 15

73 Estimation thinking, estimation language Introduction: “The study was designed to estimate reading improvement, following the new procedure, in 6 and 4 year olds.” Even better: “The theory predicted a medium-to-large increase in 6 year olds, but little or no increase in 4 year olds.” Better still if the predictions are quantitative.  Estimation thinking is assumed. Estimation of ESs is the focus. Results: “The reading age of 6 year olds increased by 4.5 months 95% CI [1.9, 7.1], which is a large and educationally substantial increase. That of 4 year olds increased by only a negligible 0.6 months [-0.8, 2.0], …”  We are given ESs, with CIs, then the ESs are interpreted. Further comment would assess the more specific (ideally, quantitative) predictions. 73 Chapters 2 - 6

74 The New Statistics: Actually doing it The editor says to remove CIs and just give p values. What do you DO?  Research methods best practice: Consider, decide, persist  The evidence should decide: Consider statistical cognition research  TNS reasons are compelling, TNS is the way of the future. Persist.  Explain and justify your data analytic approach  APA Publication Manual: “Wherever possible, base discussion and interpretation of results on point and interval estimates” (p. 34).  New guidelines: Psychonomic Society, Psychological Science…  I add p values if I must, but don’t mention them, nor remove CIs or ESs 74 Chapter 15

75 Section 3 Conclusions  ES: The amount of anything of interest  95% CI gives inferential information …which is what we want  Use any of five ways to think about a 95% CI  Interpret the ES and CI, in context  Ask estimation questions, use estimation thinking …and meta-analytic thinking NEXT: Examples of using the new statistics 75 Chapter 2

76 1.The new statistics: Why 2.Research integrity and the new statistics 3.Effect sizes and confidence intervals 4.The new statistics: How 5.Planning, power, and precision 6.Meta-analysis 76

77 The New Statistics: How? Estimation: The eight-step plan 1. Use estimation thinking. State estimation questions as: “How much…?”, “To what extent…?”, “How many…?” The key to a more quantitative discipline 2. Indentify the ESs that best answer the questions (a difference?) 3. Declare full details of the intended procedure, data analysis, … 4. Calculate point and interval estimates (CIs) for those ESs 5. Make a picture, including CIs 6. Interpret (use knowledgeable judgment, in context) 7. Use meta-analytic thinking at every stage (…cumulative discipline) 8. Make a full report publicly available (an imperative, not just a goal) Cumming, G., & Fidler, F. (2009). Confidence intervals: Better answers to better questions. Zeitschrift für Psychologie / Journal of Psychology, 217, Chapters 1, 2, 15

78 Randomised control trial (RCT) 78 Means, with 95% CIs These CIs can guide assessment of which comparisons? (between groups) They cannot help with which comparisons? (repeated measure) Figure page of ESCI chapters 14-15ESCI chapters Cumming, G., & Finch, S. (2005). Inference by eye: Confidence intervals, and how to read pictures of data. American Psychologist, 60, tiny.cc/inferencebyeye tiny.cc/inferencebyeye Chapter 15

79 Two basic experimental designs—with CIs 1. Two independent groups  E.g. Experimental group vs Control group  OK to use CIs on means to assess the difference  (…think independent groups t test) Data two and Simulate two pages of ESCI chapters 5-6ESCI chapters Paired design (repeated measure)  E.g. Pretest vs Posttest, single group of patients  May NOT use CIs on Pretest and Posttest to assess the difference  Need the CI on the paired differences  (…think paired t test)  Data paired and Simulate paired pages of ESCI chapters 5-6ESCI chapters Chapters 6

80 Two basic experimental designs, and t tests  Two independent groups  CIs on separate means (and on diff) are based on  Paired (or matched) design  Different error term  CI on difference is based on 80 Chapters 6

81 81 Overlap Rule of Eye  Overlap Rule of Eye  Pretest Posttest Chapters 6 Compare A B page of ESCI chapters 5-6ESCI chapters 5-6 Based on different SD information: SD of the differences Based on same SD information as the two separate CIs

82 Two independent groups: Rules of Eye 82 Two 95% CIs just touching (zero overlap) indicates moderate evidence of a population difference (approx p =.01) Moderate overlap (about half average MOE) is some evidence of a difference (approx p =.05) When both samples sizes are at least 10, and the two MOEs do not differ by more than a factor of 2. Use the rule without reference to p Cumming, G., & Finch, S. (2005). Inference by eye: Confidence intervals, and how to read pictures of data. American Psychologist, 60, tiny.cc/inferencebyeyetiny.cc/inferencebyeye Chapters 6

83 Two independent groups: Rules of Eye 83 Two 95% CIs just touching (zero overlap) indicates moderate evidence of a population difference (approx p =.01) Moderate overlap (about half average MOE) is some evidence of a difference (approx p =.05) When both samples sizes are at least 10, and the two MOEs do not differ by more than a factor of 2. Use the rule without reference to p Cumming, G. (2009). Inference by eye: Reading the overlap of independent confidence intervals. Statistics in Medicine, 28, Chapters 6

84 Randomised control trial (RCT) 84 Chapter 15 Means, with 95% CIs These CIs can guide assessment of which comparisons? (between groups) They cannot help with which comparisons? (repeated measure) Figure page of ESCI chapters 14-15ESCI chapters Q: How display also the within-S CIs? Fidler, F., Faulkner, S., & Cumming, G. (2008). Analyzing and presenting outcomes: Focus on effect size estimates and confidence intervals. In A. M. Nezu & C. M. Nezu (Eds.) Evidence-based outcome research (pp ). New York: OUP.

85 85 RCT example: Plot mean change score, with its CI  One choice for planned contrasts  Interpret the 95% CIs  Use overlap rule? Figure page of ESCI chapters Chapter 15

86 Cohen’s d (A standardised effect size), SMD Cohen’s d is an ES expressed as a number of SDs (a z score) d picture page of ESCI chapters 10-13ESCI chapters  Lots of overlap of the populations  For d = 0.5 (a medium effect?), 69% of E points higher than C mean  Cohen’s small, medium, large: 0.2, 0.5, 0.8—but arbitrary, last resort  Cohen’s d is the ES (in original units), divided by a suitable SD  Our sample d is a point estimate of the population   We need d to help understanding, and for meta-analysis 86 Chapter 11

87 Cohen’s d, for 2 independent groups IQ example: Control group: M C = 110, s C = 12; Experimental group: M E = 120, s E = 16 d = ES / SD … ES is in original units, we choose a value for SD 1. Which SD makes best sense as the unit of measurement? (Population SD) 2. What’s the best estimate of this SD? Three options:  Use 15, SD in reference population, d = (120 – 110) / 15 = 0.67  Use s C = 12, estimate of Control population SD, d = (120 – 110) / 12 = 0.83  Use s p = 14, pooled estimate from both groups, d = (120 – 110) / 14 = 0.71 Third option: Both numerator and denominator are estimates, so change on replication The rubber ruler: the SD as an elastic unit of measurement Denominator also used for t test and CI on difference between group means Caution: d = 0.5 may be 10/20 or 12/24 or 9/18 or… 87 Chapter 11

88 Cohen’s d, for paired design IQ example: A healthy breakfast increases IQ score by average 2 points Usual: M usual = 110, s usual = 12; Healthy: M healthy = 112, s healthy = 16 (Also s diff = 1.2) d = ES / SD Which SD makes best sense as the unit of measurement? (Population SD?)  Use 15, population SD, d = 2 / 15 = 0.13  Use s C = 12, Usual estimate, d = 2 / 12 = 0.17  Use s p = 14, pooled, d = 2 / 14 = 0.14 BUT for paired t test, and CI on mean of differences, use s diff …use s diff as the measuring unit for d?? NO: it gives d = 2 / 1.2 = 1.7. Silly. Our choice of SD for d may differ from what we use for inference 88 Chapter 11

89 CIs for Cohen’s d Both numerator and denominator have sampling variability …so distribution of d is tricky To calculate accurate CIs on d, need the noncentral t distribution …Fairytale “How the noncentral t distribution got its hump” tiny.cc/noncentralt To calculate CIs for d use ESCI, or an excellent approximate method: Cumming, G., & Fidler, F. (2009). Confidence intervals: Better answers to better questions. Zeitschrift für Psychologie / Journal of Psychology, 217, Introduction to CIs on d and noncentral t: Cumming, G., & Finch, S. (2001). A primer on the understanding, use and calculation of confidence intervals based on central and noncentral distributions. Educational and Psychological Measurement, 61, Chapters 10, 11

90 Unbiased estimate of  is d unb Unfortunately d overestimates . The unbiased estimate of  is: Multiply d by the adjustment factor to get d unb. Routinely use d unb (sometimes called Hedges’ g, but terminology a mess) Data two and Data paired pages of ESCI chapters 5-6ESCI chapters Chapter 11

91 Cohen’s d, take-home messages d = ES / SD … where ES is in original units, and we choose a value for SD  d is highly valuable, especially for meta-analysis  Choose SD carefully: what makes best sense as the unit of measurement?  Use the best available estimate of this SD  Report how d was calculated—if we don’t know that, we can’t interpret  Beware the rubber ruler, interpret d values with caution  Usually use d unb, the unbiased estimate of   Beware terminology (Hedges’ g, Glass’s  )  Use ‘Cohen’s d’ or d unb, with explanation of how calculated  Interpret values in context, use 0.2, 0.5, 0.8 as a last resort 91 Chapter 11

92 CI on correlation, r  Use Fisher’s r to z transformation  CIs asymmetric, more for r near -1 or 1  CIs shorter when near -1 or 1  CIs surprisingly wide, unless N is large  Example: r =.6, N = 30  Cohen’s benchmarks:.1,.3,.5—often inappropriate r to z and Two correlations pages of ESCI chapters 14-15ESCI chapters Correlations and Diff correlations pages of ESCI Effect sizesESCI Effect sizes Finch, S., & Cumming, G. (2009). Putting research in context: Understanding confidence intervals from one or more studies. Journal of Pediatric Psychology, 34, Chapter 14

93 CI on proportion, P Difference between two proportions (instead of OR or  2 ) ES is proportion survived. Diff = 17/20 – 11/20 =.30, [.02,.53] Proportions and Diff proportions page of ESCI Effect sizesESCI Effect sizes Altman, D. G., Machin, D., Bryant, T. N., & Gardner, M. J. (2000). Statistics with confidence: Confidence intervals and statistical guidelines (2nd ed.). London: BMJ Books. 93 Chapter 14

94 CIs for assessing model fit to data Velicer, W. F., Cumming, G., Fava, J. L., Rossi, J. S., Prochaska, J. O., & Johnson, J. (2008). Theory testing using quantitative predictions of effect size. Applied Psychology: An International Review, 57, Test the Transtheoretical Model of Behavior Change, with data from N = 3967 smokers. Calculate 95% CIs on 15 predictor variables. Dots are the quantitative predictions of the model. 94 Chapter 15

95 Section 4 Conclusions  More complex designs: Planned contrasts, with 95% CIs  Design: independent groups vs repeated measure  Cohen’s d and d unb ; how calculated? Rubber ruler  CIs on other ES measures  Correlation; difference between two independent correlations  Proportion; difference between two independent proportions  CIs to assess model fit to data NEXT: Planning good studies 95 Chapter 2

96 1.The new statistics: Why 2.Research integrity and the new statistics 3.Effect sizes and confidence intervals 4.The new statistics: How 5.Planning, power, and precision 6.Meta-analysis 96

97 Statistical power: I’m ambivalent  Statistical power is the chance of finding something if it is there  Statistical power = 1 –  = Prob (reject H 0, IF H 0 false)  Depends on NHST  If using NHST, take statistical power seriously  Instead, use precision for planning: design for target MOE  “Power” more loosely: “Goodness” of an experiment  Instead, maximise informativeness 97 Chapter 12

98 Power picture Statistical power is the chance we’ll find an effect, if there is an effect of stated size At right: Single sample, N = 18, target ES is  =.5,  =.05, two-tailed, power =.52 Power picture page of ESCI chapters To calculate power, we need non- central t, unless  is known Cumming, G., & Finch, S. (2001). A primer on the understanding, use, and calculation of confidence intervals that are based on central and noncentral distributions. Educational and Psychological Measurement, 61, 530– Chapter 12

99 Power example: HEAT Hot Earth Awareness Test (HEAT)  Test of climate change knowledge, attitudes, and behavior  Assume  = 50,  = 20 in reference population  Use  =.05, two-tailed. Choose target ES that is meaningful  Two independent groups experiment, N in each group  For target ES of  = your choice, study variation of power with N  E.g. 8 points on HEAT is  = 8/20 = 0.4  Target ES makes a big difference  For paired design, set , the population correlation, and study power  Correlation , as well as target ES, makes a big difference Power two and Power paired pages of ESCI chapters 10-13ESCI chapters Chapter 12

100 Ways to increase power  Increase N  Increase target ES (!!)  Increase  (!!)  Improve the experimental design  Use better measures  Scope for fudging (Grant applications, ethics proposals…) 100 Chapter 12

101 Power recommendations  APA Manual: “Take seriously the statistical power considerations associated with the tests of hypotheses. … routinely provide evidence that the study has sufficient power to detect effects of substantive interest…” (p. 30)  BUT power values are very rarely reported in psychology journals 101 Chapter 12

102 G*power  Great software to calculate power, display power curves  Test family (play with t, F, z…)  Enter no. tails, type of power analysis, , etc  Enter target ES  For Anova, use f as the ES measure (see Cohen, 1988)  Power for comparing two r values (correlations)  Or set power, use ‘Determine’ to calculate ES value  X-Y plot, to examine sensitivity  Download from tiny.cc/gpower3tiny.cc/gpower3 Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ: Erlbaum 102 Chapter 12

103 Statistical power: Often so low  HIGH power is gold, but many disciplines are cursed by low power  Cohen (1962): In published psychology research, median power to find a medium-sized effect is about.5  Maxwell (2004): It was still about.5 for a medium effect  Our file drawers (and journals) are crammed with Type 2 errors: Results that are ns even though there is a real effect Maxwell, S. E. (2004). The persistence of underpowered studies in psychological research: Causes, consequences, and remedies. Psychological Methods, 9, Chapter 12

104 Post hoc power: A bad idea  Calculated after data are obtained. Use obtained d as target   Logical problem: Power is a prospective probability  Replicate, and see ‘dance of post hoc power’ Enormous variation Simulate two page of ESCI chapters 5-6ESCI chapters 5-6  Devastatingly criticised as not telling us what we want to know (chance we’ll find an effect of a size chosen to be meaningful)  Merely reflects the outcome of our study. Tells us nothing new.  SPSS, etc, gives post hoc power in its printouts. Don’t use this value  Poor practice by software publishers Hoenig, J. M., & Heisey, D. M. (2001). The abuse of power: The pervasive fallacy of power calculations for data analysis. The American Statistician, 55, Chapter 12

105 Informativeness Informativeness—my general term for quality, size, sensitivity, usefulness To increase informativeness (also power & precision):  Choose experimental design to minimise error (repeated measures?)  Improve the measures, maybe measure twice and average  Statistical control? (Covariance?)  Target large effect sizes: Six therapy sessions, not two  Use large N (of course)—tho’ to halve SE, need to multiply N by 4  Use Meta-analysis (combine results over experiments) Don’t spread your eggs over too many baskets. Do one or two things well, rather than risk examining lots of things and finding ≈ nothing  The enemy is error variability: reduce it by all means possible An essential step in research planning, worth great effort: Brainstorm 105 Chapters 12, 13

106 Precision for planning (AIPE, accuracy in parameter estimation)  ESCI calculates what N is required to give:  expected MOE no more than f × , (so f is like d, a number of SDs)  OR to have a 99% chance MOE is no more than f ×   ‘assurance’ = 99%, expressed as  = 99 Three Precision pages of ESCI chapters 5-6ESCI chapters f ×  Chapter 13

107 Precision for planning, HEAT example  HEAT experiment, two independent groups  Target MOE of 8, which is 0.4 × 20, or f × , where f = 0.4 Three Precision pages of ESCI chapters 5-6ESCI chapters 5-6  For f = 0.4, two independent groups, need N = 50  And for assurance  = 99, need N = 65 Alas, such large N, even with such large f  Paired experiment, with  =.7, need N = 17  Or N = 28, with assurance  = Chapter 13

108 Precision for planning (AIPE, accuracy in parameter estimation)  ESCI calculates what N is required to give:  expected MOE no more than f × , (so f is like d, a number of SDs)  OR to have a 99% chance MOE is no more than f ×   ‘assurance’ = 99%, expressed as  = 99  Not yet widely used, but highly recommended (No need for H 0 )  Target MOE replaces target ES  Should replace power calculations for funding and ethics applications 108 f ×  Chapter 13

109 Section 5 Conclusions Power, informativeness, precision  IF using NHST, take power seriously  Low power for a meaningful effect size ==> a waste of time?  Don’t use post-hoc power  Make great efforts to maximize informativeness  Precision (MOE of the CI) is more useful than power:  No need for NHST, or for any H 0  CI pictures are so revealing  And essential, if you wish to conclude no (or trivial) effect.  If it could be useful, use precision for planning NEXT: Meta-analysis and meta-analytic thinking 109 Chapters 12, 13

110 1.The new statistics: Why 2.Research integrity and the new statistics 3.Effect sizes and confidence intervals 4.The new statistics: How 5.Planning, power, and precision 6.Meta-analysis 110

111 Single studies—So many problems  Power often low  CIs often wide, precision low  CIs report accurately the uncertainty in data.  But don’t shoot the messenger—it’s a message we need to hear The solutions:  Increase informativeness of individual studies  Combine results over studies—Meta-analysis 111 Chapter 7

112 Meta-analysis: the picture  The forest plot  CIs make this picture possible; p values are irrelevant ESCI Meta-Analysis  Beginning undergraduates easily grasp the basics—via pictures  Effect sizes used in meta-analysis: Means, Cohen’s d, r, others…  A stylized cat’s eye: Cumming, G. (2006). Meta-analysis: Pictures that explain how experimental findings can be integrated. 7th International Conference on Teaching Statistics. Brazil, July. tiny.cc/teachma tiny.cc/teachma Hunt, M. (1997). How science takes stock. The story of meta-analysis. New York: Sage. 112 Chapter 7

113 Meta-analysis: Small  The minimum: two studies to be combined  Prior studies, or  Our own, perhaps within one project, or  Prior + our own …to get the best overall estimates, so far 113 Chapter 7

114 Meta-analysis: Large Cooper’s seven steps; informed critical judgment at every stage 1. Formulate the questions, and scope of the systematic review 2. Search and obtain literature, contact researchers, find grey literature  Establish selection criteria, read and select studies 3. Code studies, enter ES estimates and coding of study features 4. Choose what to include, and design the analyses 5. Analyse the data. Prefer random effects model. Moderators? 6. Interpret; draw empirical, theoretical, and applied conclusions. 7. Prepare critical discussion, present the review 8. Receive $1,000,000 and gold medal. Retire early. (Joke, alas.) Cooper, H. M. (2009). Research synthesis and meta-analysis: A step-by-step approach (4 th ed.). Thousand Oaks, CA: Sage. 114 Chapter 9

115 Heterogeneity Forest plot variability, or dance of the ESs  Heterogeneity measures the extent of dancing  Sampling variability accounts for a certain width of dancing  Studies homogeneous => sampling variability accounts for dance  Studies heterogeneous =>  There is variability beyond that expected from sampling variability  Therefore, moderating variables may contribute  Can studies be too homogeneous? What might that imply? Borenstein, M., Hedges, L. V., Higgins, J. P. T., & Rothstein, H. R. (2009). Introduction to meta-analysis. New York: Wiley. 115 Chapter 8

116 Models for meta-analysis Fixed effect (FE) model  Assume homogeneity: every study estimates the same   Almost always unrealistic Random effects (RE) model—our routine choice  Assume Study i estimates  i, sampled from N( ,  )    = 0 means studies homogeneous, FE model applies   > 0 means studies heterogeneous, RE model needed  Assumptions are severe. Unrealistic? Other models?  Varying-coefficient model of Doug Bonett—watch this space Other approaches?  Bayesian  Schmidt-Hunter corrections for measurement and other biases Schmidt, F., & Hunter, J. (2014). Methods of meta-analysis: Correcting error and bias in research findings (3rd ed.). Sage. 116 Chapter 8

117 Measures of heterogeneity  Q …the weighted sum of squares between studies  T …the estimate of , with CI  Interpret T and its CI (Extends to 0? Interpret the limits)  I 2 …the percentage of total variance between studies that reflects variation in true effect size, rather than sampling variability If heterogeneity is considerable, consider a moderator analysis  If heterogeneity low, RE gives same result as FE  => nothing to lose by using RE 117 Chapter 8

118 But there’s more: Moderator analysis  If heterogeneity high, look for moderators  Simplest: Dichotomous moderator? (e.g., gender) Subgroups page of ESCI Meta-analysisESCI Meta-analysis  Identify moderator, even if no study manipulated that variable  Meta-analysis can give theoretical progress—that’s gold  Example: Peter Wilson, clumsy children, meta-analysis of 50 studies  Identify performance on complex visuospatial tasks as moderator  Study this moderator empirically 118 Chapter 9

119 Continuous moderator? Meta-regression  Fletcher & Kerr (2010): Does RTG fade with length of relationship?  Meta-regression of ES values (RTG score) against years, 13 studies  Correlation, not causality. Alternative interpretations? 119 Chapter 9

120 MA in the Publication Manual  Many mentions, esp. pp , 183.  Mainstreaming meta-analysis  MARS (Meta-Analysis Reporting Standards)  pp  A further big advantage of the sixth edition (2010) Cooper, H. (2010). Reporting research in psychology: How to meet Journal Article Reporting Standards (APA Style). Washington, DC: APA Books.  MARS and JARS 120 Chapter 9

121 CMA: Software for meta-analysis  Comprehensive Meta Analysis  Enter ES, and its variance, for each study—in 100+ formats  Choose FE or RE model  Assess heterogeneity of studies  Explore moderators (ANOVA, or meta-regression)  Forest plot Borenstein, M., Hedges, L. V., Higgins, J. P. T., & Rothstein, H. R. (2009). Introduction to meta-analysis. New York: Wiley. 121 Chapters 8, 9

122 Health sciences: The Cochrane Collaboration  Medicine, health sciences, health policy, practice…  Systematic reviews: meta-analyses of research studies  Publicly available in some countries  5,000+ reviews online  31,000+ people in 120+ countries  Aim to update every two years  RevMan software for meta-analysis  Includes some psychology  Campbell collaboration, for some social sciences (education, welfare…) 122 Chapter 9

123 PTSD: A Cochrane review Bisson JI, Roberts NP, Andrew M, Cooper R, Lewis C. Psychological therapies for chronic post-traumatic stress disorder (PTSD) in adults. Cochrane Database of Systematic Reviews 2013, Issue 12. Art. No.: CD  Includes 70 studies, total of 4761 participants  Update of 2005 Cochrane review, updated in 2007  Support for efficacy, for chronic PTSD in adults, of  Trauma-focused cognitive behavioral therapy (TFCBT), and  Eye movement desensitization and reprocessing (EMDR)  Non-trauma-focused psychological therapies not so effective 123 Chapter 9

124 Quality of the evidence  Many studies, but each included only small numbers of people  Some studies were poorly designed  The overall quality of the studies was very low and so findings should be interpreted with caution  There is insufficient evidence to show whether or not psychological therapy is harmful 124 Chapter 9

125 Bias? Research integrity issues 125 Chapter 9 Judgments about risks of bias, as percentages of included studies (p. 13)

126 Funnel plot: Publication bias? 126 Chapter 9 Individual therapy vs waitlist/usual care Outcome: Severity of PTSD symptoms - clinician-rated (p. 26) Large studies Small studies Favors therapy Favors control …suggests the possibility of publication bias Small studies missing? Because not statistically significant?

127 Funnel plot: Publication bias? 127 Chapter 9 Large studies Small studies Favors therapy Favors control …suggests the possibility of publication bias Small studies missing? Because not statistically significant? Individual therapy vs waitlist/usual care Outcome: Severity of PTSD symptoms - clinician-rated (p. 26)

128 128 Chapter 9 Forest plot of Analysis 1.10 Comparison 1: Trauma-focused CBT/Exposure therapy vs waitlist/usual care Outcome 10: Depression month follow-up (p. 152)

129 129 Chapter 9 Forest plot of Analysis 3.1. Comparison 3: Trauma-focused CBT/Exposure Therapy vs other therapies Outcome 1: Severity of PTSD symptoms – clinician (p. 171)

130 Meta-analysis, in many disciplines  Particle physics  As much heterogeneity as in social sciences! Hedges, L. V. (1987). How hard is hard science, how soft is soft science? The empirical cumulativeness of research. American Psychologist, 42, 443– Chapter 9

131 Meta-analytic thinking 1. Think of past literature in meta-analytic terms 2. Think of our study as the next step in that progressively cumulating meta-analysis 3. Report results so inclusion in future meta-analysis is easy  Report all effect sizes (whether ns or not), in the best way Cumming, G., & Finch, S. (2001). A primer on the understanding, use and calculation of confidence intervals based on central and noncentral distributions. Educational and Psychological Measurement, 61, Chapters 1, 7, 9

132 Meta-analysis, more generally…  A typical study asks multiple questions, tests theory, doesn’t simply ask ‘how large is the effect of the treatment?’  A typical article includes several experiments, a number of manipulations, a number of DVs…  MA typically chooses a single ES from each study  the most important, or the average of several, or the one most often reported  or carry out more than one MA, as in Cochrane  Converging approaches, converging evidence—most persuasive  Reduces risk that findings are merely chance fluctuations  Provides some evidence of robustness, generality of findings  …all part of meta-analytic thinking 132 Chapter 7

133 Section 6 Conclusions Meta-analysis:  Any size from small to very large  Quantitative integration, even of large, messy literatures  Best (most precise) ES estimates  Moderator analysis: Practical and theoretical importance  Variety of meta-analysis techniques and models  Watch for developments  Provides the best basis for evidence-based practice  Build a cumulative quantitative discipline: Meta-analytic thinking 133 Chapters 7, 8, 9

134 The New Statistics: Actually doing it The editor says to remove CIs and just give p values. What do you DO?  Research methods best practice: Consider, decide, persist  The evidence should decide: Consider statistical cognition research  TNS reasons are compelling, TNS is the way of the future. Persist.  Explain and justify your data analytic approach  APA Publication Manual: “Wherever possible, base discussion and interpretation of results on point and interval estimates” (p. 34).  New guidelines: Psychonomic Society, Psychological Science…  I add p values if I must, but don’t mention them, nor remove CIs or ESs 134 Chapter 15

135 The New Statistics: Where next?  Examples and advice, for many situations, many ESs  ANOVA, multivariate, SEM, model fitting… Refs: tiny.cc/tnswhyhowtiny.cc/tnswhyhow  New textbooks, new software  Editors insisting  More guidelines, as Psychonomic Society  Statistical cognition research  Provides the evidence for evidence-based statistical practice  p values and emotions  Better graphics  Study estimation thinking  Strategies for research integrity: replication, publication, ethics…  Teach it, from the start 135 Chapter 15

136 Take-home message Estimation: The eight-step plan 1. Use estimation thinking. State estimation questions as: “How much…?”, “To what extent…?”, “How many…?” The key to a more quantitative discipline 2. Indentify the ESs that best answer the questions 3. Declare full details of the intended procedure, data analysis, … 4. Calculate point and interval estimates (CIs) for those ESs 5. Make a picture, including CIs 6. Interpret (use knowledgeable judgment, in context) 7. Use meta-analytic thinking at every stage (…cumulative discipline) 8. Make a full report publicly available (an imperative, not just a goal) 136 Chapters 1, 2, 15

137 137 Comments to: Book information, and ESCI: …with links to: Radio talk, magazine articles Free sample chapter, dance of the p values Other videos: At YouTube, search for ‘Geoff Cumming’ Hug a confidence interval today! This PowerPoint file: tiny.cc/geoffdocstiny.cc/geoffdocs Tutorial article: tiny.cc/tnswhyhowtiny.cc/tnswhyhow


Download ppt "The New Statistics: Estimation and Research Integrity 1 Geoff Cumming School of Psychological Science, La Trobe University, Melbourne, Australia"

Similar presentations


Ads by Google