The New Statistics: Estimation and Research Integrity

The New Statistics: Estimation and Research Integrity
Geoff Cumming School of Psychological Science, La Trobe University, Melbourne, Australia APS-SMEP Workshop, APS Convention, San Francisco Thursday 22 May 2014 This PowerPoint file: tiny.cc/geoffdocs Tutorial article: The New Statistics: Why and How tiny.cc/tnswhyhow THANKS TO: Alan Kraut, Kate McMahon, The Australian Research Council, Neil Thomason, Fiona Fidler, and many others © G. Cumming 2014

The new statistics: Why Research integrity and the new statistics
Effect sizes, confidence intervals, meta-analysis …which is Estimation The techniques are not new, but using them widely would, in many disciplines, be new Sections The new statistics: Why Research integrity and the new statistics Effect sizes and confidence intervals The new statistics: How Planning, power, and precision Meta-analysis Take-home message: Intuitions about variability—the dances

Understanding The New Statistics (New York: Routledge, 2012)
1. Introduction to The New Statistics 2. From Null Hypothesis Significance Testing to Effect Sizes 3. Confidence Intervals 4. Confidence Intervals, Error Bars, and p Values 5. Replication 6. Two Simple Designs 7. Meta-Analysis 1: Introduction and Forest Plots 8. Meta-Analysis 2: Models 9. Meta-Analysis 3: Larger-Scale Analyses 10. The Noncentral t Distribution 11. Cohen’s d 12. Power 13. Precision for Planning 14. Correlations, Proportions, and Further Effect Size Measures 15. More Complex Designs and The New Statistics in Practice

The new statistics: Why
Research integrity and the new statistics Effect sizes and confidence intervals The new statistics: How Planning, power, and precision Meta-analysis

The Boots anti-ageing stampede
British J. Dermatology, online, 2009: A cosmetic ‘anti-ageing’ product improves photoaged skin: A double-blind, randomized controlled trial “…statistically significant improvement in facial wrinkles as compared to baseline assessment (p = .013), whereas vehicle-treated skin was not significantly improved (p = .11)” Media reports: “significant clinical improvement in facial wrinkles…” Queues at Boots for ‘No. 7 Protect & Perfect Intense Beauty Serum’ Watson, R. E. B., et al. (2009). British Journal of Dermatology, 161, Chapter 2

The Boots anti-ageing stampede
Concluding ‘no effect’ for placebo is accepting the null hypothesis Statistical criticisms, then revised article: “non-significant trend…” p values and CIs closely linked Should assess the difference directly, but it’s a common error: Watson, R. E. B., et al. (2009). British Journal of Dermatology, 161, p = .11 Chapter 2

Comparing significance levels is everywhere
“…incorrect procedure… in which researchers conclude that effects differ when one effect is significant (p < .05) but the other is not (p > .05). We reviewed 513 … articles in Science, Nature, Nature Neuroscience, Neuron and The Journal of Neuroscience and found that 78 used the correct procedure and 79 used the incorrect procedure.” Nieuwenhuis, S., Forstmann, B. U., & Wagenmakers, E-J. (2011). Erroneous analyses of interactions in neuroscience: a problem of significance. Nature neuroscience, 14,

Conclusions, so far Presentation format can matter—a lot
Null Hypothesis Significance Testing (NHST) promotes dichotomous thinking (an effect exists, or it doesn’t) NHST: seductive, but illusory ‘certainty’ CIs can prompt better interpretation …and are highly informative p values and CIs are closely linked, but there are important differences Chapter 2

Evidence? (statistical cognition—only psychology can do it)
authors in medical and psychology journals. Ask them to rate: “Results of the two are broadly consistent, or similar” Ask for comments, classify these as ‘mention NHST’ or no such mention. Conclude: Even if see CIs, often think of NHST Better interpretation if avoid NHST, and think in terms of intervals Don’t report p values as well as CIs Coulson, M., Healey, M., Fidler, F., & Cumming, G. (2010). Confidence intervals permit, but do not guarantee, better inference than statistical significance testing. Frontiers in Quantitative Psychology and Measurement, 1:26, 1-9. ) tiny.cc/cisbetter

Effect sizes, confidence intervals, meta-analysis
Time for a crusade? The New Statistics: Effect sizes, confidence intervals, meta-analysis …are not themselves new, but using them widely in psychology would be new, and highly beneficial

p values—how do we think about them?
The p value: …is central to research thinking …has hardly been studied A great question for statistical cognition research How do people think about p? … talk about p? … feel about p? Chapter 5

…and has real-life consequences!
Tenure, Research grant PhD, Prize, Top publication Consolation prize, Fair publication

But: p values and replication?
Given the p value from an initial experiment What’s likely to happen if you replicate—do you get a similar p? We’ll simulate and ask: What p values? Experimental (N = 32), and Control groups (N = 32) Assume population difference of Cohen’s d = 0.5 (a medium effect?) Power = .52, typical for many fields in social and behavioural science Dance p page of ESCI chapters 5-6 …free download from Dance of the p values (It’s very drunken!) The video: tiny.cc/dancepvals or tiny.cc/dancepvals2 Chapter 5

Replicate …get a VERY different p value
p interval: 80% prediction interval for one-tailed p (given two-tailed pobt) pobt p interval .001 ( , .070) .01 ( , .22) .05 (.00008, .44) Independent of N ! .2 (.00099, .70) Any p could easily have been very different. (That’s sampling variability) A p value gives only extremely vague information about p next time! Researchers severely underestimate p intervals! (Med, Psych, Stats) Cumming, G. (2008). Replication and p intervals: p values predict the future only vaguely, but confidence intervals do much better. Perspectives on Psychological Science, 3, Lai, J., Fidler, F., & Cumming, G. (2011). Subjective p intervals: Researchers underestimate the variability of p values over replication. Methodology: European Journal of Research Methods for the Behavioral and Social Sciences, 8, tiny.cc/subjectivep Chapter 5

Traditional ANOVA table
Interpretation may be based on p values, and little else? Star-jumping is crazy, yet common. ‘Failure to replicate’? Be cautious. Meta-analysis. Chapter 5

Implications of the variability of p ?
Weird inconsistency in textbooks: Sampling variability of means—whole chapters Sampling variability of CIs—illustrated, part of definition Sampling variability of p—not mentioned! Require reporting of p intervals? E.g. p = .04, p interval (0, .19) See a p value, think of the (approx) p interval More generally: Sampling variability is so large—remember the dances Chapter 5

Reasons for NHST? ??Significance testing is needed:
to identify which results are real and which due to chance, to determine whether or not an effect exists, to ensure that data analysis is objective, and to make clear decisions, as in practice we need to do. In every case: NO, estimation does better Schmidt, F. L., & Hunter, J. E. (1997). Eight common but false objections to the discontinuation of significance testing in the analysis of research data. In Lisa L. Harlow, Stanley A. Mulaik, & James H. Steiger (Eds). What if there were no significance tests? (pp ). Mahwah, NJ: Erlbaum.

Section 1 Conclusions p can mislead, CIs inform Our p tells virtually nothing about the dance Our CI gives useful information about the dance A p value indicates strength of evidence against a null hypothesis A p value does not signal its unreliability, CI length does signal uncertainty p values and CIs are closely linked A CI estimates the size of an effect, which is what we want to know Estimation is more informative than dichotomous NHST NEXT: Further reasons for reform of statistical and other research practices Chapter 2

A title to die for The replication crisis
…from cancer research to social psychology, some published findings won’t replicate Ioannidis, J. P. A. (2005). Why most published research findings are false. PLoS Medicine 2: e124. tiny.cc/mostfalse

Why are most false? The Ioannidis argument:
The imperative to achieve statistical significance explains: selective publication—file drawer data selection, tweaking, and p-hacking until p is sufficiently small why we think any finding that once meets the criterion of statistical significance is true and doesn’t require replication Many false positives published  ==> Most published findings false 

Simmons, J. P. , Nelson, L. D. , & Simonsohn, U. (2011)
Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22, tiny.cc/falsepositivepsych

2. Data selection, tweaking, p hacking
The Simmons et al. argument: p-hacking—it’s very easy to: test a few extra participants drop or add dependent variables select which comparisons to analyze drop some results as aberrant try a few different statistical analysis strategies then finally choose which of all the above to report Many degrees of freedom ==> always find statistical significance Numerous published results are false positives  Many won’t replicate, not that we do many replications 

Many false positives published?
Low power studies (e.g., power = .50, for medium effect) Dark = stat sig Selective publication ==> high proportion of false positives  Selection, tweaking, p-hacking ==> turns some ns into stat sig Red = spurious stat sig Now even higher proportion of false positives   If a journal prefers surprise, false positives selectively published Therefore even higher proportion of false positives   

Research integrity ≈ Open science
The problems: Selective publication p hacking, and other dubious data analytic practices Lack of replication; replications fail (the crisis) Two meanings of ‘research integrity’ Completeness and validity of the published research literature Requires solution to all three problems Ethical, morally correct behaviour of researchers 1: Make results of all competent research available, somehow 2: Avoid dubious data analytic practices: No p-hacking 1 and 2: Report everything in full, accurate detail 3: Carry out replications How do we achieve research integrity?

Research integrity: A new-statistics perspective
The new statistics should help in some ways: Remove the imperative for statistical significance Remove the dichotomous mindset of replication as Yes or No Emphasise meta-analysis and thus the importance of cumulation, replication, and making ALL results available But may not solve other problems: Career imperative to publish in best journals Selection in data analysis and reporting Journals’ preference for exciting, novel findings, not replications

Research integrity: What we need
Understanding of the 3 problems—no simple changes will suffice The new statistics Better, more informative experiments A clear distinction: Pilot explorations, not publishable, vs. Planned, pre-specified experiments, results ‘published’ in full Ways to declare research plans in advance Ways to ‘publish’, whatever the data (but quality control) Ethical review boards: Require pre-registration? ‘publication’? Replications (close and not-so-close) Tools, guidelines, training, editorial policies… 28

Research integrity findings, proposals, arguments…
Perspectives on Psychological Science, 2012, issue 7(6); 2013, issue 8(4) Makel: Only 1% of articles are replications, most ‘successful’; rate is increasing Bakker; Francis: Many meta-analyses have too many statistically significant results Giner-Sorolla: Top journals require ‘cool’ results! Aesthetics beat truth! Klein: Demand characteristics and experimenter bias: Still alive! Frank; Grahe: Students should do replication experiments Koole: Ways to reward replications. Publish online, linked to original reports. Nosek: Scientific Utopia: Place truth before ‘cool’. Open tools, data, publication. Wagenmakers: Publish protocol in advance. Pre-specified (confirmatory) vs. exploratory Fuchs: Psychologists open to change, but wary; prefer standards to rigid rules Ioannidis: Credibility of science. Replication. Truth. Progress. Open Science Collaboration: Reproducibility Project—replications of 2008 studies Fiedler: Test many alternative hypotheses. Converging evidence. Ingenuity. Gullo: DSM5: Pathological publishing Positive results, no null results, ‘good story’

Research integrity: A few current projects
Open Science Framework (OSF) tiny.cc/osf Manage workflow, declare protocols, archive results and data Reproducibility Project tiny.cc/repproject An open collaboration for replication, part of OSF Replicate findings published in 2008 Registered Replication Report, in Perspectives on Psychological Science Open, refereed & pre-specified, guaranteed publication, meta-analysis PsychFileDrawer tiny.cc/psychfiledrawer Archive of reports of replications in psychology figshare tiny.cc/figshare A repository of reports and datasets Archives of Scientific Psychology (APA) tiny.cc/archivesscipsy Open online journal; requires full data rOpenSci It’s not just psychology! (Posting of software, data, analyses, discussion…)

Research integrity findings, proposals, arguments…
Perspectives on Psychological Science, 2014, issue 9(3), May Ledgerwood: Introduction. “best practices… things we can change right now” Lakens & Evers: Increasing the informational value of studies, the v statistic Sagarin et al.: Protection for data peeking, then further data collection Perugini et al.: Imprecise power estimates, ‘safeguard power’ Stanley & Spence: Replications often vary greatly. Sampling error, measurement error. “Researchers should adjust their expectations concerning replications and shift to a meta-analytic mindset.” Braver, Thoemmes, & Rosenthal: Continuously cumulating meta-analysis Maner: Implications for editors and manuscript reviewers

New journal guidelines
Psychonomic Society journals New statistical guidelines tiny.cc/psychonomicstats Society for Personality and Social Psychology (SPSP) Task Force Funder, D. C., et al. (2014). Improving the dependability of research in personality and social psychology: Recommendations for research and educational practice. Personality and Social Psychology Review,18, 3-12. Psychological Science New guidelines, from Jan 2014: tiny.cc/eicheditorial tiny.cc/pssubguide Editor-in-chief, Eric Eich, explains: tiny.cc/apseichinterview

Eich, E. (2014). Business not as usual. Psychological Science, 25, 3-6
Eich, E. (2014). Business not as usual. Psychological Science, 25, 3-6. tiny.cc/eicheditorial

Psychological Science guidelines
Enhanced reporting of methods Compulsory full disclosure check-boxes: Exclusions, Manipulations, Measures, Sample Sizes Up to three ‘open science’ badges Embracing the new statistics Tutorial article: tiny.cc/tnswhyhow …APS is keen for other societies and journals to do similar

OSF badges …as in Psychological Science
Preregistered: The design and analysis plans for the reported research were preregistered in a public, open-access repository. Open Materials: All digitally shareable materials necessary to reproduce the reported methodology have been made available in a public, open-access repository. Open Data: All digitally shareable data necessary to reproduce the reported results have been made available in a public, open-access repository.

Cumming, G. (2014). The new statistics: Why and how
Cumming, G. (2014). The new statistics: Why and how. Psychological Science, 25, tiny.cc/tnswhyhow

Statistical reform efforts
The full story: Fidler (2005) tiny.cc/fionasphd The brief story: Cumming (2014, APS Observer) tiny.cc/geoffobserver International Committee of Medical Journal Editors (ICMJE) 1988: Use CIs Ken Rothman: 1990 founded Epidemiology “We won’t publish p values” …for 10 years there were virtually none Geoff Loftus at Memory & Cognition (1993-7): Increased use of figures with error bars, but a decrease after he left APA Publication Manual (2010): Recommended estimation Numerous examples of reporting CIs Format for CI: “mean is 275 ms, 95% CI [210, 340]” Fidler, F., Thomason, N., Cumming, G., Finch, S., & Leeman, J. (2004). Editors can lead researchers to confidence intervals, but can’t make them think: Statistical reform lessons from medicine. Psychological Science, 15, Finch, S., Cumming, G., Williams, J., Palmer, L., Griffith, E., Alders, C., Anderson, J., & Goodman, O. (2004). Reform of statistical inference in psychology: The case of Memory & Cognition. Behavior Research Methods, Instruments & Computers, 36,

Prospects for reform For 60+ years, damning critiques of NHST, almost no replies, almost no change Kline review: tiny.cc/klinechap3 (13 severe NHST problems) Critical quotes: tiny.cc/nhstquotes Last years: Meta-analysis arrives, no need for NHST, damage done by p-based selective publication, but still little change Last 5-10 years: Replication crisis: Urgent! At last, a tipping point?

Psychology struggles out of the p-swamp, into the beautiful garden of confidence intervals

In summary, The new statistics: Why? Sections 1 and 2 Conclusions
p values are unreliable, give seductive but illusory ‘certainty’ Dichotomous NHST is limiting, CIs are more informative Estimation for a cumulative quantitative discipline (Meehl, Gigerenzer) And now: Replicability crisis demands change: Research integrity APA Publication Manual recommends estimation New journal requirements, Psychological Science leads Research integrity: Pre-register, disclose fully, report fully NHST, the researcher’s heroin—can we kick the habit? —can we abandon the security blanket of ‘significance’ and p?! NEXT: Confidence intervals Chapters 1, 2

The new statistics: How? Effect sizes
Effect size, the amount of something of interest Many ES measures are very familiar No cause need be identified An effect size (ES) can be: A mean, or difference between means A percentage, or percentage change A correlation (e.g., Pearson r) Proportion of variance (R2, w2, h2…) A standardised measure (Cohen’s d, Hedges g…) A regression slope (b or b) A measure of goodness of fit Many other things… (but NOT a p value!) Chapter 2

My strategy I assume populations have a normal distribution
No mention of alternatives, all full of potential: Bayesian statistics robust statistics resampling methods model comparison and selection etc… Why choose estimation? Three criteria for reform that has a chance of succeeding Move on from NHST Move on from dichotomous thinking and decision making Resources available now to make techniques accessible Chapter 3

A 95% confidence interval (CI)
Public support for Proposition C is 53%, in a poll with a 2% margin of error (MOE) Relative likelihood Chapter 4

A 95% confidence interval (CI)
Public support for Proposition C is 53%, in a poll with a 2% margin of error (MOE) Chapter 4

ESCI play time: CIjumping page of ESCI chapters 1-4
Dance of the means: Narrow is good; large N is gold Mean heap: Sampling distribution of sample means SE (standard error) is SD of mean heap (SE = SD / ) Central Limit Theorem (CLT): Magic: Normal distribution from thin air m ± 1.96 x SE contains almost all (95% of) sample means: Tram lines 95% of sample means lie within 1.96 x SE of m 1.96 x SE is margin of error (MOE); most errors less than MOE M ± 1.96 x SE will capture m for most (≈95% of) samples For s known, 95% CI = M ± 1.96 x SE = M ± 1.96 x s / For s not known, 95% CI = M ± tcrit x SE = M ± tcrit x s/ Chapter 3

CIs: Interpretation 1 (of five)
487 457 427 How, pragmatically, should we think about CIs? How should we use CIs to interpret results? Interpretation 1: One from the dance RT was 457 ms [427, 487] Our CI is randomly chosen from an infinite dance, 95% of which include m, but... It might be red Of all 95% CIs we ever see, around 5% will be red, but we’ll never know which Our 95% CI Chapter 3

CIs: Interpretations 2-5
487 457 427 Is it reasonable to interpret our CI? YES, if it’s likely to be typical of its dance, so yes, provided that: N is not very small, say less than around 8 Our CI has not been selected Our 95% CI Chapter 3

Dance of the means The mean is the best point estimate, for any N
Chapter 3

Dance of the CIs 95% of CIs capture m, for any N BUT for N very small, CI length can be misleading Chapter 3

CIs: Interpretations 2-5
487 457 427 Is it reasonable to interpret our CI? YES, if likely to be typical of its dance, so yes, provided that: N is not very small, say less than around 8 For any N, the mean is the best point estimate But for very small N, CI length can be very misleading Our CI has not been selected Several CIs available, but only a selected one reported? Our 95% CI Chapter 3

CIs: Interpretation 2 487 457 427 Interpretation 2: Our interval, with cat’s eye picture The beautiful shape of a CI: likelihood or plausibility Interpret the point estimate, 457 ms, and CI limits We can be 95% confident our interval includes m Best bets for m are values near M, Less good bets toward and beyond each limit No sharp drop at the limits It matters little whether a point is just inside or just outside the CI Cumming, G. (2007). Inference by eye: Pictures of confidence intervals and thinking about levels of confidence. Teaching Statistics, 29, Our 95% CI Chapter 4

CIs: Interpretation 3 Interpretation 3: Prediction interval for next M
The CI signals, approximately, the ‘width’ of the dance …it signals where the next mean is likely to land On average, approx 83% of future M ’s lie within our CI On average, a 5 in 6 chance Researchers understand this moderately well Cumming, G., Williams, J., & Fidler, F. (2004). Replication, and researchers’ understanding of confidence intervals and standard error bars. Understanding Statistics, 3, Cumming, G., & Maillardet, R. (2006). Confidence intervals and replication: Where will the next mean fall? Psychological Methods, 11, 217–227 487 457 427 Our 95% CI Chapters 3, 5

CIs: Interpretation 4 Interpretation 4: MOE as error of estimation
487 457 427 Interpretation 4: MOE as error of estimation MOE = 30 ms The error of estimation is |M – μ| MOE is the maximum likely error of estimation MOE is our measure of precision Large MOE, low precision Small MOE, high precision MOE MOE Our 95% CI Chapter 4

CIs: Interpretation 5 (least preferred)
Interpretation 5: NHST If null hypothesis value (e.g. zero) is outside the 95% CI, reject at the p = .05 level If within the interval, don’t reject Ignores much of the information CIs provide Can prompt incorrect interpretation of results There are links: CI length, level of confidence (C), and p Coulson, M., Healey, M., Fidler, F., & Cumming, G. (2010). Confidence intervals permit, but do not guarantee, better inference than statistical significance testing. Frontiers in Quantitative Psychology and Measurement, 1:26. tiny.cc/cisbetter 487 457 427 Our 95% CI Chapter 1

CI length and level of confidence, C
The C% CI spans C% of the cat’s eye area Some simple approximate relations between C and CI length One sixth shorter than 95% CI One third as long as 95% CI One third longer than 95% CI Chapter 4

CI length and level of confidence, C
Chapter 4

Position relative to a 95% CI, and p
Note where the null hypothesis value, m0, falls in relation to a 95% CI Eyeball the two-tailed p value Chapter 4

From a 95% CI to strength of evidence
Note where a 95% CI falls in relation to the null hypothesis value, m0 No need to think about p Chapter 4

Position relative to a 95% CI, and p
If the null hypothesis value falls at the limit of the 95% CI, two-tailed p value is .05 … 1/3 of MOE from M, about .50 … 1/6 MOE back from a limit, about .10 … 1/3 of MOE beyond a limit, about .01 … 2/3 of MOE beyond a limit, about .001 … eyeballed p value? … or strength of evidence 95% CIs .033, .12, <.001, .29, .70, Chapter 4

From p (and M) to the 95% CI Given p, m0, and M, eyeball the 95% CI
Chapter 4

Five ways to interpret a result, with CI
Public support for Proposition C is 53%, in a poll with a 2% margin of error (MOE) One from the dance: most likely includes m, but might be red Eyeball the cat’s eye. Values around 53 are most plausible for m; values towards and beyond 51 or 55 are progressively less plausible. We are 95% confident m lies in [51, 55]. Interpret the midpoint (53) and limits (51 and 55). Quite likely (83% chance) a repeat survey would give a result in [51, 55]. The maximum likely error of estimation is 2%. Support is statistically above 50%, p < .01. … then interpret, in the research and practical context. Chapters 3, 4, 5

Interpret ESs and their CIs
Possible results of a study seeking to reduce anxiety For each possible result, perhaps consider: Interpretation of 95% CI Use any or all of five ways p value (reject H0?) ?? Can we ‘accept H0’? ES reference values are shown. We could mark the ES that is practically significant. MOE: short is good Chapters 3, 4

Interpret ESs and their CIs
Knowledgeable judgment is required, in context, but that’s OK Justify interpretations In the research context Including practical, theoretical, … implications Small, large, notable, important, economically valuable, negligible… Examples of ES reference values (mark in figures): 10 mm on the 100 mm pain line is the minimum change of clinical note 15% change in memory score is the smallest of clinical importance Scores of 0-13, 14-19, 20-28, and are, respectively, ‘minimal’, ‘mild’, ‘moderate’, and ‘severe’ levels of depression on the Beck (BDI) Avoid the ‘S word’ … significant (shhhh) Chapters 3, 4

The tragedy of the error bar
Samples from same population Tweaked so M, SD same for all N SD is descriptive of sample data 95% CI length varies greatly with N CI inferential, tells about population m SE length varies inversely with √N Ratio CI / SE varies—for small N SE neither descriptive nor inferential Use 95% CI, not SE bars A tragedy that bars don’t say what they mean—always define bars Cumming, G., Fidler, F., & Vaux, D. L. (2007). Error bars in experimental biology. Journal of Cell Biology, 177, tiny.cc/errorbars101 Chapters 4, 6

The tragedy of the error bar
Double the length of SE bars to get the 95% CI, approximately Doesn’t work for small N, other ESs Use 95% CI, not SE bars …CI is inferential—what we want Prefer 95% CIs—the most common Approx 68% CI Cumming, G., Fidler, F., & Vaux, D. L. (2007). Error bars in experimental biology. Journal of Cell Biology, 177, tiny.cc/errorbars101 Chapters 4, 6

Questions that should spring to mind…
What’s the DV? (on the vertical axis) What are the two conditions? What design? (Two independent groups? Paired?) What are the bars? (SE? 95% CI? SD? Some other CI?) Every figure must provide all this information, in the figure or caption. Chapters 4, 6

The New Statistics: How?
Estimation: The eight-step plan Use estimation thinking. State estimation questions as: “How much…?”, “To what extent…?”, “How many…?” The key to a more quantitative discipline Indentify the ESs that best answer the questions (a difference?) Declare full details of the intended procedure, data analysis, … Calculate point and interval estimates (CIs) for those ESs Make a picture, including CIs Interpret (use knowledgeable judgment, in context) Use meta-analytic thinking at every stage (…cumulative discipline) Make a full report publicly available (an imperative, not just a goal) Cumming, G., & Fidler, F. (2009). Confidence intervals: Better answers to better questions. Zeitschrift für Psychologie / Journal of Psychology, 217, Chapters 1, 2, 15

Estimation thinking, estimation language
Introduction: “The study was designed to estimate reading improvement, following the new procedure, in 6 and 4 year olds.” Even better: “The theory predicted a medium-to-large increase in 6 year olds, but little or no increase in 4 year olds.” Better still if the predictions are quantitative. Estimation thinking is assumed. Estimation of ESs is the focus. Results: “The reading age of 6 year olds increased by 4.5 months 95% CI [1.9, 7.1], which is a large and educationally substantial increase. That of 4 year olds increased by only a negligible 0.6 months [-0.8, 2.0], …” We are given ESs, with CIs, then the ESs are interpreted. Further comment would assess the more specific (ideally, quantitative) predictions. Chapters 2 - 6

The New Statistics: Actually doing it
The editor says to remove CIs and just give p values. What do you DO? Research methods best practice: Consider, decide, persist The evidence should decide: Consider statistical cognition research TNS reasons are compelling, TNS is the way of the future. Persist. Explain and justify your data analytic approach APA Publication Manual: “Wherever possible, base discussion and interpretation of results on point and interval estimates” (p. 34). New guidelines: Psychonomic Society, Psychological Science… I add p values if I must, but don’t mention them, nor remove CIs or ESs Chapter 15

Section 3 Conclusions ES: The amount of anything of interest
95% CI gives inferential information …which is what we want Use any of five ways to think about a 95% CI Interpret the ES and CI, in context Ask estimation questions, use estimation thinking …and meta-analytic thinking NEXT: Examples of using the new statistics Chapter 2

The New Statistics: How?
Estimation: The eight-step plan Use estimation thinking. State estimation questions as: “How much…?”, “To what extent…?”, “How many…?” The key to a more quantitative discipline Indentify the ESs that best answer the questions (a difference?) Declare full details of the intended procedure, data analysis, … Calculate point and interval estimates (CIs) for those ESs Make a picture, including CIs Interpret (use knowledgeable judgment, in context) Use meta-analytic thinking at every stage (…cumulative discipline) Make a full report publicly available (an imperative, not just a goal) Cumming, G., & Fidler, F. (2009). Confidence intervals: Better answers to better questions. Zeitschrift für Psychologie / Journal of Psychology, 217, Chapters 1, 2, 15

Randomised control trial (RCT)
Means, with 95% CIs These CIs can guide assessment of which comparisons? (between groups) They cannot help with which comparisons? (repeated measure) Figure page of ESCI chapters 14-15 Cumming, G., & Finch, S. (2005). Inference by eye: Confidence intervals, and how to read pictures of data. American Psychologist, 60, tiny.cc/inferencebyeye Chapter 15

Two basic experimental designs—with CIs
Two independent groups E.g. Experimental group vs Control group OK to use CIs on means to assess the difference (…think independent groups t test) Data two and Simulate two pages of ESCI chapters 5-6 Paired design (repeated measure) E.g. Pretest vs Posttest, single group of patients May NOT use CIs on Pretest and Posttest to assess the difference Need the CI on the paired differences (…think paired t test) Data paired and Simulate paired pages of ESCI chapters 5-6 Chapters 6

Two basic experimental designs, and t tests
Two independent groups CIs on separate means (and on diff) are based on Paired (or matched) design Different error term CI on difference is based on Chapters 6

Compare A B page of ESCI chapters 5-6
Based on different SD information: SD of the differences Based on same SD information as the two separate CIs Overlap Rule of Eye  Overlap Rule of Eye  Pretest Posttest Compare A B page of ESCI chapters 5-6 Chapters 6

Two independent groups: Rules of Eye
Two 95% CIs just touching (zero overlap) indicates moderate evidence of a population difference (approx p = .01) Moderate overlap (about half average MOE) is some evidence of a difference (approx p = .05) When both samples sizes are at least 10, and the two MOEs do not differ by more than a factor of 2. Use the rule without reference to p Cumming, G., & Finch, S. (2005). Inference by eye: Confidence intervals, and how to read pictures of data. American Psychologist, 60, tiny.cc/inferencebyeye Chapters 6

Two independent groups: Rules of Eye
Two 95% CIs just touching (zero overlap) indicates moderate evidence of a population difference (approx p = .01) Moderate overlap (about half average MOE) is some evidence of a difference (approx p = .05) When both samples sizes are at least 10, and the two MOEs do not differ by more than a factor of 2. Use the rule without reference to p Cumming, G. (2009). Inference by eye: Reading the overlap of independent confidence intervals. Statistics in Medicine, 28, Chapters 6

Randomised control trial (RCT)
Means, with 95% CIs These CIs can guide assessment of which comparisons? (between groups) They cannot help with which comparisons? (repeated measure) Figure page of ESCI chapters 14-15 Q: How display also the within-S CIs? Fidler, F., Faulkner, S., & Cumming, G. (2008). Analyzing and presenting outcomes: Focus on effect size estimates and confidence intervals. In A. M. Nezu & C. M. Nezu (Eds.) Evidence-based outcome research (pp ). New York: OUP. Chapter 15

RCT example: Plot mean change score, with its CI
One choice for planned contrasts Interpret the 95% CIs Use overlap rule? Figure page of ESCI chapters 14-15 Chapter 15

Cohen’s d (A standardised effect size), SMD
Cohen’s d is an ES expressed as a number of SDs (a z score) d picture page of ESCI chapters 10-13 Lots of overlap of the populations For d = 0.5 (a medium effect?), 69% of E points higher than C mean Cohen’s small, medium, large: 0.2, 0.5, 0.8—but arbitrary, last resort Cohen’s d is the ES (in original units), divided by a suitable SD Our sample d is a point estimate of the population d We need d to help understanding, and for meta-analysis Chapter 11

Cohen’s d, for 2 independent groups
IQ example: Control group: MC = 110, sC = 12; Experimental group: ME = 120, sE = 16 d = ES / SD …ES is in original units, we choose a value for SD Which SD makes best sense as the unit of measurement? (Population SD) What’s the best estimate of this SD? Three options: Use 15, SD in reference population, d = (120 – 110) / 15 = 0.67 Use sC = 12, estimate of Control population SD, d = (120 – 110) / 12 = 0.83 Use sp = 14, pooled estimate from both groups, d = (120 – 110) / 14 = 0.71 Third option: Both numerator and denominator are estimates, so change on replication The rubber ruler: the SD as an elastic unit of measurement Denominator also used for t test and CI on difference between group means Caution: d = 0.5 may be 10/20 or 12/24 or 9/18 or… Chapter 11

Cohen’s d, for paired design
IQ example: A healthy breakfast increases IQ score by average 2 points Usual: Musual = 110, susual = 12; Healthy: Mhealthy = 112, shealthy = 16 (Also sdiff = 1.2) d = ES / SD Which SD makes best sense as the unit of measurement? (Population SD?) Use 15, population SD, d = 2 / 15 = 0.13 Use sC = 12, Usual estimate, d = 2 / 12 = 0.17 Use sp = 14, pooled, d = 2 / 14 = 0.14 BUT for paired t test, and CI on mean of differences, use sdiff …use sdiff as the measuring unit for d?? NO: it gives d = 2 / 1.2 = 1.7. Silly. Our choice of SD for d may differ from what we use for inference Chapter 11

CIs for Cohen’s d Both numerator and denominator have sampling variability …so distribution of d is tricky To calculate accurate CIs on d, need the noncentral t distribution …Fairytale “How the noncentral t distribution got its hump” tiny.cc/noncentralt To calculate CIs for d use ESCI, or an excellent approximate method: Cumming, G., & Fidler, F. (2009). Confidence intervals: Better answers to better questions. Zeitschrift für Psychologie / Journal of Psychology, 217, Introduction to CIs on d and noncentral t: Cumming, G., & Finch, S. (2001). A primer on the understanding, use and calculation of confidence intervals based on central and noncentral distributions. Educational and Psychological Measurement, 61, Chapters 10, 11

Unbiased estimate of d is dunb
Unfortunately d overestimates d. The unbiased estimate of d is: Multiply d by the adjustment factor to get dunb. Routinely use dunb (sometimes called Hedges’ g, but terminology a mess) Data two and Data paired pages of ESCI chapters 5-6 Chapter 11

Cohen’s d, take-home messages
d = ES / SD …where ES is in original units, and we choose a value for SD d is highly valuable, especially for meta-analysis Choose SD carefully: what makes best sense as the unit of measurement? Use the best available estimate of this SD Report how d was calculated—if we don’t know that, we can’t interpret Beware the rubber ruler, interpret d values with caution Usually use dunb, the unbiased estimate of d Beware terminology (Hedges’ g, Glass’s D) Use ‘Cohen’s d’ or dunb, with explanation of how calculated Interpret values in context, use 0.2, 0.5, 0.8 as a last resort Chapter 11

CI on correlation, r Use Fisher’s r to z transformation
CIs asymmetric, more for r near -1 or 1 CIs shorter when near -1 or 1 CIs surprisingly wide, unless N is large Example: r = .6, N = 30 Cohen’s benchmarks: .1, .3, .5—often inappropriate r to z and Two correlations pages of ESCI chapters 14-15 Correlations and Diff correlations pages of ESCI Effect sizes Finch, S., & Cumming, G. (2009). Putting research in context: Understanding confidence intervals from one or more studies. Journal of Pediatric Psychology, 34, For p<.05: need n=40 for r=.3; need n=94 for r=.2; need n=13 for r=.5. Chapter 14

CI on proportion, P Difference between two proportions (instead of OR or c2) ES is proportion survived. Diff = 17/20 – 11/20 = .30, [.02, .53] Proportions and Diff proportions page of ESCI Effect sizes Altman, D. G., Machin, D., Bryant, T. N., & Gardner, M. J. (2000). Statistics with confidence: Confidence intervals and statistical guidelines (2nd ed.). London: BMJ Books. Chapter 14

CIs for assessing model fit to data
Test the Transtheoretical Model of Behavior Change, with data from N = 3967 smokers. Calculate 95% CIs on 15 predictor variables. Dots are the quantitative predictions of the model. Velicer, W. F., Cumming, G., Fava, J. L., Rossi, J. S., Prochaska, J. O., & Johnson, J. (2008). Theory testing using quantitative predictions of effect size. Applied Psychology: An International Review, 57, Chapter 15

Section 4 Conclusions NEXT: Planning good studies
More complex designs: Planned contrasts, with 95% CIs Design: independent groups vs repeated measure Cohen’s d and dunb; how calculated? Rubber ruler CIs on other ES measures Correlation; difference between two independent correlations Proportion; difference between two independent proportions CIs to assess model fit to data NEXT: Planning good studies Chapter 2

Statistical power: I’m ambivalent
Statistical power is the chance of finding something if it is there Statistical power = 1 – b = Prob (reject H0, IF H0 false) Depends on NHST If using NHST, take statistical power seriously Instead, use precision for planning: design for target MOE “Power” more loosely: “Goodness” of an experiment Instead, maximise informativeness Chapter 12

Power picture a = .05, two-tailed, power = .52
Statistical power is the chance we’ll find an effect, if there is an effect of stated size At right: Single sample, N = 18, target ES is d = .5, a = .05, two-tailed, power = .52 Power picture page of ESCI chapters 10-13 To calculate power, we need noncentral t, unless s is known Cumming, G., & Finch, S. (2001). A primer on the understanding, use, and calculation of confidence intervals that are based on central and noncentral distributions. Educational and Psychological Measurement, 61, 530–572. Chapter 12

Power example: HEAT Hot Earth Awareness Test (HEAT)
Test of climate change knowledge, attitudes, and behavior Assume m = 50, s = 20 in reference population Use a = .05, two-tailed. Choose target ES that is meaningful Two independent groups experiment, N in each group For target ES of d = your choice, study variation of power with N E.g. 8 points on HEAT is d = 8/20 = 0.4 Target ES makes a big difference For paired design, set r, the population correlation, and study power Correlation r, as well as target ES, makes a big difference Power two and Power paired pages of ESCI chapters 10-13 Chapter 12

Ways to increase power Increase N Increase target ES (!!)
Increase a (!!) Improve the experimental design Use better measures Scope for fudging (Grant applications, ethics proposals…) Chapter 12

Power recommendations
APA Manual: “Take seriously the statistical power considerations associated with the tests of hypotheses. … routinely provide evidence that the study has sufficient power to detect effects of substantive interest…” (p. 30) BUT power values are very rarely reported in psychology journals Chapter 12

G*power Great software to calculate power, display power curves
Test family (play with t, F, z…) Enter no. tails, type of power analysis, a, etc Enter target ES For Anova, use f as the ES measure (see Cohen, 1988) Power for comparing two r values (correlations) Or set power, use ‘Determine’ to calculate ES value X-Y plot, to examine sensitivity Download from tiny.cc/gpower3 Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ: Erlbaum 32, .88, .12, 64, 86 Chapter 12

Statistical power: Often so low
HIGH power is gold, but many disciplines are cursed by low power Cohen (1962): In published psychology research, median power to find a medium-sized effect is about .5 Maxwell (2004): It was still about .5 for a medium effect Our file drawers (and journals) are crammed with Type 2 errors: Results that are ns even though there is a real effect Maxwell, S. E. (2004). The persistence of underpowered studies in psychological research: Causes, consequences, and remedies. Psychological Methods, 9, Chapter 12

Post hoc power: A bad idea
Calculated after data are obtained. Use obtained d as target d Logical problem: Power is a prospective probability Replicate, and see ‘dance of post hoc power’ Enormous variation Simulate two page of ESCI chapters 5-6 Devastatingly criticised as not telling us what we want to know (chance we’ll find an effect of a size chosen to be meaningful) Merely reflects the outcome of our study. Tells us nothing new. SPSS, etc, gives post hoc power in its printouts. Don’t use this value Poor practice by software publishers Hoenig, J. M., & Heisey, D. M. (2001). The abuse of power: The pervasive fallacy of power calculations for data analysis. The American Statistician, 55, Chapter 12

Informativeness Informativeness—my general term for quality, size, sensitivity, usefulness To increase informativeness (also power & precision): Choose experimental design to minimise error (repeated measures?) Improve the measures, maybe measure twice and average Statistical control? (Covariance?) Target large effect sizes: Six therapy sessions, not two Use large N (of course)—tho’ to halve SE, need to multiply N by 4 Use Meta-analysis (combine results over experiments) Don’t spread your eggs over too many baskets. Do one or two things well, rather than risk examining lots of things and finding ≈ nothing The enemy is error variability: reduce it by all means possible An essential step in research planning, worth great effort: Brainstorm Chapters 12, 13

Precision for planning (AIPE, accuracy in parameter estimation)
ESCI calculates what N is required to give: expected MOE no more than f × s, (so f is like d, a number of SDs) OR to have a 99% chance MOE is no more than f × s ‘assurance’ = 99%, expressed as g = 99 Three Precision pages of ESCI chapters 5-6 f × s Chapter 13

Precision for planning, HEAT example
HEAT experiment, two independent groups Target MOE of 8, which is 0.4 × 20, or f × s, where f = 0.4 Three Precision pages of ESCI chapters 5-6 For f = 0.4, two independent groups, need N = 50 And for assurance g = 99, need N = 65 Alas, such large N, even with such large f Paired experiment, with r = .7, need N = 17 Or N = 28, with assurance g = 99 Chapter 13

Precision for planning (AIPE, accuracy in parameter estimation)
ESCI calculates what N is required to give: expected MOE no more than f × s, (so f is like d, a number of SDs) OR to have a 99% chance MOE is no more than f × s ‘assurance’ = 99%, expressed as g = 99 Not yet widely used, but highly recommended (No need for H0) Target MOE replaces target ES Should replace power calculations for funding and ethics applications f × s Chapter 13

Section 5 Conclusions Power, informativeness, precision
IF using NHST, take power seriously Low power for a meaningful effect size ==> a waste of time? Don’t use post-hoc power Make great efforts to maximize informativeness Precision (MOE of the CI) is more useful than power: No need for NHST, or for any H0 CI pictures are so revealing And essential, if you wish to conclude no (or trivial) effect. If it could be useful, use precision for planning NEXT: Meta-analysis and meta-analytic thinking Chapters 12, 13

Single studies—So many problems
Power often low CIs often wide, precision low CIs report accurately the uncertainty in data. But don’t shoot the messenger—it’s a message we need to hear The solutions: Increase informativeness of individual studies Combine results over studies—Meta-analysis Chapter 7

Meta-analysis: the picture
The forest plot CIs make this picture possible; p values are irrelevant ESCI Meta-Analysis Beginning undergraduates easily grasp the basics—via pictures Effect sizes used in meta-analysis: Means, Cohen’s d, r, others… A stylized cat’s eye: Cumming, G. (2006). Meta-analysis: Pictures that explain how experimental findings can be integrated. 7th International Conference on Teaching Statistics. Brazil, July. tiny.cc/teachma Hunt, M. (1997). How science takes stock. The story of meta-analysis. New York: Sage. Chapter 7

Meta-analysis: Small The minimum: two studies to be combined
Prior studies, or Our own, perhaps within one project, or Prior + our own …to get the best overall estimates, so far Chapter 7

Meta-analysis: Large Cooper’s seven steps; informed critical judgment at every stage Formulate the questions, and scope of the systematic review Search and obtain literature, contact researchers, find grey literature Establish selection criteria, read and select studies Code studies, enter ES estimates and coding of study features Choose what to include, and design the analyses Analyse the data. Prefer random effects model. Moderators? Interpret; draw empirical, theoretical, and applied conclusions. Prepare critical discussion, present the review Receive $1,000,000 and gold medal. Retire early. (Joke, alas.) Cooper, H. M. (2009). Research synthesis and meta-analysis: A step-by-step approach (4th ed.). Thousand Oaks, CA: Sage. Chapter 9

Heterogeneity Forest plot variability, or dance of the ESs
Heterogeneity measures the extent of dancing Sampling variability accounts for a certain width of dancing Studies homogeneous => sampling variability accounts for dance Studies heterogeneous => There is variability beyond that expected from sampling variability Therefore, moderating variables may contribute Can studies be too homogeneous? What might that imply? Borenstein, M., Hedges, L. V., Higgins, J. P. T., & Rothstein, H. R. (2009). Introduction to meta-analysis. New York: Wiley. Chapter 8

Models for meta-analysis
Fixed effect (FE) model Assume homogeneity: every study estimates the same m Almost always unrealistic Random effects (RE) model—our routine choice Assume Study i estimates mi, sampled from N(m, t) t = 0 means studies homogeneous, FE model applies t > 0 means studies heterogeneous, RE model needed Assumptions are severe. Unrealistic? Other models? Varying-coefficient model of Doug Bonett—watch this space Other approaches? Bayesian Schmidt-Hunter corrections for measurement and other biases Schmidt, F., & Hunter, J. (2014). Methods of meta-analysis: Correcting error and bias in research findings (3rd ed.). Sage. Chapter 8

Measures of heterogeneity
Q …the weighted sum of squares between studies T …the estimate of t, with CI Interpret T and its CI (Extends to 0? Interpret the limits) I 2 …the percentage of total variance between studies that reflects variation in true effect size, rather than sampling variability If heterogeneity is considerable, consider a moderator analysis If heterogeneity low, RE gives same result as FE => nothing to lose by using RE Chapter 8

But there’s more: Moderator analysis
If heterogeneity high, look for moderators Simplest: Dichotomous moderator? (e.g., gender) Subgroups page of ESCI Meta-analysis Identify moderator, even if no study manipulated that variable Meta-analysis can give theoretical progress—that’s gold Example: Peter Wilson, clumsy children, meta-analysis of 50 studies Identify performance on complex visuospatial tasks as moderator Study this moderator empirically Chapter 9

Continuous moderator? Meta-regression
Fletcher & Kerr (2010): Does RTG fade with length of relationship? Meta-regression of ES values (RTG score) against years, 13 studies Correlation, not causality. Alternative interpretations? Chapter 9

MA in the Publication Manual
Many mentions, esp. pp , 183. Mainstreaming meta-analysis MARS (Meta-Analysis Reporting Standards) pp A further big advantage of the sixth edition (2010) Cooper, H. (2010). Reporting research in psychology: How to meet Journal Article Reporting Standards (APA Style). Washington, DC: APA Books. MARS and JARS Chapter 9

CMA: Software for meta-analysis
Comprehensive Meta Analysis Enter ES, and its variance, for each study—in 100+ formats Choose FE or RE model Assess heterogeneity of studies Explore moderators (ANOVA, or meta-regression) Forest plot Borenstein, M., Hedges, L. V., Higgins, J. P. T., & Rothstein, H. R. (2009). Introduction to meta-analysis. New York: Wiley. Chapters 8, 9

Health sciences: The Cochrane Collaboration
Medicine, health sciences, health policy, practice… Systematic reviews: meta-analyses of research studies Publicly available in some countries 5,000+ reviews online 31,000+ people in 120+ countries Aim to update every two years RevMan software for meta-analysis Includes some psychology Campbell collaboration, for some social sciences (education, welfare…) Chapter 9

PTSD: A Cochrane review
Bisson JI, Roberts NP, Andrew M, Cooper R, Lewis C. Psychological therapies for chronic post-traumatic stress disorder (PTSD) in adults. Cochrane Database of Systematic Reviews 2013, Issue 12. Art. No.: CD003388 Includes 70 studies, total of 4761 participants Update of 2005 Cochrane review, updated in 2007 Support for efficacy, for chronic PTSD in adults, of Trauma-focused cognitive behavioral therapy (TFCBT), and Eye movement desensitization and reprocessing (EMDR) Non-trauma-focused psychological therapies not so effective Chapter 9

Quality of the evidence
Many studies, but each included only small numbers of people Some studies were poorly designed The overall quality of the studies was very low and so findings should be interpreted with caution There is insufficient evidence to show whether or not psychological therapy is harmful Chapter 9

Bias? Research integrity issues
Judgments about risks of bias, as percentages of included studies (p. 13) Chapter 9

Funnel plot: Publication bias?
Individual therapy vs waitlist/usual care Outcome: Severity of PTSD symptoms - clinician-rated (p. 26) Small studies missing? Because not statistically significant? Large studies …suggests the possibility of publication bias Small studies Favors therapy Favors control Chapter 9

Forest plot of Analysis 1.10
Comparison 1: Trauma-focused CBT/Exposure therapy vs waitlist/usual care Outcome 10: Depression month follow-up (p. 152) Chapter 9

Forest plot of Analysis 3.1.
Comparison 3: Trauma-focused CBT/Exposure Therapy vs other therapies Outcome 1: Severity of PTSD symptoms – clinician (p. 171) Chapter 9

Meta-analysis, in many disciplines
Particle physics As much heterogeneity as in social sciences! Hedges, L. V. (1987). How hard is hard science, how soft is soft science? The empirical cumulativeness of research. American Psychologist, 42, 443–455. Chapter 9

Meta-analytic thinking
Think of past literature in meta-analytic terms Think of our study as the next step in that progressively cumulating meta-analysis Report results so inclusion in future meta-analysis is easy Report all effect sizes (whether ns or not), in the best way Cumming, G., & Finch, S. (2001). A primer on the understanding, use and calculation of confidence intervals based on central and noncentral distributions. Educational and Psychological Measurement, 61, Chapters 1, 7, 9

Meta-analysis, more generally…
A typical study asks multiple questions, tests theory, doesn’t simply ask ‘how large is the effect of the treatment?’ A typical article includes several experiments, a number of manipulations, a number of DVs… MA typically chooses a single ES from each study the most important, or the average of several, or the one most often reported or carry out more than one MA, as in Cochrane Converging approaches, converging evidence—most persuasive Reduces risk that findings are merely chance fluctuations Provides some evidence of robustness, generality of findings …all part of meta-analytic thinking Chapter 7

Section 6 Conclusions Meta-analysis: Any size from small to very large
Quantitative integration, even of large, messy literatures Best (most precise) ES estimates Moderator analysis: Practical and theoretical importance Variety of meta-analysis techniques and models Watch for developments Provides the best basis for evidence-based practice Build a cumulative quantitative discipline: Meta-analytic thinking Chapters 7, 8, 9

The New Statistics: Actually doing it
The editor says to remove CIs and just give p values. What do you DO? Research methods best practice: Consider, decide, persist The evidence should decide: Consider statistical cognition research TNS reasons are compelling, TNS is the way of the future. Persist. Explain and justify your data analytic approach APA Publication Manual: “Wherever possible, base discussion and interpretation of results on point and interval estimates” (p. 34). New guidelines: Psychonomic Society, Psychological Science… I add p values if I must, but don’t mention them, nor remove CIs or ESs Chapter 15

The New Statistics: Where next?
Examples and advice, for many situations, many ESs ANOVA, multivariate, SEM, model fitting… Refs: tiny.cc/tnswhyhow New textbooks, new software Editors insisting More guidelines, as Psychonomic Society Statistical cognition research Provides the evidence for evidence-based statistical practice p values and emotions Better graphics Study estimation thinking Strategies for research integrity: replication, publication, ethics… Teach it, from the start Chapter 15

Take-home message Estimation: The eight-step plan
Use estimation thinking. State estimation questions as: “How much…?”, “To what extent…?”, “How many…?” The key to a more quantitative discipline Indentify the ESs that best answer the questions Declare full details of the intended procedure, data analysis, … Calculate point and interval estimates (CIs) for those ESs Make a picture, including CIs Interpret (use knowledgeable judgment, in context) Use meta-analytic thinking at every stage (…cumulative discipline) Make a full report publicly available (an imperative, not just a goal) Chapters 1, 2, 15

www.thenewstatistics.com Hug a confidence interval today! Comments to:
Book information, and ESCI: …with links to: Radio talk, magazine articles Free sample chapter, dance of the p values Other videos: At YouTube, search for ‘Geoff Cumming’ This PowerPoint file: tiny.cc/geoffdocs Tutorial article: tiny.cc/tnswhyhow Hug a confidence interval today!

The New Statistics: Estimation and Research Integrity

Similar presentations

Presentation on theme: "The New Statistics: Estimation and Research Integrity"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

The New Statistics: Estimation and Research Integrity

Similar presentations

Presentation on theme: "The New Statistics: Estimation and Research Integrity"— Presentation transcript:

Similar presentations

About project

Feedback