Presentation on theme: "MAGNITUDE-BASED INFERENCES"— Presentation transcript:
1MAGNITUDE-BASED INFERENCES An alternative to hypothesis testing
2Lecture outline Limitations of null hypothesis significance testing (NHST) Confidence intervals Magnitude-based inferences (MBI) Smallest worthwhile effects y Limitations of magnitude-based inferences
3Null hypothesis significance testing A major aim of research is to make an inference about an effect in a population based on study of a sample.Null-hypothesis testing via the P-value and statistical significance is the traditional approach to making an inference.P-value is the probability of obtaining the observed result, or more extreme results, if the null hypothesis is true.Most likely observationProbabilityObserved data pointOne-sided p-valueSet of possible results (under the null hypothesis)
4P-values are difficult to understand! P ≤ 0.05 is arbitrary Limitations of NHSTP-values are difficult to understand!P ≤ 0.05 is arbitrary“Surely God loves the 0.06 nearly as much as 0.05.” (Rosnow and Rosenthal, 1989)Some useful effects aren’t statistically significantSome statistically significant effects aren’t usefulP > 0.05 often interpreted as unpublishableSo good data don’t get publishedIgnores ‘judgement’, leads to dichotomised thinkingStatistical Hypothesis Inference Testing?In arriving at a problem statement and research question, researchers usually have good reasons to believe that effects willbe different from zero. The more relevant issue is not whether there is an effect but how big it is. Unfortunately, the P value alone provides us with no information about the direction or size of the effect or, given sampling variability, the range offeasible values. Depending, inter alia, on sample size and variability, an outcome statistic with P < .05 could represent an effect that is clinically, practically, or mechanistically irrelevant. Conversely, a nonsignificant result (P > .05) does not necessarily imply that there is no worthwhile effect, because a combination of small sample size and large measurement variability can mask important effects. An overreliance on P values might therefore lead to unethical errors of interpretation.
8Representation of the limits as a confidence interval: Confidence intervalsA range within which we infer the true, population or large sample value is likely to fall.Likely is usually a probability of 0.95 (for 95% limits).probabilityvalue of effect statisticpositivenegativeprobability distribution of true value, given the observed valueArea = 0.95observed valuelower likely limitupper likely limitRepresentation of the limits as a confidence interval:likely range of true valuevalue of effect statisticpositivenegative
9CI = [M – Critical value x SE, M + Critical value x SE] Confidence intervals (CI)To calculate a confidence interval you need:Confidence level (usually 95%, but can use 90% or 99%)Statistic (e.g. group mean)Margin of error (Critical value x Standard error of the statistic)𝑆𝐸= 𝑆𝐷 𝑛z-score or t-scoreCI = [M – MOE, M + MOE]CI = [M – Critical value x SE, M + Critical value x SE]EXAMPLE:You test the vertical jump heights of 100 athletes. The mean and standard deviation of the sample was 60±20 cm. The 95% CIs for this mean would be:Lower CI = 60 – 1.96 x 20/√ Upper CI = x 20/√100Lower CI = 60 – Upper CI =95% CI = 56 to 64 cm
10Confidence intervals (CI) Confidence intervals also convey the precision of an estimateWider confidence interval = less precisionIn the previous example, a smaller sample size (e.g. n=20) would have given less precision for our estimate…Lower CI = 60 – 1.96 x 20/√ Upper CI = x 20/√20Lower CI = 60 – Upper CI =95% CI = 51 to 69 cm
11Confidence intervals also convey precision of our estimate RecapP-value = The probability of obtaining the observed result, or more extreme results, if the null hypothesis is true.‘NHST’ has several limitations, namely, it leads to dichotomised thinking and does not tell us if the effect is important/worthwhile.Confidence intervals tell us the likely range of the true (population) value.It could be red!Confidence intervals also convey precision of our estimateLarger sample size and/or more consistent response = Smaller confidence interval & more precision.
12value of effect statistic Magnitude-based inferencesFor magnitude-based inferences, we interpret confidence limits in relation to the smallest clinically beneficial and harmful effects.These are usually equal and opposite in sign.Harm is the opposite of benefit, not side effects.They define regions of beneficial, trivial, and harmful values:All you need is these two things: the confidence interval and a sense of what is important (e.g., beneficial and harmful).value of effect statisticpositivenegativetrivialbeneficialharmfulsmallest clinically harmful effectsmallest clinically beneficial effect
13value of effect statistic Magnitude-based inferencesPut the confidence interval and these regions together to make a decision about clinically significant, clear or decisive effects.MBIStatistically significant?value of effect statisticpositivenegativetrivialharmfulbeneficialUse it.YesUse it.YesWhy hypothesis testing can be unethical and impractical!Use it.NoDependsNoDon’t use it.YesDon’t use it.NoDon’t use it.NoDon’t use it.YesDon’t use it.YesUnclear: need more research.No
14value of effect statistic Magnitude-based inferencesWe calculate probabilities that the true effect could be clinically beneficial, trivial, or harmful (Pbeneficial, Ptrivial, Pharmful).Spreadsheets available at:sportsci.orgThe Ps allow a more detailed call on magnitude, as follows…smallest beneficial valueprobabilityvalue of effect statisticpositivenegativeobserved valueprobability distribution of true valuePbeneficial = 0.80Ptrivial = 0.15smallest harmful valuePharmful = 0.05
15value of effect statistic Magnitude-based inferencesMaking a more detailed call on magnitudes using chances of benefit and harm.Chances (%) that the effect is harmful / trivial / beneficialvalue of effect statisticpositivenegativetrivialharmfulbeneficial0/0/100Most likely beneficial0/7/93Likely beneficial2/33/65Risk of harm >0.5% is unacceptable, unless chance of benefit is high enough.Possibly beneficialMechanistic:Possibly trivial1/59/40Clinical: unclear0/97/3Very likely trivial2/94/4Likely trivialFor a mechanistic inference the spreadsheet shows the effect as unclear if the confidence interval, which represents uncertainty about the true value, overlaps values that are substantial in a positive and negative sense; the effect is otherwise characterized with a statement about the chance that it is trivial, positive or negative. For a clinical inference the effect is shown as unclear if its chance of benefit is at least promising but its risk of harm is unacceptable; the effect is otherwise characterized with a statement about the chance that it is trivial, beneficial or harmful. 28/70/2Possibly harmful74/26/0Possibly harmful97/3/0Very likely harmful9/60/31Unclear
16The effect… beneficial/trivial/harmful Magnitude-based inferencesUse this table for the plain-language version of chances:An effect should be almost certainly not harmful (<0.5%) and at least possibly beneficial (>25%) before you decide to use it.But you can tolerate higher chances of harm if chances of benefit are much higher: e.g., 3% harm and 76% benefit = clearly useful.Use an odds ratio of benefit/harm of >66 in such situations.The effect… beneficial/trivial/harmfulis almost certainly not…Probability<0.005Chances<0.5%Odds<1:199is very unlikely to be…0.005–0.050.5–5%1:999–1:19is unlikely to be…, is probably not…0.05–0.255–25%1:19–1:3is possibly (not)…, may (not) be…0.25–0.7525–75%1:3–3:1is likely to be…, is probably…is very likely to be…is almost certainly…0.75–0.950.95–0.995>0.99575–95%95–99.5%>99.5%3:1–19:119:1–199:1>199:1
17Both these effects are clinically decisive, clear, or significant. Two examples of use of the spreadsheet for clinical chances:P value0.03value ofstatistic1.5Conf.level (%)90deg. offreedom18Confidence limitslowerupper0.42.6positivenegative1-1threshold valuesfor clinical chances-0.75.51-190180.202.4Both these effects are clinically decisive, clear, or significant.prob (%)odds783:1likely, probableclinically positiveChances (% or odds) that the true value of the statistic isprob (%)odds1:2071almost certainly notclinically negativeprob (%)odds221:3unlikely, probably notclinically trivial783:1likely, probable191:4unlikely, probably not31:30very unlikely
18a with reference to a smallest worthwhile change of 0.5%. How to Publish Clinical ChancesExample of a table from a randomized controlled trial:TABLE 1–Differences in improvements in kayaking sprint speed between slow, explosive and control training groups.Mean improvement(%) and 90%confidence limits3.1; ±1.6Qualitative outcomeaAlmost certainly beneficiala with reference to a smallest worthwhile change of 0.5%.Compared groupsSlow - controlExplosive - controlSlow - explosive2.6; ±1.2Very likely beneficial0.5; ±1.4Unclear
19Confidence intervals also convey precision of our estimate RecapP-value = The probability of obtaining the observed result, or more extreme results, if the null hypothesis is true.NHST has several limitations, namely, it does not tell us if the effect is important/worthwhile.Confidence intervals tell us the likely range of the true (population) value.It could be red!Confidence intervals also convey precision of our estimateLarger sample size and/or more consistent response = Smaller confidence interval & more precision.
20RecapFor magnitude-based inferences, we interpret confidence limits in relation to the smallest clinically beneficial and harmful effects.Spreadsheets at sportsci.org provide the % likelihood that an effect is harmful | trivial | beneficial.Effects that cross thresholds for benefit and harm are classed as unclear.An effect should be almost certainly not harmful (<0.5%) and at least possibly beneficial (>25%) before you decide to use it.But you can tolerate higher chances of harm if chances of benefit are much higher: e.g., 3% harm and 76% benefit = clearly useful.Use an odds ratio of benefit/harm of >66 in such situations.
21Smallest worthwhile difference? Problem: what's the smallest clinically important effect?“If you can't answer this question, quit the field”.This problem applies also with hypothesis testing, because it determines sample size you need to test the null properly.0.3 of a CV gives a top athlete one extra medal every 10 races.This is the smallest important change in performance to aim for in research on, or intended for, elite athletes.0.9, 1.6, 2.5, 4.0 of a CV gives an extra 3, 5, 7, 9 medals per 10 races (thresholds for moderate, large, very large, extremely large effecs).References: Hopkins et al. MSSE 31, , 1999 and MSSE 41, 3-12, 2009.
22Smallest worthwhile difference? The default for most other populations and effects is Cohen's set of smallest values.You express the difference or change in the mean as a fraction of the between-subject standard deviation (mean/SD).It's like a z score or a t statistic.In a controlled trial, it's the SD of all subjects in the pre-test, not the SD of the change scores.The smallest worthwhile difference or change is 0.20.0.20 is equivalent to moving from the 50th to the 58th percentile.
23Interpretation of standardised difference or change in means: Example: The effect of a treatment on strengthstrengthpostpreTrivial effect (0.1x SD)strengthpostpreVery large effect (3x SD)Interpretation of standardised difference or change in means:Cohen<0.2Hopkins<0.2trivialsmallmoderatelargevery large>0.8??>4.0extremely large
24Smallest worthwhile difference? Relationship of standardised effect to difference or change in percentile:strengthStandardised effect = 0.20area = 50%athlete on 50thpercentileathlete on 58thpercentilearea = 58%strengthStandardised effect0.20Percentile change50 580.2080 850.2095 970.2550 601.0050 842.0050 98
26Limitations of magnitude-based inferences Problem: these new approaches are not yet mainstream.Confidence limits at least are coming in, so look for and interpret the importance of the lower and upper limits.You can use a spreadsheet to convert a published P value into a more meaningful magnitude-based inference.If the authors state “P<0.05” you can’t do it properly.If they state “P>0.05” or “NS”, you can’t do it at all.More difficult to present and discuss results?‘Magnitude-based inferences under attack’
27SummaryMBI’s are an alternative to traditional NHST.For magnitude-based inferences, we interpret confidence limits in relation to the smallest clinically beneficial and harmful effects.Smallest worthwhile effects may be based on variability of performance (e.g. 0.3 of CV).Or standardised effects may be used (e.g. Cohen’s D).Spreadsheets available at sportsci.org to carry out MBI’s.Growing in popularity, but still not understood/accepted by many journals and academics.Confidence intervals convey far more information than a P-value alone, and should be presented where possible.
28Recommended Reading Simulation software: Batterham, A. M. & Hopkins, W. G. (2006) Making meaningful inferences about magnitudes. International Journal of Sports Physiology and Performance, 1,Hopkins, W., Marshall, S., Batterham, A. & Hanin, J. (2009) Progressive statistics for studies in sports medicine and exercise science. Medicine and Science in Sports and Exercise, 41, 3-12.