# MAGNITUDE-BASED INFERENCES

## Presentation on theme: "MAGNITUDE-BASED INFERENCES"— Presentation transcript:

MAGNITUDE-BASED INFERENCES
An alternative to hypothesis testing

Lecture outline  Limitations of null hypothesis significance testing (NHST)  Confidence intervals  Magnitude-based inferences (MBI)  Smallest worthwhile effects y Limitations of magnitude-based inferences

Null hypothesis significance testing
A major aim of research is to make an inference about an effect in a population based on study of a sample. Null-hypothesis testing via the P-value and statistical significance is the traditional approach to making an inference. P-value is the probability of obtaining the observed result, or more extreme results, if the null hypothesis is true. Most likely observation Probability Observed data point One-sided p-value Set of possible results (under the null hypothesis)

P-values are difficult to understand! P ≤ 0.05 is arbitrary
Limitations of NHST P-values are difficult to understand! P ≤ 0.05 is arbitrary “Surely God loves the 0.06 nearly as much as 0.05.” (Rosnow and Rosenthal, 1989) Some useful effects aren’t statistically significant Some statistically significant effects aren’t useful P > 0.05 often interpreted as unpublishable So good data don’t get published Ignores ‘judgement’, leads to dichotomised thinking Statistical Hypothesis Inference Testing? In arriving at a problem statement and research question, researchers usually have good reasons to believe that effects will be different from zero. The more relevant issue is not whether there is an effect but how big it is. Unfortunately, the P value alone provides us with no information about the direction or size of the effect or, given sampling variability, the range of feasible values. Depending, inter alia, on sample size and variability, an outcome statistic with P < .05 could represent an effect that is clinically, practically, or mechanistically irrelevant. Conversely, a nonsignificant result (P > .05) does not necessarily imply that there is no worthwhile effect, because a combination of small sample size and large measurement variability can mask important effects. An overreliance on P values might therefore lead to unethical errors of interpretation.

Limitations of NHST

Limitations of NHST

Very highly significant!!! ** <0.01 Highly significant!! * <0.05
Dance of the p-values https://www.youtube.com/watch?v=ez4DgdurRPg *** <0.001 Very highly significant!!! ** <0.01 Highly significant!! * <0.05 Significant (phew) ? 0.05 – 0.10 “Approaching significance” >0.10 Non-significant

Representation of the limits as a confidence interval:
Confidence intervals A range within which we infer the true, population or large sample value is likely to fall. Likely is usually a probability of 0.95 (for 95% limits). probability value of effect statistic positive negative probability distribution of true value, given the observed value Area = 0.95 observed value lower likely limit upper likely limit Representation of the limits as a confidence interval: likely range of true value value of effect statistic positive negative

CI = [M – Critical value x SE, M + Critical value x SE]
Confidence intervals (CI) To calculate a confidence interval you need: Confidence level (usually 95%, but can use 90% or 99%) Statistic (e.g. group mean) Margin of error (Critical value x Standard error of the statistic) 𝑆𝐸= 𝑆𝐷 𝑛 z-score or t-score CI = [M – MOE, M + MOE] CI = [M – Critical value x SE, M + Critical value x SE] EXAMPLE: You test the vertical jump heights of 100 athletes. The mean and standard deviation of the sample was 60±20 cm. The 95% CIs for this mean would be: Lower CI = 60 – 1.96 x 20/√ Upper CI = x 20/√100 Lower CI = 60 – Upper CI = 95% CI = 56 to 64 cm

Confidence intervals (CI)
Confidence intervals also convey the precision of an estimate Wider confidence interval = less precision In the previous example, a smaller sample size (e.g. n=20) would have given less precision for our estimate… Lower CI = 60 – 1.96 x 20/√ Upper CI = x 20/√20 Lower CI = 60 – Upper CI = 95% CI = 51 to 69 cm

Confidence intervals also convey precision of our estimate
Recap P-value = The probability of obtaining the observed result, or more extreme results, if the null hypothesis is true. ‘NHST’ has several limitations, namely, it leads to dichotomised thinking and does not tell us if the effect is important/worthwhile. Confidence intervals tell us the likely range of the true (population) value. It could be red! Confidence intervals also convey precision of our estimate Larger sample size and/or more consistent response = Smaller confidence interval & more precision.

value of effect statistic
Magnitude-based inferences For magnitude-based inferences, we interpret confidence limits in relation to the smallest clinically beneficial and harmful effects. These are usually equal and opposite in sign. Harm is the opposite of benefit, not side effects. They define regions of beneficial, trivial, and harmful values: All you need is these two things: the confidence interval and a sense of what is important (e.g., beneficial and harmful). value of effect statistic positive negative trivial beneficial harmful smallest clinically harmful effect smallest clinically beneficial effect

value of effect statistic
Magnitude-based inferences Put the confidence interval and these regions together to make a decision about clinically significant, clear or decisive effects. MBI Statistically significant? value of effect statistic positive negative trivial harmful beneficial Use it. Yes Use it. Yes Why hypothesis testing can be unethical and impractical! Use it. No Depends No Don’t use it. Yes Don’t use it. No Don’t use it. No Don’t use it. Yes Don’t use it. Yes Unclear: need more research. No

value of effect statistic
Magnitude-based inferences We calculate probabilities that the true effect could be clinically beneficial, trivial, or harmful (Pbeneficial, Ptrivial, Pharmful). Spreadsheets available at: sportsci.org The Ps allow a more detailed call on magnitude, as follows… smallest beneficial value probability value of effect statistic positive negative observed value probability distribution of true value Pbeneficial = 0.80 Ptrivial = 0.15 smallest harmful value Pharmful = 0.05

value of effect statistic
Magnitude-based inferences Making a more detailed call on magnitudes using chances of benefit and harm. Chances (%) that the effect is harmful / trivial / beneficial value of effect statistic positive negative trivial harmful beneficial 0/0/100 Most likely beneficial 0/7/93 Likely beneficial 2/33/65 Risk of harm >0.5% is unacceptable, unless chance of benefit is high enough. Possibly beneficial Mechanistic: Possibly trivial 1/59/40 Clinical: unclear 0/97/3 Very likely trivial 2/94/4 Likely trivial For a mechanistic inference the spreadsheet shows the effect as unclear if the confidence interval, which represents uncertainty about the true value, overlaps values that are substantial in a positive and negative sense; the effect is otherwise characterized with a statement about the chance that it is trivial, positive or negative. For a clinical inference the effect is shown as unclear if its chance of benefit is at least promising but its risk of harm is unacceptable; the effect is otherwise characterized with a statement about the chance that it is trivial, beneficial or harmful.  28/70/2 Possibly harmful 74/26/0 Possibly harmful 97/3/0 Very likely harmful 9/60/31 Unclear

The effect… beneficial/trivial/harmful
Magnitude-based inferences Use this table for the plain-language version of chances: An effect should be almost certainly not harmful (<0.5%) and at least possibly beneficial (>25%) before you decide to use it. But you can tolerate higher chances of harm if chances of benefit are much higher: e.g., 3% harm and 76% benefit = clearly useful. Use an odds ratio of benefit/harm of >66 in such situations. The effect… beneficial/trivial/harmful is almost certainly not… Probability <0.005 Chances <0.5% Odds <1:199 is very unlikely to be… 0.005–0.05 0.5–5% 1:999–1:19 is unlikely to be…, is probably not… 0.05–0.25 5–25% 1:19–1:3 is possibly (not)…, may (not) be… 0.25–0.75 25–75% 1:3–3:1 is likely to be…, is probably… is very likely to be… is almost certainly… 0.75–0.95 0.95–0.995 >0.995 75–95% 95–99.5% >99.5% 3:1–19:1 19:1–199:1 >199:1

Both these effects are clinically decisive, clear, or significant.
Two examples of use of the spreadsheet for clinical chances: P value 0.03 value of statistic 1.5 Conf. level (%) 90 deg. of freedom 18 Confidence limits lower upper 0.4 2.6 positive negative 1 -1 threshold values for clinical chances -0.7 5.5 1 -1 90 18 0.20 2.4 Both these effects are clinically decisive, clear, or significant. prob (%) odds 78 3:1 likely, probable clinically positive Chances (% or odds) that the true value of the statistic is prob (%) odds 1:2071 almost certainly not clinically negative prob (%) odds 22 1:3 unlikely, probably not clinically trivial 78 3:1 likely, probable 19 1:4 unlikely, probably not 3 1:30 very unlikely

a with reference to a smallest worthwhile change of 0.5%.
How to Publish Clinical Chances Example of a table from a randomized controlled trial: TABLE 1–Differences in improvements in kayaking sprint speed between slow, explosive and control training groups. Mean improvement (%) and 90% confidence limits 3.1; ±1.6 Qualitative outcomea Almost certainly beneficial a with reference to a smallest worthwhile change of 0.5%. Compared groups Slow - control Explosive - control Slow - explosive 2.6; ±1.2 Very likely beneficial 0.5; ±1.4 Unclear

Confidence intervals also convey precision of our estimate
Recap P-value = The probability of obtaining the observed result, or more extreme results, if the null hypothesis is true. NHST has several limitations, namely, it does not tell us if the effect is important/worthwhile. Confidence intervals tell us the likely range of the true (population) value. It could be red! Confidence intervals also convey precision of our estimate Larger sample size and/or more consistent response = Smaller confidence interval & more precision.

Recap For magnitude-based inferences, we interpret confidence limits in relation to the smallest clinically beneficial and harmful effects. Spreadsheets at sportsci.org provide the % likelihood that an effect is harmful | trivial | beneficial. Effects that cross thresholds for benefit and harm are classed as unclear. An effect should be almost certainly not harmful (<0.5%) and at least possibly beneficial (>25%) before you decide to use it. But you can tolerate higher chances of harm if chances of benefit are much higher: e.g., 3% harm and 76% benefit = clearly useful. Use an odds ratio of benefit/harm of >66 in such situations.

Smallest worthwhile difference?
Problem: what's the smallest clinically important effect? “If you can't answer this question, quit the field”. This problem applies also with hypothesis testing, because it determines sample size you need to test the null properly. 0.3 of a CV gives a top athlete one extra medal every 10 races. This is the smallest important change in performance to aim for in research on, or intended for, elite athletes. 0.9, 1.6, 2.5, 4.0 of a CV gives an extra 3, 5, 7, 9 medals per 10 races (thresholds for moderate, large, very large, extremely large effecs). References: Hopkins et al. MSSE 31, , 1999 and MSSE 41, 3-12, 2009.

Smallest worthwhile difference?
The default for most other populations and effects is Cohen's set of smallest values. You express the difference or change in the mean as a fraction of the between-subject standard deviation (mean/SD). It's like a z score or a t statistic. In a controlled trial, it's the SD of all subjects in the pre-test, not the SD of the change scores. The smallest worthwhile difference or change is 0.20. 0.20 is equivalent to moving from the 50th to the 58th percentile.

Interpretation of standardised difference or change in means:
Example: The effect of a treatment on strength strength post pre Trivial effect (0.1x SD) strength post pre Very large effect (3x SD) Interpretation of standardised difference or change in means: Cohen <0.2 Hopkins <0.2 trivial small moderate large very large >0.8 ? ? >4.0 extremely large

Smallest worthwhile difference?
Relationship of standardised effect to difference or change in percentile: strength Standardised effect = 0.20 area = 50% athlete on 50th percentile athlete on 58th percentile area = 58% strength Standardised effect 0.20 Percentile change 50  58 0.20 80  85 0.20 95  97 0.25 50  60 1.00 50  84 2.00 50  98

Smallest worthwhile difference?
Trivial Small Moderate Large Very large Nearly perfect Perfect Correlation 0.0 0.1 0.3 0.5 0.7 0.9 1 Diff. in means 0.2 0.6 1.2 2.0 4.0 Infinite Freq. diff 10 30 50 70 90 100 Rel. risk 1.0 1.9 3.0 5.7 19 Odds ratio 1.5 3.5 9.0 32 360 infinite

Limitations of magnitude-based inferences
Problem: these new approaches are not yet mainstream. Confidence limits at least are coming in, so look for and interpret the importance of the lower and upper limits. You can use a spreadsheet to convert a published P value into a more meaningful magnitude-based inference. If the authors state “P<0.05” you can’t do it properly. If they state “P>0.05” or “NS”, you can’t do it at all. More difficult to present and discuss results? ‘Magnitude-based inferences under attack’

Summary MBI’s are an alternative to traditional NHST. For magnitude-based inferences, we interpret confidence limits in relation to the smallest clinically beneficial and harmful effects. Smallest worthwhile effects may be based on variability of performance (e.g. 0.3 of CV). Or standardised effects may be used (e.g. Cohen’s D). Spreadsheets available at sportsci.org to carry out MBI’s. Growing in popularity, but still not understood/accepted by many journals and academics. Confidence intervals convey far more information than a P-value alone, and should be presented where possible.