Logic of Significance Testing When testing to see if there is a “real” difference between the means of two groups, we test to see if there is a significant difference. H0: μ1 – μ2 = 0 H1: μ1 – μ2 ≠ 0(two-tailed hypothesis) H0: μ1 – μ2 ≤ 0 H1: μ1 – μ2 > 0(one-tailed hypothesis)
We do so by deriving a sampling distribution which describes the probability of obtaining a particular result in a particular sample (of a particular size, N, from a population with an estimated variance, σ 2 est ), assuming that H0 is correct, that is, assuming that there is no difference between the means in the population... α/2 =.025 H0 0 M1 – M2 α/2 =.025 H0: μ1 – μ2 = 0 (two-tailed test) M1-M2 crit
... or, assuming that the difference between the means in the population is less than or equal to zero... α =.05 H0 0 M1 – M2 H0: μ1 – μ2 ≤ 0 (one-tailed test) M1-M2 crit
If the observed sample result is very different than what would be expected under H0 (i.e., the probability of observing such a result if H0 is correct is very low), then we are justified in rejecting H0 and accepting H1 – We accept with high confidence the conclusion that H1 is correct (e.g., that μ1 > μ2 in the population)... α =.05 H0 0 M1 – M2 result
We are justified in doing so, because we know that the probability that we are wrong (making a Type 1 statistical error) is very low (less than 5%). α =.05 H0 0 M1 – M2 result H0: μ1 – μ2 ≤ 0 H1: μ1 – μ2 > 0
On the other hand, if the result is not so different from what we would expect if H0 is correct (i.e., the probability of obtaining such a result if H0 is correct is greater than 5%), then we do not reject H0. α =.05 H0 0 M1 – M2 result H0: μ1 – μ2 ≤ 0
But, we do not “accept” H0 either! – if, by “accept,” we mean, accept that H0 is correct with high confidence. α =.05 H0 0 M1 – M2 result H0: μ1 – μ2 ≤ 0 WHY NOT?
We cannot “accept” H0, because we cannot reject H1. That is, we cannot say with high confidence that H1 is wrong. In fact, an observed “nonsignificant” result might be quite consistent with a state of the world in which μ1 is greater than μ2. H0 0 M1 – M2 7.5 result 5 H1 H1: μ1 – μ2 = 5
For instance, an observed difference between M1 and M2 of 7.5 points, although not significant, would be very likely (about 60% chance of observing this difference or less in any random sample*) under the assumption that this specific H1 is correct, that is, if the “true” difference between the two means in the population is 5 points. H0 0 M1 – M2 7.5 result H1: μ1 – μ2 = 5 5 H1 xxxx xxxxxx xxxxxxxx xxxxxxxxx xxxxxxxxxx xxxxxxxxxxx xxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxx β* =.60 Why is β* (aposteriori) not actually “beta”? Click to find out =>
Note: β* is not really beta, in the same way that the p-value (significance level) of a result with respect to H0 (signified here as α*) is not actually alpha: Beta is defined as the probability of wrongly rejecting H1 when H1 is correct, whereas β* (in the current example) is the probability that a difference of 7.5 or smaller would be obtained, assuming that this specific H1 is correct. [Similarly, alpha is defined as the probability of wrongly rejecting H0 when H0 is correct, whereas the specific p-value (α* or probability) (e.g., p=.002, or p=.27) is the a posteriori probability that this specific result would be obtained, assuming that H0 is correct.] H0 0 M1 – M2 7.5 result H1: μ1 – μ2 = 5 5 H1 xxxx xxxxxx xxxxxxxx xxxxxxxxx xxxxxxxxxx xxxxxxxxxxx xxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxx β* =.60 a priori probability – calculated before the empirical result (α, β) a posteriori probability – calculated after (based on) the empirical result (α*, β*)
As another example: There is exactly a 50% chance of observing a difference between M1 and M2 of 7.5 points or less, under the assumption that the “true” difference between the two means in the population is 7.5 points. [Note: There will always be a 50% chance of getting a result as low or lower than the observed result, if the true difference between the means in the population is equal to the observed result.] H0 0 M1 – M2 7.5 result H1: μ1 – μ2 = 7.5 (observed result) 7.5 H1 xxx xxxxx xxxxxx xxxxxxx xxxxxxxx xxxxxxxxx xxxxxxxxxxx xxxxxxxxxxxxxxxx β* =.50
There is even a 30% chance of obtaining a result as low or lower than the observed result, if the true difference between the means in the population is as large as 12 points. AND SO FORTH … H0 0 M1 – M2 7.5 result H1: μ1 – μ2 = 12 12 H1 xx xxx xxxx xxxxx xxxxxxx xxxxxxxxxxx β* =.30
So, what can we do? We can conclude that there is no support for H1 (in this experiment), but also that there is no support for H0 – the “results are inconclusive.” OR We can try to provide stronger support for a weaker version of H0, which holds that there is: “either no effect, or only a very small effect.”
Post-Hoc Power Analysis – The power to detect a specific (usually small) effect or greater is calculated on the basis of the obtained result. When the observed difference is not significant, if the calculated power (1 – β) β) to detect a small effect is high, then we can conclude with high confidence that either there is no effect, or, if there is an effect, it must be so small that it is essentially meaningless or trivial. Here we will focus on a particular type of post-hoc (a posteriori) analysis that determines whether the hypothesis of any specific, nontrivial effect (H1) can be confidently rejected, based on the low probability of the observed result in the sampling distribution corresponding to that hypothesis.
Example: Assume that an effect (difference) of less than 5 points is theoretically/practically meaningless or trivial. Given the earlier result of 7.5, and the calculated sampling distributions, can we reject the hypothesis that the effect is 5 points or more (i.e., non-trivial)? Let’s reformulate H1 and H0, and then try to reject H1 (in the same way that we usually try to reject H0). H0 0 M1 – M2 7.5 result β* =.60 H1: μ1 – μ2 ≥ 5 5 H1 xxxx xxxxxx xxxxxxxx xxxxxxxxx xxxxxxxxxx xxxxxxxxxxx xxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxx H0: μ1 – μ2 < 5 Post-hoc power analysis No, we can’t reject this specific H1, because β* =.60 is much too high (i.e., the obtained result is not at all unlikely, assuming that H1 is true).
H0 0 M1 – M2 7.5 result H1: μ1 – μ2 ≥ 18 H0: μ1 – μ2 < 18 18 H1 xx xx xxxxx β* =.05 What is the smallest effect (H1) that can be confidently rejected? By moving the H1 distribution far enough to the right (or, by using the appropriate computer program), we reach the point where the probability of our result (or smaller) is only 5% – this assuming that the effect in the population is 18 points. Thus, we can confidently (95%) reject the hypothesis that the effect is really 18 points or greater. Post-hoc power analysis
However, we wanted to be able to reject any effect of 5 points or more! Based on the current sampling distributions, the power (1-β) to detect an effect of 5 points was only 28%, and the probability that we would wrongly conclude that there is no effect, when in fact the effect is 5 points (β) was 72% !!! Apparently, in planning our study, we did not use a sample size that was large enough to meet our research goals. H0 0 M1 – M2 7.5 β =.72 H1: μ1 – μ2 ≥ 5 5 H1 xxxxxxx __ xxxxxxxxx _ xxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxxx xxxxxxxxxxxxxx xxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxx H0: μ1 – μ2 < 5 Post-hoc power analysis M1-M2 crit 1-β =.28 (power) α =.05
BEYOND post-hoc power analysis We see from the previous example, that post-hoc power analysis may leave us without a good solution. If we cannot confidently reject an effect that is less than 18 points, then our “null” results may still be essentially inconclusive—the actual effect may be anywhere between 0 – 17 points! (and of course, may also be negative) For this reason, a better approach is APRIORI power analysis, which is done before the experiment is run (during the planning stage) …
APRIORI Power Analysis 1.Define the smallest effect ( MINDIFF ) which is meaningful (theoretically or practically) in this particular research context. 2.Define the desired power (1 – β): The probability that we will detect MINDIFF, if H1: μ1 – μ2 ≥ MINDIFF is true (put another way, the probability that we will correctly reject the null hypothesis, if in fact there is an effect of size MINDIFF or more in the population). Note that this also defines beta ( β ): The probability that we will wrongly accept the null hypothesis, if in fact there is an effect of size MINDIFF or more in the population. 3.Estimate the population variance, based on previous research findings (or an “educated guess”). 4.Use a software program to find sample size N (which, together with the variance estimate from step 3, defines the sampling distribution of H0 and H1), in order to achieve the desired power and beta (defined in step 2) for detecting the minimum meaningful effect size (defined in step 1). Apriori power analysis
Example 1.Let’s say that we are testing the effectiveness of a preparation course for the psychometric exam. The scores of students taking the course will be compared with a control group (students who prepared on their own). From a practical standpoint, we decide that a benefit (difference) of less than 5 points is essentially meaningless (essentially the same as no difference at all). Thus, MINDIFF = 5. 2.Let’s further assume that we want to have a 90% chance of detecting a 5-point advantage for the preparation course. Or, put another way, we want there to be less than a 10% chance that we will wrongly conclude that the preparation course is ineffective, if in fact it has a benefit of 5 points or more. Thus, desired power (1 – β) =.90, and β =.10. 3.Based on previous (fictitious) research findings, the standard deviation in psychometric scores is known to be around 20 points. Thus, σ 2 est = 400. 4.What is the sample size (N) that is needed to achieve the desired power? Apriori power analysis
We first consider using a sample size of N = 200 (100 subjects in each group). This yields a sampling distribution for M1 – M2, with standard deviation (standard error) σ M1-M2 = 2.83. We use this sampling distribution both for H0 and for H1 ( MINDIFF = 5 ).The resulting power to detect a benefit of 5 points (55%) is too low (the chance that we will wrongly accept H0, 45%, is too high). H0 0 M1 – M2 H1: μ1 – μ2 = 5 5 H1 x xxx xxxx xxxxx xxxxxx xxxxxxx xxxxxxxxx xxxxxxxxxxxxxxx β =.45 M1-M2 crit = 4.7 1-β =.55 (power) α =.05 N = 200 σ M1-M2 = 2.83 Apriori power analysis
Now, we let the software program calculate the sample size that is needed to reach the desired power of.90. It tells us that we need a sample size of N = 550 (275 subjects in each group). This yields a sampling distribution for M1 – M2, with σ M1-M2 = 1.70. The narrower sampling distribution moves the “H0 acceptance” region closer to 0, and places only 10% of the H1 distribution inside that region. H0 0 M1 – M2 H1: μ1 – μ2 = 5 5 H1 x xxxxx β =.10 1-β =.90 (power) α =.05 N = 550 σ M1-M2 = 1.70 Apriori power analysis M1-M2 crit = 2.8
How does this solve the problem? Assuming that we can afford to include 550 participants in our study … If we get a null result (non-significant difference between the means of the two groups), we will be able to conclude with high confidence (90%), that the preparation course is either ineffective, or yields a benefit of less than 5 points. If we get a significant result (in the expected direction), we will be able to conclude with high confidence (95%) that the preparatory course has some benefit. Note, however, that unless the obtained difference is 7.8 points or more (5 points or more above the critical difference), we will not be able to say with high confidence that the preparatory course has a “meaningful” benefit (i.e., a benefit of 5 points or more). This is because although we have rejected H0: μ1 – μ2 ≤ 0, we have not rejected H0: μ1 – μ2 ≤ 5. To do this, we would have to pass the critical difference for this new null hypothesis, which is 5 points higher than the critical difference for the “usual” null hypothesis. Thus, there is still a chance (although a relatively small one), that the results will be inconclusive—if the observed difference is between 2.8 and 7.8 points. Apriori power analysis
Summary – The Problem With a significant result: We reject H0 and accept H1 – we conclude with high confidence that there is an effect in the population. With a non-significant result: We cannot reject H0, but neither can we accept H0 (reject H1) – we cannot conclude with high confidence that there is no effect in the population. Why? Because β is not known – we can calculate β only for a specific H1 (which assumes an effect of a specific size), because only then can we derive a sampling distribution for H1.
Summary – Possible Solutions Acknowledge that the results are inconclusive, perhaps provisionally accepting H0 with (very) low confidence, as a “default” conclusion only. Perform an apriori power analysis (before the experiment) to determine the needed sample size, so that any non-significant result that is later obtained will allow us to reject the existence of any effect larger than a pre-specified small, “meaningless” effect ( MINDIFF ). Perform a post-hoc power analysis (after the result has been obtained) to determine what size of effect can be rejected with high confidence. This can be done just like the apriori analysis (but, using the sample S 2 to estimate σ 2 ). Or, in the method outlined here, the observed result (sampled difference between means) is used to reject any H1 that assumes an effect larger than X. If we are lucky, X ≈ MINDIFF (an effect that we would consider meaningless, or “equivalent to zero”). Note: Because the post-hoc observed-result approach uses the same logic that is normally used to reject H0, it makes sense that β should be set to the conventional level for α, that is,.05. When the more general type of power analysis is used (apriori or post hoc), the targeted β is often set at.10 or.20 (that is desired power =.90 or.80).
Caveats There are many aspects of power analysis that were either not mentioned or over-simplified ( believe it or not! ) in this tutorial. For instance, the focus was entirely on a one-tailed power analysis (i.e., H1 is one-tailed, and hence, we only need to reject a small positive [or negative] effect). For a two-tailed analysis, essentially we do two one-tailed analyses—first considering a positive H1, and then considering a negative H1 (i.e., where the hypothesized difference is negative). [Or, we can just specify the “two-tailed” option in the software program …]
Links A good web site for doing power calculations Russ Lenth’s site (click) Russ Lenth’s site A good but slightly more complicated program G*Power (click) G*Power