Presentation on theme: "1 Lecture 7 POWER IN STATISTICAL TESTING. 2 The caffeine experiment In the Caffeine experiment, there were two groups: 1. the CAFFEINE group; 2. the PLACEBO."— Presentation transcript:
1 Lecture 7 POWER IN STATISTICAL TESTING
2 The caffeine experiment In the Caffeine experiment, there were two groups: 1. the CAFFEINE group; 2. the PLACEBO group.
3 Statistical summary
4 Statistics and parameters We use Arabic letters to denote the statistics of the sample; we use Greek letters to denote PARAMETERS, that is, characteristics of the population.
5 The null hypothesis The null hypothesis (H 0 ) states that, in the population, the Caffeine and Placebo means have the same value. H0: μ1 = μ2
6 The alternative hypothesis The alternative hypothesis (H 1 ) states that the Caffeine and Placebo means are NOT equal.
7 The test statistic To test the null hypothesis, we shall need a TEST STATISTIC, in this case one which reflects the size of the difference between the sample means M 1 and M 2. The t statistic fits the bill.
8 The value of t We do not know the supposedly constant population variance σ 2. Our estimate of σ 2 is 10.37. Pooled variance estimate
9 Sampling distribution of t There are many t distributions. To test H 0, we must locate our value of t in the appropriate t distribution. That distribution is specified by the DEGREES OF FREEDOM df, which is given by df = n 1 +n 2 – 2. In our example, df = 20 + 20 – 2 = 38.
10 Appearance of a t distribution A t distribution is very like the standard normal distribution. They are both centred on zero. The greater the difference between the two means, the further out in either tail your value of t will be. If the null hypothesis is true, the value of (M 1 – M 2 ) and therefore that of t will usually be small. A large value of t is taken to be evidence against the null hypothesis..95 Z~N(0, 1) t(2) 0
11 Significance To make a test of SIGNIFICANCE, we locate a CRITICAL region of values in the t distribution with 38 degrees of freedom. A SIGNIFICANCE LEVEL or ALPHA-LEVEL is a small probability α, such as 0.05 and 0.01, fixed by convention. In psychology, the 0.05 level is generally accepted. We locate the critical region in the tails of the t distribution, so that the probability of a value in EITHER one tail OR the other is α. In our case, α = 0.05. The probability of a value in one PARTICULAR tail is α/2 = 0.025.
The critical region We shall reject the null hypothesis if the value of t falls within EITHER tail of the t distribution on 38 degrees of freedom. To be significant beyond the.05 level, our value of t must be greater than +2.02 OR less than –2.02. (To be significant beyond the.01 level, our value of t must be greater than +2.704 or less than –2.704.)
13 The p-value The TOTAL blue area is the probability, under the null hypothesis, of getting a value of t at least as extreme as your value. If the p-value is less than.05, you are in the critical region, and your value of t is significant beyond the.05 level. 0 Pr of a value at least as small as yours
14 The sign of t The sign of t is IRRELEVANT. If t is negative, its in the lower tail: if its positive, its in the upper tail. In either case, the p-value is the TOTAL blue area, because an extreme value in EITHER direction is evidence against the null hypothesis.
15 Direction of subtraction in the t test The value of t is based upon the difference between means M 1 – M 2. If the first mean is larger than the second, the difference (and t) will be positive. When in the t-test procedure, you complete the Define Groups dialog, the mean entered second will be subtracted from the mean entered first. We entered the values in the order: 1, 0, ensuring that the Placebo mean would be subtracted from the Caffeine mean.
16 Avoiding the negative sign By entering the Caffeine mean first, we ensure that the sign of t is positive. Had the Placebo mean had been larger, the sign of t could still have been kept positive by entering that mean first when defining the groups in the t-test dialog. Thats fine – a difference in EITHER direction can lead to rejection of the null hypothesis.
17 Result of the t test The p-value of 2.6 is.01 (to 2 places of decimals). Our t test has shown significance beyond the.05 level. But the p-value is greater than.01, so the result, although significant beyond the.05 level, is not significant beyond the.01 level.
18 Your report The scores of the Caffeine group (M = 11.90; SD = 3.28) were higher than those of the Placebo group (M = 9.25; 3.16). With an alpha-level of 0.05, the difference is significant: t(38) = 2.60; p =.01. degrees of freedom The p-value is expressed to two places of decimals value of t
19 Type I error Suppose the null hypothesis is true. If you keep sampling a large number of times, every now and again (in 5% of samples), you will get a value of t in one of the tail areas (the critical region) and reject the null hypothesis. You will have made a TYPE I ERROR. The probability of a Type I error is the significance level, which is also termed the ALPHA-LEVEL.
20 Type II error A Type II error is said to occur when a test fails to show significance when the null hypothesis is FALSE. The probability of a Type II error is symbolised as β, and is also known as the TYPE II ERROR RATE or BETA RATE.
21 The beta rate The light-grey area is part of the critical region. Any value of t outside the tail of the H 0 distribution is insignificant. The dark area is the probability that the null hypothesis will be accepted, even though it is false. This is the BETA RATE. Real difference 0 α/2
22 Power The POWER of a statistical test is the probability that the null hypothesis will be refected, given that it is FALSE.
23 Power The power of the statistical test is the area under the H 1 distribution to the right of the dark-grey area. This is the POWER of the test. Real difference 0 α/2
24 Power and the beta rate Since the entire area under either curve is 1, the area in the H 1 distribution to the right of the dark area (the power) is 1 – β. Real difference 0 α/2
25 Increasing the power By REDUCING the beta-rate, you INCREASE the power of the test, because beta-rate + power = 1.
26 Type 1 and type 2 errors: power Type I error Pr = α Type II error Pr = β
27 A change in emphasis Traditionally, the emphasis had been on controlling Type I errors. This is achieved by insisting upon statistical significance. Even more control would be achieved by adopting the 0.01 significance level, rather than the 0.05 level. So why not fix alpha at 0.01, rather than 0.05?
28 Significance level and the beta-rate Suppose you decide upon a smaller significance level (lower figure). The probability of a type II error (green) increases. The power (P) decreases. β P β P
29 The need to strike a balance Adopting the 0.01 level reduces the Type I error rate. But that INCREASES the Type II error rate and REDUCES the power of the test. It is now considered that, in the past, there was insufficient concern with high beta-rates and low power. The 0.05 level is thought to achieve the right BALANCE between Type I and Type II errors.
30 Power and sample size Increasing the sample size reduces the overlap of the sampling distributions under H 0 and H 1. The beta-rate is reduced and so the power increases. Small samples Large samples 0 Real difference
31 Small samples Small samples bring the distributions together, increase the size of the dark-grey area and reduce the power of the test. Large samples separate the distributions, reduce the dark-grey area and increase the power of the test. Real difference 0 α/2
32 Factors affecting the Type II error rate An insufficiency of data (too few participants) means that the sampling distributions under H 0 and under H 1 overlap too much and most of the H 1 distribution lies below the critical value for rejection of H 0. A similar effect arises from unreliable data, which inflate random variation and increase the overlap. Finally, as we have seen, fixing a lower significance level (0.01 rather than 0.05) also increases the beta-rate.
33 How much power do we need? Cohen (1988) observed a general tendency for psychologists to be preoccupied with avoiding Type I errors and insufficiently concerned with the possibility of Type II errors. Most tests had insufficient power. Power levels as low as 0.05 were not uncommon. A MINIMUM POWER OF 0.75 IS RECOMMENDED.
34 Effect size in the t-test We obtained a difference between the Caffeine and Placebo means of (11.90 – 9.25) = 2.65 score points. If we take the spread of the scores to be the average of the Caffeine and Placebo SDs, we have an average SD of about 3.25 score points. So the means of the Caffeine and Placebo groups differ by about 0.8 SD.
35 Measuring effect size: Cohens d statistic In our example, the value of Cohens d is 2.75/3.25 =.8. Is this a large difference?
36 Levels of effect size Jacob Cohen proposed that we regard a d of.2 as a SMALL effect size, a d of.5 as a MEDIUM effect size and a d of.8 as a LARGE effect size. So our experimental result is a large effect.
37 How many participants? That depends upon the minimum effect size that you want to pick up with your significance tests. You also want to make sure your power is at least at the 0.75 level. You can obtain the number of participants necessary by looking up tables (Cohen 1988; Clark-Carter, 2004).
38 Books with power tables Clark-Carter, D. (2004). Quantitative psychological research: a students handbook (2 nd ed.). Hove: Psychology Press. Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2 nd ed.). Hillsdale, NJ: Lawrence Erlbaum.
39 Using the Web The Web is a very useful source of up-to- date information on all statistical topics, including power and effect size. Use the Google search engine. Their address is http://www.google.com Use the phrases statistical power and effect size.
40 Software Software is available for finding the numbers of participants you will need to make a test at a specified level of power. An example is G*Power (Erdfelder et al., 1996). G*Power is available on the Web.
41 The Google window
42 Useful information on the Web
43 The G-power opening window
44 Using G-Power You can choose the a priori option to find out how many participants you would need to achieve a power of.95 of rejecting the null hypothesis for a medium effect on a one- tail test. You would need 176 participants! Thats why we usually settle for a power level of 0.75 rather than 0.95! You fill in these values The answers
45 Directional hypotheses The null hypothesis states simply that, in the population, the Caffeine and Placebo means are equal. H 0 is refuted by a sufficiently large difference between the means in EITHER direction. But some argue that if our scientific (alternative) hypothesis is that Caffeine improves performance, we should be looking at differences in only ONE direction.
46 One-tail tests Suppose we are only interested in the possibility of a difference in ONE DIRECTION. Some argue that we should locate the critical region for our test in one tail only of the t distribution. The critical value of t is now lower. 0.025 (2.5%) 0.025 (2.5%) 0.05 (5%) -2.03 0 +2.03 0 +1.69
47 Another result Suppose that the mean score of the Caffeine group had been, not 11.90, but 10.97. In other words, the Caffeine and Placebo means differ by only 1.72 points, rather than 2.65 points, as in the original example. The value of t is now 1.69, which is the critical value on a one-tail test.
48 Report of the one-tailed test The scores of the Caffeine group (M = 10.97; SD = 3.28) were significantly higher than those of the Placebo group (M = 9.25; 3.16): t(38) = 1.69; p =.05 (one-tailed). degrees of freedom The p-value is expressed to two places of decimals value of t
49 Advantage of the one-tailed test Our t value of 1.69 would have failed to achieve significance on the two-tailed test, since the critical value was +2.03. On the one-tailed test, however, t is in the critical region and the null hypothesis is rejected.
50 More power In locating the entire critical region in the upper tail of the H 0 distribution, we increase the light-grey area and reduce the dark-grey area. In order words, we increase the POWER of the test to reject H 0. Real difference
51 An unexpected result Now suppose that, against expectation, the Placebo group had outperformed the Caffeine group. The mean for the Caffeine group is 9.25 and that for the Placebo group is 10.20. Were we to subtract the smaller mean from the larger (as in the traditional approach), we should obtain t = +2.02. On a two-tailed test, this is in the critical region (p <.05) and we should reject the null hypothesis.
The null hypothesis
53 Direction of subtraction The direction of subtraction is now crucial. You MUST subtract the Placebo mean from the Caffeine mean. Only a POSITIVE value of t can falsify the directional null hypothesis.
54 One-sided p-value In the one-tailed test, the p-value is ONE- SIDED, that is, it is the probability, under the one-sided null hypothesis, that the Caffeine mean would have been at least as LARGE as the one we obtained.
55 The one-sided p-value The one-sided p-value is the entire area under the curve to the right of your value of t. This includes the black area. That area is 0.975. -2.02 Your value of t.
56 Report of the one-tail test The scores of the Caffeine group (M = 9.25; SD = 3.16) were not significantly higher than those of the Placebo group (M = 10.20; SD = 3.28): t(38) = -1.69 ; p = 0.975 (one-tailed).
57 Changing horses Oh dear! I havent got significance! I wont get my paper published! I know: Ill make a two-tail test instead. There we are, the result is significant! Problem solved!
58 The true Type I error rate. If you follow that practice, your true Type I error rate is now the black area (0.05) PLUS the area in the lower tail (0.025). This is 0.05 + 0.025 = 0.075, a level many feel is too high. 0.025
59 One-tailed tests? I have concerns about the making of one-tailed tests. There are two reasons for this. 1.You could find yourself unable to report a surprising result. 2.Many journal editors are suspicious of one-tail tests – and with good reason.
60 A special situation Neurospsychologists often want to know whether a score is so far BELOW the norm that there is evidence for brain damage. They want to establish a cut-off point below which a score is deemed to be indicative of impairment. A one-tail test is appropriate there. But I think one-tail tests are seldom justified in RESEARCH, as opposed to clinical practice.
61 Making your choice: some guidelines
62 Advice In Chapter 1 of Kinnear & Gray (2004, 2006), there is some advice on the choosing of an appropriate statistical test. Your choice depends upon three main considerations: 1.Your RESEARCH QUESTION 2.The plan or DESIGN of your research 3.The kind of DATA you have.
63 1. Five research questions
64 The first decision Are you comparing averages on the SAME VARIABLE over different conditions or groups? OR Are you investigating a possible association between TWO measured variables?
65 The caffeine experiment
66 Making a comparison There is only ONE measured variable: the dependent variable, Performance. We want to COMPARE the AVERAGE scores of the two groups.
67 The violence study There are TWO variables, Actual violence and Exposure to screen violence. We are investigating a possible ASSOCIATION between them.
68 Comparison or association? Comparing or associating? Comparing t-tests, ANOVA Associating Correlation, regression
69 2. Design of the study Two important aspects 1.COMPLEXITY: how many conditions or groups were there? How many measured variables? 2.SAMPLING STRATEGY: was your experiment of between subjects or within subjects design?
70 Complexity Comparing Simple (2 groups or conditions) t-test Complex (three or more) ANOVA
72 Sampling strategy Comparing Between subjects Independent samples Within subjects (repeated measures) Related samples
73 3. Your data Two aspects are important 1.Their LEVEL OF MEASUREMENT 2.Their DISTRIBUTION
74 Levels of measurement 1.SCALE data are measures on an independent scale with units: heights, weights, IQs. 2.ORDINAL data are ranks. 3.NOMINAL data are assignments to categories: so-many males, so-many females.
75 Implications of level of measurement If you have nominal data, you cant run a t- test or an ANOVA. If you have two or more sets of ranks, you cant run a Pearson correlation. Other statistics are available for those purposes.
76 The distribution Many statistical techniques such as t-tests and ANOVA assume normality of distribution. There may be additional requirements such as homogeneity of variance. If the data violate the assumptions of the model underlying the test, the p-value of the test statistic may be inaccurate.
77 Decision charts There is no safe, automated way of choosing a statistical test correctly. The various charts in Chapter 1 of Kinnear & Gray (2004, 6) are intended to be GUIDELINES, rather than prescriptions. There has been considerable controversy about the correct choice in some situations. Here is one such chart.
78 Decision chart for choosing a test of the difference between means, medians, or frequencies
79 Well-behaved data You have run an experiment of simple, two-group between subjects design. At the bottom of the chart you see listed the independent-samples t-test and the Mann-Whitney test, which is a nonparametric equivalent. If your data conform to t-test requirements, the t-test is preferred: it has more POWER than the Mann-Whitney test.
80 Nonconforming data But suppose your data violate the requirements of the t-test. Suppose the distributions are far from normal and there appears to be marked heterogeneity of variance. Some would say you should run a Mann- Whitney test instead of a t-test. Others, however, would urge you to clean up your data and run the version of the t-test that doesnt make the assumption of homogeneity of variance.
81 Example Type II error rate. Type I error rate. Type II error rate.
82 Appendix 1 EQUIVALENCE OF ANOVA AND THE t-TEST IN THE SPECIAL CASE OF TWO GROUPS OR CONDITIONS
83 The two-group case In the first lecture, I revised the independent- samples t-test and illustrated the procedure with the data from the caffeine experiment. Returning to the caffeine experiment, what would happen if, instead of making a t test, we were to run an ANOVA to test the null hypothesis of equality of the means?
84 The two-group case: comparison of ANOVA with the t-test Observe that F = t 2. Observe also that the p-value is the same for both tests. The ANOVA and the independent-samples t test are EXACTLY EQUIVALENT and produce the same decision about the null hypothesis.
85 Equivalence of the F and t tests
86 The t statistic In the present example (where n 1 = n 2 ), the pooled estimate s 2 of σ 2 is simply the mean of the variance estimates from the two samples.
87 In the two-group case … The pooled variance estimate s 2 in the t statistic is identical with the pooled variance estimate MS within in the one-way ANOVA. When we specify a t statistic, we put the degrees of freedom in brackets, as in t(38). In the present comparative context, this is t(df within ).
88 Equivalence of F and t in the two- group case When there are only two groups, the value of F is the square of the value of t. So if t is significant, then so is F and vice versa.
90 The related-samples t-test The same equivalence holds with the within subjects ANOVA and the related-samples t-test.