# Lecture 7 POWER IN STATISTICAL TESTING

## Presentation on theme: "Lecture 7 POWER IN STATISTICAL TESTING"— Presentation transcript:

Lecture 7 POWER IN STATISTICAL TESTING

The caffeine experiment
In the Caffeine experiment, there were two groups: the CAFFEINE group; the PLACEBO group.

Statistical summary

Statistics and parameters
We use Arabic letters to denote the statistics of the sample; we use Greek letters to denote PARAMETERS, that is, characteristics of the population.

The null hypothesis The null hypothesis (H0) states that, in the population, the Caffeine and Placebo means have the same value. H0: μ1 = μ2

The alternative hypothesis
The alternative hypothesis (H1) states that the Caffeine and Placebo means are NOT equal.

The test statistic To test the null hypothesis, we shall need a TEST STATISTIC, in this case one which reflects the size of the difference between the sample means M1 and M2. The t statistic fits the bill.

Pooled variance estimate
The value of t We do not know the supposedly constant population variance σ2. Our estimate of σ2 is Pooled variance estimate

Sampling distribution of t
There are many t distributions. To test H0, we must locate our value of t in the appropriate t distribution. That distribution is specified by the DEGREES OF FREEDOM df, which is given by df = n1 +n2 – 2. In our example, df = – 2 = 38.

Appearance of a t distribution
A t distribution is very like the standard normal distribution. They are both centred on zero. The greater the difference between the two means, the further out in either tail your value of t will be. If the null hypothesis is true, the value of (M1 – M2) and therefore that of t will usually be small. A large value of t is taken to be evidence against the null hypothesis. Z~N(0, 1) .95 .95 t(2)

Significance To make a test of SIGNIFICANCE, we locate a CRITICAL region of values in the t distribution with 38 degrees of freedom. A SIGNIFICANCE LEVEL or ALPHA-LEVEL is a small probability α, such as 0.05 and 0.01, fixed by convention. In psychology, the 0.05 level is generally accepted. We locate the critical region in the tails of the t distribution, so that the probability of a value in EITHER one tail OR the other is α. In our case, α = The probability of a value in one PARTICULAR tail is α/2 =

The critical region We shall reject the null hypothesis if the value of t falls within EITHER tail of the t distribution on 38 degrees of freedom. To be significant beyond the .05 level, our value of t must be greater than OR less than –2.02. (To be significant beyond the .01 level, our value of t must be greater than or less than –2.704.)

Pr of a value at least as small as yours
The p-value The TOTAL blue area is the probability, under the null hypothesis, of getting a value of t at least as extreme as your value. If the p-value is less than .05, you are in the critical region, and your value of t is significant beyond the .05 level. Pr of a value at least as small as yours

The sign of t The sign of t is IRRELEVANT.
If t is negative, it’s in the lower tail: if it’s positive, it’s in the upper tail. In either case, the p-value is the TOTAL blue area, because an extreme value in EITHER direction is evidence against the null hypothesis.

Direction of subtraction in the t test
The value of t is based upon the difference between means M1 – M2. If the first mean is larger than the second, the difference (and t) will be positive. When in the t-test procedure, you complete the Define Groups dialog, the mean entered second will be subtracted from the mean entered first. We entered the values in the order: 1, 0, ensuring that the Placebo mean would be subtracted from the Caffeine mean.

Avoiding the negative sign
By entering the Caffeine mean first, we ensure that the sign of t is positive. Had the Placebo mean had been larger, the sign of t could still have been kept positive by entering that mean first when defining the groups in the t-test dialog. That’s fine – a difference in EITHER direction can lead to rejection of the null hypothesis.

Result of the t test The p-value of 2.6 is .01 (to 2 places of decimals). Our t test has shown significance beyond the .05 level. But the p-value is greater than .01, so the result, although significant beyond the .05 level, is not significant beyond the .01 level.

The p-value is expressed to two places of decimals
Your report “The scores of the Caffeine group (M = 11.90; SD = 3.28) were higher than those of the Placebo group (M = 9.25; 3.16). With an alpha-level of 0.05, the difference is significant: t(38) = 2.60; p = ” The p-value is expressed to two places of decimals degrees of freedom value of t

Type I error Suppose the null hypothesis is true.
If you keep sampling a large number of times, every now and again (in 5% of samples), you will get a value of t in one of the tail areas (the critical region) and reject the null hypothesis. You will have made a TYPE I ERROR. The probability of a Type I error is the significance level, which is also termed the ALPHA-LEVEL.

Type II error A Type II error is said to occur when a test fails to show significance when the null hypothesis is FALSE. The probability of a Type II error is symbolised as β, and is also known as the TYPE II ERROR RATE or BETA RATE.

The beta rate The light-grey area is part of the critical region.
Any value of t outside the tail of the H0 distribution is insignificant. The dark area is the probability that the null hypothesis will be accepted, even though it is false. This is the BETA RATE. α/2 Real difference

Power The POWER of a statistical test is the probability that the null hypothesis will be refected, given that it is FALSE.

Power The power of the statistical test is the area under the H1 distribution to the right of the dark-grey area. This is the POWER of the test. α/2 Real difference

Power and the beta rate Since the entire area under either curve is 1, the area in the H1 distribution to the right of the dark area (the power) is 1 – β. α/2 Real difference

Increasing the power By REDUCING the beta-rate, you INCREASE the power of the test, because beta-rate + power = 1.

Type 1 and type 2 errors: power
Type I error Pr = α Type II error Pr = β

A change in emphasis Traditionally, the emphasis had been on controlling Type I errors. This is achieved by insisting upon statistical significance. Even more control would be achieved by adopting the 0.01 significance level, rather than the 0.05 level. So why not fix alpha at 0.01, rather than 0.05?

Significance level and the beta-rate
Suppose you decide upon a smaller significance level (lower figure). The probability of a type II error (green) increases. The power (P) decreases. P β β P

The need to strike a balance
Adopting the 0.01 level reduces the Type I error rate. But that INCREASES the Type II error rate and REDUCES the power of the test. It is now considered that, in the past, there was insufficient concern with high beta-rates and low power. The 0.05 level is thought to achieve the right BALANCE between Type I and Type II errors.

Power and sample size Increasing the sample size reduces the overlap of the sampling distributions under H0 and H1. The beta-rate is reduced and so the power increases. Small samples Large samples Real difference

Small samples Small samples bring the distributions together, increase the size of the dark-grey area and reduce the power of the test. Large samples separate the distributions, reduce the dark-grey area and increase the power of the test. α/2 Real difference

Factors affecting the Type II error rate
An insufficiency of data (too few participants) means that the sampling distributions under H0 and under H1 overlap too much and most of the H1 distribution lies below the critical value for rejection of H0. A similar effect arises from unreliable data, which inflate random variation and increase the overlap. Finally, as we have seen, fixing a lower significance level (0.01 rather than 0.05) also increases the beta-rate.

How much power do we need?
Cohen (1988) observed a general tendency for psychologists to be preoccupied with avoiding Type I errors and insufficiently concerned with the possibility of Type II errors. Most tests had insufficient power. Power levels as low as 0.05 were not uncommon. A MINIMUM POWER OF 0.75 IS RECOMMENDED.

Effect size in the t-test
We obtained a difference between the Caffeine and Placebo means of (11.90 – 9.25) = 2.65 score points. If we take the spread of the scores to be the average of the Caffeine and Placebo SDs, we have an average SD of about 3.25 score points. So the means of the Caffeine and Placebo groups differ by about 0.8 SD.

Measuring effect size: Cohen’s d statistic
In our example, the value of Cohen’s d is 2.75/3.25 = .8 . Is this a ‘large’ difference?

Levels of effect size Jacob Cohen proposed that we regard a d of .2 as a SMALL effect size, a d of .5 as a MEDIUM effect size and a d of .8 as a LARGE effect size. So our experimental result is a ‘large’ effect.

How many participants? That depends upon the minimum effect size that you want to pick up with your significance tests. You also want to make sure your power is at least at the 0.75 level. You can obtain the number of participants necessary by looking up tables (Cohen 1988; Clark-Carter, 2004).

Books with power tables
Clark-Carter, D. (2004). Quantitative psychological research: a student’s handbook (2nd ed.). Hove: Psychology Press. Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ: Lawrence Erlbaum.

Using the Web The Web is a very useful source of up-to-date information on all statistical topics, including power and effect size. Use the Google search engine. Their address is Use the phrases ‘statistical power’ and ‘effect size’.

Software Software is available for finding the numbers of participants you will need to make a test at a specified level of power. An example is G*Power (Erdfelder et al., 1996). G*Power is available on the Web.

Useful information on the Web

The G-power opening window

Using G-Power You can choose the ‘a priori’ option to find out how many participants you would need to achieve a power of .95 of rejecting the null hypothesis for a medium effect on a one-tail test. You would need 176 participants! That’s why we usually settle for a power level of 0.75 rather than 0.95! You fill in these values The answers

Directional hypotheses
The null hypothesis states simply that, in the population, the Caffeine and Placebo means are equal. H0 is refuted by a sufficiently large difference between the means in EITHER direction. But some argue that if our scientific (alternative) hypothesis is that Caffeine improves performance, we should be looking at differences in only ONE direction.

One-tail tests Suppose we are only interested in the possibility of a difference in ONE DIRECTION. Some argue that we should locate the critical region for our test in one tail only of the t distribution. The critical value of t is now lower. 0.025 (2.5%) 0.025 (2.5%) 0.05 (5%)

Another result Suppose that the mean score of the Caffeine group had been, not 11.90, but In other words, the Caffeine and Placebo means differ by only 1.72 points, rather than 2.65 points, as in the original example. The value of t is now 1.69, which is the critical value on a one-tail test.

Report of the one-tailed test
“The scores of the Caffeine group (M = 10.97; SD = 3.28) were significantly higher than those of the Placebo group (M = 9.25; 3.16): t(38) = 1.69; p = .05 (one-tailed) .” The p-value is expressed to two places of decimals degrees of freedom value of t

Our t value of 1.69 would have failed to achieve significance on the two-tailed test, since the critical value was On the one-tailed test, however, t is in the critical region and the null hypothesis is rejected.

More power In locating the entire critical region in the upper tail of the H0 distribution, we increase the light-grey area and reduce the dark-grey area. In order words, we increase the POWER of the test to reject H0. Real difference

An unexpected result Now suppose that, against expectation, the Placebo group had outperformed the Caffeine group. The mean for the Caffeine group is 9.25 and that for the Placebo group is Were we to subtract the smaller mean from the larger (as in the traditional approach), we should obtain t = On a two-tailed test, this is in the critical region (p <.05) and we should reject the null hypothesis.

The null hypothesis

Direction of subtraction
The direction of subtraction is now crucial. You MUST subtract the Placebo mean from the Caffeine mean. Only a POSITIVE value of t can falsify the directional null hypothesis.

One-sided p-value In the one-tailed test, the p-value is ONE-SIDED, that is, it is the probability, under the one-sided null hypothesis, that the Caffeine mean would have been at least as LARGE as the one we obtained.

The one-sided p-value The one-sided p-value is the entire area under the curve to the right of your value of t. This includes the black area. That area is -2.02 Your value of t.

Report of the one-tail test
“The scores of the Caffeine group (M = 9.25; SD = 3.16) were not significantly higher than those of the Placebo group (M = 10.20; SD = 3.28): t(38) = ; p = (one-tailed) .”

Changing horses Oh dear! I haven’t got significance! I won’t get my paper published! I know: I’ll make a two-tail test instead. There we are, the result is significant! Problem solved!

The true Type I error rate.
If you follow that practice, your true Type I error rate is now the black area (0.05) PLUS the area in the lower tail (0.025). This is = 0.075, a level many feel is too high. 0.025

One-tailed tests? I have concerns about the making of one-tailed tests. There are two reasons for this. You could find yourself unable to report a surprising result. Many journal editors are suspicious of one-tail tests – and with good reason.

A special situation Neurospsychologists often want to know whether a score is so far BELOW the norm that there is evidence for brain damage. They want to establish a cut-off point below which a score is deemed to be indicative of impairment. A one-tail test is appropriate there. But I think one-tail tests are seldom justified in RESEARCH, as opposed to clinical practice.

Advice In Chapter 1 of Kinnear & Gray (2004, 2006), there is some advice on the choosing of an appropriate statistical test. Your choice depends upon three main considerations: Your RESEARCH QUESTION The plan or DESIGN of your research The kind of DATA you have.

1. Five research questions

The first decision Are you comparing averages on the SAME VARIABLE over different conditions or groups? OR Are you investigating a possible association between TWO measured variables?

The caffeine experiment

Making a comparison There is only ONE measured variable: the dependent variable, Performance. We want to COMPARE the AVERAGE scores of the two groups.

The violence study There are TWO variables, Actual violence and Exposure to screen violence. We are investigating a possible ASSOCIATION between them.

Comparison or association?

2. Design of the study Two important aspects
COMPLEXITY: how many conditions or groups were there? How many measured variables? SAMPLING STRATEGY: was your experiment of between subjects or within subjects design?

Complexity

Complexity …

Sampling strategy

3. Your data Two aspects are important Their LEVEL OF MEASUREMENT
Their DISTRIBUTION

Levels of measurement SCALE data are measures on an independent scale with units: heights, weights, IQs. ORDINAL data are ranks. NOMINAL data are assignments to categories: so-many males, so-many females.

Implications of level of measurement
If you have nominal data, you can’t run a t-test or an ANOVA. If you have two or more sets of ranks, you can’t run a Pearson correlation. Other statistics are available for those purposes.

The distribution Many statistical techniques such as t-tests and ANOVA assume normality of distribution. There may be additional requirements such as homogeneity of variance. If the data violate the assumptions of the model underlying the test, the p-value of the test statistic may be inaccurate.

Decision charts There is no safe, automated way of choosing a statistical test correctly. The various charts in Chapter 1 of Kinnear & Gray (2004, 6) are intended to be GUIDELINES, rather than prescriptions. There has been considerable controversy about the correct choice in some situations. Here is one such chart.

Decision chart for choosing a test of the difference between means, medians, or frequencies

‘Well-behaved’ data You have run an experiment of simple, two-group between subjects design. At the bottom of the chart you see listed the independent-samples t-test and the Mann-Whitney test, which is a nonparametric equivalent. If your data conform to t-test requirements, the t-test is preferred: it has more POWER than the Mann-Whitney test.

Nonconforming data But suppose your data violate the requirements of the t-test. Suppose the distributions are far from normal and there appears to be marked heterogeneity of variance. Some would say you should run a Mann-Whitney test instead of a t-test. Others, however, would urge you to clean up your data and run the version of the t-test that doesn’t make the assumption of homogeneity of variance.

Example Type II error rate. Type I error rate. Type II error rate.

Appendix 1 EQUIVALENCE OF ANOVA AND THE t-TEST IN THE SPECIAL CASE OF TWO GROUPS OR CONDITIONS

The two-group case In the first lecture, I revised the independent-samples t-test and illustrated the procedure with the data from the caffeine experiment. Returning to the caffeine experiment, what would happen if, instead of making a t test, we were to run an ANOVA to test the null hypothesis of equality of the means?

The two-group case: comparison of ANOVA with the t-test
Observe that F = t2 . Observe also that the p-value is the same for both tests. The ANOVA and the independent-samples t test are EXACTLY EQUIVALENT and produce the same decision about the null hypothesis.

Equivalence of the F and t tests

The t statistic In the present example (where n1 = n2), the pooled estimate s2 of σ2 is simply the mean of the variance estimates from the two samples.

In the two-group case … The pooled variance estimate s2 in the t statistic is identical with the pooled variance estimate MSwithin in the one-way ANOVA. When we specify a t statistic, we put the degrees of freedom in brackets, as in t(38). In the present comparative context, this is t(dfwithin).

Equivalence of F and t in the two-group case
When there are only two groups, the value of F is the square of the value of t. So if t is significant, then so is F and vice versa.

Explanation

The related-samples t-test
The same equivalence holds with the within subjects ANOVA and the related-samples t-test.

Similar presentations