# This afternoon’s programme

## Presentation on theme: "This afternoon’s programme"— Presentation transcript:

This afternoon’s programme
2.05 – A short talk. 3.00 – 3.20 A break for coffee. 3.20 – Running tests with PASW Statistics 17.

SESSION 2 Further topics in the analysis of variance

Only the starting-point
In Monday’s session, I revised the one-way ANOVA. We saw that merely obtaining a significant F and therefore rejecting the null hypothesis of equality of the means leaves many questions unanswered. Therefore, the ANOVA is just the first step in the complete analysis of a set of data.

Does ‘significant’ mean ‘substantial’?
The F test produced a significant result. The null hypothesis of equality of the five treatment means must be rejected. With large numbers of observations, however, a statistical test can have too much POWER to reject the null hypothesis. Even tiny differences among the means will result in a significant F, with a miniscule p-value. Modern Internet studies can yield millions of observations, so the possibility of having too much data is no longer remote. ‘Significant’ does not necessarily mean ‘substantial’.

Measuring effect size

Breakdown (or partition) of the total sum of squares

Eta squared The oldest measure of effect size is suggested by the partition of the total sum of squares. In this measure, the between groups sum of squares is expressed as a PROPORTION of the total sum of squares.

Eta squared (where eta is the CORRELATION RATIO)

Range of eta squared Theoretically, eta squared can take values between zero (no differences among the means) and unity (the scores in any group all have the same value). In practice, its values will always lie somewhere between these limits.

Why is eta called the ‘correlation ratio’?
Suppose that opposite each of the 50 scores in the one-way drug experiment, we were to place its group mean. The correlation between the column of scores and the column of means gives the value of eta. Let’s demonstrate this.

The Aggregate procedure
In SPSS/PASW, the Aggregate procedure places opposite each score a value (such as the mean – but other statistics can be chosen) which summarises the scores in the group. The grouping variable (in this case Drug Condition) is specified as the BREAK VARIABLE. The participant’s score (the DV) is the VARIABLE TO BE SUMMARISED.

Finding the Aggregate command

The Aggregate dialog

The default name Score_Mean has been changed to Group_Mean. Such changes to default names can easily be made in Variable View.

Now correlate the means with the scores …

The Bivariate Correlations dialog

We obtain the value of eta

Eta squared again

Eta is the correlation between the scores and their group means
Eta is the correlation between the scores and their group means. The square of the correlation between the scores and the group means is eta squared.

In the population …

Other measures of effect size
Several other measures of effect size have been proposed. One of these, Cohen’s f, is used as input for G*Power, a useful package for answering questions about the numbers of participants you would need in a planned study to achieve sufficient power in your statistical tests.

Cohen’s f

Error variance as a proportion
If eta squared is the proportion of the total variance that is between groups,

Cohen’s f and eta squared
So if we take the square root of eta squared divided by (1 – eta squared), we have Cohen’s f :

In our example,

Positive bias Eta squared is positively biased as an estimate of effect size. Were the experiment to be repeated many times, the long run average or EXPECTED VALUE of eta squared would be higher than the population value.

Omega squared For some ANOVA designs, there is available a statistic called omega squared, which corrects for positive bias. But the application of omega squared is problematic in complex designs with repeated measures factors. PASW Statistics 17 does not offer omega squared.

Interpreting values of eta squared and Cohen’s f

Multiple comparisons When there are three or more groups, the rejection of the null hypothesis leaves many important questions unanswered. Is the mean for the Placebo group significantly different from that of the Drug D group? Is it significantly different from the Drug C group?

Planned contrasts On Monday, I discussed the making of specific PLANNED comparisons, simple and complex, among the individual treatment means.

Unplanned or ‘post hoc’ tests
Often, however,after we have the results of an experiment, we shall want to do some DATA-SNOOPING – i.e., run unplanned statistical tests of differences among the individual treatment means. Such unplanned tests are known, (solecistically) as POST HOC tests. The following points apply both to planned and unplanned comparisons.

Statistical testing again.

The critical region A small arbitrary probability (usually .05) known as the SIGNIFICANCE LEVEL is fixed in advance. The CRITICAL REGION is a range of values such that, assuming that the null hypothesis is true, the probability that t will fall inside the range is less than or equal to the fixed significance level.

Critical region of t distribution

Type I errors If the sigificance level α is set at 0.05, any p-value less than 0.05 will result in the rejection of the null hypothesis (H0). If H0 is true, it will be wrongly rejected on 5% of occasions with repeated sampling. A false rejection of the null hypothesis is known as a ‘Type I’ error, and the significance level is therefore also known as the ‘Type I’ or ‘alpha’ error rate.

Type I error rate per comparison
Suppose we gather our data and test the differences among pairs of means for significance. We make several of these tests, setting the significance level at .05 each time and rejecting the null hypothesis whenever the p-value is less than .05 . This significance level (.05) is known as the Type I error rate PER COMPARISON.

An array of ten treatment means
Suppose, to make my next point more strongly, I have an array of ten treatment means and want to make comparisons between, say, the Placebo mean and each of nine drug means. I set the type I error rate per comparison at .05. Suppose the null hypothesis is true – in the population, the means all have the same value. In the population, the profile is a pancake.

The type I error rate per family
What is the probability that AT LEAST ONE test will show significance, even if the null hypothesis is true? This is known as the PER FAMILY or FAMILYWISE type I error rate. The older term EXPERIMENTWISE has a similar meaning.

The familywise type I error rate is unacceptably high!
Were I to make 9 INDEPENDENT tests of differences among the 10 means, the familywise type I error rate would be nearly .4 – FORTY PER CENT! (The probability of at least one type I error is 1 minus the probability of none.)

Capitalising upon chance
With a large array of treatment means, we might decide to make a large number of comparisons. Even if the null hypothesis is true, the familywise Type I error rate might be 0.90 or even higher! Failure to take the heightened probability of the familywise Type I error into account when making sets of comparisons is known as CAPITALISING UPON CHANCE.

Familywise type I error rate with unplanned pairwise comparisons
Suppose we decide to make every possible pairwise comparison. Assume, for simplicity, that the comparisons are independent. The number of possible pairings from ten means is 45. If the per comparison error rate is fixed at .05, the familywise type I error rate is in the region of

Conservative tests A CONSERVATIVE TEST adjusts the p-value per comparison upwards in order to to control the familywise Type I error rate. This is equivalent to setting the per comparison significance level at a lower value than the traditional significance level. There are many different approaches to the making of conservative tests to avoid capitalising upon chance.

The Bonferroni correction
The Bonferroni correction was originally applied to PLANNED comparisons. You plan to make k contrasts among a set of means. You want to keep the per family Type I error rate at the .05 level approximately. You achieve this by multiplying the obtained p-value for each value of t by k.

Example Returning to our drug experiment, we planned to make four simple contrasts. We want to control the familywise Type I error rate and keep it at the .05 level.

Results of the four simple contrasts
Double-click to get into the editor and use Cell properties to display more places of decimals for the second contrast.

Applying the Bonferroni correction
Multiply the given p-value by 4, the number of planned simple contrasts. For the second contrast, the corrected p-value is Report the Bonferroni-corrected p-values rather than the values given in the table. Write that the given p-values have been Bonferroni-corrected: For the second contrast (upper half of the table), write: “t(45) = 2.88; p = .024 (Bonferroni-corrected)”.

Bonferroni correction for post hoc comparisons
You must assume that ALL POSSIBLE pairwise comparisons will be made. If you have ten treatment means, there are 45 possible pairs. So the p-value for each test must be multiplied by 45. Equivalently, the per comparison significance level must be set at That’s a tough criterion for significance.

Selection of post hoc tests

Which one? The Bonferroni is the most conservative of these tests. With a large array of means it’s almost impossible to get anything significant. For between subjects experiments, the Tukey test is preferred. (The Tukey B test is less conservative.) The LSD (least significant difference) test makes no correction; but the test is made only if the ANOVA F value is significant. The Dunnet is the most powerful conservative test, but it is suitable only for the situation where you are comparing the mean of the controls with each of the other treatment means, that is, when you are making simple comparisons. The Scheffe test is good for unplanned complex comparisons.

Factorial experiments

Factorial experiments
In a FACTORIAL experiment, there are two or more treatment factors. The ANOVA really comes into its own when it is applied to the analysis of data from factorial experiments.

Types of ANOVA design The three most common types of factorial ANOVA design are: BETWEEN SUBJECTS FACTORIAL designs, in which ALL factors are between subjects. WITHIN SUBJECTS FACTORIAL designs, in which ALL factors are within subjects. MIXED FACTORIAL designs, in which SOME factors are between subjects and some are within subjects.

A two-factor between subjects factorial experiment
Suppose that a researcher has been commissioned to investigate the effects upon simulated driving performance of two new anti-hay fever drugs, A and B. It is suspected that at least one of the drugs may have different effects upon fresh and tired drivers, and the firm developing the drugs needs to ensure that neither drug has an adverse effect upon driving performance in any circumstances. The researcher decides to carry out a two-factor between subjects factorial experiment, in which the factors are: Drug Treatment, with levels Placebo, Drug A and Drug B; Alertness, with levels Fresh and Tired.

The experimental design

Main effects and interactions
A factor is said to have a MAIN EFFECT if, in the population, there are differences among the means at its different levels, ignoring any other factors in the design. A main effect is indicated by differences among the MARGINAL means. In factorial experiments, interest usually centres not on main effects, but on the interplay among the treatment factors, that is, upon INTERACTIONS.

Some terms for a two-way table of means

Observations Main effects are evident in the MARGINAL TOTALS.
Not surprisingly the Fresh participants outperformed the Tired participants. Looks as if the Alertness factor has a main effect. Performance was higher in the Drug B group, suggesting a main effect of the Drug Treatment factor as well. But the CELL means are the main focus of interest, because certain patterns in those indicate the presence of an INTERACTION.

Profile plots You find that the F test for an interaction is significant. What does this mean? The next step is to examine the appropriate profile plots. More than one plot is possible: your choice depends upon which factor is of principal interest.

Three drug profiles

Nonparallel profiles In neither plot are the profiles parallel.
In the first profile, the factor of Alertness seems to reverse its effect at different levels of the Drug factor. In fact, Drug A actually depressed the performance of the Fresh participants. In the second profile, the ordering of the means at the three levels of the Drug factor changes from level to level of the Alertness factor. Nonparallelism of the profiles indicates the presence of an interaction.

Simple main effects A factor is said to have a SIMPLE MAIN EFFECT when there are differences among its means at a specific level of another factor. In the first profile plot, the Alertness factor would seem to have simple main effects at all three levels of the Drug factor.

Interaction A two-factor INTERATION between two factors is said to occur when the simple main effects of one factor are not homogeneous across all levels of the other factor. The simple main effects of the Alertness factor are not the same across all levels of the Drug factor. It would appear that a two-factor interaction may be present in these data.

Partition of the total sum of squares in the two-way ANOVA
A and B are the two treatment factors, and AB is their interaction.

Three F tests

Two-way ANOVA summary table

Measuring effect size in factorial experiments

Effect size in factorial experiments
A controversial area. The measure known as COMPLETE ETA SQUARED expresses the contribution of a source (whether a main effect or an interaction) to the total variance in the presence of all other treatment or group sources. The measure known as PARTIAL ETA SQUARED excludes all other treatment or group sources.

Complete eta squared Expresses the variance attributable to a source in terms of the TOTAL variance.

Example of calculation of complete eta squred

Partial eta squared Expresses the variance of a source as a proportion of the source variance, plus error. The other sources are omitted.

The value of partial eta squared (.139) is greater than that of complete eta squared (.08).

Coffee break

When the interaction is significant
We shall often want to ‘unpack’ a significant interaction by testing for simple main effects and making multiple comparisons among the individual treatment means.

Simple effects with PASW
Simple effects are not an option in the ANOVA dialog windows. It is easy to run simple effects on PASW, but we must use SYNTAX to achieve this. A small problem is that we must use, not the ANOVA syntax command, but the command for what is known as Multivariate Analysis of Variance or MANOVA. But it really is VERY EASY to do this.

Multivariate analysis of variance (MANOVA)
In the ANOVA, there is just ONE dependent variable. Multivariate Analysis of Variance (MANOVA) is a generalisation of the ANOVA to the analysis of data from experiments of ANOVA design with two or more DVs. We can, therefore, regard the ANOVA as a special case of the MANOVA.

Using MANOVA to run ANOVA
If there is only one DV, running the MANOVA procedure will run a univariate ANOVA and produce the usual ANOVA summary table. Why bother? Tests for simple effects are options in the MANOVA command. They are not available in other PASW commands.

Open the data set

Get into the Syntax Editor

The PASW syntax editor

The basic MANOVA command
This time, you will have to do some writing.

Write in the MANOVA command

Check the active dataset

The two-way ANOVA summary table

The /ERROR and /DESIGN subcommands for simple effects of Drug
You will need two subcommands: /ERROR and /DESIGN.

Output for simple effects analysis

The need for multiple comparisons

The need for a smaller comparison ‘family’
An interaction is significant. We want to make unplanned or post hoc multiple comparisons among the treatment means. But there may be many cells in the design, so that the critical difference for significance may be impossibly large. In terms of the Bonferroni test, you could be multiplying the p-value by a large factor or setting the per comparison significance level at a tiny value. We need to justify making the comparisons among a smaller array of means.

First, we test for simple main effects
We might argue that if we have a significant main effect of the Drug factor at one level of Body State or Alertness, we can define the comparison family in relation to those means at the Fresh level of Body State only. This will produce a less conservative test. When testing for simple main effects, however, we should use the Bonferroni correction to control the familywise Type I error rate. In our example, since there are two simple main effects, the criterion for significance should be that p is less than 0.025, rather than 0.05.

Reduce the data set. There is more than one way of making the multiple comparisons. You can easily run a one-way ANOVA on the data from the scores of the fresh participants only, then ask for a Tukey test.

Select cases

Select the data from the Fresh participants only

Choose Tukey multiple comparisons

The results

Summary A report of an ANOVA F test should be accompanied by a measure of effect size, such as eta squared, omega squared or Cohen’s f. Follow Lisa DeBruine’s guidelines. Beware of capitalising upon chance: follow-up tests should be conservative. When unpacking significant interactions, use syntax to test for simple main effects. The obtaining of a significant simple main effect can be an argument for a smaller comparison ‘family’.

Further reading For a thorough and readable coverage of elementary (and not so elementary) statistics, I recommend … Howell, D. C. (2007). Statistical methods for psychology (6th ed.). Belmont, CA: Thomson/Wadsworth.

For PASW/SPSS Kinnear, P. R., & Gray, C. D. (2009). PASW 17 for Windows Made Simple. Hove and New York: Psychology Press. In addition to practical advice about using PASW Statistics 17, we also offer informal explanations of many of the techniques.