4 Statistics and parameters We use Arabic letters to denote the statistics of the sample; we use Greek letters to denote PARAMETERS, that is, characteristics of the population.
5 The null hypothesisThe null hypothesis (H0) states that, in the population, the Caffeine and Placebo means have the same value.H0: μ1 = μ2
6 The alternative hypothesis The alternative hypothesis (H1) states that the Caffeine and Placebo means are NOT equal.
7 The test statisticTo test the null hypothesis, we shall need a TEST STATISTIC, in this case one which reflects the size of the difference between the sample means M1 and M2.The t statistic fits the bill.
8 Pooled variance estimate The value of tWe do not know the supposedly constant population variance σ2.Our estimate of σ2 isPooled variance estimate
9 Sampling distribution of t There are many t distributions.To test H0, we must locate our value of t in the appropriate t distribution.That distribution is specified by the DEGREES OF FREEDOM df, which is given by df = n1 +n2 – 2. In our example, df = – 2 = 38.
10 Appearance of a t distribution A t distribution is very like the standard normal distribution.They are both centred on zero.The greater the difference between the two means, the further out in either tail your value of t will be.If the null hypothesis is true, the value of (M1 – M2) and therefore that of t will usually be small.A large value of t is taken to be evidence against the null hypothesis.Z~N(0, 1).95.95t(2)
11 SignificanceTo make a test of SIGNIFICANCE, we locate a CRITICAL region of values in the t distribution with 38 degrees of freedom.A SIGNIFICANCE LEVEL or ALPHA-LEVEL is a small probability α, such as 0.05 and 0.01, fixed by convention. In psychology, the 0.05 level is generally accepted.We locate the critical region in the tails of the t distribution, so that the probability of a value in EITHER one tail OR the other is α. In our case, α = The probability of a value in one PARTICULAR tail is α/2 =
12 The critical regionWe shall reject the null hypothesis if the value of t falls within EITHER tail of the t distribution on 38 degrees of freedom.To be significant beyond the .05 level, our value of t must be greater than OR less than –2.02.(To be significant beyond the .01 level, our value of t must be greater than or less than –2.704.)
13 Pr of a value at least as small as yours The p-valueThe TOTAL blue area is the probability, under the null hypothesis, of getting a value of t at least as extreme as your value.If the p-value is less than .05, you are in the critical region, and your value of t is significant beyond the .05 level.Pr of a value at least as small as yours
14 The sign of t The sign of t is IRRELEVANT. If t is negative, it’s in the lower tail: if it’s positive, it’s in the upper tail.In either case, the p-value is the TOTAL blue area, because an extreme value in EITHER direction is evidence against the null hypothesis.
15 Direction of subtraction in the t test The value of t is based upon the difference between means M1 – M2.If the first mean is larger than the second, the difference (and t) will be positive.When in the t-test procedure, you complete the Define Groups dialog, the mean entered second will be subtracted from the mean entered first.We entered the values in the order: 1, 0, ensuring that the Placebo mean would be subtracted from the Caffeine mean.
16 Avoiding the negative sign By entering the Caffeine mean first, we ensure that the sign of t is positive.Had the Placebo mean had been larger, the sign of t could still have been kept positive by entering that mean first when defining the groups in the t-test dialog.That’s fine – a difference in EITHER direction can lead to rejection of the null hypothesis.
17 Result of the t testThe p-value of 2.6 is .01 (to 2 places of decimals).Our t test has shown significance beyond the .05 level.But the p-value is greater than .01, so the result, although significant beyond the .05 level, is not significant beyond the .01 level.
18 The p-value is expressed to two places of decimals Your report“The scores of the Caffeine group (M = 11.90; SD = 3.28) were higher than those of the Placebo group (M = 9.25; 3.16). With an alpha-level of 0.05, the difference is significant: t(38) = 2.60; p = ”The p-value is expressed to two places of decimalsdegrees of freedomvalue of t
19 Type I error Suppose the null hypothesis is true. If you keep sampling a large number of times, every now and again (in 5% of samples), you will get a value of t in one of the tail areas (the critical region) and reject the null hypothesis.You will have made a TYPE I ERROR.The probability of a Type I error is the significance level, which is also termed the ALPHA-LEVEL.
20 Type II errorA Type II error is said to occur when a test fails to show significance when the null hypothesis is FALSE.The probability of a Type II error is symbolised as β, and is also known as the TYPE II ERROR RATE or BETA RATE.
21 The beta rate The light-grey area is part of the critical region. Any value of t outside the tail of the H0 distribution is insignificant.The dark area is the probability that the null hypothesis will be accepted, even though it is false.This is the BETA RATE.α/2Real difference
22 PowerThe POWER of a statistical test is the probability that the null hypothesis will be refected, given that it is FALSE.
23 PowerThe power of the statistical test is the area under the H1 distribution to the right of the dark-grey area.This is the POWER of the test.α/2Real difference
24 Power and the beta rateSince the entire area under either curve is 1, the area in the H1 distribution to the right of the dark area (the power) is 1 – β.α/2Real difference
25 Increasing the powerBy REDUCING the beta-rate, you INCREASE the power of the test, because beta-rate + power = 1.
26 Type 1 and type 2 errors: power Type I errorPr = αType II errorPr = β
27 A change in emphasisTraditionally, the emphasis had been on controlling Type I errors.This is achieved by insisting upon statistical significance.Even more control would be achieved by adopting the 0.01 significance level, rather than the 0.05 level.So why not fix alpha at 0.01, rather than 0.05?
28 Significance level and the beta-rate Suppose you decide upon a smaller significance level (lower figure).The probability of a type II error (green) increases.The power (P) decreases.PββP
29 The need to strike a balance Adopting the 0.01 level reduces the Type I error rate.But that INCREASES the Type II error rate and REDUCES the power of the test.It is now considered that, in the past, there was insufficient concern with high beta-rates and low power.The 0.05 level is thought to achieve the right BALANCE between Type I and Type II errors.
30 Power and sample sizeIncreasing the sample size reduces the overlap of the sampling distributions under H0 and H1.The beta-rate is reduced and so the power increases.Small samplesLarge samplesReal difference
31 Small samplesSmall samples bring the distributions together, increase the size of the dark-grey area and reduce the power of the test.Large samples separate the distributions, reduce the dark-grey area and increase the power of the test.α/2Real difference
32 Factors affecting the Type II error rate An insufficiency of data (too few participants) means that the sampling distributions under H0 and under H1 overlap too much and most of the H1 distribution lies below the critical value for rejection of H0.A similar effect arises from unreliable data, which inflate random variation and increase the overlap.Finally, as we have seen, fixing a lower significance level (0.01 rather than 0.05) also increases the beta-rate.
33 How much power do we need? Cohen (1988) observed a general tendency for psychologists to be preoccupied with avoiding Type I errors and insufficiently concerned with the possibility of Type II errors.Most tests had insufficient power. Power levels as low as 0.05 were not uncommon.A MINIMUM POWER OF 0.75 IS RECOMMENDED.
34 Effect size in the t-test We obtained a difference between the Caffeine and Placebo means of (11.90 – 9.25) = 2.65 score points.If we take the spread of the scores to be the average of the Caffeine and Placebo SDs, we have an average SD of about 3.25 score points.So the means of the Caffeine and Placebo groups differ by about 0.8 SD.
35 Measuring effect size: Cohen’s d statistic In our example, the value of Cohen’s d is 2.75/3.25 = .8 .Is this a ‘large’ difference?
36 Levels of effect sizeJacob Cohen proposed that we regard a d of .2 as a SMALL effect size, a d of .5 as a MEDIUM effect size and a d of .8 as a LARGE effect size.So our experimental result is a ‘large’ effect.
37 How many participants?That depends upon the minimum effect size that you want to pick up with your significance tests.You also want to make sure your power is at least at the 0.75 level.You can obtain the number of participants necessary by looking up tables (Cohen 1988; Clark-Carter, 2004).
38 Books with power tables Clark-Carter, D. (2004). Quantitative psychological research: a student’s handbook (2nd ed.). Hove: Psychology Press.Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ: Lawrence Erlbaum.
39 Using the WebThe Web is a very useful source of up-to-date information on all statistical topics, including power and effect size.Use the Google search engine.Their address isUse the phrases ‘statistical power’ and ‘effect size’.
40 SoftwareSoftware is available for finding the numbers of participants you will need to make a test at a specified level of power.An example is G*Power (Erdfelder et al., 1996).G*Power is available on the Web.
44 Using G-PowerYou can choose the ‘a priori’ option to find out how many participants you would need to achieve a power of .95 of rejecting the null hypothesis for a medium effect on a one-tail test.You would need 176 participants!That’s why we usually settle for a power level of 0.75 rather than 0.95!You fill in these valuesThe answers
45 Directional hypotheses The null hypothesis states simply that, in the population, the Caffeine and Placebo means are equal.H0 is refuted by a sufficiently large difference between the means in EITHER direction.But some argue that if our scientific (alternative) hypothesis is that Caffeine improves performance, we should be looking at differences in only ONE direction.
46 One-tail testsSuppose we are only interested in the possibility of a difference in ONE DIRECTION.Some argue that we should locate the critical region for our test in one tail only of the t distribution.The critical value of t is now lower.0.025(2.5%)0.025(2.5%)0.05(5%)
47 Another resultSuppose that the mean score of the Caffeine group had been, not 11.90, butIn other words, the Caffeine and Placebo means differ by only 1.72 points, rather than 2.65 points, as in the original example.The value of t is now 1.69, which is the critical value on a one-tail test.
48 Report of the one-tailed test “The scores of the Caffeine group (M = 10.97; SD = 3.28) were significantly higher than those of the Placebo group (M = 9.25; 3.16): t(38) = 1.69; p = .05 (one-tailed) .”The p-value is expressed to two places of decimalsdegrees of freedomvalue of t
49 Advantage of the one-tailed test Our t value of 1.69 would have failed to achieve significance on the two-tailed test, since the critical value wasOn the one-tailed test, however, t is in the critical region and the null hypothesis is rejected.
50 More powerIn locating the entire critical region in the upper tail of the H0 distribution, we increase the light-grey area and reduce the dark-grey area.In order words, we increase the POWER of the test to reject H0.Real difference
51 An unexpected resultNow suppose that, against expectation, the Placebo group had outperformed the Caffeine group.The mean for the Caffeine group is 9.25 and that for the Placebo group isWere we to subtract the smaller mean from the larger (as in the traditional approach), we should obtain t =On a two-tailed test, this is in the critical region (p <.05) and we should reject the null hypothesis.
53 Direction of subtraction The direction of subtraction is now crucial.You MUST subtract the Placebo mean from the Caffeine mean.Only a POSITIVE value of t can falsify the directional null hypothesis.
54 One-sided p-valueIn the one-tailed test, the p-value is ONE-SIDED, that is, it is the probability, under the one-sided null hypothesis, that the Caffeine mean would have been at least as LARGE as the one we obtained.
55 The one-sided p-valueThe one-sided p-value is the entire area under the curve to the right of your value of t. This includes the black area.That area is-2.02Your value of t.
56 Report of the one-tail test “The scores of the Caffeine group (M = 9.25; SD = 3.16) were not significantly higher than those of the Placebo group (M = 10.20; SD = 3.28): t(38) = ; p = (one-tailed) .”
57 Changing horsesOh dear! I haven’t got significance! I won’t get my paper published!I know: I’ll make a two-tail test instead.There we are, the result is significant!Problem solved!
58 The true Type I error rate. If you follow that practice, your true Type I error rate is now the black area (0.05) PLUS the area in the lower tail (0.025). This is = 0.075, a level many feel is too high.0.025
59 One-tailed tests?I have concerns about the making of one-tailed tests.There are two reasons for this.You could find yourself unable to report a surprising result.Many journal editors are suspicious of one-tail tests – and with good reason.
60 A special situationNeurospsychologists often want to know whether a score is so far BELOW the norm that there is evidence for brain damage.They want to establish a cut-off point below which a score is deemed to be indicative of impairment.A one-tail test is appropriate there.But I think one-tail tests are seldom justified in RESEARCH, as opposed to clinical practice.
62 AdviceIn Chapter 1 of Kinnear & Gray (2004, 2006), there is some advice on the choosing of an appropriate statistical test.Your choice depends upon three main considerations:Your RESEARCH QUESTIONThe plan or DESIGN of your researchThe kind of DATA you have.
69 2. Design of the study Two important aspects COMPLEXITY: how many conditions or groups were there? How many measured variables?SAMPLING STRATEGY: was your experiment of between subjects or within subjects design?
73 3. Your data Two aspects are important Their LEVEL OF MEASUREMENT Their DISTRIBUTION
74 Levels of measurementSCALE data are measures on an independent scale with units: heights, weights, IQs.ORDINAL data are ranks.NOMINAL data are assignments to categories: so-many males, so-many females.
75 Implications of level of measurement If you have nominal data, you can’t run a t-test or an ANOVA.If you have two or more sets of ranks, you can’t run a Pearson correlation.Other statistics are available for those purposes.
76 The distributionMany statistical techniques such as t-tests and ANOVA assume normality of distribution.There may be additional requirements such as homogeneity of variance.If the data violate the assumptions of the model underlying the test, the p-value of the test statistic may be inaccurate.
77 Decision chartsThere is no safe, automated way of choosing a statistical test correctly.The various charts in Chapter 1 of Kinnear & Gray (2004, 6) are intended to be GUIDELINES, rather than prescriptions.There has been considerable controversy about the correct choice in some situations.Here is one such chart.
78 Decision chart for choosing a test of the difference between means, medians, or frequencies
79 ‘Well-behaved’ dataYou have run an experiment of simple, two-group between subjects design.At the bottom of the chart you see listed the independent-samples t-test and the Mann-Whitney test, which is a nonparametric equivalent.If your data conform to t-test requirements, the t-test is preferred: it has more POWER than the Mann-Whitney test.
80 Nonconforming dataBut suppose your data violate the requirements of the t-test.Suppose the distributions are far from normal and there appears to be marked heterogeneity of variance.Some would say you should run a Mann-Whitney test instead of a t-test.Others, however, would urge you to clean up your data and run the version of the t-test that doesn’t make the assumption of homogeneity of variance.
81 ExampleType II error rate.Type I error rate.Type II error rate.
82 Appendix 1 EQUIVALENCE OF ANOVA AND THE t-TEST IN THE SPECIAL CASE OF TWO GROUPS OR CONDITIONS
83 The two-group caseIn the first lecture, I revised the independent-samples t-test and illustrated the procedure with the data from the caffeine experiment.Returning to the caffeine experiment, what would happen if, instead of making a t test, we were to run an ANOVA to test the null hypothesis of equality of the means?
84 The two-group case: comparison of ANOVA with the t-test Observe that F = t2 .Observe also that the p-value is the same for both tests.The ANOVA and the independent-samples t test are EXACTLY EQUIVALENT and produce the same decision about the null hypothesis.
86 The t statisticIn the present example (where n1 = n2), the pooled estimate s2 of σ2 is simply the mean of the variance estimates from the two samples.
87 In the two-group case …The pooled variance estimate s2 in the t statistic is identical with the pooled variance estimate MSwithin in the one-way ANOVA.When we specify a t statistic, we put the degrees of freedom in brackets, as in t(38).In the present comparative context, this is t(dfwithin).
88 Equivalence of F and t in the two-group case When there are only two groups, the value of F is the square of the value of t.So if t is significant, then so is F and vice versa.