Presentation on theme: "1 SESSION 1 Basic statistics: points and pitfalls."— Presentation transcript:
1 SESSION 1 Basic statistics: points and pitfalls
2 Use simple graphics Some of the fancier graphs (such as PIE CHARTS) are actually very difficult to read. A BAR GRAPH is often clearer. But even bar graphs can be over-embellished. The introduction of varying colours and textures and three-dimensional effects can make it difficult to compare the heights of the bars. KEEP IT SIMPLE! USE 2-DIMENSIONAL BAR GRAPHS. DONT HAVE DIFFERENT FILLERS IN ONE BAR.
3 A useful bar chart This is a CLUSTERED bar chart. It shows clearly that the gender effect works in opposite directions in the Placebo and Drug groups. The error bars assure the reader that the variance is uniform across conditions.
4 A 3-D bar graph of the same data This is a CLUTTERED bar chart. ThE 3-D effect doesnt help at all. Its much more difficult to compare the heights of the bars. Special effects often distract attention from the important features of the data.
5 An experiment An investigator runs an experiment in which the skilled performance of four groups of participants who have ingested four different supposedly performance- enhancing drugs is compared with that of a control or Placebo group. When the data have been collected, the researcher enters them into SPSS and orders a one-way ANOVA.
6 Ordering a means plot
7 A picture of the results
8 The picture is false! The table of means shows miniscule differences among the five group means. The p-value of F is very high – unity to two places of decimals.
9 A small scale view Only a microscopically small section of the scale is shown on the vertical axis. This greatly magnifies even small differences among the group means.
10 Putting things right Double-click on the image to get into the Graph Editor. Double-click on the vertical axis to access the scale specifications. Click here
11 Putting things right … Uncheck the minimum value box and enter zero as the desired minimum point. Click Apply. Amend entry
12 The true picture!
13 The true picture … The effect is dramatic. The profile is now as flat as a pancake. The graph now accurately depicts the results. Always be suspicious of graphs that do not show the complete vertical scale.
14 Small samples Sometimes our data are scarcer than we would wish. This can create problems for the making of statistical tests such as the t-test, the chi- square test and ANOVA.
15 Nominal data A NOMINAL data set consists of records of membership of the categories making up qualitative variables, such as gender or blood group. Nominal data must be distinguished from SCALAR, CONTINUOUS or INTERVAL data, which are measurements of quantitative variables on an independent scale with units. Some people think that the analysis of nominal data is easy …
16 A small set of nominal data A medical researcher wishes to test the hypothesis that people with a certain type of body tissue (Critical) are more likely to show the presence of a potentially harmful antibody. Data are obtained on 19 people, who are classified with respect to 2 attributes: –1.Tissue Type; –2.Whether the antibody is present or absent.
17 Contingency tables When we wish to investigate whether an association exists between qualitative or categorical variables, the starting point is usually a display known as a CONTINGENCY TABLE, whose rows and columns represent the categories of the qualitative variables we are studying.
18 A contingency table Is there an association between Tissue Type and Presence of the antibody? The antibody is indeed more in evidence in the Critical tissue group. High incidence in Critical category
19 Result of the chi-square test How disappointing! It looks as if we havent demonstrated a significant association. Under the column headed Asymp. Sig. is the p- value, which is given as.060.
20 Sampling distributions Because of sampling variability, the values of the statistics we calculate from our data would vary were the data-gathering exercise to be repeated. The distribution of a statistic is known as its SAMPLING DISTRIBUTION. Test statistics such as t, F and chi-square have known sampling distributions. You must know the sampling distribution of any statistic to produce an accurate p-value.
21 The familiar chi-square formula
22 The true definition of chi-square The familiar formula is not the defining formula for chi-square. Chi-square is NOT defined in the context of nominal data, but in terms of a normally distributed continous variable X thus…
23 True definition of chi-square The true chi-square random variable is defined as the square of a standard normal variate Z. Normally distributed variables are CONTINUOUS, that is, there is an infinite number of values between any two points on the scale.
24 An approximation The familiar chi- square statistic is only APPROXIMATELY distributed as chi- square. The approximation is good, provided that the expected frequencies E are adequately large.
25 The meaning of Asymptotic The term ASYMPTOTIC denotes the limiting distribution of a statistic as the sample size approaches infinity. The asymptotic p-value of a statistic is its p-value under the assumption that the statistic has the limiting distribution. That assumption may be false.
26 Goodness of the approximation… In the SPSS output, the column headed Asymp. Sig. contains a p-value calculated on the assumption that the approximate chi-square statistic behaves like the real chi-square statistic. But underneath the table there is a warning about low frequencies, indicating that the asymptotic p-value cannot be relied upon.
27 Goodness of the approximation… Fortunately, there are available EXACT TESTS, which do not make the assumption that the approximation is good. There are the Fisher exact tests, designed by R. A. Fisher many years ago; and there are modern brute force methods requiring massive computation.
28 Exact tests on SPSS In Variable View, Name three variables and assign Values to the code numbers making up the various tissue groups. Always assign value labels to make the output comprehensible.
29 In Data View The third variable, Count, contains the frequencies of occurrence and non-occurrence of the antibody in the different groups. So a single line in Dara View actually combines data from several cases.
30 Weighting the cases Select Weight Cases from the Data menu. Complete the Weight Cases dialog by transferring Count to the Frequency Variable slot. Click OK to weight the cases with frequencies
31 Selecting the chi-square test
32 Ordering an exact test Ask for the Presence values to appear in columns. Ask for the tissue types to appear in rows. Ask for Exact tests.
33 Choosing the statistics We want the chi-square statistic and a measure of associative strength. The chi-square statistic will not do as a measure of strength, because its value partially reflects the size of the nominal data set. Cramers V will do. Back in the Crosstabs dialog, click Cells… and order percentages and expected frequencies.
34 Clustered bar chart Ask for a clustered bar chart. The pattern of the results is even more obvious than it is in the table. The antibody is much more in evidence in the critical group. I recommend that you always order such a bar chart when you are analysing nominal data sets.
35 A different result The exact test has shown that we do have evidence for an association between tissue type and incidence of the antibody. The exact p-value is markedly lower than the aymptotic value.
36 Strength of the association The chi-square statistic is unsuitable as a measure of the strength of an association, because it increases with the size of the data set. Having asked for exact tests, we obtain an exact p-value for phi and Cramers V. Interpret either statistic as the COEFFICIENT OF DETERMINATION in a regression: the proportion of variance accounted for.
37 Related samples There is to be a debate on a contentious issue. The debate is attended by 100 people. Each is asked by a researcher to write down whether they support the motion before and after hearing the debate. The researcher produces the following table.
38 The researchers argument The researcher argues that we are looking for an association between Response (Yes or No) and Time (Before, After). There is an evident association: the proportion of Yess is markedly higher after the debate. A chi-square test seems to confirm the association.
39 An error The total frequency is 200, whereas there were only 100 participants. A requirement for a chi-square test for an association is that each participant must contribute to the tally in only ONE cell of the contingency table.
40 Essential information The researcher must keep track of each participant throughout the operation. That permits the construction of the following table, which cannot be recovered from the previous table.
41 Another chi-square test We could run a chi- square test on this table, because each participant contributes to the tally in only one cell. But if you were to do that, there would be nothing to report, because there is no evidence of an association. Can we conclude that hearing the debate had no effect?
42 Wrong question We arent interested in whether there is an association between the way people vote before and after the debate. We want to know whether hearing the debate has tended to CHANGE peoples views in one direction.
43 The McNemar test The McNEMAR TEST only uses data on those participants who CHANGED THEIR VIEWS after hearing the debate. The null hypothesis is that, in the population, as many people change positively as change negatively. We note that while 13 people changed from Yes to No, 38 changed from No to Yes. That looks promising.
44 Finding the McNemar test The McNemar test is a test for goodness- of-fit, not a test for association. Its to be found in the Nonparametric Tests menu, under 2 Related Samples….
45 Ordering a McNemar test Transfer the two related variables (Before and After) to the right-hand panel. Check the McNemar box.
46 The results The McNemar test uses an approximate chi- square test of the null hypothesis that as many change from No to Yes as vice versa. The null hypothesis is rejected. NOW WE HAVE OUR EVIDENCE! We write: χ 2 (1) = 11.29; p <.01. (Note: this chi-square statistic has one degree of freedom)
47 Correction for continuity When there are only two groups (change to Yes, change to No), the chi-square statistic can only vary in discrete jumps, rather than continuously. The CORRECTION FOR CONTINUITY (Yates) attempts to improve the approximation to a real chi-square variable, which is continuous. There has been much controversy about whether the correction for continuity is necessary.
48 Bernoulli trials Is a coin fair: that is, in a long series of tosses, is the proportion of heads ½ ? We toss the coin 100 times, obtaining 80 heads. Arguably we have a set of 100 BERNOULLI TRIALS. A Bernoulli trial is one with outcomes that can be divided into success and failure, with probabilities p and 1 – p, respectively, which remain fixed over n independent trials.
49 Binomial probability model Where you have Bernoulli trials, you can test the null hypothesis that the probability of a success is any specified proportion by applying the BINOMIAL PROBABILITY MODEL, which is the basis of the BINOMIAL TEST. In our example, 51 people changed their vote. Does that refute the null hypothesis that p = ½ ?
50 The binomial test There is no need to settle for an approximate test. Set up your data like this. Choose DataWeight Cases and weight by Frequency. Choose AnalyzeNonparametric TestsBinomial and complete the dialog.
51 Completing the binomial dialog Click on the Exact button
52 The results The p-value is very small. We have strong evidence that hearing the debate tended to change peoples views in the Yes direction more than in the No direction.
53 The caffeine experiment
54 Notational convention We use Arabic letters to denote the statistics of the sample; we use Greek letters to denote PARAMETERS, that is, characteristics of the population.
1555 The null hypothesis The null hypothesis (H 0 ) states that, in the population, the Caffeine and Placebo means have the same value. H0: μ1 = μ2
56 The alternative hypothesis The alternative hypothesis (H 1 ) states that the Caffeine and Placebo means are not equal.
57 Independent samples The Caffeine experiment yielded two sets of scores - one set from the Caffeine group, the other from the Placebo group. There is NO BASIS FOR PAIRING THE SCORES. We have INDEPENDENT SAMPLES. We shall make an INDEPENDENT- SAMPLES t test.
58 The t statistic In the present example (where n 1 = n 2 ), the pooled estimate s 2 of σ 2 is simply the mean of the variance estimates from the two samples.
59 The value of t We do not know the supposedly constant population variance σ 2. Our estimate of σ 2 is Pooled variance estimate
60 Significance Is the value 2.6 significant? Have we evidence against the null hypothesis? To answer that question, we need to locate the obtained value of t in the appropriate SAMPLING DISTRIBUTION of t. That distribution is identified by the value of the parameter known as the DEGREES OF FREEDOM.
61 Degrees of freedom The df is the total number of scores minus two. In our example, df = – 2 = 38. WHETHER A GIVEN VALUE OF t IS SIGNIFICANT WILL DEPEND UPON THE VALUE OF THE DEGREES OF FREEDOM.
62 Appearance of a t distribution A t distribution is very like the standard normal distribution. The difference is that a t distribution has thicker tails. In other words, large values of t are more likely than large values of Z. With large df, the two distributions are very similar..95 Z~N(0, 1) t(2) 0
63 The critical region We shall reject the null hypothesis if the value of t falls within EITHER tail of the t distribution on 38 degrees of freedom. To be significant beyond the.05 level, our value of t must be EITHER greater than OR less than –2.02. To be significant beyond the.01 level, our value of t must be EITHER greater than OR smaller than –2.704.
64 The two-tailed p-value The p-value is the probability, ASSUMING THAT THE NULL HYPOTHESIS IS TRUE, of obtaining a value of the test statistic at least as EXTREME (under the null hypothesis) as the one you obtained. If the p-value is less than.05, you are in the critical region, and your value of t is significant beyond the.05 level. 0 Pr of a value at least as small as yours
65 Result of the t test The p-value of 2.6 is.01 (to 2 places of decimals). Our t test has shown significance beyond the.05 level. But the p-value is greater than.01, so the result, although significant beyond the.05 level, is not significant beyond the.01 level.
66 Your report The scores of the Caffeine group (M = 11.90; SD = 3.28) were significantly higher than those of the Placebo group (M = 9.25; 3.16): t(38) = 2.60; p =.01. degrees of freedom The p-value is expressed to two places of decimals value of t
67 Directional hypotheses The null hypothesis states simply that, in the population, the Caffeine and Placebo means are equal. H 0 is refuted by a sufficiently large difference between the means in EITHER direction. But some argue that if our scientific hypothesis is that Caffeine improves performance, we should be looking at differences in only ONE direction.
68 One-tail tests Suppose are only interested in the possibility of a difference in ONE DIRECTION. You would locate the critical region for your test in one tail of the distribution (2.5%) (2.5%) 0.05 (5%)
69 A strong case for a one-tailed test Neurospsychologists often want to know whether a score is so far below the norm that there is evidence for brain damage. Arguably, a one-tailed test is justified in that case: it seems unlikely that brain damage could actually IMPROVE performance. But note that, on a one-tail test, you only need a t-value of about 1.7 for significance, rather than a value of around 2.0 for a two-tail test. (The exact values depend upon the value of the df. So, on a one-tail test, its easier to get significance IN THE DIRECTION YOU EXPECT. For this reason, many researchers (and journal editors) are suspicious of one-tail tests.
70 A substantial difference? We obtained a difference between the Caffeine and Placebo means of (11.90 – 9.25) = 2.75 score points. If we take the spread of the scores to be the average of the Caffeine and Placebo SDs, we have an average SD of about 3.25 score points. So the means of the Caffeine and Placebo groups differ by about.8 SD.
71 Measuring effect size: Cohens d statistic In our example, the value of Cohens d is 2.75/3.25 =.8. Is this a large difference?
72 Levels of effect size On the basis of scrutiny of a large number of studies, Jacob Cohen proposed that we regard a d of.2 as a SMALL effect size, a d of.5 as a MEDIUM effect size and a d of.8 as a LARGE effect size. So our experimental result is a large effect. When you report the results of a statistical test, you are now expected to provide a measure of the size of the effect you are reporting.
73 Type I error Suppose the null hypothesis is true. If you keep sampling a large number of times, every now and again (in 5% of samples), you will get a value of t in one of the tail areas (the critical region) and reject the null hypothesis. You will have made a TYPE I ERROR.
74 An objection to one-tailed tests Suppose you are prepared to report significance on a one-tailed test: a t-value of 1.7 will do. You find a difference in THE OPPOSITE DIRECTION (e.g., the Placebo mean is higher). The difference would have been signficant on a two-tail test. If you would then report this result, your true type I error rate is actually =.075. If you would report an unexpected result that is significant on a one-tail test, i.e., t is less than the 5 th percentile of the t distribution, your type I error rate is 2 ×.05 =.1 or 10%. Many feel that this is much too high.
75 Type II error A Type II error is said to occur when a test fails to show significance when the null hypothesis is ACTUALLY FALSE. The probability of a Type II error is symbolised as β. Since a probability is also a rate of occurrence, β is also known as the TYPE II ERROR RATE or BETA RATE.
76 The beta rate: POWER The dark area is the probability that the null hypothesis will be accepted, even though it is false. This is the beta rate. The POWER of the test is the probability that the null hypothesis will be rejected. The power is 1 – β. The lightly shaded area is the Type I (or α) error rate. This is the significance level.
77 Power The POWER of a statistical test is the probability that the null hypothesis will be refected, given that it is FALSE. The power of a test is 1 – β, that is, one minus the Type II error rate.
78 Type I and Type II errors: Power of a one-tailed test
79 Type 1 and type 2 errors and Power
80 Factors affecting the Type II error rate An insufficiency of data (too few participants) means that the sampling distributions under H 0 and under H 1 are too close together and too much of the H 1 distribution lies below the critical value for rejection of H 0. A similar effect arises from unreliable data, which inflate random variation and therefore ERROR.
81 How much power? Cohen (1988) observed a general tendency for psychologists to be preoccupied with avoiding Type I errors and insufficiently concerned with the possibility of Type II errors. Most tests had insufficient power. A MINIMUM POWER OF 0.75 IS RECOMMENDED.
82 How many participants? That depends upon the minimum effect size that you want to pick up with your significance tests. You also want to make sure your power is at least at the 0.75 level. You can obtain the number of participants necessary by looking up tables (Cohen 1988; Clark-Carter, 2004).
83 Books with power tables Clark-Carter, D. (2004). Quantitative psychological research: a students handbook (2 nd ed.). Hove: Psychology Press. Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2 nd ed.). Hillsdale, NJ: Lawrence Erlbaum.
84 Using the Web The Web is a very useful source of up-to- date information on all statistical topics, including power and effect size. Use the Google search engine. Their address is Use the phrases statistical power and effect size.
85 Software Software is available for finding the numbers of participants you will need to make a test at a specified level of power. An example is G*Power (Erdfelder et al., 1996). G*Power is available on the Web.
86 The Google window
87 Useful information on the Web
88 The G-power opening window
89 Using G-Power You can choose the a priori option to find out how many participants you need to achieve a power of.95 of rejecting the null hypothesis for a medium effect on a one- tail test. You need 176 participants! Thats why we usually settle for a power level of 0.75! You fill in these values The answers
90 Results of a one-factor, between subjects experiment raw scores grand mean
91 Statistics of the results group (cell) means group (cell) standard deviations Group (cell) variances
92 The five-group drug experiment The null hypothesis states that, in the population, all the means have the same value. In words, the null hypothesis states that none of the drugs has any effect.
93 The alternative hypothesis The alternative hypothesis is that, in the population, the means do NOT all have the same value. MANY POSSIBILITIES are implied by H 1.
94 The One-way ANOVA The ANOVA of a one-factor BETWEEN SUBJECTS experiment is also known as the ONE-WAY ANOVA. The one-way ANOVA must be sharply distinguished from the one-factor WITHIN SUBJECTS (or REPEATED MEASURES) ANOVA, which is appropriate when participants are tested at every level of the treatment factor. The between subjects and within subjects ANOVAs are predicated upon different statistical models.
95 There are some large differences among the five treatment means, suggesting that the null hypothesis is false.
96 Mean square (MS) In ANOVA, the numerator of a variance estimate is known as a SUM OF SQUARES (SS). The denominator is known as the DEGREES OF FREEDOM (df). The variance estimate itself is known as a MEAN SQUARE (MS), so that MS = SS/df.
97 Accounting for variability The building block for any variance estimate is a DEVIATION of some sort. The TOTAL DEVIATION of any score from the grand mean (GM) can be divided into 2 components: 1. a BETWEEN GROUPS component; 2. a WITHIN GROUPS component. total deviation between groups deviation within groups deviation grand mean
98 Example of the breakdown The score, the group mean and the grand mean have been ringed in the table. This breakdown holds for each of the fifty scores in the data set. score grand mean group mean
99 Breakdown (partition) of the total sum of squares If you sum the squares of the deviations over all 50 scores, you obtain an expression which breaks down the total variability in the scores into BETWEEN GROUPS and WITHIN GROUPS components.
100 The F ratio
101 Rule for obtaining the df
102 Degrees of freedom The degrees of freedom df of a sum of squares is the number of independent values (scores, means) minus the number of parameters estimated. The SS between is calculated from 5 group means, but ONE parameter (the grand mean) has been estimated. Therefore df between = 5 – 1 = 4.
103 Degrees of freedom … The SS within is calculated from the scores of the 50 participants in the experiment; but the group mean is subtracted from each score to produce a deviation score. There are 5 group means. The df within = 50 – 5 = 45.
104 Calculating MS within In the equal-n case, we can simply take the mean of the cell variance estimates. MS within = ( )/5 =48.36/5 = 9.67
105 Finding MS between
106 The value of MS within, since it is calculated only from the variances of the scores within groups and ignores the values of the group means, reflects ONLY RANDOM ERROR.
107 The value of the MS between, since it is calculated only from the MEANS, reflects random error, PLUS any real DIFFERENCES among the population means that there may be.
108 What F is measuring If there are differences among the population means, the numerator will be inflated and F will increase. If there are no differences, the two MSs will have similar values and F will be close to 1. error + real differences error only
109 The ANOVA summary table F large, nine times larger than unity, the expected value from the null hypothesis and well over the critical value The p-value (Sig.) <.01. So F is significant beyond the.01 level. Do not write the p-value as.000. Write this result as follows: with an alpha-level of.05, F is significant: F(4, 45) = 9.09; p <.01. Notice that SS total = SS between groups + SS within groups
110 The two-group case Returning to the caffeine experiment, what would happen if, instead of making a t test, we were to run an ANOVA to test the null hypothesis of equality of the means?
111 The two-group case: comparison of ANOVA with the t-test Observe that F = t 2. Observe also that the p-value is the same for both tests. The ANOVA and the independent-samples t test are EXACTLY EQUIVALENT and produce the same decision about the null hypothesis.
112 Equivalence of F and t in the two- group case When there are only two groups, the value of F is the square of the value of t. So if t is significant, then so is F and vice versa.
114 Effect size in ANOVA The greater the differences among the means, the greater will be the proportion of the total variability that is explained or accounted for by SS between. This is the basis of the oldest measure of effect size in ANOVA, which is known as ETA SQUARED (η 2 ).
115 Eta squared Eta squared (also known as the CORRELATION RATIO) is defined as the ratio of the between groups and within groups mean squares. Its theoretical range of variation is from zero (no differences among the means) to unity (no variance in the scores of any group, but different values in different groups). In our example, η 2 =.447
116 Comparison of eta squared with Cohens d
117 Positive bias of eta squared The correlation ratio (eta squared) is positively biased as an estimator. Imagine you were to have huge numbers of participants in all the groups and calculate eta squared. This is the population value, which we shall term ρ 2 (rho squared). Imagine your own experiment (with the same numbers of participants) were to be repeated many times and you were to calculate all the values of eta squared. The mean value of eta squared would be higher than that of rho squared.
118 Removing the bias: omega squared The measure known as OMEGA SQUARED corrects the bias in eta squared. Omega squared achieves this by incorporating degrees of freedom terms.
119 Recommended reading For a thorough and readable coverage of elementary (and not so elementary) statistics, I recommend … Howell, D. C. (2007). Statistical methods for psychology (6 th ed.). Belmont, CA: Thomson/Wadsworth.
120 For SPSS Kinnear, P. R., & Gray, C. D. (2006). SPSS 14 for windows made simple. Hove and New York: Psychology Press. In addition to being a practical guide to the use of SPSS, this book also offers informal explanations of many of the techniques.