Presentation on theme: "This morning’s programme"— Presentation transcript:
1This morning’s programme 9:05am – 10:50am The t tests10:50am – 11:20am A break for coffee.11:20am – 12:30pm Approximate chi-square tests
2SESSION 2 From description to inference: hypothesis testing
3INDUCTIVE REASONINGTraditional Aristotelian logic is DEDUCTIVE: it argues from the general to the particular.Statistical inference reverses this process by arguing INDUCTIVELY from the particular (the sample) to the general (the population).Statistical inference, therefore, is subject to error and inferences must be expressed in terms of probabilities.
4Kinds of statistical inference Estimation (of parameter values);Hypothesis testing
5Estimates There are two types of estimate: POINT ESTIMATES. For example, we might use the sample mean as an estimate of the value of the population mean.INTERVAL ESTIMATES. On the basis of sample data, we can specify a range of values within which we can assume with specified levels of CONFIDENCE that the population value lies. I discuss confidence intervals in the appendix to this talk.
6‘Confirming’ our dataSuppose that we have found, in our results, a pattern that we would like to confirm, such as a difference between means.Could this pattern have arisen merely through sampling error? Would another research team who collect data of this type obtain a similar result?Hypothesis testing can provide an answer to questions of this sort.
7Statistical hypotheses A statistical hypothesis is a statement about a population, usually to the effect that a parameter, such as the mean, has a specified value, or that the means of two or more populations have the same value or different values.Here, by “population” is always meant a probability distribution, a hypothetical population of numbers.
8Two hypothesesIn hypothesis testing, as widely practised at present by researchers, a decision is made between two complementary hypotheses, which are set up in opposition to each other:the null hypothesis (H0);the alternative hypothesis (H1).
9The null hypothesisThe null hypothesis (H0) is the statistical equivalent of the hypothesis of NO EFFECT, the negation of the scientific hypothesis.For example, if a researcher thinks a set of scores is from a population with a different mean than a control population, the null hypothesis will state that there is NO such difference.The alternative hypothesis (H1) is that the null hypothesis is false.In traditional statistical testing, it is the null hypothesis, not the alternative hypothesis, that is tested.
10Number of samplesTests of some hypotheses can be made by drawing a single sample of scores.Other hypotheses, however, can only be tested by drawing two or more samples.It is easiest to consider the elements of hypothesis testing by considering one-sample tests first.
12Situation (a) The population standard deviation is known
13An exampleLet us suppose that in the island of Erewhon, men’s heights have an approximately normal distribution with a mean of 69 inches and an SD of 3.2 inches.A researcher wonders whether there might be a tendency for those in the north of the island to be taller than the general population.A sample of 100 northerners has a mean height of 69.8 inches.Remembering that this is merely a sample from the population of northerners, do we have evidence that northerners are taller?
14Steps in testing a hypothesis Formulate the null and alternative HYPOTHESES.Decide upon a SIGNIFICANCE LEVEL.Decide upon an appropriate TEST STATISTIC.Decide upon the CRITICAL REGION, a range of “unlikely” values for the test statistic, that is, less probable than the significance level.If the value of the test statistic falls within the critical region, the null hypothesis is rejected.
15The null and alternative hypotheses The null hypothesis is that, contrary to the researcher’s speculation, the height of northerners is no different from that of the general population.The alternative hypothesis is that northerners are of different height.
16Significance levelThe significance level is a small probability fixed by tradition.The significance level is commonly set at .05, but in some areas researchers insist upon a lower level, such as .01 .We shall set the level at .05 .
17RevisionWe are talking about a situation in which a single sample has been drawn from a population.Here the reference set is the population or probabililty distribution of such samples, which is known as the SAMPLING DISTRIBUTION OF THE MEAN.Its SD is known as the STANDARD ERROR OF THE MEAN (σM ).
19The standard normal distribution Questions about ranges of values in any normal distribution can always be referred to questions about corresponding values in the STANDARD NORMAL DISTRIBUTION.We do this by tranforming the original values to values of z, the STANDARD NORMAL VARIATE.
20The standard normal distribution We transform the original value to z by subtracting the mean, then dividing by the standard deviation.In this case, we must divide by σM, not σ.
21The test statisticSince we know the SD σ, we can use as our test statistic z, where the denominator is the STANDARD ERROR OF THE MEAN, that is, the SD of the sampling distribution of the mean.
22The critical regionWe want the total probability of a value in the critical region to be .05, that is the significance level.We distribute that probability equally between the two tails of the distribution: .025 in each tail.
23Calculate the value of z Since this value falls within the critical region, the null hypothesis is rejected.We have evidence that the northerners are taller.
24The p-valueThe p-value of a test statistic is the probability, assuming that the null hypothesis is true, of obtaining a value of the test statistic at least as unlikely as the one obtained.The p-value must be clearly distinguished from the significance level (say .05): the significance level is fixed beforehand; but the p-value is determined by your own data.
25Use of the p-valueIf the p-value is less than the significance level, the value of your test statistic must have fallen within the critical region.But the p-value tells you more than this.A high p-value means that the value of the test statistic is well short of being significant; whereas a low p-value means we are well over the line.
26The one-tailed p-value The ONE-TAILED p-value is the probability of a value of the test statistic at least as extreme (in the same direction) as the value actually obtained.
27The one-tailed p-value We obtain the one-tailed p-value by subtracting the cumulative probability of 2.5 from 1: = .0062
28One-tailed and two-tailed p-values If the region of rejection is located in both tails of the sampling distribution, as in the present example, a TWO-TAILED p-value must be calculated.We must DOUBLE the one-tailed p-value.If we didn’t do that, a value only marginally significant would seem to have a probability of only .025, not .05 as previously decided.So if the p-value in either direction is less than .025, the two-sided p-value is less than .05, and we have significance.
29The two-tailed p-value of 2.5 We must now double the one-tailed p-value:.0062 × 2 =
31Directional hypothesis Our researcher suspects that the northerners are TALLER, not simply that they are of DIFFERENT height. This is a DIRECTIONAL hypothesis.On this basis, it could be (and is) argued that the critical region, with a probability of .05, should be located entirely in the UPPER tail of the standard normal distribution.
33Comparison of the critical regions If you are only interested in the possibility of a difference in ONE direction, you might decide to locate the critical region entirely in one tail of the distribution.0.025(2.5%)0.025(2.5%)0.05(5%)
34Easier to get a significant result Note that, on a one-tail test, you only need a z-value of 1.64 for significance, rather than a value of 1.96 for a two-tail test.So, on a one-tail test, it’s easier to get significance IN THE DIRECTION YOU EXPECT.
36Type I errorsSuppose the null hypothesis is true, but the value of z falls within the critical region.We shall reject the null hypothesis, but, in so doing, we shall have made a Type I or alpha (α) error.The probability of a Type I error is simply the chosen significance level and in our example its value is .05 .
37Probability of a Type I error Suppose H0 is true.If the value of z falls within either tail, we shall reject H0 and make a Type I error.The probability that we shall do this is the significance level, .05.Accordingly, the significance level is also referred to as the ALPHA-LEVEL.
38Type II (beta) errors Suppose the null hypothesis is false. The value of test statistic, however, does not fall within the critical region and the null hypothesis is accepted.We have made a Type II or beta (β) error.
39PowerThe POWER of a statistical test is the probability that, if the null hypothesis is false, it will be rejected by the statistical test.When the power of a test is low, an insignificant test result is impossible to interpret: there may indeed be nothing to report; but the researcher has no way of knowing this.
40Two distributionsThe following diagram shows the relationships among the significance level, the Type I error rate (significance level) and Power when the null hypothesis is tested against a one-sided alternative hypothesis that the mean has a higher value. (This is a one-tailed test.)The overlapping curves represent the sampling distributions of the mean under the null hypothesis (left) and the alternative hypothesis (right).In the diagram, μ0 and μ1 are the values of the population mean according to the null and alternative hypotheses, respectively.
42PointsAny value of M to the left of the grey area will result in the acceptance of H0.If H1 is true (distribution on the right), a Type II error will have been made.Notice that the Power and Type II error rates sum to unity.
44Factors affecting the power of a statistical test
45Significance level and power In the upper figure, the red area is the .05 significance level; the green area is the Type II error rate.The lower figure shows that a lower significance level (e.g. .01) reduces the probability of making a Type I error, but the probability of a type II error (green) increases and the power (P) decreases.PββP
46Size of the difference between μ1 and μ0. The greater the difference between the real population mean and the value assumed by the null hypothesis, the less the overlap between the sampling distributions.The less the overlap, the greater will be the area of the H1 (right) curve beyond the critical value under H0 and the greater the power of the test to reject the null hypothesis.The researcher has no control over this determinant of power.
49Sample sizeNow we come to another important determinant of the power of a statistical test: sample size.This is the factor over which the research usually has the most control.
50RevisionThe larger the sample, the smaller the standard error of the mean and the taller and narrower will be the sampling distribution if drawn to the same scale as the original distribution.
51Effect of increasing the sample size n Sampling distributions of the mean for n = 16 and n = 64.n = 16The IQ distributionμ
52Sample sizeWhen there are two samples, therefore, larger samples will result in greater separation of the sampling distributions, reduction in the Type II error rate and more power.
53Power and sample sizeIncreasing the sample size reduces the overlap of the sampling distributions under H0 and H1 by making them taller and narrower.The beta-rate is reduced and the power (green area) increases.Small samplesLarge samples
54Reliability of measurement Greater precision or RELIABILITY of measurement also reduces the standard error of the mean and improves the separation of the null and alternative distributions.The more separation between the sampling distributions, the greater the power of the statistical test.Jeanette Jackson will discuss the topic of reliability.
55Situation (b) The population standard deviation is unknown
56Very rarely do we know the population standard deviation
57VocabularyA researcher suspects that a new intake to a college of further education may require extra coaching to enrich their vocabulary.The College has been using a vocabulary test, students’ performance on which, over the years, has been found to have a mean of 50. The standard deviation is not known with certainty (estimates have varied and the records are incomplete); but the population distribution seems to be approximately normal.The 36 new students have vocabulary scores with a mean of 49 and a sample standard deviation of 2.4 .Is this evidence that their vocabulary scores are not from the usual College student population?
62Distribution of tLike the standard normal variate z, the distribution of t has a mean of zero.The t statistic, however, is not normally distributed.Although, like the normal distribution, it is symmetrical and bell-shaped, it has thicker tails: that is, large absolute values of t are more likely than large absolute values of z.
63The family of t distributions There is only one standard normal distribution, to which any other normal distribution can be transformed; but there is a whole family of t distributions.A normal distribution has two parameters: the mean and the SD.A t distribution has ONE parameter, known as the DEGREES OF FREEDOM (df ).
64Degrees of freedom The term is borrowed from physics. The degrees of freedom of a system is the number of constraints that must be placed upon it to determine its state completely.By analogy, the variance of n scores is calculated from the squares of n deviations from the mean; but deviations from the mean sum to zero, so if you know the values of (n – 1) deviations, you know the nth deviation.The degrees of freedom of the variance is therefore (n – 1).
65Degrees of freedom of tThe degrees of freedom of the one-sample t statistic is (n – 1), where n is the size of the sample. This is the degrees of freedom of the variance estimate from the sample.In our case, the degrees of freedom of the t statistic = (n – 1) = 36 – 1 = 35.As the size of n increases, the t distribution becomes more and more like the standard normal distribution.
66Extreme values of t are more likely than extreme values of z
67The critical regionArguably, since the administrator’s concern is with low scores, we can justify a one-tailed test here and locate the critical region exclusively in the lower tail of the distribution of t on 35 degrees of freedom.We want the critical region to the left of the 5th percentile of the distribution in the lower tail.
69Boundary of critical region for the t test lies further out in the tail Notice that the boundary (-1.69) of the critical region lies further out in the lower tail than does the 5th percentile of the standard normal distribution (–1.64).
70A significant resultOur value of t (–2.5) lies within the critical region.The null hypothesis is therefore rejected and we have evidence that our sample is from a population with a mean score of less than 50.
73Is this a repeatable result? The difference between the Caffeine and Placebo means is (11.90 – 9.25) = 2.65 hits.Could this difference have arisen merely from sampling error?
74Independent samplesThe Caffeine experiment yielded two sets of scores - one set from the Caffeine group, the other from the Placebo group.There is NO BASIS FOR PAIRING THE SCORES.We have INDEPENDENT SAMPLES.We shall make an INDEPENDENT-SAMPLES t test.
75The null hypothesisThe null hypothesis states that, in the population, the Caffeine and Placebo means have the same value.
76The alternative hypothesis The alternative hypothesis states that, in the population, the Caffeine and Placebo means do not have the same value.
77RevisionWe are talking about a situation in which two samples have been drawn from identical normal populations and the difference between their means M1 – M2 has been calculated.Here the reference set is the population or probabililty distribution of such differences, which is known as the SAMPLING DISTRIBUTION OF THE DIFFERENCE (between means).Its SD is known as the STANDARD ERROR OF THE DIFFERENCE
79The standard normal distribution Questions about ranges of values in any normal distribution can always be referred to questions about corresponding values in the STANDARD NORMAL DISTRIBUTION.We do this by tranforming the original values to values of z, the STANDARD NORMAL VARIATE.
88The critical regionWe shall reject the null hypothesis if the value of t falls within EITHER tail of the t distribution on 38 degrees of freedom.To be significant beyond the .05 level, our value of t must be greater than OR less than –2.02. Since our value for t (2.60) falls within the critical region, the null hypothesis is rejected.
89The p-valueThe one-tailed p-value is (1 – the cumulative probability of the t value 2.60), that is,To obtain the 2-tailed p-value, we must double this value: 2 × =
90Your report“The scores of the Caffeine group (M = 11.90; SD = 3.28) were higher than those of the Placebo group (M = 9.25; 3.16). With an alpha-level of 0.05, the difference is significant: t(38) = 2.60; p = (two-tailed) . ”degrees of freedom
91Representing very small p-values Suppose, in the caffeine experiment, that the p-value had been very small indeed. (Suppose t = 6.0). The computer would have given your p-value as ‘.000’. NEVER write, ‘p = .000’. This is unacceptable in a scientific article.Write, ‘p < .01’, or ‘p < .001’.You would have written the present result as‘ t(38) = 6.0; p < .001’ .
92Lisa DeBruine’s guidelines Lisa DeBruine has compiled a very useful document describing the most important of the APA guidelines for the reporting of the results of statistical tests.I strongly recommend this document, which is readily available on the Web.Sometimes the APA manual is unclear. In such cases, Lisa has opted for what seems to be the most reasonable interpretation.If you follow Lisa’s guidelines, your submitted paper won’t draw fire on account of poor presentation of your statistics!
93A one-tailed test?The null hypothesis states simply that, in the population, the Caffeine and Placebo means are equal.H0 is refuted by a sufficiently large difference between the means in EITHER direction.But some argue that if our scientific hypothesis is that Caffeine improves performance, we should be looking at differences in only ONE direction.
94AssumptionIn what follows, we shall assume that the researcher, on the basis of sound theory, has planned to make a one-tailed test .Accordingly, the critical region is located entirely in the upper tail of the distribution of t on 38 degrees of freedom.
95The null hypothesis again The null and alternate hypotheses must be complementary: that is, they must exhaust the possibilities.If the alternative hypothesis says that the Caffeine mean is greater, the null hypothesis must say that it is not greater: that is, it is equal to OR LESS THAN the Placebo mean.
97Direction of subtraction The direction of subtraction of one sample mean from the other is now crucial.You MUST subtract the Placebo mean from the Caffeine mean.Only a POSITIVE value of t can falsify the directional null hypothesis.
98A smaller difference between the means Suppose that the mean score of the Caffeine group had been, not 11.90, but The cell variances are the same as before.In other words, the Caffeine and Placebo means differ by only 1.74 points, rather than 2.65 points, as in the original example.
100The resultThe value of t is now 1.71, which is greater than the critical value (1.69) on a one-tailed test.The null hypothesis that the Caffeine mean is no greater than the Placebo mean is rejected.
101Report of the one-tailed test “The scores of the Caffeine group (M = 10.97; SD = 3.28) were significantly higher than those of the Placebo group (M = 9.25; 3.16): t(38) = 1.71; p = (one-tailed) .”
102Advantage of the one-tailed test Our t value of 1.69 would have failed to achieve significance on the two-tailed test, since the critical value there wasOn the one-tailed test, however, t lies in the critical region and the null hypothesis is rejected.
103More powerIn locating the entire critical region in the upper tail of the H0 distribution, we increase the light-grey area and reduce the dark-grey area - the beta rate.In other words, we increase the POWER of the test to reject H0.
104An unexpected resultNow suppose that, against expectation, the Placebo group had outperformed the Caffeine group.The mean for the Caffeine group is 9.25 and that for the Placebo group isIf we subtract the Placebo mean from the Caffeine mean as before, we obtain t = – 2.02.On a two-tailed test, this would have been in the critical region (p <.05) and we should have rejected the null hypothesis.
105One-sided p-valueWe cannot, however, change horses and declare this unexpected result to be significant.In the one-tailed test, the null hypothesis is also one-sided.Accordingly, the p-value is also one-sided, that is, it is the probability that the (Caffeine – Placebo) difference would have been at least as LARGE in the positive direction as the one we obtained.
106The one-sided p-valueThe one-sided p-value is the entire area under the curve TO THE RIGHT of your value of t.That area isYou have nothing to report.
107Correct report of the one-tail test “The scores of the Caffeine group (M = 9.25; SD = 3.16) were not significantly higher than those of the Placebo group (M = 10.20; SD = 3.28): t(38) = ; p = (one-tailed) .”
108Why you can’t change horses Having decided upon a one-tailed test, you cannot change to a two-tailed test when you get a result in the opposite direction to that expected.If you do, the Type I error rate increases.
109The true Type I error rate. If you switch to a two-tailed test, your true Type I error rate is now the black area (0.05) PLUS the green area in the lower tail (0.025). This is = 0.075, a level many would feel is too high.(See the OR rule in the appendix to my first talk.)
111A reoccupation with significance For many years, following R. A. Fisher, the first to develop a system of testing, there was a preoccupation with significance and insufficient regard for the MAGNITUDE of the effect one was investigating.Fisher himself observed that, on a sufficiently powerful test, even the most minute difference will be statistically “significant”, however “insubstantial” it may be.
112A ‘substantial’ difference? We obtained a difference between the Caffeine and Placebo means of (11.90 – 9.25) = 2.75 score points.This difference, as we have seen, is “significant” in the statistical sense; but is it SUBSTANTIAL, that is, worth reporting?
115Levels of effect sizeOn the basis of scrutiny of a large number of studies, Jacob Cohen proposed that we regard a d of .2 as a SMALL effect size, a d of .5 as a MEDIUM effect size and a d of .8 as a LARGE effect size.So our experimental result is a ‘large’ effect.When you report the results of a statistical test, you are now expected to provide a measure of the size of the effect you are reporting.
117Complete report of your test “The scores of the Caffeine group (M = 11.90; SD = 3.28) were higher than those of the Placebo group (M = 9.25; 3.16). With an alpha-level of 0.05, the difference is significant: t(38) = 2.60; p = (two-tailed) . Cohen’s d = .82, a ‘large’ effect”
120Nominal dataA NOMINAL data set consists of records of membership of the categories making up QUALITATIVE VARIABLES, such as gender or blood group.Nominal data must be distinguished from SCALAR, CONTINUOUS or INTERVAL data, which are measurements of QUANTITATIVE variables on an independent scale with units.Nominal data sets merely carry information about the frequencies of observations in different categories.
121A set of nominal dataA medical researcher wishes to test the hypothesis that people with a certain type of body tissue (Critical) are more likely to have a potentially harmful antibody.Data are obtained on 79 people, who are classified with respect to 2 attributes:Tissue Type;Presence/Absence of the antibody.
122A question of association Do more of the people in the critical group have the antibody?We are asking whether there is an ASSOCIATION between the variables of category membership (tissue type) and presence/absence of the antibody.The SCIENTIFIC hypothesis is that there is such an association.
123The null hypothesisThe NULL HYPOTHESIS is the negation of the scientific hypothesis.The null hypothesis states that there is NO association between tissue type and presence of the antibody.
124Contingency tables (cross-tabulations) When we wish to investigate whether an association exists between qualitative or categorical variables, the starting point is usually a display known as a CONTINGENCY TABLE, whose rows and columns represent the categories of the qualitative variables we are studying.Contingency tables are also known as CROSS-TABULATIONS, or CROSSTABS.
125The equivalent of a scatterplot The contingency table is the equivalent, for use with nominal data, of the scatterplot that is used to display bivariate continuous data sets.
127InterpretationIs there an association between Tissue Type and Presence of the antibody?The antibody is indeed more in evidence in the ‘Critical’ tissue group.It looks as if there may be an association.
129Observed and expected cell frequencies Let O be the frequency of observations in a cell of the contingency table.From the marginal totals, we calculate the cell frequencies E that we should expect if there were NO ASSOCIATION between the two attributes Tissue Type and Presence/Absence of the antibody.
130Testing the null hypothesis We test the null hypothesis by comparing the values of O and E.Large (O – E ) differences cast doubt upon the null hypothesis of no association.
131What cell frequencies can be expected? The pattern of the OBSERVED FREQUENCIES (O) would suggest that there is a greater incidence of the antibody in the Critical tissue group.But the marginal totals showing the frequencies of the various groups in the sample also vary.What cell frequencies would we expect under the independence hypothesis?
133Expected cell frequencies (E) According to the null hypothesis, the joint occurrence of the antibody and a particular tissue type are independent events.The probability of the joint occurrence of independent events is the product of their separate probabilities. (See the appendix of the first talk.)On this basis, we find the expected frequencies (E) by multiplying together the marginal totals that intersect at the cells concerned and dividing by the total number of observations.
136Marked (O – E ) differences In both cells of the Critical group, there seem to be large differences between O and E: there are many fewer No’s than expected and many more Yes’s.
137The chi-square (χ2) statistic We need a statistic which compares the differences between the O and E, so that a large value will cast doubt upon the null hypothesis of independence.The approximate CHI-SQUARE (χ2) statistic fits the bill.
138Formula for chi-square The element of this summation expresses the square of the difference between O and E as a proportion of E.Add up these proportional squared differences for all the cells in the contingency table.
139The value of chi-square There are 8 terms in the summation, but only the first two and the last are shown in the calculation below.
140Degrees of freedomTo decide whether a given value of chi-square is significant, we must specify the DEGREES OF FREEDOM df of the chi-square statistic.If a contingency table has R rows and C columns, the degrees of freedom is given bydf = (R – 1)(C – 1)In our example, R = 4, C = 2 and sodf = (4 – 1)(2 – 1) = 3.
141SignificanceThe p-value of a chi-square with a value of in the chi-square distribution with three degrees of freedom is .014.We should write this result as:χ2(3) = 10.66; p =Since the result is significant beyond the .05 level, we have evidence against the null hypothesis of independence and evidence for the scientific hypothesis.
150Confidence intervalsA CONFIDENCE interval is a range of values centred on the value of the sample statistic and which one can assume with a specified level of “confidence” includes the true value of the parameter.
152Equivalent probability statement An expression with terms such as < is known as an INEQUALITY.There are special rules for manipulating inequalities.
153Inference about the population mean Notice that the population mean is now at the centre of the inequality and the sample mean is in the terms denoting the lower and upper limits of the interval.We have changed a statement about the sample mean to one about the population mean.
154The 95% confidence interval on the sample mean You can be 95% “confident” that the value of the population mean lies within this range.
155Example A sample of 100 people has a mean height of 69.8 inches. Suppose, (very unrealistically), that we know that the population SD is 3.2 inches, but we don’t know the value of the population mean.Construct the 95% confidence interval on the mean.
156The first stepCalculate the standard error of the mean.
157The 95% confidence interval You can be 95% confident that the population mean lies within this range.
158Using the confidence interval to test the null hypothesis Notice that the 95% confidence interval on the mean, that is, [69.17, 70.43], does not include the value 69.If the confidence interval does not include the value specified by the null hypothesis, the hypothesis can be rejected.The two approaches lead to exactly the same decision about the null hypothesis.
159Interpretation of a confidence interval The 95% confidence interval on our sample mean is, [69.17, 70.43].We cannot say, “The probability that the mean lies between and is .95”. A confidence confidence interval is not a sample space. (See the appendix to my first talk.)A classical probability refers to a hypothetical future. Here, the die has already been cast and either the interval fell over the population mean or it didn’t. In view of the manner in which the interval was constructed, however, we can be “95% confident” that it fell over the true value of the population mean.