# This morning’s programme

## Presentation on theme: "This morning’s programme"— Presentation transcript:

This morning’s programme
9:05am – 10:50am The t tests 10:50am – 11:20am A break for coffee. 11:20am – 12:30pm Approximate chi-square tests

SESSION 2 From description to inference: hypothesis testing

INDUCTIVE REASONING Traditional Aristotelian logic is DEDUCTIVE: it argues from the general to the particular. Statistical inference reverses this process by arguing INDUCTIVELY from the particular (the sample) to the general (the population). Statistical inference, therefore, is subject to error and inferences must be expressed in terms of probabilities.

Kinds of statistical inference
Estimation (of parameter values); Hypothesis testing

Estimates There are two types of estimate:
POINT ESTIMATES. For example, we might use the sample mean as an estimate of the value of the population mean. INTERVAL ESTIMATES. On the basis of sample data, we can specify a range of values within which we can assume with specified levels of CONFIDENCE that the population value lies. I discuss confidence intervals in the appendix to this talk.

‘Confirming’ our data Suppose that we have found, in our results, a pattern that we would like to confirm, such as a difference between means. Could this pattern have arisen merely through sampling error? Would another research team who collect data of this type obtain a similar result? Hypothesis testing can provide an answer to questions of this sort.

Statistical hypotheses
A statistical hypothesis is a statement about a population, usually to the effect that a parameter, such as the mean, has a specified value, or that the means of two or more populations have the same value or different values. Here, by “population” is always meant a probability distribution, a hypothetical population of numbers.

Two hypotheses In hypothesis testing, as widely practised at present by researchers, a decision is made between two complementary hypotheses, which are set up in opposition to each other: the null hypothesis (H0); the alternative hypothesis (H1).

The null hypothesis The null hypothesis (H0) is the statistical equivalent of the hypothesis of NO EFFECT, the negation of the scientific hypothesis. For example, if a researcher thinks a set of scores is from a population with a different mean than a control population, the null hypothesis will state that there is NO such difference. The alternative hypothesis (H1) is that the null hypothesis is false. In traditional statistical testing, it is the null hypothesis, not the alternative hypothesis, that is tested.

Number of samples Tests of some hypotheses can be made by drawing a single sample of scores. Other hypotheses, however, can only be tested by drawing two or more samples. It is easiest to consider the elements of hypothesis testing by considering one-sample tests first.

One-sample tests

Situation (a) The population standard deviation is known

An example Let us suppose that in the island of Erewhon, men’s heights have an approximately normal distribution with a mean of 69 inches and an SD of 3.2 inches. A researcher wonders whether there might be a tendency for those in the north of the island to be taller than the general population. A sample of 100 northerners has a mean height of 69.8 inches. Remembering that this is merely a sample from the population of northerners, do we have evidence that northerners are taller?

Steps in testing a hypothesis
Formulate the null and alternative HYPOTHESES. Decide upon a SIGNIFICANCE LEVEL. Decide upon an appropriate TEST STATISTIC. Decide upon the CRITICAL REGION, a range of “unlikely” values for the test statistic, that is, less probable than the significance level. If the value of the test statistic falls within the critical region, the null hypothesis is rejected.

The null and alternative hypotheses
The null hypothesis is that, contrary to the researcher’s speculation, the height of northerners is no different from that of the general population. The alternative hypothesis is that northerners are of different height.

Significance level The significance level is a small probability fixed by tradition. The significance level is commonly set at .05, but in some areas researchers insist upon a lower level, such as .01 . We shall set the level at .05 .

Revision We are talking about a situation in which a single sample has been drawn from a population. Here the reference set is the population or probabililty distribution of such samples, which is known as the SAMPLING DISTRIBUTION OF THE MEAN. Its SD is known as the STANDARD ERROR OF THE MEAN (σM ).

Sampling distribution of the mean

The standard normal distribution
Questions about ranges of values in any normal distribution can always be referred to questions about corresponding values in the STANDARD NORMAL DISTRIBUTION. We do this by tranforming the original values to values of z, the STANDARD NORMAL VARIATE.

The standard normal distribution
We transform the original value to z by subtracting the mean, then dividing by the standard deviation. In this case, we must divide by σM, not σ.

The test statistic Since we know the SD σ, we can use as our test statistic z, where the denominator is the STANDARD ERROR OF THE MEAN, that is, the SD of the sampling distribution of the mean.

The critical region We want the total probability of a value in the critical region to be .05, that is the significance level. We distribute that probability equally between the two tails of the distribution: .025 in each tail.

Calculate the value of z
Since this value falls within the critical region, the null hypothesis is rejected. We have evidence that the northerners are taller.

The p-value The p-value of a test statistic is the probability, assuming that the null hypothesis is true, of obtaining a value of the test statistic at least as unlikely as the one obtained. The p-value must be clearly distinguished from the significance level (say .05): the significance level is fixed beforehand; but the p-value is determined by your own data.

Use of the p-value If the p-value is less than the significance level, the value of your test statistic must have fallen within the critical region. But the p-value tells you more than this. A high p-value means that the value of the test statistic is well short of being significant; whereas a low p-value means we are well over the line.

The one-tailed p-value
The ONE-TAILED p-value is the probability of a value of the test statistic at least as extreme (in the same direction) as the value actually obtained.

The one-tailed p-value
We obtain the one-tailed p-value by subtracting the cumulative probability of 2.5 from 1: = .0062

One-tailed and two-tailed p-values
If the region of rejection is located in both tails of the sampling distribution, as in the present example, a TWO-TAILED p-value must be calculated. We must DOUBLE the one-tailed p-value. If we didn’t do that, a value only marginally significant would seem to have a probability of only .025, not .05 as previously decided. So if the p-value in either direction is less than .025, the two-sided p-value is less than .05, and we have significance.

The two-tailed p-value of 2.5
We must now double the one-tailed p-value: .0062 × 2 =

One-tailed tests

Directional hypothesis
Our researcher suspects that the northerners are TALLER, not simply that they are of DIFFERENT height. This is a DIRECTIONAL hypothesis. On this basis, it could be (and is) argued that the critical region, with a probability of .05, should be located entirely in the UPPER tail of the standard normal distribution.

Critical region for a one-tailed test

Comparison of the critical regions
If you are only interested in the possibility of a difference in ONE direction, you might decide to locate the critical region entirely in one tail of the distribution. 0.025 (2.5%) 0.025 (2.5%) 0.05 (5%)

Easier to get a significant result
Note that, on a one-tail test, you only need a z-value of 1.64 for significance, rather than a value of 1.96 for a two-tail test. So, on a one-tail test, it’s easier to get significance IN THE DIRECTION YOU EXPECT.

Errors in hypothesis testing

Type I errors Suppose the null hypothesis is true, but the value of z falls within the critical region. We shall reject the null hypothesis, but, in so doing, we shall have made a Type I or alpha (α) error. The probability of a Type I error is simply the chosen significance level and in our example its value is .05 .

Probability of a Type I error
Suppose H0 is true. If the value of z falls within either tail, we shall reject H0 and make a Type I error. The probability that we shall do this is the significance level, .05. Accordingly, the significance level is also referred to as the ALPHA-LEVEL.

Type II (beta) errors Suppose the null hypothesis is false.
The value of test statistic, however, does not fall within the critical region and the null hypothesis is accepted. We have made a Type II or beta (β) error.

Power The POWER of a statistical test is the probability that, if the null hypothesis is false, it will be rejected by the statistical test. When the power of a test is low, an insignificant test result is impossible to interpret: there may indeed be nothing to report; but the researcher has no way of knowing this.

Two distributions The following diagram shows the relationships among the significance level, the Type I error rate (significance level) and Power when the null hypothesis is tested against a one-sided alternative hypothesis that the mean has a higher value. (This is a one-tailed test.) The overlapping curves represent the sampling distributions of the mean under the null hypothesis (left) and the alternative hypothesis (right). In the diagram, μ0 and μ1 are the values of the population mean according to the null and alternative hypotheses, respectively.

Power and the type I and type II error rates

Points Any value of M to the left of the grey area will result in the acceptance of H0. If H1 is true (distribution on the right), a Type II error will have been made. Notice that the Power and Type II error rates sum to unity.

Summary of errors and correct decisions

Factors affecting the power of a statistical test

Significance level and power
In the upper figure, the red area is the .05 significance level; the green area is the Type II error rate. The lower figure shows that a lower significance level (e.g. .01) reduces the probability of making a Type I error, but the probability of a type II error (green) increases and the power (P) decreases. P β β P

Size of the difference between μ1 and μ0.
The greater the difference between the real population mean and the value assumed by the null hypothesis, the less the overlap between the sampling distributions. The less the overlap, the greater will be the area of the H1 (right) curve beyond the critical value under H0 and the greater the power of the test to reject the null hypothesis. The researcher has no control over this determinant of power.

A small difference

A large difference

Sample size Now we come to another important determinant of the power of a statistical test: sample size. This is the factor over which the research usually has the most control.

Revision The larger the sample, the smaller the standard error of the mean and the taller and narrower will be the sampling distribution if drawn to the same scale as the original distribution.

Effect of increasing the sample size n
Sampling distributions of the mean for n = 16 and n = 64. n = 16 The IQ distribution μ

Sample size When there are two samples, therefore, larger samples will result in greater separation of the sampling distributions, reduction in the Type II error rate and more power.

Power and sample size Increasing the sample size reduces the overlap of the sampling distributions under H0 and H1 by making them taller and narrower. The beta-rate is reduced and the power (green area) increases. Small samples Large samples

Reliability of measurement
Greater precision or RELIABILITY of measurement also reduces the standard error of the mean and improves the separation of the null and alternative distributions. The more separation between the sampling distributions, the greater the power of the statistical test. Jeanette Jackson will discuss the topic of reliability.

Situation (b) The population standard deviation is unknown

Very rarely do we know the population standard deviation

Vocabulary A researcher suspects that a new intake to a college of further education may require extra coaching to enrich their vocabulary. The College has been using a vocabulary test, students’ performance on which, over the years, has been found to have a mean of 50. The standard deviation is not known with certainty (estimates have varied and the records are incomplete); but the population distribution seems to be approximately normal. The 36 new students have vocabulary scores with a mean of 49 and a sample standard deviation of 2.4 . Is this evidence that their vocabulary scores are not from the usual College student population?

Sampling distribution of the mean

Estimating the standard error
When we don’t know σ, we must use the statistics of our sample to estimate the standard error of the mean.

The t statistic

In our example …

Distribution of t Like the standard normal variate z, the distribution of t has a mean of zero. The t statistic, however, is not normally distributed. Although, like the normal distribution, it is symmetrical and bell-shaped, it has thicker tails: that is, large absolute values of t are more likely than large absolute values of z.

The family of t distributions
There is only one standard normal distribution, to which any other normal distribution can be transformed; but there is a whole family of t distributions. A normal distribution has two parameters: the mean and the SD. A t distribution has ONE parameter, known as the DEGREES OF FREEDOM (df ).

Degrees of freedom The term is borrowed from physics.
The degrees of freedom of a system is the number of constraints that must be placed upon it to determine its state completely. By analogy, the variance of n scores is calculated from the squares of n deviations from the mean; but deviations from the mean sum to zero, so if you know the values of (n – 1) deviations, you know the nth deviation. The degrees of freedom of the variance is therefore (n – 1).

Degrees of freedom of t The degrees of freedom of the one-sample t statistic is (n – 1), where n is the size of the sample. This is the degrees of freedom of the variance estimate from the sample. In our case, the degrees of freedom of the t statistic = (n – 1) = 36 – 1 = 35. As the size of n increases, the t distribution becomes more and more like the standard normal distribution.

Extreme values of t are more likely than extreme values of z

The critical region Arguably, since the administrator’s concern is with low scores, we can justify a one-tailed test here and locate the critical region exclusively in the lower tail of the distribution of t on 35 degrees of freedom. We want the critical region to the left of the 5th percentile of the distribution in the lower tail.

Critical region for a one-tailed t test

Boundary of critical region for the t test lies further out in the tail
Notice that the boundary (-1.69) of the critical region lies further out in the lower tail than does the 5th percentile of the standard normal distribution (–1.64).

A significant result Our value of t (–2.5) lies within the critical region. The null hypothesis is therefore rejected and we have evidence that our sample is from a population with a mean score of less than 50.

Two-sample tests

Results of the caffeine experiment

Is this a repeatable result?
The difference between the Caffeine and Placebo means is (11.90 – 9.25) = 2.65 hits. Could this difference have arisen merely from sampling error?

Independent samples The Caffeine experiment yielded two sets of scores - one set from the Caffeine group, the other from the Placebo group. There is NO BASIS FOR PAIRING THE SCORES. We have INDEPENDENT SAMPLES. We shall make an INDEPENDENT-SAMPLES t test.

The null hypothesis The null hypothesis states that, in the population, the Caffeine and Placebo means have the same value.

The alternative hypothesis
The alternative hypothesis states that, in the population, the Caffeine and Placebo means do not have the same value.

Revision We are talking about a situation in which two samples have been drawn from identical normal populations and the difference between their means M1 – M2 has been calculated. Here the reference set is the population or probabililty distribution of such differences, which is known as the SAMPLING DISTRIBUTION OF THE DIFFERENCE (between means). Its SD is known as the STANDARD ERROR OF THE DIFFERENCE

Sampling distribution of the difference

The standard normal distribution
Questions about ranges of values in any normal distribution can always be referred to questions about corresponding values in the STANDARD NORMAL DISTRIBUTION. We do this by tranforming the original values to values of z, the STANDARD NORMAL VARIATE.

If we knew the population standard deviation …

The standard normal distribution
We could transform the original value to z by subtracting the mean, then dividing by the standard deviation. In this case, we would divide by

The test statistic We could have calculated z in the usual way:

But we don’t know σ! So we must estimate the standard error of the difference from the statistics of our samples:

The pooled variance estimate

Estimate of the standard error of the difference

The independent samples t statistic
The degrees of freedom of this t statistic is the sum of the dfs of the two sample variance estimates. In our example, df = – 2 = 38

The value of t

The critical region We shall reject the null hypothesis if the value of t falls within EITHER tail of the t distribution on 38 degrees of freedom. To be significant beyond the .05 level, our value of t must be greater than OR less than –2.02. Since our value for t (2.60) falls within the critical region, the null hypothesis is rejected.

The p-value The one-tailed p-value is (1 – the cumulative probability of the t value 2.60), that is, To obtain the 2-tailed p-value, we must double this value: 2 × =

Your report “The scores of the Caffeine group (M = 11.90; SD = 3.28) were higher than those of the Placebo group (M = 9.25; 3.16). With an alpha-level of 0.05, the difference is significant: t(38) = 2.60; p = (two-tailed) . ” degrees of freedom

Representing very small p-values
Suppose, in the caffeine experiment, that the p-value had been very small indeed. (Suppose t = 6.0). The computer would have given your p-value as ‘.000’. NEVER write, ‘p = .000’. This is unacceptable in a scientific article. Write, ‘p < .01’, or ‘p < .001’. You would have written the present result as ‘ t(38) = 6.0; p < .001’ .

Lisa DeBruine’s guidelines
Lisa DeBruine has compiled a very useful document describing the most important of the APA guidelines for the reporting of the results of statistical tests. I strongly recommend this document, which is readily available on the Web. Sometimes the APA manual is unclear. In such cases, Lisa has opted for what seems to be the most reasonable interpretation. If you follow Lisa’s guidelines, your submitted paper won’t draw fire on account of poor presentation of your statistics!

A one-tailed test? The null hypothesis states simply that, in the population, the Caffeine and Placebo means are equal. H0 is refuted by a sufficiently large difference between the means in EITHER direction. But some argue that if our scientific hypothesis is that Caffeine improves performance, we should be looking at differences in only ONE direction.

Assumption In what follows, we shall assume that the researcher, on the basis of sound theory, has planned to make a one-tailed test . Accordingly, the critical region is located entirely in the upper tail of the distribution of t on 38 degrees of freedom.

The null hypothesis again
The null and alternate hypotheses must be complementary: that is, they must exhaust the possibilities. If the alternative hypothesis says that the Caffeine mean is greater, the null hypothesis must say that it is not greater: that is, it is equal to OR LESS THAN the Placebo mean.

A one-sided null hypothesis

Direction of subtraction
The direction of subtraction of one sample mean from the other is now crucial. You MUST subtract the Placebo mean from the Caffeine mean. Only a POSITIVE value of t can falsify the directional null hypothesis.

A smaller difference between the means
Suppose that the mean score of the Caffeine group had been, not 11.90, but The cell variances are the same as before. In other words, the Caffeine and Placebo means differ by only 1.74 points, rather than 2.65 points, as in the original example.

The critical region

The result The value of t is now 1.71, which is greater than the critical value (1.69) on a one-tailed test. The null hypothesis that the Caffeine mean is no greater than the Placebo mean is rejected.

Report of the one-tailed test
“The scores of the Caffeine group (M = 10.97; SD = 3.28) were significantly higher than those of the Placebo group (M = 9.25; 3.16): t(38) = 1.71; p = (one-tailed) .”

Our t value of 1.69 would have failed to achieve significance on the two-tailed test, since the critical value there was On the one-tailed test, however, t lies in the critical region and the null hypothesis is rejected.

More power In locating the entire critical region in the upper tail of the H0 distribution, we increase the light-grey area and reduce the dark-grey area - the beta rate. In other words, we increase the POWER of the test to reject H0.

An unexpected result Now suppose that, against expectation, the Placebo group had outperformed the Caffeine group. The mean for the Caffeine group is 9.25 and that for the Placebo group is If we subtract the Placebo mean from the Caffeine mean as before, we obtain t = – 2.02. On a two-tailed test, this would have been in the critical region (p <.05) and we should have rejected the null hypothesis.

One-sided p-value We cannot, however, change horses and declare this unexpected result to be significant. In the one-tailed test, the null hypothesis is also one-sided. Accordingly, the p-value is also one-sided, that is, it is the probability that the (Caffeine – Placebo) difference would have been at least as LARGE in the positive direction as the one we obtained.

The one-sided p-value The one-sided p-value is the entire area under the curve TO THE RIGHT of your value of t. That area is You have nothing to report.

Correct report of the one-tail test
“The scores of the Caffeine group (M = 9.25; SD = 3.16) were not significantly higher than those of the Placebo group (M = 10.20; SD = 3.28): t(38) = ; p = (one-tailed) .”

Why you can’t change horses
Having decided upon a one-tailed test, you cannot change to a two-tailed test when you get a result in the opposite direction to that expected. If you do, the Type I error rate increases.

The true Type I error rate.
If you switch to a two-tailed test, your true Type I error rate is now the black area (0.05) PLUS the green area in the lower tail (0.025). This is = 0.075, a level many would feel is too high. (See the OR rule in the appendix to my first talk.)

Effect size

A reoccupation with significance
For many years, following R. A. Fisher, the first to develop a system of testing, there was a preoccupation with significance and insufficient regard for the MAGNITUDE of the effect one was investigating. Fisher himself observed that, on a sufficiently powerful test, even the most minute difference will be statistically “significant”, however “insubstantial” it may be.

A ‘substantial’ difference?
We obtained a difference between the Caffeine and Placebo means of (11.90 – 9.25) = 2.75 score points. This difference, as we have seen, is “significant” in the statistical sense; but is it SUBSTANTIAL, that is, worth reporting?

Measuring effect size: Cohen’s d statistic

In our example …

Levels of effect size On the basis of scrutiny of a large number of studies, Jacob Cohen proposed that we regard a d of .2 as a SMALL effect size, a d of .5 as a MEDIUM effect size and a d of .8 as a LARGE effect size. So our experimental result is a ‘large’ effect. When you report the results of a statistical test, you are now expected to provide a measure of the size of the effect you are reporting.

Cohen’s classification of effect size

“The scores of the Caffeine group (M = 11.90; SD = 3.28) were higher than those of the Placebo group (M = 9.25; 3.16). With an alpha-level of 0.05, the difference is significant: t(38) = 2.60; p = (two-tailed) . Cohen’s d = .82, a ‘large’ effect”

Coffee break

The analysis of nominal data

Nominal data A NOMINAL data set consists of records of membership of the categories making up QUALITATIVE VARIABLES, such as gender or blood group. Nominal data must be distinguished from SCALAR, CONTINUOUS or INTERVAL data, which are measurements of QUANTITATIVE variables on an independent scale with units. Nominal data sets merely carry information about the frequencies of observations in different categories.

A set of nominal data A medical researcher wishes to test the hypothesis that people with a certain type of body tissue (Critical) are more likely to have a potentially harmful antibody. Data are obtained on 79 people, who are classified with respect to 2 attributes: Tissue Type; Presence/Absence of the antibody.

A question of association
Do more of the people in the critical group have the antibody? We are asking whether there is an ASSOCIATION between the variables of category membership (tissue type) and presence/absence of the antibody. The SCIENTIFIC hypothesis is that there is such an association.

The null hypothesis The NULL HYPOTHESIS is the negation of the scientific hypothesis. The null hypothesis states that there is NO association between tissue type and presence of the antibody.

Contingency tables (cross-tabulations)
When we wish to investigate whether an association exists between qualitative or categorical variables, the starting point is usually a display known as a CONTINGENCY TABLE, whose rows and columns represent the categories of the qualitative variables we are studying. Contingency tables are also known as CROSS-TABULATIONS, or CROSSTABS.

The equivalent of a scatterplot
The contingency table is the equivalent, for use with nominal data, of the scatterplot that is used to display bivariate continuous data sets.

A contingency table

Interpretation Is there an association between Tissue Type and Presence of the antibody? The antibody is indeed more in evidence in the ‘Critical’ tissue group. It looks as if there may be an association.

Some terms

Observed and expected cell frequencies
Let O be the frequency of observations in a cell of the contingency table. From the marginal totals, we calculate the cell frequencies E that we should expect if there were NO ASSOCIATION between the two attributes Tissue Type and Presence/Absence of the antibody.

Testing the null hypothesis
We test the null hypothesis by comparing the values of O and E. Large (O – E ) differences cast doubt upon the null hypothesis of no association.

What cell frequencies can be expected?
The pattern of the OBSERVED FREQUENCIES (O) would suggest that there is a greater incidence of the antibody in the Critical tissue group. But the marginal totals showing the frequencies of the various groups in the sample also vary. What cell frequencies would we expect under the independence hypothesis?

More terms

Expected cell frequencies (E)
According to the null hypothesis, the joint occurrence of the antibody and a particular tissue type are independent events. The probability of the joint occurrence of independent events is the product of their separate probabilities. (See the appendix of the first talk.) On this basis, we find the expected frequencies (E) by multiplying together the marginal totals that intersect at the cells concerned and dividing by the total number of observations.

Formula for E

Example calculation of E

Marked (O – E ) differences
In both cells of the Critical group, there seem to be large differences between O and E: there are many fewer No’s than expected and many more Yes’s.

The chi-square (χ2) statistic
We need a statistic which compares the differences between the O and E, so that a large value will cast doubt upon the null hypothesis of independence. The approximate CHI-SQUARE (χ2) statistic fits the bill.

Formula for chi-square
The element of this summation expresses the square of the difference between O and E as a proportion of E. Add up these proportional squared differences for all the cells in the contingency table.

The value of chi-square
There are 8 terms in the summation, but only the first two and the last are shown in the calculation below.

Degrees of freedom To decide whether a given value of chi-square is significant, we must specify the DEGREES OF FREEDOM df of the chi-square statistic. If a contingency table has R rows and C columns, the degrees of freedom is given by df = (R – 1)(C – 1) In our example, R = 4, C = 2 and so df = (4 – 1)(2 – 1) = 3.

Significance The p-value of a chi-square with a value of in the chi-square distribution with three degrees of freedom is .014. We should write this result as: χ2(3) = 10.66; p = Since the result is significant beyond the .05 level, we have evidence against the null hypothesis of independence and evidence for the scientific hypothesis.

The odds and the odds ratio

A 2 × 2 contingency table

The odds

The odds and probability
Like the probability p, the odds is a measure of likelihood. The two measures are related according to

Example of a calculation of the odds
The odds in favour of the antibody in the Critical group are

The odds ratio The ODDS RATIO (OR ) compares the odds in favour of an event between two groups.

Example Moving to the critical group multiplies the odds in favour of the antibody being present nearly five times.

APPENDIX Confidence Intervals

Confidence intervals A CONFIDENCE interval is a range of values centred on the value of the sample statistic and which one can assume with a specified level of “confidence” includes the true value of the parameter.

Sampling distribution of the mean

Equivalent probability statement
An expression with terms such as < is known as an INEQUALITY. There are special rules for manipulating inequalities.

Notice that the population mean is now at the centre of the inequality and the sample mean is in the terms denoting the lower and upper limits of the interval. We have changed a statement about the sample mean to one about the population mean.

The 95% confidence interval on the sample mean
You can be 95% “confident” that the value of the population mean lies within this range.

Example A sample of 100 people has a mean height of 69.8 inches.
Suppose, (very unrealistically), that we know that the population SD is 3.2 inches, but we don’t know the value of the population mean. Construct the 95% confidence interval on the mean.

The first step Calculate the standard error of the mean.

The 95% confidence interval
You can be 95% confident that the population mean lies within this range.

Using the confidence interval to test the null hypothesis
Notice that the 95% confidence interval on the mean, that is, [69.17, 70.43], does not include the value 69. If the confidence interval does not include the value specified by the null hypothesis, the hypothesis can be rejected. The two approaches lead to exactly the same decision about the null hypothesis.

Interpretation of a confidence interval
The 95% confidence interval on our sample mean is, [69.17, 70.43]. We cannot say, “The probability that the mean lies between and is .95”. A confidence confidence interval is not a sample space. (See the appendix to my first talk.) A classical probability refers to a hypothetical future. Here, the die has already been cast and either the interval fell over the population mean or it didn’t. In view of the manner in which the interval was constructed, however, we can be “95% confident” that it fell over the true value of the population mean.