Presentation on theme: "University of Warwick, Department of Sociology, 2012/13 SO 201: SSAASS (Surveys and Statistics) (Richard Lampard) Week 4 Regression Analysis: Some Preliminaries..."— Presentation transcript:
University of Warwick, Department of Sociology, 2012/13 SO 201: SSAASS (Surveys and Statistics) (Richard Lampard) Week 4 Regression Analysis: Some Preliminaries...
Lecture content Standard deviation Normal distribution z-scores t-statistics and tests Correlation (r) Variation explained (r 2 ) F-test Partial correlation
A key measure of spread Standard Deviation (SD) - and Variance This is linked with the mean, as it is based on averaging [squared] deviations from it. The variance is simply the standard deviation squared.
Why is the standard deviation so important? The standard deviation (or, more precisely, the variance) is important because it introduces the idea of summarising variation in terms of summed, squared deviations. And it is also central to some of the statistical theory used in statistical testing/statistical inference...
An example of the calculation of a standard deviation Number of seminars attended by a sample of undergraduates: 5, 4, 4, 7, 9, 8, 9, 4, 6, 5 Mean = 61/10 = 6.1 Variance = ((5 – 6.1) 2 + (4 – 6.1) 2 + (4 – 6.1) 2 + (7 – 6.1) 2 + (9 – 6.1) 2 + (8 – 6.1) 2 + (9 – 6.1) 2 + (4 – 6.1) 2 + (6 – 6.1) 2 + (5 – 6.1) 2 )/(10 – 1) = 36.9 /9 = 4.1 Standard deviation = Square root of variance = 2.025
Formula The standard deviation is given by the following formula Like the preceding example, the formula shows it to be the square root of the average of the squared differences of the values from the mean.
The Standard Normal (z) Distribution Total area under the curve = 1 or 100% Mean = 0 Standard deviation = 1
Question: How often are values in the standard normal distribution below a particular number, e.g (SDs)? The relative frequency of such values is equal to the area under the curve for the relevant range of values. The blue area is the relative frequency of the event z < This area appears to be approximately 85%-90%.
(Part of) a table of z-scores Z Tables showing z-scores are laid out differently. However they all will tell you the same thing. A common layout, is shown below. If we go across from 1.1 to 0.08, we find the score for It says If you wanted to know the proportion of cases that fell beyond 1.18 we can work it out as =
However … So far we’ve been working with a standard normal distribution. However this is not really that useful since most quantities do not have a mean of 0 nor a standard deviation of 1. But we can convert quantities from more specific forms of normal distributions into z-scores For instance: How likely is it for someone to get a mark of at least 75 in an exam, if the mean is 60 and the standard deviation is 10?
z-scores (e.g. z-statistics) In order to address these types of problems we use z-scores. z-scores are calculated by subtracting the mean from any value and dividing it by the standard deviation. z = x - mean s z-scores will always have a mean of 0 and a standard deviation of 1. We can quickly see that the first part of this is true, since when the value of x = the mean, the numerator will equal 0, and therefore z must = 0. It may be a little less clear that the second part of this is true. However if you think about the instance when x is one standard deviation bigger than the mean (i.e. x = mean + s) z = (mean + s) - mean= s = 1 s
Judging whether differences occur by chance… How do we judge whether it is plausible that two population means are the same and that any difference between them simply reflects sampling error? Example: Household size of minority ethnic groups (HOH = Head of household; data adapted from 1991 Census) 1.The size of the difference between the two sample means Mean Indian HOH: 3.0 Bangladeshi HOH:5.0 Mean Indian HOH: 3.0 Pakistani HOH:4.0 The first difference is more ‘convincing’
The logic of a statistical test… The statistical test that we have looked at so far: Chi-square test: Are the observed frequencies in a table sufficiently different from what one would have expected to have seen (i.e. the expected frequencies) if there was no relationship in the population for the idea that there is no relationship in the population to be implausible? The more general testing approach asks whether the difference between the actual (observed) data and what one would have expected to have seen given a particular hypothesis is sufficiently large that the hypothesis is implausible.
…e.g. this logic applies to comparing sample means. If the two samples came from populations with identical population means, then one would expect the difference between the sample means to be (close to) zero. Thus, The larger the difference between the two sample means, the more implausible is the idea that the two population means are identical. The plausibility of the idea is also affected by: Sample size The extent to which there is variation within each group (or sample). The logic of a statistical test…
2. The sample sizes of the two samples Mean Pakistani HOH: Bangladeshi HOH: Mean Pakistani HOH: Bangladeshi HOH: The second difference is more ‘convincing’ Judging whether differences occur by chance…
3. The amount of variation in each of the two groups (samples) Mean Pakistani HOH: Bangladeshi HOH: Mean Pakistani HOH: Bangladeshi HOH: The second difference is more ‘convincing’. Judging whether differences occur by chance…
Example continued: the impact of variability on the difference of means. The three graphs each show two groups with the same mean difference. However the groups in each of the three graphs have different levels of variability. Where there is lower variability there is less cross-over between the groups, and so the difference of the means reflects a more pertinent difference (virtually no one in group A has the same score as anyone in group B).
As we’ll see these three things – the size of the difference between the means, the sample sizes, and the amount of variation (measured by the standard deviations) within the samples – are critical to our determination (test) of whether the difference we observe between samples is likely to reflect no actual difference in the population between the groups being compared. But first let’s think about how single sample means behave... Judging whether differences occur by chance…
“If repeated (simple random) samples of size N are drawn from a normally distributed population, the means of such samples will be normally distributed with mean and standard error / N... if the N of each sample drawn is large, then regardless of the shape of the population distribution the sample means will tend to distribute themselves normally with mean and standard error / N”. Reminder: = population mean = population standard deviation N = number in sample A formal theorem:
Combining that with what we know about the normal distribution: 95% of cases lie within +/ standard deviations of the mean in a normal distribution. The distribution of sample means is normal. And the standard error of sample means is approximately / n Therefore 95% of sample means fall into the range: ( / n) to ( / n) FrequencyFrequency (population mean) 1.96 / n 95% of sample means 2.5% of sample means Sample mean
What do we do if the sample size is too small for us to safely assume that the sample mean has a normal distribution? When a sample is small (i.e. less than about 25) the assumption that the sample mean is normally distributed is not reasonable. In fact, regardless of sample size, the sample mean can be assumed to have a t-distribution; the precise shape of a t-distribution depends on the sample size, and for moderate to large sample sizes the t-distribution is very similar to (and, as the sample size approaches infinity, eventually converges with) the normal distribution.
The procedure for identifying a statistic which has a statistically significant value remains very similar to the one for the normal distribution. The only difference is that the ‘magic number’ 1.96 is replaced by a slightly larger number, whose magnitude gets bigger as the sample size gets smaller. Thus for a sample size of 25, 1.96 is replaced by 2.06 and for a sample size of 15 it is replaced by (you can see a t-distribution table printed in many statistics textbooks) However another problem arises with small samples. The distribution of sample means can be asymmetric; in fact, the assumption that the sample mean has a t-distribution is only reasonable for small samples if the distribution of the variable under consideration is approximately normal. Small samples continued…
What does a two sample t-test measure? Note: In the above, T = treatment group and C = control group (as in experimental research). In most discussions these will just be shown as groups 1 and 2, indicating different groups.
If we take a large number of random samples from two groups with the same population mean and calculate the difference between each pair of sample means, we will end up with a sampling distribution with the following properties: It will be a t-distribution The mean of the difference between sample means will be zero Mean M 1 - M 2 = 0 The spread of scores around this mean of zero (the standard error) will be defined by the formula: The left-hand square brackets contain the pooled variance estimate
Example Descriptive statistic Australian sample British sample Mean166 minutes187 minutes Standard deviation29 minutes30 minutes Sample size20 We want to compare the average amount of television watched per night by Australian and by British children. We are making an inference from two samples, which are being compared in terms of an interval-ratio variable – hours of TV watched. Therefore we select the two sample t-test for the equality of means as the relevant test of significance. Table 1. Descriptive statistics for the samples
t-test of independent means: formulae Where:M = mean S DM = Standard error of the difference between means N = number of subjects in group s = Standard Deviation of group df = degrees of freedom Note: = N 1 + N 2 N 1 N 2 N 1 N 2
Descriptive statistic Australian sample British sample Mean166 minutes187 minutes Standard deviation 29 minutes30 minutes Sample size20 S DM = (20-1) (20-1) = – 2 20 x 20 t sample = 166 – 187 = – Calculating the t-statistic
Obtaining a p-value for a t-statistic To obtain the p-value for this t-statistic we could consult a table of critical values for the t-distribution The number of degrees of freedom we refer to in the table in the combined sample size minus two: df = N 1 + N 2 – 2 Here that is – 2 = 38 If a table doesn’t have a row of probabilities for 38, we can refer to the row for the nearest reported number of degrees of freedom below the desired number, which might be 35. For 35 degrees of freedom and a two-tailed test, the t- statistic falls between the critical values of t for p=0.05 and p=0.01, i.e and The p-value is therefore probably (given the 35 for 38 swap) between 0.01 and 0.05, and certainly less than Therefore the p-value is statistically significant at a 0.05 (5%) level but not at a 0.01 (1%) level.
Reporting the results We can say that: The mean number of minutes of TV watched by the sample of 20 British children is 187 minutes, which is 21 minutes higher than the mean for the sample of 20 Australian children, and this difference is statistically significant at the 0.05 (5%) level (t(38)= -2.3, p = 0.03, two-tailed). Based on these results we can reject the hypothesis that British and Australian children watch the same average amount of television per night.
Moving on to correlation Measures of association quantify the (strength of the) association between two variables. The most well-known of these is the Pearson correlation coefficient, often referred to as ‘the correlation coefficient’, or even ‘the correlation’. This quantifies the closeness of the relationship between two interval-level variables (scales).
Spearman correlation coefficient If we have measures that are clearly ordinal but not necessarily interval-level, we can use the Spearman correlation coefficient instead, as this looks at the relationship between the ranks of the values rather than the values per se. This is also a non-parametric measure, which does not make assumptions about distributions when used as the basis of a test for a relationship.
Other measures of association (for cross-tabulations) In a literature review more than thirty years ago, Goodman and Kruskal identified several dozen of these, including Cramér’s V: Goodman, L.A. and Kruskal, W.H Measures of association for cross classifications. New York, Springer-Verlag. … and I added one of my own, Tog, which measures inequality (in a particular way) where both variables are ordinal… One of Tog’s (distant) relatives
The Correlation Coefficient (r) Age at First Childbirth Age at First Cohabitation This shows the strength/closeness of a relationship r = 0.5 (or perhaps less…)
r = + 1 r = -1 r = 0
Working out the correlation coefficient (Pearson’s r) Pearson’s r tells us how much one variable changes as the values of another changes – their covariation. Variation is measured with the standard deviation. This measures average variation of each variable from the mean for that variable. Covariation is measured by calculating the amount by which each value of X varies from the mean of X, and the amount by which each value of Y varies from the mean of Y and multiplying the differences together and finding the average (by dividing by n-1). Pearson’s r is calculated by dividing this by (SD of x) times (SD of y) in order to standardize it.
Example of calculating a correlation coefficient (corresponding to the last slide) X = 9, 10, , 19 Mean(X) = 15.3 Y = 55, 59, , 60 Mean(Y) = 60.1 (9-15.3)(55–60.1) + (10–15.3)(59–60.1)... etc = 85.2 (Covariation) S.D. (X) = 2.71 ; S.D. (Y) = 3.65 (85.2 / (27-1)) / (2.71 x 3.65) = 0.331
So what? The correlation coefficient varies between -1 and 1, so signifies a moderately strong (!!), positive relationship between essay mark and seminar attendance. However, how do we know whether a correlation coefficient of is convincing evidence of a genuine underlying relationship? The answer is, as usual, that we need to test how likely a correlation of would be to occur given that there was no underlying relationship (i.e. how likely it would be that a value as big as would occur simply as a consequence of sampling error).
Variation explained One can view seminar attendance as ‘explaining’, in part, the variation in the essay mark. In fact, the proportion of the variation in essay mark explained by seminar attendance is equal to the correlation coefficient squared, i.e x0.331 = A reminder that the correlation coefficient is often labelled r, hence the variation explained is often referred to as r 2, or r-squared, or, more specifically, the coefficient of determination.
Variation and its sources We can now split up the variation in essay mark of the variation in essay mark is explained by seminar attendance This is explained variation. Hence of the variation is left unexplained ( = 0.890) This is unexplained variation We have used 1 variable (seminar attendance) to explain essay mark, hence we have ‘used up’ 1 degree of freedom (source of variation). There are 27 cases, hence there are 26 degrees of freedom overall. Since we have used 1 degree of freedom when we explained some of the variation in essay mark using seminar attendance, there are 25 degrees of freedom corresponding to the unexplained variation.
Using r as the basis of a test statistic Hence the explained proportion (0.110) of the variation corresponds to 1 degree of freedom, and the unexplained proportion (0.890) of the variation corresponds to 25 degrees of freedom. Variation d.f. Variation/d.f. Ratio Explained Unexplained
...which turns out to be an F-statistic The ratio of to is If there were no underlying relationship between essay mark and seminar attendance, then we would expect the 1 degree of freedom corresponding to seminar attendance to have explained approximately its ‘fair share’ of the variation in essay mark, rather than 3.08 times its ‘fair share’! The figure of 3.08 is in fact an F-statistic, since the ratio of the variation per degree of freedom for explained variation to the variation per degree of freedom for unexplained variation has an F-distribution (assuming that there is no underlying relationship). The relevant F-distribution is the one corresponding to the degrees of freedom of the explained variation and the degrees of freedom of the unexplained variation (i.e. 1 and 25 in this example).
F-distributions F-distributions for 1 d.f. and N d.f. do not look dissimilar for values of N of 10 upwards, but the right-hand tail becomes slightly more shallow.
Doing an F-test We now need to compare the figure of 3.08 with the critical value at the 5% level for an F-statistic with 1 degree of freedom and 25 degrees of freedom, i.e (Such values can be obtained from tables of critical values of F, at the back of many stats textbooks). Since ). Hence the data for the 27 students does not provide statistically significant evidence of a relationship between essay mark and seminar attendance.
Correlation… and Regression r measures correlation in a linear way … and is connected to linear regression More precisely, it is r 2 (r-squared) that is of relevance It is the ‘variation explained’ by the regression line... Although the relationship between r-squared and the basic correlation coefficient disappears in multiple regression.
Partial correlations Partial correlations allow us to measure the association between two variables controlling for one or more others. They allow us to test whether variables are significantly correlated net of such other factors. As such they allow us to see whether one variable adds significantly to the variation explained in a focal variable of interest by other factors. More on this topic on Wednesday!