Presentation on theme: "Quantitative Methods in Social Research 2010/11 Week 5 (morning) session 11 th February 2011 Descriptive Statistics."— Presentation transcript:
Quantitative Methods in Social Research 2010/11 Week 5 (morning) session 11 th February 2011 Descriptive Statistics
Some relevant online course extracts Diamond and Jefferies (2001) Chapter 5. - Measures and displays of spread Sarantakos (2007) Chapter 5. - Graphical displays. Huizingh (2007) Chapter SPSS material.
Descriptive statistics are data summaries which provide an alternative to graphical representations of distributions of values... aim to describe key aspects of distributions of values... are of most relevance when we are thinking about interval-level variables (scales)
Types of descriptive statistics Measures of location (averages)... spread... skewness (asymmetry)... kurtosis We typically want to know about the first two, sometimes about the third, and rarely about the fourth!
What is ‘kurtosis’ anyway? Increasing kurtosis is associated with the “movement of probability mass from the shoulders of a distribution into its center and tails.” (Balanda, K.P. and MacGillivray, H.L ‘Kurtosis: A Critical Review’, The American Statistician 42:2: 111–119. Below, kurtosis increases from left to right...
Measures of location Mean (the arithmetic average of the values, i.e. the result of dividing the sum of the values by the total number of cases) Median (the middle value, when the values are ranked/ordered) Mode (the most common value)
... and measures of spread Standard deviation (and Variance) (This is linked with the mean, as it is based on averaging [squared] deviations from it. The variance is simply the standard deviation squared). Interquartile range / Quartile deviation (These are linked with the median, as they are also based on the values placed in order).
Measures of location and spread: an example (household size) Mean = 2.94, Median = 2, Mode = 2 Mean = 2.96, Median = 3, Mode = 2 s.d. = 1.93, skewness = 2.10; kurtosis = 5.54 s.d. = 1.58, skewness = 1.27; kurtosis = 2.24 West Midlands London
Why is the standard deviation so important? The standard deviation (or, more precisely, the variance) is important because it introduces the idea of summarising variation in terms of summed, squared deviations. And it is also central to some of the statistical theory used in statistical testing/statistical inference...
An example of the calculation of a standard deviation Number of seminars attended by a sample of undergraduates: 5, 4, 4, 7, 9, 8, 9, 4, 6, 5 Mean = 61/10 = 6.1 Variance = ((5 – 6.1) 2 + (4 – 6.1) 2 + (4 – 6.1) 2 + (7 – 6.1) 2 + (9 – 6.1) 2 + (8 – 6.1) 2 + (9 – 6.1) 2 + (4 – 6.1) 2 + (6 – 6.1) 2 + (5 – 6.1) 2 )/(10 – 1) = 36.9 /9 = 4.1 Standard deviation = Square root of variance = 2.025
The Empire Median Strikes Back! Comparing descriptive statistics between groups can be done graphically in a rather nice way using a form of display called a ‘boxplot’. Boxplots are based on medians and quartiles rather than on the more commonly found mean and standard deviation.
Example of a boxplot
Moving on to bivariate ‘descriptive statistics'... These are referred to as ‘Measures of association’, as they quantify the (strength of the) association between two variables The most well-known of these is the (Pearson) correlation coefficient, often referred to as ‘the correlation coefficient’, or even ‘the correlation’.
Positive and negative relationships Positive or direct relationships If the points cluster around a line that runs from the lower left to upper right of the graph area, then the relationship between the two variables is positive or direct. An increase in the value of x is more likely to be associated with an increase in the value of y. The closer the points are to the line, the stronger the relationship. Negative or inverse relationships If the points tend to cluster around a line that runs from the upper left to lower right of the graph, then the relationship between the two variables is negative or inverse. An increase in the value of x is more likely to be associated with a decrease in the value of y.
Working out the correlation coefficient (Pearson’s r) Pearson’s r tells us how much one variable changes as the values of another changes – their covariation. Variation is measured with the standard deviation. This measures average variation of each variable from the mean for that variable. Covariation is measured by calculating the amount by which each value of X varies from the mean of X, and the amount by which each value of Y varies from the mean of Y and multiplying the differences together and finding the average (by dividing by n-1). Pearson’s r is calculated by dividing this by (SD of x) x (SD of y) in order to standardize it.
Working out the correlation coefficient (Pearson’s r) Because r is standardized it will always fall between +1 and -1. A correlation of either 1 or -1 means perfect association between the two variables. A correlation of 0 means that there is no association. Note: correlation does not mean causation. We can only investigate causation by reference to our theory. However (thinking about it the other way round) there is unlikely to be causation if there is not correlation.
Another measure of association Later in the module we will look at the calculation of a different measure of association for cross-tabulated variables, Cramér’s V Like the Pearson correlation coefficient, it has a maximum of 1, and 0 indicates no relationship, but it can only take on positive values, and makes no assumption of linearity.