Data analysis: Explore GAP Toolkit 5 Training in basic drug abuse data management and analysis Training session 9
Objectives To define a standard set of descriptive statistics used to analyse continuous variables To examine the Explore facility in SPSS To introduce the analysis of a continuous variable according to values of a categorical variable, an example of bivariate analysis To introduce further SPSS Help options To reinforce the use of SPSS syntax
SPSS Descriptive Statistics Analyse/Descriptive Statistics/Frequencies Analyse/Descriptive Statistics/Explore Analyse/Descriptive Statistics/Descriptives
Exercise: continuous variable Generate a set of standard summary statistics for the continuous variable Age
Explore: Age
Explore: Descriptive Statistics StatisticStd. Error AGEMean % Confidence Interval for Mean Lower Bound31.16 Upper Bound % Trimmed Mean31.31 Median31.00 Variance Std. Deviation Minimum1 Maximum77 Range76 Interquartile Range20.00 Skewness Kurtosis Descriptives
Exercise: Help What’s This? Results Coach Case Studies
Measures of central tendency Most commonly: –Mode –Median –Mean 5 per cent trimmed mean
The mode The mode is the most frequently occurring value in a dataset Suitable for nominal data and above Example: –The mode of the first most frequently used drug is Alcohol, with 717 cases, approximately 46 per cent of valid responses
Bimodal Describes a distribution Two categories have a large number of cases Example: –The distribution of Employment is bimodal, employment and unemployment having a similar number of cases and more cases than the other categories
The median The middle value when the data are ordered from low to high is the median Half the data values lie below the median and half above The data have to be ordered so the median is not suitable for nominal data, but is suitable for ordinal levels of measurement and above
Example: median Seizures of opium in Germany, (Kilograms) Source: United Nations (2000). World Drug Report 2000 (United Nations publication, Sales No. GV.E ). Year Seizure
Sort the seizure data in ascending order The middle value is the median; the median annual seizures of opium for Germany between 1994 and 1998 was 42 kilograms Year Seizure Ranked:
The mean Add the values in the data set and divide by the number of values The mean is only truly applicable to interval and ratio data, as it involves adding the variables It is sometimes applied to ordinal data or ordinal scales constructed from a number of Likert scales, but this requires the assumption that the difference between the values in the scale is the same, e.g. between 1 and 2 is the same as between 5 and 6
Example: mean Seizures of opium in Germany, Sample size = = /5 = 84.8 Year Seizure
The 5 per cent trimmed mean The 5 per cent trimmed mean is the mean calculated on the data set with the top 5 per cent and bottom 5 per cent of values removed An estimator that is more resistant to outliers than the mean
95 per cent confidence interval for the mean An indication of the expected error (precision) when estimating the population mean with the sample mean In repeated sampling, the equation used to calculate the confidence interval around the sample mean will contain the population mean 95 times out of 100
Measures of dispersion The range The inter-quartile range The variance The standard deviation
The range A measure of the spread of the data Range = maximum – minimum
Quartiles 1 st quartile: 25 per cent of the values lie below the value of the 1 st quartile and 75 per cent above 2 nd quartile: the median: 50 per cent of values below and 50 per cent of values above 3 rd quartile: 75 per cent of values below and 25 per cent of the values above
Inter-quartile range IQR = 3 rd Quartile – 1 st Quartile The inter-quartile range measures the spread or range of the mid 50 per cent of the data Ordinal level of measurement or above
Variance The average squared difference from the mean Measured in units squared Requires interval or ratio levels of measurement
Standard deviation The square root of the variance Returns the units to those of the original variable
Example: standard deviation and variance Seizures of opium in Germany, YearSeizureDeviationsSquared deviations Total Count55 Mean84.8Variance10230 Standard deviation 101
Distribution or shape of the data The normal distribution Skewness: –Positive or right-hand skewed –Negative or left-hand skewed Kurtosis: –Platykurtic –Mesokurtic –Leptokurtic
Symmetrical data: the mean, the median and the mode coincide Mean Median Mode f(X) X The normal distribution
Right-hand skew (+) Right-hand skew: the extreme large values drag the mean towards them f(X) XModeMedianMean
Left-hand skew (-) Left-hand skew: the extreme small values drag the mean towards them ModeMeanMedianX f(X)
Bivariate analysis Continuous Dependent Variable Categorical Independent Variable
Explore
Explore: Options button
Explore: Plots button
Explore: Statistics button
GenderStatisticStd. Error AGEMaleMean % Confidence Interval for Mean Lower Bound30.76 Upper Bound % Trimmed Mean31.03 Median30.00 Variance Std. Deviation Minimum1 Maximum70 Range69 Interquartile Range19.00 Skewness Kurtosis FemaleMean % Confidence Interval for Mean Lower Bound31.84 Upper Bound % Trimmed Mean32.77 Median33.00 Variance Std. Deviation Minimum14 Maximum77 Range63 Interquartile Range23.00 Skewness Kurtosis Descriptives
Male Female
Boxplot of Age vs Gender Median Inter-quartile range Outlier
Syntax: Explore EXAMINE VARIABLES=age BY gender /ID=id /PLOT BOXPLOT HISTOGRAM /COMPARE GROUP /STATISTICS DESCRIPTIVES /CINTERVAL 95 /MISSING LISTWISE /NOTOTAL.
Summary Measures of central tendency Measures of variation Quantiles Measures of shape Bivariate analysis for a categorical independent variable and continuous dependent variable Histograms Boxplots