Presentation on theme: "A bar chart of a quantitative variable with only a few categories (called a discrete variable) communicates the relative number of subjects with each of."— Presentation transcript:
A bar chart of a quantitative variable with only a few categories (called a discrete variable) communicates the relative number of subjects with each of the possible responses. However, the bar chart does not graphically distinguish between quantitative and qualitative variables. Once we looked at the variable label and the values, we would realize that this is a quantitative variable, but it would take that extra work to understand it. 8/21/2014Slide 1
If the quantitative variable has a large number of categories (called a continuous variable), the bar chart provides little information beyond the fact that there are a lot of different values, and some occur more frequently than others. 8/21/2014Slide 2
Histograms are used as the preferred graph for quantitative variables. While the bars resemble those of a bar chart, histograms are distinguished by the absence of gaps between consecutive bars. For continuous variables, values are grouped in equally spaced intervals to convey a sense of what the distribution looks like. 8/21/2014Slide 3
While we used counts and percents to describe the distribution of a qualitative variable, we use statistical measures to describe the center, spread, and shape of a quantitative variable. Measures of central tendency identify a value in the center of the distribution. Measures of variability or dispersion summarize how the values for individual cases are spread out around the measure of central tendency. 8/21/2014Slide 4
There are two measures of the shape of the distribution: skewness and kurtosis. Many of the statistics we will use assume that the distribution of a variable is bell- shaped, i.e. the normal distribution. Skewness measures the symmetry of the distribution on both sides of the average score for the distribution. Having overlaid a blue normal curve on the distribution of this variable, we can see that the bars on either side of the red center line are similar as one moves away from the center. Kurtosis measure the degree to which the distribution is peaked or flat compared to the normal distribution. In this example, the bars at the center of the distribution are close to what would be expected for a normal distribution and the frequencies decrease as we move away from the center. 8/21/2014Slide 5
Both of these variables have a problem with skewness, caused by atypical scores at one end of the distribution. Skewness is characterized as negative or positive, depending on which side, or tail, of the distribution has the unusual scores. This is an example of negative skewness, where a few small scores have elongated the left tail of the distribution. The tail on the right is truncated. This is an example of positive skewness, where a few large scores have elongated the right tail of the distribution. The tail to the left is truncated. 8/21/2014Slide 6
Both of these variables have a problem with kurtosis, caused by either too few cases in the center of the distribution, or too many cases in the center of the distribution. This is an example of negative kurtosis, where the scores are uniformly distributed through the range of scores. The kurtosis statistic will have a negative value. This is an example of positive kurtosis, where the scores are heavily concentrated in the center of the distribution. The kurtosis statistic will have a positive value. 8/21/2014Slide 7
When the distribution has minimal skewness and is symmetric, both the red mean line and the green median line fall in the center of the distribution. There are two measures of central tendency for quantitative variables: the mean and the median. The mean is the average score. The median is the middle score, i.e. half of the scores are higher and half are lower. While both measures reflect the center of the distribution, the mean is the preferred measure because it uses information for all of the cases in the distribution. For each measure of centrality, there is a corresponding measure of spread. The standard deviation is used with the mean, and the interquartile range is used with median. 8/21/2014Slide 8
When skewing is present, the red mean line moves away from the center of the distribution as identified by the green median line in the direction of the skewness. At some level of skewness, the median becomes more effective at representing the center of the distribution. The issue is selecting a defensible rule for deciding the dividing line between acceptable skewness and problematic skewness. The rule of thumb that we will use is that skewness less than -1.0 or greater than +1.0 is problematic and indicates that the median is the preferred measure. 8/21/2014Slide 9
Kurtosis does not affect the location of the measure of central tendency. Kurtosis indicates that there are either more cases than expected in the middle of the distribution (positive kurtosis), or fewer cases than expected (negative kurtosis). 8/21/2014Slide 10 The bars extending about the normal curve overlay indicate that there is positive kurtosis. A distribution with positive kurtosis is characterized as a “peaked distribution.” When the bars fall below the center of the normal curve overlay, the distribution has negative kurtosis, and is referred to as a flat distribution.
8/21/2014Slide 11 The homework problems on central tendency and variability focus on describing the distribution of quantitative variables. The counts and percents that we used for qualitative variables are not effective for quantitative variables that can have many different scores in the distribution. We describe the distribution of quantitative variables with summary statistics that try to communicate the value on which the distribution is centered, the spread of the values from the center of the distribution, the symmetry of the distribution around the center measure, and the degree to which the distribution is bell-shaped or flat.
8/21/2014Slide 12 The center, or central tendency, of the distribution is usually represented by the mean (average score) or the median (middle score) of the distribution. The standard deviation is used as the measure of spread (variability or dispersion) that is paired with the mean. It measures the average difference between the mean and each of the scores in the distribution. The range and interquartile range are used to measure the spread around the median. The range is the difference between the highest score and lowest score. The interquartile range is the difference between the highest and lowest score when the smallest 25% and the largest 25% of the scores are removed from the distribution.
8/21/2014Slide 13 Both the mean and the median can be computed for the values in the distribution of any quantitative variable. However, the degree to which one or the other is a “good” measure or indicator of the central tendency of a distribution differs with the shape of the distribution, specifically the symmetry of the distribution as measured by skewness. If the distribution is symmetric, both the mean and the median fall in the center of the distribution. The mean is the preferred measure because it uses all of the cases in the distribution in its calculation, and because it can be used in a broader range of statistical tests. If the distribution is not symmetric, the median stays in the middle of the distribution, but the mean is pulled away from the center toward one of the tails of the distribution.
8/21/2014Slide 14 The degree of symmetry of a distribution of scores for a quantitative variable can vary quite widely. These six histograms show progressively increasing skewness. At what point do we choose the median over the mean?
8/21/2014Slide 15 There is no universally accepted criteria for the amount of skewness that dictates a preference for the median. Most agree that we should be concerned with substantial violations of skewness and ignore minor departures, but there is not agreement of what is a substantial violation. One rule of thumb indicates that a distribution has a substantial skewness problem when the size of the skew statistic is twice its standard error (in the SPSS output). The rule of thumb that I have used and which will be used for the problems is that skewness is a problem if it is less than -1 for negatively skewed distributions or greater than +1 for positively skewed.
8/21/2014Slide 16 The skewness for this histogram is The skewness for this histogram is The skewness for this histogram is The skewness for this histogram is The skewness for this histogram is The skewness for this histogram is By my rule of thumb, we would use the mean as the measure of central tendency for the top row, and the median for the bottom row. That the rule is arbitrary is shown by the similarity of the last chart on the top row to the first chart on the bottom row.
The introductory statement in the question indicates: The data set to use (GSS200R) The statistic to use (central tendency and dispersion) The variable to use in the analysis(occupational prestige score [prestg80]. ) 8/21/2014Slide 17
The first statement for us to evaluate concerns the number of valid and missing cases. To answer this question, we produce the descriptive statistics in SPSS. 8/21/2014Slide 18
To compute the measures of central tendency and dispersion in SPSS, select the Descriptive Statistics > Explore command from the Analyze menu. Measures of central tendency and variability can also be computed with the Frequencies and Descriptives commands. 8/21/2014Slide 19
Move the variable for the analysis prestg80 to the Dependent List list box. Click on the Statistics button to select optional statistics. 8/21/2014Slide 20
The check box for Descriptives is already marked by default. Click on Continue button to close the dialog box. Mark the Percentiles check box. This will provide the upper and lower bounds for the interquartile range. 8/21/2014Slide 21
After returning to the Explore dialog box, click on the OK button to produce the output. 8/21/2014Slide 22
The 'Case Processing Summary' in the SPSS output showed the total number of valid cases to be 255 and the number of missing cases to be 15. The SPSS output provides us with the answer to the question on sample size. 8/21/2014Slide 23
The 'Case Processing Summary' in the SPSS output showed the total number of valid cases to be 255 and the number of missing cases to be 15. Click on the check box to mark the statement as correct. 8/21/2014Slide 24
The next pair of statements asks us to identify the correct values for the mean and the standard deviation from the SPSS output. 8/21/2014Slide 25
In the table of descriptive statistics, the Mean row has a value of and the Std. Deviation row shows , which rounds to /21/2014Slide 26
In the table of descriptive statistics, the mean is and the standard deviation is We mark the check box for the statement with the correct values. 8/21/2014Slide 27
The next pair of statements asks us to identify the correct values for the median and the interquartile range from the SPSS output. 8/21/2014Slide 28
In the table of descriptive statistics, the Median row has a value of and the Interquartile Range row has a value of 18. 8/21/2014Slide 29
In the table of descriptive statistics, the median is 43 and the interquartile range is 18. We mark the check box for the statement with the correct values. 8/21/2014Slide 30
The next pair of statements asks us to identify the direction of the skewing in the distribution of the variable. 8/21/2014Slide 31
The skewness for the distribution of "occupational prestige score" [prestg80] is Since this is equal to or greater than zero, we characterize it as positive skewing or skewing to the right. If it were less than zero, it would be negative skewing or skewing to the left. 8/21/2014Slide 32
The skewness for the distribution of "occupational prestige score" [prestg80] is Since this is greater than zero, we characterize it as positive skewing or skewing to the right. We mark the check box for the statement with the correct response. 8/21/2014Slide 33
The final pair of statements asks us to identify which measure of center and spread should be reported for the variable. 8/21/2014Slide 34
One rule of thumb suggests that when the value of the skewness statistic is 2 times the value of the skewness standard error, the median is preferred. For this variable, the statistic (.401) is more than twice the standard error (.153), so the median would be preferred. 8/21/2014Slide 35
The skewness of this distribution (0.40) is in the allowable range, making the mean and standard deviation the preferred measures of center and spread. Another rule of thumb uses only the value of the skewness statistic. When the skewness is smaller than -1.0 or larger than + 1.0, the distribution is badly skewed and the median is a better measure of central tendency. This is the rule of thumb used in our problems. 8/21/2014Slide 36
Using the rule of thumb that skewness between and is acceptable, the skewness of this distribution (0.40) is acceptable making the mean and standard deviation the preferred measures of center and spread. The check box for the first statement is marked. 8/21/2014Slide 38