Presentation on theme: "4/12/2015Slide 1 We have seen that skewness affects the way we describe the central tendency and variability of a quantitative variable: if a distribution."— Presentation transcript:
4/12/2015Slide 1 We have seen that skewness affects the way we describe the central tendency and variability of a quantitative variable: if a distribution is more skewed than the threshold of -1.0 to 1.0, we report the median and interquartile range rather than the mean and standard deviation. A major cause of skewed distributions is the presence of outliers – cases that have very small or very large scores relative to the other cases in the distribution. Outliers have a larger effect on the results of statistical analysis than other cases. One extreme outlier may change our view of central tendency and variability for the entire distribution.
4/12/2015Slide 2 Outliers pose a dilemma for us in terms of our justification for either omitting them or retaining them in the analysis. It is easy to remove outliers that were data entry errors. It is more difficult to defend removing outliers when the scores represent accurate data. One response to the dilemma is to run the analysis with and without the outliers, and describe the difference. Sometimes it makes little difference and we can ignore the presence of the outliers. Another response to the dilemma is to re-express or transform the variable and see if the outliers are eliminated. If there are no outliers using the re-expressed data, we can run the analysis with the re-expressed data and draw our conclusions based on the results for the re-expressed variables.
4/12/2015Slide 3 Two downsides to the strategy of re-expressing data are: the skepticism of audiences who already think we massage the numbers to produce the results we want, and the need to convert the results back to the original scale if we need to report numerical results. In this problem set, we will use a boxplot strategy for detecting outliers and examine the use of two of the possible transformations: the square and the logarithm. The Explore procedure in SPSS provides both the boxplot and the descriptive statistics needed to solve these problems. In the boxplot, two types of outliers are identified by symbols: circles for outliers, and stars for extreme (or far) outliers.
4/12/2015Slide 4 A case is identified as an outlier (circle) if its value is less than or equal to the first quartile minus 1.5 times the interquartile range, or is greater than or equal to the third quartile plus 1.5 times the interquartile range. If the case has a value less than or equal to the first quartile minus 3 times the interquartile range or greater than the third quartile plus 3 times the interquartile range, it is characterized as a far outlier (stars). If outliers or far outliers are found for a variable, we will examine the behavior of the outliers when the variable is re- expressed by computing the logarithm of the values if the variable is skewed to the right. If the variable is negatively skewed, we will square the values and examine the effect on the outliers.
Slide 5 The script for this week positions the boxplot under the histogram. In this chart, we see a number of circles at the right end of the distribution. These are outliers, and there are no far outliers in this distribution. As we would expect, this distribution has a skewness problem (skewness=1.19) in the subtitle to the chart. NOTE: the horizontal axis for the boxplot approximates the axis for the histogram, but is not exact.
Slide 6 This distribution for this variable shows one far outlier at the extreme right of the distribution.
Slide 7 Some distributions will show both outliers and far outliers. Our problems will state the number of outliers, and the number of far outliers as a subset of the total number of outliers. Note that the chart shows the presence or absence of outliers, but does not necessarily provide an exact count since the outlier symbol might represent more than one case with the score.
Slide 8 The boxplots for some distributions will indicate that there are no outliers.
Slide 9 The boxplot for the distribution for this variable shows several outliers at the right end of the distribution. When the variable is positively skewed, the data values are re-expressed on the logarithmic scale. When we re-express the values for the variable on a logarithmic scale, the boxplot does not indicate that there are any outliers.
Slide 10 If we re-express the data values using the wrong transformation, we actually increase the problem of outliers. The distribution was positively skewed, and we squared the data values, rather than converting to a log scale, resulting in more outliers.
Slide 11 The boxplot of the squared values indicates that there are not outliers for the re-expressed data values. The boxplot for the distribution for this variable shows several outliers at the low end of the scale. Since this variable is skewed to the left, we will re-express the data values as squares.
Slide 12 If we re-express the data values using the wrong transformation, we actually increase the problem of outliers. The distribution was negatively skewed, and we applied a log transformation rather than the square transformation, resulting in more outliers.
Slide 13 Re-expressing the data values does not always remedy the problem of outliers. In the chart to the right, the logarithmic transformation appears to have had little impact.
Slide 14 Some variables have outliers at both ends of the distribution. The outliers at one end may offset the skewness in the other tail, but kurtosis will become a problem. Neither the logarithmic nor the square transformation will remedy this distribution because each re-expression works on only one tail of the distribution.
4/12/2015Slide 15 Re-expression changes the measuring scale for the variable by altering the distance between the values. All of the lines below represent the numbers 1 to 10, on a decimal, logarithmic, and squared scale. On our familiar decimal measuring scale, the distance between numbers is the same for all numbers. On a logarithmic scale, the distance between the numbers decreases as the numbers get larger On a square scale, the distance between the numbers decreases as the values get smaller. All of the dots represent the same sequence of values from 1 to 10 on different measuring scales.
4/12/2015Slide 16 The logarithmic transformation works by stretching the scale at the left end of the distribution and compressing the scale at the right end of the distribution. As shown in the diagram below, the numbers 1 to 5 (red dots) are converted to their log equivalents (blue dots). The distance between the log points decreases as the values increase. The distance between the log of 4 and the log of 5 is less than the distance between the log of 1 and log of 2.
4/12/2015Slide 17 Positive skewing is reduced because the distance between consecutive numbers on the decimal scale decreases as the size of the decimal number increases. For example, the difference between the log of 2 and the log of 3 is 0.176, larger than the difference between the log of 4 and log of 5, which is 0.097. decimal scalelog scaledifference between consecutive values 10.000 20.301 30.4770.176 40.6020.125 50.6990.097 60.7780.079 70.8450.067 80.9030.058 90.9540.051 101.0000.046
4/12/2015Slide 18 The square transformation works by compressing the scale at the left end of the distribution and stretching the scale at the right end of the distribution. As shown in the diagram below, the numbers 1 to 5 (red dots) are converted to their squared equivalents (blue dots). The distance between the squared points increases as the values increase. The distance between the square of 4 and the square of 5 is larger than the distance between the square of 1 and square of 2.
4/12/2015Slide 19 Negative skewing is reduced because the distance between consecutive numbers on the decimal scale increases as the size of the decimal number increases. For example, the difference between the square of 2 and the square of 3 is 5.0, less than the difference between the square of 4 and square of 5, which is 9.0. decimal scalesquared scale difference between consecutive values 11.000 24.0003.000 39.0005.000 416.0007.000 525.0009.000 636.00011.000 749.00013.000 864.00015.000 981.00017.000 10100.00019.000
4/12/2015Slide 20 As long as we can reverse the transformation and get back to the original values, the transformations are legitimate. To make certain we can get back to the original values, we must make certain the numbers on all scales are mathematically defined as real numbers. Not all numbers are defined, such as the logarithm of 0 and the square root of negative numbers. To make certain we do not do a transformation we cannot work backwards, we may need to add a constant to each number. If numbers are negative, we add the amount of the smallest value to each number. If the smallest value in the distribution is 0, we add 1 to each score in the distribution. Since we are starting out with transformations, the problem statement will tell you if you need to add a numeric constant when doing the transformations.
Slide 21 The introductory statement in the question indicates: The data set to use (2001WorldFactBook) The task to accomplish (checking for outliers) The SPSS procedure to use (Explore) The variable to use in the analysis: HIV-AIDS adult prevalence rate [hivaids]
Slide 22 These problem also contain a second paragraph of instructions that provide the formulas to use if our examination of outliers requires us to re-express or transform the variable.
Slide 23 The first statement concerns the number of valid and missing cases. To answer this question, we produce the descriptive statistics using the SPSS Explore procedure.
Slide 24 To compute the descriptive statistics and charts that we need to check for outliers, select the Descriptive Statistics > Explore command from the Analyze menu.
Slide 25 Move the variable for the analysis hivaids to the Dependent List list box.. Click on the Statistics button to select optional statistics.
Slide 26 The check box for Descriptives is already marked by default. Click on Continue button to close the dialog box. Mark the Percentiles check box. This will provided the upper and lower bounds for the interquartile range. While there is a check box for Outliers, it lists the five largest scores and the five smallest scores, but does not tell us whether or not they are really outliers.
Slide 27 Next, we click on the Plots button to obtain visual evidence of the presence of outliers in the distribution.
Slide 28 We accept the default for the Box plot, which provides us the output we need even though we are not using factor levels in this problem. We accept the default Stem-and-Leaf plot, and mark the check box for a Histogram as well. We click on the Continue button to close the Plots dialog.
Slide 29 After returning to the Explore dialog box, click on the OK button to produce the output.
Slide 30 The 'Case Processing Summary' in the SPSS output showed the total number of valid cases to be 162 and the number of missing cases to be 56. The SPSS output provides us with the answer to the question on sample size.
Slide 31 The 'Case Processing Summary' in the SPSS output showed the total number of valid cases to be 162 and the number of missing cases to be 56. Click on the check box to mark the statement as correct.
Slide 32 The next two statements focus on the median and interquartile range as the center and spread of the data. We are using the median and interquartile range because we are using the box plot strategy for identifying outliers. The median and interquartile range are key measures of box plots.
Slide 33 We use the table of descriptive statistics to obtain the value for the median:.2000 for this variable. However, we do not use the table of descriptive statistics for the value of the interquartile range because this is not the value used in the box plot. The value used in the box plot is based on “Tukey’s hinges” which use a slightly different calculation for the first and third quartile, which may make a difference in the value for the interquartile range.
Slide 34 The value for the first quartile (the 25 th percentile) is 0.050. The value for the third quartile (the 75 th percentile) is 2.010. The interquartile range is the difference between the two: 1.96 (2.010 – 0.050 = 1.96). Note that the 75 th percentile using the default weighted average calculation is slightly different (2.0150) from the 75 th percentile calculated with Tukey’s Hinges.
Slide 35 From the SPSS output, we obtained a value of 0.20 for the median and 1.96 for the interquartile range. We mark the first check box in the pair as the correct answer.
Slide 36 The next pair of statements asks us to identify the direction of the skewing in the distribution of the variable. Outliers almost always skew the distribution. The direction of the skewness is critical because it dictates which function we choose to re-express or transform the data.
Slide 37 The skewness for the distribution of "HIV-AIDS adult prevalence rate" [hivaids] is 3.19. Since this is greater than zero, we characterize it as positive skewing, or skewing to the right. When the distribution is skewed to the right, the text recommends re-expressing the data as logarithms, square roots, or reciprocals. We will use logarithms in these problems. When the distribution is skewed to the left, the text recommends re-expressing the data as squares.
Slide 38 The skewness for the distribution of "HIV-AIDS adult prevalence rate" [hivaids] is 3.19. Since this is greater than zero, we characterize it as positive skewing or skewing to the right. We mark the check box for the first statement as the correct response.
Slide 39 The next pair of statements asks us to identify how many outliers there are in the distribution, either that there are no outliers or that there are a specific number of outliers and far outliers.
Slide 40 The box plot provides us with the first evidence of outliers. The circles and asterisks above the whiskers of the box plot attest to the presence of outliers. In the terminology of the text, the circles are outliers, and the asterisks are far outliers.
Slide 41 If the variable does not have outliers, neither circles nor asterisks will appear in the box plot. This is a box plot for the variable Population below poverty line from the same data set.
Slide 42 While the box plot makes it obvious that there are outliers in this distribution, it is not possible to obtain the exact number because the points overlap and because a single circle or star may represent more than one case with that value.
Slide 43 The presence of outliers is also seen in the histogram of the distribution. However, it also does not make it easy to determine the exact number.
Slide 44 Our first task is to compute the values that would let us determine whether or not a case is an outlier. The value for the first quartile (the 25 th percentile) is 0.050. The value for the third quartile (the 75 th percentile) is 2.010. The interquartile range is the difference between the two: 1.96 (2.010 – 0.050 = 1.96). To be characterized as an outlier, a case would have to have: a value less than or equal to -2.89 (Q1 - 1.5 x IQR = 0.05 - 1.5 x 1.96 = -2.89) or a value greater than or equal to 4.95 (Q3 + 1.5 x IQR = 2.01 + 1.5 x 1.96 = 4.95)
Slide 45 Our second task is to compute the values that would let us determine whether or not a case is a far outlier. The value for the first quartile (the 25 th percentile) is 0.050. The value for the third quartile (the 75 th percentile) is 2.010. The interquartile range is the difference between the two: 1.96 (2.010 – 0.050 = 1.96). To be characterized as a far outlier in the distribution of "HIV-AIDS adult prevalence rate" [hivaids], a case would have to have a value less than or equal to -5.83 (Q1 - 3 x IQR = 0.05 - 3 x 1.96 = -5.83) or a value greater than or equal to 7.89 (Q3 + 3 x IQR = 2.01 + 3 x 1.96 = 7.89) The calculations may produce values that do not exist in the data set, e.g. -5.83. Since there can be no outliers at that value or smaller, it does not have any impact on our solution.
Slide 46 We sort the cases in ascending order by the variable we are studying, so we can count the number of cases that fall in the outlier region. Click the right mouse button on the column header for hivaids, and select Sort Ascending from the pop-up menu.
Slide 47 The entries at the top of sorted column are missing values, indicated by the periods in the cells. The lower bounds for both outliers and far outliers were negative numbers (-2.89 and -5.83). Since all of the values for hivaids are positive numbers there are no outliers in the lower range of values.
Slide 48 The upper bound for outliers was 4.95. After locating in the sorted column, we count the number of values greater than or equal to 4.95., as shown in the red border. There are 25 outliers. The upper bound for far outliers was 7.89, outlined with the blue border. There are 18 far outliers.
Slide 49 We counted 25 outliers and 18 far outliers in the data editor for hivaids. We mark the second check box in the pair which concurs with our finding.
Slide 50 The first statement in the next pair asks about the impact of re-expressing or transforming the data as logarithms. It predicts that the logarithmic transformation will eliminate both outliers and far outliers.
Slide 51 The formula for computing the log transformation was given in the second part of the instructions for the problem. We will create a new variable called LG_hivaids based on the LG10 function in SPSS In some problems a number (e.g. 2.14 or 6.0) will be included in the parentheses of the formula to make sure all of the values to be converted are greater than zero.
Slide 52 To compute the new variable, select the Compute command from the Transform menu.
Slide 53 In the Compute Variable dialog box, we type the name for the new variable, LG_hivaids, in the Target Variable text box. Click on the Arithmetic function group so that the list of available functions appears in the Functions and Special Variables list box.
Slide 54 First, in the list of Functions and Special Variables, highlight Lg10 which computes logarithmic values using a base of 10. Second, click on the up arrow button to paste the Lg10 function in the Numeric Expression text box.
Slide 55 Next, type the name of the variable to be transformed hivaids between the parentheses after the function name. Finally, click on the OK button to compute the transformed variable.
Slide 56 Scroll the data editor window to the right to see the transformed variable, LG_hivaids. Note that I moved the hivaids variable to the right as well. It will not appear in this position in your data editor.
Slide 57 To calculate the descriptive statistics so we can identify outliers on the transformed variable, click on the Dialog Recall tool button.
Slide 58 In the pop-up menu for Dialog Recall, select the Explore item (the second to the last command we executed in SPSS).
Slide 59 Since we want the same statistics computed in the last Explore procedure, we only need to replace the variable hivaids with LG_hivaids. Click on the OK button to produce the output.
Slide 60 The box plot for LG_hivaids shows no circles or asterisks, indicating that there are no outliers in this distribution.
Slide 61 Similarly, the histogram displays a distribution that is much less skewed.
Slide 62 We use the SPSS output in the Percentiles table to compute the cut points that would make a case an outlier. To be characterized as an "outlier", a case would have to have a logarithmic value less than or equal to -3.707 (Q1 - 1.5 x IQR = -1.301 - 1.5 x 1.604 = -3.707) or a logarithmic value greater than or equal to 2.71 (Q3 + 1.5 x IQR = 0.303 + 1.5 x 1.604 = 2.71) The interquartile range is Q3 – Q1 = 0.3032 - -1.30103 = 1.604.
Slide 63 To be characterized as a "far outlier" in the distribution of "HIV- AIDS adult prevalence rate" [hivaids], a case would have to have a logarithmic value less than or equal to -6.114 (Q1 - 3 x IQR = -1.301 - 3 x 1.604 = -6.114) or a logarithmic value greater than or equal to 5.116 (Q3 + 3 x IQR = 0.303 + 3 x 1.604 = 5.116) The interquartile range is Q3 – Q1 = 0.3032 – (-1.30103) = 1.604.
Slide 64 The smallest logarithmic values is -2.0, larger than the cut point for an outlier (-3.707) and larger than the cut point for a far outlier (-6.114).
Slide 65 The largest logarithmic value is 1.554, less than the cut point for an outlier (2.71) and less than the cut point for a far outlier (5.116). When re-expressed as logarithms, the number of outliers in the distribution of "HIV-AIDS adult prevalence rate" was reduced from 25 to 0 and the number of far outliers was reduced from 18 to 0.
Slide 66 The logarithmic re-expression of the distribution of hivaids had no outliers and no far outliers, effectively reducing the number of both to 0. We mark the check box for the statement.
Slide 67 The final statement asks about the impact of squaring the values of hivaids. Since this is not a transformation that works for negatively skewed distributions, we do not mark the check box.
Slide 68 Even though we do not need to re-express this variable as a square, an example of the commands in SPSS will be shown. The formula for the square transformation is provided in the second paragraph to the problem.
Slide 69 To compute the new variable, select the Compute command from the Transform menu.
Slide 70 In the Compute Variable dialog box, we type the name for the new variable, SQ_hivaids, in the Target Variable text box. The variable name follows the convention of prepending SQ_ (for square) to the original variable name. There is not a function to compute the square. Instead we directly type the formula into the Numeric Expression text box. Note the parentheses around the variable name. These could be optional in this problem, but when there is a constant involved, we need to make certain that the constant and the variable name are enclosed in parentheses. After we have typed in the variable name and formula, we click on the OK button to compute the new variable.
Slide 71 The square of hivaids is added to the data set. I increased the number of decimal places displayed so that the initial entries were not displayed as.000.
Slide 72 The feedback in BlackBoard shows that all of the check boxes we marked were correct.