# 1 By: Chris Hartl A statistics learning project: 0.

## Presentation on theme: "1 By: Chris Hartl A statistics learning project: 0."— Presentation transcript:

1 By: Chris Hartl A statistics learning project: 0

2 2 A sentence including a long underscore (______) denotes a question-and-answer exercise. Your next click will put an answer into the space. answer How the Presentation Works

3 Presentation Enhancement This presentation is enhanced by the statistical features of the TI-83 calculator. The data used in the presentation can be found in TI-program form here. The raw data is located at the end of the presentation.

4 Worksheets on This Presentation Some slides are worksheet slides, containing only open-ended questions. It is recommended that the user answer the questions with the data provided before proceeding to the answer slides. The answers to the worksheet are on the slides proceeding the worksheet slide.

5 There are two basic types of transformations:  single-variable, and  multi-variable transformations. This slideshow will concentrate on one- variable and two-variable transformations. 1

6 1 I. Single Variable Transformations

7 This is a sample. While this mean is accurate, it does not appear to come from a normal or ~N population, and thus, the assumptions for a mean test or a mean confidence interval are not met (assuming that n<30). This is the sample mean. This is the sample median. Sample Data

8 This is the sample mean. This is the sample median. The median of the sample is robust. It is not as affected by outliers and skewness as the mean is. There are statistical median tests which can be used for small, skewed samples. It is common to think of a population in terms of its mean and standard deviation rather than its median, and most statistical tests involve these numbers. Sample Data

9 In a skewed distribution, the mean is influenced by the data found in the tail, and shifts towards the tail. The mean and standard deviation not summarize the data as well as the five-number summary because of the skewness. 0 Influence of Skewed Data on the Mean

10 In a skewed distribution, the mean is biased, and moves towards the tail. The standard deviation is very high: those values do not summarize the data as well as the five-number summary due to the skew. Using a transformation will truncate the tail of the distribution, and making the data more symmetric. The mean and standard deviation are not affected by the tail as much. In other words, using a transformation makes the mean and standard deviation more robust. (There is an example of this later in the presentation.) 1 Transformation Can Normalize Data

11 The Goals of Single-Variable Transformations Make the display of data easier to read and analyze. Reduce the effects of outliers and skewness on the mean and standard deviation of a sample.

12 Ease of contextual analyses are based on the of the data description. 2 Goal 1: Simplifying Data Display The complexity of the word “simplicity”s display makes it hard to read and understand.The same thing occurs when you look at data: some data will be hard to analyze due to its display and summary The complexity of the word “simplicity”s display makes it hard to read and understand.The same thing occurs when you look at data: some data will be hard to analyze due to its display and summary.

13 Transformation is changing the system of measurement of the data so that things become easier to interpret. simplify its description and analysis. It is often worthwhile to search for a transformation of the data to simplify its description and analysis. Effective display of the data Symmetry of the distribution Many Standard Statistical Procedures Require the Normally Distributed Data Transformation can help satisfy this requirement by at least helping the distribution be symmetric instead of skewed. The main goals that can be achieved by transforming a single group of numbers are: 3

14 Transformations can help to interpret both single and multi-variable data, they change both the shape and the summaries of the data. For an example of how transformations can affect outliers, run the program “Island.” (use the sheet that has been passed out to you.) 1) Construct a boxplot of the data. What do you notice? Do you have any ideas of how to improve the display? 2) Construct a stemplot of the data. From the display below, identify what causes the display in question one. What could we do to change it? 3) Rather than splitting the data and examining histograms, take the base 10 logarithm of the data, and store it in another list. Use the command (Log(Area) List). Construct a boxplot of this new data. What do you notice? Does transforming the data appear to have an affect on outliers? Single Variable Worksheet #1

15 1. Construct a boxplot of the data. What do you notice? Do you have any suggestions of how to improve the display? The boxplot has at least one outlier, which makes the display very hard to read. The whisker is many times wider than the IQR. I would recommend using a modified box plot. Solution: Worksheet #1(a); Question #1

16 The outliers on the modified boxplot are still making it hard to read. Because of the scale, very little information can be determined from this graph. My next recommendation would be to examine a histogram or stemplot of the distribution. Construct a modified boxplot of the data. What do you notice? Do you have any suggestions of what display to use for the data? Solution: Worksheet #1(a); Follow-up to Question 

17 2) Construct a stemplot of the data. From the display below, identify what causes the display in question one. What could we do to change it? Ten-Thousands Thousands 0 1 2 3 4 5 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 9 0 2 84 0 The stemplot reveals that the scale required to display all the data forces most of the data to cluster together, even though the values range from 7 to 9,000. A histogram would have the same shape as this stemplot, so my recommendation (excluding use of a transformation) would be to split the data into x 10,000. Solution: Worksheet #1; Question #2

18 3) Rather than splitting the data and examining histograms, take the base 10 logarithm of the data, and store it in another list. Construct a boxplot of this new data. What do you notice? Does transforming the data appear to have an affect on outliers? Not only has the transformation of the data made the plot easier to analyze, but the outliers have, in this case, been brought within the statistical cutoff for outliers (namely 1.5 times the IQR.) Transforming data seems to have a profound affect on outliers. Solution: Worksheet #1; Question #3

19 To examine the effect of functions on the spread of data, construct four parallel number lines: x Log(x) x √x And match up values for X with the corresponding values for Log(x) and √x, like so (click): 1 23456789101112 13 Single Variable Transformation Continued

20 Run the program CARPRI. This program will help to identify the ways in which transformations can change the shape of data, and help calculate a more meaningful mean. 1) Examine the histogram for the variable “Price,” and calculate the mean, standard deviation, and the mean ± standard deviation. If you were to use these values in a statistical test, what would be your concern? What would better values be? 2) Based on the results from the previous slide, which transformation would you recommend, a square-root function, or a logarithmic function? 3) Perform the appropriate transformation and compare its histogram to the histogram of the original data. Calculate the same values that you did in question one. What has changed? Do these values describe the data better before or after the transformation? Worksheet #2

21 1) Examine the histogram for the variable “Price,” and calculate the mean, standard deviation, and the mean ± standard deviation. If you were to use these values in a statistical test or confidence interval, what would be your concern? What would a more appropriate number summary be? The histogram of the data shows that the distribution is highly skewed. Single variable (1-Var Stats) shows the mean and the standard deviation, the interval for mean±SD is (-19160, 214644). Because the data is skewed, the mean is pulled to the right, and the standard deviation is very high. I would not use these values for a confidence interval or significance test. Since medians are robust, I would suggest a 5-number summary as a better summary of the data. Solution: Worksheet #2; Question #1

22 2) Based on the results from the previous slide, which transformation would you recommend, a square-root function, or a logarithmic function? The activity we did on the previous slide showed that the logarithm function pulls higher values of x much closer to the other values than the square-root function does. For instance, the square-root of 100,000 is approximately 316. The logarithm of 100,000 is 5. Because the data in Question #1 has a very pronounced skew with a very long tail, a logarithm function would probably work better than a square-root function.. I would recommend using the logarithm function. Solution: Worksheet #2; Question #2

23 3) Perform the appropriate transformation and compare its histogram to the histogram of the original data. Calculate the same values that you did in question one. What has changed? Do these values describe the data better before or after the transformation? The histogram now is much less skewed, and much more symmetrical after the transformation. The new mean is 10^(4.783) = 60,673. This is a more logical result (considering it is the mean car price). The new mean ± standard deviation no longer includes negative numbers, which is reassuring. Solution: Worksheet #2; Question #3

24 The “Issue” With Changing Data We’ve used a transformation which changes the shape of the data. As you saw in the answer to question 3, the summary statistics of the transformed data are “un-transformed” to put the statistics back into the original units and preserve context.

25 The “Issue” With Changing Data The five-number summary of the transformed data, when put back in the context of the data (un-transformed) is the same as the 5-number summary of the original data.

26 The “Issue” With Changing Data The mean and standard deviation from the transformed data, however, is not the same as the mean and standard deviation of the original data. The transformed mean is a new center of the data, which is viable for a statistical mean test.

27 The “Issue” With Changing Data In small, skewed samples, the conditions for a two-sample mean test of equality are not met when using the original means, but are met when using the original mean. By using the transformed means rather than the original means, one can do a mean test of equality on two small, skewed samples.

28 This is a graph of two-variable data. RESPONSERESPONSE EXPLANATORY VARIABLE LSR line (y) Two-Variable Transformation

29 A statistician needs to test whether the population has a linear association, by using a  test on the data. RESPONSERESPONSE EXPLANATORY VARIABLE LSR line (y) (x) Two-Variable Transformation

30 For this data, the conditions for the test are _____ because the data does not vary normally about the LSR line. RESPONSERESPONSE EXPLANATORY VARIABLE LSR line violated (y) (x) Two-Variable Transformation

31 What Two-Variable Transformations Do: Linearize data so that higher statistical calculations (such as inference tests for  and  ) can be used on data which is originally not linear. For this reason, two-variable transformations are often called “linear transformations.” Two-Variable Transformation

32 RESPONSERESPONSE EXPLANATORY VARIABLE (y) (x) In other words, the goal of linear transformations is to turn this into: Two-Variable Transformation

33 RESPONSERESPONSE EXPLANATORY VARIABLE This. LSR line ( ) (x) Note the change from Y to the square-root of Y. This is the transformation. Two-Variable Transformation

34 Linear Transformations In the demonstration, data appeared to be modeled by a power function : it appeared to be a quadratic. Two-Variable Transformation

35 The square-root function was the best candidate to linearize the sample data because it was the inverse of the apparent functional relationship between the explanatory and response variables. Two-Variable Transformation Linear Transformations

36 This is true of all linearizing functions: the best candidates for functions to use are the ones which would best “undo” the non-linear relationship—the inverse of the apparent functional relationship. Linear Transformations Two-Variable Transformation

37 Square-root function (used for parabolic or power data) Logarithmic function (used for exponential data) Inverse function ( ) (used for asymptotic data) There are many different types of transformations to use on two-variable data. The three which we will examine are: Two-Variable Transformation: Linear Transformation

38 Run the program CARPRI again. Set up a scatterplot of Price vs Mileage. Try several different transformations, including the square-root and the logarithmic function. 1) Of the functions tried, which one renders a more linear graph? 2) Can you draw any conclusions about what shape of scatterplots the LOG function will linearize? 3) What transformation would you suggest for this graph? Worksheet #3

39 1) Of the functions tried, which one renders a more linear graph? This is the graph of the original data. Worksheet 3; Question #1 Price Mileage

40 1) Of the functions tried, which one renders a more linear graph? This is the data after a logarithmic transformation. This is the data after a square-root transformation. Worksheet 3; Question #1, cont. Log(Price) Mileage

41 1) Of the functions tried, which one renders a more linear graph? The logarithmic transformation has less downward trend compared to the square-root function, resulting in a more linear plot.. The square-root transformation improves linearity, the downward trend near the origin of the data persists before the linear data is evident. Worksheet 3; Question #1, cont. Log(Price) Mileage

42 1) Of the functions tried, which one renders a more linear graph? The residuals in the residual plot for the Logarithm function appear more random than the square-root function. The square root transformation appears not to have eliminated the trend in the data, but only lessened the eccentricity of the relationship. Residual plots of linear regression models fit to respective graphs. Worksheet 3; Question #1, cont. Log(Price) Mileage resid

43 Examining the Model Since we have identified the transformation which best linearizes the data, we can treat the graph as a linear relationship between x and log(y). Solution: Worksheet 3; Question #1, cont. Log(Price) Mileage resid Log(Price) Mileage

44 Examining the Model Here’s the big question: Is the new relationship linear? We can examine the model like any other linear model to answer this question. Solution: Worksheet 3; Question #1, cont. Log(Price) Mileage resid Log(Price) Mileage

45 Examining the Model The values for r and r 2 are very high, and the residual plot seems fairly random, or at the very least, the most random residual plot obtained. The linear model appears to fit the data. My conclusion is that there is a linear association between Mileage and Log(Price). Solution: Worksheet 3; Question #1, cont.

46 2) Can you draw any conclusions about the shape of scatterplots the LOG function will linearize? Since the graph of Price vs. Mileage was a horizontally asymptotic graph, I think that for most graphs which look similar to the demonstrated graph, the logarithm function (of no particular base) of the response variable will linearize the data. Worksheet 3; Question #2

47 3) What transformation would you suggest to linearize this graph: This data appears to take either a second-degree relationship, or an exponential relationship. I would suggest using either a square-root transformation on the response variable, or a normal logarithm on the response variable. Worksheet 3; Question #3

48 Guidelines for Transformations The last question of the exercise is intended to reveal that choosing a transformation to linearize data requires thought, and perhaps trial and error, especially when specific values for the data are not given (i.e. only a graph.)

49 Guidelines for Transformations Luckily, some guidelines exist which tell us which patterns to use specific transformations on. In these guidelines, the ln function is a recommendation for a logarithmic function: depending on the data, the base of the function may differ.

50 Guidelines for Transformations y 0 x Contains (0,0) and appears to be a power curve, or a curve asymptotic to both horizontal and vertical axes. Suggested transformation: (x,y)(ln(x),ln(y)) x i >0; y i >0

51 Guidelines for Transformations y 0 x Contains a nonzero y-intercept and appears exponential (either growth or decay). Suggested transformation: (x,y)(x, ln(y)) y i >0

52 Guidelines for Transformations y 0 x Contains (0,0) and appears logarithmic. Suggested transformation: (x,y)( x, y) x i ≥0

53 Guidelines for Transformations y 0 x Contains a nonzero y-intercept and appears logarithmic. Suggested transformation: (x,y)(ln(x),y) x i >0

54 Guidelines for Transformations y 0 x Has nonzero horizontal and vertical asymptotes. Suggested transformation: (x,y)(, ) x i ≠0;y i ≠0

55 (x,y) (ln(x),ln(y))(x,y) (x,ln(y)) (x,y) (,y) (x,y) (ln(x),y) (x,y) (, )