Presentation is loading. Please wait.

Presentation is loading. Please wait.

Fixing problems with the model Transforming the data so that the simple linear regression model is okay for the transformed data.

Similar presentations


Presentation on theme: "Fixing problems with the model Transforming the data so that the simple linear regression model is okay for the transformed data."— Presentation transcript:

1 Fixing problems with the model Transforming the data so that the simple linear regression model is okay for the transformed data.

2 Where does this topic fit in? Model building –Model formulation –Model estimation –Model evaluation Model use

3 Just a reminder Data analysis is an artful (subjective decisions!) science (objective tools!). Data transformation definitely requires a “trial and error” process.

4 Options for fixing problems with the model Abandon simple linear regression model and find a more appropriate – but typically more complex – model. Transform the data so that the simple linear regression model works for the transformed data.

5 Abandoning the model If not linear: try a different function, like a quadratic or an exponential function. If unequal error variances: use weighted least squares. If error terms are not independent: try fitting a time series model. If important predictor variables omitted: try fitting a multiple regression model. If outlier: use a robust estimation procedure.

6 Choices for transforming the data Transform predictor (x) values only. Transform response (y) values only. Transform both the predictor (x) values and the response (y) values.

7 Transforming the predictor (x) values only Building the model.

8 Transforming the x values only Appropriate when non-linearity is a problem – normality and equal variance okay. It may be necessary to correct the non-linearity before you can assess the normality and equal variance assumptions. If the error terms are well-behaved, transforming the y values could change them into badly-behaved error terms.

9 Memory retention time prop 1 0.84 5 0.71 15 0.61 30 0.56 60 0.54 120 0.47 240 0.45 480 0.38 720 0.36 1440 0.26 2880 0.20 5760 0.16 10080 0.08 Subjects asked to memorize a list of disconnected items. Asked to recall them at various times up to a week later Predictor time = time, in minutes, since initially memorized the list. Response prop = proportion of items recalled correctly. Example 1

10 Fitted line plot Example 1

11 Residual vs. fits plot Example 1

12 Normal probability plot Example 1

13 The logarithmic transformation Most useful transformation. Most common scale for scientific work is the natural logarithm (denoted ln or log) based on the number e = 2.71828… General rules: –ln(e) = 1 –ln(1) = 0 –ln(e x ) = x

14 Transform the x values Change (“transform”) the predictor time to ln(time). Example 1 time prop lntime 1 0.84 0.00000 5 0.71 1.60944 15 0.61 2.70805 30 0.56 3.40120 60 0.54 4.09434 120 0.47 4.78749 240 0.45 5.48064 480 0.38 6.17379 720 0.36 6.57925 1440 0.26 7.27240 2880 0.20 7.96555 5760 0.16 8.65869 10080 0.08 9.21831

15 Fitted line plot using transformed x values Example 1

16 Residuals vs. fits plot using transformed x values Example 1

17 Normal probability plot using transformed x values Example 1

18 What if transform y values instead when nonlinearity is main problem? Example 1

19 The residuals are an improvement… (although not great)… Example 1

20 …but we now have non-normal error terms. Example 1

21 Transforming the predictor (x) values only Using the model to answer your research question.

22 What is the nature of the association between time since memorized and effectiveness of recall?

23 Is there an association between time since memorized and effectiveness of recall? The regression equation is prop = 0.846 - 0.0792 lntime Predictor Coef SE Coef T P Constant 0.84642 0.01419 59.63 0.000 lntime -0.079227 0.002416 -32.80 0.000 S = 0.02339 R-Sq = 99.0% R-Sq(adj) = 98.9% Analysis of Variance Source DF SS MS F P Regression 1 0.58841 0.58841 1075.70 0.000 Residual Error 11 0.00602 0.00055 Total 12 0.59443

24 What proportion of words can we expect a person to recall after 1000 minutes? Predicted Values for New Observations New Obs Fit SE Fit 95.0% CI 95.0% PI 1 0.29896 0.00766 (0.282, 0.316) (0.245, 0.353) Values of Predictors for New Observations New Obs lntime 1 6.91

25 How much does expected recall change if time increases ten-fold? We can say a ten-fold increase in x is associated with a β 1 ×ln(10) change in the mean of y. And, we can say a two-fold increase in x is associated with a β 1 ×ln(2) change in the mean of y. Choose the multiple so that it makes sense for scope of model.

26 How much does expected recall change if time increases ten-fold? Predictor Coef SE Coef T P Constant 0.84642 0.01419 59.63 0.000 lntime -0.079227 0.002416 -32.80 0.000 We expect the proportion of recalled words to change by: for each ten-fold increase in time since memorization took place.

27 How much does expected recall change if time increases ten-fold? Predictor Coef SE Coef T P Constant 0.84642 0.01419 59.63 0.000 lntime -0.079227 0.002416 -32.80 0.000 Since: we can be 95% confident that the proportion of recalled words will change between: for each ten-fold increase in time since memorization took place. and

28 Transforming the y values only Building the model.

29 Transforming the y values only You should consider transforming the y values when non-normality and/or unequal variances are problems. As an added bonus, the transformation on y may also help to “straighten out” a curved relationship.

30 Gestation time and birth weight for mammals Mammal Birthwgt Gestation Goat 2.75 155 Sheep 4.00 175 Deer 0.48 190 Porcupine1.50 210 Bear 0.37 213 Hippo 50.00 243 Horse 30.00 340 Camel 40.00 380 Zebra 40.00 390 Giraffe98.00 457 Elephant113.00670 Predictor Birthwgt = birth weight, in kg, of mammal. Response Gestation = number of days until birth Example 2

31 Fitted line plot Example 2

32 Residual vs. fits plot Example 2

33 Normal probability plot Example 2

34 Transform the y values Mammal Birthwgt Gestation lnGest Goat 2.75 155 5.04343 Sheep 4.00 175 5.16479 Deer 0.48 190 5.24702 Porcupine 1.50 210 5.34711 Bear 0.37 213 5.36129 Hippo 50.00 243 5.49306 Horse 30.00 340 5.82895 Camel 40.00 380 5.94017 Zebra 40.00 390 5.96615 Giraffe 98.00 457 6.12468 Elephant 113.00 670 6.50728 Change (“transform”) the response Gestation to ln(Gestation). Example 2

35 Fitted line plot using transformed y values Example 2

36 Residual vs. fits plot using transformed y values Example 2

37 Normal probability plot using transformed y values Example 2

38 Transforming the response (y) values only Using the model to answer your research question.

39 What is nature of association between birth weight and length of gestation?

40 Is there an association between birth weight and length of gestation? The regression equation is lnGest = 5.28 + 0.0104 Birthwgt Predictor Coef SE Coef T P Constant 5.27882 0.08818 59.87 0.000 Birthwgt 0.010410 0.001717 6.06 0.000 S = 0.2163 R-Sq = 80.3% R-Sq(adj) = 78.1% Analysis of Variance Source DF SS MS F P Regression 1 1.7193 1.7193 36.75 0.000 Residual Error 9 0.4211 0.0468 Total 10 2.1405

41 What is the expected gestation length of a new 50 kg mammal? Estimated regression function: Therefore, since: we predict the gestation length of another mammal at 50 kgs to be: Example 2

42 What is the expected gestation length of a new 50 kg mammal? Example 2 Predicted Values for New Observations New Fit SE Fit 95.0% CI 95.0% PI 1 5.7993 0.0704 (5.6401, 5.9586) (5.2847, 6.3139) Values of Predictors for New Observations New Obs Birthwgt 1 50.0 We can be 95% confident that the gestation length for a new mammal at 50 kgs will be between 197.3 and 552.2 days.

43 What is expected change in length of gestation for each one pound increase in birth weight? The median changes by a factor of for each one unit increase in the predictor x.

44 What is expected change in length of gestation for each one pound increase in birth weight? Predictor Coef SE Coef T P Constant 5.27882 0.08818 59.87 0.000 Birthwgt 0.010410 0.001717 6.06 0.000 The estimated regression line tells us: The median gestation for a mammal weighing 3 kgs is 1.01 times the median gestation for a mammal weighing 2 kgs. The median gestation for a mammal weighing 30 kgs is 1.01 10 = 1.105 times the median gestation for a mammal weighing 20 kgs.

45 What is expected change in length of gestation for each one pound increase in birth weight? Since: we can be 95% confident that the median gestation will increase by a factor between for each one kilogram increase in birth weight. and Predictor Coef SE Coef T P Constant 5.27882 0.08818 59.87 0.000 Birthwgt 0.010410 0.001717 6.06 0.000

46 Transforming both the x and y values Building the model.

47 Transforming both the x and y values You might have to do this when everything seems wrong – the error terms are not normal, have unequal variances, and the function is not linear. Transforming the y values corrects problems with the error terms (and may help the non-linearity). Transforming the x values primarily corrects the non-linearity.

48 Diameter (inches) and volume (cu. ft.) of 70 shortleaf pines Example 3

49 Residuals vs. fits plot Example 3

50 Normal probability plot Example 3

51 Transform the x values only Transform predictor diameter to ln(diameter) Example 3 Diameter Volume lnDiam 4.4 2.0 1.48160 4.6 2.2 1.52606 5.0 3.0 1.60944 5.1 4.3 1.62924 5.1 3.0 1.62924 5.2 2.9 1.64866 5.2 3.5 1.64866 5.5 3.4 1.70475 5.5 5.0 1.70475 5.6 7.2 1.72277 5.9 6.4 1.77495 5.9 5.6 1.77495 7.5 7.7 2.01490 7.6 10.3 2.02815 … and so on …

52 Fitted line plot using transformed x values Example 3

53 Residuals vs. fitted plot using transformed x values Example 3

54 Normal probability plot using transformed x values Example 3

55 Transform both the x and y values Diameter Volume lnDiam lnVol 4.4 2.0 1.48160 0.69315 4.6 2.2 1.52606 0.78846 5.0 3.0 1.60944 1.09861 5.1 4.3 1.62924 1.45862 5.1 3.0 1.62924 1.09861 5.2 2.9 1.64866 1.06471 5.2 3.5 1.64866 1.25276 5.5 3.4 1.70475 1.22378 5.5 5.0 1.70475 1.60944 5.6 7.2 1.72277 1.97408 5.9 6.4 1.77495 1.85630 5.9 5.6 1.77495 1.72277 7.5 7.7 2.01490 2.04122 7.6 10.3 2.02815 2.33214 … and so on … Transform predictor diameter to ln(diameter) Transform response volume to ln(volume) Example 3

56 Fitted line plot using transformed x and y values Example 3

57 Residual plot using transformed x and y values Example 3

58 Normal probability plot using transformed x and y values Example 3

59 Transforming both the x and y values Using the model to answer your research question.

60 What is the nature of the association between diameter and volume of pines?

61 Is there an association between diameter and volume of pines? The regression equation is lnVol = - 2.87 + 2.56 lnDiam Predictor Coef SE Coef T P Constant -2.8718 0.1215 -23.63 0.000 lnDiam 2.56442 0.05120 50.09 0.000 S = 0.1703 R-Sq = 97.4% R-Sq(adj) = 97.3% Analysis of Variance Source DF SS MS F P Regression 1 72.734 72.734 2509.00 0.000 Residual Error 68 1.971 0.029 Total 69 74.706

62 What is the median volume of all pine trees that are 10" in diameter? Estimated regression function: Therefore, since: we predict the median volume of all 10" shortleaf pines to be: Example 2 cubic feet.

63 What is the median volume of all pine trees that are 10" in diameter? Example 2 We can be 95% confident that the median volume of all shortleaf pines, 10" diameter, to be between 19.9 and 21.6 cubic feet. Predicted Values for New Observations New Fit SE Fit 95.0% CI 95.0% PI 1 3.0330 0.0204 (2.9922, 3.0738) (2.6908, 3.3752) Values of Predictors for New Observations New Obs lnDiam 1 2.30

64 What is expected change in volume for a two-fold increase in diameter? The median changes by a factor of for each two-fold increase in the predictor x.

65 What is expected change in volume for a two-fold increase in diameter? The estimated regression line tells us: The median volume of a 20" diameter tree is estimated to be 5.92 times the median volume of a 10" diameter tree. The median volume of a 10" diameter tree is estimated to be 5.92 times the median volume of a 5" diameter tree. Predictor Coef SE Coef T P Constant -2.8718 0.1215 -23.63 0.000 lnDiam 2.56442 0.05120 50.09 0.000

66 What is expected change in volume for a two-fold increase in diameter? Since: we can be 95% confident that the median volume will increase by a factor between for each two-fold increase in diameter. and Predictor Coef SE Coef T P Constant -2.8718 0.1215 -23.63 0.000 lnDiam 2.56442 0.05120 50.09 0.000


Download ppt "Fixing problems with the model Transforming the data so that the simple linear regression model is okay for the transformed data."

Similar presentations


Ads by Google