Presentation is loading. Please wait.

Presentation is loading. Please wait.

14-1 Transformations in Statistical Analysis Assumptions of linear statistical models. Types of Transformations Alternatives to Transformations Outline.

Similar presentations


Presentation on theme: "14-1 Transformations in Statistical Analysis Assumptions of linear statistical models. Types of Transformations Alternatives to Transformations Outline."— Presentation transcript:

1 14-1 Transformations in Statistical Analysis Assumptions of linear statistical models. Types of Transformations Alternatives to Transformations Outline Model Assumptions Effect addivitity Normality Homoscedasticity Independence

2 14-2 Order of Importance Homoscedasticity Normality Additivity Independence Additivity Homoscedasticity Normality Independence Experimental Analysis Models (ANOVA) Observational Analysis Models (Regression) All four are so interrelated that which is “most” important may be immaterial!

3 14-3 Independence When is this important? Measurements over time on the same individual. Time series data (rainfall, temperature, etc). Repeated measures - split plots in time. Growth curves. Measurements near each other in space. Split plot designs. Spatial data. How do I know it’s a problem? Rectifying a dependence problem. By design - how the data were collected. Temporal/spatial autocorrelation analysis. Modify the type of model to be fitted to the data.

4 14-4 Homoscedasticity How do I know I have a problem? Plot predicted (fitted) values versus residuals. What is the pattern of the spread in the residuals as the predicted values increase? Spread constant. Spread increases. Spread decreases then increases. Acceptable Problems x x x x x x x x x x x x x x x x x x x x x x x x x x x

5 14-5 What to do? Attempt a transformation. Weighted regression. Incorporate additional covariates. Non-linear regression. Lack of Homogeneity in Regression What to do if the spread of the residuals plotted versus X looks like this? or this? x   X Need another x variable.

6 14-6 Transforming the Response to achieve Linearity If a scatterplot of y versus x curves upward, proceed down on the scale to choose a transformation.

7 14-7

8 14-8 Handling Heterogeneity Regression? ANOVA no yes Test for Homoscedasticity reject accept OK Type of Transformation Transform Observations Box/Cox Family Power Family Traditional Fit linear model Plot residuals Group means OK

9 14-9 Transformations to Achieve Normality Regression? ANOVA no yes Fit linear model Estimate group means Residuals Normal? no yes OK Transform Different Model Q-Q plot Formal Tests

10 14-10 Transformations to Achieve Normality How can we determine if observations are normally distributed? Graphical examination: Normal quantile-quantile plot (QQ-plot). Histogram or boxplot. Goodness of fit tests: Kolmogorov-Smirnov test. Shapiro-Wilks test. D’Agostino’s test.

11 14-11 Non-normal! So what? Only very skewed distributions will have a marked effect on the significance level of the F-test for overall model or model effects. Often the same transformations which are used to achieve homoscedasticity will produce more normal- looking observations (residuals). Transformations to Achieve Model Simplicity GOAL: To provide as simple as possible a mathematical form for the relationship among response and explanatory variables. May require transforming both response and explanatory variables.

12 14-12 Alternative Models Generalized Linear Models Non-Linear Regression Non-Parametric Methods Weighted Least Squares complexitycomplexity high Regular Least Squares low

13 14-13 Example: Predicting brain weight from body weight in mammals via SLR Data are average brain (Y, g) and body (X, kg) weights for 62 species of mammals (2 omitted). Source: Allison & Chicchetti (1976), Science. Species (common name) body weight brain weight Arctic fox3.38544.500 Owl monkey0.48015.499 Horse521.000655.000 Kangaroo35.00056.000 Human62.0001320.000 African elephant6654.0005712.000 Asian elephant2547.0004603.000 … Chimpanzee52.160440.000 Tree shrew0.1042.500 Red fox4.23550.400 Omit

14 14-14 Scatterplot of data is non-informative. Most species have small weights compared to the elephants. Viewing only those mammals with body weight below 300kgs suggests transforming to a log scale to linearize the relationship.

15 14-15 Scatterplot looks linear. Fitted regression equation is: Body weight is a very significant predictor of brain weight (p-value<0.0001). Also, R 2 =0.922.

16 14-16 Residual plot shows no obvious violations of the zero mean and constant variance assumption. QQ-Plot demonstrates that the normality assumption for the residuals is plausible. human opossum

17 14-17 Checking for influential observations (R) > fm_lm(log(y)~log(x)) > influence.measures(fm) Influence measures of lm(formula = log(y) ~ log(x)) : dfb.1. dfb.lg.. dffit cov.r cook.d hat inf 1 0.13501 -8.18e-03 0.14452 1.009 1.04e-02 0.0167 2 0.27274 -1.56e-01 0.27714 0.956 3.71e-02 0.0245 (Owl Monk.) 3 -0.04860 1.62e-02 -0.04876 1.051 1.21e-03 0.0187 … 14 -0.02853 3.42e-02 -0.03775 1.142 7.25e-04 0.0937 * (Shrew) … 19 0.00538 1.69e-01 0.18810 1.121 1.79e-02 0.0881 * (Asian El.) … 32 0.22151 3.51e-01 0.53207 0.788 1.24e-01 0.0295 * (Human) 33 0.00130 -5.11e-02 -0.05538 1.164 1.56e-03 0.1110 * (African El.) 34 -0.31147 1.54e-02 -0.33480 0.846 5.11e-02 0.0167 * (Opossum) 35 0.27033 5.36e-02 0.32472 0.861 4.85e-02 0.0171 * (Rhesus Monk.) … 40 -0.00740 8.39e-03 -0.00945 1.124 4.55e-05 0.0786 * (Brown Bat) … 60 -0.00799 2.27e-03 -0.00806 1.054 3.31e-05 0.0181 In MTB: Stat > > Regression > Regression > Regression Storage

18 14-18 Decision: Leave out man (he doesn’t really fit in with the rest of the mammals) and re-run the analysis. FeatureFull ModelOmit Human 2.1112.090 0.7550.745 0.0290.027 R 2 0.9220.929 Slope p-value< 0.0001< 0.0001 Even though results don’t change much, we will go with this last model:

19 14-19 This illustrates the idea of cross-validation in regression. It is often recommended that the data be split into two (equal?) portions; use one for model fitting; the other for model checking/verification. Mammal Predicted Brain Wt Prediction IntervalActual Brain Wt Tree Shrew 1.498 (0.396, 5.667) 2.500 Red Fox23.714 (6.359, 88.431) 50.400 Predicting the brain weights of the omitted mammals (R) > xh <- x[-32]; yh <- y[-32] > fmh <- lm(log(yh)~log(xh)) > new <- data.frame(xh=c(.104,4.235)) > predict(fmh, newdata=new, interval="prediction") fit lwr upr 1 0.4038624 -0.9269029 1.734628 2 3.1660753 1.8499283 4.482222 > exp(predict(fmh, newdata=new, interval="prediction")) fit lwr upr 1 1.497598 0.3957776 5.666817 2 23.714231 6.3593633 88.430985 Exponentiate final results!

20 14-20 Predicting the brain weights of the omitted mammals (MTB) Influence measures can be selected here.

21 14-21 The regression equation is lbrain = 2.11 + 0.755 lbody Predictor Coef SE Coef T P Constant 2.11091 0.09860 21.41 0.000 lbody 0.75467 0.02889 26.12 0.000 S = 0.696924 R-Sq = 92.2% R-Sq(adj) = 92.0% Analysis of Variance Source DF SS MS F P Regression 1 331.35 331.35 682.21 0.000 Residual Error 58 28.17 0.49 Total 59 359.52 Unusual Observations Obs lbody lbrain Fit SE Fit Residual St Resid 32 4.13 7.1854 5.2255 0.1197 1.9599 2.85R 33 8.80 8.6503 8.7542 0.2322 -0.1039 -0.16 X 34 1.25 1.3610 3.0563 0.0901 -1.6954 -2.45R 35 1.92 5.1874 3.5575 0.0912 1.6298 2.36R R denotes an observation with a large standardized residual. X denotes an observation whose X value gives it large influence. Predicted Values for New Observations New Obs Fit SE Fit 95% CI 95% PI 1 0.4028 0.1388 (0.1249, 0.6807) (-1.0196, 1.8253) 2 3.2002 0.0900 (3.0201, 3.3803) ( 1.7936, 4.6068) MTB output (with man) Only available influence measures are: standard/student residuals; hat matrix; Cook’s dist; and dffits.


Download ppt "14-1 Transformations in Statistical Analysis Assumptions of linear statistical models. Types of Transformations Alternatives to Transformations Outline."

Similar presentations


Ads by Google