# Lecture 18: Thurs., Nov. 6th Chapters 8.3.2, 8.4, 8.6.1 Outliers and Influential Observations Transformations Interpretation of log transformations (8.4)

## Presentation on theme: "Lecture 18: Thurs., Nov. 6th Chapters 8.3.2, 8.4, 8.6.1 Outliers and Influential Observations Transformations Interpretation of log transformations (8.4)"— Presentation transcript:

Lecture 18: Thurs., Nov. 6th Chapters 8.3.2, 8.4, 8.6.1 Outliers and Influential Observations Transformations Interpretation of log transformations (8.4) R 2 (8.6.1)

Outliers and Influential Observations An outlier is an observation that lies outside the overall pattern of the other observations. A point can be an outlier in the x direction, the y direction or in the direction of the scatterplot. For regression, the outliers of concern are those in the x direction and the direction of the scatterplot. A point that is an outlier in the direction of the scatterplot will have a large residual. An observation is influential if removing it markedly changes the least squares regression line. A point that is an outlier in the x direction will often be influential. The least squares method is not resistant to outliers. Follow the outlier examination strategy in Display 3.6 for dealing with outliers in x direction and outliers in the direction of scatterplot.

Outliers Example Does the age at which a child begins to talk predict a later score on a test of mental ability at a later age? gesell.JMP contains data on the age at first word (x) and their Gesell Adaptive score (y), an ability test taken at a later age. Child 18 is an outlier in the x direction and potentially influential. Child 19 is an outlier in the direction of the scatterplot. To assess whether a point is influential, fit the least squares line with and without the point (excluding the row to fit it without the point) and see how much of a difference it makes. Child 18 is highly influential; child 19 is not highly influential.

Case Study 8.1.1 Biologists are interested in the relationship between the area of islands (X) and the number of animal and plant species (Y) living on them. –Estimates of this relationship are useful in conservation biology for predicting species extinction rates due to diminishing habitat. Data in Display 8.1 are number of reptile and amphibian species and the island areas for seven islands in the West Indies.

Scatterplots for Species Data Regression function does not appear to be linear.

Case Study 8.1.2 In an industrial laboratory, batches of electrical insulating fluid were subject to different voltages until insulating property of fluids broke down. Y=time to breakdown of an insulating fluid, X=voltage. Residual plots shows “horn shaped” pattern indicating both nonlinearity and nonconstant variance.

Tukey’s Bulging Rule Draw a circle, divide into 4 pieces Try transformations based on what quadrant the shape of the data falls in. Upper left: sqrt X, log X, 1/X, Y 2 Upper right: X 2 Y 2 Lower left: sqrt X, log X, 1/X, sqrt Y, log Y, 1/Y Lower right: X 2, sqrt Y, log Y, 1/Y Try different transformations, draw residual plots and see which works best. If no transformation works, polynomial regression (Ch. 9) must be used.

Transformations for Voltage Data

Transformations for Species Data

Prediction After Transformation To predict y given x (or to estimate ) when y has been transformed to f(y) and x to g(x), Species Data log-log transformation. Y transformed to log Y, X transformed to log X Predicted number of species given area = 30000: –Predicted number of log species given log area = log(30000)=10.31 equals 1.94+0.25*10.31=4.52. –Predicted number of species given area = 30000 equals exp(predicted number of log species given log area = log(30000)) = exp(4.52) =91.84.

Second Prediction Example For voltage data, if using the square root transformation, to predict y based on x, Predicted Time for Voltage = 30: –Predicted Square Root of Time for Voltage = 30 equals 61.78-1.70*30 = 10.78 –Predicted Time for Voltage = 30 equals 10.78 2 =116.21

Testing whether Y is Associated with X To test whether Y is associated with X, we can test whether f(Y) is associated with g(X) by testing whether the slope is zero in the transformed model. Strong evidence that number of species is associated with area. Interpreting the slope and intercept is difficult except for log transformations.

Interpreting log transformations Case I: Response is logged, explanatory variable is not logged. Median{Y|X}= Consequently, Median{Y|(X+1)}/ Median{Y|X} = Interpretation: –If, as X increases by 1, the median of Y increases by –If, as X increases by 1, the median of Y decreases by

Interpretation in Voltage Study Interpretation: It is estimated that the median failure time decreases by with each 1kV increase in voltage. 95% CI: Median failure time decreases by for 1KV increase in voltage.

Case II: Explanatory variable is logged Implies Interpretation: Doubling of X is associated with change in the mean of Y. Species Example: Interpretation: Doubling of Area is associated with an increase in mean species of 8.86*log(2) = 6.14. 95% CI = (4.52*log(2),13.20*log(2))=(3.13,9.15)

Case III: Both response and explanatory variable logged Interpretation: –A doubling of X is associated with a multiplicative change of in the median of Y. –A ten-fold increase in X is associated with a change of in the median of Y.

Case III Example Species Example: Since, “associated with each doubling of island area is a 19% increase in the median number of bird species. 95% CI for multiplicative increase = (16.4%, 21.5%)

R-Squared The R-squared statistic, also called the coefficient of determination, is the percentage of response variation explained by the explanatory variable. Total sum of squares =. Best sum of squared prediction error without using x.

R-Squared example R 2 = 86.69%. Read as “86.69 percent of the variation in neuron activity was explained by linear regression on years played.”

Interpreting R 2 If the residuals are all zero (a perfect fit), then R 2 is 100%. If the least squares line has slope 0, R 2 will be 0%. R 2 is useful as a unitless summary of the strength of linear association but –It is not useful for assessing model adequacy (e.g., linearity) or whether or not there is an association –A good R 2 depends on the context. In precise laboratory work, R 2 values under 90% might be too low, but in social science contexts, when a single variable rarely explains great deal of variation in response, R 2 values of 50% may be considered remarkably good.

Coverage of Second Midterm Transformations of the data for two group problem (Ch. 3.5) Welch t-test (Ch. 4.3.2) Comparisons Among Several Samples (5.1-5.3, 5.5.1) Multiple Comparisons (6.3-6.4) Simple Linear Regression (Ch. 7.1-7.4, 7.5.3) Assumptions for Simple Linear Regression and Diagnostics (Ch. 8.1-8.4, 8.6.1, 8.6.3)

Download ppt "Lecture 18: Thurs., Nov. 6th Chapters 8.3.2, 8.4, 8.6.1 Outliers and Influential Observations Transformations Interpretation of log transformations (8.4)"

Similar presentations