# Lecture 19: Tues., Nov. 11th R-squared (8.6.1) Review

## Presentation on theme: "Lecture 19: Tues., Nov. 11th R-squared (8.6.1) Review"— Presentation transcript:

Lecture 19: Tues., Nov. 11th R-squared (8.6.1) Review
Midterm II on Thursday in class: Allowed calculator, two double-sided pages of notes Office hours: Today after class; Wednesday, 1:30-2:30; by appointment (I will be around Wed. morning and Thurs. morning before 10:30).

R-Squared The R-squared statistic, also called the coefficient of determination, is the percentage of response variation explained by the explanatory variable. Total sum of squares = Best sum of squared prediction error without using x. Residual sum of squares =

R-Squared example R2= Read as “86.69 percent of the variation in neuron activity was explained by linear regression on years played.”

Interpreting R2 R2 takes on values between 0 and 1, with higher R2 indicating a stronger linear association. If the residuals are all zero (a perfect fit), then R2 is 100%. If the least squares line has slope 0, R2 will be 0%. R2 is useful as a unitless summary of the strength of linear association.

Caveats about R2 R2 is not useful for assessing model adequacy (e.g., linearity) or whether or not there is an association. A good R2 depends on the context. In precise laboratory work, R2 values under 90% might be too low, but in social science contexts, when a single variable rarely explains great deal of variation in response, R2 values of 50% may be considered remarkably good.

Coverage of Second Midterm
Transformations of the data for two group problem (Ch. 3.5) Welch t-test (Ch ) Comparisons Among Several Samples ( , 5.5.1) Multiple Comparisons ( ) Simple Linear Regression (Ch , 7.5.3) Assumptions for Simple Linear Regression and Diagnostics (Ch , 8.6.1, 8.6.3)

Transformations for two-group problem
Goal: Find transformation so that the two distributions have approximately equal spread. Log transformation might work when distributions are skewed and spread is greater in the distribution with larger median. Interpretation of log transformation: For causal inference: Let be the additive treatment effect on the log scale ( ). Then the effect of the treatment is to multiply the control outcome by For population inference: Let and be the means of the logged values of population 1 and 2 respectively. If the logged values of the population are symmetric, then equals the ratio of the median of population 2 to the median of population 1.

Review of One-way layout
Assumptions of ideal model All populations have same standard deviation. Each population is normal Observations are independent Planned comparisons: Usual t-test but use all groups to estimate If many planned comparisons, use Bonferroni to adjust for multiple comparisons Test of vs. alternative that at least two means differ: one-way ANOVA F-test Unplanned comparisons: Use Tukey-Kramer procedure to adjust for multiple comparisons.

Regression Goal of regression: Estimate the mean response Y for subpopulations X=x, Applications: (i) Description of association between X and Y; (ii) Passive prediction of Y given X ; (iii) Control – predict what y will be if x is changed. Application (iii) requires the x’s to be randomly assigned. Simple linear regression model: Estimate and by least squares – choose to minimize the sum of squared residuals (prediction errors)

Ideal Model Assumptions of ideal simple linear regression model
There is a normally distributed subpopulation of responses for each value of the explanatory variable The means of the subpopulations fall on a straight-line function of the explanatory variable. The subpopulation standard deviations are all equal (to ) The selection of an observation from any of the subpopulations is independent of the selection of any other observation.

The standard deviation
is the standard deviation in each subpopulation. measures the accuracy of predictions from the regression. If the simple linear regression models holds, then approximately 68% of the observations will fall within of the least squares line 95% of the observations will fall within of the least squares line

Inference for Simple Linear Regression
Inference based on the ideal simple linear regression model holding. Inference based on taking repeated random samples ( ) from the same subpopulations ( ) as in the observed data. Types of inference: Hypothesis tests for intercept and slope Confidence intervals for intercept and slope Confidence interval for mean of Y at X=X0 Prediction interval for future Y for which X=X0

Tools for model checking
Scatterplot of Y vs. X (see Display 8.6) Scatterplot of residuals vs. fits (see Display 8.12) Look for nonlinearity, non-constant variance and outliers Normal probability plot (Section 8.6.3) – for checking normality assumption

Outliers and Influential Observations
An outlier is an observation that lies outside the overall pattern of the other observations. A point can be an outlier in the x direction, the y direction or in the direction of the scatterplot. For regression, the outliers of concern are those in the x direction and the direction of the scatterplot. A point that is an outlier in the direction of the scatterplot will have a large residual. An observation is influential if removing it markedly changes the least squares regression line. A point that is an outlier in the x direction will often be influential. The least squares method is not resistant to outliers. Follow the outlier examination strategy in Display 3.6 for dealing with outliers in x direction and outliers in the direction of scatterplot.

Transformations Goal: Find transformations f(y) and g(x) such that the simple linear regression model approximately describes the relationship between f(y) and g(x). Tukey’s Bulging Rule can be used to find candidate transformations. Prediction after transformation Interpreting log transformations