1 Lecture 19: Tues., Nov. 11th R-squared (8.6.1) Review Midterm II on Thursday in class: Allowed calculator, two double-sided pages of notesOffice hours: Today after class; Wednesday, 1:30-2:30; by appointment (I will be around Wed. morning and Thurs. morning before 10:30).
2 R-SquaredThe R-squared statistic, also called the coefficient of determination, is the percentage of response variation explained by the explanatory variable.Total sum of squares = Best sum of squared prediction error without using x.Residual sum of squares =
3 R-Squared exampleR2= Read as “86.69 percent of the variation in neuron activity was explained by linear regression on years played.”
4 Interpreting R2R2 takes on values between 0 and 1, with higher R2 indicating a stronger linear association.If the residuals are all zero (a perfect fit), then R2 is 100%. If the least squares line has slope 0, R2 will be 0%.R2 is useful as a unitless summary of the strength of linear association.
5 Caveats about R2R2 is not useful for assessing model adequacy (e.g., linearity) or whether or not there is an association.A good R2 depends on the context. In precise laboratory work, R2 values under 90% might be too low, but in social science contexts, when a single variable rarely explains great deal of variation in response, R2 values of 50% may be considered remarkably good.
6 Coverage of Second Midterm Transformations of the data for two group problem (Ch. 3.5)Welch t-test (Ch )Comparisons Among Several Samples ( , 5.5.1)Multiple Comparisons ( )Simple Linear Regression (Ch , 7.5.3)Assumptions for Simple Linear Regression and Diagnostics (Ch , 8.6.1, 8.6.3)
7 Transformations for two-group problem Goal: Find transformation so that the two distributions have approximately equal spread.Log transformation might work when distributions are skewed and spread is greater in the distribution with larger median.Interpretation of log transformation:For causal inference: Let be the additive treatment effect on the log scale ( ). Then the effect of the treatment is to multiply the control outcome byFor population inference: Let and be the means of the logged values of population 1 and 2 respectively. If the logged values of the population are symmetric, then equals the ratio of the median of population 2 to the median of population 1.
8 Review of One-way layout Assumptions of ideal modelAll populations have same standard deviation.Each population is normalObservations are independentPlanned comparisons: Usual t-test but use all groups to estimate If many planned comparisons, use Bonferroni to adjust for multiple comparisonsTest of vs. alternative that at least two means differ: one-way ANOVA F-testUnplanned comparisons: Use Tukey-Kramer procedure to adjust for multiple comparisons.
9 RegressionGoal of regression: Estimate the mean response Y for subpopulations X=x,Applications: (i) Description of association between X and Y; (ii) Passive prediction of Y given X ; (iii) Control – predict what y will be if x is changed. Application (iii) requires the x’s to be randomly assigned.Simple linear regression model:Estimate and by least squares – choose to minimize the sum of squared residuals (prediction errors)
10 Ideal Model Assumptions of ideal simple linear regression model There is a normally distributed subpopulation of responses for each value of the explanatory variableThe means of the subpopulations fall on a straight-line function of the explanatory variable.The subpopulation standard deviations are all equal (to)The selection of an observation from any of the subpopulations is independent of the selection of any other observation.
11 The standard deviation is the standard deviation in each subpopulation.measures the accuracy of predictions from the regression.If the simple linear regression models holds, then approximately68% of the observations will fall within of the least squares line95% of the observations will fall within of the least squares line
12 Inference for Simple Linear Regression Inference based on the ideal simple linear regression model holding.Inference based on taking repeated random samples ( ) from the same subpopulations( ) as in the observed data.Types of inference:Hypothesis tests for intercept and slopeConfidence intervals for intercept and slopeConfidence interval for mean of Y at X=X0Prediction interval for future Y for which X=X0
13 Tools for model checking Scatterplot of Y vs. X (see Display 8.6)Scatterplot of residuals vs. fits (see Display 8.12)Look for nonlinearity, non-constant variance and outliersNormal probability plot (Section 8.6.3) – for checking normality assumption
14 Outliers and Influential Observations An outlier is an observation that lies outside the overall pattern of the other observations. A point can be an outlier in the x direction, the y direction or in the direction of the scatterplot. For regression, the outliers of concern are those in the x direction and the direction of the scatterplot. A point that is an outlier in the direction of the scatterplot will have a large residual.An observation is influential if removing it markedly changes the least squares regression line. A point that is an outlier in the x direction will often be influential.The least squares method is not resistant to outliers. Follow the outlier examination strategy in Display 3.6 for dealing with outliers in x direction and outliers in the direction of scatterplot.
15 TransformationsGoal: Find transformations f(y) and g(x) such that the simple linear regression model approximately describes the relationship between f(y) and g(x).Tukey’s Bulging Rule can be used to find candidate transformations.Prediction after transformationInterpreting log transformations