1 Topic 3 – Multiple Regression Analysis Regression on Several Predictor Variables (Chapter 8)

2 Topic Overview Systolic Blood Pressure Example Multiple Regression Models SAS Output for Regression Multicollinearity

3 Systolic Blood Pressure Data In this topic we will fully analyze the SBP dataset described in KKNR Problem 5.02 in the text. This dataset illustrates some excellent points regarding multiple regression. The file 03SBP.sas provides the data and all of the code that has been utilized to produce output shown in the CLG handout and lecture notes.

4 Dataset Overview (n = 32) Response Variable: systolic blood pressure for an individual Note: SBP is the maximum pressure exerted when the heart contracts (top number). Predictor Variables  Age (measured in years)  Body Size (measured using the quetelet index)  Smoking Status (0 = nonsmoker, 1 = smoker)

5 Multiple Regression Analysis For the SBP data, our goal is to determine whether SBP may be reasonably well predicted by some combination of age, body size, and smoking status. Additionally we may want to try to describe relationships and answer questions such as:  Does the SBP increase (on average) with an increase in size?

6 Multiple Regression Analysis (2) For starters, let’s use several SLRs to help us understand the three predictors better and their individual relationships with SBP.  Identify potential problems (e.g. outliers)  Identify and assess the form, direction, and strength of pairwise relationships. Note: We wouldn’t normally do this, once we learn how we will just run the MLR. But this will help us see what might happen to our model when we do run the MLR

7 CLG Activity #1 Please discuss questions 3.1-3.6 from the handout. Additional slides will be made available on the website after we have done the activities in class.

8 Multiple Regression Analysis (3) The first step in multiple regression analysis is to consider using more than one predictor variable in the same model. You can think of adding variables to the model in a certain order. Each variable takes up a portion of the total sums of squares.

9 Graphical View of MLR Rectangle represents total SS; Ovals represent variables; Note OVERLAP! X1 X2 X3 Total SSY

10 Some Key Points Must take into account relationships (correlation) among the potential predictor variables – these relationships are mostly responsible for the overlap in explained SS. Interaction between predictors may also become a consideration. Interaction means that the effect of one predictor changes depending on the value of the other. More on this later...

11 Some Additional Concerns Dealing with multiple predictors is much more difficult than SLR: More difficult to choose the “best” model (we now have many more choices). Calculation of estimates can be problematic – generally we always employ a computer. Interpretation of the parameter estimates is usually less clear, and can in fact be meaningless if highly correlated variables are used in the same model. Harder to visualize the multi-dimensional relationship expressed by the model.

12 Examples of Models Suppose we have two predictors X 1 and X 2. Then possible models include (but are not limited to):

13 “Best” Model The best model will contain “significant” variables and exclude those that are not significant.  BUT: Because of possible overlaps, it is not always easy to determine which variables should be excluded! Use scatter plots and residual plots to check the FORM of the model.  In scatter plots of Y-hat vs. individual predictors, non-linear patterns indicate the need to transform a given predictor.

14 “Best” Estimates The best estimates for a MLR model are still those that minimize the SSE (least squares) Note: we are still measuring error in the Y direction Deviations are measured as before by the difference between the observed response and the predicted response. The X’s come into play only in terms of estimating the predicted values.

15 Notation Similar to SLR Greek Letters are parameters: β 0, β 1,etc Lower Case English are estimates: b 0, b 1,etc Parameter estimates / standard error formulas are much more complicated due to the fact we have multiple predictors, mostly just let software handle it for us

16 Matrix Approach Matrix approach simplifies notation somewhat. Estimates are: This is discussed in Appendix B, however we will rely on SAS to do the work for us. Our goal is to learn how to interpret the information.

17 Slope Estimates Some important properties of the estimates include the following: Each slope estimate is a linear function of the Y’s (the linear coefficients are based on the X’s as was the case in SLR). Thus they are random variables. Correlation between Y and is maximized (while the sum of squared error is minimized).

18 Assumptions on Errors These are the same as before as well! A simple statement of the assumptions is that Normality Given a particular set of predictors, the errors (or equivalently the Y|X’s) have a normal distribution. This provides justification for the use of T and F tests.

19 Assumptions on Errors (2) Constancy of Variance Variance of errors is the same regardless of the values of the predictor variables. Equivalent to say that the variance of Y given X = (X 1,X 2,...,X k ) is constant. Transformations If needed, we can often transform Y to achieve normality and/or constancy of variance.

20 Assumptions on Errors (3) Independence Errors (and hence responses) are statistically independent of one another. One common violation is repeated measures (measuring the same individual over time). If the relationship is linear (or transformable) then it is possible to include time in the model as a predictor variable and avoid this issue.

21 Linear Model Assumption Linearity means: Mean of Y is a linear function of the X’s Can have “weird” terms from transformations

22 Checking Assumptions Also very much the same as SLR Epsilon is the error component and is estimated by the residuals: Residual plots can be considered to check for violations of the assumptions. May plot residuals versus each single predictor to assess constant variance and linearity with respect to that predictor.

23 ANOVA Table Similar idea to simple linear regression; information is organized into an ANOVA table. Still have (Corrected, for the mean) Total Sums of Squares which is split into SS for the regression model and SS error.

24 Degrees of Freedom Degrees of freedom change  Always have n – 1 total degrees of freedom.  The degrees of freedom for the regression is the number of slope estimates k (you do not count the intercept).  Degrees of freedom for error is therefore n – 1 – k.

25 Regression Sums of Squares The SSR can be divided up into parts according to the addition of variables into the model. When this is done, we have an “extended” ANOVA table – the model line is broken down into one line for each variable and each variable has 1 DF associated to it. IMPORTANT: This breakdown is order dependent! In SAS we will call this the TYPE I sums of squares.

26 Graphical View of MLR Rectangle represents total SS; Ovals represent SS explained by the different variables. Note that there is overlap, and the order in which variables are added is important! SS(X 1 ) SS(X 3 |X 1,X 2 ) Total SSY SS(X 2 |X 1 )

27 Coefficient of Determination R 2 is still the coefficient of determination. But it has a slightly different meaning in the context of MLR. R 2 can be thought of as the percentage of variation in Y (as represented by the total SS) that is explained by the group of predictor variables in the model. R 2 gives no indication of the importance of any particular predictor variable.

28 MLR Hypothesis Testing: Basics There are similarities to SLR, but there are also some big differences.

29 Parameter Estimates We will have a slope estimate for each variable. Each slope estimate will also have a standard error. Note that in SAS, the variable name is used to identify each line. Confidence intervals may be requested as before by using the CLB option in PROC REG.

30 Parameter Estimates (2) These are joint estimates. They assume other variables in the model. They will all change some if you remove or add a variable. (Note: If we are interested in interpretation, how much the estimates change in such a case is very important!) A Bonferroni adjustment may be appropriate to adjust for the fact we are computing several intervals.

31 Some Questions of Interest 1.Does the group of independent variables contribute anything worthwhile to the prediction of the response variable? 2.Given the present model, which individual variable(s) add significantly to the prediction of the response (over and above other predictors already included)? 3.Given the present model, is there any group of variables which will add significantly to the prediction of the response (over and above other predictors already included)?

32 Answers? We find the answers to these questions by considering:  F-test from the ANOVA table  T-tests from the parameter estimates table  Partial F tests for comparing different models; these are based on Type I and II SS which we will discuss in Topic 4

33 ANOVA F-test Null hypothesis is that ALL of the slope parameters are zero. Written: Alternative hypothesis is that at least one of the slope parameters is non-zero: Rejecting the null hypothesis says “There is something in the model that is important in explaining the variation in the response.”

34 ANOVA F-test (2) The test statistic is MSE estimates the variance. MSR estimates the variance under H 0. Under H A, MSR will be larger than MSE; hence we reject H 0 if F >> 1. Compare to F statistic with DF associated to the MSR and MSE; p-value computed by SAS.

35 ANOVA F-test (3) Failing to reject means that there is no evidence of association between the response variable Y and the group of predictors that are currently in the model. Failing to reject doesn’t guarantee that no variables in the model are important. We might be lacking in power; we might also be missing an important variable that would reduce the total SS. Rejecting means that at least one predictor variable is important in predicting Y. Reject doesn’t indicate “which” predictor(s) are important.

36 T-tests Null hypothesis for each t-test is that the slope parameter associated with the corresponding variable is zero. The alternative is that the associated slope parameter is not zero. KEY POINT: Other predictors are considered to be already in the model. So these tests are “variable added last tests”.

37 T-tests (2) Rejection means: Even with all of the other variables in the model and taking up as much of the SS as they can, this variable X i is still important! Failing to reject means: This variable does not do anything useful when added to the other variables in the model. Key Point: This may just be due to overlap. The variable X i if considered by itself may be associated to Y.

38 T-tests – General Comments If only one of the variables has an insignificant t-test, and if the sample size is reasonable, then probably that variable is not important. Any variable that has a significant t-test is important and should remain in the model. If multiple variables have insignificant t- tests, it is possible (and even likely) that some of them may still be important!!!!

39 Collinearity Issues If predictors are highly associated to each other, then they will likely be redundant when we consider them in the same model.

40 Collinearity and Multi-Collinearity Occurs when there are strong relationships (high intercorrelations) among the predictor variables. Collinearity is for a pair of predictors only Multi-collinearity is when a variable(s) is strongly related to multiple other variables

41 Collinearity and Multi-Collinearity—Examples X1 is highly correlated to X2  “X1 and X2 are collinear” X1 not highly correlated to X2, X3, or X4 individually, but together they explain most of the variation in X1.  “X1 is multi-collinear with X2, X3, and X4 Note that when squares or interactions are used, e.g., X3 = X1*X2 will by default be correlated to X1 and X2. Centering can help in this case to decrease correlation.

42 Consequences Whenever we have collinearity or multi- collinearity, the predictor variables will “fight” to explain the same portion of the variation in the response (SS). If X1 and X2 are redundant, then the variable added last T-test will usually be insignificant for both even though we probably do want one of them in the model. When collinearity is present, standard errors will be large (inflated) and interpretation of parameter estimates is adversely affected.

43 Pictorial Representation X 3 is highly correlated to X 1 and X 2. So in a regression model we are likely to use either X 3 or (X 1,X 2 ), but not both. SS(X 1 ) SS(X 3 |X 1,X 2 ) Total SSY SS(X 2 |X 1 )

44 Finding Intercollinearity PROC CORR to determine pairwise correlations, high values are problematic  Any single correlation > 0.9  collinearity between just those two predictors  Any predictor that has several values between 0.5 and 0.9 with other predictors  multi-collinearity Variance Inflation Factors (VIFs) also help determine multi-collinearity  For X1, VIF > 10 suggest X1 is multi-collinear with other predictors in the model

45 PROC CORR Output contains r and a p-value.  Looking for high r values  P-value is for an SLR between those 2 predictors.

46 Pairwise Scatterplots--collinearity Can look at plots to find relationships among the predictors. Can use GPLOT for individual plots; for a first look, SAS “Interactive Data Analysis” provides a convenient scatter plot matrix: Solutions  Analysis  Interactive Data Analysis Select WORK library and the dataset for which you wish to obtain plots. Select columns you want to plot (can use control key to select multiple columns). Analyze  Scatterplot Clicking a point in a plot will highlight that point in every plot.

47 Variance Inflation Factors— multi-collinearity VIF is related to the variance of the estimated regression coefficients (you can think of the SE’s being “inflated” by having intercorrelation among the predictors) is the coefficient of determination obtained in regression of each X i on all other predictors.

48 VIF’s in SAS Obtained by using VIF option in the model statement of PROC REG. VIF > 10 suggests multicollinearity. If this happens, a simplistic strategy is to remove the variable with the highest VIF and rerun the analysis.  We will learn better strategies for removing variables, but VIFs are still a good indicator of multi-collinearity

49 CLG Activity #2 We now look at the SBP data from the MLR model viewpoint. Please discuss questions 3.7-3.9 from the handout.

50 Questions?

51 Upcoming in Topic 4... Partial F Tests in MLR Related Reading: Chapter 9

1 Topic 3 – Multiple Regression Analysis Regression on Several Predictor Variables (Chapter 8)

Similar presentations

Presentation on theme: "1 Topic 3 – Multiple Regression Analysis Regression on Several Predictor Variables (Chapter 8)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Topic 3 – Multiple Regression Analysis Regression on Several Predictor Variables (Chapter 8)

Similar presentations

Presentation on theme: "1 Topic 3 – Multiple Regression Analysis Regression on Several Predictor Variables (Chapter 8)"— Presentation transcript:

Similar presentations

About project

Feedback