Presentation is loading. Please wait.

Presentation is loading. Please wait.

Chapter 9 Multiple Linear Regression

Similar presentations


Presentation on theme: "Chapter 9 Multiple Linear Regression"— Presentation transcript:

1 Chapter 9 Multiple Linear Regression
BAE 5333 Applied Water Resources Statistics Biosystems and Agricultural Engineering Department Division of Agricultural Sciences and Natural Resources Oklahoma State University Source Dr. Dennis R. Helsel & Dr. Edward J. Gilroy 2006 Applied Environmental Statistics Workshop and Statistical Methods in Water Resources

2 Multiple Linear Regression
Y = bo + b1X1 + b2X2 + … + bkXk + e Y = response or dependant variable bo = intercept bi = slopes Xi = explanatory or independent variables e = error term

3 Multiple Linear Regression Model
Y X1 X2 Y = 0 + 1X1 + 2X2 e (positive) e (negative) Source:

4 Multiple Linear Regression
Parametric method of fitting a surface Same assumptions as Simple Linear Regression Linear pattern of data (in 3+ dimensions) Variance of residuals is constant for all X Normal distribution of residuals

5 Biggest Issue in Multiple Linear Regression: Multicollinearity
Cause Redundant variables More than one X variable explaining same effect Symptoms Slope coefficients with signs that make no sense Two variables describing same effect with opposite signs Stepwise, backwards, forwards methods give different results (more later)

6 Biggest Issue in Multiple Linear Regression: Multicollinearity
Measure with Variance Inflation Factor, VIF Measures the correlation (not just pair wise) among X variables Has NOTHING to do with the response variable Y Want all VIFs < 10 Solutions Drop one or more redundant variables Alternate design – collect additional data

7 Biggest Issue in Multiple Linear Regression: Multicollinearity
Issues Regression equation is still VALID with multicollinearity Cannot put any physical meaning to the sign and magnitude of the coefficients You should NEVER apply a regression equation outside the range of data that were used to develop it

8 Regression Model with Two Variables
X1 & X2 are independent X1 & X2 are partially correlated, introducing multicollinearity X1 & X2 are perfectly correlated, and thus are redundant variables Source:

9 Hypothesis Tests for MLR
t-test for each slope coefficient Null Hypothesis Slope = 0 No influence of a X on Y Do not include in model Alternative Hypothesis Slope ≠ 0 X influences Y Keep variable in equation Y = bo + b1X1 + b2X2 + … + bkXk + e

10 Partial t-test Partial t-test Ho: bj=0 Ha: bj≠0
Reject Ho when p-value < α (2 sided) Multicollinearity inflates SE(bj) and hence lowers tj

11 Overall F-test Null Hypothesis All slopes = 0
Implies best estimate of Y is the mean of Y Alternative Hypothesis At least one slope ≠ 0 Current model better than no model Does not imply this is the best model Reject Ho when F is large, p-value < α (2-sided test)

12 How to Build a Good Regression Model
Choose the best units for Y Run regression with all variables Check for non-constant variance with residual plots

13 How to Build a Good Regression Model
Choose the best units for X using partial plots – want a linear relationship

14 Partial Plots Shows the relationship between an explanatory variable (Xi) and the response variable (Y) given other independent variables are in the model Simple plot of Y vs. Xi will not show this because it doesn’t consider the other Xis Want plot to be linear Curvature indicates a transformation is required for Xi

15 Partial Regression Plots Added Value or Leverage Plots
May not show proper relationship If several variables already in the model are incorrectly specified If strong multicollinearity exists Plots the residuals Yi from the regression of Y on all X’s except Xi and the residuals from Xi regressed on all other X’s. In other words it plots the relationship between Y and Xi that remains when the effects of Xi+1, …, Xk are removed Good for diagnosing outliers & determining if a variable should be included in the model

16 Partial Residual Plots Adjusted Variable or Component Plots
Show the relationships between each explanatory variable (Xi) and the response variable (Y) not explained by all other variables Primarily used to identify violations of the linearity assumption Good for diagnosing nonlinearity

17 Partial Residual Plots Adjusted Variable or Component Plots
Partial Residual, ej* ej* y = observed dependant variable y(j) = predicted y from regression equation where xj is left out of the model xj* ˄ Adjusted Explanatory Variable, xj* x = observed explanatory variable x(j) = predicted x from regression equation with all variables ˄

18 How to Build a Good Regression Model
Check for multicollinearity Rj2 is the R2 between xj and all other Xs One VIF for each X variable Want all Variance Inflation Factors (VIFs) < 10

19 How to Build a Good Regression Model
Choose the best model Use an overall measure of quality Mallow’s CP Low Adjusted R2 High Predicted R2 High PRESS Low RMSE Low R2 by itself is not adequate, since it always increases as the number of variables increases.

20 Different Types of R2 R2 - Coefficient of Determination Adjusted R2
Percentage of total variation explained by the model In general, a higher R2 indicates a better model fit Adjusted R2 Accounts for the number of predictors in your model Useful for comparing models with different numbers of predictors

21 Different Types of R2 Predicted R2
Indicates how well the model predicts responses for new observations For each observation, Xi, delete the ith observation from the data set, estimate the regression equation from the remaining n-1 observations, use the fitted regression function to obtain Ŷi Can prevent over fitting the model since it is calculated with observations not included in model calculation

22 Measures of Quality Mallow’s Cp
Used to compare a full model to a model with a subset of predictors. In general, look for models where Mallows' Cp is small and close to np, where np is the number of predictors in the model (including the constant).   A small Cp value indicates that the model is relatively precise (has small variance) in estimating the true regression coefficients and predicting future responses. Models with considerable lack-of-fit and bias have values of Cp larger than np.

23 Mallow’s Cp Test Statistic
Measures of Quality Mallow’s Cp Test Statistic σ2 = true error, usually estimated as the minimum MSE among the 2k possible models MSE = mean square error for p coefficient model n = Number of observations p = number of coefficients (explanatory variables+1) k = total number of explanatory variables

24 Measures of Quality Mallow’s Cp
SSEp = residual sum of squares for p variables MSEfull = residual mean square with k variables n = Number of observations p = number of independent variables (subset of k) k = total number of independent variables

25 PRESS (Prediction Sum of Squares)
Measures of Quality PRESS (Prediction Sum of Squares) Assesses the model's predictive ability In general, the smaller the PRESS value, the better the model's predictive ability Used to calculate Predicted R2 Residual, ei, for the ith point from a model with the ith point deleted

26 How to Build a Good Regression Model
Final Check Compute the regression model, look for: Linear pattern of the data Constant variance for all X Normal distribution of residuals Significant t-statistics on all variables

27 Stepwise Regression Automated tool used to identify a useful subset of predictors to build a multiple linear regression model. The process systematically adds the most significant variable or removes the least significant variable during each step. Standard Stepwise Adds and removes predictors as needed for each step. Stops when all variables not in the model have p-values greater then α and all variables in the model have p-values less than or equal to α. Source: MINITAB 15

28 Stepwise Regression (cont.)
Forward Selection Starts with no predictors in the model. Add the most significant variable for each step. Stops when all variables not in the model have p-values greater than α. Backwards Elimination Starts with all predictors in the model and removes the least significant variable for each step. Stops when all variables in the model have p-values that are less than or equal α. Source: MINITAB 15

29 Stepwise Regression (cont.)
Potential Pitfalls If two independent variables are highly correlated, only one may end up in the model even though either may be important. Because the procedure fits many models, it could be selecting ones that fit the data well due to chance alone. May not always end with the model with the highest R2 value possible. Automatic procedures cannot take into account the analyst knowledge about the data. Therefore, the model selected may not be the best from a practical point of view. Source: MINITAB 15

30 MINITAB Laboratory 8 Reading Assignment
Chapter 11 Multiple Linear Regression (pages 295 to 322) Statistical Methods in Water Resources by D.R. Helsel and R.M. Hirsch MINITAB Laboratory 8


Download ppt "Chapter 9 Multiple Linear Regression"

Similar presentations


Ads by Google