Presentation is loading. Please wait.

Presentation is loading. Please wait.

MULTIPLE REGRESSION Using more than one variable to predict another.

Similar presentations


Presentation on theme: "MULTIPLE REGRESSION Using more than one variable to predict another."— Presentation transcript:

1 MULTIPLE REGRESSION Using more than one variable to predict another

2 Last Week  Coefficient of Determination r2r2  Explained variance between 2 variables  Simple Linear Regression  y = mx + b  Predicting one variable from another Based on explained variance – if r 2 is large, should be a good predictor  Predicting one dependent variable from one independent variable  SEE, residuals

3 Tonight  Predicting one DV from one IV is simple linear regression  Predicting one DV from multiple IV’s is called multiple linear regression  More IV’s usually allow for a better prediction of the DV  If IV A explains 20% of the variance (r 2 = 0.20) and  IV B explains 30% of the variance (r 2 = 0.30), then  Can I use both to predict the dependent variable?

4 Example: Activity Dataset  To demonstrate, we’ll use the same data as last week, on the pedometer and armband  Goal: To predict Armband calories (real calories expended) as accurately as possible  Lets start by trying to predict Armband calories with body weight  Complete simple linear regression with body weight

5 Simple Regression  Here is the simple regression output from using Body Weight (kg) to predict Armband Calories

6 Simple Regression  Results using Body Weight (kg):  r 2 = 0.155  SEE = 400.5 calories  Can we improve on this equation by adding in new variables?  First, we have to determine if other variables in the dataset might be related to Armband Calories..  Use correlation matrix

7 Correlations  Notice several variables have some association with armband calories: Variablerr2r2 Height0.2250.05 Weight0.3930.15 BMI0.3780.14 PedSteps0.7820.61 PedCalories0.8530.73

8 Create new regression equation  Simple regression equation looks like:  y = mx + b  Multiple regression equation looks like:  y = m 1 x 1 + m 2 X 2 + b  Subscript is used to help organize the data  All we are doing is adding an additional variable into our equation.  That new variable will have it’s own slope, m 2  For the sake of simplicity, lets add in pedometer steps as X 2

9 OUTPUT…

10 Multiple Regression Output

11 Simple to Multiple  Results using Body Weight (kg):  r 2 = 0.155  SEE = 400.5 calories  Results using Body Weight and Pedometer Steps:  r 2 = 0.672  SEE = 251.7 calories  r 2 change = 0.672 – 0.155 = 0.517  If 2 variables are good – would 3 be even better?

12 Adding one more in…  In addition to body weight (x 1 ) and pedometer steps (x 2 ), lets add in age (x 3 )

13 Multiple Regression Output 2

14 Simple to Multiple  Results using Body Weight (kg):  r 2 = 0.155  SEE = 400.5 calories  Results using Body Weight and Pedometer Steps:  r 2 = 0.672  SEE = 251.7 calories  r 2 change = 0.517  Results using Body Weight, PedSteps, Age  r 2 = 0.689  SEE = 247.7  r 2 change = 0.689 – 0.672 = 0.017

15 Multiple Regression Decisions  Should we recommend that age is used in the model?  These decisions can be difficult  “Model Building” or “Model Reduction” is more of an art than a science  Consider  p-value of age in model = 0.104  r 2 change by adding age = 0.017, or 1.7% of variance  More coefficients (predictors) make the model more complicated to use and interpret  Does it make sense to include age? Should age be related to caloric expenditure?

16 Other Regression Issues  Sample Size  With too small a sample, you lack the statistical power to generalize your results to other samples/the whole population You increase your risk of Type II Error (failing to reject the alternative hypothesis when true)  In multiple regression, the more variables you use in your model the greater your risk of Type II Error This is a complicated issue, but essentially you need large samples to use several predictors Guidelines…

17 Other Regression Issues  Sample Size  Tabachnick & Fidell (1996): N > 50 + 8m N=appropriate sample size, m=# of IV’s So, if you use 3 predictors (like we just did in our example): 50 + 8*3 = 74 subjects  You can find several different ‘guess-timates’, I usually just try have 30 subjects, plus another 30 for each variable in the model (ie, 30 + 30m) I like to play it safe…

18 Other Regression Issues  Multiple Regression has the same statistical assumptions as correlation/regression  Check for normal distribution, outliers, etc…  One new concern with multiple regression is the idea of Collinearity  You have to be careful that your IV’s (predictor variables) are not highly correlated with each other  Can cause a model to overestimate r 2  Can also cause one new variable to eliminate another

19 Example Collinearity  Results of MLR using Body Weight, PedSteps, Age  r 2 = 0.689  SEE = 247.7  Imagine we want to add in one other variable, Pedometer Calories  Look at the correlation matrix first…

20  Notice that Armband calories is highly correlated with both Pedometer Steps and Pedometer Calories  Initially, this looks great because we might have two very good predictors to use  But, notice that Pedometer Calories is very highly correlated with Pedometer Steps  These two variables are probably collinear – they are very similar and may not explain ‘unique’ variance

21 Here is the MLR result with Weight, Steps, and Age: Here is the MLR result by adding Pedometer calories in the model: Pedometer calories becomes the only significant predictor in the model. In other words, the variance in the other 3 variables can be explained by Pedometer Calories – not all 4 variables add ‘unique’ variance to the model

22 Example Collinearity  Results of MLR using Body Weight, PedSteps, Age  r 2 = 0.689  SEE = 247.7  Results of MLR using Body Weight, PedSteps, Age, and PedCalories  r 2 = 0.745  SEE = 226.2  Results of MLR using just PedCalories (eliminates collinearity)  r 2 = 0.727  SEE = 227.5 Which model is the best model? Remember, we’d like to pick the strongest predictor model with the fewest number of predictor variables

23 Model Building  Collinearity makes model building more difficult  1) When you add in new variables you have to look at r 2, r 2 change, and SEE – but you also have to notice what’s happening to the other IV’s in the model  2) Sometimes, you need to remove variables that used to be good predictors  3) This is why the model with the most variables is not always the best model – sometimes you can do just as well with 1 or 2 variables

24 What to do about Collinearity?  Your approach:  Use a correlation matrix to examine the variables BEFORE you try to build your model 1) Check the IV’s correlations with the DV (high correlations will probably be best predictors) but… 2) Check the IV’s correlations with the other IV’s (high correlations probably indicate collinearity)  If you do find that two IV’s are highly correlated, be aware that having them both in the model is probably not the best approach (pick the best one and keep it) QUESTIONS…?

25 Upcoming…  In-class activity on MLR…  Homework (not turned in due to exam):  Cronk Section 5.4  OPTIONAL: Holcomb Exercises 31 and 32 Multiple correlation, NOT full multiple linear regression Similar to MLR, but looks at the model’s r instead of making a prediction equation  Mid-Term Exam next week  Group differences after spring break (t-test, ANOVA, etc…)


Download ppt "MULTIPLE REGRESSION Using more than one variable to predict another."

Similar presentations


Ads by Google