SADC Course in Statistics Multiple Linear Regression: Introduction (Session 06)

To put your footer here go to View > Header and Footer 2 Learning Objectives At the end of this session, you will be able to interpret results from a regression model with more than one explanatory variable understand the specific hypotheses being tested by t-values associated with parameter estimates have an appreciation of what might be done with outliers, identified via residual plots

To put your footer here go to View > Header and Footer 3 More than one explanatory variable In real life examples, the following type of questions may be asked… What factors affect child mortality? Can household socio-economic characteristics be identified that relate closely to household poverty levels? Would provision of free fertiliser and packs of seed increase crop productivity and hence improved livelihoods of farmers? Addressing these leads to fitting multiple linear regression models

To put your footer here go to View > Header and Footer 4 Example with 3 explanatory variables A random sample of 45 university students were asked to personally decide which of a set of 25 acts they would consider to be a crime. The number of acts selected was recorded as the variable named crimes. Data were also collected on each students age, years in college and income of parents. Question: Which of the three factors (if any) have an effect on students views on what acts constitute crime?

To put your footer here go to View > Header and Footer 5 y = number of acts regarded as being criminal x 1 = age x 2 = years in college x 3 = parents income Start with some scatter plots crimes age College years Parents income

To put your footer here go to View > Header and Footer 6 Crimes appears most strongly associated with age and income, although for age, this association is not linear – also one outlier? Crimes does not appear associated with years in college Can test whether these observations are telling us something real about the relationship by regression analysis procedures Initial visual impression

To put your footer here go to View > Header and Footer 7 Aim Would aim for the simplest possible model i.e. one with fewest parameters that still adequately summarises the relationship of response (here crimes) and one or more of the predictors (here age, college years, and income) and gives information on which of the explanatory variables make a contribution to variability in crimes.

To put your footer here go to View > Header and Footer 8 Anova with all 3 variables ---------+------------------------------------ Source | SS df MS F Prob ---------+------------------------------------ Model | 1244.02 3 414.67 51.79 0.000 Residual | 328.29 41 8.01 --------+------------------------------------- Total | 1572.31 44 35.73 ---------+------------------------------------ Here the F-probability of 0.000 indicates there is strong evidence that at least one of the 3 explanatory variables contributes significantly to variability in crimes. The adjusted R 2 value is 77.6%

To put your footer here go to View > Header and Footer 9 Parameter Estimates ---------------------------------------------- crimes | Coef. Std. Err. t P>|t| --------+------------------------------------- age |.3403046.2174386 1.57 0.125 college |.5838187 1.307748 0.45 0.658 income |.3255904.0309183 10.53 0.000 const. | -11.18371 2.592542 -4.31 0.000 ---------------------------------------------- Hence equation describing the model is: Crimes (y) = -11.18 + 0.34(age) + 0.58(college) + 0.33(income) More generally, y i = 0 + 1 x 1i + 2 x 2i + 3 x 3i + i

To put your footer here go to View > Header and Footer 10 Interpretation of t-probabilities ---------------------------------------------- crimes | Coef. Std. Err. t P>|t| --------+------------------------------------- age |.3403046.2174386 1.57 0.125 college |.5838187 1.307748 0.45 0.658 income |.3255904.0309183 10.53 0.000 const. | -11.18371 2.592542 -4.31 0.000 ---------------------------------------------- Each t-probability indicates whether the corresponding variable contributes significantly to the model in the presence of the other two. Thus age added to a model including college and income does not explain any additional amount of variability in crimes.

To put your footer here go to View > Header and Footer 11 Next steps… finding the best model Since both age and college give non-significant p-values, should we drop both? Most definitely the answer is NO!!! At most, we drop one and look at the results. Dropping college gives the following: ----------------------------------------- crimes | Coef. Std. Err. t P>|t| -------+--------------------------------- age |.42947.08512 5.05 0.000 income |.31854.02633 12.10 0.000 const. | -11.236 2.5651 -4.38 0.000 -----------------------------------------

To put your footer here go to View > Header and Footer 12 Meaning of regression coefficients ----------------------------------------- crimes | Coef. Std. Err. t P>|t| -------+--------------------------------- age |.42947.08512 5.05 0.000 income |.31854.02633 12.10 0.000 const. | -11.236 2.5651 -4.38 0.000 ----------------------------------------- Interpret the regression coefficient 0.43 for age (or 0.32 for income) as representing the change in crimes for a unit change in age (or income), provided the other variable remains unchanged in the model.

To put your footer here go to View > Header and Footer 13 Final steps… - residual plots A normal probability plot Plot of residuals versus fitted values What do you conclude from these plots?

To put your footer here go to View > Header and Footer 14 Conclusions… (a) Normality assumption OK, but some doubt about variance homogeneity… (b) If assumptions assumed OK, age and parents income contribute significantly to explaining the variability in students response concerning the number of acts that constitute a crime. (c) 78.0% of the variability in crimes was explained by age and income. (d) The equation describing the relationship is: Crimes = -11.236+ 0.43(age) + 0.32(income)

To put your footer here go to View > Header and Footer 15 Points to note… 1.Although age appeared non-significant in the initial model with all 3 explanatory variables, dropping college gave a significant t-value for age. This emphasises the need to remember that the interpretation of t-probabilities is dependent on other variables included in the model. 2.The graph of crimes versus age showed a quadratic relationship. Should we therefore consider including (age) 2 as an additional variable in the model?

To put your footer here go to View > Header and Footer 16 Results including age 2 ----------------------------------------- crimes | Coef. Std. Err. t P>|t| -------+--------------------------------- age | -.75873.86513 -0.88 0.386 age2 |.02319.01680 1.38 0.175 income |.29973.02940 10.20 0.000 const.| 4.2431 11.50 0.37 0.714 ----------------------------------------- There is no evidence of an improvement by adding age-squared, so we return to the previous model. i.e. initial model with age and income is still better.

To put your footer here go to View > Header and Footer 17 Consider also a model with age+age 2 ----------------------------------------- crimes | Coef. Std. Err. t P>|t| -------+--------------------------------- age | -4.9314 1.4159 -3.48 0.000 age2 |.10261.02766 3.71 0.001 const. | 71.079 17.553 4.05 0.000 ----------------------------------------- Without income, there is a significant quadratic relationship. Age alone explains only 4% of variability in crimes, but including age 2 increase adjusted R 2 to 26%. However, this is a much lower R 2 compared to model with age and income = our final choice!

To put your footer here go to View > Header and Footer 18 Residual plots: model with age+age 2 Note the outlier in both plots!!! This is not the chosen final model, but if it were, need to consider action to take with outliers! With just one, it can be removed and reported separately!

To put your footer here go to View > Header and Footer 19 Practical work follows to ensure learning objectives are achieved…

SADC Course in Statistics Multiple Linear Regression: Introduction (Session 06)

Similar presentations

Presentation on theme: "SADC Course in Statistics Multiple Linear Regression: Introduction (Session 06)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

SADC Course in Statistics Multiple Linear Regression: Introduction (Session 06)

Similar presentations

Presentation on theme: "SADC Course in Statistics Multiple Linear Regression: Introduction (Session 06)"— Presentation transcript:

Similar presentations

About project

Feedback