Multiple Linear Regression

Multiple Linear Regression
Chung-Yi Li, PhD Dept. of Public Health, College of Med. NCKU

Topics Reasons for conducting MLR Built a MLR MLR diagnosis Example
SST, SSR, SSE Partial correlation Dummy variables

Multiple Linear Regression
ANOVA was extension of T-Test to multiple group means Linear regression extends ANOVA to continuous predictor variables Multiple linear regression extends simple linear regression to multiple predictor variables Systolic blood pressure predicted by body mass index, sodium intake, and race Body mass index predicted by caloric intake and gender Caloric intake predicted by measure of stress and gender

Reasons for Multivariate Regression Models - I
First, to assess effect of different factors on the subgroup means of an outcome Since factors have differing effects on outcome and are interrelated themselves, necessary to look at a number of predictors simultaneously Systolic blood pressure predicted by body mass index, sodium intake, race

We know that each predictor has an independent effect on systolic blood pressure, but how much of the information carried by each one is also carried by the others It could be that ‘race’ is a surrogate for sodium intake and that sodium intake is actually the cause for high blood pressure Cannot find this out without a multiple regression model

Reasons for Multivariate Regression Models - II
Second, to predict for an individual value of the outcome based on the values of the predictor variables e.g., based on body mass index, sodium intake, and race, what would the predicted value of systolic blood pressure for a person Why? To identify those people who may have significantly higher systolic blood pressure than expected from similar people

Reasons for Multivariate Regression Models - III
Third, to assess the interaction between variables e.g., see the warning sign on the next page A study aims to investigate the influence of “family SES” and “race” on time of TV watching. It is possible that “family SES” and “race” have an interactive effect on time of TV watching.

A Statistical Viewpoint of Effect-Modification
Suppose Y＝Time of TV watching X1＝Race (0=White; 1=Black) X2＝Family SES (0＝high SES; 1＝low SES) ……….(eq.1) is a simple linear regression model where represents the "crude" association between X1 and Y

To assess the possible role of X2 as an effect-modifier, we include the "interaction" term of X1*X2 as an independent variable in the multivariate linear regression model. …… (eq.2) is a multivariate linear regression model where the association between X1 and Y will be dependent on the existence of

…… (eq.2) If X2＝1 (Low family SES) then eq.2 will be re-expressed as:
If X2＝0 (High family SES) then eq.2 will be re-expressed as: ………. .(eq.4)

If ＝0 (i.e., interaction between X1 and X2 does not exist) then the association between X1 and Y would remain constant. Otherwise, X2 will modify the association between X1 and Y.

Interpretation of Interaction Terms
If test of significance for an interaction term is significant, may be necessary to fit separate models based on the levels of the interaction terms If the “race × income” interaction is significant, fit separate models for the white and for the black If no significant interaction is present, the relationship between income and TV watching is similar for both the black and the white The relationship of income and TV watching represented by the income ‘main effects’ term

Interpretation of Interaction Terms
Principal of hierarchy The appropriate main effects must remain in the model if the interaction term remains in the model Otherwise, impossible to interpret with no reference group

The SPSS Output for the Test of Interaction
An example of IQ, high-school score and entrance test score

Variable Selection in A Regression Model
Standard Hierarchical Stepwise Forward solution Backward solution Stepwise solution

Standard All the independent variables are entered at once.
All variables are evaluated in relation to the dependent variable and the other independent variables through the use of partial correlation coefficient.

Hierarchical The research may want to force the order of entry of variables into the equation. Suppose you want to know whether a particular intervention would improve pregnancy outcomes. Already have some givens, such as age, socioeconomic status, and nutritional status, and you would like to know whether your intervention makes a difference over and above factors that you can not changes. You might then enter the givens first and add your intervention last.

An Example of Hierarchical Regression
A study of quality of life among people with multiple sclerosis (Gulic, 1997). The independent variables were entered in three blocks. In the first blocks, the demographic variables were entered In the second, symptoms entered, and In the third, activities of daily living (ADLs). With demographics and symptoms entered, two of the ADL variables, intimacy and recreation/socialization, made significant contributions.

Stepwise: Forward Solution
The independent variable that has the highest correlation with the dependent variable is entered first. The second variable entered is the one that will increase the R2 the most over and above what the first variable contributed.

Once the first variable is entered, the partial correlations are calculated between each of the remaining independent variables and the dependent variable. Thus, the effect of the first variable is removed from the correlation. Various criteria may be set for entry into the regression equation. The 0.05 level of significance is often used. Once none of the remaining independent variables can contribute significantly to the R2, the analysis ended.

Red enters first Followed by blue Finally green

Stepwise: Backward Solution
In this method, we start with the overall R2 generated by putting all of our independent variables in the equation. Each variable is tested to see what would happen if it was the last one entered into the equation. R2y.1234－R2y tested for variable 1 R2y.1234－R2y tested for variable 2 R2y.1234－R2y tested for variable 3 R2y.1234－R2y tested for variable 4

If for any one of these variables there is a significant drop in R2, that variable is contributing significantly and will not be removed. If one or more variables are not significant, the model will include the other independent variables without these non-significant variables.

Then each of the remaining variables included in the model would be tested to see whether it would contribute significantly if entered last. The analysis continues until all variables in the equation contribute significantly if entered last.

Removing blue might result in greater reduction in R2 than removing red

Stepwise: Stepwise Solution
The stepwise solution combines the forward solution with the backward solution there overcomes difficulties associated with each.

With the forward solution, once a variable is in the equation, it is not removed.
No attempt is made to assess the contribution of a variable once other variables have been added. The backward solution remedies that problem, but the order of entry is not clear. With the stepwise solution, variables are entered in the method outlined for the forward solution and are assessed at each step using the backward method to determine whether their contribution is still significant, given the effect of other variables in the equation.

Summary of Methods of Entry
Selecting a method for entering variables into the equation is an important decision, because the results will differ depending on the method selected. Stepwise methods were in vogue in the 1970s, but they are less popular today. Because order of entry is based on statistical rather than a theoretical rationale, the technique is criticized for capitalizing on chance. A better idea is that variables are not to be “dumped” into an analysis and “large samples are an absolute necessity”.

Multicollinearity Because variables collected in medical research often provide very similar information, they are often highly correlated with each other. Such high interrelatedness makes evaluation of results problematic. For example, SBP, BW, and BMI in the prediction of risks of stroke.

Indications of the Problem of Multicollinearity
High correlations between variables (>0.85) Substantial R2 but statistically insignificant coefficients Unstable regression coefficient (i.e., weight that changes dramatically when variables are added or dropped from the equation) Unexpected size of coefficients (much larger or smaller than expected) Unexpected signs (positive or negative)

“tolerance” Measure of Collinearity
It is the proportion of the variance in a variable that is not accounted for by other independent variables. To obtain measures of tolerance, each independent variable is treated as a dependent variable and is regressed on the other independent variables.

A high multiple correlation indicates that the variable is closely related to the other independent variables. If multiple R2 =1, then the independent variable would be completely related to the others.

Criteria for The “tolerance”
Tolerance is simply 1- R2; therefore, a tolerance of 0 (1-1=0) indicates perfect collinearity. The variable is a perfect linear combination of the other variables. The “variance inflation factor” (VIF) is the reciprocal of tolerance. Therefore, variables with high tolerances have small VIF, and vice versa.

Example

ANOVA Table from MLR Coefficient of multiple determination (r2)
Ratio of regression sums of squares (SSR) to the total sums of squares Can be interpreted as the proportion of the total variation in the outcome that can be explained by the regression model Total Sums of Squares (SST) SST is total SS in the outcome for the entire sample – analogous to the variance Sum of squared deviations of observed values from the overall mean

Error Sums of Squares (SSE)
Unexplained variation based on the sum of the squared deviations between the observed and predicted values What is being minimized in least squares Regression Sums of Squares (SSR) – Global fit of the model Determine if any parameters are significantly different from zero Represents variability in the outcome accounted for by regression line

Test Score for College Entrance Examination in Association with Score at High School and IQ
11 13 8 5 9 6 10 4 15 7 12

Multivariate and Partial Correlations
Multivariate correlation is an indicator of strength of the relationship between the outcome and predictor variables Can be tested with an F-test, but F-statistic will be the same as that from the test of the regression sums of squares Partial correlation is an indicator of strength and direction of relationship between the outcome and one predictor with the effect of the other variables removed Can be calculated from simple correlations or produced by software

entrance E iq A G C F D B h_school
The correlation between iq and entrance can be presented as 0-order correlation: (E+G)/(A+E+D+G) Partial correlation: E / (A+E) Part correlation: E / (A+E+D+G) Multivariate correlation: (E+G+D)/(A+E+D+G) entrance E iq A G C F D B h_school

E A G C F D B Multivariate correlation: (E+G+D)/(A+E+D+G)
entrance E iq A G C F D The correlation between “iq” and “entrance” can be presented as 0-order correlation: (E+G)/(A+E+D+G) Partial correlation: E / (A+E) Part correlation: E / (A+E+D+G) B h_school 46

Dummy Variables Use ‘dummy’ (indicator) variables to represent the levels of a variable measured on the nominal scale e.g., education, income, race, gender levels have no intrinsic numeric relationship Procedure: choose one level as the ‘reference’ level and then form n-1 dummy variables to represent the n levels of the original variable Use set of dummy variables in regression model 47

Example of Constructing Dummy Variables
Income has three levels: low, middle, and high Choose low income as the reference group Form two indicator variables: Income1: 0=low, 1=middle Income2: 0=low, 1=high Note that both income1 and income2 have to be in the model to properly represent the ‘low’ income group A person from the middle income category would have income1=1 and income2 = 0 48

Effect Coding Design (dummy) Variable Class Value Label 1 2
Income 1 Low 1 0 2 Medium 0 1 3 High 49

Reference Cell Coding Design (dummy) Variable Class Value Label 1 2
Income 1 Low 1 0 2 Medium 0 1 3 High 0 0

Reference Cell Coding Dummy variable X1 X2 X3 TLBW 1 0 0 PNBW 0 1 0
PLBW TNBW (reference)

Multiple Linear Regression

Similar presentations

Presentation on theme: "Multiple Linear Regression"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Multiple Linear Regression

Similar presentations

Presentation on theme: "Multiple Linear Regression"— Presentation transcript:

Similar presentations

About project

Feedback