Hierarchical Binary Logistic Regression

Hierarchical Binary Logistic Regression

Hierarchical Binary Logistic Regression
In hierarchical binary logistic regression, we are testing a hypothesis or research question that some predictor independent variables improve our ability to predict membership in the modeled category of the dependent variable, after taking into account the relationship between some control independent variables and the dependent variable. In multiple regression, we evaluated this question by looking at R2 change, the increase in R2 associated with adding the predictors to the regression analysis. The analog to R2 in logistic regression is the Block Chi-square, which is the increase in Model Chi-square associated with the inclusion of the predictors. In standard binary logistic regression, we interpreted the SPSS output that compared Block 0, a model with no independent variables, to Block 1, the model that included the independent variables. In hierarchical binary logistic regression, the control variables are added SPSS in Block 1, and the predictor variables are added in Block 2, and the interpretation of the overall relationship is based on the change in the relationship from Block 1 to Block 2.

Output for Hierarchical Binary Logistic Regression after control variables are added
This output is for the sample problem worked below. In this example, the control variables do not have a statistically significant relationship to the dependent variable, but they can still serve their purpose as controls. After the controls are added, the measure of error, -2 Log Likelihood, is

Output for Hierarchical Binary Logistic Regression after predictor variables are added
The hierarchical relationship is based on the reduction in error associated with the inclusion of the predictor variables. After the predictors are added, the measure of error, -2 Log Likelihood, is The difference between the -2 log likelihood at Block 1 ( ) and the -2 log likelihood at Block 2 ( ) is Block Chi-square (26.870) which is significant at p < .001. Model Chi-square is the cumulative reduction in -2 log likelihood for the controls and the predictors.

The Problem in Blackboard
The problem statement tells us: the variables included in the analysis whether each variable should be treated as metric or non-metric the type of dummy coding and reference category for non-metric variables the alpha for both the statistical relationships and for diagnostic tests

The Statement about Level of Measurement
The first statement in the problem asks about level of measurement. Hierarchical binary logistic regression requires that the dependent variable be dichotomous, the metric independent variables be interval level, and the non-metric independent variables be dummy-coded if they are not dichotomous. SPSS Binary Logistic Regression calls non-metric variables “categorical.” SPSS Binary Logistic Regression will dummy-code categorical variables for us, provided it is useful to use either the first or last category as the reference category.

Marking the Statement about Level of Measurement
Mark the check box as a correct statement because: The dependent variable "should marijuana be made legal" [grass] is dichotomous level, satisfying the requirement for the dependent variable. The independent variable "age" [age] is interval level, satisfying the requirement for independent variables. The independent variable "sex" [sex] is dichotomous level, satisfying the requirement for independent variables. The independent variable "strength of religious affiliation" [reliten] is ordinal level, which the problem instructs us to dummy-code as a non-metric variable. The independent variable "general happiness" [happy] is ordinal level, which the problem instructs us to dummy-code as a non-metric variable.

The Statement about Outliers
While we do not need to be concerned about normality, linearity, and homogeneity of variance, we need to determine whether or not outliers were substantially reducing the classification accuracy of the model. To test for outliers, we run the binary logistic regression in SPSS and check for outliers. Next, we exclude the outliers and run the logistic regression a second time. We then compare the accuracy rates of the models with and without the outliers. If the accuracy of the model without outliers is 2% or more accurate than the model with outliers, we interpret the model excluding outliers.

Running the hierarchical binary logistic regression
Select the Regression | Binary Logistic… command from the Analyze menu.

Selecting the dependent variable
First, highlight the dependent variable grass in the list of variables. Second, click on the right arrow button to move the dependent variable to the Dependent text box.

Selecting the control independent variables
First, move the control independent variables stated in the problem (age and sex) to the Covariates list box. Second, click on the Next button to start a new block and add the predictor independent variables.

Selecting the predictor independent variables
Note that the block is now labeled at 2 of 2. First, move the predictor independent variables stated in the problem (reliten and happy) to the Covariates list box. Second, click on the Categorical button to specify which variables should be dummy coded.

Declare the categorical variables - 1
Move the variables sex, reliten, and happy to the Categorical Covariates list box. SPSS assigns its default method for dummy-coding, Indicator coding, to each variable, placing the name of the coding scheme in parentheses after each variable name.

Declare the categorical variables - 2
We accept the default of using the Indicator method for dummy-coding variable.. Click on the Continue button to close the dialog box. We will also accept the default of using the last category as the reference category for each variable.

Specifying the method for including variables
Since the problem calls for a hierarchical binary logistic regression, we accept the default Enter method for including variables in both blocks.

Adding the values for outliers to the data set - 1
SPSS will calculate the values for standardized residuals and save them to the data set so that we can check for outliers and remove the outliers easily if we need to run a model excluding outliers. Click on the Save… button to request the statistics that we want to save.

Adding the values for outliers to the data set - 2
Second, click on the Continue button to complete the specifications. First, mark the checkbox for Standardized residuals in the Residuals panel.

Requesting the output Click on the OK button to request the output.
While optional statistical output is available, we do not need to request any optional statistics.

Detecting the presence of outliers - 1
SPSS created a new variable, ZRE_1, which contains the standardized residual. If SPSS finds that the data set already contains a ZRE_1 variable, it will create ZRE_2. I find it easier to delete the ZRE_1 variable after each analysis rather than have multiple ZRE_ variables in the data set, requiring that I remember which one goes with which analysis.

To detect outliers, we will sort the ZRE_1 column twice: first, in ascending order to identify outliers with a standardized residual of or greater. second, in descending order to identify outliers with a standardized residual of or less. Click the right mouse button on the column header and select Sort Ascending from the pop-up menu.

After scrolling down past the cases with missing data (. in the ZRE_1 column), we see that we have one outlier that has a standardized residual of or less.

To check for outliers with large positive standardized residuals, click the right mouse button on the column header and select Sort Ascending from the pop-up menu.

After scrolling up to the top of the data set, we see that there are no outliers that have standardized residuals of or more. Since we found outliers, we will run the model excluding them and compare accuracy rates to determine which one we will interpret. Had there been no outliers, we would move on to the issue of sample size.

Running the model excluding outliers - 1
We will use a Select Cases command to exclude the outliers from the analysis.

First, in the Select Cases dialog box, mark the option button If condition is satisfied. Second, click on the If button to specify the condition.

To eliminate the outliers, we request the cases that are not outliers be selected into the analysis. The formula specifies that we should include cases if the standard score for the standardized residual (ZRE_1) is less than 2.58. The abs() or absolute value function tells SPSS to ignore the sign of the value. After typing in the formula, click on the Continue button to close the dialog box.

SPSS displays the condition we entered on the Select Cases dialog box. Click on the OK button to close the dialog box.

SPSS indicates which cases are excluded by drawing a slash across the case number. Scrolling down in the data, we see that the outliers and cases with missing values are excluded.

To run the logistic regression excluding outliers, select Logistic Regression from the Dialog Recall menu.

The only change we will make is to clear the check box for saving standardized residuals. Click on the Save button to open the dialog box.

Second, click on the Continue button to close the dialog box. First, clear the check box for Standardized residuals.

Finally, click on the OK button to request the output.

Accuracy rate of the baseline model including all cases
The accuracy rate for the model with all cases is 71.3%. Navigate to the Classification Table for the logistic regression with all cases. To distinguish the two models, I often refer to the first one as the baseline model.

Accuracy rate of the revised model excluding outliers
The accuracy rate for the model excluding outliers is 71.1%. Navigate to the Classification Table for the logistic regression excluding outliers. To distinguish the two models, I often refer to the first one as the revised model.

Marking the statement for excluding outliers
In the initial logistic regression model, 1 case had a standardized residual of or greater or or lower: - Case had a standardized residual of -2.78 The classification accuracy of the model that excluded outliers (71.14%) was not greater by 2% or more than the classification accuracy for the model that included all cases (71.33%). The model including all cases should be interpreted. The check box is nor marked because removing outliers did not increase the accuracy of the model. All of the remaining statements will be evaluated based on the output for the model that includes all cases.

The statement about multicollinearity and other numerical problems
Multicollinearity in the logistic regression solution is detected by examining the standard errors for the b coefficients. A standard error larger than 2.0 indicates numerical problems, such as multicollinearity among the independent variables, cells with a zero count for a dummy-coded independent variable because all of the subjects have the same value for the variable, and 'complete separation' whereby the two groups in the dependent event variable can be perfectly separated by scores on one of the independent variables. Analyses that indicate numerical problems should not be interpreted.

Checking for multicollinearity
The standard errors for the variables included in the analysis were: the standard error for "age" [age] was .01, the standard error for survey respondents who said that overall they were not too happy was .92, the standard error for survey respondents who said that overall they were pretty happy was .47, the standard error for survey respondents who said they had no religious affiliation was .53, the standard error for survey respondents who said they had a somewhat strong religious affiliation was .70, the standard error for survey respondents who said they had a not very strong religious affiliation was .47 and the standard error for survey respondents who were male was .39.

Marking the statement about multicollinearity and other numerical problems
Since none of the independent variables in this analysis had a standard error larger than 2.0, we mark the check box to indicate there was no evidence of multicollinearity.

The statement about sample size
Hosmer and Lemeshow, who wrote the widely used text on logistic regression, suggest that the sample size should be 10 cases for every independent variable.

The output for sample size
We find the number of cases included in the analysis in the Case Processing Summary. The 150 cases available for the analysis satisfied the recommended sample size of 70 (10 cases per independent variable) for logistic regression recommended by Hosmer and Lemeshow. .

Marking the statement for sample size
Since we satisfy the sample size requirement, we mark the check box.

The hierarchical relationship between the dependent and independent variables
In a hierarchical logistic regression, the presence of a relationship between the dependent variable and combination of independent variables entered after the control variables have been taken into account is based on the statistical significance of the block chi-square for the second block of variables in which the predictor independent variables are included.

The output for the hierarchical relationship
In this analysis, the probability of the block chi-square was was less than or equal to the alpha of 0.05 (χ²(5, N = 150) = 26.87, p < .001). The null hypothesis that there is no difference between the model with only the control variables versus the model with the predictor independent variables was rejected. The existence of the hierarchical relationship between the predictor independent variables and the dependent variable was supported.

Marking the statement for hierarchical relationship
Since the hierarchical relationship was statistically significant, we mark the check box.

The statement about the relationship between age and legalization of marijuana
Having satisfied the criteria for the hierarchical relationship, we examine the findings for individual relationships with the dependent variable. If the overall relationship were not significant, we would not interpret the individual relationships. The first statement concerns the relationship between age and legalization of marijuana.

Output for the relationship between age and legalization of marijuana
The probability of the Wald statistic for the control independent variable "age" [age] (χ²(1, N = 150) = 1.83, p = .176) was greater than the level of significance of .05. The null hypothesis that the b coefficient for "age" [age] was equal to zero was not rejected. "Age" [age] does not have an impact on the odds that survey respondents supported the legalization of marijuana. The analysis does not support the relationship that 'For each unit increase in "age", survey respondents were 1.7% less likely to supported the legalization of marijuana'

Marking the statement for relationship between age and legalization of marijuana
Since the relationship was not statistically significant, we do not mark the check box for the statement.

Statement for relationship between general happiness and legalization of marijuana
The next statement concerns the relationship between the dummy-coded variable for general happiness and legalization of marijuana.

Output for relationship between general happiness and legalization of marijuana
The probability of the Wald statistic for the predictor independent variable survey respondents who said that overall they were not too happy (χ²(1, N = 150) = 13.96, p < .001) was less than or equal to the level of significance of .05. The null hypothesis that the b coefficient for survey respondents who said that overall they were not too happy was equal to zero was rejected. The value of Exp(B) for the variable survey respondents who said that overall they were not too happy was which implies the odds were multiplied by approximately 31.6 times. The statement that 'Survey respondents who said that overall they were not too happy were approximately 31.6 times more likely to supported the legalization of marijuana compared to those who said that overall they were very happy' is correct.

Marking the relationship between general happiness and legalization of marijuana
Since the relationship was statistically significant, and survey respondents who said that overall they were not too happy were approximately 31.6 times more likely to supported the legalization of marijuana compared to those who said that overall they were very happy is correct, the statement is marked.

Statement for relationship between general happiness and legalization of marijuana
The next statement concerns the relationship between the dummy-coded variable for general happiness and legalization of marijuana.

Output for relationship between general happiness and legalization of marijuana
The probability of the Wald statistic for the predictor independent variable survey respondents who said that overall they were pretty happy (χ²(1, N = 150) = 3.42, p = .064) was greater than the level of significance of .05. The null hypothesis that the b coefficient for survey respondents who said that overall they were pretty happy was equal to zero was not rejected. Survey respondents who said that overall they were pretty happy does not have an impact on the odds that survey respondents supported the legalization of marijuana. The analysis does not support the relationship that 'Survey respondents who said that overall they were pretty happy were approximately two and a quarter times more likely to supported the legalization of marijuana compared to those who said that overall they were very happy'

Marking the relationship between general happiness and legalization of marijuana

Statement for relationship between religious affiliation and legalization of marijuana
The next statement concerns the relationship between the dummy-coded variable for religious affiliation and legalization of marijuana.

Output for relationship between religious affiliation and legalization of marijuana
The probability of the Wald statistic for the predictor independent variable survey respondents who said they had no religious affiliation (χ²(1, N = 150) = 4.39, p = .036) was less than or equal to the level of significance of .05. The null hypothesis that the b coefficient for survey respondents who said they had no religious affiliation was equal to zero was rejected. The value of Exp(B) for the variable survey respondents who said they had no religious affiliation was which implies the odds increased by approximately three times. The statement that 'Survey respondents who said they had no religious affiliation were approximately three times more likely to supported the legalization of marijuana compared to those who said they had a strong religious affiliation' is correct.

Marking the relationship between religious affiliation and legalization of marijuana
Since the relationship was statistically significant, and survey respondents who said they had no religious affiliation were approximately three times more likely to supported the legalization of marijuana compared to those who said they had a strong religious affiliation is correct , the statement is marked.

The next statement concerns the relationship between the dummy-coded variable for a somewhat strong religious affiliation and legalization of marijuana.

Output for the relationship between religious affiliation and legalization of marijuana
The probability of the Wald statistic for the predictor independent variable survey respondents who said they had a somewhat strong religious affiliation (χ²(1, N = 150) = .67, p = .414) was greater than the level of significance of .05. The null hypothesis that the b coefficient for survey respondents who said they had a somewhat strong religious affiliation was equal to zero was not rejected. Survey respondents who said they had a somewhat strong religious affiliation does not have an impact on the odds that survey respondents support the legalization of marijuana. The analysis does not support the relationship that 'Survey respondents who said they had a somewhat strong religious affiliation were 43.7% less likely to support the legalization of marijuana compared to those who said they had a strong religious affiliation'

The next statement concerns the relationship between the dummy-coded variable for a not very strong religious affiliation and legalization of marijuana.

Output for the relationship between religious affiliation and legalization of marijuana
The probability of the Wald statistic for the predictor independent variable survey respondents who said they had a not very strong religious affiliation (χ²(1, N = 150) = .24, p = .626) was greater than the level of significance of .05. The null hypothesis that the b coefficient for survey respondents who said they had a not very strong religious affiliation was equal to zero was not rejected. Survey respondents who said they had a not very strong religious affiliation does not have an impact on the odds that survey respondents support the legalization of marijuana. The analysis does not support the relationship that 'Survey respondents who said they had a not very strong religious affiliation were 25.8% more likely to support the legalization of marijuana compared to those who said they had a strong religious affiliation'

Since the relationship was not statistically significant, the check box is not marked.

The statement for the relationship between sex and legalization of marijuana
The next statement concerns the relationship between the sex and legalization of marijuana.

Output for the relationship between sex and legalization of marijuana
The probability of the Wald statistic for the control independent variable survey respondents who were male (χ²(1, N = 150) = .13, p = .719) was greater than the level of significance of .05. The null hypothesis that the b coefficient for survey respondents who were male was equal to zero was not rejected. Survey respondents who were male does not have an impact on the odds that survey respondents support the legalization of marijuana. The analysis does not support the relationship that 'Survey respondents who were male were 13.1% less likely to support the legalization of marijuana compared to those who were female'

Marking the statement for the relationship between sex and legalization of marijuana
Since the relationship was not statistically significant, the check box is not marked.

Statement about the usefulness of the model based on classification accuracy
The final statement concerns the usefulness of the logistic regression model. The independent variables could be characterized as useful predictors distinguishing survey respondents who use a computer from survey respondents who not use a computer if the classification accuracy rate was substantially higher than the accuracy attainable by chance alone. Operationally, the classification accuracy rate should be 25% or more higher than the proportional by chance accuracy rate.

Computing proportional by-chance accuracy rate
At Block 0 with no independent variables in the model, all of the cases are predicted to be members of the modal group, 1=Legal in this example. The proportion in the largest group is 63.3% or The proportion in the other group is 1.0 – = .367. The proportional by chance accuracy rate was computed by calculating the proportion of cases for each group based on the number of cases in each group in the classification table at Step 0, and then squaring and summing the proportion of cases in each group (.633² ² = .536).

Output for the usefulness of the model based on classification accuracy
To be characterized as a useful model, the accuracy rate should be 25% higher than the by chance accuracy rate. The by chance accuracy criteria is compute by multiplying the by chance accurate rate of .536 times 1.25, or 1.25 x .536 = .669 (66.9%).. The classification accuracy rate computed by SPSS was 71.3% which was greater than or equal to the proportional by chance accuracy criteria of 66.9% (1.25 x 53.6% = 66.9%). The criteria for classification accuracy is satisfied. The criteria for classification accuracy is satisfied.

Marking the statement for usefulness of the model
Since the criteria for classification accuracy was satisfied, the check box is marked.

Hierarchical Binary Logistic Regression: Level of Measurement
measurement ok? Do not mark check box for level of measurement No Mark: Inappropriate application of the statistic Yes Stop Mark check box for level of measurement Ordinal level variable treated as metric? Consider limitation in discussion of findings Yes No

Standard Binary Logistic Regression: Exclude Outliers
Run Baseline Binary Logistic Regression, Including All Cases, Requesting Standardized Residuals Run Revised Binary Logistic Regression, Excluding Outliers (standardized Residuals >= 2.58) Accuracy rate for revised Model >= accuracy rate for baseline model + 2% Mark check box for excluding outliers Yes Interpret revised model No Do not mark check box for excluding outliers Interpret baseline model

Hierarchical Binary Logistic Regression: Multicollinearity and Sample Size
Multicollinearity/Numerical Problems (S. E. > 2.0) Do not mark check box for no multicollinearity Yes Stop No Mark check box for no multicollinearity Adequate Sample Size (Number of IV’s x 10) Do not mark check box for sample size No Consider limitation in discussion of findings Yes Mark check box for sample size

Hierarchical Binary Logistic Regression: Hierarchical Relationship
Probability of Block Chi-square for Block 2 ≤ α Do not mark check box for hierarchical relationship No Stop Yes Mark check box for hierarchical relationship The biggest distinction between hierarchical and standard models is our focus on the contribution of the predictors in addition to the controls.

Hierarchical Binary Logistic Regression: Individual Relationships
(Wald Sig ≤ α)? No Yes Correct interpretation of direction and strength of relationship? Do not mark check box for individual relationship No Yes Mark check box for individual relationship Additional individual Relationships to interpret? Yes No

Hierarchical Binary Logistic Regression: Classification Accuracy
Classification accuracy > 1.25 x by chance accuracy rate Do not mark check box for classification accuracy No Yes Mark check box for classification accuracy

Hierarchical Binary Logistic Regression

Similar presentations

Presentation on theme: "Hierarchical Binary Logistic Regression"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Hierarchical Binary Logistic Regression

Similar presentations

Presentation on theme: "Hierarchical Binary Logistic Regression"— Presentation transcript:

Similar presentations

About project

Feedback