Presentation on theme: "Slide 1 The SPSS Sample Problem To demonstrate multinomial logistic regression, we will work the sample problem for multinomial logistic regression in."— Presentation transcript:
Slide 1 The SPSS Sample Problem To demonstrate multinomial logistic regression, we will work the sample problem for multinomial logistic regression in SPSS Regression Models 10.0, pages The description of the problem found on page 66 states that the 1996 General Social Survey asked people who they voted for in Demographic variables from the GSS, such as sex, age, and education, can be used to identify the relationships between voter demographics and voter preference. The data for this problem is: voter.sav. Multinomial Logistic Regression
Slide 2 Stage One: Define the Research Problem In this stage, the following issues are addressed: Relationship to be analyzed Specifying the dependent and independent variables Method for including independent variables Multinomial Logistic Regression Relationship to be analyzed The goal of this analysis is to examine the relationship between presidential choice in 1992, sex, age, and education.
Slide 3 Specifying the dependent and independent variables The dependent variable is pres92 ‘Vote for Clinton, Bush, Perot.’ It has three categories: 1 is a vote for Bush, 2 is a vote for Perot, and 3 is vote for Clinton. SPSS will solve the problem by contrasting votes for Bush to votes for Clinton, and votes for Perot to votes for Clinton. By default, SPSS uses the highest numbered choice as the reference category. The independent variables which we will use in this analysis are: AGE 'Age of respondent’ EDUC ‘Highest year of school completed’ DEGREE ‘Respondent’s Highest Degree’ SEX ‘Respondent’s Sex’ Multinomial Logistic Regression Method for including independent variables The only method for including variables multinomial logistic regression in SPSS is direct entry of all variables.
Slide 4 Stage 2: Develop the Analysis Plan: Sample Size Issues In this stage, the following issues are addressed: Missing data analysis Minimum sample size requirement: cases per independent variable Multinomial Logistic Regression Missing data analysis Only 2 of the 1847 cases have any missing data. Since the number of cases with missing data is so small, it cannot produce a missing data process that is disruptive to the analysis. We will bypass any missing data analysis. Minimum sample size requirement: cases per independent variable The data set has 1845 cases and 4 independent variables for a ratio of 462 to 1, well in excess of the requirement that we have cases per independent variable.
Slide 5 Stage 2: Develop the Analysis Plan: Measurement Issues: In this stage, the following issues are addressed: Incorporating nonmetric data with dummy variables Representing Curvilinear Effects with Polynomials Representing Interaction or Moderator Effects Multinomial Logistic Regression Incorporating Nonmetric Data with Dummy Variables It is not necessary to create dummy variables for nonmetric data since SPSS will do this automatically when we specify that a variable is a “factor” in the model. Representing Curvilinear Effects with Polynomials We do not have any evidence of curvilinear effects at this point in the analysis, though the SPSS text for this problem points out that there is a curvilinear relationship between education and voting preference, which led them to create the variable Degree ‘Respondent’s Highest Degree’. Democrats (i.e. Clinton voters) are favored by both those with little formal education and those who have advanced degrees. Representing Interaction or Moderator Effects We do not have any evidence at this point in the analysis that we should add interaction or moderator variables. The SPSS procedure makes it very easy to add interaction terms.
Slide 6 Stage 3: Evaluate Underlying Assumptions In this stage, the following issues are addressed: Nonmetric dependent variable with two or more groups Metric or nonmetric independent variables Multinomial Logistic Regression Nonmetric dependent variable having two groups The dependent variable pres92 ‘Vote for Clinton, Bush, Perot’ has three categories. Metric or nonmetric independent variables AGE and EDUC, as metric variables, will be entered as covariates in the model. SEX and DEGREE, as nonmetric variables, will be entered as factors.
Slide 7 Stage 4: Estimation of Logistic Regression and Assessing Overall Fit: Model Estimation In this stage, the following issues are addressed: Compute logistic regression model Multinomial Logistic Regression Compute the logistic regression The steps to obtain a logistic regression analysis are detailed on the following screens.
Slide 14 Stage 4: Estimation of Logistic Regression and Assessing Overall Fit: Assessing Model Fit In this stage, the following issues are addressed: Significance test of the model log likelihood (Change in -2LL) Measures Analogous to R²: Cox and Snell R² and Nagelkerke R² Classification matrices as a measure of model accuracy Check for Numerical Problems Presence of outliers Multinomial Logistic Regression
Slide 15 Significance test of the model log likelihood The Initial Log Likelihood Function, (-2 Log Likelihood or -2LL) is a statistical measure like total sums of squares in regression. If our independent variables have a relationship to the dependent variable, we will improve our ability to predict the dependent variable accurately, and the log likelihood measure will decrease. The initial log likelihood value ( ) is a measure of a model with no independent variables, i.e. only a constant or intercept. The final log likelihood value ( ) is the measure computed after all of the independent variables have been entered into the logistic regression. The difference between these two measures is the model chi-square value ( = ) that is tested for statistical significance. This test is analogous to the F-test for R² or change in R² value in multiple regression which tests whether or not the improvement in the model associated with the additional variables is statistically significant. In this problem the model Chi-Square value of has a significance < , so we conclude that there is a significant relationship between the dependent variable and the set of independent variables. Multinomial Logistic Regression
Slide 16 Measures Analogous to R² The next SPSS outputs indicate the strength of the relationship between the dependent variable and the independent variables, analogous to the R² measures in multiple regression. The Cox and Snell R² measure operates like R², with higher values indicating greater model fit. However, this measure is limited in that it cannot reach the maximum value of 1, so Nagelkerke proposed a modification that had the range from 0 to 1. We will rely upon Nagelkerke's measure as indicating the strength of the relationship. If we applied our interpretive criteria to the Nagelkerke R², we would characterize the relationship as weak. Multinomial Logistic Regression
Slide 17 The Classification Matrix as a Measure of Model Accuracy The classification matrix in logistic regression serves the same function as the classification matrix in Multinomial Logistic Regression, i.e. evaluating the accuracy of the model. If the predicted and actual group memberships are the same, i.e. 1 and 1, 2 and 2, or 3 and 3, then the prediction is accurate for that case. If predicted group membership and actual group membership are different, the model "misses" for that case. The overall percentage of accurate predictions (49.9% in this case) is the measure of a model that I rely on most heavily for this analysis as well as for Multinomial Logistic Regression because it has a meaning that is readily communicated, i.e. the percentage of cases for which our model predicts accurately. To evaluate the accuracy of the model, we compute the proportional by chance accuracy rate and the maximum by chance accuracy rates, if appropriate. The proportional by chance accuracy rate is equal to (0.358^ ^ ). A 25% increase over the proportional by chance accuracy rate would equal Our model accuracy rate of 49.9% meets this criterion. Since one of our groups (voters for Clinton) contains 49.2% of the cases, we should also apply the maximum by chance criterion. A 25% increase over the largest groups would equal Our model accuracy race of 49.9% fails to meet this criterion. The usefulness of the relationship among the demographic variables and voter preference is questionable. Multinomial Logistic Regression
Slide 18 Check for Numerical Problems There are several numerical problems that can occur in logistic regression that are not detected by SPSS or other statistical packages: multicollinearity among the independent variables, zero cells for a dummy-coded independent variable because all of the subjects have the same value for the variable, and "complete separation" whereby the two groups in the dependent event variable can be perfectly separated by scores on one of the independent variables. All of these problems produce large standard errors (over 2) for the variables included in the analysis and very often produce very large B coefficients as well. If we encounter large standard errors for the predictor variables, we should examine frequency tables, one- way ANOVAs, and correlations for the variables involved to try to identify the source of the problem. None of the standard errors or B coefficients are excessively large, so there is no evidence of a numeric problem with this analysis. Multinomial Logistic Regression
Slide 19 Presence of outliers Multinomial Logistic Regression Multinomial logistic regression does not provide any output for detecting outliers. However, if we are concerned with outliers, we can identify outliers on the combination of independent variables by computing Mahalanobis distance in the SPSS regression procedure.
Slide 20 Stage 5: Interpret the Results In this section, we address the following issues: Identifying the statistically significant predictor variables Direction of relationship and contribution to dependent variable Multinomial Logistic Regression
Slide 21 Identifying the statistically significant predictor variables - 1 There are two outputs related to the statistical significance of individual predictor variables: the Likelihood Ratio Tests and Parameter Estimates. The Likelihood Ratio Tests indicate the contribution of the variable to the overall relationship between the dependent variable and the individual independent variables. The Parameter Estimates focus on the role of each independent variable in differentiating between the groups specified by the dependent variable. The likelihood ratio tests are a hypothesis test that the variable contributes to the reduction in error measured by the –2 log likelihood statistic. In this model, the variables age, degree, and sex are all significant contributors to explaining differences in voting preference. Multinomial Logistic Regression
Slide 22 Identifying the statistically significant predictor variables - 2 The two equations in the table of Parameter Estimates are labeled by the group they contrast to the reference group. The first equation is labeled "1 Bush", and the second equation is labeled "2 Perot." The coefficients for each logistic regression equation are found in the column labeled B. The hypothesis that the coefficient is not zero, i.e. changes the odds of the dependent variable event, is tested with the Wald statistic, instead of the t-test as was done for the individual B coefficients in the multiple regression equation. Multinomial Logistic Regression The variables that have a statistically significant relationship to distinguishing voters for Bush from voters for Clinton in the first logistic regression equation were DEGREE=3 (bachelor's degree) and Sex=1 (Male). The variables that have a statistically significant relationship to distinguishing voters for Perot from voters for Clinton were AGE, Degree=2 (junior college degree), and SEX=1 (male).
Slide 23 Direction of relationship and contribution to dependent variable - 1 Interpretation of the independent variables is aided by the "Exp (B)" column which contains the odds ratio for each independent variable. We can state the relationships as follows: Having a bachelor's degree rather than an advanced degree increased the likelihood that a voter would choose Bush over Clinton by about 50%. Being a male increased the likelihood that a voter would choose Bush over Clinton by approximately 50% (almost 60%). Multinomial Logistic Regression
Slide 24 Direction of relationship and contribution to dependent variable Interpretation of the independent variables is aided by the "Exp (B)" column which contains the odds ratio for each independent variable. We can state the relationships as follows: Increases in age made a voter about 3% less likely to choose Perot over Clinton. Having a junior college degree made a person about 2.3 times more likely to choose Perot over Clinton. Being a male doubled the likelihood that a voter would choose Perot over Clinton. Multinomial Logistic Regression
Slide 25 Stage 6: Validate The Model The SPSS multinomial logistic procedure does not include the ability to select a subset of cases based on the value of a variable, so we cannot use our usual strategy for conducting a validation analysis. We can, however, accomplish the same results with a step-by-step series of syntax commands, as will be shown on the following screens. We cannot run all of the syntax commands at one time because one of the steps requires us to manually type the coefficients from the SPSS output into the syntax file so that we can calculate predicted values for the logistic regression equations. In order to understand the steps that we will follow, we need to understand how we translate scores on the logistic regression equations into classification in a group. The multinomial logistic regression problem for three groups is solved by contrasting two of the groups with a reference group. In this problem, the reference group is Clinton voters. The classification score for the reference group is 0, just as the code for any reference group for dummy coded variables is 0. The first logistic regression equation is used to compute a logistic regression score that would test whether or not the subject is more likely a member of the group of Bush voters rather than a member of the group of Clinton voters. Similarly, the second logistic regression equation is used to test whether or not the subject is more likely to be a Perot voter than a Clinton voter. Multinomial Logistic Regression
Slide 26 Stage 6: Validate The Model (continued) The classification problem, thus, involves the comparison of three scores, one associated with each of the groups. The first score (which we will label g1) is associated with voting for Bush. The second score (which we will label g2) is associated with voting for Perot. The third score (which we will label g3) is associated with voting for Clinton. Calculating g1 and g2 require substituting the variables for each subject in the logistic regression equations. G3 is always 0. The scores g1, g2, and g3 are log estimates of the odds of belonging to each group. To convert the scores into a probability of group membership, we convert each score into its antilog equivalent and divide by the sum of the three antilog equivalents. To estimate group membership, we compare the three probabilities, and estimate that the subject is a member of the group associated with the highest probability. Multinomial Logistic Regression
Slide 27 Computing the First Validation Analysis The first step in our validation analysis is to create the split variable. * Compute the split variable for the learning and validation samples. SET SEED COMPUTE split = uniform(1) > EXECUTE. Multinomial Logistic Regression
Slide 28 Creating the Multinomial Logistic Regression for the First Half of the Data Next, we run the multinomial logistic regression on the first half of the sample, where split = 0. * Select the cases to include in the first validation analysis. USE ALL. COMPUTE filter_$=(split=0). FILTER BY filter_$. EXECUTE. * Run the multinomial logistic regression for these cases. NOMREG pres92 BY degree sex WITH age educ /CRITERIA = CIN(95) DELTA(0) MXITER(100) MXSTEP(5) LCONVERGE(0) PCONVERGE(1.0E-6) SINGULAR(1.0E-8) /MODEL /INTERCEPT = INCLUDE /PRINT = CLASSTABLE PARAMETER SUMMARY LRT. Multinomial Logistic Regression
Slide 29 Entering the Logistic Regression Coefficients into SPSS To compute the classification scores for the logistic regression equations, we need to enter the coefficients for each equation into SPSS. Next, we enter the B coefficients into SPSS using compute commands. For the first set of coefficients, we will use the letter A, followed by a number. For the second set of coefficients, we will use the letter B, followed by a number. The complete set of compute commands are below the graphic. Multinomial Logistic Regression
Slide 30 Create the coefficients in SPSS * Assign the coefficients from the model just run to variables. compute A0 = compute A1 = compute A2 = compute A3 = compute A4 = compute A5 = compute A6 = compute A7 = compute B0 = compute B1 = compute B2 = compute B3 = compute B4 = compute B5 = compute B6 = compute B7 = execute. Multinomial Logistic Regression
Slide 31 Entering the Logistic Regression Equations into SPSS Before we can enter the logistic regression equations, we need to explicitly create the dummy coded variables which the logistic regression equation created for the variables that we specified were factors. * Create the dummy coded variables which SPSS created. * Use a logical assignment to code the variables as 0 or 1. compute degree0 = (degree = 0). compute degree1 = (degree = 1). compute degree2 = (degree = 2). compute degree3 = (degree = 3). compute degree4 = (degree = 4). compute sex1 = (sex = 1). execute. The logistic regression equations can be entered as compute statements. We will also enter the zero value for the third group, g3. compute g1 = A0 + A1 * AGE + A2 * EDUC + A3 * DEGREE0 + A4 * DEGREE1 + A5 * DEGREE2 + A6 * DEGREE3 + A7 * SEX1. compute g2 = B0 + B1 * AGE + B2 * EDUC + B3 * DEGREE0 + B4 * DEGREE1 + B5 * DEGREE2 + B6 * DEGREE3 + B7 * SEX1. compute g3 = 0. execute. When these statements are run in SPSS, the scores for g1, g2, and g3 will be added to the dataset. Multinomial Logistic Regression
Slide 32 Converting Classification Scores into Predicted Group Membership We convert the three scores into odds ratios using the EXP function. When we divide each score by the sum of the three odds ratios, we end up with a probability of membership in each group. * Compute the probabilities of membership in each group. compute p1 = exp(g1) / (exp(g1) + exp(g2) + exp(g3)). compute p2 = exp(g2) / (exp(g1) + exp(g2) + exp(g3)). compute p3 = exp(g3) / (exp(g1) + exp(g2) + exp(g3)). execute. The follow if statements compare probabilities to predict group membership. * Translate the probabilities into predicted group membership. if (p1 > p2 and p1 > p3) predgrp = 1. if (p2 > p1 and p2 > p3) predgrp = 2. if (p3 > p1 and p3 > p2) predgrp = 3. execute. When these statements are run in SPSS, the dataset will have both actual and predicted membership for the first validation sample. Multinomial Logistic Regression
Slide 33 The Classification Table To produce a classification table for the validation sample, we change the filter criteria to include cases where split = 1, and create a contingency table of predicted voting versus actual voting. USE ALL. COMPUTE filter_$=(split=1). FILTER BY filter_$. EXECUTE. CROSSTABS /TABLES=pres92 BY predgrp /FORMAT= AVALUE TABLES /CELLS= COUNT TOTAL. These command produce the following table. The classification accuracy rate is computed by adding the percents for the cells where predicted accuracy coincides with actual voting behavior: 6.3% % = 48.5%. We enter this information in the validation table. Multinomial Logistic Regression
Slide 34 Computing the Second Validation Analysis The second validation analysis follows the same series of command, except that we build the model with the cases where split = 1 and validate the model on cases where split = 0. The results from my calculations have been entered into the validation table below. Multinomial Logistic Regression Full ModelSplit = 0Split = 1 Model Chi-Square , p < , p < , p < Nagelkerke R Accuracy Rate for Learning Sample 49.9%48.8%50.6% Accuracy Rate for Validation Sample 48.5%46.9% Significant Coefficients (p < 0.05) Equation 1 DEGREE = 3 SEX = 1 Equation 2 AGE DEGREE = 2 SEX = 1 Equation 1 SEX=1 Equation 2 AGE SEX=1 Equation 1 DEGREE = 3 SEX = 1 Equation 2 AGE DEGREE = 2 SEX = 1
Slide 35 Generalizability of the Multinomial Logistic Regression Model We can summarize the results of the validation analyses in the following table. Multinomial Logistic Regression Full ModelSplit = 0Split = 1 Model Chi-Square , p < , p < , p < Nagelkerke R Accuracy Rate for Learning Sample 49.9%48.8%50.6% Accuracy Rate for Validation Sample 48.5%46.9% Significant Coefficients (p < 0.05) Equation 1 DEGREE = 3 SEX = 1 Equation 2 AGE DEGREE = 2 SEX = 1 Equation 1 SEX=1 Equation 2 AGE SEX=1 Equation 1 DEGREE = 3 SEX = 1 Equation 2 AGE DEGREE = 2 SEX = 1 From the validation table, we see that the original model is verified by the accuracy rates for the validation analyses. SEX and AGE would appear to be the more reliable predictors of voting behavior. However, the relationship is weak and falls short of the classification accuracy criteria for a useful model.