Presentation on theme: "Simple Logistic Regression An introduction to PROC FREQ and PROC LOGISTIC."— Presentation transcript:
Simple Logistic Regression An introduction to PROC FREQ and PROC LOGISTIC
Introduction to Logistic Regression Logistic Regression is used when the outcome variable of interest is categorical, rather than continuous. Examples include: death vs. no death, recovery vs. no recovery, obese vs. not obese, etc. All of the examples you will see in this class have binary outcomes, meaning there are only two possible outcomes. Simple Logistic Regression has only one predictor variable. You may already be familiar with this type of regression under a different name: odds ratio.
Simple Logistic Regression: An example Imagine you are interested in investigating whether there is a relationship between race and party identification. Race (Black or White) is the independent variable, and Party Identification (Democrat or Republican) is the dependent variable. Consider the following table: Example from Agresti, A. Categorical Data Analysis, 2 nd ed
Race x Party Identification DemocratRepublican Black10311 White341405
The odds of being a Democrat for Black vs. White is: OR(odds ratio) = (103/11)/(341/405) = (103x405)/(341x11) = Blacks have a times greater odds of being a Democrat than Whites. The odds of being a Republican for Black vs. White is: (11/103)/(405/341) = (11x341)/(405x103) = 0.09 Blacks have a 91% (1-0.09) lower odds of being a Republican than Whites.
Odds Ratios in SAS Copy the following code into SAS:
Odds Ratios with PROC FREQ There are two ways to get Odds Ratios in SAS when there is one predictor and one outcome variable. The first is with PROC FREQ. Type the following code into SAS:
Notes about the SAS code: weight is a term in SAS which weighs whatever variable you specify. When you have a table you want to enter into SAS, it is often easier to use a “count” variable rather than list each subject individually. Because the data set has 860 observations, we would have to type out 860 separate datalines if we did not use the “count” variable and “weight count” option.
TABLES tells SAS to construct a table with the two specified variables (in this case, race and party). The chisq option requests all Chi-Square statistics. The relrisk option gives you estimates of the odds ratio and relative risks for the two columns.
Output from PROC FREQ
Reading the Table Each cell has four numbers: count, percent, row %, and column % There are 103 Black Democrats, which is 11.98% of the total sample % of Blacks are Democrats % of Democrats are Black. Compare this to 2.64% of Republicans who are Black.
Interpreting Chi-Square Statistic The Chi-Square (Χ 2 ) test statistic tests the null hypothesis that two variables are independent versus the alternative, that they are not independent (that is, related). H o : race and party identification are independent H a : race and party identification are associated Χ 2 = , pvalue < Reject H o. Conclude that race and party identification are associated.
Output of Odds Ratio
Interpreting the Odds Ratio You can find the OR in the SAS output under “Case-Control (Odds Ratio).” The odds ratio is with a 95% Confidence Interval of [5.87, 21.05]. Because this C.I. does not contain 0, we know that the OR is statistically significant. Blacks have a times greater odds of being Democratic than Whites.
A note about the PROC FREQ table: Notice the way the table is set up in SAS: When calculating the OR in PROC FREQ, SAS will alphabetize the table, and this affects the OR it will calculate. SAS is calculating the odds of being a Democrat for Blacks versus Whites (or the odds of being Black for Democrats versus Republicans). If you wanted the odds of being Democratic for Whites versus Blacks, you would have to either calculate this by hand or use PROC LOGISTIC. DemRep Black10311 White341405
Odds Ratio with PROC LOGISTIC To simplify our data set, we will change our variables to have values of 1 and 0, rather than B/W and D/R. If someone is Black, s/he will have a value of “1” for the variable “race2.” Whites will have a value of “0.” If someone is a Democrat, s/he will have a value of “1” for “party2.” Republicans will have a value of “0.” Type the following code into SAS, which creates a new data set called “partyid2”:
PROC LOGISTIC Once you have created the new data set, do regression analysis on the data, using PROC LOGISTIC (notice the format is similar to that of linear regression, with the model statement y = x): “Descending” tells SAS to model the probability that “party2” = 1 (Democratic). If you did not include the descending statement, SAS would model the probability that “party2” = 0 (Republican). All subsequent interpretations will be in terms of the odds of being Democratic, not Republican.
PROC LOGISTIC Output
Interpreting the Output From PROC LOGISITC, we now have an equation for our log(odds): Log(odds) = β 0 + β 1 x Log(odds) = x where x = 1 if the person is Black and x = 0 if the person is White.
Calculating the Odds Ratio Suppose we wanted to know the odds of being a Democrat for Blacks vs. Whites. The log(odds) of being Democratic for Blacks is: β 0 + β 1 (1) = β 0 + β 1 The log(odds) of being Democratic for Whites is: β 0 + β 1 (0) = β 0. To calculate the OR, take the log(odds) for Blacks minus the log(odds) for Whites: β 0 + β 1 – ( β 0 ) = β 1 Then exponentiate this value: exp( β 1 ) = exp(2.4088) = This is the same OR calculated earlier using PROC FREQ. In addition, it is given to you in the PROC LOGISTIC output under “Odds Ratio Estimates” with the 95% C.I.
Calculating the OR, cont. Suppose we wanted to know the odds of being a Democrat for Whites vs. Blacks. To calculate the OR, take the log(odds) for Whites minus the log(odds) for Blacks: β 0 – ( β 0 + β 1 ) = - β 1 Then exponentiate this value: exp(- β 1 ) = exp( ) = Whites have a 91% ( ) decreased odds of being Democratic than Blacks.
Significance Testing Testing the significance of a parameter estimate can be done by constructing a confidence interval around that parameter estimate. If the C.I. for an estimate (or log(OR)) contains 0, the variable is not significantly associated with the outcome. If the C.I. for an OR contains 1, the variable is not significantly associated with the outcome.
The Wald Chi-Square statistic tests whether the parameter estimate equals zero, that is H o : β 1 = 0 vs. H a : β 1 ≠ 0. From the output, we see that the pvalue of this test < , so we reject H o and conclude that race is significantly related to party identification.
Confidence Interval Construction Confidence interval construction is similar to what you have seen for linear regression, except that it is now on the natural log scale: 95% C.I. for β 1 = β 1 +/- 1.96*se( β 1 ) = /- 1.96*(0.3256) = [1.77,3.05]. This C.I. does not contain 0. exp [1.77,3.05] = [5.875, ] This C.I. does not contain 1. Notice that [5.875, ] is also the 95% C.I. for the OR given in the SAS output.
Calculating the Probability If you were asked to calculate the probability that someone is a Democrat, given that he is Black, you would use the following formula: Π (probability) = exp(log(odds))/[1+ exp(log(odds))] Π = exp( )/[1+ exp( )] = A Black person has a 90.35% chance of being a Democrat.
Summary This has been an introduction to calculating odds ratios in PROC FREQ and PROC LOGISTIC. The next section will introduce you to multiple predictors in logistic regression, including interactions.