Presentation is loading. Please wait.

Presentation is loading. Please wait.

Logistic Regression.

Similar presentations


Presentation on theme: "Logistic Regression."— Presentation transcript:

1 Logistic Regression

2 Categorical versus Continuous Data Analysis
Response INTERNAL USE FIG. 05S01F01

3 Overview INTERNAL USE FIG. 05S03F01

4 Types of Logistic Regression
INTERNAL USE FIG. 05S03F02

5 Logistic regression Logistic regression investigates the relationship between a response variable and one or more predictors. Minitab provides three logistic regression procedures that you can use to assess the relationship between one or more predictor variables and a categorical response variable of the following types:

6 What Does Logistic Regression Do?
The logistic regression model uses the predictor variables, which can be categorical or continuous, to predict the probability of specific outcomes. In other words, logistic regression is designed to describe probabilities associated with the values of the response variable.

7 Logistic Regression Curve
1.0 0.9 0.8 0.7 Probability 0.6 0.5 0.4 0.3 0.2 0.1 0.0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 x

8 Logit Transformation Logistic regression models transform probabilities called logits. Where i indexes all cases (observations). pi is the probability the event (a sale, for example) occurs in the ith case. log is the natural log (to the base e).

9 Assumption pi (pi ) INTERNAL USE FIG. 05S03F03

10 Logistic Regression Model
logit (pi) = 0 + 1Xi where logit(pi) logit transformation of the probability of the event 0 intercept of the regression line 1 slope of the regression line.

11 Binary Logistic Regression
This demonstration illustrates fitting a binary logistic regression model in PROC LOGISTIC.

12 Odds Ratio from a Logistic Regression Model
Estimated logistic regression model: Estimated odds ratio (Females to Males): odds ratio = (e )/(e-.7566) odds ratio = e.4373 = 1.549

13 Multiple Logistic Regression
logit (pi) = 0 + 1X1 + 2X2 + 3X3 INTERNAL USE FIG. 05S03F10

14 Binary Logistic Regression
Use binary logistic regression to perform logistic regression on a binary response variable. A binary variable only has two possible values, such as presence or absence of a particular disease. A model with one or more predictors is fit using an iterative reweighted least squares algorithm to obtain maximum likelihood estimates of the parameters . Binary logistic regression has also been used to classify observations into one of two categories, and it may give fewer classification errors than discriminant analysis for some cases.

15 RestingPulse Smokes Weight
Example: You are a researcher who is interested in understanding the effect of smoking and weight upon resting pulse rate. Because you have categorized the response-pulse rate-into low and high, a binary logistic regression analysis is appropriate to investigate the effects of smoking and weight upon pulse rate.

16 SAS code proc logistic desc; class Smokes/desc param=glm; model RestingPulse = Smokes Weight; run;

17 Output Logistic Regression Table Odds 95% CI
                                                 Odds    95% CI Predictor        Coef    SE Coef       Z      P  Ratio  Lower  Upper Constant             -1.18  0.237 Smokes  Yes                 -2.16  0.031   0.30   0.10   0.90 Weight            2.04  0.041   1.03   1.00   1.05 Test that all slopes are zero: G = 7.574, DF = 2, P-Value = 0.023

18 Logistic Regression Table - shows the estimated coefficients, standard error of the coefficients, z-values, and p-values, odds ratio and a 95% confidence interval for the odds ratio. From the output, you can see that the estimated coefficients for both Smokes (z = -2.16, p = 0.031) and Weight (z = 2.04, p = 0.041) have p-values less than 0.05, indicating that there is sufficient evidence that the coefficients are not zero using an a-level of 0.05. The estimated coefficient of for Smokes represents the change in the log of P(low pulse)/P(high pulse) when the subject smokes compared to when he/she does not smoke, with the covariate Weight held constant. The estimated coefficient of for Weight is the change in the log of P(low pulse)/P(high pulse) with a 1 unit (1 pound) increase in Weight, with the factor Smokes held constant.

19 Although there is evidence that the estimated coefficient for Weight is not zero, the odds ratio is very close to one  (1.03), indicating that a one pound increase in weight minimally effects a person's resting pulse rate. A more meaningful difference would be found if you compared subjects with a larger weight difference (for example, if the weight unit is 10 pounds, the odds ratio becomes 1.28, indicating that the odds of a subject having a low pulse increases by 1.28 times with each 10 pound increase in weight). For Smokes, the negative coefficient of and the odds ratio of 0.30 indicate that subjects who smoke tend to have a higher resting pulse rate than subjects who do not smoke. Given that subjects have the same weight, the odds ratio can be interpreted as the odds of smokers in the sample having a low pulse being 30% of the odds of non-smokers having a low pulse.

20 Ordinal Logistic Regression
Use ordinal logistic regression to perform logistic regression on an ordinal response variable. Ordinal variables are categorical variables that have three or more possible levels with a natural ordering, such as strongly disagree, disagree, neutral, agree, and strongly agree. A model with one or more predictors is fit using an iterative-reweighted least squares algorithm to obtain maximum likelihood estimates of the parameters. Parallel regression lines are assumed, and therefore, a single slope is calculated for each covariate. In situations where this assumption is not valid, nominal logistic regression, which generates separate logit functions, is more appropriate.

21 Survival Region ToxicLevel
Example Suppose you are a field biologist and you believe that adult population of salamanders in the Northeast has gotten smaller over the past few years. You would like to determine whether any association exists between the length of time a hatched salamander survives and level of water toxicity, as well as whether there is a regional effect. Survival time is coded as 1 if < 10 days, 2 = 10 to 30 days, and 3 = 31 to 60 days.

22 SAS code proc logistic; class Region/desc param=glm; model Survival = Region ToxicLevel; run;

23 Logistic Regression Table
                                                 Odds     95% CI Predictor       Coef     SE Coef      Z      P    Ratio  Lower  Upper Const(1)            -4.19  0.000 Const(2)               0.017 Region  2                   0.41  0.685   1.22   0.46   3.23 ToxicLevel         3.56  0.000   1.13   1.06   1.21 Test that all slopes are zero: G = 14.713, DF = 2, P-Value = 0.001

24 Interpreting the results
The values labeled Const(1) and Const(2) are estimated intercepts for the logits of the cumulative probabilities of survival for <10 days, and for days, respectively. Because the cumulative probability for the last response value is 1, there is not need to estimate an intercept for days. The coefficient of for Region is the estimated change in the logit of the cumulative survival time probability when the region is 2 compared to region being 1, with the covariate ToxicLevel held constant. Because the p-value for estimated coefficient is 0.685, there is insufficient evidence to conclude that region has an effect upon survival time.

25 Interpreting the results
There is one estimated coefficient for each covariate, which gives parallel lines for the factor levels. Here, the estimated coefficient for the single covariate, ToxicLevel, is 0.121, with a p-value of < The p-value indicates that for most a-levels, there is sufficient evidence to conclude that the toxic level affects survival. The positive coefficient, and an odds ratio that is greater than one indicates that higher toxic levels tend to be associated with lower values of survival. Specifically, a one-unit increase in water toxicity results in a 13% increase in the odds that a salamander lives less than or equal to 10 days versus greater than 30 days and that the salamander lives less than or equal to 30 days versus greater than 30 days. Next displayed is the last Log-Likelihood from the maximum likelihood iterations along with the statistic G. This statistic tests the null hypothesis that all the coefficients associated with predictors equal zero versus at least one coefficient is not zero. In this example, G = with a p-value of 0.001, indicating that there is sufficient evidence to conclude that at least one of the estimated coefficients is different from zero.

26 Nominal Logistic Regression
Use nominal logistic regression performs logistic regression on a nominal response variable using an iterative-reweighted least squares algorithm to obtain maximum likelihood estimates of the parameters. Nominal variables are categorical variables that have three or more possible levels with no natural ordering. For example, the levels in a food tasting study may include crunchy, mushy, and crispy.

27 TeachingMethod Age Subject
Suppose you are a grade school curriculum director interested in what children identify as their favorite subject and how this is associated with their age or the teaching method employed. Thirty children, 10 to 13 years old, had classroom instruction in science, math, and language arts that employed either lecture or discussion techniques. At the end of the school year, they were asked to identify their favorite subject. We use nominal logistic regression because the response is categorical and possesses no implicit categorical ordering.

28 SAS code proc logistic; class TeachingMethod/desc param=glm;
model Subject = TeachingMethod Age / link = glogit; run;

29 Logistic Regression Table
Odds % CI Predictor Coef SE Coef Z P Ratio Lower Upper Logit 1: (math/science) Constant Age TeachingMethod lecture Logit 2: (arts/science) Constant Age lecture

30 Interpreting the results
If there are k response distinct values, Minitab estimates k-1 sets of parameter estimates, here labeled as Logit(1) and Logit(2). These are the estimated differences in log odds or logits of math and language arts, respectively, compared to science as the reference event. Each set contains a constant and coefficients for the factor(s), here teaching method, and the covariate(s), here age. The TeachingMethod coefficient is the estimated change in the logit when TeachingMethod is lecture compared to the teaching method being discussion, with Age held constant. The Age coefficient is the estimated change in the logit with a one year increase in age with teaching method held constant. These sets of parameter estimates gives nonparallel lines for the response values.

31 Interpreting the results
The first set of estimated logits, labeled Logit(1), are the parameter estimates of the change in logits of math relative to the reference event, science. The p-values of and for TeachingMethod and Age, respectively, indicate that there is insufficient evidence to conclude that a change in teaching method from discussion to lecture or in age affected the choice of math as favorite subject as compared to science.

32 Interpreting the results
The second set of estimated logits, labeled Logit(2), are the parameter estimates of the change in logits of language arts relative to the reference event, science. The p-values of and for TeachingMethod and Age, respectively, indicate that there is sufficient evidence, if the p-values are less than your acceptable a-level, to conclude that a change in teaching method from discussion to lecture or in age affected the choice of language arts as favorite subject compared to science. The positive coefficient for teaching method indicates students given a lecture style of teaching tend to prefer language arts over science compared to students given a discussion style of teaching. The estimated odds ratio of implies that the odds of choosing language arts over science is about 16 times higher for these students when the teaching method changes from discussion to lecture. The positive coefficient associated with age indicates that students tend to like language arts over science as they become older.


Download ppt "Logistic Regression."

Similar presentations


Ads by Google