Lecture 18 Ordinal and Polytomous Logistic Regression BMTRY 701 Biostatistical Methods II
Categorical Outcomes Logistic regression is appropriate for binary outcomes What about other kinds of categorical data? >2 categories ordinal data Standard logistic is not applicable unless you ‘threshold’ the date or collapse categories BMTRY 711: Analysis of Categorical Data This is just an overview
Ordinal Logistic Regression Ordinal Dependent Variable Teaching experience SES (high, middle, low) Degree of Agreement Ability level (e.g. literacy, reading) Severity of disease/outcome Severity of toxicity Context is important Example: attitudes towards smoking
Proportional Odds Model One of several possible regression models for the analysis of ordinal data, and also the most common. Model predicts the ln(odds) of being in category j or beyond. Simplifying assumption: “proportional odds” Effect of covariate assumed to be invariant across splits Example: 4 categories 0 vs 1,2,3 0,1 vs 2,3 0,1,2 vs 3 Assumes that each of these comparisons yields the same odds ratio
Motivating Example: YTS The South Carolina Youth Tobacco Survey (SC YTS) is part of the National Youth Tobacco Survey program sponsored by the Centers for Disease Control and Prevention. The YTS is an annual school-based survey designed to evaluate youth-related smoking practices, including initiation and prevalence, cessation, attitudes towards smoking, media influences, and more. The SC YTS is coordinated by the SC Department of Health and Environmental Control and has been administered yearly since Data for this report are based on years The SC YTS uses a two-stage sample cluster design to select a representative sample of public middle (grades 6-8) and high school (grades 9-12) students.
Ordinal Outcomes. tab cr44 “do you think | smoking | cigarettes | makes young | people look | cool or fit | in?” | Freq. Percent Cum definitely yes | probably yes | probably not | 1, definitely not | 4, Total | 7, “do you think | young people | risk harming | themselves if | they smoke | from | ciga | Freq. Percent Cum definitely yes | 5, probably yes | 1, probably not | definitely not | Total | 7,
What factors are related to these attitudes? Gender? Grade? Race? parental education (surrogate for SES)? year? (2005, 200, 2007) have tried cigarettes? school performance? smoker in the home?
Tabulation of gender vs. look cool “do you think | smoking | cigarettes | makes young | people look | cool or fit | gender in?” | 0 1 | Total definitely yes | | 455 probably yes | | 810 probably not | | 1,320 definitely not | 2,158 2,797 | 4, Total | 3,574 3,966 | 7,540
Possible “breaks” OR = 1.81 malefemale def yes else OR = 1.59 malefemale yes no OR = 1.57 malefemale else def no
Proportional Odds Assumption How to implement this? Model the probability of ‘cumulative’ logits Instead of Here, we have
The (simple) ordinal logistic model Warning! different packages parameterize it different ways! Stata codes it differently than SAS and R. Notice how this differs from logistic regression: there is a ‘level’ specific intercept. But, there is just ONE log odds ratio describing the association between x and y.
Example. ologit lookcool gender Iteration 0: log likelihood = Iteration 1: log likelihood = Iteration 2: log likelihood = Ordered logistic regression Number of obs = 7540 LR chi2(1) = Prob > chi2 = Log likelihood = Pseudo R2 = lookcool | Coef. Std. Err. z P>|z| [95% Conf. Interval] gender | /cut1 | /cut2 | /cut3 |
R estimation Different parameterization Makes you think about what the model is doing!
> library(Design) > oreg <- lrm(lookcool ~ gender, data=data) > oreg Logistic Regression Model lrm(formula = lookcool ~ gender, data = data) Frequencies of Responses Frequencies of Missing Values Due to Each Variable lookcool gender Obs Max Deriv Model L.R. d.f. P C Dxy e Gamma Tau-a R2 Brier Coef S.E. Wald Z P y>= y>= y>= gender
MLR. ologit lookcool gender evertried smokerhome grade school_perf Iteration 0: log likelihood = Iteration 1: log likelihood = Iteration 2: log likelihood = Iteration 3: log likelihood = Ordered logistic regression Number of obs = 2125 LR chi2(5) = Prob > chi2 = Log likelihood = Pseudo R2 = lookcool | Coef. Std. Err. z P>|z| [95% Conf. Interval] gender | evertried | smokerhome | grade | school_per~e | /cut1 | /cut2 | /cut3 |
It is a pretty strong assumption How can we check? Simple check as shown in 2x2 table. Continuous variables: harder need to consider the model no direct ‘tabular’ comparison multiple regression: does it hold for all? Tricky! It needs to make sense and you need to do some ‘model checking’ for all of your variables Worthwhile to check each individually.
There is another approach There is a test of proportionality. Implemented easily in Stata with an add-on package: omodel Ho: proportionality holds Ha: proportionality is violated Why? violation would require more parameters and would be a larger model What does small p-value imply? but be careful of sample size! large sample sizes will make it hard to ‘adhere’ to proportionality assumption
Estimation in Stata. omodel logit lookcool gender Iteration 0: log likelihood = Iteration 1: log likelihood = Iteration 2: log likelihood = Ordered logit estimates Number of obs = 7540 LR chi2(1) = Prob > chi2 = Log likelihood = Pseudo R2 = lookcool | Coef. Std. Err. z P>|z| [95% Conf. Interval] gender | _cut1 | (Ancillary parameters) _cut2 | _cut3 | Approximate likelihood-ratio test of proportionality of odds across response categories: chi2(2) = 2.43 Prob > chi2 =
. omodel logit lookcool grade Iteration 0: log likelihood = Iteration 1: log likelihood = Iteration 2: log likelihood = Ordered logit estimates Number of obs = 7505 LR chi2(1) = 0.68 Prob > chi2 = Log likelihood = Pseudo R2 = lookcool | Coef. Std. Err. z P>|z| [95% Conf. Interval] grade | _cut1 | (Ancillary parameters) _cut2 | _cut3 | Approximate likelihood-ratio test of proportionality of odds across response categories: chi2(2) = Prob > chi2 =
What would the ORs be? Generate three separate binary outcome variables from the ordinal variable lookcool1v234 lookcool12v34 lookcool123v4 Estimate the odds ratio for each binary outcome
Stata Code gen lookcool1v234=1 if lookcool==2 | lookcool==3 | lookcool==4 replace lookcool1v234=0 if lookcool==1 gen lookcool12v34=1 if lookcool==3 | lookcool==4 replace lookcool12v34=0 if lookcool==1 | lookcool==2 gen lookcool123v4=1 if lookcool==4 replace lookcool123v4=0 if lookcool==2 | lookcool==3 | lookcool==1 logit lookcool1v234 grade logit lookcool12v34 grade logit lookcool123v4 grade
Results For a one grade difference (range = 6 – 12) lookcool1v234 vs. grade: OR = (0.93) lookcool12vs34 vs. grade: OR = 1.04 (p=0.03) lookcool123v4 vs. grade: OR = 0.98 (p=0.11)
Another approach: Polytomous Logistic Regression Polytomous (aka Polychotomous) Logistic Regression Fits the regression model with all contrasts. Can be used as an inferential model Or, can be used to estimate odds ratio to see if they look ‘ordered” Model is different though
. mlogit lookcool gender Iteration 0: log likelihood = Iteration 1: log likelihood = Iteration 2: log likelihood = Iteration 3: log likelihood = Multinomial logistic regression Number of obs = 7540 LR chi2(3) = Prob > chi2 = Log likelihood = Pseudo R2 = lookcool | Coef. Std. Err. z P>|z| [95% Conf. Interval] definitely~s | gender | _cons | probably yes | gender | _cons | probably not | gender | _cons | (lookcool==definitely not is the base outcome)
Interpretation For gender, notice the ordered nature of the odds ratio Suggests that it may be appropriate to use an ordinal model This model is more general, less restrictive but, sort of a mess to interpret
. mlogit lookcool grade Iteration 0: log likelihood = Iteration 1: log likelihood = Iteration 2: log likelihood = Multinomial logistic regression Number of obs = 7505 LR chi2(3) = Prob > chi2 = Log likelihood = Pseudo R2 = lookcool | Coef. Std. Err. z P>|z| [95% Conf. Interval] definitely~s | grade | _cons | probably yes | grade | _cons | probably not | grade | _cons | (lookcool==definitely not is the base outcome)
In R? mlogit library requires a data transformation step