Notes on Logistic Regression

Slides:



Advertisements
Similar presentations
Tests of Significance and Measures of Association
Advertisements

Continued Psy 524 Ainsworth
Sociology 680 Multivariate Analysis Logistic Regression.
© Department of Statistics 2012 STATS 330 Lecture 32: Slide 1 Stats 330: Lecture 32.
Brief introduction on Logistic Regression
Logistic Regression.
Simple Logistic Regression
Logistic Regression Example: Horseshoe Crab Data
Regression With Categorical Variables. Overview Regression with Categorical Predictors Logistic Regression.
Logistic Regression Multivariate Analysis. What is a log and an exponent? Log is the power to which a base of 10 must be raised to produce a given number.
Chapter 10 Simple Regression.
Logistic Regression Biostatistics 510 March 15, 2007 Vanessa Perez.
Chapter 11 Multiple Regression.
Notes on Logistic Regression STAT 4330/8330. Introduction Previously, you learned about odds ratios (OR’s). We now transition and begin discussion of.
An Introduction to Logistic Regression
Review for Exam 2 Some important themes from Chapters 6-9 Chap. 6. Significance Tests Chap. 7: Comparing Two Groups Chap. 8: Contingency Tables (Categorical.
Regression Model Building Setting: Possibly a large set of predictor variables (including interactions). Goal: Fit a parsimonious model that explains variation.
Generalized Linear Models
Logistic regression for binary response variables.
Relationships Among Variables
Review for Final Exam Some important themes from Chapters 9-11 Final exam covers these chapters, but implicitly tests the entire course, because we use.
Logistic Regression In logistic regression the outcome variable is binary, and the purpose of the analysis is to assess the effects of multiple explanatory.
STAT E-150 Statistical Methods
Regression and Correlation
Categorical Data Prof. Andy Field.
CHAPTER 14 MULTIPLE REGRESSION
Logistic Regression Database Marketing Instructor: N. Kumar.
April 4 Logistic Regression –Lee Chapter 9 –Cody and Smith 9:F.
Chapter 4: Introduction to Predictive Modeling: Regressions
STA 286 week 131 Inference for the Regression Coefficient Recall, b 0 and b 1 are the estimates of the slope β 1 and intercept β 0 of population regression.
HAWKES LEARNING SYSTEMS Students Matter. Success Counts. Copyright © 2013 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. Section 12.3.
Multiple Logistic Regression STAT E-150 Statistical Methods.
1 Chapter 4: Introduction to Predictive Modeling: Regressions 4.1 Introduction 4.2 Selecting Regression Inputs 4.3 Optimizing Regression Complexity 4.4.
Applied Epidemiologic Analysis - P8400 Fall 2002 Labs 6 & 7 Case-Control Analysis ----Logistic Regression Henian Chen, M.D., Ph.D.
Logistic Regression Analysis Gerrit Rooks
Logistic regression (when you have a binary response variable)
Applied Epidemiologic Analysis - P8400 Fall 2002 Labs 6 & 7 Case-Control Analysis ----Logistic Regression Henian Chen, M.D., Ph.D.
Nonparametric Statistics
The Probit Model Alexander Spermann University of Freiburg SS 2008.
Analysis of matched data Analysis of matched data.
LOGISTIC REGRESSION. Purpose  Logistical regression is regularly used when there are only two categories of the dependent variable and there is a mixture.
Logistic Regression: Regression with a Binary Dependent Variable.
Other tests of significance. Independent variables: continuous Dependent variable: continuous Correlation: Relationship between variables Regression:
Applied Regression Analysis BUSI 6220
Multiple Regression.
Nonparametric Statistics
Association Between Variables Measured at the Ordinal Level
BINARY LOGISTIC REGRESSION
Chapter 7. Classification and Prediction
Logistic Regression APKC – STATS AFAC (2016).
Advanced Quantitative Techniques
Correlation and Simple Linear Regression
Inference and Tests of Hypotheses
Categorical Data Aims Loglinear models Categorical data
Generalized Linear Models
Introduction to logistic regression a.k.a. Varbrul
Multiple logistic regression
CHAPTER 29: Multiple Regression*
Nonparametric Statistics
Multiple Regression.
ביצוע רגרסיה לוגיסטית. פרק ה-2
Review for Exam 2 Some important themes from Chapters 6-9
Logistic Regression.
Topic 10 - Categorical Outcomes
BIVARIATE ANALYSIS: Measures of Association Between Two Variables
Regression and Categorical Predictors
Logistic Regression.
Correlation and Simple Linear Regression
Presentation transcript:

Notes on Logistic Regression STAT 4330/8330

Introduction Previously, you learned about odds ratios (OR’s). We now transition and begin discussion of binary logistic regression. We will see that OR’s play an important role in the results of binary logistic models.

Binary Logistic Regression Binary Logistic Regression is an appropriate when: The response variable is categorical w/ 2 categories (binary, dichotomous, etc.). The response categories are often generically labeled “success” or “failure”. One or more explanatory variables are involved. These can be either quantitative or categorical or a mixture of both. One is interested in assessing the relationship between the binary response and the explanatory variables and/or predicting the response category based on the value(s) of the explanatory variable(s).

The Model Equation

The Model Equation A few points: E(y) can never fall below 0 or above 1 (Remember: it is a probability!). The model is not a linear function of the β parameters. This is a type of nonlinear regression model.

The Model Function

The Model Equation Alternatively, the equation can be transformed to show that it models the natural logarithm of the odds of y = 1.

The Model Equation The left side is called the “logit”

The Model Equation In general, the bi estimates the change in the log-odds when xi is increased by 1 unit, holding all other x’s in the model fixed. Therefore, exp(bi) estimates the OR of a success for each additional 1-unit increase in xi. Furthermore, (exp(bi)-1)*100 gives the percent increase in the odds of a success for each 1-unit increase in xi.

Example: The Outbreak Data The Outbreak data contain a sample of N = 196 persons in 2 neighborhoods (sectors) of a large city during a disease outbreak. Can we predict whether or not a person contracts the disease? We will begin with a simple binary logit model (with 1 predictor = age).

Example: The Outbreak Data. Through SAS PROC LOGISTIC, we find that b1 = .0285. Therefore, OR = exp(.0285) = 1.029, indicating that a person’s odds of contracting the disease increase 1.029 times for every year they age.

Example: The Outbreak Data. Furthermore, we can state that the odds of contracting the disease increase by 2.89% with each additional year in age. (exp(.0285)-1)*100% = 2.89%.

Example: The Outbreak Data. We can transform these results to discuss the increase in odds in 5 & 10 year increments by the following: exp(cbi) = the OR when there is a difference of c units.

Example: The Outbreak Data. Therefore: (exp(5*.0285)-1)*100% = 15.32% (exp(10*.0285)-1)*100% = 32.98% As a result then, a person’s odds of getting the disease increase by 15.32% for every additional 5 years in age.

Model Fit We ended last session fitting a simple (1-predictor) binary logit model to the Outbreak data using SAS. We will now continue covering the SAS PROC LOGISTIC output.

Model Fit Statistics All of these statistics assess the model fit through the quality of the explanatory capacity of the model.

Model Fit Statistics -2 Log L The -2 Log-Likelihood is a transformation of the Likelihood function (L). L is a quantification of how well the model fits the sample data.

Model Fit Statistics Both AIC & SC are deviants of the -2 Log L that penalize for model complexity (the number of predictor variables).

Model Fit Statistics AIC Akaike Information Criterion. Used to compare non-nested models. Smaller is better. AIC is only meaningful in relation to another model’s AIC value.

Model Fit Statistics SC Schwarz Criterion. Very much like AIC, however the penalization is different. SC tends to favor simpler models than AIC.

Model Fit Statistics Choose either AIC or SC (not both) and use the values under the heading ‘Intercept and Covariates’ to compare to competing models.

The model equation.

Inference: The Coefficients. Instead of a t-test for the significance of a coefficient (like in linear regression), we have a Wald Chi-Squared test.

Inference: The Coefficients. Remember, typically we do not evaluate the intercept, but rather focus on the test for each predictor.

Inference: The Coefficients. In this case, age is a statistically significant predictor of disease status at the α = .05 level, X2(1) = 11.53, p = .0007.

Inference: The Coefficients. One can also obtain CI’s for the parameter estimates using CL option in the MODEL statement of PROC LOGISTIC.

Inference: The Coefficients. As we found in linear regression, we can conclude that a given predictor is statistically significant at the α = .05 if the 95% CI does not include the null value of 0.

Inference: The Coefficients. Therefore, our best estimate of the change in the log-odds for age is 0.0285, however, we are 95% confident that that change lies between 0.0120 and 0.0449 for the population.

Inference: The Coefficients. Furthermore: exp(.0285) = 1.029 exp(.0120) = 1.012 exp(.0449) = 1.046 Therefore, we estimate a person’s odds of contracting the disease increase 1.029 times for every year they age and we are 95% confident that this increase ranges between (1.012,1.046) for the pop.

Inference: The Coefficients. Of course, we no longer have to compute these odds ratio estimates by hand, because SAS provides them for us.

Inference: The Coefficients. Furthermore: (exp(.0285)-1)*100% = 2.89%. (exp(.0120)-1)*100% = 1.21% (exp(.0449)-1)*100% = 4.59% We can state that the odds of contracting the disease increase by 2.89% with each additional year in age and we are 95% confident that this increase ranges between (1.21%,4.59%) for the pop.

Final Note: Model Fitting Realize that in order to estimate the model parameters, the data must consist of a substantial number of each response category. For example, one will not be able to estimate the risk of contracting a disease if the data set does not contain any individuals who have been diagnosed with the disease.

Final Note: Model Fitting Essentially, then, in order to estimate the probability of either a success or failure, the data set must contain a substantial number (> 30 is best) of observations that experienced a success and a substantial number that experienced a failure.

More about output. PROC LOGISTIC provides more information concerning how the model fits the sample data.

More about Model Fit Percent Concordant A pair of observations with different observed responses is considered concordant if the observation with the lower ordered response value has a lower predicted value than the observation with a higher ordered response value.

More about Model Fit Percent Discordant A pair is considered discordant if an observation with a lower ordered response value has a higher predicted value than an observation with a higher order response.

More about Model Fit Percent Tied A pair with different responses is considered tied if it is neither concordant nor discordant.

More about Model Fit Somer’s D, Gamma, & Tau-a These are statistics that measure the strength and direction of the relationship between pairs.

More about Model Fit Somer’s D & Tau-a Like r, these vary between -1.0 (all pairs discordant) & +1.0 (all pairs are concordant). Somer’s D = the difference between the % concordant and the % discordant * 100.

More about Model Fit Gamma Gamma is a similar statistic: it’s values also range between -1.0 & +1.0, however the interpretation of these values is different: -1.0 = no association & + 1.0 = perfect association.

Predicted Values The output of a logit model is the predicted probability of a success for each observation.

Predicted Values These are obtained and stored in a separate SAS data set using the OUTPUT statement (see the following code).

Predicted Values PROC LOGISTIC outputs the predicted values and 95% CI limits to an output data set that also contains the original raw data.

Predicted Values Use the PREDPROBS = I option in order to obtain the predicted category (which is saved in the _INTO_ variable).

Predicted Values _FROM_ = The observed response category = The same value as the response variable.

Predicted Values _INTO_ = The predicted response category.

Predicted Values IP_1 = The Individual Probability of a response of 1.

Scoring Observations in SAS Obtaining predicted probabilities and/or predicted outcomes (categories) for new observations (i.e., scoring new observations) is done in logit modeling using the same procedure we used in scoring new observations in linear regression.

Scoring Observations in SAS Create a new data set with the desired values of the x variables and the y variable set to missing. Merge the new data set with the original data set. Refit the final model using PROC LOGISTIC using the OUTPUT statement.

Classification Table & Rates A Classification Table is used to summarize the results of the predictions and to ultimately evaluate the fitness of the model. Obtain a classification table using PROC FREQ.

Classification Table & Rates The observed (or actual) response is in rows and the predicted response is in columns.

Classification Table & Rates Correct classifications are summarized on the main diagonal.

Classification Table & Rates The total number of correct classifications (i.e., ‘hits’) is the sum of the main diagonal frequencies. O = 130+9 = 139

Classification Table & Rates The total-group hit rate is the ratio of O and N. HR = 139/196 = .698

Classification Table & Rates Individual group hit rates can also be calculated. These are essentially the row percents on the main diagonal.