# Unit 4a: Basic Logistic (Binomial Logit) Regression Analysis © Andrew Ho, Harvard Graduate School of EducationUnit 4a – Slide 1

## Presentation on theme: "Unit 4a: Basic Logistic (Binomial Logit) Regression Analysis © Andrew Ho, Harvard Graduate School of EducationUnit 4a – Slide 1"— Presentation transcript:

Unit 4a: Basic Logistic (Binomial Logit) Regression Analysis © Andrew Ho, Harvard Graduate School of EducationUnit 4a – Slide 1 http://xkcd.com/74/ http://xkcd.com/210/

Exploratory Data Analysis with dichotomous outcome variables How our familiar regression model fails our data An initial look at logistic regression results © Andrew Ho, Harvard Graduate School of Education Unit 4a– Slide 2 Multiple Regression Analysis (MRA) Multiple Regression Analysis (MRA) Do your residuals meet the required assumptions? Test for residual normality Use influence statistics to detect atypical datapoints If your residuals are not independent, replace OLS by GLS regression analysis Use Individual growth modeling Specify a Multi-level Model If time is a predictor, you need discrete- time survival analysis… If your outcome is categorical, you need to use… Binomial logistic regression analysis (dichotomous outcome) Multinomial logistic regression analysis (polytomous outcome) If you have more predictors than you can deal with, Create taxonomies of fitted models and compare them. Form composites of the indicators of any common construct. Conduct a Principal Components Analysis Use Cluster Analysis Use non-linear regression analysis. Transform the outcome or predictor If your outcome vs. predictor relationship is non-linear, Use Factor Analysis: EFA or CFA? Course Roadmap: Unit 4a Today’s Topic Area

© Andrew Ho, Harvard Graduate School of EducationUnit 4a– Slide 4 Structure of Dataset Col. # Variable Name Variable DescriptionVariable Metric/Labels 1HOME Does the married woman work in the home as a homemaker? Dichotomous outcome variable: 0 = no 1 = yes 2HUBSAL The husband’s annual income in 1976 Canadian dollars. \$1000’s 3CHILD Are there children present in the home? Dichotomous predictor variable: 0 = no 1 = yes 1 15 1 1 13 1 1 45 1 1 23 1 1 19 1 1 7 1 1 15 1 0 7 1 1 15 1 1 23 1 0 13 1 1 9 1 … 1 15 1 1 13 1 1 45 1 1 23 1 1 19 1 1 7 1 1 15 1 0 7 1 1 15 1 1 23 1 0 13 1 1 9 1 … We’ve already demonstrated the use of dichotomous predictors. Why not dichotomous outcome variables? We’ll try it and see. What could possibly go wrong? We’ve already demonstrated the use of dichotomous predictors. Why not dichotomous outcome variables? We’ll try it and see. What could possibly go wrong? HOME is a categorical (dichotomous) outcome variable What’s the best way to model the relationship between a binary outcome and regular predictors like CHILD and HUBSAL? Eyes on the Data

© Andrew Ho, Harvard Graduate School of EducationUnit 4a– Slide 5 *--------------------------------------------------------------------------------- * Input the raw dataset, name and label the variables and selected values. *--------------------------------------------------------------------------------- * Input the target dataset: infile HOME HUBSAL CHILD /// using "C:\Users\Andrew Ho\Documents\Dropbox\S-052\Raw Data\AT_HOME.txt" * Label the principal variables: label variable HOME "Is Woman a Homemaker?" label variable HUBSAL "Husband's Annual Salary (in \$1,000)" label variable CHILD "Are Children Present in the Home?" * Label the values of important categorical variables: * Dichomotous outcome HOME: label define homelbl 0 "In Labor Force" 1 "Homemaker" label values HOME homelbl * Dichotomous secondary question predictor CHILD: label define childlbl 0 "No Child" 1 "Children at Home" label values CHILD childlbl *-------------------------------------------------------------------------------- * Obtain descriptive statistics on the sample HOME/HUBSAL relationship. *-------------------------------------------------------------------------------- * Examine the sample univariate distribution of HOME: hist HOME, discrete percent ylabel(0(20)100) xlabel(0(1)1) name(Unit4a_g1) summarize HOME * Inspect the sample bivariate relationship of outcome HOME and predictor HUBSAL: scatter HOME HUBSAL, jitter(7) msize(small) name(Unit4a_g2,replace) graph hbox HUBSAL, over(HOME, descending) name(Unit4a_g3,replace) *--------------------------------------------------------------------------------- * Input the raw dataset, name and label the variables and selected values. *--------------------------------------------------------------------------------- * Input the target dataset: infile HOME HUBSAL CHILD /// using "C:\Users\Andrew Ho\Documents\Dropbox\S-052\Raw Data\AT_HOME.txt" * Label the principal variables: label variable HOME "Is Woman a Homemaker?" label variable HUBSAL "Husband's Annual Salary (in \$1,000)" label variable CHILD "Are Children Present in the Home?" * Label the values of important categorical variables: * Dichomotous outcome HOME: label define homelbl 0 "In Labor Force" 1 "Homemaker" label values HOME homelbl * Dichotomous secondary question predictor CHILD: label define childlbl 0 "No Child" 1 "Children at Home" label values CHILD childlbl *-------------------------------------------------------------------------------- * Obtain descriptive statistics on the sample HOME/HUBSAL relationship. *-------------------------------------------------------------------------------- * Examine the sample univariate distribution of HOME: hist HOME, discrete percent ylabel(0(20)100) xlabel(0(1)1) name(Unit4a_g1) summarize HOME * Inspect the sample bivariate relationship of outcome HOME and predictor HUBSAL: scatter HOME HUBSAL, jitter(7) msize(small) name(Unit4a_g2,replace) graph hbox HUBSAL, over(HOME, descending) name(Unit4a_g3,replace) Standard input statements As I have illustrated in earlier STATA code, where categorical variables are involved, you can define a format label variable ( homelbl ) to contain the value labels, and then associate the label with the variable of interest, when needed. Requests standard univariate descriptive plots and statistics on the dichotomous outcome, HOME. Requests bivariate plot of dichotomous outcome HOME on the continuous predictor HUBSAL Loading the Data, Visualizing/Summarizing the Outcome Variable

© Andrew Ho, Harvard Graduate School of EducationUnit 4a – Slide 6 Visualizing/Summarizing the Dichotomous Outcome Variable

© Andrew Ho, Harvard Graduate School of EducationUnit 4a – Slide 7 The Bivariate Distribution of HOME on HUBSAL Jittered scatterplot showing the sample relationship between the dichotomous outcome variable HOME and the continuous predictor, HUBSAL

© Andrew Ho, Harvard Graduate School of Education Unit 2a– Slide 8 “In the population, …Assumption How Does Failure of the Assumption Affect OLS Regression Analysis? Linear Outcome/Predictor Relationships … the bivariate relationship between the outcome and each predictor must be linear.” If the modeled relationship is not linear, then it will be misrepresented by the linear regression analysis, and the fundamental underpinnings of the entire analysis are at risk:  OLS-estimated regression slope will not represent the population relationship.  Assumptions about the population residuals (sometimes called, simply, “errors”) will be violated.  Estimated residuals will be incorrect.  Statistical inference will be incorrect. High-priority conditions must be met for accurate statistical inference with linear OLS regression. (Most of this falls under the heading of “independent and identically normally distributed errors.”

© Andrew Ho, Harvard Graduate School of EducationUnit 4a – Slide 9 The Linearity Assumption

© Andrew Ho, Harvard Graduate School of EducationUnit 4a – Slide 10 Fitting a linear model to a dichotomous outcome variable

© Andrew Ho, Harvard Graduate School of EducationUnit 4a – Slide 11 Residual Diagnostics There is quite a healthy amount of vertical variation in the middle range of fitted values.  We are often fairly forgiving of heteroscedasticity.  We might resolve it with “weighted least squares,” if anything.  In the case of a dichotomous outcome variable,  the problems (heteroscedasticity, nonlinearity) are so predictable,  the implications are so atheoretical (predictions outside 0,1, linear fit to a nonlinear relationship),  and the alternatives are so attractive and straightforward (logistic regression),  that we never fit linear models to dichotomous outcomes.  We are often fairly forgiving of heteroscedasticity.  We might resolve it with “weighted least squares,” if anything.  In the case of a dichotomous outcome variable,  the problems (heteroscedasticity, nonlinearity) are so predictable,  the implications are so atheoretical (predictions outside 0,1, linear fit to a nonlinear relationship),  and the alternatives are so attractive and straightforward (logistic regression),  that we never fit linear models to dichotomous outcomes. Very little vertical variation in the extremes.

© Andrew Ho, Harvard Graduate School of EducationUnit 1b – Slide 12 Residual Normality Residuals certainly don’t seem normally distributed. Not surprising that we reject the null hypothesis of normally distributed population residuals.

© Andrew Ho, Harvard Graduate School of EducationUnit 4a – Slide 13 Wouldn’t it be nice…  Wouldn’t it be nice if there were some way to fit a… nonlinear… regression model to these data?  One way to think about this nonlinear model might be as a transformed outcome variable that “stretches” extreme proportions, accounting for the smaller variance we know exists in that region…  Wouldn’t it be nice if there were some way to fit a… nonlinear… regression model to these data?  One way to think about this nonlinear model might be as a transformed outcome variable that “stretches” extreme proportions, accounting for the smaller variance we know exists in that region…

© Andrew Ho, Harvard Graduate School of EducationS052/II.1(a) – Slide 14 Because of the linear model’s flaws, we recommend the non-linear Logistic Function as a credible and interpretable model for representing the population relationship between the underlying probability that a married woman is a homemaker and predictors like the husband’s salary, HUBSAL: In a Logistic Regression Model, the outcome is specified in a way that is consistent with our intuition about the analysis of categorical outcomes: probability  We model the underlying probability that a married woman is a homemaker. In a Logistic Regression Model, the outcome is specified in a way that is consistent with our intuition about the analysis of categorical outcomes: probability  We model the underlying probability that a married woman is a homemaker. The population Logistic Regression Model has a Non-linear Functional Form so that it can provide the properties that we require for a hypothesized relationship between a probability and its predictors. In a Logistic Regression Model, the hypothesized trend line:  Cannot drop below zero (the “lower asymptote’),  Cannot exceed unity (the “upper asymptote”),  Makes a smooth and sensible transition between these asymptotes. In a Logistic Regression Model, the hypothesized trend line:  Cannot drop below zero (the “lower asymptote’),  Cannot exceed unity (the “upper asymptote”),  Makes a smooth and sensible transition between these asymptotes. But, what do these parameters represent!!? The Logistic Regression Model

© Andrew Ho, Harvard Graduate School of EducationUnit 4a – Slide 15 Here’s some EXCEL plots, to provide intuition into how the Logistic Regression Model works. All logistic curves approach an upper asymptote of 1 All logistic curves approach an lower asymptote of 0

© Andrew Ho, Harvard Graduate School of EducationUnit 4a – Slide 16 And a few more ….

© Andrew Ho, Harvard Graduate School of EducationUnit 4a – Slide 17 This will be our statistical model for relating a categorical outcome to predictors. We will fit it to data using Nonlinear Regression Analysis … This will be our statistical model for relating a categorical outcome to predictors. We will fit it to data using Nonlinear Regression Analysis … Logistic Regression Model dichotomous outcome We consider the non-linear Logistic Regression Model for representing the hypothesized population relationship between dichotomous outcome, HOME, and predictors … underlying probability that the value of outcome HOME equals 1 The outcome being modeled is the underlying probability that the value of outcome HOME equals 1 determines the slope but is not equal to it Parameter  1 determines the slope of the curve, but is not equal to it (in fact, the slope is different at every point on the curve). determines the intercept but is not equal to it Parameter  0 determines the intercept of the curve, but is not equal to it. The Logistic Regression Model

© Andrew Ho, Harvard Graduate School of EducationUnit 4a – Slide 18 Building the Logistic Regression Model: The Unconditional Model  We recall from multilevel modeling that we wish to maximize our likelihood, “maximum likelihood.”  Because the likelihoods are a product of many, many small probabilities, we maximize the sum of log-likelihoods, an attempt at making a negative number as positive as possible.  Later, we’ll use the difference in -2*logliklihoods (the deviance) in a statistical test to compare models.  We recall from multilevel modeling that we wish to maximize our likelihood, “maximum likelihood.”  Because the likelihoods are a product of many, many small probabilities, we maximize the sum of log-likelihoods, an attempt at making a negative number as positive as possible.  Later, we’ll use the difference in -2*logliklihoods (the deviance) in a statistical test to compare models.

© Andrew Ho, Harvard Graduate School of EducationUnit 4a – Slide 19 Building the Logistic Regression Model

© Andrew Ho, Harvard Graduate School of EducationUnit 1b – Slide 20 Graphical Interpretation of the Logistic Regression Model Comparing local polynomial, linear, and logistic fits to the data.

Download ppt "Unit 4a: Basic Logistic (Binomial Logit) Regression Analysis © Andrew Ho, Harvard Graduate School of EducationUnit 4a – Slide 1"

Similar presentations