Unit 4a: Basic Logistic (Binomial Logit) Regression Analysis © Andrew Ho, Harvard Graduate School of EducationUnit 4a – Slide 1

Unit 4a: Basic Logistic (Binomial Logit) Regression Analysis © Andrew Ho, Harvard Graduate School of EducationUnit 4a – Slide 1 http://xkcd.com/74/ http://xkcd.com/210/

Exploratory Data Analysis with dichotomous outcome variables How our familiar regression model fails our data An initial look at logistic regression results © Andrew Ho, Harvard Graduate School of Education Unit 4a– Slide 2 Multiple Regression Analysis (MRA) Multiple Regression Analysis (MRA) Do your residuals meet the required assumptions? Test for residual normality Use influence statistics to detect atypical datapoints If your residuals are not independent, replace OLS by GLS regression analysis Use Individual growth modeling Specify a Multi-level Model If time is a predictor, you need discrete- time survival analysis… If your outcome is categorical, you need to use… Binomial logistic regression analysis (dichotomous outcome) Multinomial logistic regression analysis (polytomous outcome) If you have more predictors than you can deal with, Create taxonomies of fitted models and compare them. Form composites of the indicators of any common construct. Conduct a Principal Components Analysis Use Cluster Analysis Use non-linear regression analysis. Transform the outcome or predictor If your outcome vs. predictor relationship is non-linear, Use Factor Analysis: EFA or CFA? Course Roadmap: Unit 4a Today’s Topic Area

© Andrew Ho, Harvard Graduate School of EducationUnit 4a– Slide 3 DatasetAT_HOME.txt OverviewSub-sample from the 1976 Canadian National Labor Force Survey (LFS) on the participation of married Canadian women in the labor force, in which the relationship between whether a woman works as a homemaker (vs. taking a job outside the home) is investigated as a function of the husband’s salary and the presence of children in the home.Labor Force Survey SourceAtkinson, et al., 1977 Sample size434 married women InfoIf you are interested in this topic, you might find the Gender & Work Database informative. This collaborative project, lead by Leah Vosko, Canada Research Chair in Feminist Politic Economy, provides access to a library of theoretical and empirical papers on the relationship of gender and work, summary statistical tables containing descriptive statistics that document women’s positions, compensation, etc., and links to many of the major datasets on women’s labor force participation in Canada, including the LFS itself, and also the Survey of Work Arrangements, the General Social Survey, the Survey of Labour and Income Dynamics, the National Population Health Survey, and many others.Gender & Work DatabaseLFS Survey of Work ArrangementsGeneral Social Survey Survey of Labour and Income DynamicsNational Population Health Survey Note: I’ve removed other obvious controls and question predictors to simplify my presentation of the logistic regression approach. Broad Research Question: In 1976, were married Canadian women who had children at home and husbands with higher salaries more likely to work at home rather than joining the labor force (when compared to their married peers with no children at home and husbands who earn less)? Broad Research Question: In 1976, were married Canadian women who had children at home and husbands with higher salaries more likely to work at home rather than joining the labor force (when compared to their married peers with no children at home and husbands who earn less)? The Data: A Historical Look at Canadian Gender and Work Patterns Works at Home Husband’s Income Children Works at Home Husband’s Income Children Lurking Confound

© Andrew Ho, Harvard Graduate School of EducationUnit 4a– Slide 4 Structure of Dataset Col. # Variable Name Variable DescriptionVariable Metric/Labels 1HOME Does the married woman work in the home as a homemaker? Dichotomous outcome variable: 0 = no 1 = yes 2HUBSAL The husband’s annual income in 1976 Canadian dollars. $1000’s 3CHILD Are there children present in the home? Dichotomous predictor variable: 0 = no 1 = yes 1 15 1 1 13 1 1 45 1 1 23 1 1 19 1 1 7 1 1 15 1 0 7 1 1 15 1 1 23 1 0 13 1 1 9 1 … 1 15 1 1 13 1 1 45 1 1 23 1 1 19 1 1 7 1 1 15 1 0 7 1 1 15 1 1 23 1 0 13 1 1 9 1 … We’ve already demonstrated the use of dichotomous predictors. Why not dichotomous outcome variables? We’ll try it and see. What could possibly go wrong? We’ve already demonstrated the use of dichotomous predictors. Why not dichotomous outcome variables? We’ll try it and see. What could possibly go wrong? HOME is a categorical (dichotomous) outcome variable What’s the best way to model the relationship between a binary outcome and regular predictors like CHILD and HUBSAL? Eyes on the Data

© Andrew Ho, Harvard Graduate School of EducationUnit 4a– Slide 5 *--------------------------------------------------------------------------------- * Input the raw dataset, name and label the variables and selected values. *--------------------------------------------------------------------------------- * Input the target dataset: infile HOME HUBSAL CHILD /// using "C:\Users\Andrew Ho\Documents\Dropbox\S-052\Raw Data\AT_HOME.txt" * Label the principal variables: label variable HOME "Is Woman a Homemaker?" label variable HUBSAL "Husband's Annual Salary (in $1,000)" label variable CHILD "Are Children Present in the Home?" * Label the values of important categorical variables: * Dichomotous outcome HOME: label define homelbl 0 "In Labor Force" 1 "Homemaker" label values HOME homelbl * Dichotomous secondary question predictor CHILD: label define childlbl 0 "No Child" 1 "Children at Home" label values CHILD childlbl *-------------------------------------------------------------------------------- * Obtain descriptive statistics on the sample HOME/HUBSAL relationship. *-------------------------------------------------------------------------------- * Examine the sample univariate distribution of HOME: hist HOME, discrete percent ylabel(0(20)100) xlabel(0(1)1) name(Unit4a_g1) summarize HOME * Inspect the sample bivariate relationship of outcome HOME and predictor HUBSAL: scatter HOME HUBSAL, jitter(7) msize(small) name(Unit4a_g2,replace) graph hbox HUBSAL, over(HOME, descending) name(Unit4a_g3,replace) *--------------------------------------------------------------------------------- * Input the raw dataset, name and label the variables and selected values. *--------------------------------------------------------------------------------- * Input the target dataset: infile HOME HUBSAL CHILD /// using "C:\Users\Andrew Ho\Documents\Dropbox\S-052\Raw Data\AT_HOME.txt" * Label the principal variables: label variable HOME "Is Woman a Homemaker?" label variable HUBSAL "Husband's Annual Salary (in $1,000)" label variable CHILD "Are Children Present in the Home?" * Label the values of important categorical variables: * Dichomotous outcome HOME: label define homelbl 0 "In Labor Force" 1 "Homemaker" label values HOME homelbl * Dichotomous secondary question predictor CHILD: label define childlbl 0 "No Child" 1 "Children at Home" label values CHILD childlbl *-------------------------------------------------------------------------------- * Obtain descriptive statistics on the sample HOME/HUBSAL relationship. *-------------------------------------------------------------------------------- * Examine the sample univariate distribution of HOME: hist HOME, discrete percent ylabel(0(20)100) xlabel(0(1)1) name(Unit4a_g1) summarize HOME * Inspect the sample bivariate relationship of outcome HOME and predictor HUBSAL: scatter HOME HUBSAL, jitter(7) msize(small) name(Unit4a_g2,replace) graph hbox HUBSAL, over(HOME, descending) name(Unit4a_g3,replace) Standard input statements As I have illustrated in earlier STATA code, where categorical variables are involved, you can define a format label variable ( homelbl ) to contain the value labels, and then associate the label with the variable of interest, when needed. Requests standard univariate descriptive plots and statistics on the dichotomous outcome, HOME. Requests bivariate plot of dichotomous outcome HOME on the continuous predictor HUBSAL Loading the Data, Visualizing/Summarizing the Outcome Variable

© Andrew Ho, Harvard Graduate School of EducationUnit 4a – Slide 7 The Bivariate Distribution of HOME on HUBSAL Jittered scatterplot showing the sample relationship between the dichotomous outcome variable HOME and the continuous predictor, HUBSAL

© Andrew Ho, Harvard Graduate School of Education Unit 2a– Slide 8 “In the population, …Assumption How Does Failure of the Assumption Affect OLS Regression Analysis? Linear Outcome/Predictor Relationships … the bivariate relationship between the outcome and each predictor must be linear.” If the modeled relationship is not linear, then it will be misrepresented by the linear regression analysis, and the fundamental underpinnings of the entire analysis are at risk:  OLS-estimated regression slope will not represent the population relationship.  Assumptions about the population residuals (sometimes called, simply, “errors”) will be violated.  Estimated residuals will be incorrect.  Statistical inference will be incorrect. High-priority conditions must be met for accurate statistical inference with linear OLS regression. (Most of this falls under the heading of “independent and identically normally distributed errors.”

© Andrew Ho, Harvard Graduate School of EducationUnit 4a – Slide 11 Residual Diagnostics There is quite a healthy amount of vertical variation in the middle range of fitted values.  We are often fairly forgiving of heteroscedasticity.  We might resolve it with “weighted least squares,” if anything.  In the case of a dichotomous outcome variable,  the problems (heteroscedasticity, nonlinearity) are so predictable,  the implications are so atheoretical (predictions outside 0,1, linear fit to a nonlinear relationship),  and the alternatives are so attractive and straightforward (logistic regression),  that we never fit linear models to dichotomous outcomes.  We are often fairly forgiving of heteroscedasticity.  We might resolve it with “weighted least squares,” if anything.  In the case of a dichotomous outcome variable,  the problems (heteroscedasticity, nonlinearity) are so predictable,  the implications are so atheoretical (predictions outside 0,1, linear fit to a nonlinear relationship),  and the alternatives are so attractive and straightforward (logistic regression),  that we never fit linear models to dichotomous outcomes. Very little vertical variation in the extremes.

© Andrew Ho, Harvard Graduate School of EducationUnit 1b – Slide 12 Residual Normality Residuals certainly don’t seem normally distributed. Not surprising that we reject the null hypothesis of normally distributed population residuals.

© Andrew Ho, Harvard Graduate School of EducationUnit 4a – Slide 13 Wouldn’t it be nice…  Wouldn’t it be nice if there were some way to fit a… nonlinear… regression model to these data?  One way to think about this nonlinear model might be as a transformed outcome variable that “stretches” extreme proportions, accounting for the smaller variance we know exists in that region…  Wouldn’t it be nice if there were some way to fit a… nonlinear… regression model to these data?  One way to think about this nonlinear model might be as a transformed outcome variable that “stretches” extreme proportions, accounting for the smaller variance we know exists in that region…

© Andrew Ho, Harvard Graduate School of EducationS052/II.1(a) – Slide 14 Because of the linear model’s flaws, we recommend the non-linear Logistic Function as a credible and interpretable model for representing the population relationship between the underlying probability that a married woman is a homemaker and predictors like the husband’s salary, HUBSAL: In a Logistic Regression Model, the outcome is specified in a way that is consistent with our intuition about the analysis of categorical outcomes: probability  We model the underlying probability that a married woman is a homemaker. In a Logistic Regression Model, the outcome is specified in a way that is consistent with our intuition about the analysis of categorical outcomes: probability  We model the underlying probability that a married woman is a homemaker. The population Logistic Regression Model has a Non-linear Functional Form so that it can provide the properties that we require for a hypothesized relationship between a probability and its predictors. In a Logistic Regression Model, the hypothesized trend line:  Cannot drop below zero (the “lower asymptote’),  Cannot exceed unity (the “upper asymptote”),  Makes a smooth and sensible transition between these asymptotes. In a Logistic Regression Model, the hypothesized trend line:  Cannot drop below zero (the “lower asymptote’),  Cannot exceed unity (the “upper asymptote”),  Makes a smooth and sensible transition between these asymptotes. But, what do these parameters represent!!? The Logistic Regression Model

© Andrew Ho, Harvard Graduate School of EducationUnit 4a – Slide 15 Here’s some EXCEL plots, to provide intuition into how the Logistic Regression Model works. All logistic curves approach an upper asymptote of 1 All logistic curves approach an lower asymptote of 0

© Andrew Ho, Harvard Graduate School of EducationUnit 4a – Slide 17 This will be our statistical model for relating a categorical outcome to predictors. We will fit it to data using Nonlinear Regression Analysis … This will be our statistical model for relating a categorical outcome to predictors. We will fit it to data using Nonlinear Regression Analysis … Logistic Regression Model dichotomous outcome We consider the non-linear Logistic Regression Model for representing the hypothesized population relationship between dichotomous outcome, HOME, and predictors … underlying probability that the value of outcome HOME equals 1 The outcome being modeled is the underlying probability that the value of outcome HOME equals 1 determines the slope but is not equal to it Parameter  1 determines the slope of the curve, but is not equal to it (in fact, the slope is different at every point on the curve). determines the intercept but is not equal to it Parameter  0 determines the intercept of the curve, but is not equal to it. The Logistic Regression Model

© Andrew Ho, Harvard Graduate School of EducationUnit 4a – Slide 18 Building the Logistic Regression Model: The Unconditional Model  We recall from multilevel modeling that we wish to maximize our likelihood, “maximum likelihood.”  Because the likelihoods are a product of many, many small probabilities, we maximize the sum of log-likelihoods, an attempt at making a negative number as positive as possible.  Later, we’ll use the difference in -2*logliklihoods (the deviance) in a statistical test to compare models.  We recall from multilevel modeling that we wish to maximize our likelihood, “maximum likelihood.”  Because the likelihoods are a product of many, many small probabilities, we maximize the sum of log-likelihoods, an attempt at making a negative number as positive as possible.  Later, we’ll use the difference in -2*logliklihoods (the deviance) in a statistical test to compare models.

Unit 4a: Basic Logistic (Binomial Logit) Regression Analysis © Andrew Ho, Harvard Graduate School of EducationUnit 4a – Slide 1

Similar presentations

Presentation on theme: "Unit 4a: Basic Logistic (Binomial Logit) Regression Analysis © Andrew Ho, Harvard Graduate School of EducationUnit 4a – Slide 1"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Unit 4a: Basic Logistic (Binomial Logit) Regression Analysis © Andrew Ho, Harvard Graduate School of EducationUnit 4a – Slide 1

Similar presentations

Presentation on theme: "Unit 4a: Basic Logistic (Binomial Logit) Regression Analysis © Andrew Ho, Harvard Graduate School of EducationUnit 4a – Slide 1"— Presentation transcript:

Similar presentations

About project

Feedback