Presentation is loading. Please wait.

Presentation is loading. Please wait.

Unit 5b: The Logistic Regression Approach to Life Table Analysis © Andrew Ho, Harvard Graduate School of EducationUnit 5b– Slide 1

Similar presentations


Presentation on theme: "Unit 5b: The Logistic Regression Approach to Life Table Analysis © Andrew Ho, Harvard Graduate School of EducationUnit 5b– Slide 1"— Presentation transcript:

1 Unit 5b: The Logistic Regression Approach to Life Table Analysis © Andrew Ho, Harvard Graduate School of EducationUnit 5b– Slide 1 http://xkcd.com/881/

2 Replicating Life Table Analysis with Logistic Regression Interpreting coefficients using the noconstant option. Fitting the Hazard Function with polynomial regression. © Andrew Ho, Harvard Graduate School of Education Unit 5b– Slide 2 Multiple Regression Analysis (MRA) Multiple Regression Analysis (MRA) Do your residuals meet the required assumptions? Test for residual normality Use influence statistics to detect atypical datapoints If your residuals are not independent, replace OLS by GLS regression analysis Use Individual growth modeling Specify a Multi-level Model If time is a predictor, you need discrete- time survival analysis… If your outcome is categorical, you need to use… Binomial logistic regression analysis (dichotomous outcome) Multinomial logistic regression analysis (polytomous outcome) If you have more predictors than you can deal with, Create taxonomies of fitted models and compare them. Form composites of the indicators of any common construct. Conduct a Principal Components Analysis Use Cluster Analysis Use non-linear regression analysis. Transform the outcome or predictor If your outcome vs. predictor relationship is non-linear, Use Factor Analysis: EFA or CFA? Course Roadmap: Unit 5b Today’s Topic Area

3 © Andrew Ho, Harvard Graduate School of EducationUnit 5b– Slide 3 Person-Period Dataset ID PERIOD EVENT 1 1 1 2 1 0 2 2 1 3 1 1 4 1 1 5 1 0 5 2 0 5 3 0 5 4 0 5 5 0 5 6 0 5 7 0 5 8 0 5 9 0 5 10 0 5 11 0 5 12 0 6 1 1 7 1 0 7 2 0 7 3 0 7 4 0 7 5 0 7 6 0 7 7 0 7 8 0 7 9 0 7 10 0 7 11 0 7 12 0 Etc. Person-Period Dataset ID PERIOD EVENT 1 1 1 2 1 0 2 2 1 3 1 1 4 1 1 5 1 0 5 2 0 5 3 0 5 4 0 5 5 0 5 6 0 5 7 0 5 8 0 5 9 0 5 10 0 5 11 0 5 12 0 6 1 1 7 1 0 7 2 0 7 3 0 7 4 0 7 5 0 7 6 0 7 7 0 7 8 0 7 9 0 7 10 0 7 11 0 7 12 0 Etc. So, why not replace life-table analysis by the logistic-regression analysis of EVENT on PERIOD in the person-period dataset?  From a technical perspective, this turns out to be exactly the right thing to do.  It’s then called Discrete-Time Survival Analysis. So, why not replace life-table analysis by the logistic-regression analysis of EVENT on PERIOD in the person-period dataset?  From a technical perspective, this turns out to be exactly the right thing to do.  It’s then called Discrete-Time Survival Analysis. In our earlier life-table analysis in the person-period dataset:  EVENT recorded whether the teacher experienced the event of interest (quitting teaching) in each time PERIOD.  Conceptually, in these analyses:  EVENT served as a (dichotomous) outcome.  PERIOD served as a predictor. In our earlier life-table analysis in the person-period dataset:  EVENT recorded whether the teacher experienced the event of interest (quitting teaching) in each time PERIOD.  Conceptually, in these analyses:  EVENT served as a (dichotomous) outcome.  PERIOD served as a predictor. In a person-period dataset:  Each person has one row of data in each time-period.  Their data record continues until, and includes, the time-period in which they experience the event of interest, or are censored:  A person cannot be present in a time- period unless they had a value of 0 for EVENT in the previous period.  In other words, they must have survived the prior period.  So, the person-period dataset has been formatted to permit each person to be present in a particular time period only if they are a legitimate member of the risk set in that period. In a person-period dataset:  Each person has one row of data in each time-period.  Their data record continues until, and includes, the time-period in which they experience the event of interest, or are censored:  A person cannot be present in a time- period unless they had a value of 0 for EVENT in the previous period.  In other words, they must have survived the prior period.  So, the person-period dataset has been formatted to permit each person to be present in a particular time period only if they are a legitimate member of the risk set in that period. Notice how, in the person-period dataset, outcome EVENT has been encoded to embody the same conditionality present in the definition of the hazard probability … The Person-Period Dataset

4 © Andrew Ho, Harvard Graduate School of EducationUnit 5b– Slide 4 *-------------------------------------------------------------------------------- * Input the person-period dataset, name and label the variables in the dataset. * Note that this is a different input dataset -- in person-period format, rather * than the prior person-level format -- than the one that was used in the previous * data-analytic handout on life-table analysis, in Unit5a.do *-------------------------------------------------------------------------------- * Input the person-period dataset: infile ID PERIOD EVENT P1-P12 /// using ""C:\Users\Andrew Ho\Documents\Dropbox\S-052\Raw Data\SPEC_ED_PP.txt" * Label the variables: label variable ID "Teacher Identification Code" label variable PERIOD "Current Time Period" label variable EVENT "Did Teacher Quit in this Time Period?“ * Label the values of important categorical variables: * Dichotomous event occurrence variable EVENT: label define eventlbl 0 "No Quit" 1 "Quit" label values EVENT eventlbl *-------------------------------------------------------------------------------- * Inspect the structure of the new person-period dataset. * Notice that there is one row per discrete time-period for each person. *-------------------------------------------------------------------------------- list ID EVENT PERIOD P1-P12 in 1/40 *-------------------------------------------------------------------------------- * Input the person-period dataset, name and label the variables in the dataset. * Note that this is a different input dataset -- in person-period format, rather * than the prior person-level format -- than the one that was used in the previous * data-analytic handout on life-table analysis, in Unit5a.do *-------------------------------------------------------------------------------- * Input the person-period dataset: infile ID PERIOD EVENT P1-P12 /// using ""C:\Users\Andrew Ho\Documents\Dropbox\S-052\Raw Data\SPEC_ED_PP.txt" * Label the variables: label variable ID "Teacher Identification Code" label variable PERIOD "Current Time Period" label variable EVENT "Did Teacher Quit in this Time Period?“ * Label the values of important categorical variables: * Dichotomous event occurrence variable EVENT: label define eventlbl 0 "No Quit" 1 "Quit" label values EVENT eventlbl *-------------------------------------------------------------------------------- * Inspect the structure of the new person-period dataset. * Notice that there is one row per discrete time-period for each person. *-------------------------------------------------------------------------------- list ID EVENT PERIOD P1-P12 in 1/40 Here’s the STATA code that kicks off Data-Analytic Handout, Unit5b.do, in which I conduct the suggested logistic regression analyses of EVENT. In Unit5a.do, recall that I provided code that allows you to convert the person-level dataset to the person-period dataset. Here I list the values of EVENT and P1 thru P12 for the few cases we inspected on the previous slide. Here are the time-period indicators -- P1 through P12 -- that were present in the person- period dataset, but were input and ignored up to this point. Loading in the dataset

5 Unit 5b– Slide 5 Calculating Hazard Probabilities in Person-Period Datasets © Andrew Ho, Harvard Graduate School of Education tabulate EVENT PERIOD, column This calculates what we see in the table above. count(ID) gives us our Total in each PERIOD, sum(EVENT) gives us the number who Quit by PERIOD, and NEVENT/NPERIOD gives us our Hazard Probabilities by PERIOD.

6 Unit 5b– Slide 6 Calculating Survival Probabilities in Person-Period Datasets © Andrew Ho, Harvard Graduate School of Education preserve and, at the end, restore, allows us to mess with our dataset and get it back at the end. Our collapsed dataset with HAZARDP (collapsed) and SURVIVEP (calculated)

7 © Andrew Ho, Harvard Graduate School of EducationUnit 5b– Slide 7 ColVarVariable DescriptionLabels 1IDTeacher identification code.Integer 2PERIODIndicates discrete time period to which record refers.Integer 3EVENT Dummy variable indicating whether the teacher experienced the event of interest in this period. 0 = no; 1 = yes 4P1Is this the first year of the teaching career?0 = no; 1= yes 5P2Is this the second year of the teaching career?0 = no; 1= yes 6P3Is this the third year of the teaching career?0 = no; 1= yes 7P4Is this the fourth year of the teaching career?0 = no; 1= yes 8P5Is this the fifth year of the teaching career?0 = no; 1= yes 9P6Is this the sixth year of the teaching career?0 = no; 1= yes 10P7Is this the seventh year of the teaching career?0 = no; 1= yes 11P8Is this the eighth year of the teaching career?0 = no; 1= yes 12P9Is this the ninth year of the teaching career?0 = no; 1= yes 13P10Is this the tenth year of the teaching career?0 = no; 1= yes 14P11Is this the eleventh year of the teaching career?0 = no; 1= yes 15P12Is this the twelfth year of the teaching career?0 = no; 1= yes To conduct logistic regression analyses in the person-period dataset, we must think about how we represent time PERIOD in our models -- recall that the dataset contains a vector of predictors that we have not yet used … “General Specification of PERIOD”  Dichotomous predictors, P1 thru P12 are defined to distinguish among the discrete time periods.  For each person in each period, each of the time period indicators, P1 thru P12, is set to 1 in the corresponding period, and 0 in other periods. “General Specification of PERIOD”  Dichotomous predictors, P1 thru P12 are defined to distinguish among the discrete time periods.  For each person in each period, each of the time period indicators, P1 thru P12, is set to 1 in the corresponding period, and 0 in other periods. Representing PERIOD by this “vector of dummies” in our logistic regression analysis provides the most general specification possible for any potential relationship between EVENT and PERIOD. The “Discrete” of DTSA: The Person-Period Dummy Variables

8 +--------------------------------------------------------------------------------------+ | ID EVENT PERIOD P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12 | |--------------------------------------------------------------------------------------| 1. | 1 Quit 1 1 0 0 0 0 0 0 0 0 0 0 0 | 2. | 2 No Quit 1 1 0 0 0 0 0 0 0 0 0 0 0 | 3. | 2 Quit 2 0 1 0 0 0 0 0 0 0 0 0 0 | 4. | 3 Quit 1 1 0 0 0 0 0 0 0 0 0 0 0 | 5. | 4 Quit 1 1 0 0 0 0 0 0 0 0 0 0 0 | |--------------------------------------------------------------------------------------| 6. | 5 No Quit 1 1 0 0 0 0 0 0 0 0 0 0 0 | 7. | 5 No Quit 2 0 1 0 0 0 0 0 0 0 0 0 0 | 8. | 5 No Quit 3 0 0 1 0 0 0 0 0 0 0 0 0 | 9. | 5 No Quit 4 0 0 0 1 0 0 0 0 0 0 0 0 | 10. | 5 No Quit 5 0 0 0 0 1 0 0 0 0 0 0 0 | |--------------------------------------------------------------------------------------| 11. | 5 No Quit 6 0 0 0 0 0 1 0 0 0 0 0 0 | 12. | 5 No Quit 7 0 0 0 0 0 0 1 0 0 0 0 0 | 13. | 5 No Quit 8 0 0 0 0 0 0 0 1 0 0 0 0 | 14. | 5 No Quit 9 0 0 0 0 0 0 0 0 1 0 0 0 | 15. | 5 No Quit 10 0 0 0 0 0 0 0 0 0 1 0 0 | |--------------------------------------------------------------------------------------| 16. | 5 No Quit 11 0 0 0 0 0 0 0 0 0 0 1 0 | 17. | 5 No Quit 12 0 0 0 0 0 0 0 0 0 0 0 1 | 18. | 6 Quit 1 1 0 0 0 0 0 0 0 0 0 0 0 | … 39. | 12 No Quit 4 0 0 0 1 0 0 0 0 0 0 0 0 | 40. | 12 No Quit 5 0 0 0 0 1 0 0 0 0 0 0 0 | +--------------------------------------------------------------------------------------+ © Andrew Ho, Harvard Graduate School of EducationUnit 5b– Slide 8 Here are the values of the time-period indicators for a few teachers from the person-period dataset … Here are the original 12 years of data on the time periods in which Teacher #5 was present in the person-period dataset The time-period indicators, P1 - P12, identify each time-period in a very general way In the 1 st time period: P1 = 1 P2 thru P12 = 0 In the 1 st time period: P1 = 1 P2 thru P12 = 0 … … In the 2 nd time period: P2 = 1 P1 & P3 thru P12 = 0 In the 2 nd time period: P2 = 1 P1 & P3 thru P12 = 0 In the 12 th time period: P12 = 1, P1 thru P11 = 0. In the 12 th time period: P12 = 1, P1 thru P11 = 0. The “Discrete” of DTSA: Person-Period Dummies as Time Period Indicators

9 Unit 5b– Slide 9 The “Discrete” of DTSA: Person-Period Dummies as Time Period Indicators +--------------------------------------------------------------------------------------+ | ID EVENT PERIOD P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12 | |--------------------------------------------------------------------------------------| 1. | 1 Quit 1 1 0 0 0 0 0 0 0 0 0 0 0 | 2. | 2 No Quit 1 1 0 0 0 0 0 0 0 0 0 0 0 | 3. | 2 Quit 2 0 1 0 0 0 0 0 0 0 0 0 0 | 4. | 3 Quit 1 1 0 0 0 0 0 0 0 0 0 0 0 | 5. | 4 Quit 1 1 0 0 0 0 0 0 0 0 0 0 0 | |--------------------------------------------------------------------------------------| 6. | 5 No Quit 1 1 0 0 0 0 0 0 0 0 0 0 0 | 7. | 5 No Quit 2 0 1 0 0 0 0 0 0 0 0 0 0 | 8. | 5 No Quit 3 0 0 1 0 0 0 0 0 0 0 0 0 | 9. | 5 No Quit 4 0 0 0 1 0 0 0 0 0 0 0 0 | 10. | 5 No Quit 5 0 0 0 0 1 0 0 0 0 0 0 0 | |--------------------------------------------------------------------------------------| 11. | 5 No Quit 6 0 0 0 0 0 1 0 0 0 0 0 0 | 12. | 5 No Quit 7 0 0 0 0 0 0 1 0 0 0 0 0 | 13. | 5 No Quit 8 0 0 0 0 0 0 0 1 0 0 0 0 | 14. | 5 No Quit 9 0 0 0 0 0 0 0 0 1 0 0 0 | 15. | 5 No Quit 10 0 0 0 0 0 0 0 0 0 1 0 0 | |--------------------------------------------------------------------------------------| 16. | 5 No Quit 11 0 0 0 0 0 0 0 0 0 0 1 0 | 17. | 5 No Quit 12 0 0 0 0 0 0 0 0 0 0 0 1 | 18. | 6 Quit 1 1 0 0 0 0 0 0 0 0 0 0 0 | … 39. | 12 No Quit 4 0 0 0 1 0 0 0 0 0 0 0 0 | 40. | 12 No Quit 5 0 0 0 0 1 0 0 0 0 0 0 0 | +--------------------------------------------------------------------------------------+ Hazard Function © Andrew Ho, Harvard Graduate School of Education You might notice that the Hazard Function shows the conditional means of the dichotomous variable, EVENT, on the predictor variable, PERIOD. If we wanted to model these means, and test the null hypothesis that all means are equal, how might we do it? In the population, are hazard probabilities different across years of teaching? If we wanted to model these means, and test the null hypothesis that all means are equal, how might we do it? In the population, are hazard probabilities different across years of teaching? tabulate EVENT PERIOD, column

10 Unit 5b– Slide 10 A Model for each of the Means +--------------------------------------------------------------------------------------+ | ID EVENT PERIOD P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12 | |--------------------------------------------------------------------------------------| 1. | 1 Quit 1 1 0 0 0 0 0 0 0 0 0 0 0 | 2. | 2 No Quit 1 1 0 0 0 0 0 0 0 0 0 0 0 | 3. | 2 Quit 2 0 1 0 0 0 0 0 0 0 0 0 0 | 4. | 3 Quit 1 1 0 0 0 0 0 0 0 0 0 0 0 | 5. | 4 Quit 1 1 0 0 0 0 0 0 0 0 0 0 0 | |--------------------------------------------------------------------------------------| 6. | 5 No Quit 1 1 0 0 0 0 0 0 0 0 0 0 0 | 7. | 5 No Quit 2 0 1 0 0 0 0 0 0 0 0 0 0 | 8. | 5 No Quit 3 0 0 1 0 0 0 0 0 0 0 0 0 | 9. | 5 No Quit 4 0 0 0 1 0 0 0 0 0 0 0 0 | 10. | 5 No Quit 5 0 0 0 0 1 0 0 0 0 0 0 0 | |--------------------------------------------------------------------------------------| 11. | 5 No Quit 6 0 0 0 0 0 1 0 0 0 0 0 0 | 12. | 5 No Quit 7 0 0 0 0 0 0 1 0 0 0 0 0 | 13. | 5 No Quit 8 0 0 0 0 0 0 0 1 0 0 0 0 | 14. | 5 No Quit 9 0 0 0 0 0 0 0 0 1 0 0 0 | 15. | 5 No Quit 10 0 0 0 0 0 0 0 0 0 1 0 0 | |--------------------------------------------------------------------------------------| 16. | 5 No Quit 11 0 0 0 0 0 0 0 0 0 0 1 0 | 17. | 5 No Quit 12 0 0 0 0 0 0 0 0 0 0 0 1 | 18. | 6 Quit 1 1 0 0 0 0 0 0 0 0 0 0 0 | … 39. | 12 No Quit 4 0 0 0 1 0 0 0 0 0 0 0 0 | 40. | 12 No Quit 5 0 0 0 0 1 0 0 0 0 0 0 0 | +--------------------------------------------------------------------------------------+ Hazard Function © Andrew Ho, Harvard Graduate School of Education We could fit this model with the dummy variables that we have: regress EVENT P1-P12 // OR // The “i.” notation auto-creates dummy variables regress EVENT i.PERIOD There are two problems with this statistical model as written. What are they? We could fit this model with the dummy variables that we have: regress EVENT P1-P12 // OR // The “i.” notation auto-creates dummy variables regress EVENT i.PERIOD There are two problems with this statistical model as written. What are they? tabulate EVENT PERIOD, column

11 Unit 5b– Slide 11 A Model for each of the Logits? © Andrew Ho, Harvard Graduate School of Education  A model for the log-odds (logits) of teachers exiting the system for the first time, given “survival” through a given number of years of teaching.  We might think of PERIOD as a continuous variable, but let’s start by trying to reproduce the Hazard Probabilities at each discrete period, in the same way that we would estimate probabilities for a large number of racial/ethnic groups or polychotomies/categories.  A model for the log-odds (logits) of teachers exiting the system for the first time, given “survival” through a given number of years of teaching.  We might think of PERIOD as a continuous variable, but let’s start by trying to reproduce the Hazard Probabilities at each discrete period, in the same way that we would estimate probabilities for a large number of racial/ethnic groups or polychotomies/categories.

12 Unit 5b– Slide 12 The Discrete-Time Hazard Model: Reproducing Life Tables © Andrew Ho, Harvard Graduate School of Education P1P2P3P4P5P6P7P8P9P10P11P12 Percentage 11.57%11.02%11.58%10.76%8.91%8.25%6.01%4.81%4.22%3.69%2.47%1.28% Odds 0.1310.1240.1310.1210.0980.0900.0640.0510.0440.0380.0250.013 Log-Odds -2.034-2.089-2.033-2.116-2.325-2.408-2.749-2.985-3.122-3.261-3.676-4.346 Hazard Function

13 A No-Constant (Zero-Constant) Model P1P2P3P4P5P6P7P8P9P10P11P12 Percentage 11.57%11.02%11.58%10.76%8.91%8.25%6.01%4.81%4.22%3.69%2.47%1.28% Odds 0.1310.1240.1310.1210.0980.0900.0640.0510.0440.0380.0250.013 Log-Odds -2.034-2.089-2.033-2.116-2.325-2.408-2.749-2.985-3.122-3.261-3.676-4.346 Hazard Function Unit 5b– Slide 13© Andrew Ho, Harvard Graduate School of Education

14 Unit 5b– Slide 14 How Logistic Models Replicate (and Extend?) Life Table Analyses © Andrew Ho, Harvard Graduate School of Education  Logistic Regression provides us a statistical model for Hazard Probabilities and allows us to ask questions about differences in Hazard Probabilities in the population.  Does the probability of exit really decline over time in the population (conditional on survival to that point?)  Logistic Regression provides us a statistical model for Hazard Probabilities and allows us to ask questions about differences in Hazard Probabilities in the population.  Does the probability of exit really decline over time in the population (conditional on survival to that point?)  Now, we can extend this analysis by adding predictors (What about certified teachers? Age? The year that they started?).  And, instead of modeling the logit at each PERIOD, we can use a more parsimonious model for the trajectory of Hazard Probabilities over time.  Now, we can extend this analysis by adding predictors (What about certified teachers? Age? The year that they started?).  And, instead of modeling the logit at each PERIOD, we can use a more parsimonious model for the trajectory of Hazard Probabilities over time.

15 Unit 5b– Slide 15 Instead of logit EVENT P2-P12, why not logit EVENT PERIOD? © Andrew Ho, Harvard Graduate School of Education  What is the estimated change in the Hazard Probability (in logits) per unit PERIOD?  Is this change different from 0 in the population?  What is the estimated change in the Hazard Probability (in logits) per unit PERIOD?  Is this change different from 0 in the population? Preparing for some polynomial regression. Linear, quadratic, and cubic fits to the Hazard function.

16 Unit 5b– Slide 16 A linear model for the logits © Andrew Ho, Harvard Graduate School of Education When PERIOD = 0, the estimated logit of exiting the system is -1.76. Remember your logit scale. This is a fitted probability of 14.7%. This is a linear model. Why are the fitted probabilities clearly curvilinear? And does this seem like a good fit to you?

17 Unit 5b– Slide 17 Quadratic Fit © Andrew Ho, Harvard Graduate School of Education When PERIOD = 0, the estimated logit of exiting the system is -2.06. Remember your logit scale. This is a fitted probability of 11.3%. Remember that coefficients from polynomial regression equations are, like coefficients from all interactions, difficult to interpret on their own. We graph: Is this a quadratic function? Does this seem like a better fit to you? Is this a quadratic function? Does this seem like a better fit to you?

18 Unit 5b– Slide 18 Cubic Fit © Andrew Ho, Harvard Graduate School of Education When PERIOD = 0, the estimated logit of exiting the system is -2.14. Remember your logit scale. This is a fitted probability of 10.5%. Is this a cubic function? Does this seem like a better fit to you? Is this a cubic function? Does this seem like a better fit to you? Remember that coefficients from polynomial regression equations are, like coefficients from all interactions, difficult to interpret on their own. We graph:


Download ppt "Unit 5b: The Logistic Regression Approach to Life Table Analysis © Andrew Ho, Harvard Graduate School of EducationUnit 5b– Slide 1"

Similar presentations


Ads by Google