# An Introduction to Group-Based Trajectory Modeling and PROC TRAJ Richard Charnigo Professor of Statistics and Biostatistics Director of Statistics and.

## Presentation on theme: "An Introduction to Group-Based Trajectory Modeling and PROC TRAJ Richard Charnigo Professor of Statistics and Biostatistics Director of Statistics and."— Presentation transcript:

An Introduction to Group-Based Trajectory Modeling and PROC TRAJ Richard Charnigo Professor of Statistics and Biostatistics Director of Statistics and Psychometrics Core, CDART RJCharn2@aol.com

Objectives First ~80 minutes: 1.Be able to describe a group-based trajectory model and, in particular, distinguish it from a conventional regression model. 2. Be able to interpret results obtained from fitting a group-based trajectory model via PROC TRAJ. Last ~40 minutes: 3. Be able to fit a group-based trajectory model via PROC TRAJ.

Motivating example The Excel file at {www.richardcharnigo.net/traj} contains a simulated data set: Five hundred college freshmen (“ID”) were asked to estimate how many times per month they consumed marijuana during their freshman (“Y1”), sophomore (“Y2”), junior (“Y3”), and senior (“Y4”) years of high school. Later they were asked to estimate their marijuana use during freshman year of college (“Y5”). They were also assessed on reward seeking; for ease of interpretation, we standardize this variable (“X”).

Motivating example Two possible “research questions” are: i.What are prototypical trajectories of marijuana use within the population of college students from which this sample was drawn ? ii.Is the trajectory that best describes the experience of a particular student associated with that student’s level of reward seeking ? We can develop more complicated and realistic scenarios ( e.g., with additional personality variables and/or interventions ), but this simple scenario will help us begin to understand group- based trajectory modeling and PROC TRAJ.

Exploratory data analysis Before pursuing group-based trajectory ( or any other statistical ) modeling, we are well-advised to perform exploratory data analysis. This can alert us to gross mistakes in the data set, heretofore undetected, which may otherwise threaten the validity of our results. This can also suggest an appropriate probability distribution to use with the group-based trajectory model and help us to anticipate what the results may be.

Exploratory data analysis Quantiles (Definition 5) QuantileEstimate 100% Max4 99%3 95%2 90%1 75% Q31 50% Median0 25% Q10 10%0 5%0 1%0 0% Min0 Basic Statistical Measures LocationVariability Mean0.362000Std Deviation0.71553 Median0.000000Variance0.51198 Mode0.000000Range4.00000 Interquartile Range1.00000

Exploratory data analysis Quantiles (Definition 5) QuantileEstimate 100% Max14 99%12 95%9 90%7 75% Q31 50% Median0 25% Q10 10%0 5%0 1%0 0% Min0 Basic Statistical Measures LocationVariability Mean1.454000Std Deviation2.91563 Median0.000000Variance8.50089 Mode0.000000Range14.00000 Interquartile Range1.00000

Exploratory data analysis The preceding slides show descriptive statistics for Y1 and Y5. ( We can similarly examine descriptive statistics for Y2, Y3, and Y4. ) Here are a few observations: As anticipated, the possible values of Y1 and Y5 are nonnegative, and they appear to have been recorded ( or rounded ) to the nearest integer. The distributions of Y1 and Y5 are right-skewed, and there are lots of 0’s. Both the mean and the variance for Y5 are greater than the corresponding quantities for Y1.

Exploratory data analysis Our observations suggest the following: Because there are lots of 0’s, there is no transformation that will bring Y1 or Y5 to approximate normality. However, because Y1 and Y5 are integer-valued, a Poisson ( or similar ) probability distribution may be applicable. Since Y5 has greater mean and variance than Y1, we anticipate some divergence between trajectories over time and at least one trajectory showing increasing marijuana use over time.

A first trajectory model Let t denote time in years. If we set time 0 to be high school graduation, then we have t = -3, -2, -1, 0, and 1 corresponding to Y1 through Y5. Suppose for now --- the viability of this supposition can be assessed later --- that there are three subpopulations whose mean levels of marijuana use over time ( called “trajectories” ) are defined by exponentials of linear functions f 1 (t) = exp(a 1 + b 1 t), f 2 (t) = exp(a 2 + b 2 t), and f 3 (t) = exp(a 3 + b 3 t). The exponentials are needed because f 1 (t), f 2 (t), and f 3 (t) must be nonnegative.

A first trajectory model Suppose that the distribution of Y k ( 1 < k < 5 ) in the first subpopulation is Poisson with mean f 1 ( k-4 ), in the second is Poisson with mean f 2 ( k-4 ), and in the third is Poisson with mean f 3 ( k-4 ). Finally, suppose that the probability of belonging to subpopulation j ( 2 0, then higher levels of reward seeking increase the above ratio; if d j < 0, then they decrease the above ratio.

A first trajectory model A group-based trajectory model is thus distinguished from a conventional regression model in that a latent variable --- namely, the subpopulation to which one belongs --- is intermediate between what might be thought of as the independent variable (here, reward seeking) and the dependent variable (here, marijuana use). Consequently, and importantly, the difference between two trajectories is typically much greater than the difference between mean levels among persons “high” on the independent variable versus persons “low” on the independent variable.

A first trajectory model

The preceding figure shows results from fitting the group-based trajectory model via PROC TRAJ. Approximately 65.3% of persons belong to a subpopulation that is essentially abstinent from marijuana, about 19.4% to a subpopulation whose marijuana use increases and then decreases, and about 15.3% to a subpopulation whose marijuana use continually increases. Dashed lines represent estimates of f 1 (t), f 2 (t), and f 3 (t) when they are assumed to be exponentials of linear functions; solid lines represent estimates without such a constraint.

A first trajectory model ObsIDY1Y2Y3Y4Y5T1T2T3T4T5XGRP1PRBGRP2PRBGRP3PRBGROUP 5500100-3-2010.080.9958140.0041860.0000001 6623640-3-2012.750.0000000.2436060.7563943 7700001-3-201-0.970.9983640.0016360.0000001 8824388-3-2010.70.0000000.0000020.9999983 9910145-3-2012.780.0000000.0713900.9286103 10 14001-3-2010.530.0006340.9992870.0000782 Obs_MODEL__MODEL2__TYPE__NAME_INTERC1LINEAR1INTERC2 1ZIP PARMS -2.240945095-0.0610558920.4881041958 ObsLINEAR2INTERC3LINEAR3CONST2X2CONST3X3 10.08815278871.64136146160.404393847-1.1966777531.1816491462-2.4004660752.4141657075 Obs_LOGLIK__BIC1__BIC2__AIC__CONVERGE_ 1-2580.343083-2611.416123-2619.463313-2590.3430834 ObsTAVG1AVG2AVG3PRED1PRED2PRED3 1-3.000000.134010.488011.178270.127741.250631.53446 2-2.000000.105071.686102.665160.120171.365882.29923 30.134082.587103.572230.113051.491753.44515 40.000000.083911.571385.248920.106361.629225.16219 51.000000.110021.206197.527290.100061.779377.73500

A first trajectory model The preceding tables display additional results. The first table shows variable values for six subjects, along with the estimated probabilities that the subjects belong to the three subpopulations. The second and third tables present estimates of a 1, b 1, a 2, b 2, a 3, b 3, c 2, d 2, c 3, and d 3. Companion output, which is displayed by PROC TRAJ on screen only, provides accompanying p-values. The fourth table provides indices of model fit, and the fifth table specifies the numbers used to construct the figure displayed earlier.

A first trajectory model Visually, the estimate of f 2 (t) appears somewhat unsatisfactory. There are corresponding discrepancies between the “AVG2” and “PRED2” columns in the fifth table. Therefore, let us consider a second group-based trajectory model in which the trajectories are defined by exponentials of quadratic functions f 1 (t) = exp(a 1 + b 1 t + g 1 t 2 ), f 2 (t) = exp(a 2 + b 2 t + g 2 t 2 ), and f 3 (t) = exp(a 3 + b 3 t + g 3 t 2 ).

A second trajectory model

ObsIDY1Y2Y3Y4Y5T1T2T3T4T5XGRP1PRBGRP2PRBGRP3PRBGROUP 5500100-3-2010.080.9928630.0071370.0000001 6623640-3-2012.750.0000000.8682320.1317682 7700001-3-201-0.970.9992850.0007150.0000001 8824388-3-2010.70.000000 1.0000003 9910145-3-2012.780.0000000.0088700.9911303 10 14001-3-2010.530.0011330.9987480.0001192 Obs_MODEL__MODEL2__TYPE__NAME_INTERC1LINEAR1QUADRA1 1ZIP PARMS -2.3115748460.05583977040.0514642791 ObsLINEAR2QUADRA2INTERC3LINEAR3QUADRA3CONST2X2 1-0.469526884-0.2967670961.69398478360.3771947256-0.029055771-1.1572740991.197230375 ObsTAVG1AVG2AVG3PRED1PRED2PRED3 1-3.000000.134010.488011.178270.127741.250631.53446 2-2.000000.105071.686102.665160.120171.365882.29923 30.134082.587103.572230.113051.491753.44515 40.000000.083911.571385.248920.106361.629225.16219 51.000000.110021.206197.527290.100061.779377.73500 ObsCONST3X3_LOGLIK__BIC1__BIC2__AIC__CONVERGE_ 1-2.3566193042.313769971-2504.788285-2545.183238-2555.644584-2517.7882854

A second trajectory model Some comments are in order: The estimate of f 2 (t) looks much better now. The guess about which subpopulation subject 6 belongs to has changed ( and appears more reasonable now ). The BIC 1, BIC 2, and AIC have increased by approximately 66, 64, and 73 points respectively. These are overwhelming changes, suggesting that the second group-based trajectory model provides a much better fit to the data than the first group- based trajectory model.

Is that the best we can do ? Besides moving from linear functions to quadratic functions, other modifications are possible. One, for which I provide SAS code at {www.richardcharnigo.net/traj}, entails replacing the ordinary Poisson probability distribution by the zero-inflated Poisson probability distribution. The idea is that, especially in the first subpopulation, there may be too many 0’s to be compatible with the ordinary Poisson probability distribution. Accounting for this zero inflation may provide a better fit to the data.

Is that the best we can do ? Another possible modification is to change the quadratic functions to cubic or even quartic functions. ( With only five time points, we cannot go beyond polynomials of degree four. ) In fact, the polynomial degree need not be the same for each subpopulation. For instance, a linear function may suffice for the first and third subpopulations, while ( at least ) a quadratic function appears necessary for the second subpopulation.

Is that the best we can do ? We face the practical problem, though, of deciding which modifications to make. Rather than consider dozens ( or hundreds ) of possible competing models, a more feasible approach may be to start with the most complicated model that one is willing to entertain ( for example, with quartic polynomials for each subpopulation ) and then perform “backward elimination”.

Is that the best we can do ? To do this, remove whichever model feature has the largest p-value, while respecting the hierarchical principle that simpler features cannot be removed before more complicated features. Thus, for example, the linear term cannot be removed from a quadratic polynomial. Once all remaining model features have p-values less than 0.05 ( or are ineligible for removal ), stop and create a table of model fit indices corresponding to the various steps of the backward elimination.

Is that the best we can do ? The step in the backward elimination at which the model fit indices are optimized can be used to select a final model. ( Matters become a bit more complicated, though, if the model fit indices are not in agreement about this. ) Also, if we are unsure whether three is the best number of groups, then the above process can be repeated with, say, two groups and four groups. Model fit indices can then be used to choose among the final two-group model, the final three- group model, and the final four-group model.

Other capabilities of PROC TRAJ Worth mentioning here, though not illustrated in this presentation or in the SAS code at {www.richardcharnigo.net/traj}, are three additional capabilities of PROC TRAJ: The dependent variable need not have the (zero- inflated) Poisson probability distribution; the normal and Bernoulli probability distributions can be accommodated as well. Multiple independent variables can be accommodated.

Other capabilities of PROC TRAJ Multiple, related dependent variables can be accommodated. If there are two ( for instance, marijuana use and alcohol use ), then PROC TRAJ provides one latent variable defining subpopulations on the first dependent variable and a separate latent variable defining subpopulations on the second. Part of the output from PROC TRAJ then estimates the probabilities of membership in the subpopulations defined by the second latent variable given membership in a subpopulation defined by the first. If there are more than two, then PROC TRAJ provides a single latent variable defining subpopulations on all dependent variables simultaneously.

Trying out PROC TRAJ With this background, let us open SAS and work our way through at least some of the SAS code at {www.richardcharnigo.net/traj}. This is also an opportunity to experiment and make some changes to the SAS code. For instance, you can see what PROC TRAJ does when a quadratic function is replaced by a cubic function or when a quadratic function is retained for only one of the three subpopulations.

Download ppt "An Introduction to Group-Based Trajectory Modeling and PROC TRAJ Richard Charnigo Professor of Statistics and Biostatistics Director of Statistics and."

Similar presentations