Presentation on theme: "REALCOM Multilevel models for realistically complex data Measurement errors Multilevel Structural equations Multivariate responses at several levels and."— Presentation transcript:
REALCOM Multilevel models for realistically complex data Measurement errors Multilevel Structural equations Multivariate responses at several levels and of different types Methodology and examples for: An ESRC research project at Bristol University
General Format MATLAB software –Free standing executable programs –ASCII and worksheet input and output –Graphical menu based input specification –Model equation display –Monitoring of MCMC chains A training manual containing: –Outline of methodology –Worked through examples
Markov Chain Monte Carlo – a quick introduction Bayesian simulation based method that, given starting values samples a new set of parameters at each cycle of a Markov chain This yields a final chain (after discarding a burn- in set) of, say, 5000 sets of values from the (joint) posterior distribution of the parameters This is formed by combining the likelihood based on the data and a prior distribution – typically diffuse. These chains are used for inference – e.g. the mean for a parameter is analogous to the point estimate from a likelihood analysis, intervals etc.
The parameters in this model are the fixed coefficients, the two variances and the level 2 residuals. Consider the simple 2-level model: From suitable starting values eventually the chain settles down so that sampling is from the true posterior distribution and we need to sample sufficient to provide stable estimates – using suitable convergence criteria. All the MATLAB routines use MCMC sampling.
Measurement errors 1.Continuous variables: a simple example: Basic model is: With a model of interest e.g.
Some assumptions we need to make Variance assumed known – or alternatively Reliability: We also need a distribution for true value: An important issue is value for and sensitivity analysis useful – we can also give it a prior.
2. Missclassification errors Assume a binary (0,1) variable, for example whether or not a school pupil is eligible for free school meals (yes=1) Probability of observing a zero (no eligibility), given that the true value is zero, is and the probability of observing a one given that the true value is zero by - likewise we have and We now assume we know these missclassification probabilities – similar target model as before with a binary predictor.
Modelling considerations We can model multivariate continuous measurement errors, but only independent binary missclassifications. We can allow different measurement error variances and covariances for different groups – e.g. gender. In multivariate case we typically need non-zero correlations between measurement errors: Thus, say, if R=0.7 observed correlation = 0.8 then we require measurement error correlation >0.33
An educational example Maths test score related to prior test scores and FSM eligibility. We will look at continuous, correlated and binary measurement errors. Open measurement-error.exe and read file classsize
Factor analysis and structural equation models Consider a single level factor model where we have several responses on each member of a sample: Where r indexes the response variable and i the person. This is a special kind of multivariate model where we assume the residuals are independent and the covariance between two responses is thus given by A constraint is needed for identifiability and the default is to choose
Extensions- further factors We can add explanatory variables in addition to the (see later) or we can add further factors: As number of factors increases, we require further constraints, typically on loading values. A popular choice is simple structure with each response loading on only 1 factor and non-zero correlations between factors.
Extensions – structural variables We can allow the factors themselves to depend on further variables e.g. Or alternatively, but less commonly
Two level factor models Standard formulation Alternatively But we shall not consider this case
Example – PISA data A survey of reading performance, of 15 year olds in 32 countries by OECD in 2000. We use one subscale of 35 items retrieving information and look at France and England. First we shall fit one and two level models assuming responses are Normal – in fact they are binary and ordered but we come to that later. Open structural-equation.exe load pisadata
Binary and ordered responses Assume a binary response z. We will use the idea of a latent Normal distribution. Consider the (factor) model for a single response: Where we observe a positive (=1) response for our binary variable z if y is positive, that is So that we obtain the probit model
Ordered data Consider the cumulative probability of being in one of the lowest s+1 categories of a p category variable - categories numbered from 0 upwards: s=0,…p-2 We extend the binary response model as: Where the define a set of thresholds for the categories. So suppose we have a 3-category variable, then for observed responses
PISA data with binary/ordered responses In fact all the responses are binary except for 4 with 3 ordered categories: C9, C14, C20, and C26 Change these responses and rerun models. Finally fit explanatory variables Country and Gender in structural part of model.
Multivariate models with responses at 2 levels Consider first 2 Normal responses: Superscript indicates level Models are linked via level 2 covariance matrix MCMC algorithm handles missing response data and categorical (binary, ordered and unordered) as well as Normal data. First example is a repeated measures growth curve model
Child heights + adult height Child height as a cubic polynomial with intercept + slope random at level 2
Adult height prediction Suppose we have 2 growth measures: we want a regression prediction of the form This leads to:
Mixed response types and missing data Normal and ordered data already considered in structural equation models We now introduce unordered categorical responses We can also have general Normalising transformations Missing data via imputation is an important application for these models
Unordered categorical responses We have where h indexes the response. For each we assume an underlying latent variable exists and that we have the following model: For identifiability we model p-1 categories and assume. The maximum indicant model: we observe category h for individual i iff. so that Assume p categories where an individual responds to just one.
Multiple imputation – briefly and simply Consider the model of interest (MOI) We turn this into a multivariate response model and obtain residual estimates of (from an MCMC chain) which are missing. Use these to fill in and produce a complete data set. Do this (independently) n (e.g. = 20) times. Fit MOI to each data set and combine according to rules to get estimates and standard errors.
Class size example Load classsize_impute MOI is Normalised exam score as response regressed on pretest score, gender, FSM, class size. 50% level 1 units have missing data. Multivariate model:
MI estimates vs listwise deletion Fixed effects in multivariate model: 50% records MCAR Estimate Listwise (SE)MI (SE): Complete (SE) Post maths 0.102 (0.088) 0.134 (0.071) : 0.134 (0.070) Pre Maths 0.011 (0.088) 0.032 (0.071) : 0.019 (0.071) Gender 0.096 (0.074) 0.073 (0.047) : 0.069 (0.047) FSM -1.124 (0.159) -1.090 (0.129) : - 1.064 (0.129) Class size (-30) -4.030 (0.602) -4.049 (0.597) : - 4.267 (0.544)
Further extensions Box-Cox normalising transformations: Application to survival data treated as an ordered response when divided into discrete time intervals Combination of measurement errors, structural models and responses at >1 level into a single program Incorporation into MLwiN
General remarks Report back welcome (firstname.lastname@example.org)email@example.com A REALCOM discussion group is under consideration Use with care!