 # Multilevel modelling short course

## Presentation on theme: "Multilevel modelling short course"— Presentation transcript:

Multilevel modelling short course
Mark Tranmer, CCSR

What is multilevel analysis
Many populations have a group structure of some kind: hierarchical or non-hierarchical. For example pupils can be grouped into schools Individuals can be grouped into areas. Pupils can be grouped by school, and by neighbourhood. Suppose we wish to assess area variations in income, possibly with respect to other factors.

What is multilevel analysis?
If we have district level data we can estimate a district level relationship. E.g. average income and average age in each district If we have individual level data we can estimate an individual level relationship E.g. we can relate a person’s income to a person’s age.

What is multilevel analysis?
But how do we assess the relationships at the district level and the individual level at the same time? We can do this with a multilevel model. We can fit this kind of model with specialist software such as MLwiN, which we will use today.

The ecological fallacy
We could assume that an equation we estimate at the district level also occurs at the individual level, that is to make a cross level inference But this is generally not sensible – individuals vary within each district with respect to the variables we wish to relate. Hence we could well make invalid inferences about the relationship at the individual level This phenomenon is referred to as ‘the ecological fallacy’.

Problems of ignoring population structure
If we carry out the analysis at the individual level we do not recognise in our analysis that ‘similar’ individuals that live within small sub areas of our population. That is, ‘clustering’ occurs Ignoring this clustering may lead to biased estimates of summary statistics, especially variances, standard deviations and standard errors. Hence we might falsely attribute statistical significance (or non significance) to results if we ignore the clustering.

Examples of multilevel relationships

Some substantive multilevel examples
Schools. Variations in exam performance. Level 3: school Level 2: class Level 1: pupils Variations in exam score. ‘School effectiveness’

Some substantive multilevel examples
Areas: Variations in health Level 3: Counties Level 2: Districts Level 1: people People: Dental data Level 2: People’s mouths Level 1: teeth

Some substantive multilevel examples
Time as a level. Level 2: Person Level 1: Occasion Multivariate. Level 2: Pupil Level 1: subject of exam score.

Terminology Nesting. Level k-1 units contained in level k units. E.g. classes at level 2 nested in schools at level 3. Classes are the level 2 units, schools are the level 3 units. Cross classification. Non overlapping higher level units – school and neighbourhood at level 2, pupil at level 1.

Continuous and Binary Response variables
For a continuous response we use a multilevel model that is an extension of the standard multiple regression model – as we will see this morning. For a binary response we use a multilevel model that is an extension of the logistic regression model – as we will see this afternoon.

Data requirements What are the data requirements for multilevel modelling? The standard requirements are to have available a dataset that includes indicators of the group to which individual unit belongs. For example information for a sample of pupils that includes an indicator of the school that they attend. Another example is a sample of individuals that includes an indicator of the area in which they live.

Fixed effects What about fixed effects analysis?
If we had information on pupils that attended three schools, we can carry out a fixed effects analysis to compare the three schools based on these sample data. We would do this by doing an analysis that includes two dummy variables that allow us to compare the schools. We could make inferences from our results about how the three schools compare but we would not want to make wider inferences about ‘all schools’ based on information on only 3 schools.

Multilevel modelling For multilevel modelling we would have information on a ‘reasonable number of higher level units’ What is ‘reasonable’? Snijders and Bosker (1999) recommend at least 10 groups. 20 or more is better. We essentially assume we have a representative sample of higher level units in multilevel modelling, so 30 is a good number to have in mind.

Multilevel modelling Suppose we had data for pupils based on 30 schools. We could carry out a fixed effects analysis on these data by using 29 dummy variables. Or we could use multilevel modelling which assumes the schools are themselves a sample. Hence we do not need to estimate so many model parameters using multilevel modelling and it is desirable in this situation. Multilevel modelling also takes into account group size in estimation – estimates of residuals for groups with small populations – e.g. a school with 2 pupils – are ‘shrunken’ towards the mean.

Theory: Single level models
Suppose we have data for 4059 pupils in 65 schools. How could we model the data? Model 1: pupil level model based on the 4059 pupils Var(yi) = 2

Single level models Model 2: Or a school level model based on aggregate data for the 65 schools; that is, the school means.

Multilevel models: model 3 ‘variance components’ model
Var(yij) = 2u+2e = 2 i is the pupil subscript j is the school subscript 2u measures variation in schools. 2e measures variation in pupils.

Intra-‘class’ correlation
2u /2 = the intra class correlation: the proportion of the overall variation in exam score attributable to schools. i.e. how similar are exam scores within schools

Random intercepts model
Model 4: 2 level model: pupils in schools, with an explanatory variable.

Random slopes model Model 5: random slopes
Where the ‘random slopes coefficient is: Or alternatively, but equivalently, we can write the model as:

Group level variables We can also add group level variables to the model, e.g. the type of school (mixed or single sex), or the percentage of pupils taking free school meals in the school.

Binary response variables
Many response variables are ‘binary’ ‘0/1’ ‘dichotomous’. E.g. whether or not a person is unemployed or has a limiting long term illness. Risk of unemployment may be associated with personal characteristics and/or where people live. We can use Multilevel logistic models to investigate these issues.

Binary response variables
Let’s suppose we are looking at the risk of people being unemployed given some demographic characteristics, and also given some information about the area in which they live. We can look at this problem using multilevel logistic regression models

Multilevel logistic regression models
Model 6: The basic (two level) multilevel model for a binary response is written as follows. where yij takes the value 0 or 1 for each individual i in group j (0=not unemployed, 1=employed), pij is the predicted probability of unemployment for individual i in area j. eij is an individual level error,

Multilevel logistic regression models
Where 0 is the ‘intercept’ and, 1 to p are the coefficients of the p explanatory variables

MLwiN for binary response variables.
MLwiN could be used to fit a multilevel model based on the example of unemployment as a response variable and some demographic information as explanatory variables. For this analysis we could use 1991 UK Census data from the Samples of Anonymised records (SAR). The MLwiN procedure for binary response variables is slightly more involved than that for continuous response variables. See chapter 9 of the mlwin user guide

SPSS for mutilevel modelling
In versions of SPSS >= 11.5 it is now possible to fit models for dependent variables with an interval response. The syntax on the next slide shows how variance components, random intercepts and random intercepts/slopes models can be fitted for a 2-level example - pupils in schools.

SPSS for multilevel modelling
Random intercepts and slopes (on standlrt) model for pupils in Schools. (normexam is continuous response; standlrt is continuous) Explanatory variable. Syntax is as follows. mixed normexam with standlrt / print = solution / fixed standlrt / random intercept standlrt | subject(school) covtype(UN). [ to access via SPSS menus: analyse > mixed models ]

variance components model only
mixed normexam / print = solution / random intercept | subject(school) covtype(UN). random intercepts model only mixed normexam with standlrt / fixed standlrt / random intercept | subject(school) covtype(UN).