Presentation on theme: "Multilevel modelling short course Mark Tranmer, CCSR."— Presentation transcript:
Multilevel modelling short course Mark Tranmer, CCSR
What is multilevel analysis Many populations have a group structure of some kind: hierarchical or non-hierarchical. For example pupils can be grouped into schools Individuals can be grouped into areas. Pupils can be grouped by school, and by neighbourhood. Suppose we wish to assess area variations in income, possibly with respect to other factors.
What is multilevel analysis? If we have district level data we can estimate a district level relationship. E.g. average income and average age in each district If we have individual level data we can estimate an individual level relationship E.g. we can relate a persons income to a persons age.
What is multilevel analysis? But how do we assess the relationships at the district level and the individual level at the same time? We can do this with a multilevel model. We can fit this kind of model with specialist software such as MLwiN, which we will use today.
The ecological fallacy We could assume that an equation we estimate at the district level also occurs at the individual level, that is to make a cross level inference But this is generally not sensible – individuals vary within each district with respect to the variables we wish to relate. Hence we could well make invalid inferences about the relationship at the individual level This phenomenon is referred to as the ecological fallacy.
Problems of ignoring population structure If we carry out the analysis at the individual level we do not recognise in our analysis that similar individuals that live within small sub areas of our population. That is, clustering occurs Ignoring this clustering may lead to biased estimates of summary statistics, especially variances, standard deviations and standard errors. Hence we might falsely attribute statistical significance (or non significance) to results if we ignore the clustering.
Examples of multilevel relationships
Some substantive multilevel examples Schools. Variations in exam performance. Level 3: school Level 2: class Level 1: pupils Variations in exam score. School effectiveness
Some substantive multilevel examples Areas: Variations in health Level 3: Counties Level 2: Districts Level 1: people People: Dental data Level 2: Peoples mouths Level 1: teeth
Some substantive multilevel examples Time as a level. Level 2: Person Level 1: Occasion Multivariate. Level 2: Pupil Level 1: subject of exam score.
Terminology Nesting. Level k-1 units contained in level k units. E.g. classes at level 2 nested in schools at level 3. Classes are the level 2 units, schools are the level 3 units. Cross classification. Non overlapping higher level units – school and neighbourhood at level 2, pupil at level 1.
Continuous and Binary Response variables For a continuous response we use a multilevel model that is an extension of the standard multiple regression model – as we will see this morning. For a binary response we use a multilevel model that is an extension of the logistic regression model – as we will see this afternoon.
Data requirements What are the data requirements for multilevel modelling? The standard requirements are to have available a dataset that includes indicators of the group to which individual unit belongs. For example information for a sample of pupils that includes an indicator of the school that they attend. Another example is a sample of individuals that includes an indicator of the area in which they live.
Fixed effects What about fixed effects analysis? If we had information on pupils that attended three schools, we can carry out a fixed effects analysis to compare the three schools based on these sample data. We would do this by doing an analysis that includes two dummy variables that allow us to compare the schools. We could make inferences from our results about how the three schools compare but we would not want to make wider inferences about all schools based on information on only 3 schools.
Multilevel modelling For multilevel modelling we would have information on a reasonable number of higher level units What is reasonable? Snijders and Bosker (1999) recommend at least 10 groups. 20 or more is better. We essentially assume we have a representative sample of higher level units in multilevel modelling, so 30 is a good number to have in mind.
Multilevel modelling Suppose we had data for pupils based on 30 schools. We could carry out a fixed effects analysis on these data by using 29 dummy variables. Or we could use multilevel modelling which assumes the schools are themselves a sample. Hence we do not need to estimate so many model parameters using multilevel modelling and it is desirable in this situation. Multilevel modelling also takes into account group size in estimation – estimates of residuals for groups with small populations – e.g. a school with 2 pupils – are shrunken towards the mean.
Theory: Single level models Suppose we have data for 4059 pupils in 65 schools. How could we model the data? Model 1: pupil level model based on the 4059 pupils Var(y i ) = 2
Single level models Model 2: Or a school level model based on aggregate data for the 65 schools; that is, the school means.
Multilevel models: model 3 variance components model Var(y ij ) = 2 u + 2 e = 2 i is the pupil subscript j is the school subscript 2 u measures variation in schools. 2 e measures variation in pupils.
Intra-class correlation 2 u / 2 = the intra class correlation: the proportion of the overall variation in exam score attributable to schools. i.e. how similar are exam scores within schools
Random intercepts model Model 4: 2 level model: pupils in schools, with an explanatory variable.
Random slopes model Model 5: random slopes Where the random slopes coefficient is: Or alternatively, but equivalently, we can write the model as:
Group level variables We can also add group level variables to the model, e.g. the type of school (mixed or single sex), or the percentage of pupils taking free school meals in the school.
Binary response variables Many response variables are binary 0/1 dichotomous. E.g. whether or not a person is unemployed or has a limiting long term illness. Risk of unemployment may be associated with personal characteristics and/or where people live. We can use Multilevel logistic models to investigate these issues.
Binary response variables Lets suppose we are looking at the risk of people being unemployed given some demographic characteristics, and also given some information about the area in which they live. We can look at this problem using multilevel logistic regression models
Multilevel logistic regression models Model 6: The basic (two level) multilevel model for a binary response is written as follows. where y ij takes the value 0 or 1 for each individual i in group j (0=not unemployed, 1=employed), p ij is the predicted probability of unemployment for individual i in area j. e ij is an individual level error,
Multilevel logistic regression models Where 0 is the intercept and, 1 to p are the coefficients of the p explanatory variables
MLwiN for binary response variables. MLwiN could be used to fit a multilevel model based on the example of unemployment as a response variable and some demographic information as explanatory variables. For this analysis we could use 1991 UK Census data from the Samples of Anonymised records (SAR). The MLwiN procedure for binary response variables is slightly more involved than that for continuous response variables. See chapter 9 of the mlwin user guide pdf
SPSS for mutilevel modelling In versions of SPSS >= 11.5 it is now possible to fit models for dependent variables with an interval response. The syntax on the next slide shows how variance components, random intercepts and random intercepts/slopes models can be fitted for a 2-level example - pupils in schools.
Random intercepts and slopes (on standlrt) model for pupils in Schools. (normexam is continuous response; standlrt is continuous) Explanatory variable. Syntax is as follows. mixed normexam with standlrt / print = solution / fixed standlrt / random intercept standlrt | subject(school) covtype(UN). SPSS for multilevel modelling [ to access via SPSS menus: analyse > mixed models ]
variance components model only mixed normexam / print = solution / random intercept | subject(school) covtype(UN). random intercepts model only mixed normexam with standlrt / print = solution / fixed standlrt / random intercept | subject(school) covtype(UN).
Reading list Books: Plewis, I (1997) Statistics in Education. Edward Arnold Snijders T and Bosker R (1999) An introduction to Basic and Advanced Multilevel modelling. Sage Publications. Goldstein, H (1995) Multilevel statisical models. Edward Arnold. Web: Nb: New version of mlwin 2.10 just released : see website