Missing Data and random effect modelling

Presentation on theme: "Missing Data and random effect modelling"— Presentation transcript:

Missing Data and random effect modelling
Lecture 20 Missing Data and random effect modelling

Lecture Contents What is missing data? Simple ad-hoc methods.
Types of missing data. (MCAR, MAR, MNAR) Principled methods. Multiple imputation. Methods that respect the random effect structure. Thanks to James Carpenter (LSHTM) for many slides!!

Dealing with missing data
Why is this necessary? Missing data are common. However, they are usually inadequately handled in both epidemiological and experimental research. For example, Wood et al. (2004) reviewed 71 recently published BMJ, JAMA, Lancet and NEJM papers. 89% had partly missing outcome data. In 37 trials with repeated outcome measures, 46% performed complete case analysis. Only 21% reported sensitivity analysis.

What do we mean by missing data?
Missing data are observations that we intended to be made but did not make. For example, an individual may only respond to certain questions in a survey, or may not respond at all to a particular wave of a longitudinal survey. In the presence of missing data, our goal remains making inferences that apply to the population targeted by the complete sample - i.e. the goal remains what it was if we had seen the complete data. However, both making inferences and performing the analysis are now more complex. We will see we need to make assumptions in order to draw inferences, and then use an appropriate computational approach for the analysis. We will avoid adopting computationally simple solutions (such as just analysing complete data or carrying forward the last observation in a longitudinal study) which generally lead to misleading inferences.

What are missing data? Variable
In practice the data consist of (a) the observations actually made (where '?' denotes a missing observation): and (b) the pattern of missing values: Variable Unit 1 2 3 4 5 6 7 3.4 4.5 ? 10 1.2 B 12 2.6 C 15 Variable Unit 1 2 3 4 5 6 7

Inferential Framework
When it comes to analysis, whether we adopt a frequentist or a Bayesian approach the likelihood is central. In these slides, for convenience, we discuss issues from a frequentist perspective, although often we use appropriate Bayesian computational strategies to approximate frequentist analyses.

Classical Approach The actual sampling process involves the 'selection' of the missing values, as well as the units. So to complete the process of inference in a justifiable way we need to take this into account.

Bayesian Framework Posterior Belief = Prior Belief + Likelihood. Here
The likelihood is a measure of comparative support for different models given the data. It requires a model for the observed data, and as with classical inference this must involve aspects of the way in which the missing data have been selected (i.e. the missingness mechanism).

What do we mean by valid inference when we have missing data?
We have already noted that missing data are observations we intended to make but did not. Thus, the sampling process now involves both the selection of the units, AND ALSO the process by which observations become missing - the missingness mechanism. It follows that for valid inference, we need to take account of the missingness mechanism. By valid inference in a frequentist framework we mean that the quantities we calculate from the data have the usual properties. In other words, estimators are consistent, confidence intervals attain nominal coverage, p-values are correct under the null hypothesis, and so on.

Assumptions We distinguish between item and unit nonresponse (missingness). For item missingness, values can be missing on response (i.e. outcome) variables and/or on explanatory (i.e. design/covariate/exposure/confounder) variables. Missing data can effect properties of estimators (for example, means, percentages, percentiles, variances, ratios, regression parameters and so on). Missing data can also affect inferences, i.e. the properties of tests and confidence intervals, and Bayesian posterior distributions. A critical determinant of these effects is the way in which the probability of an observation being missing (the missingness mechanism) depends on other variables (measured or not) and on its own value. In contrast with the sampling process, which is usually known, the missingness mechanism is usually unknown.

Assumptions The data alone cannot usually definitively tell us the sampling process. Likewise, the missingness pattern, and its relationship to the observations, cannot definitively identify the missingness mechanism. The additional assumptions needed to allow the observed data to be the basis of inferences that would have been available from the complete data can usually be expressed in terms of either 1. the relationship between selection of missing observations and the values they would have taken, or 2. the statistical behaviour of the unseen data. These additional assumptions are not subject to assessment from the data under analysis; their plausibility cannot be definitively determined from the data at hand.

Assumptions The issues surrounding the analysis of data sets with missing values therefore centre on assumptions. We have to 1. decide which assumptions are reasonable and sensible in any given setting; - contextual/subject matter information will be central to this 2. ensure that the assumptions are transparent; 3. explore the sensitivity of inferences/conclusions to the assumptions, and 4. understand which assumptions are associated with particular analyses.

Getting computation out of the way
The above implies it is sensible to use approaches that make weak assumptions, and to seek computational strategies to implement them. However, often computationally simple strategies are adopted, which make strong assumptions, which are subsequently hard to justify. Classic examples are completers analysis (i.e. only including units with fully observed data in the analysis) and last observation carried forward. The latter is sometimes advocated in longitudinal studies, and replaces a unit's unseen observations at a particular wave with their last observed values, irrespective of the time that has elapsed between the two waves.

Conclusions (1) Missing data introduce an element of ambiguity into statistical analysis, which is different from the traditional sampling imprecision. While sampling imprecision can be reduced by increasing the sample size, this will usually only increase the number of missing observations! As discussed in the preceding sections, the issues surrounding the analysis of incomplete datasets turn out to centre on assumptions and computation. The assumptions concern the relationship between the reason for the missing data (i.e. the process, or mechanism, by which the data become missing) and the observations themselves (both observed and unobserved). Unlike say in regression, where we can use the residuals to check on the assumption of normality, these assumptions cannot be verified from the data at hand. Sensitivity analysis, where we explore how our conclusions change as we change the assumptions, therefore has a central role in the analysis of missing data.

Simple, ad-hoc methods and their shortcomings
In contrast to principled methods, these usually create a single 'complete' dataset, which is analysed as if it were the fully observed data. Unless certain, fairly strong, assumptions are true, the answers are invalid. We briefly review the following methods: Analysis of completers only. Imputation of simple mean. Imputation of regression mean. Creating an extra category.

Completers analysis The data on the right has one missing observation on variable 2, unit 10. Completers analysis deletes all units with incomplete data from the analysis (here unit 10). Variable Unit 1 2 3.4 5.67 3.9 4.81 3 2.6 4.93 4 1.9 6.21 5 2.2 6.83 6 3.3 5.61 7 1.7 5.45 8 2.4 4.94 9 2.8 5.73 10 3.6 ?

What’s wrong with completers analysis?
It is inefficient. It is problematic in regression when covariate values are missing and models with several sets of explanatory variables need to be compared. Either we keep changing the size of the data set, as we add/remove explanatory variables with missing observations, or we use the (potentially very small, and unrepresentative) subset of the data with no missing values. When the missing observations are not a completely random selection of the data, a completers analysis will give biased estimates and invalid inferences.

Simple mean imputation
We replace missing data with the arithmetic average of the observed data for that variable. In the table of 10 cases this will be 5.58. Why not? This approach is clearly inappropriate for categorical variables. It does not lead to proper estimates of measures of association or regression coefficients. Rather, associations tend to be diluted. In addition, variances will be wrongly estimated (typically under estimated) if the imputed values are treated as real. Thus inferences will be wrong too.

Regression mean imputation
Here, we use the completers to calculate the regression of the incomplete variable on the other complete variables. Then, we substitute the predicted mean for each unit with a missing value. In this way we use information from the joint distribution of the variables to make the imputation. To perform regression imputation, we first regress variable 2 on variable 1 (note, it doesn't matter which of these is the 'response' in the model of interest). In our example, we use simple linear regression: V2 = α + β V1 + e. Using units 1-9, we find that α = 6.56 and β = , so the regression relationship is Expected value of V2 = V1. For unit 10, this gives x 3.6 = 5.24.

Regression mean imputation: Why/Why Not?
Regression mean imputation can generate unbiased estimates of means, associations ad regression coefficients in a much wider range of settings than simple mean imputation. However, one important problem remains. The variability of the imputations is too small, so the estimated precision of regression coefficients will be wrong and inferences will be misleading.

Creating an extra category
When a categorical variable has missing values it is common practice to add an extra 'missing value' category. In the example below, the missing values, denoted '?' have been given the category 3. Variable Unit 1 2 3.4 3.9 3 2.6 4 1.9 5 2.2 ? → 3 6 3.3 7 1.7 8 2.4 9 2.8 10 3.6

Creating an extra category
This is bad practice because: the impact of this strategy depends on how missing values are divided among the real categories, and how the probability of a value being missing depends on other variables; very dissimilar classes can be lumped into one group; severe bias can arise, in any direction, and when used to stratify for adjustment (or correct for confounding) the completed categorical variable will not do its job properly.

Some notation The data We denote the data we intended to collect, by Y, and we partition this into Y = {Yo,Ym}. where Yo is observed and Ym is missing. Note that some variables in Y may be outcomes/responses, some may be explanatory variables/covariates. Depending on the context these may all refer to one unit, or to an entire dataset. Missing value indicator Corresponding to every observation Y, there is a missing value indicator R, defined as: R = 1 if Y observed 0 otherwise.

Missing value mechanism
The key question for analyses with missing data is, under what circumstances, if any, do the analyses we would perform if the data set were fully observed lead to valid answers? As before, 'valid' means that effects and their SE's are consistently estimated, tests have the correct size, and so on, so inferences are correct. The answer depends on the missing value mechanism. This is the probability that a set of values are missing given the values taken by the observed and missing observations, which we denote by Pr(R | Yo, Ym).

Examples of missing value mechanisms
1. The chance of non-response to questions about income usually depend on the person's income. 2. Someone may not be at home for an interview because they are at work. 3. The chance of a subject leaving a clinical trial may depend on their response to treatment. 4. A subject may be removed from a trial if their condition is insufficiently controlled.

Missing Completely at Random (MCAR)
Suppose the probability of an observation being missing does not depend on observed or unobserved measurements. In mathematical terms, we write this as Pr(R | Yo, Ym) = Pr(R) Then we say that the observation is Missing Completely At Random, which is often abbreviated to MCAR. Note that in a sample survey setting MCAR is sometimes called uniform non-response. If data are MCAR, then consistent results with missing data can be obtained by performing the analyses we would have used had there been no missing data, although there will generally be some loss of information. In practice this means that, under MCAR, the analysis of only those units with complete data gives valid inferences.

Missing At Random (MAR)
After considering MCAR, a second question naturally arises. That is, what are the most general conditions under which a valid analysis can be done using only the observed data, and no information about the missing value mechanism, Pr(R | Yo, Ym)? The answer to this is when, given the observed data, the missingness mechanism does not depend on the unobserved data. Mathematically, Pr(R | Yo, Ym) = Pr(R | Yo). This is termed Missing At Random, abbreviated MAR.

Missing Not At Random (MNAR)
When neither MCAR nor MAR hold, we say the data are Missing Not At Random, abbreviated MNAR. In the likelihood setting (see end of previous section) the missingness mechanism is termed non-ignorable. What this means is Even accounting for all the available observed information, the reason for observations being missing still depends on the unseen observations themselves. To obtain valid inference, a joint model of both Y and R is required (that is a joint model of the data and the missingness mechanism).

MNAR (continued) Unfortunately
We cannot tell from the data at hand whether the missing observations are MCAR, MNAR or MAR (although we can distinguish between MCAR and MAR). In the MNAR setting it is very rare to know the appropriate model for the missingness mechanism. Hence the central role of sensitivity analysis; we must explore how our inferences vary under assumptions of MAR, MNAR, and under various models. Unfortunately, this is often easier said than done, especially under the time and budgetary constraints of many applied projects.

Principled methods These all have the following in common:
No attempt is made to replace a missing value directly. i.e. we do not pretend to 'know' the missing values. Rather: available information (from the observed data and other contextual considerations) is combined with assumptions not dependent on the observed data. This is used to either generate statistical information about each missing value, e.g. distributional information: given what we have observed, the missing observation has a normal distribution with mean a and variance b , where the parameters can be estimated from the data. and/or generate information about the missing value mechanism.

Principled methods The great range of ways in which these can be done leads to the plethora of approaches to missing values. Here are some broad classes of approach: Wholly model based methods. Simple stochastic imputation. Multiple stochastic imputation. Weighted methods. (not covered here)

Wholly model based methods
A full statistical model is written down for the complete data. Analysis (whether frequentist or Bayesian) is based on the likelihood. Assumptions must be made about the missing data mechanism: If it is assumed MCAR or MAR, no explicit model is needed for it. Otherwise this model must be included in the overall formulation. Such likelihood analyses requires some form of integration (averaging) over the missing data. Depending on the setting this can be done implicitly or explicitly, directly or indirectly, analytically or numerically. The statistical information on the missing data is contained in the model. Examples of this would be the use of linear mixed models under MAR in SAS PROC MIXED or MLwiN. We will examine this in the practical.

Simple stochastic imputation
Instead of replacing a value with a mean, a random draw is made from some suitable distribution. Provided the distribution is chosen appropriately, consistent estimators can be obtained from methods that would work with the whole data set. Very important in the large survey setting where draws are made from units with complete data that are 'similar' to the one with missing values (donors). There are many variations on this hot-deck approach. Implicitly they use non-parametric estimates of the distribution of the missing data: typically need very large samples.

Simple stochastic imputation
Although the resulting estimators can behave well, for precision (and inference) account must be taken of the source of the imputations (i.e. there is no 'extra' data). This implies that the usual complete data estimators of precision can't be used. Thus, for each particular class of estimator (e.g. mean, ratio, percentile) each type of imputation has an associated variance estimator that may be design based (i.e. using the sampling structure of the survey) or model based, or model assisted (i.e. using some additional modelling assumptions). These variance estimators can be very complicated and are not convenient for generalization.

Multiple (stochastic) imputation
This is very similar to the single stochastic imputation method, except there are many ways in which draws can be made (e.g. hot-deck non-parametric, model based). The crucial difference is that, instead of completing the data once, the imputation process is repeated a small number of times (typically 5-10). Provided the draws are done properly, variance estimation (and hence constructing valid inferences) is much more straightforward. The observed variability among the estimates from each imputed data set is used in modifying the complete data estimates of precision. In this way, valid inferences are obtained under missing at random.

Why do multiple imputation?
One of the main problems with the single stochastic imputation methods is the need for developing appropriate variance formulae for each different setting. Multiple imputation attempts to provide a procedure that can get the appropriate measures of precision relatively simply in (almost) any setting. It was developed by Rubin is a survey setting (where it feels very natural) but has more recently been used more widely.

Missing Data and Random effects models
In the practical we will consider two approaches: Model based MCMC estimation of a multivariate response model. Generating multiple imputations from this model (using MCMC) that can then be used to fit further models using any estimation method.

Information on practical
Practical introduces MVN models in MLwiN using MCMC. Two education datasets. Firstly two responses that are components within GCSE science exams in which we consider model based approaches. Secondly a six responses dataset from Hungary in which we consider multiple imputation.

Other approaches to missing data
IGLS estimation of MVN models is available in MLwiN. Here the algorithm treats the MVN model as a special case of a univariate Normal model and so there are no overheads for missing data (assuming MAR). WinBUGS has great flexibility with missing data. The MLwiN->WinBUGS interface will allow you to do the same model based approach as in the practical. It can however also be used to incorporate imputation models as part of the model.

Plug for www.missingdata.org.uk
James Carpenter has developed MLwiN macros that perform multiple imputation using MCMC. These build around the MCMC features in the practical but run an imputation model independent of the actual model of interest. See for further details including variants of these slides and WinBUGS practicals.