Generalized linear MIXED models

Name: Generalized linear MIXED models
Uploaded: 2017-12-17T14:44:50+00:00
Duration: PTM22S12
Channel: Muriel Tate
Description: Generalized linear MIXED models

Generalized linear MIXED models
Claudia von Brömssen Dept of economics Unit of applied statistics and mathematics

Introduction Methods used yesterday all depend on the independence of observations. All collected data should be - true replicates - not clustered - not measured several times over a time period

Independence and replicates
1. Does fish length of species A vary with land use? river 1 (lies in forest): 25, 27, 34, 22, 26 river 2 (lies in agricultural area): 42, 36, 29, 35 river 3 (lies in a mixed area): 34, 27, 32, 41 2. Does fish length of species A vary between rivers? river 1: 25, 27, 34, 22, 26 river 2: 42, 36, 29, 35 river 3: 34, 27, 32, 41 Why can question 2 be answered with statistical methods but not 1.?

2. Does fish length of species A vary between rivers? river 1: river 2: river 3: Population 1: all fish in river 1 Population 2: all fish in river 2 Population 3: all fish in river 3 Observations: individual fish of species A in each river = independent fish representing the population

1. Does fish length of species A vary with land use? river 1: lies in forest river 2: lies in agricultural area river 3: lies in a mixed area Population 1: fish in rivers in forests Population 2: fish in rivers in agricultural areas Population 3: fish in rivers in mixed ares Observations: 5 observations in river 1 represents the river but not the population. There are no true replicates, but pseudoreplicates.

1. Does fish length of species A vary with land use? forest area: rivers 1a, 1b and 1c agricultural area: rivers 2a, 2b, 2c and 2d mixed area: rivers 3a, 3b and 3c Population 1: fish in rivers in forests Population 2: fish in rivers in agricultural areas Population 3: fish in rivers in mixed ares Rivers 1a, 1b and 1c are replicates and represent population 1.

Experimental units When conducting experiments experimental units are the smallest unit that can get a individual treatment: If you have cows in a box each cow can get its own diet -> to compare diets cows are the experimental units, several cows getting the same diet are replicates If you treat plots in a forest with a special treatment the plots are the experimental units. If instead each leave can get different treatments, the leaves are the experimental units.

Experimental units Experimental units = independent observations are needed to quantify the variation in the data How much variation can we expect from completely unconnected individuals/subjects/sites

Dependent data Often it is easier or of special interest to collect dependent data Time series/repeated measurements: we are interested how the treatment effects the experimental unit over time Clustered/hierarchical data: it is easier and gives a better representation to collect several leaves from several trees within the same experimental plot.

Dependent data If dependent data is ignored in the analysis this can lead to bias in the estimates and an underestimation of variation, leading to low, but false p-values. If you want to make a study that includes dependent data plan this thoroughly before data collection.

Dependent data Observe that I am talking about dependencies/ independence of observations. Dependecies between variables is desirable for multivariate methods. Dependecies between explanatory variables in general or generalised linear models can be a problem if correlations are very high.

Models for dependent observations - examples
If it is important to follow a treatment over time we could make observations on the same plot several times (several days after the treatment, several month after the treatment,…) Data for each plot has a time series structure and measurements on the same plot are not independent. The time series structure is incorporated in the model. We often call these models ’repeated measures models’

Models for dependent observations - examples
To make estimates better we could choose to take measurements several times on the same plot (but at the same time point). This data structure is called clustered or hierarchical and we can use the data to get some idea of how large the variation within the plot is.

Mixed models Data with such structures are analysed with mixed models where different types of random factors or random effects account for the dependencies in the data. Mixed models in R can be run in different functions/packages all with some restrictions. We will use the function glmer and glmmPQL.

Examples - Lophodermium
For the Lophodermium data set there were actually 2 forests observed at each site: sample site forest Latitud veg_period vegetation_zone status 1 Sk1G Nemoral Healthy Th1G Nemoral Healthy Th2G Nemoral Healthy Bo1G Nemoral Healthy Bo2G Nemoral Healthy Asa1G Hemi Healthy Asa2G Hemi Healthy

Examples - Lophodermium
Since we now for most sites have 2 forests observed, the two forests at the same site cannot really be regarded to be independent of each other. Probably the results from these two forests are similar due to their being close geographically. We can assume a hierachical structure. In the model this resolves to estimating variance components for the site and the forests within each site.

Fixed and random effects
𝑔 𝜇 𝑖𝑗 =𝜇+ 𝛽 𝑖 +𝑎 𝑗 + 𝑒 𝑖𝑗 Where 𝛽 is a factor effect (e.g. healthy/sick) and 𝑎 is a random effect (e.g. of the forest within each site). Generelly the factor effects or fixed effects are the one that we are interested to model, whereas the random effects are there to reconstruct the design of the study or experimental design.

If we only look at the random effect, site: 𝜇 1 =𝜇+ 𝑎 1 𝑎 𝑖 𝑖𝑠 𝑟𝑎𝑛𝑑𝑜𝑚 𝑤𝑖𝑡ℎ 𝐸 𝑎 𝑖 =0 𝑎 𝑖 gives different values for each site. 𝜇 2 =𝜇+ 𝑎 2 𝜇 3 =𝜇+ 𝑎 3 The different sites are included in the experiment since they represent different conditions. Variation in the proportion of X6 between the sites 𝜎 𝐴 2

It is usually not intersting to learn more about the different levels of a random factor. If we would use site as a fixed factor, we would estimate the level of mean proportion for species 6 for each of the sites. We would make 18 estimates, one for each site except for one. When we treat site as a random factor, we only estimate one parameter – the variance between the different sites.

The hierarchical structure site 2 experimental unit site 1 site 3 several measurements on the same unit

Since the forests can be affected by the common factor site we do not see them as independent. Forests within the same site can be more similar than forests from different sites. We model this effect by including the random factor site in the model.

We also make several measurements on each forest = we measure both sick and healthy needles in each of the forests and we observe all forests both 2006 and 2007. The fixed factors status and needle_cohort are nested within forest. This type of model is in agricultural experiments often called split-plot model.

The factor ’site’ is on the large scale level. It coincides with latitude and to some extend with vegetation_zone. Both forests are observed on the same level of ’site’. The factors status and year are on the small scale level. They can be oberserved separately for each of the forests.

Loph: Consideration regarding the factor site
In this study design data was collected at different sites. At each site both healthy and diseased needles were collected during both 2006 and Some measurements are however missing. The correct model yesterday would also need to include the site variable to adjust for local levels. In our model, however, this part was taken by the latitude variable. We could choose to replace the latitude with the site variable (which gives less information) or use the site variable as a random factor and keep latitude in the model as well.

Mixed models for Lophodermium
With the type of model we use now we can include the factor ’site’ easily as random variable. Also forest is included as random variable. We assume that both sites and forests are randomly selected from all sites and forests available.

We need now to change to an R packages that can do mixed models. There are several of them, but we start with the glmer function. I glmer we write the model basically the same as in glm, but we can include random variables by setting them into a paranthesis: (1|site) for a random site (1|site/forest) for a random forest within a random site

Model1 <- glmer(cbind(X6_reads, reads-X6_reads)~ Latitud+status + needle_cohort + (1|site/forest), family=binomial, data=Loph2) Model3<-glmer(X6_reads~Latitud+status + needle_cohort + (1|site/forest), family=poisson, offset=log_reads, data=Loph2)

Random effects: Groups Name Variance Std.Dev. forest:site (Intercept) site (Intercept) Number of obs: 69, groups: forest:site, 20; site, 10 Fixed effects: Estimate Std. Error z value Pr(>|z|) (Intercept) e-07 *** Latitud e-06 *** statusHealthy < 2e-16 *** needle_cohort < 2e-16 *** --- Signif. codes: 0 ‘***’ ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Quasi-binomial and quasipoisson does not work with glmer. Instead we need to include overdispersion with yet another method: Random residual The idea is to just estimate a separate variance for the residuals of the model and adjust p-value for that. Model1b<-glmer(cbind(X6_reads, reads-X6_reads)~Latitud+status + needle_cohort + (1|site/forest)+ (1|sample), family=binomial, data=Loph2) Model1a<-glmer(cbind(X6_reads, reads-X6_reads)~Latitud+status + needle_cohort + (1|site/forest/sample), family=binomial, data=Loph2)

Random effects: Groups Name Variance Std.Dev. sample:(forest:site) (Intercept) 1.996e forest:site (Intercept) 8.703e site (Intercept) 1.508e Number of obs: 69, groups: sample:(forest:site), 69; forest:site, 20; site, 10 Fixed effects: Estimate Std. Error z value Pr(>|z|) (Intercept) e-07 *** Latitud e-06 *** statusHealthy < 2e-16 *** needle_cohort --- Signif. codes: 0 ‘***’ ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Manyglm and edgeR do not seem to have any possibility to account for hierachical or other mixed structures.

Mixed models for Funghi
In the second example of yesterdays computer lab we analysed funghi data at a number of sites. The sites in this example are the replicates made in the experiment, the experimental units At each site we observe a specific combination of tree type, CO2 (yes/no) and Warmed (yes/no). At each experimental unit we make 3 observations (=the three different horizons).

The structure is similar to the Lophodermium example, where we also had several measurements at each site, but in this case these measurements also have meaning – they represent different soil layers. This means that measurements have a meaning and a specific order.

In such cases we usually assume that there is a correlation between measurements at the same site. If a measurement is made at a site with high probability of species 3 it will be so at all levels. Part of this correlation between horizons is described by the model – we include horizon as factor. There can, however, still be correlations in the residuals of the model = data is not independent.

correlated also correlated, but less We can assume that the observations made at the same site are correlated with each other. Observations made close to each other are more correlated than observations longer apart.

Correlation between layers can be estimated and the standard errors and p-values are adjusted accordingly. The correlations are estimated on the residuals, i.e. after the model is fitted, to see if there is any remaining dependence between the layers.

Since horizon actually is a rather important factor in the model we should also consider interactions between the other factors and horizon. The effect of tree type could be different at different soil horizons. For X3 however we will not be able to estimate this interaction.

Generalised linear models - overview
We use logistic regression or Poisson regression as base models. For DNA sequencing data or similar data specific procedures often use the negative binomial distribution, since overdispersion is almost always observed.

Generalised linear models - overview
For these types of models you need to have the data observed as counts. If your response variable is a propoportion and cannot be traced back to counts, you use general linear models with a normal distribution for the error term. Sometimes this will demand transformation for the observed data before the model can be fitted. (Look at residual plots to check it residuals are normally distributed and have equal variances) If normality does not hold use nonparametric metods.

Overdispersion - overview
There are several ways to handle overdispersion in data to use quasidistributions (this does often not work in mixed settings) to use the negative binomial distribution (not availabe in all packages, e.g. not in glm) use a random residual (demand the use of mixed models even if the model itself is not mixed)

Overdispersion - overview
Always control that the design is well represented in the model. Leaving out design variables (factors that are used to define the data collection) will almost always lead to overdispersion.

Mixed models - overview
If your data is collected according to a specific experimental plan or study design you need to account for this structure in the analysis. If you do not do this it will leave you with faulty variation estimates = wrong pvalues (usually to low pvalues). Leaving out the study design variables can also lead to overdispersion.

Mixed models - overview
Typical mixed models are repeated measures models, where an experimental unit is observed several times (in time or space) hierarchical models, where several observations are made within the experimental unit (but with no specific order)

Generalized linear MIXED models

Similar presentations

Presentation on theme: "Generalized linear MIXED models"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Generalized linear MIXED models

Similar presentations

Presentation on theme: "Generalized linear MIXED models"— Presentation transcript:

Similar presentations

About project

Feedback