ExDataScientia

Mixed-effects models in R

Describing a complex system accurately using statistics can be challenging. In many situations, measurements are confounded by uncontrollable factors, e.g. different sampling sites, which may have an effect on the behaviour of the system or on parts thereof. Such cases call for the fitting of mixed-effect models, the appropriate design of which may, however, be complicated.

Mixed-effect models desribe the dependency of a response variable one one or multiple explanatory variables that are of interest in an experiment and one or multiple variables that are not part of an experimental design but arise out of necessity. Such variables are typically categorical in a mixed-effect model; for example, one could imagine that for achieving a sufficient number of samples, it was necessary to sample at two different locations or at two different times of the year.

Now, this categorical effect could in the simplest case be just an additive effect: The response variable reacts in the same way to the driver variables at both sampling sites, but the scale of the response (i.e. the intercept of the model) is different between the sampling sites. On the other hand, the situation could be more difficult: The reaction of the response variable to the driver variables, or to just one driver variable, could vary between the sampling sites (which would not be particularly nice for the experiment designer, as the intention was to obtain more samples to get a clearer insight into the driver-response relationship). However, there could be one overall effect of the driver (i.e. a model slope), which varies slightly between the sampling sites but does not differ entirely between them. The same cases could apply to the scaling of the response variable (intercept in the model).

Finding the right model description can be a daunting task. Potential models of various complexity can be compared via AIC values and diagnostics plots (checking e.g. the distribution of model residuals). In the following, we will first generate (noisy) data from a complex relationship between predictor variables and a response variable that will serve as a "ground truth" on which we will test our ability to select the appropriate statistical model (and the abilities of R's optimizers to correctly estimate the parameters used to generate the ground truth).

We first set up the parameter values that the statistical model shall estimate later on. These are the overall intercept (itc_base), the group-specific summands to the intercept (twgo groups; itc_g1 and itc_g2), the overall slope coefficients for the independent variables (slp_1 and slp_2) and the group-specific summands to slp_2 (same groups as before; slp_g1 and slp_g2).

We now calculate the dependent variable y, or more specifically, the group-specific values of y, y_g1 and y_g2. To this end, we employ the formula y_gx = (itc_base + itc_gx) + slp_1 * x1 + (slp_2 + slp_gx) * x2_gx, where gx is either the first or the second group of samples. We add some statistical noise (observation error) drawn from a normal distribution (with mean = 0 and standard deviation = 1) in order to make the process of finding and fitting the adequate model more challenging. The formula is the ground truth, i.e. the true mechanism describing the variability in the dependent variable. Ideally, model-selection procedures should lead us to accept only this equation as the best description of the data, over simpler and more complex equations.

Now, we start the process of setting up, fitting and quality-checking a variety of statistical models that describe the relationship between the variables y, x1 and x2, and, in some cases, the sample groups a and b. (To the latter end, we create a character vector g of "a"s and "b"s, with their number matching the number of samples belonging to each group). We first fit a model that corresponds to the true equation, a mixed-effects model formulated and fitted with the package lme4. The formulation of the model equation in the package-specific context deviates quite markedly from the original purely mathematical formulation: y ~ (1 + 1 | g) + x1 + x2 + (0 + x2 | g). This formulation means that the intercept (indicated by the "1") is composed of an overall component and a group-specific component (the expression "| g" indicates the effect of the sampling groups), that x1 and x2 affect y in a group-independent manner, and that there is an additional group-specific effect of x2 on y. For more details see https://cran.r-project.org/web/packages/lme4/vignettes/lmer.pdf.

Note that the formulation "x2 + (0 + x2 | g)" instead of "(x2 | g)" is important, since the latter formulation is by default interpreted by lme4 as (1 + x2 | g), i.e. one would try to fit additional group-specific intercepts that are not connected to x2, and which are of course not part of the ground-truth equation. (To the unknowing, the formulation (1 + x2 | g) could falsely come off as the combination of a group-independent and group-specific parameters associated with x2; however, in lme4, the "1" relates to g and not to x2).

We can now have a look at the fitted coefficients of the mixed-effects model (calling the function lmer() will both set up and automatically fit the model). Using the summary() function, we can first look at the fitted values for the non-group-specific, the so-called fixed, effects, which are the overall intercept (itc_base), the slope coefficient for x1 (slp_x1) and the overall slope coefficient for x2 (slp_x2). The parameter estimates are listed under the estimate header of the fixed effects section. We can see that the estimates are overall fairly near the ground truth, although itc_base is closer to 3.8 then the actual 3.5, and slp_x2 is closer to 3.4 than the actual 3.5 (actual outcomes may vary due to the randomness of error values drawn). The deivation between estimates and ground truth is founded in the random noise added to y, and when picking a larger standard deviation for the distribution from which to sample the noise, the deviation between the parameter estimates and the ground truth becomes markedly larger.

We can obtain the estimates for the group-specific coefficients, the so-called random effects, by using the coef() function on the model object ("random" because we allow these parameters to be estimated completely freely for each group, without trying to "explain" inter-group variability mechanistically - this has stronger implications when the number of groups becomes large). The function returns a matrix, with the groups as rows and the coefficients as columns. Now, here we don's actually see the estimated coefficients, but rather the sum of the overall coefficient (i.e. itc_base and slp_x2) and the corresponding group-specific coefficient (itc_gx and slp_gx). The slope coefficient for x1 is also listed, and as we can see, it does not deviate from the summary output above, and it does not differ between groups, which is correct, as we specified a group-specific coefficient for x1 neither in the ground-truth equation nor in the model design. The other values correspond pretty closely with the ground truth (you can check this by adding up itc_base, itc_gx, slp_x2 and slp_gx in the matching combinations yourself).

Thus, we have been able to accurately estimate the ground-truth parameters using the lmer() function of the lme4 package. As a next step, we would like to know if the model design (which is, of course, identical to the ground-truth equation) holds up against simpler (but wrong) model designs in terms of model diagnostics. As our current model design is fairly complex with six parameters, we will need strong evidence to justify it against simpler designs (which, in case of doubt, are often preferred as they rely less heavily on assumptions and are therefore often more rigid when it comes to making predictions outside the observed ranges of data).

To that end, we are going to fit a suite of models with varying degress of complexity and compare their residual distribution and -trends, the relationship between predicted and true y-values and their AIC values. Model #1 is the model we have already fitted, the underlying equation of which matches the ground-truth equation perfectly. Model #2 omits the group-independent effect of x2, i.e. only group-specific coefficients related to x2 are fitted. Model #3 omits the group-specific coefficients related to x2, i.e. only one overall coefficient related to x2 is fitted. Model #4 extends from model #3 by omitting the overall intercept, i.e. only group-specific intercepts are fitted. Model #5 is not a mixed-effects model but instead a simple linear model (fitted using the lm() function), with one coefficient associated with x1 and x2 each being fitted.

par(mfrow = c(1,2))

plot(y, resid(m1), xlab = 'y', ylab = 'Residual'); abline(0,0)
plot(y,predict(m1),ylab = 'Predicted y');abline(0,1)
print(AIC(m1))

m2 <- lmer(y ~ (1 + 1 | g) + x1 + (0 + x2 | g))
plot(y, resid(m2), xlab = 'y', ylab = 'Residual'); abline(0,0)
plot(y,predict(m2),ylab = 'Predicted y');abline(0,1)
print(AIC(m2))

m3 <- lmer(y ~ (1 | g) + x1 + x2)
plot(y, resid(m3), xlab = 'y', ylab = 'Residual'); abline(0,0)
plot(y,predict(m3),ylab = 'Predicted y');abline(0,1)
print(AIC(m3))

m4 <- lmer(y ~ 0 + (1 | g) + x1 + x2)
plot(y, resid(m4), xlab = 'y', ylab = 'Residual'); abline(0,0)
plot(y,predict(m4),ylab = 'Predicted y');abline(0,1)
print(AIC(m4))

m5 <- lm(y ~ x1 + x2)
plot(y, resid(m5), xlab = 'y', ylab = 'Residual'); abline(0,0)
plot(y,predict(m5),ylab = 'Predicted y');abline(0,1)
print(AIC(m5))

Model #1 does indeed have the lowest AIC value of all models tested, indicating that this model-selection criterion can indeed lead us to select the correct model description of the data. The spread of residuals appears relatively normal, and there is no apparent trend in residual values over y; furthermore, the model predictions are very close to the ground truth (almost a perfect fit). These other attributes (appart from AIC) are shared with model #2, however. In a real-world setting, one would have to seriously discuss the approproateness of selecting the more complex model (model #1), given that the only supporting argument is the slightly lower AIC value. The omission of the group-specific coefficients on x2 (model #3) leads to a marked increase in AIC and a slightly worse fit, although the residuals still don't display a pattern or trend. The omission of any group effects leads once more to a strong increase in AIC, and now the residuals separate quite clearly into two clusters.

Hence, in summary, we find that i) we can successfully estimate the ground-truth parameter values of a complex mixed-effects system (although that success rate decreases with increasingly strong noise) and ii) that classic model-selection criteria can guide us successfully to selecting the appropriate model description in spite of comparatively high model complexity.

Ex Data, Scientia