23  Mixed and Hierarchical Models

In the linear and generalized linear models so far, the regression parameters (\(\beta\)) are fixed properties of the population, and our task is to estimate them with as much precision as possible. We then interpret the parameters to make statements about the population. The parameters are fixed, so their effects are known as fixed effects.

But it is possible for the relationship between predictor and response to also depend on random quantities—on properties of the particular random sample we obtain. When this happens, some of the model parameters are random, and so their effects are known as random effects.

We can construct linear and generalized linear models with random effects, known as random effects models or mixed effects models (“mixed” because they contain both fixed and random effects). As we will see, these are useful for modeling many kinds of data with grouped or repeated structure. They can also be thought of as a way of applying penalization, as discussed in Chapter 19, to regression coefficients in a more structured way, and so can give useful estimates even when the parameters are not truly random.

23.1 Motivating examples

Many regression problems involve data that comes from a random sample from a population (or that we at least treat as a random sample). But we are concerned with problems where the effects are random. For instance, if the observations come in groups, perhaps the relationship is different for each group, and perhaps the groups can be seen as samples from a larger population.

Example 23.1 (Small area estimation) A social scientist is constructing a map of poverty across the United States as part of a project to suggest government policies that might help low-income areas. The scientist wants to estimate the average income of individuals in every county in the US, and pays a polling company to conduct a large, representative national survey. Survey respondents indicate their county of residence and their income.

Each observation is one person, but we want to estimate county averages, so it is reasonable to model respondent \(i\)’s income as \[ \text{income}_i = \alpha_{\text{county}[i]} + e_i, \] where \(\text{county}[i]\) represents the index of the county respondent \(i\) lives in. As there are about 3,200 counties, \(\alpha\) is a vector of about 3,200 values, and their estimates \(\hat \alpha\) give us the average income per county. We can treat \(\alpha\) as a fixed unknown parameter, making this a simple regression or ANOVA problem; or we can imagine the counties as a sample from the process generating county incomes every year, making them random effects. As we will see, this has some advantages.

This is an example of a small area estimation problem, in which the goal is to make inferences about individual geographic areas from survey or sample data. The data is hierarchical: the units are survey respondents, who are grouped within counties, which are all within the United States.

Example 23.2 (Sleep deprivation) Belenky et al. (2003) conducted an experiment to measure the effect of sleep deprivation on reaction time. A group of 18 experimental subjects spent 14 days in the laboratory. For the first 3 nights, they were required to be in bed for eight hours per night; for the next seven days, they were only allowed three hours per night. Each day, they completed several tests of alertness and reaction time, including one that directly measured how quickly they pressed a button after seeing a visual stimulus. Our data hence includes 10 observations per subject: three days of baseline measurements and seven days of sleep deprivation.

The goal is to measure the effect of sleep deprivation on reaction time, including how reaction time changes as the number of days of deprivation increases. We might model respondent \(i\)’s reaction time on day \(d\) as \[ \text{time}_{id} = \beta_0 + \beta_1 d + e, \] but it is reasonable to expect that each subject had a different reaction time at baseline (0 days) and that each was affected differently by sleep deprivation. If this is not accounted for, the errors \(e\) will be dependent between observations of the same subject on different days.

Instead we might allow both the slope and intercept to vary by subject: \[ \text{time}_{id} = \alpha_i + \beta_i d + e. \] As subjects are drawn from a population, here \(\alpha\) and \(\beta\) are both random effects.

These examples also illustrate a difference in goals between mixed model problems. In Example 23.1, we would like to estimate the individual random effects \(\alpha\) for each county. In Example 23.2, the subject-specific slopes and intercepts are not important: the scientific question is whether \(\E[\beta_i] = 0\), which determines whether the average person experiences a change in reaction time when sleep-deprived. We will see that mixed models are better suited to answer the latter question, although obtaining individual random effect estimates is still possible.

23.2 Mixed models

From the examples, we see that a mixed model is a regression model much like any other: we have a response variable and one or more predictors, and translate the predictors into regressors to fit the regression. However, in a mixed model we split the predictors into two groups first. The fixed effects are treated just as in ordinary linear models, while the random effects have a separate parameter vector that is random.

As the variance and covariance structure of the random effects will be important, we can separately consider the mean structure of mixed models and the variance structure.

23.2.1 Mean structure

Definition 23.1 (Mixed model) A mixed model is a linear model written as \[ Y_i = \underbrace{\beta\T X_i}_{\text{fixed effects}} + \underbrace{\gamma\T Z_i}_{\text{random effects}} + e_i, \] where \(\beta \in \R^p\) is a fixed population parameter, \(X_i \in \R^p\) is a vector of known covariates, \(\gamma \in \R^q\) is a vector of random effects, and \(Z_i \in \R^q\) is another vector of known covariates. Crucially, \(\gamma\) is a random variable that is assumed to be drawn from a particular distribution. One can also write the model in matrix form as \[ \Y = \beta\T \X + \gamma\T \Z + e. \]

Generally it is assumed that \(\E[e] = 0\), \(\var(e) = \sigma^2 I\), \(\cov(\gamma, e) = 0\), and the distribution of \(\gamma\) is a parametric family with mean 0 and known or estimable covariance.

Transforming a particular model into a mixed model simply requires translating its predictors into regressors represented as columns of \(X\) and \(Z\), just as we did with ordinary linear models in Chapter 7.

Example 23.3 (Random intercepts model) In the small area estimation example (Example 23.1), there is one predictor, county. We will represent this as dummy-coded regressors in \(Z\), including a regressor for every county, so none is absorbed in the intercept. The model is \[ \text{income} = \beta_0 + \gamma\T \Z + e, \] \(\beta_0\) represents the overall mean across counties (because \(\gamma\) has mean zero) and the entries of \(\gamma\) give the difference between each county’s mean and the grand mean.

Example 23.4 (Random slopes model) In the sleep deprivation study (Example 23.2), there are two predictors: the number of days of sleep deprivation and the subject. The number of days is a fixed effect, so \(X\) contains an intercept column and a column for the number of days; hence \(\beta \in \R^2\). Subject is a random effect, and we want this to affect both the intercept and slope, so we will enter multiple regressors in \(Z\): an an indicator of subject (and hence a column for every subject) as well as columns for the interaction of subject and days. We can hence partition the random effects part of the model: \[ \gamma\T \Z = \begin{pmatrix} \gamma_1 \\ \vdots \\ \gamma_{18} \\ \gamma_{19} \\ \vdots \\ \gamma_{36} \end{pmatrix}\T \left(\begin{array}{cccc|cccc} 1 & 0 & \cdots & 0 & 0 & 0 & \cdots & 0\\ 1 & 0 & \cdots & 0 & 1 & 0 & \cdots & 0\\ \vdots & & & & & & & \vdots \\ 0 & 0 & \cdots & 1 & 0 & 0 & \cdots & 8\\ 0 & 0 & \cdots & 1 & 0 & 0 & \cdots & 9 \end{array}\right). \] Here the first 18 entries of \(\gamma\) correspond to random intercepts for each of the 18 subjects, while the next 18 entries are random interactions for each subject. We then assemble them into the model \[ \text{time} = \beta_0 + \beta_1 d + \gamma\T \Z + e. \] The interaction allows the intercept and slope of the relationship to be random effects, not fixed, with the mean intercept and slope estimated by \(\beta_0\) and \(\beta_1\). For example, for subject 1 on day \(d\), \(Z = (1, 0, \dots, 0, d, 0, \dots, 0)\T\), and so the model is \[\begin{align*} \text{time} &= \beta_0 + \beta_1 d + \gamma_1 + \gamma_{19} d + e\\ &= (\beta_0 + \gamma_1) + (\beta_1 + \gamma_{19}) d + e. \end{align*}\]

In each case we have written the model to make the mean of the random effects zero, so their mean is estimated as a fixed effect. This is convenient because the software we use for estimation defaults to setting the mean of all random effects to zero.

23.2.2 Variance structure

Besides specifying the mean function, we must specify the covariance structure \(\Sigma = \var(\gamma)\). This is not comparable to the variance \(\var(\hat \beta)\) for the fixed effect term: that is a property of the estimator \(\hat \beta\), and it is silly to refer to \(\var(\beta)\) (without the hat) because \(\beta\) is a fixed population parameter. For the random effects, \(\gamma\) is random, so \(\Sigma\) is a population parameter to estimate.

We have several options. We could simply assume \(\Sigma\) has a particular value, such as \(\Sigma = I\). But that is rarely realistic. We could, at the other extreme, estimate the entire matrix with no assumptions, allowing any variance and covariance between terms in \(\gamma\). That may be too flexible, and also is often unrealistic, as we will see in the examples below.

Instead, we usually identify a specific structure for \(\Sigma\) and then estimate its component pieces as part of the model-fitting process. Two common structures are given in these examples.

Example 23.5 (Variance in the random intercepts model) Consider the random-intercepts model applied to small area estimation (Example 23.3), where each entry of \(\gamma\) is one county’s difference from the overall mean. If counties are drawn independently from a hypothetical county-generating process, we could assume that \[ \gamma \sim \normal(\zerovec, \sigma_\gamma^2 I). \] If we also assume that \(\var(e) = \sigma^2 I\), the model only has three parameters to estimate (\(\beta_0\), \(\sigma^2\), and \(\sigma_\gamma^2\)), despite there being over 3,200 individual counties.

However, if the same random effect predictor enters \(\Z\) as multiple regressors, it is more reasonable to assume the corresponding entries of \(\gamma\) are correlated. This is just as in ordinary linear regression: in a simple linear model \(Y = \beta_0 + \beta_1 X + e\), \(\cov(\hat \beta_0, \hat \beta_1) \neq 0\) except in special cases. (We can think of \(\beta_0\) as being a regressor for \(X\) because it is the intercept of the relationship with \(X\), just as random intercepts for a grouping variable are intercepts but are regressors for that predictor.)

Example 23.6 (Variance in the random slopes model) In the random slopes model of Example 23.4, it’s not reasonable to say that the entries of \(\gamma\) are simply iid with common variance \(\sigma_\gamma^2\). The intercept and interaction effects may have different variances, and the effects for subject \(i\) are likely correlated with each other, just as the slope and intercept are correlated in an ordinary linear model. A reasonable choice might instead be that \(\gamma \sim \normal(\zerovec, \Sigma)\), where \[ \Sigma_{ij} = \begin{cases} \sigma_0^2 & \text{if $i = j$ and $i \leq 18$}\\ \sigma_1^2 & \text{if $i = j$ and $i > 18$}\\ \sigma_{01} & \text{if $|i - j| = 18$}\\ 0 & \text{otherwise.} \end{cases} \] Hence the intercepts have variance \(\sigma_0^2\), the interaction slopes have variance \(\sigma_1^2\), and the slope and intercept for one subject have correlation \(\sigma_{01}\). But by setting the covariance to 0 between subjects, we have assumed the subjects are not correlated with each other. This is logical unless the subjects somehow influence each other’s responses.

The general rule, then, is to evaluate all the random effects terms. Like terms entering from the same predictor—such as all the regressors needed to enter one factor predictor—are modeled with a common variance. Related terms from the same predictor—such as slope and intercept terms—are modeled with a common covariance. Random effects are assumed uncorrelated between unrelated predictors.

23.2.3 Variance components

The effect of the random effects terms in a mixed model is to induce correlation between observations, and to partition the variance of \(Y\) into components. We can see this by expanding the variance using Definition 23.1: \[\begin{align*} \var(\Y) &= \var(\beta\T \X + \gamma\T \Z + e) \\ &= \Z \var(\gamma) \Z\T + \var(e) \\ &= \Z \Sigma \Z\T + \sigma^2 I. \end{align*}\] The total variance of the response is hence partly due to the errors \(e\) and partly due to the random effects. What this means depends on the structure we assume for \(\Sigma\). In simple random intercepts models, for instance, the variance decomposes nicely.

Example 23.7 (Variance components in the random intercepts model) In the random intercepts model for counties, suppose we assume that \(\var(\gamma) = \sigma_\gamma^2 I\), as in Example 23.5. There is one predictor per county, so this is an \(n \times n\) matrix, and each row of \(\Z\) has exactly one 1 and \(n - 1\) zeroes (since \(\Z\) encodes the county of each observation). Thus, for survey respondent \(i\), \[ \var(Y_i) = \var(\gamma\T Z_i + e_i) = \sigma_\gamma^2 + \sigma^2. \] The variance in respondent incomes is then partly due to random noise (\(\sigma^2\)) and partly due to variation in county means (\(\sigma_\gamma^2\)). The relative size of \(\sigma^2\) and \(\sigma_\gamma^2\) determines the structure of the data: if \(\sigma_\gamma^2 \gg \sigma^2\), there is a great deal of variation between counties but very little within a county, as shown in the left panel of Figure 23.1. If \(\sigma_\gamma^2 \ll \sigma^2\), the reverse is true: the counties are very similar and most variation is individual, as shown in the right panel.

Figure 23.1: Left: County-level variation (\(\sigma_\gamma^2\)) is large relative to subject-level variation (\(\sigma^2\)). Right: The reverse. Each point is one respondent within a county.

For two respondents \(i\) and \(j\), the covariance between their responses is \[\begin{align*} \cov(Y_i, Y_j) &= \cov(\gamma\T Z_i + e_i, \gamma\T Z_j + e_j)\\ &= \cov(\gamma\T Z_i, \gamma\T Z_j) + \cov(\gamma\T Z_i, e_j) + \cov(e_i, \gamma\T Z_j) + \cov(e_i, e_j). \end{align*}\] As we said in Definition 23.1, we usually assume that \(\cov(\gamma, e) = 0\), so the center two terms are zero. The last term is zero because we typically assume the errors are uncorrelated, as in ordinary linear models. Only the first term matters. When both respondents are in the same county, this is the variance of that component of \(\gamma\); when they are not, it is the covariance between the corresponding components. Since we assumed \(\var(\gamma)\) is diagonal, \[ \cov(Y_i, Y_j) = \begin{cases} \sigma_\gamma^2 & \text{if in the same county} \\ 0 & \text{otherwise.} \end{cases} \] Thus, marginally, the random effect means the response is correlated for units within the same group: incomes for people in the same county are correlated with each other.

This decomposition is why the variances in mixed-effects models are often referred to as variance components: the total variance of \(Y\) is decomposed into components coming from unit-level random error and the fixed effects. We can use the variance components to quantify the decomposition of error by using the intraclass correlation coefficient.

Definition 23.2 (Intraclass correlation coefficient) The intraclass correlation coefficient (ICC) in a simple random intercepts model \[ \Y = \beta_0 + \gamma\T \Z + e, \] where the random effect is a categorical grouping variable, is \[ \operatorname{ICC} = \cor(Y_i, Y_j) \] for two units \(i\) and \(j\) in the same group.

When \(\var(\gamma) = \sigma_\gamma^2 I\) and \(\var(e) = \sigma^2\), \[ \operatorname{ICC} = \frac{\sigma_\gamma^2}{\sigma^2 + \sigma_\gamma^2}, \] by definition of the correlation.

The ICC is between 0 and 1. When the ICC is 0, the random effect variance is 0, and all variation is due to the unit-level random error; when the ICC is 1, the unit-level random error variance is 0, and all variation is due to the random effect. In Figure 23.1, the left panel ICC is nearly 1 and the right panel ICC is nearly 0. The ICC is thus a useful summary of the variation in a random effects model, and estimating the variances is hence a key part of the estimation process.

23.2.4 Choosing between fixed and random effects

TODO why do we use which when?

23.3 Estimating mixed models

Estimating the parameters of a mixed model is more complicated than in ordinary linear regression. Only in simple cases can we reduce the problem to simple sums of squares; more complex covariance structure, such as in Example 23.4, means standard least-squares geometry does not apply. We have to estimate multiple variance parameters. The typical approach is to assume a distribution for the errors \(e\) so that we can use maximum likelihood.

23.3.1 The likelihood

We can write a mixed model with normal errors in vector form as \[\begin{align*} \Y \mid \gamma &\sim \normal(\beta\T \X + \gamma\T \Z, \sigma^2 I)\\ \gamma &\sim \normal(\zerovec, \Sigma), \end{align*}\] where \(\Sigma\) is a covariance matrix giving the covariance of the random effects. We use bold letters \(\Y\), \(\X\), and \(\Z\) to denote the complete vectors and matrices of \(n\) observations.

The covariance matrix \(\Sigma\) is often structured, as we saw in Example 23.4. As we assume \(\cov(\gamma, e) = 0\), we can also write this marginally as \[ \Y \sim \normal(\beta\T \X + \gamma\T \Z, \sigma^2 I + \Z \Sigma \Z\T). \] Hence the likelihood of \(\Y\) can be written using the normal density in terms of \(\beta\), \(\gamma\), \(\sigma^2\), and \(\Sigma\). The matrix \(\Sigma\) has \(q^2\) entries, making it difficult to estimate in general, but when it is structured there may only be a few parameters to estimate.

Example 23.8 (Likelihood of the random intercepts model) The random intercepts model of Example 23.3 can be written as \[\begin{align*} \Y \mid \gamma &\sim \normal(\beta\T \X + \gamma\T \Z, \sigma^2 I)\\ \gamma &\sim \normal(\zerovec, \sigma^2_\gamma I), \end{align*}\] replacing the general covariance matrix \(\Sigma\) with a single parameter \(\sigma_\gamma^2\), because the random intercepts for different teachers are assumed to be independent. We can write this marginally as \[ \Y \sim \normal(\beta\T \X + \gamma\T \Z, (\sigma^2 + \sigma_\gamma^2) I), \] because each row of \(\Z\) contains only one nonzero entry (1), so \(\Z\Sigma\Z\T\) when \(\Sigma = \sigma_\gamma^2 I\) reduces to \(\sigma_\gamma^2 I\).

The likelihood is then the product of normal densities, as in any normal likelihood.

23.3.2 Maximum likelihood and optimization

In principle, maximum likelihood estimation is straightforward and just requires us to maximize with respect to \(\beta\), \(\gamma\), \(\sigma^2\), and \(\Sigma\). In practice, however, maximum likelihood estimation is difficult here. The \(q \times q\) covariance matrix \(\Sigma\) has \(q^2\) parameters to estimate, and it is constrained, since covariance matrices must be positive definite; such a constraint is difficult to work with. If we assume a specific structure for \(\Sigma\), Example 23.5 and Example 23.6, the optimization is easier.

But suppose, for instance, we let \(\var(\gamma) = \sigma_\gamma^2 I\). It is entirely possible for the MLE to be \(\hat \sigma_\gamma^2 = 0\), which is on the boundary of the space, and the gradient of the likelihood may not be zero. That is, the likelihood may improve as we shrink \(\hat \sigma_\gamma^2\) toward zero, without ever reaching a stationary point. Thus, simply taking derivatives and setting them to zero does not suffice to find the MLE. We will likely need a numerical maximizer, and those frequently do not like finding solutions that are exactly on the boundary.

But the problem is not just optimization: The maximum likelihood estimates of the variance are biased. This is true even in ordinary linear regression: Theorem 5.2 showed that the unbiased estimator of \(\sigma^2 = \var(e)\) in a simple linear regression model is \[ S^2 = \frac{\RSS(\hat \beta)}{n - p}, \] where \(p\) is the number of regressors and \(\RSS(\hat \beta)\) is the residual sum of squares. But the maximum likelihood estimate is \[ \hat \sigma^2_\text{MLE} = \frac{\RSS(\hat \beta)}{n}. \]

This is a critical problem in many mixed model applications, as we often interpret the variance components and intraclass correlation (Definition 23.2). This is a natural way to measure the “size” of the random effects, but if the variance estimates are all biased, variance component estimates will all be biased.

Worse, in some mixed model settings the number of parameters grows with \(n\). For example, if we have a fixed effect for a factor variable, and the number of levels of that factor grows with \(n\), the number of fixed effect parameters will grow with \(n\). In this case, the MLE is inconsistent: as \(n \to \infty\), the MLE does not converge to the true parameter values. This is called the Neyman–Scott problem.

One solution is restricted maximum likelihood estimation, known as REML. This amounts to finding clever ways to rewrite the likelihood so we can estimate the random effects variances alone, without estimating the fixed effects, and then using these to estimate the fixed effects. The details are out of the scope of this course; see Bates et al. (2015), section 3, or Christensen (2020), section 12.6.

23.3.3 Obtaining random effects estimates

Maximizing the (restricted) likelihood of a mixed effects model gives us values for \(\hat \beta\), \(\hat \sigma^2\), and \(\hat \Sigma\). But \(\gamma\) is a random variable, not a parameter, so we do not get estimates of the random effects! We may still want reasonable values for them, so we will have to be creative.

We can imagine a Bayesian setup for \(\gamma\). We have a prior distribution \(p(\gamma)\) specified in our mixed model (\(\normal(\zerovec, \Sigma)\)) and we have a likelihood \(p(y \mid \gamma)\), treating all other parameters as fixed. We can use Bayes’ theorem to obtain the posterior \(p(\gamma \mid y)\). Because the likelihood and prior are both normal, the posterior is also normal, making it straightforward to derive the posterior mean.

Exercise 23.1 (Posterior mean random intercepts) In the random intercepts model of Example 23.8, the posterior mean of \(\gamma\) can be derived simply. By Bayes’ theorem, \[ p(\gamma \mid y) \propto p(y \mid \gamma) p(\gamma), \] and both the prior and likelihood are Gaussian. If we multiply the likelihoods, ignoring the constant factors, we obtain \[\begin{align*} p(\gamma \mid y) &\propto \exp\left( -\frac{\|\Y - \X\beta - \Z \gamma\|_2^2}{2 \sigma^2} \right) \exp\left( - \frac{\|\gamma\|_2^2}{2 \sigma_\gamma^2} \right) \end{align*}\] The two exponential terms can be combined and rearranged to form a single Gaussian distribution for \(\gamma\), from which we can extract the posterior mean.

TODO detail? Christensen (2020), section 12.2

(In principle, a Bayesian setup should put priors on all the parameters. Here we just fix them at their MLE estimates. This is “empirical Bayes”.)

23.4 Inference

Conducting inference for mixed models is somewhat more complicated than in ordinary linear models, for the same reason that maximizing the likelihood can be difficult: MLEs for key parameters are biased, and the MLE may be on the boundary of the parameter space. (One of the regularity conditions of Theorem 12.1, which gives the asymptotic distribution of MLEs, is that the parameter not be on the boundary of the parameter space.)

23.4.1 Hypothesis tests

For the fixed effect parameters \(\beta\), it is difficult to derive the null distribution of \(\hat \beta\) in general, because it depends on \(\X\), \(\Z\), and \(\var(\gamma)\). A common approach is to condition on the estimated variance \(\hat \Sigma = \widehat{\var}(\gamma)\); if this is treated as fixed, we can derive least-squares estimates for \(\hat \beta\) and its variance, following standard least-squares theory (Pinheiro and Bates 2000, 66). These estimates, however, do not account for uncertainty in \(\hat \Sigma\). One can derive Wald tests or \(t\) tests based on them, but the null distributions will not be correct, leading to biased \(p\) values.

It is not possible to test null hypotheses involving \(\gamma\) itself, because \(\gamma\) is a random variable, not a parameter. We can only test its variance parameter \(\var(\gamma) = \Sigma\), which can be useful if we want to test the presence of a random effect: as we assume \(\gamma \sim \normal(\zerovec, \Sigma)\), if \(\Sigma\) is the zero matrix, the random effects are all zero and the random effect term can be omitted.

One reasonable approach is a likelihood ratio test (Theorem 11.2). However—and you should get used to “however” appearing frequently in inference for mixed models—the asymptotic distribution of the likelihood ratio statistic depends on certain regularity conditions, and again, one of those conditions is that the parameter not be on the boundary of the parameter space. And the variances being zero is certainly on the boundary. Consequently, even as \(n \to \infty\), the likelihood ratio statistic will not approach the \(\chi^2\) distribution given by Theorem 11.2. In some cases the null distribution can be calculated (Crainiceanu and Ruppert 2004); in most cases we just accept the usual likelihood ratio test as an approximation.

23.4.2 Confidence intervals

There are, again, several approaches to produce confidence intervals in mixed models. For fixed effects, we can use the approximate Wald confidence intervals based on conditioning on \(\hat \Sigma\). But for both fixed and random effects parameters, a better alternative is profile likelihood confidence intervals (Definition 12.10). As long as the parameter values are not at or near the boundary (i.e., as long as a variance is not estimated to be zero), these can be computed and used.

23.4.3 Making predictions

TODO

23.5 Fitting mixed models in R

In R, the lme4 package is the most popular package for fitting mixed models (Bates et al. 2015). Its lmer() function supports a variety of mixed-effects models, and its glmer() function does the same for GLMs.

To indicate fixed or random effects, lme4 uses a special formula syntax. It follows R’s usual conventions (Section 6.2): a formula of the form y ~ x1 + x2 specifies y as the response variable plus x1 and x2 as fixed effects. But the | character has special meaning. The right-hand side of the formula can contain random effect terms of the form (expression | group). The expression can be multiple terms, interpreted just like in any model formula; the group is a factor variable, and says the terms in the expression are random effects that vary by group.

Based on the model formula, lme4 automatically fills in the structure of \(\Z\) and \(\Sigma = \var(\gamma)\). Each random effect term creates random effect columns in \(\Z\), and all random effects associated with the term have common variance. If the term creates both random intercepts and random slopes, these are assumed to be correlated, but otherwise the random effects are uncorrelated with those introduced by other terms.

For example:

  • (1 | factor): A random intercept term. The intercept varies by level of the factor. \(\Z\) encodes the level of the factor and \(\Sigma\) is assumed to be diagonal, as in Example 23.5.
  • (1 + x | factor), also written (x | factor): A random intercept and slope. Both the slope of x and its intercept vary by level of the factor, and the slope and intercept may be correlated with each other. \(\Z\) encodes the factor levels and their interactions with \(x\); the intercepts have one variance, the slopes have another variance, and the intercept and slope for each term have a covariance, as in Example 23.6.
  • (1 | factor_a) + (1 | factor_b): A random intercept for factor_a and another for factor_b. These have different variances, but are uncorellated with each other.

The lmer() and glmer() functions use this syntax to construct the random effect matrix \(\Z\). Let’s see an example.

Example 23.9 (Simulated random intercepts) Let’s simulate data from the small area estimation example (Example 23.1). For simplicity, we’ll model 26 counties, labeled A through Z. Each row of data is one survey respondent, who lives in one county. Each county has a different mean income. We’ll transform income to be normally distributed to make things easier.

library(regressinator)
library(gtools)

counties <- LETTERS

county_means <- rnorm(26, sd = 2)
names(county_means) <- counties

county_props <- as.vector(rdirichlet(1, rep(2, length(counties))))

county_pop <- population(
  county = predictor(rfactor, levels = counties),
  income = response(50 + by_level(county, county_means),
                    error_scale = 2)
)

respondents <- county_pop |>
  sample_x(n = 1000) |>
  sample_y()

We’ll model an intercept and a random effect for county. Here’s the mixed-effect model fit:

library(lme4)

survey_fit <- lmer(income ~ (1 | county),
                    data = respondents)
summary(survey_fit)
Linear mixed model fit by REML ['lmerMod']
Formula: income ~ (1 | county)
   Data: respondents

REML criterion at convergence: 4314.5

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-3.3058 -0.6458  0.0171  0.6751  3.3866 

Random effects:
 Groups   Name        Variance Std.Dev.
 county   (Intercept) 3.907    1.977   
 Residual             3.985    1.996   
Number of obs: 1000, groups:  county, 26

Fixed effects:
            Estimate Std. Error t value
(Intercept)  49.5231     0.3928   126.1

We see that summary() prints two tables, one for random effects and one for fixed effects. For random effects, we get the estimated variances \(\hat \sigma_\gamma^2\) and \(\hat \sigma^2\). For the fixed effects, we get the estimated mean and its standard error (based on the conditional distribution of \(\hat \beta\)).

The confint() function generates, by default, profile likelihood confidence intervals for all the parameters:

confint(survey_fit)
Computing profile confidence intervals ...
                2.5 %    97.5 %
.sig01       1.496985  2.627607
.sigma       1.910879  2.088391
(Intercept) 48.739423 50.306846

Here .sig01 is \(\sigma_\gamma\) and .sigma is \(\sigma\). We can see there is little uncertainty in estimating \(\sigma\), because we have 1,000 observations, but there is more uncertainty in estimating \(\sigma_\gamma\), because we only have 26 counties.

The ranef() function extracts the posterior mean random effects, i.e., the difference in mean for each county. We could use these and the intercept to report the estimated average income for every county.

Example 23.10 (Random intercepts and slopes) The data from the sleep deprivation study of Example 23.2 is provided in the lme4 package as sleepstudy. The data is plotted in Figure 23.2, which shows there is a great deal of variation from subject to subject both in baseline reaction time (days 0-2) and in reaction time during sleep deprivation (day 3 onward). Thus it is reasonable to assume each subject has a different intercept and also a different slope (a different reaction to sleep deprivation).

Figure 23.2: Average reaction time in the sleep study. The vertical line indicates the start of sleep deprivation (after day 2). Each line is one experimental subject.

We can do this by including a (Days | Subject) term in the model formula, which tells lmer() to include a random intercept and slope for Days:

sleep_fit <- lmer(Reaction ~ Days + (Days | Subject),
                  data = sleepstudy)
summary(sleep_fit)
Linear mixed model fit by REML ['lmerMod']
Formula: Reaction ~ Days + (Days | Subject)
   Data: sleepstudy

REML criterion at convergence: 1743.6

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-3.9536 -0.4634  0.0231  0.4634  5.1793 

Random effects:
 Groups   Name        Variance Std.Dev. Corr
 Subject  (Intercept) 612.10   24.741       
          Days         35.07    5.922   0.07
 Residual             654.94   25.592       
Number of obs: 180, groups:  Subject, 18

Fixed effects:
            Estimate Std. Error t value
(Intercept)  251.405      6.825  36.838
Days          10.467      1.546   6.771

Correlation of Fixed Effects:
     (Intr)
Days -0.138

To read the model summary:

  • The Fixed effects table shows the average intercept and slope. The average subject has a reaction time of about 250 ms on day 0, and their reaction time increases by about 10 ms per day.
  • The Correlation of Fixed Effects table shows the off-diagonal entries of \(\widehat{\var}(\hat \beta)\), i.e., the correlation between \(\hat \beta_0\) and \(\hat \beta_1\).
  • The Random effects table shows the variation in intercept between subjects; it has a standard deviation of about 25 ms. It also shows the variation in slope between subjects, which is about 6 ms/day. These estimates have a nonzero correlation with each other.
ranef(sleep_fit)
$Subject
    (Intercept)        Days
308   2.2585510   9.1989758
309 -40.3987381  -8.6196806
310 -38.9604090  -5.4488565
330  23.6906196  -4.8143503
331  22.2603126  -3.0699116
332   9.0395679  -0.2721770
333  16.8405086  -0.2236361
334  -7.2326151   1.0745816
335  -0.3336684 -10.7521652
337  34.8904868   8.6282652
349 -25.2102286   1.1734322
350 -13.0700342   6.6142178
351   4.5778642  -3.0152621
352  20.8636782   3.5360011
369   3.2754656   0.8722149
370 -25.6129993   4.8224850
371   0.8070461  -0.9881562
372  12.3145921   1.2840221

with conditional variances for "Subject" 

23.6 Random effects as shrinkage

TODO show examples of shrinkage, connect to penalization and bias/variance tradeoff

group_intercepts <- c(A = 2, B = 3, C = 1, D = 2.5)
group_slopes <- c(A = 1, B = 0.5, C = 1.2, D = 0)

shrink_pop <- population(
  group = predictor(rfactor, levels = c("A", "B", "C", "D")),
  x = predictor(rnorm),
  y = response(
    by_level(group, group_intercepts) + x * by_level(group, group_slopes),
    error_scale = 1
  )
)

shrinks <- shrink_pop |>
  sample_x(n = 100) |>
  sample_y()

shrink_fit <- lm(y ~ group * x, data = shrinks)
shrink_lmer <- lmer(y ~ (1 + x | group), data = shrinks)
boundary (singular) fit: see help('isSingular')

23.7 Diagnostics

TODO residuals?

Exercises

Exercise 23.2 (Cupping marks) TODO replicate analysis in https://stat.columbia.edu/~gelman/research/unpublished/healing.pdf