7  Interpreting Regressors

In multiple regression, the design matrix \(\X\) might have \(p\) columns, but these columns might not reflect \(p\) distinct observed predictor variables. We will distinguish between predictors or covariates, which are the measured values we have for each observation, and regressors, which are the values that enter into the design matrix. One predictor may translate into multiple regressors, as we will see below.

To keep things straight, let’s define some notation. Previously we considered models where \(\E[Y \mid X] = \beta\T X\), where \(X \in \R^p\) and \(\beta \in \R^p\). We then defined the design matrix \(\X \in \R^{n \times p}\) as a convenient way to stack all observations.

Let us instead assume \(X \in \R^q\), where \(q \leq p\), and the matrix \(\X\) is formed from the observations of \(X\), but might have additional columns involving transformations of their values. Hence \(X\) are the predictors, but \(\X\) is a matrix of regressors.

7.1 Intercept

The first and most obvious additional regressor is the intercept column. It is common to define the design matrix so that \[ \X = \begin{pmatrix} 1 & x_{11} & \cdots & x_{1q} \\ 1 & x_{21} & \cdots & x_{2q} \\ \vdots & \vdots & \vdots & \vdots \\ 1 & x_{n1} & \cdots & x_{nq} \end{pmatrix}. \] The column of ones represents an intercept, so that an individual observation \(Y\) is modeled as \[ Y = \beta_0 + \beta_1 X_1 + \dots + \beta_q X_q + e. \] We often use \(\beta_0\) to refer to the coefficient multiplying the column of ones, and so \(\hat \beta_0\) is the estimated intercept. This practice can sometimes cause confusion: if we have \(q\) predictors but \(p = q + 1\) regressors, it is easy to accidentally use the wrong number in calculations that depend on the model degrees of freedom. These calculations generally depend on the number of regressors, i.e. the dimension of the vector \(\beta\), not on the number of predictors.

The intercept \(\beta_0\) is the mean value of \(Y\) when all other regressors are 0: \[ \beta_0 = \E[Y \mid X_1 = 0, \dots, X_q = 0]. \] In some problems this may not be physically meaningful: If it is not possible for some regressors to be 0, the mean outcome when they are 0 is not relevant. For instance, if we are modeling the weight of adults as a function of their height, the intercept represents the mean weight of adults who have zero height. The intercept may be necessary for the model to fit well, but confidence intervals or hypothesis tests for it would not be of any scientific interest.

Additionally, we often use linear models because we think the relationship between \(X\) and \(Y\) may be locally linear—because we think a linear model will be a good enough approximation for our needs, even if the true relationship is not linear at all. If this is the case, interpreting the intercept \(\beta_0\) is unreasonable when it reflects behavior for values of \(X\) far from those we observed. Or, in other words, interpreting it may be unreasonable extrapolation. Interpreting the intercept, quoting confidence intervals, and conducting hypothesis tests for its value may only be useful when the intercept has a substantive meaning and we have observed \(X\) values nearby.

7.2 Continuous predictors

When we model individual observations as \[ Y = \beta_0 + \beta_1 X_1 + \dots + \beta_q X_q + e, \] the coefficients \(\beta_1, \beta_2, \dots, \beta_q\) are slopes. For example, \[\begin{multline*} \beta_1 = \E[Y \mid X_1 = x_1 + 1, X_2 = x_2, \dots, X_q = x_q] - {}\\ \E[Y \mid X_1 = x_1, X_2 = x_2, \dots, X_q = x_q]. \end{multline*}\] Or, in words, the coefficient \(\beta_1\) is the difference in the mean value of \(Y\) associated with a one-unit increase in \(X_1\), holding all other regressors constant. Our estimate \(\hat \beta_1\) is our best estimate of that difference.

Notice how carefully we have phrased this. Let’s unpack this interpretation in pieces:

  • Difference in the mean value of \(Y\): The slope gives the relationship between \(\E[Y \mid X]\) and \(X\), not between \(Y\) and \(X\). Individual observations will be above or below the mean. If we happen to have two observations with \(X_1 = 1\) and \(X_1 = 2\), with all other regressors identical, the difference in their \(Y\) values is almost certainly going to be different from \(\beta_1\).
  • Associated with a one-unit increase in \(X_1\): The meaning of the coefficient depends on the units and scale of \(X_1\). Changing the scale of the regressor will correspondingly change the scale of the coefficient. It may not make any sense to compare coefficients for regressors measured in different units: if \(X_1\) is measured in miles and \(X_2\) is measured in volts, \(\beta_1 > \beta_2\) does not mean the relationship between \(X_1\) and \(Y\) is “stronger” or “steeper”, or that \(X_1\) is “more predictive”.
  • Holding all other regressors constant: The difference in means is between observations with all other regressors at the same values. But in real populations, the predictors may tend to be correlated with each other: compare units with \(X_1 = a\) and others with \(X_1 = a + 1\) and you may find systematic differences in the other regressors.

For example, suppose we want to understand what kind of physical features are associated with being good at hurdling. We obtain data from numerous hurdlers; our response variable is a score measuring how good they are, calculated from the events they have competed in, and two of the predictors are their height and weight.

Suppose that when we fit the model, our estimated coefficient for height is positive.1 It would be tempting to say that, in our data, taller people are, on average, better hurdlers. But this would not necessarily be true. What is true is that in our data, taller people of the same weight are, on average, better hurdlers—that is, thinner people are better hurdlers. However, in the population of hurdlers, height and weight are likely correlated, because tall people tend to weigh more. It is entirely possible that the average taller person is a worse hurdler than the average shorter person, even if a taller person would on average be better than a shorter person of the same weight.

To express this another way, if we have two regressors \(X_1\) and \(X_2\) that are correlated in the population, \[\begin{multline*} \E[Y \mid X_1 = x_1 + 1, X_2 = x_2] - \E[Y \mid X_1 = x_1, X_2 = x_2] \neq{}\\ \E_{X_2 \mid X_1}[\E[Y \mid X_1 = x_1 + 1]] - \E_{X_2 \mid X_1}[\E[Y \mid X_1 = x_1]], \end{multline*}\] because on the right-hand side the outer expectations average over \(X_2\)’s conditional distribution given \(X_1 = x_1 + 1\) and \(X_1 = x_1\), whereas on the left-hand side the expectations are both for a fixed value of \(X_2\).

Exercise 7.1 (Cats) The MASS package includes a cats dataset that records the body weight and heart weight of adult cats. We’d like to fit a linear regression to predict heart weight from body weight.

  1. Fit a regression model and give a table showing the model parameters. For each coefficient, give a sentence in APA format (Chapter 25) interpreting the coefficient. Be sure to include 95% confidence intervals.
  2. Use the test statistics reported by R to test whether the slope of the relationship is nonzero. Report the results in APA format.

Exercise 7.2 (Inheritance of height) Early statisticians were inordinately fascinated with inheritance and genetics. This led to many early advances in the science of genetics, but was also used by some statisticians in the early 1900s to advocate for eugenics (i.e. improving society by selecting people from “the better stocks” and rejecting “inferior” people) and so-called “scientific racism”. In this problem we will try to make positive use of some data collected by Karl Pearson from 1893 to 1898, who was studying the inheritance of height.

Pearson’s data has 1375 observations of heights of mothers (under the age of 65) and their adult daughters (over the age of 17). The data is available in the alr4 package as the data frame Heights. We are interested in the inheritance of height from the mother to the daughter, so we consider the mother’s height (mheight, in inches) as the predictor of the daughter’s height (dheight, also in inches).

  1. Perform an exploratory data analysis. There are only two variables, so consider each variable on its own (such as with a histogram) and both together (such as with a scatterplot). Write a few sentences commenting on the variables and their relationships, pointing out any potential problems such as outliers, bimodal distributions, etc. (A good exploratory data analysis is a key part of any data analysis, and will inform your modeling choices and conclusions.)
  2. Fit the regression of dheight on mheight. Report the fitted regression line and the estimated variance \(\hat \sigma^2\).
  3. Give a test for whether the coefficient for mheight is 0. Write your results and conclusion in APA format (Chapter 25).
  4. Write a sentence interpreting the coefficient. Include a 95% confidence interval.
  5. Describe what it would mean if the coefficient were exactly 1, if it were less than 1, and if it were greater than 1, in the context of the scientific problem.

7.3 Factors

Often a predictor is a factor: a qualitative or categorical variable that only takes on a fixed, discrete set of values, called levels. For example, a predictor that indicates whether a patient received the treatment or a placebo might be a binary factor; in surveys, respondents might report their occupation by picking from a fixed set of categories.

Typically we want to encode factors in the design matrix so that each level can be associated with a distinct slope or intercept in the model. To do so, we introduce dummy variables, which are regressors taking on only two values; a combination of several dummy variables might represent one factor. The values entered into the design matrix are also sometimes called contrasts.

In principle, the simplest dummy variable coding would encode a factor with \(k\) levels as \(k\) dummy variables, each either 0 or 1 depending. An observation that has the second level of the factor would have all dummy variables 0 except the second.

For example, consider a binary factor treatment_assignment with levels “treatment” and “placebo”. We might be interested in the model with formula outcome ~ treatment_assignment. We could write the model as \[ \begin{pmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{pmatrix} = \begin{pmatrix} 1 & 0 & 1 \\ 1 & 1 & 0 \\ \vdots & \vdots & \vdots \\ 1 & 1 & 0 \end{pmatrix} \begin{pmatrix} \beta_0 \\ \beta_1 \\ \beta_2 \end{pmatrix}. \] Here the first column of \(\X\) is the intercept, the second is a dummy variable representing the “treatment” level, and the third is a dummy variable indicating the “placebo” level. The first observation received the placebo, the second received the treatment, and the last received the treatment. This implies that \[\begin{align*} \E[Y \mid \text{treatment}] &= \beta_0 + \beta_1 \\ \E[Y \mid \text{placebo}] &= \beta_0 + \beta_2. \end{align*}\] However, this model is not identifiable: there are infinitely many values of \(\beta\) that lead to identical estimated means, and hence identical \(\RSS\). There are several equivalent ways to describe this:

  • A model with \(\beta = (0, 1, 2)\T\) predicts the same means as a model with \(\beta = (100, -99, -98)\T\).
  • If \(\hat \beta\) minimizes the residual sum of squares, \(\hat \beta^* = (\hat \beta_0 + c, \hat \beta_1 - c, \hat \beta_2 - c)\T\) also minimizes the residual sum of squares for any \(c \in \R\).
  • The model’s design matrix is linearly dependent, and so there is no unique inverse of \(\X\T \X\).

To avoid this problem, we usually choose a dummy coding scheme that has one fewer dummy variable. By default, R uses treatment contrasts.

Definition 7.1 (Treatment contrasts) For a factor with \(K\) levels, the treatment contrasts are \(K - 1\) binary dummy variables. The first level of the factor is omitted, and the remaining dummy variables \(k = 1, \dots, K - 1\) are defined by \[ \text{dummy}_k = \begin{cases} 1 & \text{if observation has factor level $k + 1$}\\ 0 & \text{otherwise.} \end{cases} \] In machine learning, this is often known as one-hot encoding with the first level dropped.

In the example above, the treatment contrasts would lead to the model \[ \begin{pmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{pmatrix} = \begin{pmatrix} 1 & 1 \\ 1 & 0 \\ \vdots & \vdots \\ 1 & 0 \end{pmatrix} \begin{pmatrix} \beta_0 \\ \beta_1 \end{pmatrix}, \] and hence \[\begin{align*} \E[Y \mid \text{treatment}] &= \beta_0 \\ \E[Y \mid \text{placebo}] &= \beta_0 + \beta_1. \end{align*}\] The intercept \(\beta_0\) is hence interpreted as the mean value for treated patients, and \(\beta_1\) is the difference in means between treated patients and those who received the placebo. A hypothesis test for the null hypothesis \(\beta_1 = 0\) tests whether the treatment and placebo groups have identical means. The “treatment” level is hence the baseline level against which “placebo” is compared. We could also choose to keep all levels and omit the intercept from the model, which would also lead to an identifiable model.

The choice of baseline does not affect the model in the sense that it does not change the column span of \(\X\); it only changes the interpretation of \(\beta\). Other software packages may choose different contrasts; for instance, SAS omits the last level of a factor, not the first. Consequently, the main advantage of choosing different contrasts is simply convenience: for instance, if your research question is to compare a baseline level to other treatment levels, choosing contrasts so the baseline level is omitted (and part of the intercept) means tests of the coefficients for the other levels will be tests of whether those groups have different means than the baseline.

Exercise 7.3 (Treatment contrast estimates) Consider a regression with a response variable \(Y\) and a single predictor \(X\) that is a factor with \(K\) levels. Let \(N\) be the total sample size and \(n_k\) be the number of responses in factor level \(k\). The residual sum of squares is a sum that can be written per factor level: \[ \RSS(\beta) = \sum_{k=1}^{K} \sum_{i=1}^{n_k} (Y_{ki} - \beta\T X_{ki})^2, \] where \(X_{ki}\) is a vector containing the intercept and treatment contrasts for observation \(i\) in factor level \(k\).

Show that the least squares estimate of the mean within factor level \(k\) is \[ \E[Y \mid X \text{ in level } k] = \frac{1}{n_k} \sum_{i=1}^{n_k} Y_{ki}, \] the sample average of observations in factor level \(k\). (Weisberg 2014, problem 5.1)

In R, any predictor stored as a logical vector (TRUE or FALSE) or character vector will be implicitly converted into a factor. One can also use the factor() function to create a factor and control the order of the factor levels; factors can also have labels assigned to each level, allowing you to create human-readable labels for levels that might be encoded in the data in less readable form. The forcats package provides useful utilities for reordering and manipulating factors.

R also allows the user to select the contrasts to use for a factor.

Example 7.1 (Contrasts in R) The iris dataset indicates the species of each observation with a factor. There are three species:

levels(iris$Species)
[1] "setosa"     "versicolor" "virginica" 

A model using species as a predictor will default to treatment contrasts. R can print the contrasts for us:

contrasts(iris$Species)
           versicolor virginica
setosa              0         0
versicolor          1         0
virginica           0         1

The rows are factor levels; the columns are the dummy variables and indicate the dummy variable values used for each factor level. We can see setosa is omitted, since it is the first factor level, and there is an indicator each for versicolor and virginica.

But we can choose other contrasts by using the contrasts() function. (See its documentation for more details.) For instance, we can use “sum to zero” contrasts:

contrasts(iris$Species) <- contr.sum(levels(iris$Species))
contrasts(iris$Species)
           [,1] [,2]
setosa        1    0
versicolor    0    1
virginica    -1   -1

These have different interpretations.

Exercise 7.4 (Sum-to-zero contrasts) Consider modeling Petal.Width ~ Species in the iris data described in Example 7.1. Write out the model when using treatment contrasts and when using sum-to-zero contrasts, and write the mean for each species in terms of \(\beta\).

Exercise 7.5 (Weisberg (2014), problem 4.9) Researchers in the 1970s collected data on faculty salaries at a small Midwestern college. Using the data, they fit a regression function \[ \hat r(\text{sex}) = 24697 - 3340 \times \text{sex}, \] where \(\text{sex} = 1\) if the faculty member was female and zero if male. The response variable, salary, is measured in dollars.

Note: sex is a dichotomous variable, at least as defined by these researchers in this dataset, so the regression function is not continuous: it takes only 2 values, corresponding to \(\text{sex} = 1\) and \(\text{sex} = 0\).

  1. Describe the meaning of the two estimated coefficients.
  2. What is the mean salary for men? For women?
  3. Alternatively, we could include a variable for the number of years each faculty member has been employed at the college. Draw a causal diagram of a plausible causal relationship between years of service, sex, and salary, and comment on whether the previous regression could attain an unbiased estimate of the direct causal effect of sex on salary. (By “direct” we mean the arrow directly from sex to salary, and not causal effects through any other paths. Hint: Consider changes in the role of women in the workplace in the 1970s.)
  4. When the number of years of employment is added to the model, the researchers obtained the estimated regression function \[ \hat r(\text{sex}, \text{years}) = 18065 + 201 \times \text{sex} + 759 \times \text{years}. \] The coefficient for sex has changed signs. Describe the meaning of each coefficient in a sentence, and comment on how the sign could have changed, based on your answer to the previous question.
  5. What is the mean salary for men as a function of years? What is the mean salary for women as a function of years? Write out both mathematically.

7.4 Interactions

An interaction allows one predictor’s association with the outcome to depend on values of another predictor.

For example, consider a model with two continuous predictors and their interaction. We represent this with three regressors, in the model \[ Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_1 X_2 + e. \] The \(X_1\) and \(X_2\) regressors are referred to as main effects, and the \(X_1 X_2\) regressor is their interaction. We can rewrite this model in two ways, allowing us to interpret it in terms of the original regressors: \[\begin{align*} Y &= (\beta_0 + \beta_2 X_2) + (\beta_1 + \beta_3 X_2) X_1 + e\\ Y &= \underbrace{(\beta_0 + \beta_1 X_1)}_\text{intercept} + \underbrace{(\beta_2 + \beta_3 X_1)}_\text{slope} X_2 + e. \end{align*}\] The first form expresses the linear association between \(Y\) and \(X_1\) as having an intercept and slope that both depend on \(X_2\); equivalently, in the second form, the slope and intercept of the association between \(Y\) and \(X_2\) depend on \(X_1\).

Similarly, consider a model with one continuous predictor and one factor predictor with three levels. Using treatment contrasts (Definition 7.1), we have an intercept, three main effect regressors, and two interaction regressors: \[ Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_3 + \beta_4 X_1 X_2 + \beta_5 X_1 X_3 + e, \] where \(X_1\) is the continuous predictor’s regressor and \(X_2, X_3\) are the factor regressors. Rearranging this, the relationship between \(X_1\) and \(Y\) can be written as \[ Y = \underbrace{(\beta_0 + \beta_2 X_2 + \beta_3 X_3)}_\text{intercept} + \underbrace{(\beta_1 + \beta_4 X_2 + \beta_5 X_3)}_\text{slope} X_1 + e. \] Here the intercept and slope depend on which of the three factor levels the observation has; the intercept is \(\beta_0\) in the baseline level, \(\beta_0 + \beta_2\) in the level represented by \((X_2, X_3) = (1, 0)\), and \(\beta_0 + \beta_3\) in the level represented by \((X_2, X_3) = (0, 1)\). We can similarly work out the interpretations of the slope.

When an interaction is present, the normal interpretation of coefficients as slopes (as described in Section 7.2) no longer holds for the predictors involved in the interaction. Returning again to a model with two predictors (\(X_1\) and \(X_2\)) and their interaction (\(X_1 X_2\)), the change in the mean of \(Y\) when \(X_1\) is one unit higher, holding \(X_2\) fixed, is \[\begin{multline*} \E[Y \mid X_1 = x_1 + 1, X_2 = x_2] - \E[Y \mid X_1 = x_1, X_2 = x_2] ={}\\ (\beta_1 + \beta_3 x_2) (x_1 + 1) - (\beta_1 + \beta_3 x_2) x_1 = \beta_1 + \beta_3 x_2. \end{multline*}\] Hence \(\beta_1\) only has its conventional interpretation when \(x_2 = 0\). For an interaction between a continuous predictor and a factor, this is just another way of saying that the interaction allows different slopes per factor level, but this problem is easy to overlook when interpreting interactions of continuous predictors.

Example 7.2 (Palmer Penguins) The Palmer Penguins dataset records size measurements for three species of penguins recorded by Dr. Kristen Gorman at the Palmer Station in Antarctica. Dr. Gorman recorded the bill length and flipper length for each penguin, among other features. The bill length and flipper length appear to be associated, as we can see in Figure 7.1, using the code below.

library(palmerpenguins)
library(ggplot2)
library(broom)

ggplot(penguins,
       aes(x = flipper_length_mm, y = bill_length_mm,
           color = species)) +
  geom_point() +
  labs(x = "Flipper length (mm)", y = "Bill length (mm)",
       color = "Species")

Figure 7.1: Bill and flipper pengths in the Palmer Penguins data.

To assess if the slope of this relationship differs by group, we can fit a linear model with an interaction, using the formula syntax from Section 6.2. The coefficients are shown in Table 7.1.

penguin_fit <- lm(
  bill_length_mm ~ flipper_length_mm + species +
    flipper_length_mm:species,
  data = penguins
)
Table 7.1: Coefficients from an interaction model fit to the penguins data, including 95% confidence intervals.
 (1)
(Intercept) 13.587 [1.685, 25.489]
flipper_length_mm 0.133 [0.070, 0.195]
speciesChinstrap −7.994 [−28.611, 12.623]
speciesGentoo −34.323 [−53.639, −15.007]
flipper_length_mm × speciesChinstrap 0.088 [−0.018, 0.194]
flipper_length_mm × speciesGentoo 0.182 [0.088, 0.275]
R2 0.785
Num.Obs. 342

We interpret this as follows. For Adelie penguins (the baseline level), the association between bill length and flipper length is 0.13 mm of bill length per millimeter of flipper length, on average. But for chinstrap penguins, the association is 0.13 + 0.09 = 0.22 mm of bill length per millimeter of flipper length, on average.

Notice that in this model, we had three predictors (flipper length, species, and their interaction) and six regressors. The species predictor corresponds to two regressors, because it is a three-level factor, while the interaction predictor corresponds to two additional regressors, because each factor regressor is multiplied by the flipper length regressor.

Exercise 7.6 (Palmer Penguins models) For this problem, use the model fit in Example 7.2.

  1. For each species, calculate the slope and intercept of the fitted relationship between bill length and flipper length.

  2. For each coefficient reported in Table 7.1, interpret its meaning in APA format (Chapter 25), being careful to interpret the main effect and interactions precisely.

  3. Using the coefficients printed in Table 7.1, calculate the expected bill length for a Gentoo penguin whose flipper length is 210 mm.

Interactions between factor variables, or between a factor variable and a continuous variable, can often be used to answer substantive questions. Tests for whether the interaction coefficients are zero might be scientifically meaningful, provided they are interpreted correctly.

Example 7.3 Consider a study of a new treatment for dihydrogen monoxide poisoning. Suppose we have a binary treatment assignment (treatment or placebo) and a continuous outcome variable rating the patient’s symptoms one hour after treatment. We want to know if the efficacy of the treatment depends on whether the patient has antibodies for dihydrogen monoxide, measured by a test that indicates the presence or absence of antibodies.

We fit a model with the formula symptoms ~ treatment * antibodies, where both treatment and antibodies are binary factors. Let \(X_1\) be a binary indicator of treatment and \(X_2\) be a binary indicator of antibodies. If placebo and antibody absence are the baseline levels, this model would be \[ Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_1 X_2 + e, \] where the coefficients are interpreted as follows:

  • \(\beta_0\): Average symptom rating for patients who received the placebo and had no prior antibodies
  • \(\beta_1\): Difference between the mean symptom ratings of treated patients and untreated patients, for those patients without antibodies (because of the interaction)
  • \(\beta_2\): Difference between the mean symptom ratings of patients with and without patients, for those patients who received the placebo (because of the interaction)
  • \(\beta_3\): Additional difference beyond that given by the treatment coefficient, for patients with antibodies; or, equivalently, the additional difference beyond that given by the antibodies coefficient, for patients who were treated

It may be easier to see this as a table of values of \(\E[Y \mid X]\):

Placebo Treatment
No antibodies \(\beta_0\) \(\beta_0 + \beta_1\)
Antibodies \(\beta_0 + \beta_2\) \(\beta_0 + \beta_1 + \beta_2 + \beta_3\)

If the interaction coefficient \(\beta_3\) is zero, the difference in symptoms between patients receiving treatment and those receiving control is equal to \(\beta_1\), regardless of the presence or absence of antibodies. If \(\beta_3\) is nonzero, the difference between placebo and treatment depends on the presence or absence of antibodies.

Exercise 7.7 (Confidence intervals for slopes) Consider the penguin_fit model from Example 7.2. Load the data and fit the model.

  1. Give the slope of the relationship between flipper length and bill length for chinstrap penguins.
  2. Calculate a 95% confidence interval for that slope by using Theorem 5.5 to combine the necessary coefficients.

It is sometimes tempting to include an interaction regressor but not the main effect regressors corresponding to it. This, however, is rarely a good idea, as it produces a model with strange restrictions on the effects included. For example, in Example 7.3, omitting the treatment main effect is equivalent to enforcing that \(\beta_1 = 0\); this would only allow a difference in symptoms between treatment and placebo for those patients with antibodies. In scientific settings we rarely have a reason to believe this is true a priori; it may be a question to be answered using the data, but it is rarely something we know in advance and want to include in our model.

Similarly, a test of whether the treatment main effect is equal to zero is a test of whether there is no treatment effect specifically among those with no antibodies; it does not test whether there is no treatment effect at all.

The rule that we include main effects when we use their corresponding interactions is known as the marginality principle, and can be formalized (Peixoto 1990). It suggests we should be careful when adding and removing main effects and interactions from a model, and should take care to interpret tests of main effects when their interactions are present.

Example 7.4 (Nelder (1998)) Consider measuring the response \(Y\) of cells to a drug given at dosage \(X_1\). The dose-response relationship is linear. An adjuvant drug is added at dosage \(X_2\); the adjuvant causes no response on its own, but merely enhances the effect of the drug. (For example, in vaccines, an adjuvant compound like alum can enhance the body’s immune response to the vaccine, though on its own alum cannot confer immunity to anything.) In this case, a model such as \[\begin{align*} Y &= \beta_0 + \beta_1 X_1 + \beta_2 X_1 X_2 + e\\ &= \beta_0 + (\beta_1 + \beta_2 X_2) X_1 + e \end{align*}\] is reasonable, since we know that when \(X_1 = 0\), there should be no association between the adjuvant dose \(X_2\) and \(Y\).

7.5 Causal claims

In all of our interpretations, of course, we must be careful about making causal claims. If we estimated \(\E[Y \mid X = x]\), we must be careful about claiming that changing \(X\) causes a change in the mean of \(Y\)—unless we have estimated \(\E[Y \mid \cdo(X = x)]\), following the causal principles in Chapter 2.

Statisticians are often tempted, or explicitly trained, to never make causal claims: to only speak of associations, and to never claim that a regression implies causality. This caution can be good, because with observational data, we usually can’t make causal claims. But as we discussed in Chapter 2, sometimes we can control for confounders, and we can make causal claims. In those situations, our obligation is to make clear the limitations of our claims:

  1. If our regression model is misspecified or otherwise incorrect, our estimates may be wrong.
  2. If our causal model is missing important confounders, or we have measured some confounders incompletely or inaccurately, our estimates may include some bias from confounding.
  3. If our data comes from a specific sample or subset of a population, the causal claims may not generalize to every situation we want to generalize them to.

TODO sensitivity

7.6 Effect plots

When we fit a complicated linear model, it can be hard to interpret and visualize the relationships we discover. If we have several different factor variables, each with multiple levels, plus many continuous variables, plus interactions between different combinations of each, it can be quite difficult to simply read the table of coefficients and interpret what’s going on. What direction is the association between a particular predictor and \(Y\)? It might depend on the value of several other variables, depending on the interactions.

A predictor effect plot is designed to help us visualize these relationships (Fox and Weisberg 2018). (Do not confuse “effect” with “causal effect” here!). Each predictor effect plot helps us visualize the relationship between that predictor and \(Y\), and how that relationship might depend on other predictors.

Definition 7.2 (Predictor effect plot) Consider a regression model with predictors \(X \in \R^q\). Select a focal predictor \(X_f\). We can then partition the predictors so that \(X = (X_f, X_i, X_o)\), where \(X_i\) represents the set of predictors that have interactions with \(X_f\), and \(X_o\) represents the set of remaining predictors. (\(X_i\) and \(X_o\) can be empty sets.) The predictor effect plots for \(X_f\) are plots of \[ \hat r(x_f, x_i, x_o^a) \; \text{versus} \; x_f, \] where \(\hat r\) is the estimated regression function and \(x_o^a\) represents an average value of the other predictors. Multiple plots are made, each setting \(x_i\) to a specific value; for example, we might set it to multiple values across a range, or to each level of a factor, if the interaction is with a factor.

For the average value \(x_o^a\), continuous predictors may be set at their mean, while for factor variables, the expectation may be calculated by obtaining an estimate for each level of the factor and then averaging those estimates, weighted by the prevalence of each level.

This definition is a little complicated, so let’s make several example plots based on Example 7.2. We will use the ggeffects package, which can automatically make effect plots for many types of models. Note that you also need to install the effects package; if it is unavailable, ggeffects will default to calculating effect plots differently from how we described them.

Example 7.5 (Penguin effect plots) Let’s construct an effect plot using the model we already fit in Example 7.2. We will select flipper length as the focal predictor. The set \(X_i\) contains the species predictor, because it has an interaction with the focal predictor, and the set \(X_o\) is empty. We need a plot for multiple values of \(x_i\), and since species is discrete, it’s reasonable to choose to make one plot per species. That is what the ggeffects package chooses automatically.

To make the plot, we use the ggeffect() function and choose our focal predictor in the terms argument. We must tell it explicitly which variables to include in \(X_i\) by providing additional terms, so we tell it make a plot per species. The function returns a special data frame, and the package provides an S3 method for plot() (see Section 6.1) that constructs a plot using ggplot2. The code below produces the effect plot shown in Figure 7.2.

library(ggeffects) # install the effects package too

ggeffect(penguin_fit,
         terms = c("flipper_length_mm", "species")) |>
  plot() +
  labs(x = "Flipper length (mm)", y = "Bill length (mm)",
       color = "Penguin species",
       title = "Flipper length effect plot")

Figure 7.2: Effect plot for flipper length in the Palmer Penguins data.

Notice how the multiple plots have been combined into one. By default, the plot includes 95% confidence intervals for the mean. We can now see graphically that the interaction allowed each penguin species to have a different slope and intercept.

Now let’s consider a more complicated situation by adding in body mass:

penguin_fit_2 <- lm(
  bill_length_mm ~ flipper_length_mm + species +
    flipper_length_mm:species + body_mass_g,
  data = penguins
)

If we make the same effects plot as before, body mass is in \(X_o\) and hence the plot uses its average value when drawing the lines, as shown in Figure 7.3.

ggeffect(penguin_fit_2,
         terms = c("flipper_length_mm", "species")) |>
  plot() +
  labs(x = "Flipper length (mm)", y = "Bill length (mm)",
       color = "Penguin species",
       title = "Flipper length effect plot")

Figure 7.3: Effects plot for the Palmer Penguins data when body mass is included in the model.

We could also ask for multiple plots at different values of body mass. The range of typical body masses seems to be from about 3,000 g to about 6,000 g, so let’s pick four reasonable values. The result is shown in Figure 7.4.

ggeffect(
  penguin_fit_2,
  terms = c("flipper_length_mm", "species",
            "body_mass_g [3000, 4000, 5000, 6000]")
) |>
  plot() +
  labs(x = "Flipper length (mm)", y = "Bill length (mm)",
       color = "Penguin species",
       title = "Flipper length effect plot",
       subtitle = "By body mass (g)")

Figure 7.4: Effect plots for the same model as in Figure 7.3, but conditioning on fixed body masses instead of averaging.

In our chosen model, there is no interaction between body mass and flipper length, so the lines in every plot have identical slopes; however, their vertical positions are different, because each plot sets a different fixed value of body mass.


  1. Having never hurdled myself, I have no idea what attributes help a hurdler hurdle, so this example is completely made up. I chose hurdling because it is the athletic event whose name sounds the strangest when you read it many times in short succession. Hurdle hurdle hurdle.↩︎