9  Regression Assumptions and Diagnostics

So far we have considered the regression model \(\Y = \X \beta + e\), with a variety of assumptions:

  1. The errors have mean 0: \(\E[e] = 0\)
  2. The error variance is constant: \(\var(e \mid \X) = \sigma^2 I\)
  3. The errors are uncorrelated: \(\var(e \mid \X) = \sigma^2 I\). (Yes, the math is the same, but we can violate it in different ways)
  4. The errors are normally distributed.

We’ve also assumed that \(\X\) is fixed, not random.

Assumptions are not magical. If I fit a regression model to data with non-normal errors with non-constant variance, I will still get an estimate \(\hat \beta\), and it may even be a reasonably good estimate. The assumptions are not necessary for fitting the model, only for some of the mathematical results we have derived about bias, variance, and tests for \(\hat \beta\). Let’s examine the consequences. We will use the regressinator for many examples and exercises in this chapter, along with ggplot for plotting:

library(regressinator)
library(ggplot2)

9.1 Residual diagnostics

As many of the regression assumptions involve the errors, plots of the observed residuals will be useful to determine if our model has fit well. There are several varieties of residuals we can use:

  • Raw residuals, i.e. \(\Y - \hat \Y\)
  • Standardized residuals
  • Studentized residuals

The standardized or Studentized residuals are usually the best choice, because as shown in Section 5.4, the raw residuals have unequal variance even if all regression assumptions hold.

To determine if the mean or variance of the residuals varies proportional to \(\X\) in some way, many of our residual plots will plot the residuals against individual predictors or linear combinations of them.

Consider plotting the residuals \(\hat e\) against \(U \in \R^n\), where \(U\) is one of the columns of \(\X\) or a linear combination of them (such as the fitted values \(\hat \Y = \X \hat \beta\)). When all the model assumptions above hold,

  1. The residual mean function \(\E[\hat e \mid U]\) is 0 everywhere. In a scatterplot, the residuals should be roughly equally above and below the axis, and there should not be obvious trends or curvature suggesting the mean is higher in some places than others.
  2. The standardized or Studentized residuals have constant variance, so the scatter of residuals above and below 0 should be similar for different values of \(U\).

This means the ideal residual plot looks like a “null plot”, a scatterplot with no obvious trends or patterns.

Scatterplots will not help us judge if the residuals are normally distributed, unless the scatterplot is particularly pathological (e.g. all the residuals only take on a few discrete values, or there is obvious skew). Instead, we can use a quantile-quantile plot of the standardized residuals. If the assumptions are correct, the standardized residuals should be approximately normally distributed with equal variance. To check this, if there are \(n\) residuals, we calculate the expected value of the minimum of \(n\) draws from a standard normal distribution, the expected value of the second-smallest of \(n\) draws from a standard normal distribution, and so on; call these the “theoretical quantiles”. For each residual, we plot its actual value against the expected value: for example, for the smallest observed residual, we plot it against the expected value of the smallest of \(n\) draws.

In the ideal case where every residual is equal to its expected value, this plot will be a diagonal line; but of course the residuals are random, so some deviation is to be expected.

Example 9.1 (Well-specified model) Let’s consider a well-specified model, where all assumptions hold, to see these residual plots in action.

nice_pop <- population(
  x1 = predictor("rnorm", mean = 5, sd = 4),
  x2 = predictor("runif", min = 0, max = 10),
  y = response(
    4 + 2 * x1 - 3 * x2, # relationship between X and Y
    # errors are multiplied by this scale factor:
    error_scale = 3.0
  )
)

nice_sample <- nice_pop |>
  sample_x(n = 100) |>
  sample_y()

Figure 9.1: Scatterplots of the raw data from Example 9.1.

A good first step is always to plot the original data, to be sure there are no surprises. The plots in Figure 9.1 look reasonable. Let’s now fit a simple model.

fit <- lm(y ~ x1 + x2, data = nice_sample)

To make diagnostics, we will use the augment() function from broom. As described in Section 6.4, this gives us a data frame of every observation, its fitted value, and the residuals. For example, we can plot the residuals against the fitted values, which is a good first check:

library(broom)

augment(fit) |>
  ggplot(aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_smooth(se = FALSE) +
  labs(x = "Fitted value", y = "Residual")

Notice we have added a smoothing line, which helps us check if the mean appears to deviate from 0. This plot looks okay: there’s a deviation on the left side, but it appears that’s only one observation pulling up the average, so it shouldn’t worry us.

We can also use partial residual plots to look for nonlinearity. Partial residuals are not standardized and hence have unequal variance, but when the true relationship is nonlinear, partial residual plots may suggest what kind of transformation or additional terms could better fit the data.

Example 9.2 (Partial residuals of a well-specified model) The regressinator provides the partial_residuals() function to obtain partial residuals. Because there is a partial residual vector for every predictor, we can make one partial residual plot per predictor. We will plot the partial residuals, a smoothed line showing their trend, and a line (or curve) for the model predictions. (This line is the predictor effect, as defined in Definition 7.2.)

partial_residuals(fit) |>
  ggplot(aes(x = .predictor_value, y = .partial_resid)) +
  geom_point() +
  geom_line(aes(y = .predictor_effect)) +
  geom_smooth(se = FALSE) +
  facet_wrap(vars(.predictor_name), scales = "free") +
  labs(x = "Predictor value", y = "Partial residual")

Both of these plots look quite linear, with no major deviations from a straight line, suggesting the relationships are indeed linear.

9.2 Mean function misspecification

9.2.1 Nonlinearity

If we fit a linear model, but the true relationship between some of the predictors and \(Y\) is nonlinear, plots of partial residuals (Definition 5.3) may be the most useful way to diagnose this problem.

Example 9.3 (Nonlinear relationship) In this population, x1 is quadratically related to y, while x2 is linearly related.

nonlinear_pop <- population(
  x1 = predictor("runif", min = 1, max = 8),
  x2 = predictor("runif", min = 1, max = 8),
  y = response(0.7 + x1**2 + 0.8 * x2,
               family = gaussian(),
               error_scale = 4.0)
)

nonlinear_data <- sample_x(nonlinear_pop, n = 100) |>
  sample_y()

nonlinear_fit <- lm(y ~ x1 + x2, data = nonlinear_data)

We can quickly make the partial residual plot:

partial_residuals(nonlinear_fit) |>
  ggplot(aes(x = .predictor_value, y = .partial_resid)) +
  geom_point() +
  geom_line(aes(y = .predictor_effect)) +
  geom_smooth(se = FALSE) +
  facet_wrap(vars(.predictor_name), scales = "free") +
  labs(x = "Predictor value", y = "Partial residual")

Notice that in the left plot, for x1, the smoothed residuals (in blue) resemble a plot of \(x^2\) from \(x = 1\) to \(x = 8\), while the right plot appears roughly linear. By examining the left plot, we determine we should try a model fit with a quadratic term, and indeed this fit has better partial residuals:

lm(y ~ x1 + I(x1^2) + x2, data = nonlinear_data) |>
  partial_residuals() |>
  ggplot(aes(x = .predictor_value, y = .partial_resid)) +
  geom_point() +
  geom_line(aes(y = .predictor_effect)) +
  geom_smooth(se = FALSE) +
  facet_wrap(vars(.predictor_name), scales = "free") +
  labs(x = "Predictor value", y = "Partial residual")

Here the smoothed lines and fitted lines match much better, particularly on the left, where the fitted line’s curvature matches the smoothed residuals very well.

9.2.2 Missing predictors

Another way a model can be misspecified is by not including relevant predictors. We do not always want to include every possible predictor; as discussed in Chapter 2, we can use causal diagrams to determine which variables we must include to measure our effect of interest. But what if one of the relevant variables is not available to us or we do not include it in our model? There is no diagnostic that will reveal this to us directly; our only defense is a strong understanding of the data, how it was generated, and what factors may influence the outcome.

Exercise 9.1 (Correlated missing predictors) Using the mvtnorm package, we can define two predictors that are drawn from a multivariate normal distribution and are correlated with each other. (Because x is defined to be bivariate normal, the regressinator makes two predictors named x1 and x2.)

library(mvtnorm)

corr_omit_pop <- population(
  x = predictor("rmvnorm",
                mean = c(0, 0),
                sigma = matrix(c(4, 3, 3, 4), ncol = 2)),
  y = response(
    4 + 2 * x1 - 3 * x2,
    family = gaussian(),
    error_scale = 1.0
  )
)

corr_omit_sample <- corr_omit_pop |>
  sample_x(n = 100) |>
  sample_y()

In this population, \(\beta_0 = 4\), \(\beta_1 = 2\), and \(\beta_2 = -3\).

Because we typically treat \(\X\) as fixed in regression analysis, the population correlation is not as important here as the sample correlation between the predictor values, as we can see in Figure 9.2.

Figure 9.2: Correlation between \(X_1\) and \(X_2\) in Exercise 9.1.
  1. The true population slope for x1 is 2. Suppose we fit a linear model using only x1 and not x2. What do you expect will happen to the model’s estimates of the slope? If you think they will be biased, be specific about the direction of the bias.

  2. Using sampling_distribution(), fit the model omitting x2 and obtain 1000 draws from the sampling distribution of its estimated slope. Does this confirm your prediction?

  3. Define \(\beta_1^*\) to be the slope such that \(\E[Y \mid X_1] = \beta_0^* + \beta_1^* X_1\). This is the slope of the population relationship when we ignore x2.

    Draw a causal diagram with x1, x2, and y. We know \(y = f_Y(x_1, x_2)\), so certain arrows must be present; include these in the diagram. However, there could be additional arrows beyond those implied by \(y = f_Y(x_1, x_2)\).

    Suppose we want to estimate the causal effect of x1 on y. Draw the arrows so that \(\beta_1^*\) is the correct target of estimation, and we should only include x1 in the model, not x2.

  4. Draw another causal diagram, but this time draw it so that \(\beta_1\) is the correct target of estimation for the causal effect of x1 on y.

9.2.3 Incompletely observed predictors

In some cases, we may determine that we need to include several predictors in our model in order to estimate the desired association or causal effect, but some of those predictors are incompletely observed. For instance, in a medical study we might only observe a binary indicator of whether the patient has high blood pressure, not their actual blood pressure; in an economic study, we may know if a family’s income is below the poverty threshold defined by a government assistance program, but not know the actual income value; in a survey, respondents may be asked their age in a question that provides answer choices like “25-34”, “35-44”, and so on, rather than indicating their exact age.

Dichotomizing a variable (deliberately converting it to a binary indicator) or discretizing it (breaking it into discrete buckets) can also be done deliberately, not due to measurement limitations. For non-statisticians, it is often easier to interpret a result that says “Families below the poverty threshold have higher rates of illness”, rather than one giving a coefficient for the change in illness rate per unit of income. For doctors, it is often convenient to have a specific threshold to act upon—if the patient’s blood test result is above a specific number, give them a specific treatment—rather than weighing complicated factors and formulas.

But dichotomization and discretization must be used with care. If we must control for (condition on) a specific variable to estimate a causal effect, discretizing or dichotomizing that variable before adding it to the model may harm our ability to accurately control for it.

Exercise 9.2 (Dichotomized predictor) Let’s continue using corr_omit_pop from Exercise 9.1, and suppose that we want to estimate the coefficient for x1 while controlling for x2. However, we do not observe x2 directly: we only observe a binary indicator of whether x2 >= 0. We can fit a model using the dichotomized variable by including it in the model formula:

dichot_fit <- lm(y ~ x1 + I(x2 >= 0),
                 data = corr_omit_sample)

Assume that for the causal effect we want to estimate, it’s necessary to condition on (control for) x2.

  1. Predict what will happen. Will the estimated coefficient for x1 be unbiased? If it is biased, do you expect it to be too large or too small?
  2. Using sampling_distribution(), draw 1000 samples from the sampling distribution of \(\hat \beta\) and report the result. Does this match your prediction?
  3. Would you expect the same result if x1 and x2 were independent? Adjust the population and test the simulation again to test your prediction.

9.3 Error misspecification

9.3.1 Heteroskedastic errors

Suppose \(\var(e) \neq \sigma^2 I\), and instead the errors have non-constant variance, so that \(\var(e)\) is a diagonal matrix whose entries on the diagonal are not all equal.

Exercise 9.3 (Estimator properties in heteroskedastic data) Return to Theorem 5.3, which showed that the least-squares estimator for \(\hat \beta\) is unbiased and has a particular covariance matrix. Repeat each derivation without assuming \(\var(e \mid X) = \sigma^2 I\). Show that the estimator is still unbiased, but the formula for its covariance is no longer correct.

One common type of heteroskedasticity has \(\var(e)\) be proportional to one or more of the predictors, or to the response. For example, consider modeling the sale price of homes using information about their size, number of rooms, previous sale price, distance to amenities, and so on. Here the error represents the difference between the actual sale price, which may vary due to buyer preferences, negotiations, local market conditions, and so on, and the price estimated using the house’s features. We might expect that the typical error for less valuable homes is smaller than the typical error for very expensive homes. Expensive buyers have fewer potential buyers, more individual variation due to personal and artistic tastes, and potentially many expensive features not included in a model—after all, does your linear regression model have a term for the size of the wine cellar and the presence of a helicopter landing pad? If it doesn’t, those features will contribute to large potential valuation errors.

To detect heteroskedasticity, we can examine residual plots against the predictors and the fitted values.

Exercise 9.4 (Diagnosing heteroskedasticity) Here is a population where the error scales with \(X_2\), i.e. \(\var(e \mid X_2 = x_2) = 0.5 + x_2 / 10\):

heteroskedastic_pop <- population(
  x1 = predictor("rnorm", mean = 5, sd = 4),
  x2 = predictor("runif", min = 0, max = 10),
  y = response(
    4 + 2 * x1 - 3 * x2, # relationship between X and Y
    family = gaussian(), # distribution of the errors
    error_scale = 0.5 + x2 / 10 # heteroskedastic!
  )
)

We can plot y against x2 in data sampled from heteroskedastic_pop to try to observe the heteroskedasticity:

heteroskedastic_sample <- heteroskedastic_pop |>
  sample_x(n = 100) |>
  sample_y()

ggplot(heteroskedastic_sample, aes(x = x2, y = y)) +
  geom_point()

  1. Why don’t we see obvious heteroskedasticity in this plot?
  2. Fit a model of y ~ x1 + x2. Construct a lineup of residual diagnostic plots. (See Section 8.2.) Comment on how easy or difficult it is to see the heteroskedasticity in these plots.
  3. Using the sampling_distribution() function, conduct the simulations necessary to establish if the least squares estimate \(\hat \beta_2\) is biased and if the confidence intervals have the correct coverage.

9.3.2 Non-normal errors

The errors \(e_i\) are not always normally distributed. Sometimes the residuals come from a distribution with heavier tails; sometimes their distribution is not symmetric; and sometimes they appear to come from a “contaminated” distribution, where most residuals are normally distributed but a few are abnormally large. This contamination often is caused by occasional measurement errors or observations that come from a different population.

Residual Q-Q plots are the most straightforward way to detect non-normal errors.

Exercise 9.5 (t-distributed errors) Suppose the errors have constant variance but are heavy-tailed. For instance, they could follow a \(t\) distribution with 3 degrees of freedom. We can model this by telling response() to use the ols_with_error family, which lets us choose the distribution the errors are drawn from.

heavy_tail_pop <- population(
  x1 = predictor("rnorm", mean = 5, sd = 4),
  x2 = predictor("runif", min = 0, max = 10),
  y = response(
    4 + 2 * x1 - 3 * x2,
    # distribution of the errors:
    family = ols_with_error(rt, df = 3),
    # errors are multiplied by this scale factor:
    error_scale = 1.0
  )
)

heavy_tail_sample <- heavy_tail_pop |>
  sample_x(n = 100) |>
  sample_y()
  1. What consequences do you expect this will have on the parameter estimates and their covariance, as derived in Theorem 5.3?

  2. Use sampling_distribution() to simulate 1000 datasets from heavy_tail_pop and test your predictions. Are the estimates unbiased, and is the variance estimate accurate?

  3. A residual Q-Q plot is the obvious choice to detect this misspecification, but they can be hard to read. Examine the residual plots in Figure 9.3 and try to identify the plot representing the real fit. Run the code yourself so you can check your answer.

    fit <- lm(y ~ x1 + x2, data = heavy_tail_sample)
    
    model_lineup(fit) |>
      ggplot(aes(sample = .std.resid)) +
      geom_qq() +
      geom_qq_line() +
      facet_wrap(vars(.sample)) +
      labs(x = "Theoretical quantiles",
           y = "Observed quantiles")
    decrypt("nsW7 Ykjk l3 gCPljlC3 RT")

    Figure 9.3: Q-Q plots of heavy-tailed data sampled from heavy_tail_pop.

9.3.3 Outliers and contaminated errors

Sometimes the errors are mostly well-behaved, but a few observations have very large errors. This could happen for several reasons:

  • Some people, animals, or objects are just weird, and so they do not fit the same pattern as others. The relationship \(\E[Y \mid X] = \beta\T X\) is true for most cases, but a few weirdos follow a different relationship in ways not accounted for by our current predictors, and so their observed \(Y\) is far from our estimated \(\E[Y \mid X]\).
  • Measurement is prone to error, and so sometimes the outcome variable is measured incorrectly. We observe \(Y = Y^* + \delta\), where \(Y^*\) is the “correct” value and \(\delta\) is some offset or measurement error.
  • Data is recorded and processed by humans who sometimes forget a decimal point, add an extra 0, accidentally put numbers in the wrong column of a spreadsheet, or otherwise mess things up.

The trouble is that observations with large errors can be highly influential: they can pull our \(\hat \beta\) away from \(\beta\), causing high variance in estimates and estimated regression lines that fit the data poorly. We call such observations outliers.

Exercise 9.6 (Contaminated errors) One way to model outliers is with a “contaminated” error distribution. A contaminated distribution is a mixture of two distributions, where one is the “right” distribution and the other is the contamination. For instance, we could model \[ e \sim \begin{cases} N(0, \sigma^2) & \text{most of the time} \\ N(0, 10 \sigma^2) & \text{when you least expect it.} \end{cases} \] Let’s set up a population in the regressinator that has a contaminated error distribution. To do so, we create a function that draws from the normal distribution with probability 0.95, and draws from a distribution with much higher variance with probability 0.05.

rcontaminated <- function(n) {
  contaminant <- rbinom(n, 1, prob = 0.05)

  return(ifelse(contaminant == 1,
                rcauchy(n, scale = 20),
                rnorm(n, sd = 1)))
}

Let’s create a simple population where all assumptions hold except the errors are contaminated, again by using ols_with_error.

contaminated_pop <- population(
  x = predictor("rnorm", mean = 5, sd = 4),
  y = response(
    4 + 2 * x,
    error_scale = 3.0,
    family = ols_with_error(rcontaminated)
  )
)
  1. What consequences do you expect this will have on the parameter estimates and their covariance, as derived in Theorem 5.3?

  2. Write code to fit a linear model of y ~ x and plot

    • y versus x with the fitted line overlaid,
    • the standardized residuals versus the fitted values, and
    • Q-Q plots of the residuals.

    Run this code using a sample of \(n = 20\) observations from contaminated_pop. Run it at least 20 times, resampling \(Y\) but not \(X\) each time, and observe how the plots behave. (One way to do this is with the sampling_distribution() function: setting its fn argument to broom::augment will have it call augment() on each fit it simulates, producing a data frame with all the data and residuals from all the simulations.)

    Write a few sentences describing the behavior you observe. In particular, look at how the outliers affect the regression line, and whether their effect depends upon the outlier’s value of \(X\).

Detecting and dealing with outliers requires careful judgment. Even if \(e \sim N(0, \sigma^2)\), we will occasionally see very large errors; that does not mean those observations are “wrong”. Not all outliers have a major effect on the regression line, either, so their presence may not be a problem. We hence need a way to detect outliers and characterize their influence on the regression line, so we can then investigate the cause of problematic outliers. If we can determine the outlier is due to some kind of measurement or recording error, we can correct the error. If we cannot, we must make the difficult decision of whether to keep the outlier—and acknowledge that it may significantly influence our estimates—or to remove it and risk throwing away good information.

Let’s begin by quantifying the influence of individual observations on the regression line.

Definition 9.1 The leverage of observation i is defined to be \(H_{ii}\), the \(i\)th diagonal of the hat matrix.

This definition is reasonable because \(\hat Y = H Y\), and so \[ \hat Y_i = \sum_{j=1}^n H_{ij} Y_j = H_{ii} Y_i + \sum_{j \neq i} H_{ij} Y_j. \] One can show that because \(H\) is symmetric and idempotent, its entries must be between 0 and 1 (Christensen 2011, sec. 13.1.2). Consequently, if \(H_{ii} \approx 1\), \(\hat Y_i \approx Y_i\). Observations with high leverage hence have the potential to pull the fitted regression line to be close to them. Potential is the key word here: an observation with high leverage may already fall along the trend line set by the other observations, and so while \(\hat Y_i \approx Y_i\), that would be true even if the observation would be omitted from a fit. Leverage tells us which observations could affect the fit, but not whether they have.

Another way to see this is to note that because \(H = \X(\X\T\X)^{-1} \X\T\), the leverage depends only on \(\X\). A point can have high leverage even if it is observed with a very small error \(e\), because \(Y\) does not enter into the leverage calculation.

TODO average hat matrix diagonal value given on 302 of SeberLee?

We could instead look for points with large error \(e\) and large leverage, since those points are more likely to shift the regression line. But we do not observe \(e\) directly, only the residual \(\hat e\) after fitting to the data. If the true regression relationship is \(Y = \beta\T X + e\), but we instead observe \(\Y = \X \beta + e + \Delta\), where \(\Delta \in \R^n\) is a vector that is mostly zero but has large values for some entries and hence creates outliers, we can show that \[\begin{align*} \E[\hat e] &= \E[Y - \hat Y]\\ &= \E[Y - HY] \\ &= (I - H) \E[Y] \\ &= (I - H) (\X \beta + \Delta) \\ &= (I - H) \Delta. \end{align*}\] The last line follows because \((I - H)\) is orthogonal to \(\X\). Consequently, when \(H_{ii} \approx 1\), \(\E[\hat e_i] \approx 0\); or, in other words, a point with high leverage will pull the regression line towards itself, so its residual will be small.

This suggests an approach that scales the residual based on the leverage.

Definition 9.2 (Cook’s distance) The Cook’s distance for observation \(i\) is \[ D_i = \frac{\hat e_i^2 H_{ii}}{(1 - H_{ii})^2 p S^2} = r_i^2 \frac{H_{ii}}{p (1 - H_{ii})}, \] where \(S^2\) is the variance estimate from Theorem 5.2, \(r_i\) is the standardized residual, and \(H\) is the hat matrix.

The Cook’s distance has a particularly useful interpretation as a scaled difference between \(\hat \beta\) when the observation is included and \(\hat \beta^{(i)}\), the estimate when observation \(i\) is excluded from the fit.

Theorem 9.1 The Cook’s distance can be written as \[ D_i = \frac{(\hat \beta^{(i)} - \hat \beta)\T \X\T \X(\hat \beta^{(i)} - \hat \beta)}{p S^2}, \] where \(S^2\) is the variance estimate from Theorem 5.2 and \(\hat \beta^{(i)}\) is the least-squares fit when observation \(i\) is excluded. This can be seen as the Mahalanobis distance between \(\hat \beta^{(i)}\) and \(\hat \beta\) (Definition 3.6), which rescales the distance according to the covariance of \(\hat \beta\).

Proof. See Seber and Lee (2003), theorem 10.1 and page 309.

The vector of Cook’s distances can be obtained in R using the cooks.distance() function. Because Cook’s distances can be interpreted as a distance that is rescaled proportional to the variance of \(\var(\hat \beta)\), they have a consistent scale. We can look for large values—say, \(D_i \geq 1\), though this is again a matter of judgment—to indicate that a particular observation substantially changes the regression fit.

9.4 Random X

It almost goes without saying, but to estimate \(\E[Y \mid X]\) in the population, we require good measurements of \(Y\) and \(X\). Our typical modeling approach allows for the mean to be measured with error, because we observe \(Y\) with some random error \(e\). But what if \(X\) is measured with error, so we do not measure the true underlying value but some proxy to it? The conventional linear regression model treats \(X\) as fixed, so it does not account for measurement error in \(X\). How will estimates be affected?

Formally, the problem is that we want to estimate \(\E[Y \mid X]\), but we do so by estimating \[ \hat \beta^* = \argmin_\beta \|Y - \X^* \beta\|_2^2, \] where each row \(i\) in \(\X^*\) is \[ X_i^* = X_i + \delta, \] where \(\delta\) is an error vector with some distribution. This might be because \(X_i\) involves physical quantities that have to be measured with equipment that has inherent error; because \(X_i\) is a quantity that fluctuates over time but has only been observed at one time, and the relationship of interest is with its long-term average; because we are interested in some variable that cannot be measured and instead are observing an easy-to-measure variable that is only correlated with it; because \(X_i^*\) is obtained by surveying people who may give inaccurate or misremembered answers; or for a whole variety of other reasons.

In this setting, \(\hat \beta^*\) is not unbiased for \(\beta\): \(\E[\hat \beta^*] \neq \beta\). In fact, one can show that for a regressor measured with error, \[ \E\left[\hat \beta_j^*\right] \to 0 \text{ as } \var(\delta) \to \infty. \] This is known as regression dilution: the slopes are “diluted” by the presence of measurement error. Figure 9.4 can help give us some intuition for this phenomenon: the slope is steep in the original data, but in the diluted data, adding error to \(X\) has made the relationship with \(Y\) much less obvious and reduced the slope. If we added even more error to \(X\), the scatterplot would be an amorphous blob.

Figure 9.4: Regression dilution on a simulated dataset. The errors added to \(X\) have mean 0 and standard deviation 7. Least-squares fit lines are shown.

The problem is similar to what we saw with dichotomization in Section 9.2.3: losing information about a predictor (because it is dichotomized or observed with error) biases our estimates for its coefficients, and if that predictor is correlated with other variables or is a confounding variable, our estimates for other predictors may also become biased.

Exercise 9.7 (Regression dilution) We can adapt Exercise 9.2 to create a covariate measured with error:

random_x_pop <- population(
  x = predictor("rmvnorm",
                mean = c(0, 0),
                sigma = matrix(c(4, 3, 3, 4), ncol = 2)),
  x2_err = response(x2, family = gaussian(),
                    error_scale = 2),
  y = response(
    4 + 2 * x1 - 3 * x2,
    family = gaussian(),
    error_scale = 1.0
  )
)

Because x2_err is defined as a response variable, it will have mean x2 with normally distributed error. This is like observing x2 with error. Note that the true population relationship still uses x2.

Sample data from this population and fit a model that uses x2_err instead of x2, i.e. y ~ x1 + x2_err.

  1. Predict what will happen. Will the estimated coefficient for x1 be unbiased? If it is biased, do you expect it to be too large or too small?
  2. Using sampling_distribution(), draw 1000 samples from the sampling distribution and report the result. Does this match your prediction?
  3. Would you expect the same result if x1 and x2 were independent? Adjust the population and test the simulation again to test your prediction.

Regression dilution cannot be detected from the data. (How could you tell, just from a plot, that \(X\) is observed with error?) You must understand how the data was collected. For example, if you are studying a complex social question where socioeconomic status is a confounding variable, controlling for income is like controlling for socioeconomic status measured with lots of error—income is not the same as socioeconomic status, if you could even quantify socioeconomic status fully. Controlling for income will not control for other kinds of difference in socioeconomic status.

You must also understand what your goals are. If you want to study the association between the predictor and response in the population, observing the predictor with error is a problem. But if you want to know the association between the predictor as observed and the response, that’s fine. For example, if you’re using self-reported income as a predictor in a regression, that has error compared to a person’s actual annual income as might be calculated by a forensic accountant; but if you want to know how self-reported income is correlated with self-esteem, it makes more sense to use what a person thinks their income is to know how they think about themselves. The measurement with error is actually the quantity of interest, not the “true” measurement.

Models to account for errors in measurement of \(X\) are available, and are called errors in variables models. They are outside the scope of this book, but you can find detailed information about them should you ever need do deal with measurement error.

9.5 Exercises

Exercise 9.8 (Rent for agricultural land (Weisberg 2014, problem 9.17)) The alr4 package contains a dataset called landrent giving features of farmland in 1977. Each row of data represents one county in Minnesota, and the variables are:

  • Y: The average rent per acre paid by farmers using the land, among farmland used to plant alfalfa.
  • X1: The average rent paid per acre for all tillable land, not just for alfalfa.
  • X2: The number of dairy cows per square mile.
  • X3: The proportion of farmland used as pasture for cows, rather than for growing crops.
  • X4: Binary variable indicating if liming (the addition of calcium- and magnesium-rich minerals to soil to change its pH and help certain plants grow) is required to grow alfalfa in this county’s soil.

Alfalfa is often used as feed for dairy cows, so it’s plausible that in counties with more dairy cows, alfalfa is in higher demand and farmers would be willing to pay more rent to grow it; but since liming costs money, farmers may be less willing to grow alfalfa if liming is required, and hence only willing to grow it if rent is cheaper. Our goal is to test these theories.

  1. Rename the columns in the data to more descriptive names. Conduct a brief exploratory data analysis to explore the relationships between the variables and the distribution of each variables, paying particular attention to any unusual outliers that might affect your analysis and to the shape of relationships between variables, which might determine the model you choose to build.

    Present your EDA in the form of several relevant figures, with captions, and one or two paragraphs of text that refer to the figures, summarize what you have seen, and describe what models may be most useful to answer the research question, based on the results of your EDA.

  2. Fit a linear model to predict rent using the covariates. Choose the covariates to use (and any transformations or other model features) based on your EDA and on your understanding of the research question and the implied causal diagram.

    Present your model fit in the form of a table summarizing the regression coefficients (with meaningful names and column labels) and a short paragraph describing how you chose the model to fit.

  3. Produce residual diagnostics to check your model. If it’s helpful, create some lineups (Section 8.2). Determine which model assumptions appear to hold and which do not.

    Present your diagnostics by showing several figures (any diagnostic plots with unusual or problematic results, or that make important points), with captions, and one or two paragraphs describing the results of your diagnostics. Be explicit about the model assumptions that may be violated, and the effect these violations would have on how we can interpret the results.

  4. For the moment, ignore any possible assumption violations. Give an answer to the research question by presenting the relevant model estimates in APA format (Chapter 25), with brief commentary on what conclusions these estimates let us draw.

Exercise 9.9 (SAT scores) Refer back to Exercise 2.11 for a discussion of the SAT dataset. The data is available in the Sleuth3 package in the data frame case1201.

Based on your causal diagram in Exercise 2.11, construct a model that you can use to answer the research question: Which states have the best SAT performance relative to the amount of money they spend, after accounting for confounding factors?

  1. Describe the model you have chosen and present a table of its coefficients.

  2. Produce diagnostic plots to check your model. If it’s helpful, create some lineups (Section 8.2).

    Present your diagnostics by showing several figures (any diagnostic plots with unusual or problematic results, or that make important points), with captions, and one or two paragraphs describing the results of your diagnostics. Be explicit about the model assumptions that may be violated, and the effect these violations would have on how we can interpret the results.

  3. Choose how to alter your model based on these diagnostics. Present the new model and cursory diagnostics to ensure it is reasonable.

  4. Using the model, produce a table of states ranked by their performance relative to the money they spend. (To save space, you can present just the top 5 and bottom 5 states.)

Hint: If you can model SAT scores as a function of expenditure on education, and any relevant covariates you chose in Exercise 2.11, the residuals indicate performance above or below what is expected for a given SAT score.