17 Prediction Goals and Prediction Errors

So far in this course, we have considered problems where our goal is to infer features of the true relationship \(\E[Y \mid X = x]\). We might want to test if a particular predictor is relevant, determine if a relationship is positive, or estimate a particular parameter. This goal requires fitting an appropriate model, using diagnostics to determine if it fits well, and then using inference procedures appropriate for our question of interest.

But often our goal is more straightforward. We imagine there is a true population joint distribution, so that \((X, Y) \sim F\), where \(F\) is some unknown distribution. We have a set of training observations \((x_1, y_1), (x_2, y_2), \dots, (x_n, y_n)\). If we obtain a new draw \(X^*\) from \(F\), how do we best predict the matching \(Y^*\)?

Many real problems are like this. We may not care what the exact shape of the relationship is or want to learn about it—we just want to make predictions. A spam filter just needs to make the best possible classifications; a social media recommendation algorithm just needs to predict which content you’ll like; a high-frequency trading algorithm just needs to predict changes in stock prices.¹ It may be interesting to do inference and learn how features are associated with the response, but it is not necessary.

Prediction raises two simple questions:

How do we define the “best” predictions?
How do we measure how well a model predicts, so we can pick the best one?

17.1 Risk and loss

First we can define the goodness of a prediction. Suppose we have a vector of predictors \(X \in \R^p\) and a response variable \(Y\) we want to predict, drawn from a population distribution \((X, Y) \sim F\). We draw a training set \(\trainset = \{(x_1, y_1), \dots, (x_n, y_n)\}\). We use the training set to estimate a function \(\hat f(X)\), and we predict \(\hat Y = \hat f(X)\).

Definition 17.1 (Loss function) A loss function is a bivariate function \(L(Y, \hat f(X))\) whose value is interpreted as the “loss” of \(\hat f(X)\) for predicting \(Y\). A larger loss represents a worse prediction.

Some common loss functions include:

Squared error: \[ L\left(Y, \hat f(X)\right) = \left(Y - \hat f(X)\right)^2 \]
Absolute error: \[ L\left(Y, \hat f(X)\right) = \left|Y - \hat f(X)\right| \]
Cross-entropy loss, when \(Y \in \{0, 1\}\) and \(\hat f(X) = \widehat{\Pr}(Y = 1 \mid X)\): \[ L\left(Y, \hat f(X)\right) = Y \log(\hat f(X)) + (1 - Y) \log(1 - \hat f(X)) \]
Deviance, when \(\ell(\beta)\) is the log-likelihood given the observed data and the model \(\hat f(X)\): \[ L(Y, \hat f(X)) = - 2 \ell(\hat \beta). \]

We choose loss functions depending on the response variable and our prediction goals; for continuous response variables, squared error loss is the popular default, but we might choose absolute error to prevent outliers from being influential, or choose an asymmetric loss if overestimation and underestimation have different costs. For binary outcomes, the cross-entropy loss is the same as the log-likelihood used in logistic regression (Section 12.2).

Once we have selected a loss function, we can define the generalization error.

Definition 17.2 (Generalization error) The generalization error, test error, or prediction risk of a model \(\hat f(X)\) trained on a training set \(\trainset\), relative to a loss function \(L(Y, \hat f(X))\), is \[ R_{\trainset} = \E[L(Y, \hat f(X)) \mid \trainset], \] where the expectation is over new draws of \((X, Y)\) from their population joint distribution.

The generalization error is exactly what we want to know: With this model, fit to the training data we have, how much error (loss) can we expect when making predictions on new data? If we have several candidate models, we could estimate the generalization error for each and choose the model with the best results.

However, we will find the generalization error to be quite hard to estimate. Instead, we’ll estimate the expected prediction error.

Definition 17.3 (Expected generalization error) The expected generalization error, expected prediction error or expected test error of a model \(\hat f(X)\), relative to a loss function \(L(Y, \hat f(X))\), is \[ R = \E[L(Y, \hat f(X)] = \E[R_{\trainset}], \] where the outer expectation is over the training set \(\trainset\) as well as the test point \((X, Y) \sim F\).

Definition 17.3 is an abuse of notation and hides its meaning behind a deceptively simple definition. The expected generalization error \(R\) is a property of the estimation procedure for \(\hat f(X)\) and the size \(n\) of the training set: If we draw any training set \((x_1, y_1), \dots, (x_n, y_n)\) from \(F\), fit the model \(\hat f(X)\), and then draw any new observation \((X^*, Y^*) \sim F\), what is the expected loss \(L(Y^*, \hat f(X^*))\)?

This is simultaneously more revealing—it tells us which model or estimator is best-suited to the population \(F\)—and less useful, because it doesn’t tell us which model is best given the particular training set \(\trainset\) we have. A model may have a poor generalization error because of an unlucky training sample, even if it is well-suited to the population relationship and has a great expected generalization error.

Also, notice the proliferation of names (test error, prediction risk, expected test error, and so on). These names are used inconsistently by different authors: sometimes “test error” refers to the expected generalization error; sometimes it refers to the error on a specific test set rather than an expectation; sometimes no distinction is made between the two kinds of error.

Finally, we can define the error on our training sample alone, without appeal to any populations and expectations.

Definition 17.4 (Training error) The training error of a model \(\hat f(X)\) trained on a training set \(\trainset = \{(x_1, y_1), \dots, (x_n, y_n)\}\), relative to a loss function \(L(Y, \hat f(X))\), is \[ R_\text{train} = \frac{1}{n} \sum_{i=1}^n L(y_i, \hat f(x_i)). \]

The training error is easy to calculate whenever we have fit a model to training data. However, as we will see in Section 17.5, the training error is a poor estimator of the generalization error.

17.2 Metrics for classification

TODO AUC, ROC, F1, etc.

TODO proper scoring

17.3 Bias, variance, and the bias-variance decomposition

Hastie, Tibshirani, and Friedman (2009), section 7.3

The squared error loss has useful properties that will help us build intuition for the trade-offs between different prediction methods. One of these properties is so useful that it gets its own name: the bias-variance decomposition.

Consider the usual regression setting where \(Y \in \R\) and \(Y = \E[Y \mid X = x] + e\), where \(e\) is mean-0 noise. Intuitively, we can see that the prediction error can come from several sources:

Even if we estimate \(\E[Y \mid X = x]\) perfectly, if \(\var(e) > 0\), we cannot predict \(Y\) perfectly. The error due to \(e\) is called the irreducible error.
Our estimate of \(\E[Y \mid X = x]\) may be biased. Perhaps we are using a linear model when the relationship is not linear, or perhaps we are using a penalized model (Chapter 19), so our estimated regression has systematic error. This is bias.
Even if our estimator is unbiased, there is sampling variation: we will obtain a different model fit for every training set. This is variance.

Each of these errors can be rigorously defined, and the squared-error prediction risk can be written in terms of them. It’s easiest to start by doing so for point estimation before moving to regression.

17.3.1 In point estimation

Consider a standard point estimation problem, we obtain a sample \[ Y_1, \dots, Y_n \sim F(\theta), \] where \(F\) is some parametric distribution with unknown parameter \(\theta\). We’d like to estimate \(\theta\), so we define an estimator \(\hat \theta_n\), which is a function of \(Y_1, \dots, Y_n\).

We define the bias of \(\hat \theta_n\) as \(\bias(\hat \theta_n) = \E[\hat \theta_n] - \theta\), where the expectation averages over random samples from \(F\). The variance of \(\hat \theta_n\) is again defined across repeated samples: \(\var(\hat \theta_n) = \E\left[(\hat \theta_n - \E[\hat \theta_n])^2\right]\).

TODO diagram

Theorem 17.1 (Bias-variance decomposition for point estimates) The mean-squared error of a point estimator \(\hat \theta_n\) for a parameter \(\theta\), defined to be \[ \MSE(\hat \theta_n) = \E\left[(\hat \theta_n - \theta)^2\right], \] can be written as \[ \MSE(\hat \theta_n) = \var(\hat \theta_n) + \bias(\hat \theta_n)^2. \]

Proof. We expand out the square: \[\begin{align*} \MSE(\hat \theta_n) &= \E\left[ \left( (\hat \theta_n - \E[\hat \theta_n]) + (\E[\hat \theta_n] - \theta)\right)^2 \right]\\ &= \E\left[ (\hat \theta_n - \E[\hat \theta_n])^2 + (\E[\hat \theta_n] - \theta)^2 + 2 (\hat \theta_n - \E[\hat \theta_n]) (\E[\hat \theta_n] - \theta)\right]\\ &= \var(\hat \theta_n) + \bias(\hat \theta_n)^2 + 0, \end{align*}\] as the expectation of the third term is 0.

This decomposition allows us to better understand the tradeoffs between estimators. For instance, it’s intuitive to always use unbiased estimators, because “unbiased” sounds like a good property; but do unbiased estimators always have the lowest error? No!

Example 17.1 (Estimating the mean) Suppose we observe \(Y \sim \normal(\mu, \sigma^2)\) and we want to estimate \(\mu\).

The minimum variance unbiased estimator of \(\mu\) is \(Y\). But consider instead the estimator \(\hat \mu = \alpha Y\), where \(0 \leq \alpha \leq 1\). We have \[\begin{align*} \E[\hat \mu] - \mu &= \E[\alpha Y] - \mu\\ &= \alpha \mu - \mu\\ &= (\alpha - 1) \mu, \end{align*}\] so the estimator is biased if \(\alpha \neq 1\). For variance, we have \[\begin{align*} \var(\hat \mu) &= \var(\alpha Y)\\ &= \alpha^2 \sigma^2. \end{align*}\] Adding the squared bias and variance, we have \[ \MSE(\hat \mu) = (1 - \alpha)^2 \mu^2 + \sigma^2 \alpha^2. \] As \(\alpha \to 1\), the bias goes to 0 and the variance increases. As \(\alpha \to 0\), the bias increases and the variance decreases. The MSE is minimized at \[ \alpha = \frac{\mu^2}{\sigma^2 + \mu^2} < 1, \] so surprisingly enough, the lowest-MSE estimator is also biased toward 0.

We can apply a similar decomposition to regression, and it will lead to similar realizations: in Chapter 19, we’ll consider deliberately biasing models to decrease their error.

17.3.2 In regression

To apply the bias-variance decomposition to regression, let’s simplify the problem. Regression involves estimating a function, not a point value, but if we condition on a specific value \(X = x\), then estimating \(\E[Y \mid X = x]\) is a point estimation problem. So let’s adapt the expected generalization error (Definition 17.3) to point estimation.

Definition 17.5 (Conditional generalization error) The generalization error conditional on \(X = x\) is \[ R(x) = \E\left[L(Y, \hat f(X)) \mid X = x\right], \] where the expectation is over both training sets \(\trainset\) and response values \(Y\) drawn from the conditional distribution \(F_{Y \mid X = x}(Y)\).

We can obtain the expected generalization error by marginalizing the conditional generalization error: \[ R = \E[R(x)] = \int R(x) f_X(x) \dif x, \] where \(f_X(x)\) is the marginal density of \(X\).

Now we’re ready to prove the bias-variance decomposition for regression.

Theorem 17.2 (Bias-variance tradeoff for regression) In a regression problem where \(Y = f(X) + e\), the conditional generalization error for the squared-error loss can be decomposed as \[ R(x) = \var(\hat f(x) \mid X = x) + \bias(\hat f(x) \mid X = x)^2 + \var(e \mid X = x). \]

Proof. By definition, \[\begin{align*} R(x) ={}& \E[(Y - \hat f(X))^2 \mid X = x] \\ ={}& \E\left[ (Y - f(X) + f(X) - \hat f(X))^2 \mid X = x \right]\\ ={}& \E\left[ (Y - f(X))^2 \mid X = x \right] + \E\left[ (f(X) - \hat f(X))^2 \mid X = x \right] \\ &{}+ 2 \E\left[(Y - f(X))(f(X) - \hat f(X)) \mid X = x\right]\\ ={}& \var(e \mid X = x) + \E\left[(\hat f(x) - f(x))^2\right] + 0. \end{align*}\] The last term is zero because here, \(Y\) is a new observation independent of the training set \(\trainset\) and hence independent of \(\hat f(x)\).

The second term is the MSE of \(\hat f(x)\) for estimating \(f(x)\). Following Theorem 17.1, it can be broken into bias and variance.

The bias-variance tradeoff implies that the best generalization error may be achieved by a model that compromises between bias and variance—by a model that is biased. We can deliberately introduce bias if it reduces the variance of the estimator.

Exercise 17.1 (The bias-variance tradeoff) Let’s explore the bias and variance in a specific concrete example, and explore how we might tune a model to achieve the right balance.

We’ve previously used the Doppler function, for instance when discussing polynomials (Figure 10.3) and fitting smoothing splines (Example 14.1). Here’s a population where the relationship between \(X\) and \(Y\) is given by the Doppler function:

library(regressinator)

doppler <- function(x) {
  sqrt(x * (1 - x)) * sin(2.1 * pi / (x + 0.05))
}
doppler_pop <- population(
  x = predictor(runif, min = 0.1, max = 1),
  y = response(
    doppler(x),
    error_scale = 0.2
  )
)

Let’s consider taking a sample of \(n = 50\) observations and fitting a smoothing spline. We’ll set the effective degrees of freedom for our spline, which is equivalent to setting the penalty parameter \(\lambda\), and try different values to see their bias, variance, and total error. This can be done by using smooth.spline() with the df argument; see Example 14.1 to see how smooth.spline() is used.

Write a loop to try different values of the effective degrees of freedom, from 5 to 24. For each edf,

Sample \(n = 50\) observations from doppler_pop.
Fit a smoothing spline with the chosen edf.
Use that smoothing spline to obtain a prediction at \(x_0 = 0.2\).
Store the prediction in a data frame, along with the edf used to obtain it.
Repeat steps 1-4 500 times.

You should now have a data frame with \(20 \times 500\) rows, for 20 edfs and 500 simulations each.

For each edf, estimate the squared bias \(\E[\hat f(x_0) - f(x_0)]^2\); you can use doppler(0.2) to get \(f(x_0)\). Also, for each edf, estimate the variance \(\E\left[\left(\hat f(x_0) - \E[\hat f(x_0)]\right)^2\right]\).

Make a plot of squared bias versus edf and a plot of variance versus edf.

TODO mention Belkin et al. (2019), Hastie et al. (2022)

17.4 The curse of dimensionality

17.5 Optimism of the training error

Hastie, Tibshirani, and Friedman (2009), section 7.4

All this discussion of prediction error leads to an obvious question: How do we estimate it? If I have several different predictive models and want to know which generalizes the best, how do I do that?

The obvious answer would be to use the training error (Definition 17.4), as it can be calculated directly from the training data. However, it should be obvious that this is not always wise. In linear regression, for instance, we can always obtain a perfect fit (zero training error) by having as many regressors as training observations, even if the true population relationships are nonlinear. Evidently the training error can be optimistic, in that it is lower than the generalization error.

We can quantify this optimism. It is easiest to do so by comparing the training error to the in-sample prediction error.

Definition 17.6 (In-sample prediction error) The in-sample prediction error of a model \(\hat f(X)\) trained on a training set \(\trainset\), relative to a loss function \(L(Y, \hat f(X))\), is \[ R_\text{in} = \frac{1}{n} \sum_{i=1}^n \E[L(Y_i^*, \hat f(x_i)) \mid \trainset], \] where the expectation is over new draws \(Y_i^*\) from the conditional distribution given \(X = x_i\).

The in-sample prediction error eliminates one source of variation: we make our predictions at the same \(X\) values, but obtain new response values at those points. There is no error introduced by extrapolating to new \(X\) values. This is unsuitable for typical prediction problems (where each observation to predict is a new draw from the population), but matches the standard linear regression setting where \(X\) is considered fixed.

We can now calculate the optimism of the training error relative to the in-sample prediction error.

Theorem 17.3 (Optimism of the training error) Given a model \(\hat f(X)\) trained on a training set \(\trainset\), the optimism of the training error is \[ \text{opt} = R_\text{in} - R_\text{train}. \] The average optimism \(\omega\) averages over training sets \(\trainset\), and one can show that for squared-error loss, \[\begin{align*} \omega &= \E[\text{opt}] \\ &= \frac{2}{n} \sum_{i=1}^n \cov(\hat f(x_i), Y_i)\\ &= \frac{2 \sigma^2}{n} \edf(\hat f), \end{align*}\] where the effective degrees of freedom \(\edf(\hat f)\) is defined in Definition 14.2.

Proof. See Exercise 17.2.

In short, a more flexible model (one with more effective degrees of freedom) will have greater optimism: on average, across all training sets, its training error will be substantially less than the true in-sample prediction error. On average, then, picking the model with the best training error amounts to picking the most flexible model, not the model with the best prediction error.

We can generalize this result from squared-error loss to deviance, when we measure performance using the log-likelihood.

Theorem 17.4 (Optimism of the log-likelihood) Given a model with parameters \(\beta \in \R^p\) with log-likelihood function \(\ell(\beta)\), consider \(\E[\ell^*(\beta)]\), the expected value of the log-likelihood when evaluated on a new independent draw from the population. As \(n \to \infty\), the optimism of the training log-likelihood is \[ -2 \E[\ell^*(\hat \beta)] \approx - \frac{2}{n} \E[\ell (\hat \beta)] + 2 \frac{p}{n}. \]

Proof. TODO Hastie, Tibshirani, and Friedman (2009) section 7.5

This implies that optimism is a problem for any likelihood-based model, in that it will fit better to the training data than to new, independent test data. In the next chapter, we’ll introduce alternate ways to estimate the prediction error that avoid this bias.

Exercises

Exercise 17.2 (Optimism) Prove Theorem 17.3.

Exercise 17.3 (Optimism of a polynomial regression) TODO calculate optimism of a polynomial regression depending on \(d\), show how it grows with \(d\)

“Just” is doing a lot of hard work in this sentence. You will soon learn to be wary whenever a manager asks if you can “just” predict some complex human system.↩︎