16  Prediction Goals and Prediction Errors

TODO define risk, loss, prediction error, bias-variance tradeoff, optimism, training and test error, etc.

TODO mention prediction intervals somewhere

16.1 Risk and loss

Hastie, Tibshirani, and Friedman (2009), sections 7.1-7.2

16.2 Bias, variance, and the bias-variance decomposition

Hastie, Tibshirani, and Friedman (2009), section 7.3

Example 16.1 (Estimating the mean) Suppose we observe \(Y \sim \normal(\mu, \sigma^2)\) and we want to estimate \(\mu\).

The minimum variance unbiased estimator of \(\mu\) is \(Y\). But consider instead the estimator \(\hat \mu = \alpha Y\), where \(0 \leq \alpha \leq 1\). We have \[\begin{align*} \E[\hat \mu] - \mu &= \E[\alpha Y] - \mu\\ &= \alpha \mu - \mu\\ &= (\alpha - 1) \mu, \end{align*}\] so the estimator is biased if \(\alpha \neq 1\). For variance, we have \[\begin{align*} \var(\hat \mu) &= \var(\alpha Y)\\ &= \alpha^2 \sigma^2. \end{align*}\] Adding the squared bias and variance, we have \[ \text{MSE} = (1 - \alpha)^2 \mu^2 + \sigma^2 \alpha^2. \] As \(\alpha \to 1\), the bias goes to 0 and the variance increases. As \(\alpha \to 0\), the bias increases and the variance decreases. The MSE is minimized at \[ \alpha = \frac{\mu^2}{\sigma^2 + \mu^2}, \] so surprisingly enough, the lowest-MSE estimator is also biased toward 0. We’ll see a similar idea in action in Chapter 18.

Exercise 16.1 (The bias-variance tradeoff) Let’s explore the bias and variance in a specific concrete example, and explore how we might tune a model to achieve the right balance.

We’ve previously used the Doppler function, for instance when discussing polynomials (Figure 10.3) and fitting smoothing splines (Example 14.1). Here’s a population where the relationship between \(X\) and \(Y\) is given by the Doppler function:

library(regressinator)

doppler <- function(x) {
  sqrt(x * (1 - x)) * sin(2.1 * pi / (x + 0.05))
}
doppler_pop <- population(
  x = predictor("runif", min = 0.1, max = 1),
  y = response(
    doppler(x),
    error_scale = 0.2
  )
)

Let’s consider taking a sample of \(N = 50\) observations and fitting a smoothing spline. We’ll set the effective degrees of freedom for our spline, which is equivalent to setting the penalty parameter \(\lambda\), and try different values to see their bias, variance, and total error. This can be done by using smooth.spline() with the df argument; see Example 14.1 to see how smooth.spline() is used.

Write a loop to try different values of the effective degrees of freedom, from 5 to 24. For each edf,

  1. Sample \(N = 50\) observations from doppler_pop.
  2. Fit a smoothing spline with the chosen edf.
  3. Use that smoothing spline to obtain a prediction at \(x_0 = 0.2\).
  4. Store the prediction in a data frame, along with the edf used to obtain it.
  5. Repeat steps 1-4 500 times.

You should now have a data frame with \(20 \times 500\) rows, for 20 edfs and 500 simulations each.

For each edf, estimate the squared bias \(\E[\hat f(x_0) - f(x_0)]^2\); you can use doppler(0.2) to get \(f(x_0)\). Also, for each edf, estimate the variance \(\E\left[\left(\hat f(x_0) - \E[\hat f(x_0)]\right)^2\right]\).

Make a plot of squared bias versus edf and a plot of variance versus edf.

TODO mention Belkin et al. (2019), Hastie et al. (2022)

16.3 The curse of dimensionality

16.4 Optimism of the training error

Hastie, Tibshirani, and Friedman (2009), section 7.4