# 16  Prediction Goals and Prediction Errors

TODO define risk, loss, prediction error, bias-variance tradeoff, optimism, training and test error, etc.

TODO mention prediction intervals somewhere

## 16.1 Risk and loss

Hastie, Tibshirani, and Friedman (2009), sections 7.1-7.2

## 16.2 Bias, variance, and the bias-variance decomposition

Hastie, Tibshirani, and Friedman (2009), section 7.3

Example 16.1 (Estimating the mean) Suppose we observe $$Y \sim \normal(\mu, \sigma^2)$$ and we want to estimate $$\mu$$.

The minimum variance unbiased estimator of $$\mu$$ is $$Y$$. But consider instead the estimator $$\hat \mu = \alpha Y$$, where $$0 \leq \alpha \leq 1$$. We have \begin{align*} \E[\hat \mu] - \mu &= \E[\alpha Y] - \mu\\ &= \alpha \mu - \mu\\ &= (\alpha - 1) \mu, \end{align*} so the estimator is biased if $$\alpha \neq 1$$. For variance, we have \begin{align*} \var(\hat \mu) &= \var(\alpha Y)\\ &= \alpha^2 \sigma^2. \end{align*} Adding the squared bias and variance, we have $\text{MSE} = (1 - \alpha)^2 \mu^2 + \sigma^2 \alpha^2.$ As $$\alpha \to 1$$, the bias goes to 0 and the variance increases. As $$\alpha \to 0$$, the bias increases and the variance decreases. The MSE is minimized at $\alpha = \frac{\mu^2}{\sigma^2 + \mu^2},$ so surprisingly enough, the lowest-MSE estimator is also biased toward 0. We’ll see a similar idea in action in Chapter 18.

Exercise 16.1 (The bias-variance tradeoff) Let’s explore the bias and variance in a specific concrete example, and explore how we might tune a model to achieve the right balance.

We’ve previously used the Doppler function, for instance when discussing polynomials (Figure 10.3) and fitting smoothing splines (Example 14.1). Here’s a population where the relationship between $$X$$ and $$Y$$ is given by the Doppler function:

library(regressinator)

doppler <- function(x) {
sqrt(x * (1 - x)) * sin(2.1 * pi / (x + 0.05))
}
doppler_pop <- population(
x = predictor("runif", min = 0.1, max = 1),
y = response(
doppler(x),
error_scale = 0.2
)
)

Let’s consider taking a sample of $$N = 50$$ observations and fitting a smoothing spline. We’ll set the effective degrees of freedom for our spline, which is equivalent to setting the penalty parameter $$\lambda$$, and try different values to see their bias, variance, and total error. This can be done by using smooth.spline() with the df argument; see Example 14.1 to see how smooth.spline() is used.

Write a loop to try different values of the effective degrees of freedom, from 5 to 24. For each edf,

1. Sample $$N = 50$$ observations from doppler_pop.
2. Fit a smoothing spline with the chosen edf.
3. Use that smoothing spline to obtain a prediction at $$x_0 = 0.2$$.
4. Store the prediction in a data frame, along with the edf used to obtain it.
5. Repeat steps 1-4 500 times.

You should now have a data frame with $$20 \times 500$$ rows, for 20 edfs and 500 simulations each.

For each edf, estimate the squared bias $$\E[\hat f(x_0) - f(x_0)]^2$$; you can use doppler(0.2) to get $$f(x_0)$$. Also, for each edf, estimate the variance $$\E\left[\left(\hat f(x_0) - \E[\hat f(x_0)]\right)^2\right]$$.

Make a plot of squared bias versus edf and a plot of variance versus edf.

TODO mention Belkin et al. (2019), Hastie et al. (2022)

## 16.4 Optimism of the training error

Hastie, Tibshirani, and Friedman (2009), section 7.4