19  Kernel Regression

Hastie, Tibshirani, and Friedman (2009), chapter 6; 402 lec 1 & 7

Penalized regression allows us to adapt our classic linear regression and generalized linear models to higher-dimensional problems, and to trade off bias and variance to improve our prediction performance.

In a pure prediction problem, another approach might be to start out with a more flexible model from the beginning, instead of adapting a linear model. That might mean using an additive model (Chapter 14), but another option is kernel regression. Kernel regression is also pedagogically useful for us because it involves a tradeoff between bias and variance, in a way that is more direct than other methods we’ve considered—and because it will allow us to introduce some of the mathematical tools for studying bias, variance, and convergence that you will use in later courses.

19.1 Kernel smoothing

19.1.1 Basic definition

In Chapter 14, we introduced linear smoothers (Definition 14.1). A linear smoother is a regression estimator such that \(\hat \Y = S \Y\), for some matrix \(S\). We can also write this (by expanding out the matrix multiplication) as \[ \hat Y_i = \sum_{j=1}^N w(x_i, x_j) Y_i, \] where \(w\) is a weight function. This version suggests we can make a new smoother by choosing an appropriate weight function. One reasonable approach would be to choose a weight function that gives high weight when \(x_i \approx x_j\) and less weight otherwise. That leads to kernel regression.

Definition 19.1 (Kernel regression) Consider \(Y = r(X) + e\), where \(r(X)\) is the true unknown regression function. Given a set of training observations \((X_1, Y_1), \dots, (X_N, Y_N)\), kernel regression estimates \[ \hat r(x) = \sum_{i=1}^N w(x, x_i) Y_i, \] where \[ w(x, x_i) = \frac{K\left(\frac{x_i - x}{h}\right)}{\sum_{j=1}^N K\left(\frac{x_j - x}{h}\right)}. \] Here \(h > 0\) is the bandwidth and \(K\) is a kernel function. The kernel function is typically chosen so that \(K(x) \geq 0\) and \[\begin{align*} \int K(x) \dif x &= 1\\ \int x K(x) \dif x &= 0\\ \sigma^2_K = \int x^2 K(x) \dif x &\in (0, \infty). \end{align*}\] The kernel regression estimate of \(\hat r(x)\) can be interpreted as a weighted average of the training data, where the kernel function determines the weights. Because of the conditions above, the kernel function can be interpreted as a probability density representing a distribution with mean zero and variance \(\sigma^2_K\).

This kernel regression estimator is sometimes called the Nadaraya–Watson kernel estimator.

TODO Why the three conditions are usually chosen (or maybe make exercises)

Example 19.1 (Common kernels) Often-used kernels include: \[\begin{align*} \text{boxcar} && K(x) &= \frac{1}{2} \ind(|x| \leq 1),\\ \text{Gaussian} && K(x) &= \frac{1}{\sqrt{2\pi}} e^{-x^2/2},\\ \text{Epanechnikov} && K(x) &= \frac{3}{4} \left(1 - x^2\right) \ind(|x| \leq 1)\\ \text{tricube}&& K(x) &= \frac{70}{81} \left(1- |x|^3\right)^3 \ind(|x| \leq 1) \end{align*}\] These are shown in Figure 19.1. Notice particularly that all but the Gaussian kernel are bounded, meaning their value is 0 except in a specific interval. We will see later that different kernels have different efficiencies, but that the differences are small and commonly used kernels typically produce very similar fits.

Figure 19.1: The most common kernels for kernel regression.

Exercise 19.1 (Origin of the kernel regression estimator) Show that the estimator \(\hat r(x)\) given in Definition 19.1 is the solution to the following least-squares problem: \[ \hat r(x) = \argmin_{c} \sum_{i=1}^N K\left(\frac{x - x_i}{h}\right) (c - Y_i)^2, \] That is, the kernel regression estimate of \(\hat r(x)\) minimizes the weighted sum of squared errors for predicting all \(Y_i\), where the weight is given by the kernel. Observations for which \(K((x-x_i)/h)\) is large are weighted higher, so the kernel regression estimate can be seen as minimizing the sum of squared errors for “nearby” observations, for a definition of “nearby” given by the kernel function.

19.1.2 Interpreting the kernel and bandwidth

TODO examples, effect of bandwidth

Example 19.2 (Kernel regression for the Doppler function) Let’s return to the Doppler function we’ve used before, most recently in Example 14.1, to see how bandwidth and kernel choices affect fits. The range of the data is between 0.15 and 1, so it is reasonable to consider smaller bandwidths. We can fit a kernel smoother using the np package. By default, it uses a Gaussian kernel, though this can be set with the ckertype argument to npreg().

library(np)

bws <- c(0.01, 0.05, 0.1)
fits <- lapply(bws, function(bw) {
  npreg(y ~ x, data = doppler_samp, bws = bw)
})

The resulting fits are shown in Figure 19.2. All the fits, even with small bandwidths, tend to “trim the peaks and fill the valleys”: at extrema, the kernel regression averages over points around the peak, which are inevitably less extreme.

Figure 19.2: Gaussian kernel fits to the Doppler function at different bandwidths. Notice how the fits tend to “trim the peaks”, and larger bandwidths heavily oversmooth.

19.1.3 Bandwidth selection

TODO mention adaptive bandwidths

19.1.4 Discrete predictors

As we have defined them, the kernel function \(K\) is appropriate for continuous real-valued predictors only. However, kernel regression can indeed be defined for categorical data as well (Li and Racine 2003). Suppose, for instance, that \(X \in \{0, 1\}\). The estimator is again a weighted average of \(Y\): \[ \hat r(x) = \sum_{i=1}^N \frac{K(x, x_i) Y_i}{\sum_{j=1}^N K(x, x_j)}, \] and the kernel function is quite simple: \[ K(x, x_i) \propto \begin{cases} 1 - \lambda & \text{if $x_i = x$}\\ \lambda & \text{if $x_i \neq x$}. \end{cases} \] Hence if \(\lambda = 0\), we average only over cases with identical values of the discrete predictor; as we increase \(\lambda\), we also average over cases with different values of the predictor.

19.2 Local linear and local polynomial smoothers

Kernel regression estimates are flexible and adaptable to any reasonably smoothly varying regression function, but they do suffer some important problems. Notably, they suffer a boundary effect. If the training data \((X_1, Y_1), \dots, (X_N, Y_N)\) is observed on some interval \(X_i \in [a, b]\), then estimates of \(\hat r(a)\) and \(\hat r(b)\), and nearby points, will be biased.

Example 19.3 (The boundary effect) As a simple example, consider a population where \(X \in [0, 100]\) and its association with \(Y\) is linear.

library(np)

kfit <- npreg(y ~ x, data = linear_samp, bws = 0.5)

In Figure 19.3 we see the data and the kernel fit. Outside the interval, the fit quickly becomes flat, as the kernel estimator can only average nearby points, not extrapolate. The slope begins flattening even inside the interval.

Figure 19.3: Kernel smoother fit to a linear trend, with the dashed line indicating the true population association. Notice that outside the range of the data, the predictions become flat.

The boundary effect arises because the kernel regression only takes averages. But Exercise 19.1 suggests an alternative: if the kernel regression estimator chooses a constant to solve a least-squares problem, why not choose a line to solve the least-squares problem instead?

Definition 19.2 (Local linear regression) The local linear regression estimator solves the least-squares problem \[ \hat r(x) = \argmin_{r(x)} \sum_{i=1}^N K\left(\frac{x - x_i}{h}\right) (r(x) - Y_i)^2, \] where \(r(x) = \beta_0 + \beta_1 x\). That is, rather than minimizing over constants, we minimize over the set of linear functions. The minimization is repeated for each value of \(x\), so \(\beta_0\) and \(\beta_1\) are estimated locally, not globally.

Example 19.4 (Boundary effect revisited) In Example 19.3 we saw the boundary effect in kernel regression. Let’s repeat the same fit, but with a local linear fit. npreg() can do local linear regression by specifying regtype = "ll":

ll_fit <- npreg(y ~ x, data = linear_samp,
                bws = 0.5, regtype = "ll")

In Figure 19.4 we see the tradeoff we have made. Within the interval \([0, 10]\), there is no longer a boundary effect, because the local linear fit can learn the slope even when near a boundary. But outside the interval, the fit has high variance, because the slope estimated outside the boundary depends on the values of the training points closest to the boundary.

Figure 19.4: Local linear fit to a linear trend, with the dashed line indicating the true population association. Notice that outside the range of the data, the predictions can have unexpected slopes, depending on the values of training observations near the boundaries.

Of course, Definition 19.2 suggests we need not stop at local linear regression: provided our bandwidth is wide enough to ensure there are sufficient observations at any point \(x\), we could fit any degree of polynomial, not just a line. However, the variance of these fits will grow quite high. Let’s examine the bias and variance of kernel fits in more detail.

19.3 Bias and variance of kernel smoothers

402 lec 7

TODO mention Epanechnikov efficiency proof

19.4 Multivariate kernel smoothers

TODO how does np do factors? “generalized product kernels” https://doi.org/10.1016/S0047-259X(02)00025-8

19.5 Exercises

402 HW1 1, HW 5 1, exercise 17.2,