18  Penalized Models

When we were using regression for inference—to learn parameters of specific scientific interest—we wanted our models to be as close to the true population relationship as possible, and we chose unbiased estimators to ensure our parameter estimates would, on average, be unbiased. If the whole point of regression is to get a good \(\hat \beta\) to tells us something about the true population relationship, those seem like reasonable goals.

But if our goal is prediction, we might instead be interested in predicting with minimal error. And the bias-variance tradeoff (Section 16.2) says that total error is a combination of bias and variance—and the optimal combination might have nonzero bias. If we can reduce the variance in some meaningful way, perhaps we can get lower error despite using biased estimators for the model.

As a motivating example, consider using the model \[ Y = \beta\T X + e \] to predict \(Y \in \R\) from regressors \(X \in \R^p\). If we obtain a dataset of size \(N\), we know that we can use least squares to estimate \(\hat \beta\) and that it will be unbiased; but we also know that if \(N\) is only slightly larger than \(p\), we can expect \(\var(\hat \beta)\) to be large. (TODO we need to show this somewhere in chapter 5) And if \(\var(\hat \beta)\) is large, then a particular prediction \(\hat \beta\T x_0\) will also have high variance. How can we reduce that variance?

One approach is to eliminate some of the predictors. We could, for instance, use forward stepwise selection.

Definition 18.1 (Forward stepwise selection) Consider predicting \(Y\) from a set of possible predictors. Define the model scope to be the full set of regressors (derived from the predictors) that can be included in the selected model. Then:

  1. Start with an empty model, i.e. one with only an intercept.
  2. For each term in the scope that is not yet in the model, fit a model including that term.
  3. Compare the new model to the previous model using some criterion (such as an \(F\) test, AIC, or Mallows’ \(C_p\)).
  4. Select the term that improves the criterion the most (e.g. improves the AIC the most). Add it to the model.
  5. Return to step 2 and repeat. Stop the process if no term improves the criterion enough.

For example, the model scope could be represented by the formula

Y ~ X1 + poly(X2, 3) + X3 + X4 + X3:X4

and on the first iteration, the model would choose whether to add regressors for X1, poly(X2, 3), X3, or X4. (Often the “hierarchy principle” is followed, where interactions are not considered unless the main effects are already in the model.)

But as we saw in Example 11.3 and Exercise 11.8, procedures like forward stepwise regression have an unfortunate property: the distribution of \(\hat \beta\) is much more complicated, because some coefficients are either zero (when that predictor is not included in the model) or nonzero (when it is) depending on the particular sample obtained from the population. Or, in other words, the inclusion or exclusion of a predictor from the model is a discrete event, resulting in a sampling distribution that is a mixture of different distributions for different combinations of included predictors. This increases the variance compared to a single model using only one particular subset of the predictors. (TODO Harrell, section 4.3)

There are other stepwise selection methods: backwards stepwise selection, for example, starts with a large model and sequentially considers terms to remove. But these all have similar limitations. They are known as greedy algorithms because they try to find an optimal solution (the best model) by sequentially taking the best decision at each step (which variable to add or remove); however, like many greedy algorithms, they are not guaranteed to find a globally optimal solution.

18.1 Penalization

Penalization, also known as shrinkage, is another approach to reduce variance. In penalized regression, our estimator is penalized: large values of \(\hat \beta\) are worse than smaller values. For instance, in linear regression, our estimator might be \[ \hat \beta = \argmin_\beta \underbrace{\|\Y - \X \beta\|_2^2}_\text{error} + \underbrace{\lambda \|\beta\|}_\text{penalty}, \] where \(\|\beta\|\) is a norm of beta. Picking different norms will result in different estimators with different useful properties, as we will see below. The parameter \(\lambda\) is a tuning parameter. Much like the penalty parameter in smoothing splines (Theorem 14.1), it controls the amount of penalization or the amount of shrinkage. When \(\lambda = 0\), we have the ordinary least squares fit; when \(\lambda\) is large, \(\hat \beta\) is penalized heavily. This results in \(\hat \beta\) being shrunken toward 0.

Generalized linear models are usually fit by maximum likelihood instead of by least squares (Section 13.2). We can apply a similar penalty to the maximum likelihood problem: \[ \hat \beta = \argmin_\beta \underbrace{-\ell(\beta)}_\text{likelihood} + \underbrace{\lambda \|\beta\|}_\text{penalty}. \]

We can immediately see a few properties that will be shared by penalized regression methods:

  1. The estimator \(\hat \beta\) is biased. It is not the ordinary least-squares estimate.
  2. We will need to choose \(\lambda\). It is a tuning parameter, not something that can be estimated from the data as part of the fit. We will need additional steps, such as cross-validation, to choose the \(\lambda\) that produces the best fit for our purposes.
  3. Scaling matters. In least squares regression, rescaling a column of \(\X\) by a constant will rescale the corresponding entry of \(\beta\), but otherwise does not change the fit (see Exercise 5.1). But here, rescaling a column of \(\X\) by a constant means the corresponding entry of \(\beta\) will be penalized more or less heavily. That may change the optimal solution and the predictions made by the model, so we will need a principled way to choose the “right” scaling of \(\X\).
  4. Calculating \(\hat \beta\) from the optimization problem above may be nontrivial. We’ll see that for some choices of the norm \(\|\beta\|\), a solution can be obtained in closed form, but for others, an algorithm will be required to find the solution.
  5. Estimating \(\var(\hat \beta)\), producing confidence intervals, or conducting hypothesis tests will also be more challenging in penalized methods, though this is usually not our goal anyway. The penalty will change the sampling distribution of \(\hat \beta\), potentially very substantially, depending on our choice of \(\lambda\).

18.1.1 Regressor scaling

As mentioned above, scaling matters in penalized regression. A model can have very different parameters depending on the scaling of its regressors: \[\begin{align*} \text{bill length (mm)} &= 14 + 0.13 \times \text{flipper length (mm)} + e\\ &= 14 + 3.3 \times \text{flipper length (in)} + e \\ &= 14 + 1.3 \times 10^5 \times \text{flipper length (km)} + e. \end{align*}\] Each of these would have very different values of \(\|\beta\|\) despite identical fits. When there are multiple regressors, each would be penalized according to its scale, so some predictors may have coefficients heavily shrunken towards 0 while others have coefficients that are hardly changed from the least-squares fit.

The intercept \(\beta_0\) is typically not penalized (it is left out of the norm \(\|\beta\|\)), since the intercept can be of arbitrary size depending on the range of \(X\) and \(Y\).

When regressors are on the same scale and it makes sense to penalize them equally, we may enter them as-is into the regression. When regressors are on different scales, it is common to standardize them by dividing by their standard deviation in the training sample, so each has variance 1. This is often automatically done by packages, and the estimated coefficients automatically transformed back to the original scale.

18.1.2 More complex regressors

In Chapter 10, we discussed allowing complex and nonlinear relationships between predictors and response variables by representing a predictor with multiple regressors. For example, we could allow a polynomial relationship between \(X_1\) and \(Y\) by using the regressors \(\{X_1, X_1^2, X_1^3\}\); or we could use a cubic spline by using the spline basis functions to generate regressors.

TODO discuss how penalized solution depends on choice of polynomial/spline/factor basis

TODO factor standardization issue (Harrell, p. 209)

18.2 Ridge regression

In ridge regression, we use the Euclidean norm (Definition 3.2) for the penalty term: \[ \hat \beta_\text{ridge} = \argmin_\beta \underbrace{\|\Y - \X \beta\|_2^2}_\text{error} + \underbrace{\lambda \|\beta\|_2^2}_\text{penalty}. \] This optimization problem can also be written in a constrained form (TODO ref to optimization book/name for this rewrite?): \[\begin{align*} \hat \beta_\text{ridge} &= \argmin_\beta \|\Y - \X \beta\|_2^2\\ \text{such that } & \|\beta\|_2^2 \leq t, \end{align*}\] where \(t\) is a tuning parameter. There is a one-to-one relationship between values of \(\lambda\) and values of \(t\) that produce identical solutions.

Crucially, this optimization problem has a unique solution even if \(p > N\). The solution is available in closed form: \[ \hat \beta_\text{ridge} = (\X\T\X + \lambda I)^{-1} \X\T \Y, \] where \(I\) is the \(p \times p\) identity matrix. The matrix \(\X\T\X + \lambda I\) is invertible even if \(\X\T\X\) does not have full rank. This is why you hear of ridge regression being used in high-dimensional problems; “high-dimensional” means \(p\) is large relative to \(N\).

Though this closed-form solution is convenient (and it is similar to how we fit smoothing splines in Section 14.1.1), often we do not know the right \(\lambda\) in advance and instead want to fit the model with many values of \(\lambda\) so we can evaluate which is best. Packages for fitting ridge regression often use alternate algorithms that can find the solution path efficiently, meaning they estimate the function \(\hat \beta_\text{ridge}(\lambda)\) for a range of values of \(\lambda\). There are clever iterative algorithms that can do this more efficiently than repeatedly evaluating the closed-form solution for many values of \(\lambda\).

Example 18.1 (Ridge regression with glmnet) The glmnet package implements ridge, lasso, and elastic net regression. As its name implies, it can fit penalized forms of any generalized linear model. It implements the elastic net penalty (to be discussed in Section 18.4), of which ridge regression is a special case.

Let’s illustrate the value of penalized regression by simulating a sparse population: there are 100 predictor variables, of which only 5 actually have an association with \(Y\). We will take a sample of just 90 observations, which is clearly not enough to fit a full model.

library(regressinator)
library(mvtnorm)

sparse_pop <- population(
  x = predictor("rmvnorm",
                mean = rep(0, 100),
                sigma = diag(100)),
  y = response(4 + 0.2 * x1 - 5 * x2 + x3 + 0.4 * x4 + 8 * x5,
               error_scale = 1)
)

sparse_samp <- sparse_pop |>
  sample_x(n = 90) |>
  sample_y()

We will use the glmnet() function to fit the model. Unlike most model functions we’ve used, glmnet() requires us to provide \(X\) as a matrix and \(Y\) as a vector of responses; we cannot simply provide a data frame and a formula. One way to produce the desired \(X\) matrix is with the R function model.matrix(), though glmnet also provides makeX() to do a similar task.

x <- model.matrix(~ . - 1 - y, data = sparse_samp)

This formula means the matrix should use all variables (.), but not include an intercept (- 1) or our response variable (- y). Now we can load glmnet and fit the model, setting alpha = 0 to get the ridge regression penalty:

library(glmnet)

ridge_fit <- glmnet(x, sparse_samp$y, alpha = 0)

We did not have to standardize x, as glmnet() standardizes each column to have variance 1 automatically. Now ridge_fit represents not one fit but, by default, 100 fits with different penalty values. It has a plot() method that plots the coefficients as a function of the penalty, as shown in Figure 18.1. We can see that as the penalty parameter varies, the estimated coefficients shrink to zero.

Figure 18.1: Solution paths for ridge regression. Each point on the \(X\) axis represents a ridge fit with a particular penalty parameter; here the \(X\) axis is labeled by the value of \(\|\hat \beta\|_1\) of each fit. Each line is one estimate \(\hat \beta_j\).

To pick one fit out of the many, we could use cross-validation, as described in Section 17.3. The cv.glmnet() function will automatically do \(k\)-fold cross-validation and return the penalty parameter with optimal error. By default, it does 10-fold cross-validation for every model, and returns all the estimated errors. It takes the data as arguments, as well as additional arguments passed on to glmnet() (such as alpha) to control the fits.

cv_results <- cv.glmnet(x, sparse_samp$y, alpha = 0)
cv_results$lambda.min
[1] 88.0841

We can see that cross-validation has chosen \(\lambda = 88.1\) as the value that minimizes the error. There is also a plot() method for this object, producing the plot shown in Figure 18.2. Evidently this problem wants minimal regularization.

Figure 18.2: Ridge regression error (estimated by cross-validation) versus the log of the penalty parameter. The error bars indicate estimated standard errors.

Finally, we can get model coefficients or predictions by using coef() or predict(), as with most other models. These both take an argument s for the value of the penalty parameter for which coefficients or predictions are desired. For example, let’s see the parameter estimates with the largest magnitudes, and see if they match the predictors we know to have nonzero coefficients:

coefs <- coef(ridge_fit, s = cv_results$lambda.min)
coefs[order(abs(coefs[, 1]), decreasing = TRUE)[1:5], ]
(Intercept)          x5          x2         x49         x67 
  4.9204981   0.8413614  -0.5743900   0.2489602   0.2357922 

The intercept is very close to the true population value (4), and the ridge regression has pulled out \(X_2\) and \(X_5\) as two covariates with the largest estimates; but the other predictors, whose associations with \(Y\) have slopes smaller than 0.5, don’t appear in the top predictors. This is likely because, with 95 other predictors that have no association, the noise in their estimates is larger than the magnitude of the true effects for \(X_1\), \(X_3\), and \(X_4\).

18.3 The lasso

In the lasso, we use the Manhattan norm (Definition 3.3) for the penalty term: \[ \hat \beta_\text{lasso} = \argmin_\beta \underbrace{\|\Y - \X \beta\|_2^2}_\text{error} + \underbrace{\lambda \|\beta\|_1}_\text{penalty}. \] This optimization problem can again be written in a constrained form: \[\begin{align*} \hat \beta_\text{lasso} &= \argmin_\beta \|\Y - \X \beta\|_2^2\\ \text{such that } & \|\beta\|_1 \leq t, \end{align*}\] where \(t\) is a tuning parameter. There is a one-to-one relationship between values of \(\lambda\) and values of \(t\) that produce identical solutions. As with ridge regression, there is a unique \(\hat \beta_\text{lasso}\) even when \(p > N\), provided \(\lambda > 0\).

The lasso’s most useful property is that, depending on the value of \(\lambda\), it forces many entries in \(\hat \beta_\text{lasso}\) to be exactly zero. TODO draw the constrained optimization problem and show why it induces sparsity

This property is known as sparsity, and there are many real-world populations where we expect the true \(\beta\) to be sparse. For example, a genome-wide association study relates the gene expression levels of thousands of genes to some response (such as a genetic disorder); geneticists often expect only a few of the genes will actually be associated with the response, and the rest have a true coefficient of 0. The goal of the study is to find the few genes with a nonzero relationship.

Example 18.2 (Lasso with glmnet) Let’s reuse the sparse population from Example 18.1, but this time use the lasso. It should perform well in a situation like this: many of the predictors have true slopes of exactly 0. We again use glmnet() to do the fit, this time with alpha = 1 to select a pure \(L^1\) penalty.

lasso_fit <- glmnet(x, sparse_samp$y, alpha = 1)

Again, the plot() method for glmnet fits will show the solution path as \(\lambda\) varies, as shown in Figure 18.3.

Figure 18.3: Solution paths for the lasso. Each point on the \(X\) axis represents a lasso fit with a particular penalty parameter. Each line is one estimate \(\hat \beta_j\). On the left, the penalty is the strongest, and only a few parameters have nonzero estimates; as the penalty weakens towards the right, more and more parameters are estimated to be nonzero.

And again, we can use cv.glmnet() to choose the optimal \(\lambda\):

cv_results <- cv.glmnet(x, sparse_samp$y, alpha = 1)
cv_results$lambda.min
[1] 0.1699195

But if we plot the cross-validation results, as in Figure 18.4, we can see that many values of \(\lambda\) have very similar estimated errors—and that the errors of this fit are dramatically lower than the errors in Figure 18.2 for ridge regression.

Figure 18.4: Lasso regression error (estimated by cross-validation) versus the log of the penalty parameter. The error bars indicate estimated standard errors.

This time, most of the truly nonzero coefficients are included among the largest coefficients fit by the model:

coefs <- coef(lasso_fit, s = cv_results$lambda.min)
coefs[order(abs(coefs[, 1]), decreasing = TRUE)[1:15], ]
           x5            x2   (Intercept)            x3           x37 
 7.9180150169 -4.8304165826  3.9372057261  0.8415081098 -0.1582125730 
          x34           x41           x60           x89           x86 
 0.1351341594 -0.1048691505 -0.0889643354 -0.0872731290 -0.0252486253 
          x75            x4           x65           x80            x1 
-0.0096401794  0.0080470567 -0.0072618799 -0.0001440157  0.0000000000 

If we’re using the lasso in a population we believe is truly sparse, the choice of nonzero coefficients may be of as much interest as the actual values of the nonzero estimates. Identifying the genes associated with a disease, for example, might be just as medically interesting as knowing the actual numerical association. If that is our goal, we want a procedure with model selection consistency.

Definition 18.2 (Model selection consistency) A model selection procedure is a procedure that produces a set of regressors estimated to have nonzero coefficients, \(\{i : \hat \beta \neq 0\}\). Such a procedure is model selection consistent if \[ \Pr(\{i : \hat \beta \neq 0\} = \{i : \beta \neq 0\}) \to 1 \] as \(N \to \infty\).

It can be shown that, under certain conditions, the lasso is model selection consistent (Zhao and Yu 2006). That is, if one knows the “right” value of \(\lambda\), the model selection by lasso is consistent. Typically, of course, we choose \(\lambda\) through cross-validation, and so this property is not guaranteed. (TODO more refs on this?)

There are many variations of the lasso; for instance, the group lasso allows groups of coefficients to be shrunk so that either all of a group is zero or none of them are. This can be used when entering multiple regressors from the same predictor, for instance. See Hastie, Tibshirani, and Wainright (2015), chapter 4, for more detail.

18.4 The elastic net

The elastic net uses a combination of ridge and lasso penalties: \[ \hat \beta_\text{elastic} = \argmin_\beta \underbrace{\|\Y - \X \beta\|_2^2}_\text{error} + \underbrace{\lambda \left(\frac{1 - \alpha}{2} \|\beta\|_2^2 + \alpha \|\beta\|_1\right)}_\text{penalty}. \] We can see from this form that \(\alpha = 0\) produces ridge regression and \(\alpha = 1\) produces the lasso, but choices in between produce a combination.

TODO balance of sparsity and collinearity

18.5 Exercises

Exercise 18.1 (Choosing the penalty parameter) Suppose you observe \(N\) samples from a population with \(Y \in \R\) and \(X \in \R^p\). You decide a linear model would be appropriate, and that the lasso would be useful because you suspect the problem is sparse. You decide to set the penalty parameter \(\lambda\) and estimate the error of the final model through the following procedure:

  1. Split the data into a training set and test set of equal sizes.
  2. Fit the model to the training set using a certain \(\lambda\). Evaluate its error on the test set.
  3. Repeat step 2 with many values of \(\lambda\), using the same training and test sets each time.
  4. Pick the \(\lambda\) that gives the best test error. Report its test error as our estimate of the generalization error of the fitted model.

Will this test-set error be an accurate estimate of the model’s generalization error on new data? If yes, describe why; if not, suggest an alternate procedure to estimate the generalization error accurately.

Hint: As a simple example, consider what would happen if all values of \(\lambda\) had equal generalization error. Recall that test error calculated from real data is a random quantity.

Exercise 18.2 (Penalized regression with correlated predictors) Consider a sparse population where only two predictors are associated with the response, but they are highly correlated with each other:

library(regressinator)
library(mvtnorm)

# unit variance
sigma <- diag(5)

# but X1 and X2 are correlated
sigma[1, 2] <- 0.75
sigma[2, 1] <- 0.75

sparse_corr_pop <- population(
  x = predictor("rmvnorm",
                mean = rep(0, 5),
                sigma = sigma),
  y = response(
    7 + x1 + x2,
    error_scale = 1
  )
)
  1. Obtain 1000 draws from the sampling distribution of \(\hat \beta_1\) and \(\hat \beta_2\), estimated using ordinary least squares, on samples of size \(N = 50\) from this population. Fit each model including all predictors.

    Make a plot of the joint distribution of \(\hat \beta_1\) and \(\hat \beta_2\), such as a heatmap of 2D density estimate. (See, for instance, ggplot2’s geom_density_2d().)

  2. Repeat this procedure but with ridge regression, with \(\lambda = 1\). Make a similar plot of the joint distribution.

    Note: Unfortunately, the regressinator’s sampling_distribution() does not work with models fit by glmnet(). But you can use replicate() to do this easily, by filling in the missing pieces here:

    replicate(1000, {
      new_samp <- sparse_corr_pop |>
        sample_x(n = 50) |>
        sample_y()
    
      x <- # TODO get X matrix
      y <- # TODO get Y matrix
      fit <- glmnet(TODO, alpha = 0, lambda = TODO)
    
      coefs <- coef(fit)
    
      coefs[2:3] # 1 is the intercept
    })

    This will produce a \(2 \times 1000\) matrix containing each simulation’s values of \(\hat \beta_1\) and \(\hat \beta_2\).

  3. Finally, repeat the procedure with lasso and \(\lambda = 1\). (The true parameter vector is \(\beta = (1, 1, 0, 0, 0)\T\), so \(\|\beta\|_2^2 = 2\) and \(\|\beta\|_1 = 2\). The penalties are hence comparable.) Make a plot of the joint distribution.

  4. Based on your three plots, comment on the bias of lasso and ridge regression. Are they unbiased? How biased are they?

  5. Based on your three plots, comment on the shape of the sampling distributions. The least squares distribution should show a noticeable correlation between \(\hat \beta_1\) and \(\hat \beta_2\), which is what we expect when predictors are collinear; what about lasso and ridge? Explain why the nature of their penalty functions explains their differing behavior with collinear predictors.

This problem illustrates why ridge regression is popular in settings with highly collinear predictors, and why it can produce estimates with low variance even in those settings. When a problem is both sparse and has collinear predictors, this problem illustrates why the elastic net can be useful.

Exercise 18.3 (Gene expression and autism) (This problem is based on the Data Analysis Exam given to Ph.D. students in Statistics in 2013.)

The occurrence of autism is believed to be related to gene expression, at least in part. Gene expression refers to the process by which genes are used to synthesize proteins that serve functions in the cell; gene expression levels indicate how much each gene is being expressed. Gene expression is regulated by various genetic and epigenetic mechanisms: people with the same genes may express them in different amounts. If the gene regulation mechanism operates differently, it may affect traits like autism by altering which proteins and products are produced in the brain.

The file genedat-exam.csv on Canvas contains gene expression data from 19 people with autism and 17 people without. This is measured using microarrays, which can measure the expression of many genes at the same time; in this dataset, we have selected 125 genes, but there were nearly 20,000 measured in the original study. Some of the people had expression levels measured in both their frontal cortex and temporal cortex, and these appear as separate rows in the data, making 58 rows in total. We have the following columns:

  • CaseID: ID number for each person
  • Cortexarea: frontal or temporal cortex
  • Disease: autism or control
  • RIN: RNA integrity number, a measure of the quality of the microarray measurement. Above 7 is good quality, and above 5.5 is considered acceptable.
  • Age: Age of the person measured
  • Sex: Sex of the person measured

There then follow 125 columns giving gene expression levels for different genes.

  1. Choose the appropriate type of GLM to use for this data. Why can you not fit it directly with maximum likelihood and no penalty?
  2. Build lasso and ridge GLMs to predict autism status using the gene expression data. Use cross-validation to select the tuning parameters.
  3. Write a function that uses predict() to predict autism status for each observation in a data frame, by predicting autism if the predicted probability is 0.5 or greater.
  4. Use cross-validation to evaluate the accuracy of lasso and ridge. Which performs best?
  5. Describe a more appropriate way to design the analysis so your cross-validated comparison of the two models is unbiased and gives valid error estimates.