17  Estimating Error

TODO Test/validation sets, cross-validation, model selection by error metrics

17.1 Shortcut estimators

TODO AIC, Mallows Cp

Hastie, Tibshirani, and Friedman (2009), section 7.5

17.2 Test and validation sets

17.3 Cross-validation

Hastie, Tibshirani, and Friedman (2009), section 7.10

17.3.1 Leave-one-out cross-validation

17.3.2 K-fold cross-validation

Example 17.1 (Cross-validation in R) For easy cross-validation in R, we can use the rsample package again, as we did with bootstrapping. rsample calls K-fold CV “V-fold cross-validation”, but it’s the same thing.

As in Exercise 16.1, let’s imagine fitting a smoothing spline to the Doppler function and estimating it’s error. We have one sample from the population, doppler_samp, and we want to fit a spline with 7 degrees of freedom:

fit <- smooth.spline(doppler_samp$x, doppler_samp$y, df = 7)

How can we estimate the error? Let’s use vfold_cv() from rsample to get 10 folds:

library(rsample)

folds <- vfold_cv(doppler_samp, v = 10)
head(folds)
# A tibble: 6 × 2
  splits          id    
  <list>          <chr> 
1 <split [90/10]> Fold01
2 <split [90/10]> Fold02
3 <split [90/10]> Fold03
4 <split [90/10]> Fold04
5 <split [90/10]> Fold05
6 <split [90/10]> Fold06

Each row of folds is one train/test split. The splits column contains the splits. For each split, we can fit our model on the training (“analysis”) data and test it on the test (“assessment”) data:

errors <- sapply(folds$splits, function(split) {
  train <- analysis(split)
  test <- assessment(split)

  fit <- smooth.spline(train$x, train$y, df = 7)

  test_predictions <- predict(fit, test$x)$y

  mean((test_predictions - test$y)^2)
})

mean(errors)
[1] 0.08011286

This estimate is random: the 10 folds are selected by randomly splitting the data, and if we repeated the process, we’d get 10 different folds and 10 different error estimates. We can estimate the variability of the process by noting the variability in the 10 error estimates we obtained:

errors
 [1] 0.03352007 0.04935135 0.08715105 0.06603658 0.14826196 0.07439575
 [7] 0.10870238 0.09615005 0.07663481 0.06092462

The largest and smallest errors differ by a factor of 4.4. (This is partly because our sample has size \(N = 100\), and so in 10-fold cross-validation, each test set is of size 10; if the data were larger, or we used fewer folds, each error estimate would be the average error on a larger test set.)

17.3.3 Order of operations in cross-validation

TODO Why you must CV the entire model fit process, including variable selection, not just the final step

17.3.4 Comparing models with cross-validation

TODO discuss getting a “SE” for CV estimates, and using them to compare models

Exercise 17.1 (College Scorecard) Let’s return to the College Scorecard data from Exercise 11.6. We previously used \(F\) tests to determine if polynomial terms were warranted. But suppose we were instead interested in predicting earnings after graduation, and simply wanted the fit that gave the best predictions.

Use 10-fold cross-validation to estimate the squared-error loss of each of your models from Exercise 11.6. Compare the results to what you got using \(F\) tests. Why could the results differ?

Exercise 17.2 (Abalone weights) Abalone are a type of sea snail often harvested and sold for food. The data file abalonemt.csv (on Canvas) contains nine variables measured on 4173 abalones.

You might want to predict the weight of the edible part of the abalone (Shucked.weight) before shucking it (removing it from its shell), so you can determine how much to charge for the abalone you are selling. You can easily measure the exterior dimensions without shucking it, so you can use the Diameter, Length, and Height predictors.

It might be reasonable to assume the abalone’s weight is proportional to its volume, so that \[ \text{weight} \propto \text{diameter} \times \text{length} \times \text{height}. \] If so, then \[ \log(\text{weight}) \propto \log(\text{diameter}) + \log(\text{length}) + \log(\text{height}). \] Consider several models to predict shucked weight from dimensions:

  1. A linear model predicting shucked weight from diameter, length, and height.
  2. A linear model predicting \(\log(\text{shucked weight})\) from the log of each dimension.
  3. A smoothing spline model (fit with smoothing.spline(), letting it choose the optimal smoothness automatically) predicting shucked weight from the product of diameter, length, and height.

If this model is to be used for pricing, what matters is that it predicts accurately, so customers do not feel ripped off. Use 10-fold cross-validation to estimate the error of each model. Comment on the results. Which model appears best?

When estimating the error of model 2, ensure you measure the squared error for shucked weight, not \(\log(\text{shucked weight})\). You will need to exponentiate its predictions so they are on the same scale as the other models.

17.4 Metrics for classification

TODO AUC, ROC, F1, etc.

TODO proper scoring

17.5 Model selection process