Produce a lineup for a fitted model

A lineup hides diagnostics among "null" diagnostics, i.e. the same diagnostics calculated using models fit to data where all model assumptions are correct. For each null diagnostic, model_lineup() simulates new responses from the model using the fitted covariate values and the model's error distribution, link function, and so on. Hence the new response values are generated under ideal conditions: the fitted model is true and all assumptions hold. decrypt() reveals which diagnostics are the true diagnostics.

Usage

model_lineup(fit, fn = augment, nsim = 20, ...)

Arguments

fit: A model fit to data, such as by lm() or glm()
fn: A diagnostic function. The function's first argument should be the fitted model, and it must return a data frame. Defaults to broom::augment(), which produces a data frame containing the original data and additional columns .fitted, .resid, and so on. To see a list of model types supported by broom::augment(), and to find documentation on the columns reported for each type of model, load the broom package and use methods(augment).
nsim: Number of total diagnostics. For example, if nsim = 20, the diagnostics for fit are hidden among 19 null diagnostics.
...: Additional arguments passed to fn each time it is called.

Value

A data frame (tibble) with columns corresponding to the columns returned by fn. The additional column .sample indicates which set of diagnostics each row is from. For instance, if the true data is in position 5, selecting rows with .sample == 5 will retrieve the diagnostics from the original model fit.

Details

To generate different kinds of diagnostics, the user can provide a custom fn. The fn should take a model fit as its argument and return a data frame. For instance, the data frame might contain one row per observation and include the residuals and fitted values for each observation; or it might be a single row containing a summary statistic or test statistic.

fn will be called on the original fit provided. Then parametric_boot_distribution() will be used to simulate data from the model fit nsim - 1 times, refit the model to each simulated dataset, and run fn on each refit model. The null distribution is conditional on X, i.e. the covariates used will be identical, and only the response values will be simulated. The data frames are concatenated with an additional .sample column identifying which fit each row came from.

When called, this function will print a message such as decrypt("sD0f gCdC En JP2EdEPn ZY"). This is how to get the location of the true diagnostics among the null diagnostics: evaluating this in the R console will produce a string such as "True data in position 5".

Model limitations

Because this function uses S3 generic methods such as simulate() and update(), it can be used with any model fit for which methods are provided. In base R, this includes lm() and glm().

The model provided as fit must be fit using the data argument to provide a data frame. For example:

fit <- lm(dist ~ speed, data = cars)

When simulating new data, this function provides the simulated data as the data argument and re-fits the model. If you instead refer directly to local variables in the model formula, this will not work. For example, if you fit a model this way:

# will not work
fit <- lm(cars$dist ~ cars$speed)

It will not be possible to refit the model using simulated datasets, as that would require modifying your environment to edit cars.

References

Buja et al. (2009). Statistical inference for exploratory data analysis and model diagnostics. Philosophical Transactions of the Royal Society A, 367 (1906), pp. 4361-4383. doi:10.1098/rsta.2009.0120

Wickham et al. (2010). Graphical inference for infovis. IEEE Transactions on Visualization and Computer Graphics, 16 (6), pp. 973-979. doi:10.1109/TVCG.2010.161

Examples

fit <- lm(dist ~ speed, data = cars)
model_lineup(fit, nsim = 5)
#> decrypt("QUg2 qFyF Rx 8tLRyRtx Kn")
#> # A tibble: 250 × 9
#>      dist speed .fitted  .resid   .hat .sigma    .cooksd .std.resid .sample
#>     <dbl> <dbl>   <dbl>   <dbl>  <dbl>  <dbl>      <dbl>      <dbl>   <int>
#>  1 -21.4      4   -3.27 -18.1   0.115    14.2 0.118         -1.35         1
#>  2   6.49     4   -3.27   9.76  0.115    14.4 0.0342         0.727        1
#>  3  11.1      7    8.54   2.56  0.0715   14.4 0.00134        0.186        1
#>  4  18.5      7    8.54  10.00  0.0715   14.4 0.0203         0.727        1
#>  5  20.3      8   12.5    7.79  0.0600   14.4 0.0101         0.563        1
#>  6  -4.52     9   16.4  -20.9   0.0499   14.1 0.0594        -1.50         1
#>  7  36.2     10   20.4   15.9   0.0413   14.2 0.0277         1.13         1
#>  8  16.5     10   20.4   -3.82  0.0413   14.4 0.00161       -0.273        1
#>  9  20.6     10   20.4    0.230 0.0413   14.4 0.00000583     0.0165       1
#> 10  26.3     11   24.3    2.01  0.0341   14.4 0.000361       0.143        1
#> # ℹ 240 more rows

resids_vs_speed <- function(f) {
  data.frame(resid = residuals(f),
             speed = model.frame(f)$speed)
}
model_lineup(fit, fn = resids_vs_speed, nsim = 5)
#> decrypt("QUg2 qFyF Rx 8tLRyRtx Kn")
#> # A tibble: 250 × 3
#>      resid speed .sample
#>      <dbl> <dbl>   <int>
#>  1  17.6       4       1
#>  2 -25.0       4       1
#>  3  -1.19      7       1
#>  4   4.21      7       1
#>  5  -6.33      8       1
#>  6  -0.886     9       1
#>  7  14.8      10       1
#>  8  20.2      10       1
#>  9 -11.9      10       1
#> 10   1.52     11       1
#> # ℹ 240 more rows