Simulate the sampling distribution of estimates from a population
Source:R/diagnose.R
sampling_distribution.Rd
Repeatedly refits the model to new samples from the population, calculates estimates for each fit, and compiles a data frame of the results.
Arguments
- fit
A model fit to data, such as by
lm()
orglm()
, to refit to each sample from the population.- data
Data drawn from a
population()
, usingsample_x()
and possiblysample_y()
. Thepopulation()
specification is used to draw the samples.- fn
Function to call on each new model fit to produce a data frame of estimates. Defaults to
broom::tidy()
, which produces a tidy data frame of coefficients, estimates, standard errors, and hypothesis tests.- nsim
Number of simulations to run.
- fixed_x
If
TRUE
, the default, the predictor variables are held fixed and only the response variables are redrawn from the population. IfFALSE
, the predictor and response variables are drawn jointly.- ...
Additional arguments passed to
fn
each time it is called.
Value
Data frame (tibble) of nsim + 1
simulation results, formed by
concatenating together the data frames returned by fn
. The .sample
column identifies which simulated sample each row came from. Rows with
.sample == 0
come from the original fit
.
Details
To generate sampling distributions of different quantities, the user can
provide a custom fn
. The fn
should take a model fit as its argument and
return a data frame. For instance, the data frame might contain one row per
estimated coefficient and include the coefficient and its standard error; or
it might contain only one row of model summary statistics.
Refitting is done using the S3 generic update()
, so this function can be
used with any model fit that supports update()
. In base R, this includes
lm()
and glm()
, and many other model fits.
See also
parametric_boot_distribution()
to simulate draws from a fitted
model, rather than from the population
Examples
pop <- population(
x1 = predictor("rnorm", mean = 4, sd = 10),
x2 = predictor("runif", min = 0, max = 10),
y = response(0.7 + 2.2 * x1 - 0.2 * x2, error_scale = 4.0)
)
d <- sample_x(pop, n = 20) |>
sample_y()
fit <- lm(y ~ x1 + x2, data = d)
# using the default fn = broom::tidy(). conf.int argument is passed to
# broom::tidy()
samples <- sampling_distribution(fit, d, conf.int = TRUE)
samples
#> # A tibble: 303 × 8
#> term estimate std.error statistic p.value conf.low conf.high .sample
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) 1.07 1.52 0.707 4.89e- 1 -2.13 4.27 0
#> 2 x1 2.12 0.0698 30.3 3.02e-16 1.97 2.27 0
#> 3 x2 -0.211 0.231 -0.917 3.72e- 1 -0.698 0.275 0
#> 4 (Intercept) -1.41 1.82 -0.773 4.50e- 1 -5.25 2.44 1
#> 5 x1 2.23 0.0838 26.6 2.66e-15 2.06 2.41 1
#> 6 x2 -0.166 0.277 -0.599 5.57e- 1 -0.750 0.418 1
#> 7 (Intercept) -1.52 1.82 -0.836 4.15e- 1 -5.36 2.32 2
#> 8 x1 2.20 0.0837 26.3 3.32e-15 2.02 2.38 2
#> 9 x2 0.0909 0.276 0.329 7.46e- 1 -0.492 0.674 2
#> 10 (Intercept) 4.73 2.22 2.14 4.75e- 2 0.0589 9.41 3
#> # ℹ 293 more rows
suppressMessages(library(dplyr))
# the model is correctly specified, so the estimates are unbiased:
samples |>
group_by(term) |>
summarize(mean = mean(estimate),
sd = sd(estimate))
#> # A tibble: 3 × 3
#> term mean sd
#> <chr> <dbl> <dbl>
#> 1 (Intercept) 0.479 2.03
#> 2 x1 2.19 0.0995
#> 3 x2 -0.168 0.304
# instead of coefficients, get the sampling distribution of R^2
rsquared <- function(fit) {
data.frame(r2 = summary(fit)$r.squared)
}
sampling_distribution(fit, d, rsquared, nsim = 10)
#> # A tibble: 11 × 2
#> r2 .sample
#> <dbl> <dbl>
#> 1 0.982 0
#> 2 0.975 1
#> 3 0.981 2
#> 4 0.978 3
#> 5 0.961 4
#> 6 0.950 5
#> 7 0.973 6
#> 8 0.979 7
#> 9 0.964 8
#> 10 0.970 9
#> 11 0.958 10