Simulate the sampling distribution of estimates from a population
Source:R/diagnose.R
sampling_distribution.Rd
Repeatedly refits the model to new samples from the population, calculates estimates for each fit, and compiles a data frame of the results.
Arguments
- fit
A model fit to data, such as by
lm()
orglm()
, to refit to each sample from the population.- data
Data drawn from a
population()
, usingsample_x()
and possiblysample_y()
. Thepopulation()
specification is used to draw the samples.- fn
Function to call on each new model fit to produce a data frame of estimates. Defaults to
broom::tidy()
, which produces a tidy data frame of coefficients, estimates, standard errors, and hypothesis tests.- nsim
Number of simulations to run.
- fixed_x
If
TRUE
, the default, the predictor variables are held fixed and only the response variables are redrawn from the population. IfFALSE
, the predictor and response variables are drawn jointly.- ...
Additional arguments passed to
fn
each time it is called.
Value
Data frame (tibble) of nsim + 1
simulation results, formed by
concatenating together the data frames returned by fn
. The .sample
column identifies which simulated sample each row came from. Rows with
.sample == 0
come from the original fit
.
Details
To generate sampling distributions of different quantities, the user can
provide a custom fn
. The fn
should take a model fit as its argument and
return a data frame. For instance, the data frame might contain one row per
estimated coefficient and include the coefficient and its standard error; or
it might contain only one row of model summary statistics.
Model limitations
Because this function uses S3 generic methods such as model.frame()
,
simulate()
, and update()
, it can be used with any model fit for which
methods are provided. In base R, this includes lm()
and glm()
.
The model provided as fit
must be fit using the data
argument to provide
a data frame. For example:
When simulating new data, this function provides the simulated data as the
data
argument and re-fits the model. If you instead refer directly to local
variables in the model formula, this will not work. For example, if you fit a
model this way:
It will not be possible to refit the model using simulated datasets, as that
would require modifying your environment to edit cars
.
See also
parametric_boot_distribution()
to simulate draws from a fitted
model, rather than from the population
Examples
pop <- population(
x1 = predictor("rnorm", mean = 4, sd = 10),
x2 = predictor("runif", min = 0, max = 10),
y = response(0.7 + 2.2 * x1 - 0.2 * x2, error_scale = 4.0)
)
d <- sample_x(pop, n = 20) |>
sample_y()
fit <- lm(y ~ x1 + x2, data = d)
# using the default fn = broom::tidy(). conf.int argument is passed to
# broom::tidy()
samples <- sampling_distribution(fit, d, conf.int = TRUE)
samples
#> # A tibble: 303 × 8
#> term estimate std.error statistic p.value conf.low conf.high .sample
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) -2.49 1.88 -1.32 2.03e- 1 -6.45 1.48 0
#> 2 x1 2.33 0.0688 33.8 4.87e-17 2.18 2.47 0
#> 3 x2 0.112 0.314 0.357 7.25e- 1 -0.551 0.775 0
#> 4 (Intercept) 0.795 2.76 0.288 7.77e- 1 -5.02 6.61 1
#> 5 x1 2.19 0.101 21.7 7.88e-14 1.98 2.40 1
#> 6 x2 -0.165 0.461 -0.357 7.26e- 1 -1.14 0.809 1
#> 7 (Intercept) 0.240 2.99 0.0802 9.37e- 1 -6.07 6.55 2
#> 8 x1 2.14 0.110 19.5 4.45e-13 1.91 2.37 2
#> 9 x2 -0.134 0.501 -0.268 7.92e- 1 -1.19 0.922 2
#> 10 (Intercept) 1.12 2.39 0.468 6.46e- 1 -3.92 6.15 3
#> # ℹ 293 more rows
suppressMessages(library(dplyr))
# the model is correctly specified, so the estimates are unbiased:
samples |>
group_by(term) |>
summarize(mean = mean(estimate),
sd = sd(estimate))
#> # A tibble: 3 × 3
#> term mean sd
#> <chr> <dbl> <dbl>
#> 1 (Intercept) 0.591 2.51
#> 2 x1 2.19 0.0943
#> 3 x2 -0.199 0.411
# instead of coefficients, get the sampling distribution of R^2
rsquared <- function(fit) {
data.frame(r2 = summary(fit)$r.squared)
}
sampling_distribution(fit, d, rsquared, nsim = 10)
#> # A tibble: 11 × 2
#> r2 .sample
#> <dbl> <dbl>
#> 1 0.986 0
#> 2 0.965 1
#> 3 0.975 2
#> 4 0.965 3
#> 5 0.982 4
#> 6 0.959 5
#> 7 0.978 6
#> 8 0.980 7
#> 9 0.963 8
#> 10 0.964 9
#> 11 0.973 10