# Simulate the sampling distribution of estimates from a population

Source:`R/diagnose.R`

`sampling_distribution.Rd`

Repeatedly refits the model to new samples from the population, calculates estimates for each fit, and compiles a data frame of the results.

## Arguments

- fit
A model fit to data, such as by

`lm()`

or`glm()`

, to refit to each sample from the population.- data
Data drawn from a

`population()`

, using`sample_x()`

and possibly`sample_y()`

. The`population()`

specification is used to draw the samples.- fn
Function to call on each new model fit to produce a data frame of estimates. Defaults to

`broom::tidy()`

, which produces a tidy data frame of coefficients, estimates, standard errors, and hypothesis tests.- nsim
Number of simulations to run.

- fixed_x
If

`TRUE`

, the default, the predictor variables are held fixed and only the response variables are redrawn from the population. If`FALSE`

, the predictor and response variables are drawn jointly.- ...
Additional arguments passed to

`fn`

each time it is called.

## Value

Data frame (tibble) of `nsim + 1`

simulation results, formed by
concatenating together the data frames returned by `fn`

. The `.sample`

column identifies which simulated sample each row came from. Rows with
`.sample == 0`

come from the original `fit`

.

## Details

To generate sampling distributions of different quantities, the user can
provide a custom `fn`

. The `fn`

should take a model fit as its argument and
return a data frame. For instance, the data frame might contain one row per
estimated coefficient and include the coefficient and its standard error; or
it might contain only one row of model summary statistics.

## Model limitations

Because this function uses S3 generic methods such as `model.frame()`

,
`simulate()`

, and `update()`

, it can be used with any model fit for which
methods are provided. In base R, this includes `lm()`

and `glm()`

.

The model provided as `fit`

must be fit using the `data`

argument to provide
a data frame. For example:

When simulating new data, this function provides the simulated data as the
`data`

argument and re-fits the model. If you instead refer directly to local
variables in the model formula, this will not work. For example, if you fit a
model this way:

It will not be possible to refit the model using simulated datasets, as that
would require modifying your environment to edit `cars`

.

## See also

`parametric_boot_distribution()`

to simulate draws from a fitted
model, rather than from the population

## Examples

```
pop <- population(
x1 = predictor("rnorm", mean = 4, sd = 10),
x2 = predictor("runif", min = 0, max = 10),
y = response(0.7 + 2.2 * x1 - 0.2 * x2, error_scale = 4.0)
)
d <- sample_x(pop, n = 20) |>
sample_y()
fit <- lm(y ~ x1 + x2, data = d)
# using the default fn = broom::tidy(). conf.int argument is passed to
# broom::tidy()
samples <- sampling_distribution(fit, d, conf.int = TRUE)
samples
#> # A tibble: 303 × 8
#> term estimate std.error statistic p.value conf.low conf.high .sample
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) -2.49 1.88 -1.32 2.03e- 1 -6.45 1.48 0
#> 2 x1 2.33 0.0688 33.8 4.87e-17 2.18 2.47 0
#> 3 x2 0.112 0.314 0.357 7.25e- 1 -0.551 0.775 0
#> 4 (Intercept) 0.795 2.76 0.288 7.77e- 1 -5.02 6.61 1
#> 5 x1 2.19 0.101 21.7 7.88e-14 1.98 2.40 1
#> 6 x2 -0.165 0.461 -0.357 7.26e- 1 -1.14 0.809 1
#> 7 (Intercept) 0.240 2.99 0.0802 9.37e- 1 -6.07 6.55 2
#> 8 x1 2.14 0.110 19.5 4.45e-13 1.91 2.37 2
#> 9 x2 -0.134 0.501 -0.268 7.92e- 1 -1.19 0.922 2
#> 10 (Intercept) 1.12 2.39 0.468 6.46e- 1 -3.92 6.15 3
#> # ℹ 293 more rows
suppressMessages(library(dplyr))
# the model is correctly specified, so the estimates are unbiased:
samples |>
group_by(term) |>
summarize(mean = mean(estimate),
sd = sd(estimate))
#> # A tibble: 3 × 3
#> term mean sd
#> <chr> <dbl> <dbl>
#> 1 (Intercept) 0.591 2.51
#> 2 x1 2.19 0.0943
#> 3 x2 -0.199 0.411
# instead of coefficients, get the sampling distribution of R^2
rsquared <- function(fit) {
data.frame(r2 = summary(fit)$r.squared)
}
sampling_distribution(fit, d, rsquared, nsim = 10)
#> # A tibble: 11 × 2
#> r2 .sample
#> <dbl> <dbl>
#> 1 0.986 0
#> 2 0.965 1
#> 3 0.975 2
#> 4 0.965 3
#> 5 0.982 4
#> 6 0.959 5
#> 7 0.978 6
#> 8 0.980 7
#> 9 0.963 8
#> 10 0.964 9
#> 11 0.973 10
```