Berk, R., Buja, A., Brown, L., George, E., Kuchibhotla, A. K., Su, W., & Zhao, L. (2019). Assumption lean regression. The American Statistician. doi:10.1080/00031305.2019.1592781
Buja, A., Berk, R., Brown, L., George, E., Pitkin, E., Traskin, M., Zhao, L., & Zhang, K. (2019). Models as approximations I: Consequences illustrated with linear regression. Statistical Science. https://arxiv.org/abs/1404.1578
If nothing is truly linear, what exactly are we estimating with a linear model? These papers discuss “assumption-lean” linear regression, where we place no particular parametric form on the joint distribution between regressors and response. We can still define what we are estimating, and it still has useful properties, but we must be more careful about interpreting the model and estimating uncertainty.
This is a useful framework: it provides a single explanation for many questions about misspecified models. The American Statistician article is a more accessible version of the lengthier and more mathematically detailed Statistical Science article.
Lumley, T., Diehr, P., Emerson, S., & Chen, L. (2002). The importance of the normality assumption in large public health data sets. Annual Review of Public Health, 23, 151–169. doi:10.1146/annurev.publhealth.23.100901.140546
How much does the assumption of normally distributed residuals matter in regression? A simulation study, with a clear answer: not much, provided you have a decent sample size.
See also Experimental design.
Lin, W. (2013). Agnostic notes on regression adjustments to experimental data: Reexamining Freedman’s critique. The Annals of Applied Statistics, 7(1), 295–318. doi:10.1214/12-aoas583
Freedman showed that in a randomized experiment, adjusting for pre-treatment covariates with linear regression can harm your treatment effect estimates. Lin shows that problem is not as bad as Freedman thought, and reviews the behavior of regression in designed experiments. Asymptotically, adjusting for covariates improves precision even under misspecification, though I’m unsure what practical implications this has, as typical experimental design settings have small sample sizes.
Ding, P. (2019). Two paradoxical results in linear models: The variance inflation factor and the analysis of covariance. arXiv. https://arxiv.org/abs/1903.03883
In linear model texts, the Variance Inflation Factor measures the increase in estimation variance of regression coefficients caused by adding new variables to the model. This increase is always greater than or equal to 1. But in an experiment, ANCOVA increases the efficiency of the treatment effect (otherwise, why bother with covariates?). What’s the deal? Essentially, VIF applies in settings where all covariates are considered fixed, including treatment assignment; but if the treatment assignment is random, as is the case in an experiment, asymptotically the adjustment for covariates reduces estimation variance.