Berk, R., Buja, A., Brown, L., George, E., Kuchibhotla, A. K., Su, W., & Zhao, L. (2019). Assumption lean regression.

*The American Statistician*. doi:10.1080/00031305.2019.1592781and

Buja, A., Berk, R., Brown, L., George, E., Pitkin, E., Traskin, M., Zhao, L., & Zhang, K. (2019). Models as approximations I: Consequences illustrated with linear regression.

*Statistical Science*. https://arxiv.org/abs/1404.1578If nothing is truly linear, what exactly are we estimating with a linear model? These papers discuss “assumption-lean” linear regression, where we place no particular parametric form on the joint distribution between regressors and response. We can still define what we are estimating, and it still has useful properties, but we must be more careful about interpreting the model and estimating uncertainty.

This is a useful framework: it provides a single explanation for many questions about misspecified models. The

*American Statistician*article is a more accessible version of the lengthier and more mathematically detailed*Statistical Science*article.Lumley, T., Diehr, P., Emerson, S., & Chen, L. (2002). The importance of the normality assumption in large public health data sets.

*Annual Review of Public Health*,*23*, 151–169. doi:10.1146/annurev.publhealth.23.100901.140546How much does the assumption of normally distributed residuals matter in regression? A simulation study, with a clear answer: not much, provided you have a decent sample size.

Also known as the marginality principle, this is the “principle” that one must include main effects in a linear model when including their interactions.

Peixoto, J. L. (1990). A property of well-formulated polynomial regression models.

*The American Statistician*,*44*(1), 26–30. doi:10.1080/00031305.1990.10475687Formalizes a motivation for the hierarchy principle. Consider a matrix of predictors X and let Z(X) be the model matrix formed by adding polynomial terms, interactions, and so on. Let W be a linear transformation of X of the form W = XA + \mathbf{1} b^T, where A is diagonal and \mathbf{1} indicates a vector of ones. W hence represents a rescaling (given by A) and shift (given by b) of each column of X. A model that obeys the hierarchy principle is one where the column space of Z(X) is identical to the column space of Z(W), for any W. In other words, models that do not obey the hierarchy principle change their predictions under rescaling of the covariates, so changing units or origins of predictors in X could give different fits. Note this argument also implies one should include interaction terms when add quadratic terms.

Nelder, J. A. (1998). The selection of terms in response-surface models—how strong is the weak-heredity principle?

*The American Statistician*,*52*(4), 315–218. doi:10.1080/00031305.1998.10480588An alternate formulation of the principle, pointing out that omitting main effects (or other terms demanded by the hierarchy principle) is equivalent to enforcing in the model that zeros of specific predictors have special meanings. This is only reasonable if

*a priori*knowledge is available to justify this. For instance, “suppose that x_1 is the dose of a drug, and that x_2 is the amount of a harmless synergist added to enhance its effect; suppose also that the response to x_1 is linear for each x_2. Then if the dose is zero, so that the synergist has nothing to act on, the response will be the same (not necessarily zero) and independent of x_2.” Hence omitting the main effect for x_2 would be reasonable.

See also Experimental design.

Lin, W. (2013). Agnostic notes on regression adjustments to experimental data: Reexamining Freedman’s critique.

*The Annals of Applied Statistics*,*7*(1), 295–318. doi:10.1214/12-aoas583Freedman showed that in a randomized experiment, adjusting for pre-treatment covariates with linear regression can harm your treatment effect estimates. Lin shows that problem is not as bad as Freedman thought, and reviews the behavior of regression in designed experiments. Asymptotically, adjusting for covariates improves precision even under misspecification, though I’m unsure what practical implications this has, as typical experimental design settings have small sample sizes.

Ding, P. (2019). Two paradoxical results in linear models: The variance inflation factor and the analysis of covariance. arXiv. https://arxiv.org/abs/1903.03883

In linear model texts, the Variance Inflation Factor measures the increase in estimation variance of regression coefficients caused by adding new variables to the model. This increase is always greater than or equal to 1. But in an experiment, ANCOVA

*increases*the efficiency of the treatment effect (otherwise, why bother with covariates?). What’s the deal? Essentially, VIF applies in settings where all covariates are considered fixed, including treatment assignment; but if the treatment assignment is random, as is the case in an experiment, asymptotically the adjustment for covariates reduces estimation variance.

Cook, R. D. (1993). Exploring partial residual plots.

*Technometrics*,*35*(4), 351–362. doi:10.1080/00401706.1993.10485350Cook, R. D., & Croos-Dabrera, R. (1998). Partial residual plots in generalized linear models.

*Journal of the American Statistical Association*,*93*(442), 730–739. doi:10.1080/01621459.1998.10473725Partial residual plots give a more interpretable set of regression diagnostic plots for linear models and GLMs, mainly because one can read off the shape of any curvature/nonlinearity directly from the plot. They can also be easily placed on top of effects plots (below).

Incidentally, Gelman & Hill’s

*Data Analysis Using Regression and Multilevel/Hierarchical Models*advocates for binned residual plots for GLMs, but I think these can be subsumed by smoothed partial residual plots: the partial residuals allow plotting against each covariate, and smoothing handles the problem of having discrete outcomes and hence limited residual distributions.Fox, J., & Weisberg, S. (2018). Visualizing fit and lack of fit in complex regression models with predictor effect plots and partial residuals.

*Journal of Statistical Software*,*87*(9). doi:10.18637/jss.v087.i09Effects plots for ordinary linear regression are trivial, but they become more useful when you have interactions and transformations – or GLMs where the link function makes reasoning about outcomes more difficult. Effects plots make it easy to see how the effect of one variable varies based on its interaction with another, by producing plots conditioning on varying values of interacting variables and marginalizing over the rest.