The Regressomatic 4000

by Alex Reinhart

Once you have fit a regression model, your work is not finished. There are a variety of diagnostic plots you can make to ensure the model fits well. (There are also a range of diagnostic tests, but diagnostic plots are easy to make and give more useful information.) This page is a brief overview—not a textbook on diagnostics, but a brief introduction to each type of plot.

The diagnostic plots on this page are interactive. You can click and drag data points in the scatterplot to see what happens to the diagnostics, so spend some time playing with the data to see what happens.

Why diagnostics?

There are a million different ways to fit a regression model. You can add square or cubic terms, transform variables, weight data points, or drop outliers. But first you need to know how well your model fits.

The classic example of the need to visualize your model fit is Anscombe's Quartet, four scatterplots of data which have identical means, variances, regression lines and correlation coefficients:

Plot from Wikimedia Commons, by users Schutz and Avenue.

Only the top left plot looks like the regression line really fits the data. Let's use Ancombe's quartet as examples as we look at regression diagnostics.

Residuals

First, a simple regression and ordinary residuals. Click and drag any data point to move it and see the change in the residuals.

Regression Residuals

Those look pretty reasonable. The residuals are a blob: no obvious structure, spread evenly above and below zero. But suppose instead we try the second dataset from Anscombe's quartet, which is really quadratic. Then the residuals look very fishy indeed:

Regression Residuals

When the residuals appear curved or otherwise have a strange shape, your model is missing something. Here we can see there's a quadratic trend left over after the linear trend is fit, so we should add a quadratic term.

Standardized residuals

Next, let's standardize the residuals. We'll return to the first dataset in the quartet. Try fiddling with the data: notice that as one data point is pulled to become an outlier, the other residuals become smaller, since the residual variance is increasing.

Regression Standardized residuals

Studentized residuals

An alternate form of standardization leads to Studentized residuals. These can be derived directly from the standardized residuals, so they do not show any new information, but their interpretation is clearer.

A Studentized residual can be seen as a t test of the null hypothesis that the prediction for the given data point would not change if the data point were left out of the fit. If the point is an outlier, we can expect the test to reject the null.

Here are the Studentized residuals of the same dataset as before, for comparison. The dashed lines indicate the rejection region of the Bonferroni-corrected t test.

Regression Studentized residuals

Cook's distance

Next, let's try Cook's distances. The Cook's distance of a data point measures how much the model parameters would change if that point were removed, so it is a very intuitive measure of influence. Notice that if you vertically displace a point near the middle of the line, the distance won't pass 0.6 no matter how hard you try, while a point at either end of the line can easily have a distance over 1.

Regression Cook's distances

Cook's distance is useful when there are major outliers or influential points. For example, it highlights the outlying point here, which pulls the regression line disproportionately upwards. The other data points lie almost exactly on a line, so they individually have very little influence on the line.

Regression Cook's distances

The Cook's distance for the fourth plot in the quartet is even more extreme—it's essentially infinite for the outlying point. All the other points share the same X value, so the outlying point completely determines the regression line, which always passes through it. It's as if we only had two data points.

Leverage

Next, leverage. Actually I'm not sure what it's good for.

Regression Leverage

Quantile-quantile plots

Quantile-quantile plots are a bit difficult to get intuition about. They're designed to check if residuals are normally distributed. (The marginal distribution of the data doesn't matter—we're interested only in the distribution of the residuals.) Of course, once your dataset becomes large enough, normality isn't an important assumption—the linear regression estimators have asymptotically normal sampling distributions regardless, so you can do the standard tests on coefficients.

On the X axis of the Q-Q plot lie the expected values of the order statistics of a normal distribution. On the Y axis are the standardized residuals. Hence the smallest residual is compared against its expected value, and so on. If the points deviate from a line, the normal distribution is not a good fit for the residuals.

Regression Q-Q plot

This data looks fairly normal—the sample size is small, so it can't be expected to lie perfectly on the line. Try dragging data points around to see what happens.

On the other hand, these residuals are hardly normal:

Regression Q-Q plot