by Alex Reinhart
Once you have fit a regression model, your work is not finished. There are a variety of diagnostic plots you can make to ensure the model fits well. (There are also a range of diagnostic tests, but diagnostic plots are easy to make and give more useful information.) This page is a brief overview—not a textbook on diagnostics, but a brief introduction to each type of plot.
The diagnostic plots on this page are interactive. You can click and drag data points in the scatterplot to see what happens to the diagnostics, so spend some time playing with the data to see what happens.
There are a million different ways to fit a regression model. You can add square or cubic terms, transform variables, weight data points, or drop outliers. But first you need to know how well your model fits.
The classic example of the need to visualize your model fit is Anscombe's Quartet, four scatterplots of data which have identical means, variances, regression lines and correlation coefficients:
Only the top left plot looks like the regression line really fits the data. Let's use Ancombe's quartet as examples as we look at regression diagnostics.
First, a simple regression and ordinary residuals. Click and drag any data point to move it and see the change in the residuals.
Those look pretty reasonable. The residuals are a blob: no obvious structure, spread evenly above and below zero. But suppose instead we try the second dataset from Anscombe's quartet, which is really quadratic. Then the residuals look very fishy indeed:
When the residuals appear curved or otherwise have a strange shape, your model is missing something. Here we can see there's a quadratic trend left over after the linear trend is fit, so we should add a quadratic term.
Next, let's standardize the residuals. We'll return to the first dataset in the quartet. Try fiddling with the data: notice that as one data point is pulled to become an outlier, the other residuals become smaller, since the residual variance is increasing.
An alternate form of standardization leads to Studentized residuals. These can be derived directly from the standardized residuals, so they do not show any new information, but their interpretation is clearer.
A Studentized residual can be seen as a t test of the null hypothesis that the prediction for the given data point would not change if the data point were left out of the fit. If the point is an outlier, we can expect the test to reject the null.
Here are the Studentized residuals of the same dataset as before, for comparison. The dashed lines indicate the rejection region of the Bonferroni-corrected t test.
Next, let's try Cook's distances. The Cook's distance of a data point measures how much the model parameters would change if that point were removed, so it is a very intuitive measure of influence. Notice that if you vertically displace a point near the middle of the line, the distance won't pass 0.6 no matter how hard you try, while a point at either end of the line can easily have a distance over 1.
Cook's distance is useful when there are major outliers or influential points. For example, it highlights the outlying point here, which pulls the regression line disproportionately upwards. The other data points lie almost exactly on a line, so they individually have very little influence on the line.
The Cook's distance for the fourth plot in the quartet is even more extreme—it's essentially infinite for the outlying point. All the other points share the same X value, so the outlying point completely determines the regression line, which always passes through it. It's as if we only had two data points.
Next, leverage. Actually I'm not sure what it's good for.
Quantile-quantile plots are a bit difficult to get intuition about. They're designed to check if residuals are normally distributed. (The marginal distribution of the data doesn't matter—we're interested only in the distribution of the residuals.) Of course, once your dataset becomes large enough, normality isn't an important assumption—the linear regression estimators have asymptotically normal sampling distributions regardless, so you can do the standard tests on coefficients.
On the X axis of the Q-Q plot lie the expected values of the order statistics of a normal distribution. On the Y axis are the standardized residuals. Hence the smallest residual is compared against its expected value, and so on. If the points deviate from a line, the normal distribution is not a good fit for the residuals.
This data looks fairly normal—the sample size is small, so it can't be expected to lie perfectly on the line. Try dragging data points around to see what happens.
On the other hand, these residuals are hardly normal: