6  Linear Models in R

For these examples, we’ll use the mtcars data built into R. According to the documentation, the data frame comes from Motor Trend magazine in 1974, and contains features of performance and design for 32 cars. Here are the first few rows:

mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1

Type ?mtcars in the R console to see documentation on the meaning of each variable.

We will also be using the broom package to convert model information to tidy (data frame) format.


6.1 Generic functions in R

This book does not contain an introduction to R, and I won’t describe R syntax here. (For that, see Wickham and Grolemund (2017).) But there is one lesser-known R feature that most R users are only vaguely aware of, but that is essential to understanding how modeling actually works: S3 generics.

The S3 system tries to solve a common problem: it is helpful to have functions that can work on many different types of arguments in consistent ways. It is also useful to be able to extend a function to handle a new argument type without having to modify its source code directly.

Consider a simple example: the print() function. You’ve probably used it to print various types of objects in R:

foo <- "ducks"
[1] "ducks"
bar <- data.frame(a = 1:2, b = 3:4)
  a b
1 1 3
2 2 4

Suppose I write a function that creates a new type of data: perhaps a new kind of model fit, or maybe it allows me to load social network data, or maybe it generates graphics. How can I make print() work on this without having to find the R source code and edit print()?

Fortunately, if we examine the source code of print(), we see part of the answer:

function (x, ...) 
<bytecode: 0x7f77bccc5af0>
<environment: namespace:base>

print() is a generic function. It calls UseMethod("print"), and UseMethod() is part of the S3 generic system. It looks at the class of the first argument to the function, “class” meaning a specific attribute of the variable:

[1] "character"
[1] "data.frame"

It then calls a function print.[class]() according to the class. These class-specific print functions are called methods. For instance, R has a print.data.frame() method, and in fact nearly 200 other print() methods:

head(methods(print), n = 10)
 [1] "print.acf"               "print.activeConcordance"
 [3] "print.AES"               "print.all_vars"         
 [5] "print.anova"             "print.any_vars"         
 [7] "print.aov"               "print.aovlist"          
 [9] "print.ar"                "print.Arima"            

Many functions in R are S3 generic functions, and R provides methods for many of its built-in classes. For regression modeling, this is useful because there are generic methods for getting model coefficients, making confidence intervals, making predictions, and so on, and there are methods for many kinds of regression models. You can easily write code that works with any kind of model fit just by calling generic methods, instead of having one function that works on linear models and a different function that works on logistic regression fits.

There is often specific documentation for specific generic methods. For instance, ?residuals gives the generic documentation for residuals(), but ?residuals.lm tells you specifically what it does for linear model fits.

For more details on the S3 system and how to create your own generic functions, see Wickham (2019) chapter 13.

6.2 Fitting

The lm() function fits linear models using least squares. It takes two key arguments:

  • A model formula, which is special syntax for specifying the names of the response and predictor variables, and any transformations, interactions, or basis expansions that might be used.
  • A data frame providing the observed data, which must contain columns whose names match the terms in the model formula.

R uses the model formula and data frame to construct the design matrix; you can also use the model.matrix() function to get the design matrix R would use for a given model formula and data frame.

The lm() function returns a fit object containing the data, fit parameters, estimates, and various other useful pieces. For example,

fit <- lm(mpg ~ disp + hp, data = mtcars)

lm(formula = mpg ~ disp + hp, data = mtcars)

(Intercept)         disp           hp  
   30.73590     -0.03035     -0.02484  

The variable fit is now an object with class lm.

Model formulas place the response variable to the left of the ~ and predictors to the right, separated by + signs. Formulas can contain transformations and some useful functions:

  • mpg ~ log(disp) + hp log-transforms disp.
  • mpg ~ I(disp^2) + hp takes the square of disp. I() is necessary because the ^ operator has a specific meaning in formulas (below), so I() tells R to ignore this and evaluate it as-is.
  • mpg ~ disp + hp - 1 instructs lm() not to insert an intercept automatically.
  • The : operator represents interaction terms, so mpg ~ disp + hp + disp:hp represents a model with linear main effects for each predictor and the interaction between them.
  • The * shorthand includes both main effects and the interaction, so mpg ~ disp * hp is equivalent to the previous formula.
  • The ^ calculates interactions up to a specific degree. For instance, mpg ~ (disp + hp + wt)^3 includes the three main effects, all two-way interactions, and the three-way interaction.

See the help for ?formula for more details. We’ll discuss interactions in more detail in Chapter 7.

Exercise 6.1 (Model matrices) The model.matrix() function takes two arguments, a formula and a data frame, just like lm(), and produces the design matrix \(\X\) corresponding to that model formula. Use this function and the mtcars data to examine the model matrices produced by the formulas listed above.

6.3 Coefficients and fitted values

The coef() generic function fetches model coefficients as a named vector, so we can index it by name:

(Intercept)        disp          hp 
30.73590425 -0.03034628 -0.02484008 

For a tidier version as a data frame, tidy() from the broom package is very useful:

# A tibble: 3 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)  30.7      1.33        23.1  3.26e-20
2 disp         -0.0303   0.00740     -4.10 3.06e- 4
3 hp           -0.0248   0.0134      -1.86 7.37e- 2

The vcov() function extracts estimates of \(\var(\hat \beta)\) as a matrix, assuming the errors are independent and have equal variance (Theorem 5.3):

             (Intercept)          disp            hp
(Intercept)  1.773068357 -0.0011510577 -0.0081943285
disp        -0.001151058  0.0000548319 -0.0000783970
hp          -0.008194329 -0.0000783970  0.0001791716

The fitted() function gives the fitted values \(\hat Y\) as a vector. If the data had row names, this will be a named vector.

        Mazda RX4     Mazda RX4 Wag        Datsun 710    Hornet 4 Drive 
         23.14809          23.14809          25.14838          20.17416 
Hornet Sportabout           Valiant 
         15.46423          21.29978 

6.4 Residuals and diagnostics

The residuals() generic function fetches the residuals from a model fit. It can return different types of residuals, so see ?residuals.lm for details; the default residuals are simply \(Y - \hat Y\).

        Mazda RX4     Mazda RX4 Wag        Datsun 710    Hornet 4 Drive 
        -2.148091         -2.148091         -2.348379          1.225844 
Hornet Sportabout           Valiant 
         3.235770         -3.199783 

There are a variety of similar functions for different diagnostics:

  • rstandard() for standardized residuals
  • rstudent() for Studentized residuals
  • cooks.distance() for Cook’s distances

See Section 5.4 for details on each kind of residual. The augment() function from broom can put many of these diagnostics in a single data frame along with the original data. For example, Table 6.1 shows the augmented version of fit. Notice the new columns have names beginning with . so they cannot accidentally have the same name as a variable in your model. The data frame contains the fitted values, residuals, Cook’s distances, standardized residuals, and several other values; see ?augment.lm for details.

Table 6.1: The first few rows of augment(fit), showing the extra columns added by broom::augment().
.rownames mpg disp hp .fitted .resid .hat .sigma .cooksd .std.resid
Mazda RX4 21.0 160 110 23.1 -2.1 0.0 3.2 0 -0.7
Mazda RX4 Wag 21.0 160 110 23.1 -2.1 0.0 3.2 0 -0.7
Datsun 710 22.8 108 93 25.1 -2.3 0.1 3.1 0 -0.8
Hornet 4 Drive 21.4 258 110 20.2 1.2 0.1 3.2 0 0.4
Hornet Sportabout 18.7 360 175 15.5 3.2 0.1 3.1 0 1.1
Valiant 18.1 225 105 21.3 -3.2 0.1 3.1 0 -1.1

6.5 Confidence intervals and model inference

To get 95% confidence intervals for model coefficients, we can use confint(), which uses the normality assumption and \(t\) distribution of \(\hat \beta\). This returns a matrix, not a data frame, though it can be indexed by row and column name.

                  2.5 %       97.5 %
(Intercept) 28.01254573 33.459262767
disp        -0.04549091 -0.015201645
hp          -0.05221650  0.002536338
confint(fit)["disp", "2.5 %"]
[1] -0.04549091

Optional arguments allow you to specify individual parameters you want intervals for, or to set the confidence level to something other than 95%. Alternately, broom’s tidy() can be asked to include confidence intervals by passing the conf.int = TRUE argument, as shown in Table 6.2.

Table 6.2: The output from tidy(fit, conf.int = TRUE), showing the extra conf.low and conf.high columns.
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 30.736 1.332 23.083 0.000 28.013 33.459
disp -0.030 0.007 -4.098 0.000 -0.045 -0.015
hp -0.025 0.013 -1.856 0.074 -0.052 0.003

6.6 Predictions

To predict \(\hat Y\) for a new value of \(X\), we use the predict() method. This method takes a data frame of one or more rows of new observations. This data frame must have column names that match the predictors in the original model formula.

new_cars <- data.frame(disp = c(2, 400), hp = c(400, 2))
predict(fit, newdata = new_cars)
       1        2 
20.73918 18.54771 

A common mistake is to provide a newdata whose columns do not exactly match the original fit. This produces an error. For example, here we’ve misspelled the disp column:

predict(fit, newdata = data.frame(dips = c(2, 400),
                                  hp = c(400, 2)))
Error in eval(predvars, data, env): object 'disp' not found

Another common problem is when the original formula given to lm() refers to a specific data frame by name. For example, if we fit the model using lm(mtcars$mpg ~ mtcars$disp + mtcars$hp), predict() will look for columns named mtcars$mpg, mtcars$disp, and mtcars$hp inside the newdata data frame. Using the data = argument is always preferable, as it avoids problems like this.

Optionally, we can ask predict() for confidence intervals or prediction intervals, using the pointwise estimators described in Section 5.3.2 and Section 5.3.3:

predict(fit, newdata = new_cars, interval = "confidence")
       fit      lwr      upr
1 20.73918 10.77086 30.70749
2 18.54771 12.25457 24.84085
predict(fit, newdata = new_cars, interval = "prediction")
       fit      lwr      upr
1 20.73918 8.896103 32.58226
2 18.54771 9.575828 27.51960