mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
---|---|---|---|---|---|---|---|---|---|---|---|

Mazda RX4 | 21.0 | 6 | 160 | 110 | 3.90 | 2.620 | 16.46 | 0 | 1 | 4 | 4 |

Mazda RX4 Wag | 21.0 | 6 | 160 | 110 | 3.90 | 2.875 | 17.02 | 0 | 1 | 4 | 4 |

Datsun 710 | 22.8 | 4 | 108 | 93 | 3.85 | 2.320 | 18.61 | 1 | 1 | 4 | 1 |

Hornet 4 Drive | 21.4 | 6 | 258 | 110 | 3.08 | 3.215 | 19.44 | 1 | 0 | 3 | 1 |

Hornet Sportabout | 18.7 | 8 | 360 | 175 | 3.15 | 3.440 | 17.02 | 0 | 0 | 3 | 2 |

Valiant | 18.1 | 6 | 225 | 105 | 2.76 | 3.460 | 20.22 | 1 | 0 | 3 | 1 |

# 6 Linear Models in R

\[ \DeclareMathOperator{\E}{\mathbb{E}} \DeclareMathOperator{\R}{\mathbb{R}} \DeclareMathOperator{\var}{Var} \newcommand{\X}{\mathbf{X}} \]

For these examples, we’ll use the `mtcars`

data built into R. According to the documentation, the data frame comes from *Motor Trend* magazine in 1974, and contains features of performance and design for 32 cars. Here are the first few rows:

Type `?mtcars`

in the R console to see documentation on the meaning of each variable.

We will also be using the broom package to convert model information to tidy (data frame) format.

`library(broom)`

## 6.1 Generic functions in R

This book does not contain an introduction to R, and I won’t describe R syntax here. (For that, see Wickham and Grolemund (2017).) But there is one lesser-known R feature that most R users are only vaguely aware of, but that is essential to understanding how modeling actually works: S3 generics.

The S3 system tries to solve a common problem: it is helpful to have functions that can work on many different types of arguments in consistent ways. It is also useful to be able to extend a function to handle a new argument type without having to modify its source code directly.

Consider a simple example: the `print()`

function. You’ve probably used it to print various types of objects in R:

```
<- "ducks"
foo print(foo)
```

`[1] "ducks"`

```
<- data.frame(a = 1:2, b = 3:4)
bar print(bar)
```

```
a b
1 1 3
2 2 4
```

Suppose I write a function that creates a new type of data: perhaps a new kind of model fit, or maybe it allows me to load social network data, or maybe it generates graphics. How can I make `print()`

work on this without having to find the R source code and edit `print()`

?

Fortunately, if we examine the source code of `print()`

, we see part of the answer:

` print`

```
function (x, ...)
UseMethod("print")
<bytecode: 0x7f77bccc5af0>
<environment: namespace:base>
```

`print()`

is a *generic function*. It calls `UseMethod("print")`

, and `UseMethod()`

is part of the S3 generic system. It looks at the class of the first argument to the function, “class” meaning a specific attribute of the variable:

`class(foo)`

`[1] "character"`

`class(bar)`

`[1] "data.frame"`

It then calls a function `print.[class]()`

according to the class. These class-specific `print`

functions are called *methods*. For instance, R has a `print.data.frame()`

method, and in fact nearly 200 other `print()`

methods:

`head(methods(print), n = 10)`

```
[1] "print.acf" "print.activeConcordance"
[3] "print.AES" "print.all_vars"
[5] "print.anova" "print.any_vars"
[7] "print.aov" "print.aovlist"
[9] "print.ar" "print.Arima"
```

Many functions in R are S3 generic functions, and R provides methods for many of its built-in classes. For regression modeling, this is useful because there are generic methods for getting model coefficients, making confidence intervals, making predictions, and so on, and there are methods for many kinds of regression models. You can easily write code that works with any kind of model fit just by calling generic methods, instead of having one function that works on linear models and a different function that works on logistic regression fits.

There is often specific documentation for specific generic methods. For instance, `?residuals`

gives the generic documentation for `residuals()`

, but `?residuals.lm`

tells you specifically what it does for linear model fits.

For more details on the S3 system and how to create your own generic functions, see Wickham (2019) chapter 13.

## 6.2 Fitting

The `lm()`

function fits linear models using least squares. It takes two key arguments:

- A
*model formula*, which is special syntax for specifying the names of the response and predictor variables, and any transformations, interactions, or basis expansions that might be used. - A data frame providing the observed data, which must contain columns whose names match the terms in the model formula.

R uses the model formula and data frame to construct the design matrix; you can also use the `model.matrix()`

function to get the design matrix R would use for a given model formula and data frame.

The `lm()`

function returns a fit object containing the data, fit parameters, estimates, and various other useful pieces. For example,

```
<- lm(mpg ~ disp + hp, data = mtcars)
fit fit
```

```
Call:
lm(formula = mpg ~ disp + hp, data = mtcars)
Coefficients:
(Intercept) disp hp
30.73590 -0.03035 -0.02484
```

The variable `fit`

is now an object with class lm.

Model formulas place the response variable to the left of the `~`

and predictors to the right, separated by `+`

signs. Formulas can contain transformations and some useful functions:

`mpg ~ log(disp) + hp`

log-transforms`disp`

.`mpg ~ I(disp^2) + hp`

takes the square of`disp`

.`I()`

is necessary because the`^`

operator has a specific meaning in formulas (below), so`I()`

tells R to ignore this and evaluate it as-is.`mpg ~ disp + hp - 1`

instructs`lm()`

not to insert an intercept automatically.- The
`:`

operator represents interaction terms, so`mpg ~ disp + hp + disp:hp`

represents a model with linear main effects for each predictor and the interaction between them. - The
`*`

shorthand includes both main effects and the interaction, so`mpg ~ disp * hp`

is equivalent to the previous formula. - The
`^`

calculates interactions up to a specific degree. For instance,`mpg ~ (disp + hp + wt)^3`

includes the three main effects, all two-way interactions, and the three-way interaction.

See the help for `?formula`

for more details. We’ll discuss interactions in more detail in Chapter 7.

**Exercise 6.1 (Model matrices) **The `model.matrix()`

function takes two arguments, a formula and a data frame, just like `lm()`

, and produces the design matrix \(\X\) corresponding to that model formula. Use this function and the `mtcars`

data to examine the model matrices produced by the formulas listed above.

## 6.3 Coefficients and fitted values

The `coef()`

generic function fetches model coefficients as a named vector, so we can index it by name:

`coef(fit)`

```
(Intercept) disp hp
30.73590425 -0.03034628 -0.02484008
```

`coef(fit)["disp"]`

```
disp
-0.03034628
```

For a tidier version as a data frame, `tidy()`

from the broom package is very useful:

`tidy(fit)`

```
# A tibble: 3 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 30.7 1.33 23.1 3.26e-20
2 disp -0.0303 0.00740 -4.10 3.06e- 4
3 hp -0.0248 0.0134 -1.86 7.37e- 2
```

The `vcov()`

function extracts estimates of \(\var(\hat \beta)\) as a matrix, assuming the errors are independent and have equal variance (Theorem 5.3):

`vcov(fit)`

```
(Intercept) disp hp
(Intercept) 1.773068357 -0.0011510577 -0.0081943285
disp -0.001151058 0.0000548319 -0.0000783970
hp -0.008194329 -0.0000783970 0.0001791716
```

The `fitted()`

function gives the fitted values \(\hat Y\) as a vector. If the data had row names, this will be a named vector.

`head(fitted(fit))`

```
Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive
23.14809 23.14809 25.14838 20.17416
Hornet Sportabout Valiant
15.46423 21.29978
```

## 6.4 Residuals and diagnostics

The `residuals()`

generic function fetches the residuals from a model fit. It can return different types of residuals, so see `?residuals.lm`

for details; the default residuals are simply \(Y - \hat Y\).

`head(residuals(fit))`

```
Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive
-2.148091 -2.148091 -2.348379 1.225844
Hornet Sportabout Valiant
3.235770 -3.199783
```

There are a variety of similar functions for different diagnostics:

`rstandard()`

for standardized residuals`rstudent()`

for Studentized residuals`cooks.distance()`

for Cook’s distances

See Section 5.4 for details on each kind of residual. The `augment()`

function from broom can put many of these diagnostics in a single data frame along with the original data. For example, Table 6.1 shows the augmented version of `fit`

. Notice the new columns have names beginning with `.`

so they cannot accidentally have the same name as a variable in your model. The data frame contains the fitted values, residuals, Cook’s distances, standardized residuals, and several other values; see `?augment.lm`

for details.

.rownames | mpg | disp | hp | .fitted | .resid | .hat | .sigma | .cooksd | .std.resid |
---|---|---|---|---|---|---|---|---|---|

Mazda RX4 | 21.0 | 160 | 110 | 23.1 | -2.1 | 0.0 | 3.2 | 0 | -0.7 |

Mazda RX4 Wag | 21.0 | 160 | 110 | 23.1 | -2.1 | 0.0 | 3.2 | 0 | -0.7 |

Datsun 710 | 22.8 | 108 | 93 | 25.1 | -2.3 | 0.1 | 3.1 | 0 | -0.8 |

Hornet 4 Drive | 21.4 | 258 | 110 | 20.2 | 1.2 | 0.1 | 3.2 | 0 | 0.4 |

Hornet Sportabout | 18.7 | 360 | 175 | 15.5 | 3.2 | 0.1 | 3.1 | 0 | 1.1 |

Valiant | 18.1 | 225 | 105 | 21.3 | -3.2 | 0.1 | 3.1 | 0 | -1.1 |

## 6.5 Confidence intervals and model inference

To get 95% confidence intervals for model coefficients, we can use `confint()`

, which uses the normality assumption and \(t\) distribution of \(\hat \beta\). This returns a matrix, not a data frame, though it can be indexed by row and column name.

`confint(fit)`

```
2.5 % 97.5 %
(Intercept) 28.01254573 33.459262767
disp -0.04549091 -0.015201645
hp -0.05221650 0.002536338
```

`confint(fit)["disp", "2.5 %"]`

`[1] -0.04549091`

Optional arguments allow you to specify individual parameters you want intervals for, or to set the confidence level to something other than 95%. Alternately, broom’s `tidy()`

can be asked to include confidence intervals by passing the `conf.int = TRUE`

argument, as shown in Table 6.2.

term | estimate | std.error | statistic | p.value | conf.low | conf.high |
---|---|---|---|---|---|---|

(Intercept) | 30.736 | 1.332 | 23.083 | 0.000 | 28.013 | 33.459 |

disp | -0.030 | 0.007 | -4.098 | 0.000 | -0.045 | -0.015 |

hp | -0.025 | 0.013 | -1.856 | 0.074 | -0.052 | 0.003 |

## 6.6 Predictions

To predict \(\hat Y\) for a new value of \(X\), we use the `predict()`

method. This method takes a data frame of one or more rows of new observations. This data frame must have column names that match the predictors in the original model formula.

```
<- data.frame(disp = c(2, 400), hp = c(400, 2))
new_cars predict(fit, newdata = new_cars)
```

```
1 2
20.73918 18.54771
```

A common mistake is to provide a `newdata`

whose columns do not exactly match the original fit. This produces an error. For example, here we’ve misspelled the `disp`

column:

```
predict(fit, newdata = data.frame(dips = c(2, 400),
hp = c(400, 2)))
```

`Error in eval(predvars, data, env): object 'disp' not found`

Another common problem is when the original formula given to `lm()`

refers to a specific data frame by name. For example, if we fit the model using `lm(mtcars$mpg ~ mtcars$disp + mtcars$hp)`

, `predict()`

will look for columns named `mtcars$mpg`

, `mtcars$disp`

, and `mtcars$hp`

inside the `newdata`

data frame. Using the `data =`

argument is always preferable, as it avoids problems like this.

Optionally, we can ask `predict()`

for confidence intervals or prediction intervals, using the pointwise estimators described in Section 5.3.2 and Section 5.3.3:

`predict(fit, newdata = new_cars, interval = "confidence")`

```
fit lwr upr
1 20.73918 10.77086 30.70749
2 18.54771 12.25457 24.84085
```

`predict(fit, newdata = new_cars, interval = "prediction")`

```
fit lwr upr
1 20.73918 8.896103 32.58226
2 18.54771 9.575828 27.51960
```