3  Matrix and Vector Algebra

See Harville (1997) chapter 3, sections 4.1-4.4, sections 6.1-6.2.

3.1 Vectors

TODO

3.2 Matrices

TODO

3.3 Norms and distances

Norms are functions that take vectors and produce a number that we could consider the “length” or “size” of the vector; equivalently, we can think of them as giving the distance between the vector and the origin. Norms will be useful in regression in many ways, particularly since our usual ways of defining the error of a regression involve the norms of vectors.

Norms can be defined very generally for vector spaces, matrices, and more abstract mathematical objects, but we will only need a definition that works for vectors of real numbers.

Definition 3.1 (Norms) A norm is a real-valued function \(p : \R^n \to \R\) with the following properties:

  1. (Triangle inequality) \(p(x + y) \leq p(x) + p(y)\) for all \(x, y \in R^n\).
  2. (Absolute homogeneity) \(p(sx) = |s| p(x)\) for all \(x \in \R^n\) and all scalars \(s\).
  3. (Positive definiteness) If \(p(x) = 0\), then \(x = 0\).

Exercise 3.1 (Nonnegativity of norms) Using the properties in Definition 3.1, prove that norms are nonnegative: \(p(x) \geq 0\) for all \(x \in \R^n\).

The properties given in this definition match what we expect of a length or distance measure: if we multiply a vector by a scalar, the norm is correspondingly multiplied; if the length is zero, the vector is zero; and adding two vectors cannot produce a vector longer than the two vector lengths added separately.

We can then define a few norms that satisfy this definition.

Definition 3.2 (Euclidean norm) The Euclidean norm of a vector \(x \in \R^n\), denoted \(\|x\|_2\), is \[ \|x\|_2 = \sqrt{\sum_{i=1}^n x_i^2} = \sqrt{x\T x}. \]

In other words, the Euclidean norm is the ordinary distance measure we know from basic geometry. This is the most common norm we will use, and if we refer to a norm without specifying which one, we usually mean the Euclidean norm. It has the familiar properties you know from geometry, such as the Pythagorean theorem.

Exercise 3.2 (Pythagorean theorem) Given two vectors \(x\) and \(y\) in \(\R^n\), use Definition 3.2 to show that \[ \|x - y\|_2^2 = \|x\|_2^2 + \|y\|_2^2 \] if and only if \(x\T y = 0\).

Interpret this geometrically. Let the origin, \(x\), and \(y\) form the vertices of a triangle, and indicate what property of the triangle corresponds to \(x\T y = 0\).

But the Euclidean norm is not the only one.

Definition 3.3 (Manhattan norm) The Manhattan norm or taxicab norm of a vector \(x \in \R^n\), denoted \(\|x\|_1\), is \[ \|x\|_1 = \sum_{i=1}^n |x_i|. \]

The Manhattan norm’s name comes from the street grid of downtown Manhattan: it measures how far in each coordinate one must travel separately, as if one were constrained to drive on a rectangular grid of streets instead of being able to move directly to the destination.

You might wonder why we have strange notation like \(\|x\|_1\) and \(\|x\|_2\); have we numbered all the norms? Not exactly: both the Euclidean and Manhattan norms are simply special cases of a more general norm, the \(p\)-norm.

Definition 3.4 (p-norm) The \(p\)-norm of a vector \(x \in \R^n\) is \[ \|x\|_p = \left( \sum_{i=1}^n |x_i|^p \right)^{1/p}. \] \(p\)-norms are also sometimes referred to as \(L^p\) norms.

So the Euclidean norm is the \(p\)-norm when \(p = 2\), and the Manhattan norm is the \(p\)-norm when \(p = 1\). Statisticians will often refer to them as the 2-norm and the 1-norm, or the \(L^2\) norm and the \(L^1\) norm. We will not need other \(p\)-norms in this book, but one can define meaningful norms even as \(p \to \infty\), and their properties can sometimes be useful.

We can use norms to define distances between vectors.

Definition 3.5 (Distance) Given a norm \(p(x)\) and two vectors \(x,y \in \R^n\), the distance \(\delta(x, y)\) between the vectors is \[ \delta(x, y) = p(x - y). \]

For example, in the Euclidean norm, the distance between two vectors \(x\) and \(y\) is \[ \|x - y\|_2 = \sqrt{\sum_{i=1}^n (x_i - y_i)^2}, \] which is what we expect from basic geometry.

Finally, there is one additional useful distance measure. If you have a number \(x\) from a normal distribution with standard deviation \(\sigma\), you can get a “distance” between \(x\) and some chosen value \(c\) with \((x - c) / \sigma\). This is the common \(z\) score, interpreted as the number of standard deviations \(x\) is from \(c\). We can extend this to multivariate data.

Definition 3.6 (Mahalanobis distance) The Mahalanobis distance relative to a distribution in \(\R^p\) with covariance matrix \(\Sigma \in \R^{p \times p}\) is \[ \delta_M(x, y) = \sqrt{(x - y)\T \Sigma^{-1} (x - y)}. \]

When \(p = 1\), this reduces to the \(z\) score; for \(p > 1\), it scales each dimension according to its variance (and the covariance between dimensions) to account for their differing variances. This is particularly reasonable because the probability density function for the multivariate normal distribution with mean \(\mu \in \R^p\) and covariance \(\Sigma\) is \[ f(x) = \frac{1}{\sqrt{2 \pi |\Sigma|}} \exp\left(- \frac{(x - \mu)\T \Sigma^{-1} (x - \mu)}{2}\right). \] The argument in the exponent is \(-\delta_M(x, \mu)/2\). For multivariate normal data, then, a Mahalanobis distance of 1 is like being 1 standard deviation from the mean, and all points with the same Mahalanobis distance from \(\mu\) have the same density.

3.4 Projection

TODO

3.5 Random vectors and matrices

TODO