Overparametrized Linear Models

Posted on July 09 2017 in Statistics

Surprisingly, quite a few data scientists overlook the importance of linear regression and the problem of overparametrization. In this post, I'm going to describe mathematically what it means for a model to be overparametrized and the general strategies used by a statistician to resolve non-uniqueness.

Ranks, Subspaces, and Bases

Suppose \(X = (x_1, x_2, \dots, x_p)\) is an \(n \times p\) matrix of rank \(m\), with \(m < p\). That is, the space \(\chi\) spanned by all the columns of \(X\) can also be spanned by some subset of \(m\) linearly independent columns. The space \(\chi\) is overparametrized; we don't need all \(p\) parameters to specify vectors in \(\chi\) because there is a set of \(m\) linearly independent columns that spans \(\chi\). For each \(z\) in \(\chi\) there are many different \(b\) in \(\mathbb{R}^p\) for which \(z = Xb\). The non-uniqueness of \(b\) leads to several difficulties when the columns of \(X\) are used as the predictors in a least squares problem.

The \(p \times n\) matrix \(X^T = (w_1, w_2, \dots, w_n)\) also has rank \(m\). The subspace \(W\) of \(\mathbb{R}^p\) spanned by all the columns of \(X^T\) can also be spanned by some subset of \(m\) linearly independent columns, which (without loss of generality) we may suppose correspond to the first \(m\) rows of \(X\). Put another way,

\begin{equation} X^T = [W_1 W_2] \end{equation}

where \(W_1\) is a \(p \times m\) matrix and \(W_2\) is a \(p \times (p - m)\) matrix. The linearly independent columns of \(W_1\) form a basis for \(W\) and \(W_2 = W_1 A\) for some \(m \times (p - m)\) matrix \(A\).

A vector \(z\) in \(\mathbb{R}^p\) belongs to \(\chi\) if and only if it can be written as \(Xb\) for some \(b\) in \(\mathbb{R}^p\). If we partition \(z\) into a vector \(z_1\) of length \(m\) and a vector \(z_2\) of length \(n - m\) then \(Xb = \binom{z_1}{z_2}\) if and only if \(z_1 = W_1^T b\) and \(z_2 = W_2^T b = A^T z_1\).

That is, if \(z \in \chi\) then \(z_2 = A^T z_1\) and \(Xb = z\) if and only if \(W_1^T b = z_1\). Write \(b_z\) for the orthogonal projection of \(b\) onto \(W\), so that \(w = b - b_z \in W^\perp\). The vector \(b_z\) is the unique member of \(W\) for which \(W_1^T b_z = z_1\); it is the same for every solution of \(W_1^T b = z_1\). The \(w\) could be anything in \(W^\perp\). In summary, if \(z \in \chi\) there is a unique \(b_z\) in \(W\) for which \(W_1^T b_z = z_1\) and

\begin{equation} B_z = \{b \in \mathbb{R}^p: Xb = z\} = \{b_z + w : w \in W^\perp\} \end{equation}

General Strategies

In statistical applications, we often think of \(y\) as a random vector whose expected value \(\mu = \mathbb{E}y\) is modeled as an unknown element of the subspace \(\chi\) of \(\mathbb{R}^n\) spanned by the columns of a given matrix \(X\). By assumption, the unknown \(\mu\) can be written as \(X\beta\) for some unknown \(\beta\) in \(\mathbb{R}^p\). The fitted vector \(\hat{y}\) is then thought of as an estimator for the unknown \(\mu\) and the \(\hat{b}\) for which \(\hat{y} = X\hat{b}\) is thought of as an estimator for \(\beta\). Non-uniqueness of \(\hat{b}\) (or of \(\beta\) itself) clearly causes some embarrassment. How can we interpret quantities that are not uniquely determined?

Statisticians use two general strategies to avoid this embarrassment.

  1. Restrict attention to linear functions \(L^T \beta\) (with \(L \in \mathbb{R}^p\)) of the unknown \(\beta\) that are uniquely determined by \(\mu\). That is, only interpret the linear combinations for which there exiwsts some vector \(l\) in \(\mathbb{R}^n\) such that \(L^T \beta = l^T \mu\) whenever \(\mu = X\beta\). Such linear combinations are said to be estimable.

  2. Impose a set of \(p - m\) linearly independent lienar constraints on \(b\), say \(D_1^T b = 0\) for fixed \(p \times (p - m)\) matrix \(D_1\), so that to every \(z \in \chi\) there exists a unique \(b\) in \(\mathbb{R}^p\) for which both \(z = Xb\) and \(D_1^T b = 0\).

Estimable Functions

If \(L \in W\) then it can be written as a linear combination of the columns of \(X^T\), that is, \(L = X^T l\) for some \(l\) in \(\mathbb{R}^n\). If \(Xb = z\) then

\begin{equation} L^T b = l^T Xb = l^T z \end{equation}

In particular, if \(X\beta = \mu \in \chi\) then \(L^T \beta = l^T \mu\). That is, \(L^T \beta\) is an estimable function.

Conversely, suppose that \(L = L_1 + L_2\) with \(L_1 \in W\) and \(L_2 \in W^\perp\) and \(b = b_z + w \in B_z\) as in (2). Then

\begin{equation} L^T b = L_1^T b_z + L_2^T w \end{equation}

If \(L_2 \neq 0\) then we can generate many different \(L^T b\) values by varying \(w\). For example, try \(w = 0\) then \(w = L_2\).

In short, the estimable functions \(L^T \beta\) of the unknown parameters are precisely those for which \(L \in W\), that is, \(L = X^T l\) for some \(l \in \mathbb{R}^n\). In that case, \(L^T \hat{b} = l^T X \hat{b} = l^T \hat{y}\) for every solution \(\hat{b}\) of the equation \(X \hat{b} = \hat{y}\). Under the linear model where \(y = \mu + \epsilon\) with \(\mathbb{E}y = \mu \in \chi\) and \(var(\epsilon) = \sigma^2 I_n\) we have

\begin{equation} \mathbb{E}L^T \hat{b} = L^T \beta \end{equation}

and

\begin{equation} var(L^T \hat{b}) = \sigma^2 \|l\|^2 \end{equation}

Note that there is a role for estimability if we start with an \(X\) of full rank but lose some of the data. When the corresponding rows are removed from \(X\) we might be left with a reduced model matrix that is not of full rank.

Linear Constraints

Suppose we constrain the parametrization by adding \(p - m\) more rows to the \(X\) matrix, in such a way that the augmented matrix

\begin{equation} \widetilde{X} = \begin{bmatrix} X \\ D_1^T \\ \end{bmatrix} = \begin{bmatrix} W_1^T \\ W_2^T \\ D_1^T \\ \end{bmatrix} \end{equation}

where \(X\) is a \(n \times p\) matrix, \(D_1^T\) is a \((p - m) \times p\) matrix, \(W_1^T\) a \(m \times p\) matrix, and \(W_2^T\) a \((n - m) \times p\) matrix

has rank \(p\). The columns of \({\widetilde{X}}^T\) span \(\mathbb{R}^p\). The columns of the \(p \times p\) matrix \(M^T = [W_1 D_1]\) span the same space. Thus \(M\) has rank \(p\); it is non-singular. For each \(z\) in \(\chi\) there is now a unique \(b\) in \(\mathbb{R}^p\) for which \(Xb = z\) and \(D_1^T b = 0\):

\begin{equation} \widetilde{X}b = \begin{bmatrix} z \\ 0 \\ \end{bmatrix} \end{equation}

where \(z\) is a \(n \times 1\) matrix and \(0\) is a \((p - m) \times 1\) zero matrix

if and only if

\begin{equation} \binom{W_1^T}{D_1^T} b = \binom{z_1}{0} \end{equation}

if and only if

\begin{equation} b = M^{-1} \binom{z_1}{0} \end{equation}

Here is another way to derive the solution. It corresponds to the way R actually handles the overparametrization problem for factors. The linearly independent columns of \(D_1\) span a \((p - m)\)-dimensional subspace \(D\) of \(\mathbb{R}^p\). Find a \(p \times m\) matrix \(D_2\) whose columns span \(D^\perp\), the \(m\)-dimensional subspace of all vectors in \(\mathbb{R}^p\) that are orthogonal to \(D\). The requirement \(D_1^T b = 0\) means that \(b\) should be orthogonal to \(D\). That is, \(b = D_2 a\) for some \(a\) in \(\mathbb{R}^m\). The second requirement then becomes \(XD_2 a = z\). We now have a new parametrization for the model with the \(n \times m\) model matrix \(XD_2\) and parameters \(a \in \mathbb{R}^m\). The equation \(XD_2 a = z\) has a unique solution for each \(z\) in \(\chi\).