Statistical Shrinkage (4) – Covariance estimation

A common issue encountered in modern statistics involves the inversion of a matrix. For example, when your data is sick with multicollinearity your estimates for the regression coefficient can bounce all over the place.

In finance we use the covariance matrix as an input for portfolio construction. Analogous to the fact that variance must be positive, covariance matrix must be positive definite to be meaningful. The focus of this post is on understanding the underlying issues with an unstable covariance matrix, identifying a practical solution for such an instability, and connecting that solution to the all-important concept of statistical shrinkage. I present a strong link between the following three concepts: regularization of the covariance matrix, ridge regression, and measurement error bias, with some easy-to-follow math.

What seems to be the problem officer?

A covariance matrix \Sigma_{p \times p} is positive semi-definite if and only if v^t \Sigma v \geq 0 for all possible vectors v (check this post for the why), and it also means that the determinant of the matrix \det(\Sigma) > 0. And if you remember that \det(\Sigma) = \lambda_1 \times \lambda_2 \times \cdots \times \lambda_p, it directly follows that all the eigenvalues must be positive. If any of them is negative or if any of them is too close to zero, the inversion operation will become problematic, akin to how dividing any number by almost-zero results in disproportionately large value.

Diagonal loading

So what do we do if not all eigenvalues are positive? We make them positive! One way to do that is by way of a process known as diagonal loading. We add a positive value (usually small) to the diagonal elements of the matrix \Sigma.

Why adding a constant C to the diagonal of \Sigma results in the same constant added to the vector of eigenvalues? Good question. Here is a short proof:

Let \Sigma be a p \times p matrix, and C a positive scalar. Consider \tilde{\Sigma} = \Sigma + C I, where I is the p \times p identity matrix. Let v_i be an eigenvector of \Sigma with corresponding eigenvalue \lambda_i. Before we start a quick reminder that by definition: if \Sigma \mathbf{v} = \mathbf{\lambda} \mathbf{v} then \mathbf{v} are the eigenvectors of \Sigma and \mathbf{\lambda} are the eigenvalues of \Sigma. Now, consider the action of \tilde{\Sigma} on v_i:

    \begin{align*} \tilde{\Sigma} v_i &= (\Sigma + C I) v_i \\            &= \Sigma v_i +  C I v_i \\            &= \lambda_i v_i + C v_i \qquad  \textit{from the math definition of eigenvalue and eigenvectors } \\             &= (\lambda_i + C ) v_i  \qquad \forall i \in \{1, \dots , p \} \end{align*}

So what is shown is that \tilde{\Sigma} has the same eigenvectors, and its eigenvalues are increased by exactly C, which concludes the proof.

Negative or small eigenvalues are shifted upwards which helps inversion. You may wonder (as I’m sure you do 🙂 ): must we push all eigenvalues? I mean, it’s only the small or negative which are the culprits of instability.

The answer is no, we don’t have to increase all the eigenvalues. In fact, if you wish to remain as close as possible to the original matrix, you can choose to increase only the problematic eigenvalues. This is exactly what nearest-positive-definite type of algorithms do. You apply the minimum eigenvalues shift needed to reach positive definiteness.

Diagonal loading and ridge regression

Ridge regression minimize the Residual Sum of Squared error, but with a penalty on the size of the coefficients:

    \[RSS(\lambda)=(\boldsymbol{y}-{\boldsymbol{X}} \boldsymbol{\beta})^{\prime}(\boldsymbol{y}-{\boldsymbol{X}} \boldsymbol{\beta})+ penalty \times \boldsymbol{\beta}^{\prime} \boldsymbol{\beta}\]

This is the resulting RR (for ridge regression) coefficients:

    \[\hat{\boldsymbol{\beta}}^{R R}=\left({\boldsymbol{X}}^{\prime} {\boldsymbol{X}}+ penalty \times I \right)^{-1} {\boldsymbol{X}}^{\prime} \boldsymbol{y}\]

This \boldsymbol{X}^{\prime} \boldsymbol{X}+ penalty \times I is exactly diagonal loading of the covariance matrix.

Why? because when \boldsymbol{X} is a scaled version of the original data matrix (as it is with ridge regression) then \boldsymbol{X}^{\prime} \boldsymbol{X} is the covariance of the scaled data (up to the division by the number of observations).

Ok, so ridge regression has something to do with diagonal loading (and so increasing the eigenvalues of the covariance matrix), why does this mean that we shrink the vector of regression coefficients? Another good question, let’s doctor some econometrics to help us understand this.

Diagonal loading, ridge regression and measurement error bias

In the measurement error bias I have already shown that the fact that you add noise to the explanatory variable shrink the coefficient so I don’t repeat it here. The main equation from that post is:

    \[\beta_1 \frac{var(x)} { var(x) + var(noise) }\]

For our purpose here you can simply think of it as increasing the variance of the explanatory variable by a constant (say var(noise)). That exactly means that you increase the diagonal entry (again, in the covariance matrix) which corresponds to that explanatory variable.

What is interesting to see, I think, is that while in the econometric literature measurement error is considered a serious problem, in the statistical learning literature we introduce measurement error intentionally, to our advantage. Of course, I steer clear from the inference versus prediction never-ending debate.

Summary

Armed with this understanding, you can use it for estimating the entries of the covariance matrix individually. This approach is not widely used is primarily due to the numerical instabilities which are very likely to follow. But if you can handle numerical instability effectively, there are significant benefits to estimating individual elements of the covariance matrix as opposed to estimating the entire covariance structure in one go.

For those of you old enough to remember, paraphrasing John Hannibal Smith from the A-team: “I love it when optimization, econometrics and statistical learning come together”.

One comment on “Statistical Shrinkage (4) – Covariance estimation”

Leave a Reply

Your email address will not be published. Required fields are marked *