Correlation and correlation structure (10) – Inverse Covariance

The covariance matrix is central to many statistical methods. It tells us how variables move together, and its diagonal entries – variances – are very much our go-to measure of uncertainty. But the real action lives in its inverse. We call the inverse covariance matrix either the precision matrix or the concentration matrix. Where did these terms come from? I’ll now explain the origin of these terms and why the inverse of the covariance is named that way. I doubt this has kept you up at night, but I still think you’ll find it interesting.

Why the Inverse Covariance is Called Precision?

Variance is just a noisy soloist, if you want to know who really controls the music – who depends on whom – you listen to precision \Omega. While a variable may look wiggly and wild on its own, you often can tell where it lands precisely, conditional on the other variables in the system. The inverse of the covariance matrix encodes the conditional dependence any two variables after controlling the rest. The mathematical details appear in an earlier post and the curious reader should consult that one.

Here the following code and figure provide only the illustration for the precision terminology. Consider this little experiment:

    \[ X_2, X_3 \sim \mathcal{N}(0,1) \text{, independent and ordinary.} \]

    \[ X_1 = 2X_2 + 3X_3 + \text{small noise}.\]

Now, X_1 has a large variance (marginal variance), look, it’s all over the place:
X1 variance
But but but… given the other two variables you can determine X_1​ quite accurately (because it doesn’t carry much noise on its own); hence the term precision. The precision matrix captures exactly this phenomenon. Its diagonal entries are not about marginal uncertainty, but conditional uncertainty; how much variability remains when the values of the other variables are given. The inverse of the precision entry \Omega_{11} is the residual variance of X_1 after regression it on the other two variables. The math behind it is found in an earlier post, for now it’s suffice to write:

    \[\text{For each } i=1,\dots,n: \quad  X_i = \sum_{j \neq i} \beta_{ij} X_j + \varepsilon_i,  \quad \text{with } \mathrm{Var}(\varepsilon_i) = \sigma_i^2.\]

    \[\quad \Omega_{ii} = \tfrac{1}{\sigma_i^2},  \qquad \Omega_{ij} = -\tfrac{\beta_{ij}}{\sigma_i^2}.\]

So after accounting for the other two variables, you are left with

    \[\text{small noise} --> \frac{1}{\text{small noise}}  --> \text{high precision}\]

which in this case looks as follows:
X1 precise given X2,X3

This small illustration also reveals a useful computational insight. Instead of directly inverting the covariance matrix (expensive for high dimensions), you can also run parallel regressions of each variable on all others, which may scale better on distributed systems.

Why the Inverse Covariance is Called Concentration?

Now, what motivates the concentration terminology? What is concentrated? Let’s unwrap it. Let’s begin by first looking at the density of a single normally distributed random variable:

    \[f(x) \propto \exp\left(-\frac{1}{2}\frac{(x-\mu)^2}{\sigma^2}\right).\]

So if x = \mu we have e^{-(x-\mu)^2}= e^0=1, and otherwise we have e^{-(x-\mu)^2}= e^{(\text{negative number})}. This negative number will then be divided by the variance, or, in our context multiplied by the precision (which is the reciprocal of the variance for a single variable). A higher precision value makes for a negativier (😀) exponent. In turn, it reduces the overall density the further we drift from the mean (think faster mass-drop in the tails), so a sharper, more peaked density where the variable’s values are tightly concentrated around the mean. A numeric sanity check. Below are two cases with mean zero, one with variance 1 (so precision \tau=1), and the other with variance 4 (\tau=0.25). We look at two values, one at the mean (x=0), and one farther away (x=1), and check the density mass at those values for the two cases (p_{var=1}(0), p_{var=1}(1), and p_{var=4}(0) and p_{var=4}(1)) :

    \[ X\sim\mathcal N(0,\sigma^2),\quad \tau=\frac{1}{\sigma^2},\quad p_\tau(x)=\frac{\sqrt{\tau}}{\sqrt{2\pi}}\exp\!\left(-\tfrac12\tau x^{2}\right) \]

    \[ p_{1}(0)=\frac{1}{\sqrt{2\pi}}\approx 0.39,\qquad p_{4}(0)=\frac{2}{\sqrt{2\pi}}=\sqrt{\frac{2}{\pi}}\approx 0.79 \]

    \[ p_{1}(1)=\frac{1}{\sqrt{2\pi}}e^{-1/2}\approx 0.24,\qquad p_{4}(1)=\frac{2}{\sqrt{2\pi}}e^{-2}\approx 0.10 \]

    \[ \tau\uparrow\;\Rightarrow\;p(0)\uparrow,\;p(1)\downarrow \]

In words: higher precision leads to lower density mass away from the mean and, therefore, higher density mass around the mean (because the density has to sum up to one, and the mass must go somewhere).

Moving to the multivariate case. Say that also X_1 is normally distributed, then the joint multivariate Gaussian distribution of our 3 variables is proportional to:

    \[f(\mathbf{x}) \propto \exp\!\left(-\tfrac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^\top \mathbf{\Omega} (\mathbf{x}-\boldsymbol{\mu})\right)\]

    \[\mathbf{x} = \begin{bmatrix} x_1 \\ x_2 \\ x_3 \end{bmatrix}, \quad \boldsymbol{\mu} = \begin{bmatrix} \mu_1 \\ \mu_2 \\ \mu_3 \end{bmatrix}, \quad \mathbf{\Omega} = \begin{bmatrix} \Omega_{11} & \Omega_{12} & \Omega_{13} \\ \Omega_{21} & \Omega_{22} & \Omega_{23} \\ \Omega_{31} & \Omega_{32} & \Omega_{33} \end{bmatrix}\]

In the same fashion, \Omega directly sets the shape and orientation of the contours of the multivariate density. If there is no correlation (think a diagonal \Omega), what would you expect to see? that we have a wide, diffused, spread-out cloud (indicating little concentration). By way of contrast, a full \Omega weights the directions differently; it determines how much probability mass gets concentrated in each direction through the space.

Another way to see this is to remember that for the multivariate Gaussian density case, \mathbf{\Omega} appears in the nominator, so the inverse, the covariance \mathbf{\Sigma^{(-1)}} would be in the denominator. Higher covariance entries means more spread, and as a result a lower density values at individual points and thus a more diffused multivariate distribution overall.

The following two simple 3 \times 3 scenarios illustrate the concentration principle explained above. In the code below you can see that while I plot only the first 2 variables, there are actually 3 variables, but the third one is independent; so high covariance would remain high even if we account for the third variable (I say it so that you don’t get confused that we now work with the covariance, rather then with the inverse). Here are the two scenarios:

\textbf{Round hill:} \boldsymbol{\Sigma}_1 with correlation \rho = 0.1 creates spherical contours.

\textbf{Elongated ridge:} \boldsymbol{\Sigma}_2 with correlation \rho = 0.9 creates elliptical contours stretched along the correlation direction.

Rotate the below interactive plots to get a clearer sense of what we mean by more/less concentration. Don’t forget to check the density scale.

Round hill: diffused, less concentrated

Elongated ridge: steep\peaky, more concentrated

Hopefully, this explanation makes the terminology for the inverse covariance clearer.

Code

For Precision

For Concentration

Leave a Reply

Your email address will not be published. Required fields are marked *