Correlation and correlation structure (8)

If you are reading this, you already know that the covariance matrix represents unconditional linear dependency between the variables. Far less mentioned is the bewitching fact that the elements of the inverse of the covariance matrix (i.e. the precision matrix) encode the conditional linear dependence between the variables. This post shows why that is the case. I start with the motivation to even discuss this, then the math, then some code.

Wide data matrices, where $P = \#columns > N = \#rows$ , are widespread nowadays (pun intended). In this situation a good, stable and reliable estimate for the covariance matrix is hard to obtain. The usual sample covariance estimate when $P>N$ is simply lousy (for reasons out of scope here). Much is written about how best to adjust and correct that sample covariance estimate (see below for references). A motivation to shift the focus from the covariance matrix to the precision matrix is how we start.

Contents

Motivation: thresholding covariance estimation

One glaring issue is that the sample estimate is always dense. I mean, if you simulate data from a population with zero covariance between variables (represented by a diagonal covariance matrix), the sample estimated of the off-diagonal elements will never be exactly zero. This is because the estimated covariance reflects the input data, which will naturally have some variability even if the true underlying covariance is zero. So setting small off-diagonal estimated elements to zero is good thinking, and we call this Thresholding estimation. Let $\s_{i j}$ be the $(i,j)^{th}$ element of $\boldsymbol{S}$ the sample covariance estimate and let $thresh$ be some threshold value then formally you have the new covariance estimate:

$\widehat{\boldsymbol{\Sigma}}=\left(\widehat{\sigma}_{i j}\right)_{p \times p}, \quad \widehat{\sigma}_{i j}=\left\{\begin{array}{ll} s_{i j}, & \text { if } i=j \\ s_{i j} \mathcal{I} \left\{\left|s_{i j}\right|>thresh \right\}, & \text { if } i \neq j \end{array} .\right.$

where $\mathcal{I}$ is the indicator function to return zero if the absolute value is below the $thresh$ you set. Like that we can create a sparse covariance estimate.

Now, each entry in the covariance matrix is independent of other variables. The entry $s_{i j}$ is simply $cov(x_i, x_j) \quad \forall i \neq j$ . So for example if $x_1$ drives, and co-move with both $x_2$ , and $x_3$ ; meaning $cov(x_1, x_2)$ and $cov(x_1, x_3)$ are both high, then $cov(x_2, x_3)$ is also likely to be high. If we are able to net-out the movement in $x_1$ from $cov(x_2, x_3)$ that would be good. In other words we look for $cov(x_2, x_3 | x_1)$ . Why would that be good? because $cov(x_2, x_3 | x_1)$ is likely to be much lower than $cov(x_2, x_3)$ , and so threshold makes (even) better sense. We would expect many more zeros if we could find the partial covariances $cov(x_i, x_j | \boldsymbol{x}_{-(i,j)})$ .

The precision matrix

The precision matrix is the inverse of the covariance matrix. It just so happens that the entries of the precision matrix are exactly what we are looking for, the conditional (on other variables) rather than the unconditional as it is with the covariance matrix. This morning I was walking with a fellow ~~statistician~~, ~~data scientist~~, ~~quant~~, ~~machine learning engineer~~, ~~AI researcher~~, statistician, and he challenged me to explain this. I could not do much but accept the challenge. So, why by inverting the covariance matrix we end up with a precision matrix which encodes the conditional dependence any two variables after controlling the rest?

Some gory details

Consider a data matrix with $T$ rows and $2+p$ columns; $Z_{T \times (2 + p)}$ . Partition $Z$ to two matrices: first 2 columns as X and the rest as Y. For the sake of simplicity assume that

$\begin{aligned} & \mathbf{X} \sim \mathcal{N}_2\left(\mathbf{0}, \boldsymbol{\Sigma}_X\right) \\ & \mathbf{Y} \sim \mathcal{N}_p\left(\mathbf{0}, \boldsymbol{\Sigma}_Y\right) \end{aligned}$

$\mathbf{Z} \sim \mathcal{N}_{2+p}(\mathbf{0}, \boldsymbol{\Sigma})$

.
The covariance $\boldsymbol{\Sigma}$ and precision matrice $\boldsymbol{\Omega}=\boldsymbol{\Sigma}^{-1}$ are given by

$\boldsymbol{\Sigma}=\left[\begin{array}{ll} \boldsymbol{\Sigma}_{X X} & \boldsymbol{\Sigma}_{X Y}^{\top} \\ \boldsymbol{\Sigma}_{X Y} & \boldsymbol{\Sigma}_{Y Y} \end{array}\right]$

and

$\boldsymbol{\Omega}=\left[\begin{array}{ll} \boldsymbol{\Omega}_{X X} & \boldsymbol{\Omega}_{X Y}^{\top} \\ \boldsymbol{\Omega}_{X Y} & \boldsymbol{\Omega}_{Y Y} \end{array}\right]$

I remind you what are we chasing here. $X$ is just a matrix with 2 columns. So the off-diagonal of $\boldsymbol{\Omega}_{X X}$ is just one number. We want to show that this number is the conditional dependence between the two columns, after accounting for all other columns $Y$ .

The order of things to come:

Express $\boldsymbol{\Omega}_{X X}$ in terms of the covariance matrix $\boldsymbol{\Sigma}$ .
Show that $\boldsymbol{\Omega}_{X X}$ is the covariance of the conditional distribution $X \mid Y$ .
Relate the result to the more general case of linear projection.
Illustrate it with some code.

We start by expressing $\boldsymbol{\Omega}_{X X}$ in terms of the covariance matrix $\boldsymbol{\Sigma}$ :

$\boldsymbol{\Omega}_{X X}=\left(\boldsymbol{\Sigma}_{X X}-\boldsymbol{\Sigma}_{X Y}^{\top} \boldsymbol{\Sigma}_{Y Y}^{-1} \boldsymbol{\Sigma}_{X Y}\right)^{-1}$

This expression comes from a tedious matrix algebra or better yet from something we know about this kind of block matrices and their inverse which is called Schur complement.
(in the Gaussian case) The Precision matrix is the covariance of the conditional distribution

$\begin{aligned} f(\mathbf{x} \mid \mathbf{y}) & =2 \pi^{-(p+2) / 2}|\boldsymbol{\Omega}|^{1 / 2} \exp \left(-\frac{1}{2}\left[\begin{array}{l} \mathbf{x} \\ \mathbf{y} \end{array}\right]^{\top}\left[\begin{array}{llll} \boldsymbol{\Omega}_{X X} & \boldsymbol{\Omega}_{X Y}^{\top} \boldsymbol{\Omega}_{X Y} & \boldsymbol{\Omega}_{Y Y} \end{array}\right]\left[\begin{array}{l} \mathbf{x} \\ \mathbf{y} \end{array}\right]\right) \\ & \propto \exp \left(-\frac{1}{2}\left[\begin{array}{l} \mathbf{x}^{\top} \boldsymbol{\Omega}_{X X}+\mathbf{y}^{\top} \boldsymbol{\Omega}_{X Y} \\ \mathbf{x}^{\top} \boldsymbol{\Omega}_{X Y}^{\top}+\mathbf{y}^{\top} \boldsymbol{\Omega}_{Y Y} \end{array}\right]^{\top}\left[\begin{array}{l} \mathbf{x} \\ \mathbf{y} \end{array}\right]\right) \\ & \propto \exp \left(-\frac{1}{2}\left(\mathbf{x}^{\top} \boldsymbol{\Omega}_{X X} \mathbf{x}+\mathbf{y}^{\top} \boldsymbol{\Omega}_{X Y} \mathbf{x}+\mathbf{x}^{\top} \boldsymbol{\Omega}_{X Y}^{\top} \mathbf{y}+\mathbf{y}^{\top} \boldsymbol{\Omega}_{Y Y} \mathbf{y}\right)\right) \end{aligned}$

Usually you see this written in terms of $\boldsymbol{\Sigma}$ but recall that $\operatorname{det}\left(A^{-1}\right)=\frac{1}{\operatorname{det}(A)}$ .

We can ignore anything that is dependent on $Y$ (specifically $\mathbf{y}^{\top} \boldsymbol{\Omega}_{Y Y} \mathbf{y}$ ) since we condition on it, so we can treat it as a constant, and keep all terms related to $X$ (the rest will go outside the brackets and be elegantly ignored by using the $\propto$ which means “proportional to”):

Using some additional algebra (see below for some rules we need here) we can write it as:

$\begin{aligned} f(\mathbf{x} \mid \mathbf{y}) & \propto \exp \left(-\frac{1}{2}\left(\mathbf{x}^{\top} \boldsymbol{\Omega}_{X X} \mathbf{x}+\mathbf{y}^{\top} \boldsymbol{\Omega}_{X Y} \mathbf{x}+\mathbf{x}^{\top} \boldsymbol{\Omega}_{X Y}^{\top} \mathbf{y}\right)\right) \\ & \propto \exp \left(-\frac{1}{2} \left( \mathbf{x}^{\top} \boldsymbol{\Omega}_{X X} \mathbf{x}- 2 \mathbf{y}^{\top} \boldsymbol{\Omega}_{X Y}^{\top} \mathbf{x}\right) \right) \end{aligned}$

What we are going to do now is to make it look like the normal distribution:

$\begin{aligned} f(x) &= \frac{1}{\sigma \sqrt{2\pi}} e^{ -\frac{1}{2} \left(\frac{x - \mu}{\sigma} \right)^2 } \\ \end{aligned}$

so we recognize the covariance term out of the expression. The trick is to add and remove the same term (preserving the value overall expression), and completing the square, aka “completing the quadratic form”. We have quite a few algebraic rules to remember moving forward: transposing, noting that the covariance is a symmetric matrix, and that we can consider anything which does not depends on X as a constant, even $\boldsymbol{\Omega}_{X X}$ since it does not change with the realization of X. I wrote down some rules we use below and here you can find what I use for verification.

The term we need to add and subtract is $\mathbf{y}^{\top} \boldsymbol{\Omega}_{X Y}^{\top}$ to get

$\begin{aligned} & \exp \left(-\frac{1}{2} \mathbf{x}^{\top} \boldsymbol{\Omega}_{X X} \mathbf{x}-\mathbf{x}^{\top} \boldsymbol{\Omega}_{X Y}^{\top} \mathbf{y}\right) = \\ & \exp \left(-\frac{1}{2}\left(\mathbf{x}-\boldsymbol{\Omega}_{X X}^{-1} \boldsymbol{\Omega}_{X Y} \mathbf{y}\right)^{\top} \boldsymbol{\Omega}_{X X}\left(\mathbf{x}-\boldsymbol{\Omega}_{X X}^{-1} \boldsymbol{\Omega}_{X Y} \mathbf{y} \right) \right) \\ + \exp \left(Constant \right) \end{aligned}$

So this looks like what we aimed for:

$e^{ -\frac{1}{2} \left(\frac{x - \mu}{\sigma} \right)^2 }$

as if $\mu = \boldsymbol{\Omega}_{X X}^{-1} \boldsymbol{\Omega}_{X Y} \mathbf{y}$ and especially if you remember that $\boldsymbol{\Omega}_{X X}$ is itself an inverse (so actually “in the denominator” of that exponent):

$\boldsymbol{\Omega}_{X X}=\left(\boldsymbol{\Sigma}_{X X}-\boldsymbol{\Sigma}_{X Y}^{\top} \boldsymbol{\Sigma}_{Y Y}^{-1} \boldsymbol{\Sigma}_{X Y}\right)^{-1}.$

Cool cool, but what do we have? that the variance of the conditional distribution of X on Y is an expression, which says that you need to subtract from $\boldsymbol{\Sigma}_{X X}$ which you can think of as total variability, the part which is explained by $Y$ : $\boldsymbol{\Sigma}_{X Y}^{\top} \boldsymbol{\Sigma}_{Y Y}^{-1} \boldsymbol{\Sigma}_{X Y}$ .

The link with linear projection

You have seen this in the past:

$\hat{\beta}=\left(X^T X\right)^{-1} X^T y$

Recall that when the mean vector is zero, then $X^T X$ is the covariance simply (up to a scaling constant). Now observe that the above expression can be thought of (up to a constant) as the coefficient from a regression which explains the column of $X$ using the columns of $Y$ . Specifically, $\Sigma_{X Y}$ is analogous to the cross-product $X^T y$ in regression, capturing the covariance between $X$ and $Y$ . $\boldsymbol{\Sigma}_{Y Y}^{-1}$ is analogous to $(X^T X)^{-1}$ in that $\hat{\beta}$ expression, capturing the covariance within the columns of $Y$ .

Code illustration

Here is some R code to replicate the math above. We start by pulling some stock market data using the quantmod package:


library(quantmod)
# citation("quantmod")

k <- 10 # how many years back?
end<- format(Sys.Date(),"%Y-%m-%d")
start<-format(Sys.Date() - (k*365),"%Y-%m-%d")
symetf = c('XLY', 'XLP', 'XLE', 'XLF', 'XLV', 'XLI', 'XLB', 'XLK', 'XLU')
l <- length(symetf)
w0 <- NULL
for (i in 1:l) {
  dat0 = getSymbols(
    symetf[i],
    src = "yahoo",
    from = start,
    to = end,
    auto.assign = F,
    warnings = FALSE,
    symbol.lookup = F
  )
  w1 <- dailyReturn(dat0)
  w0 <- cbind(w0, w1)
}
time <- as.Date(substr(index(w0), 1, 10))
w0 <- 100 * as.matrix(w0) # convert to %
colnames(w0) <-  symetf

library(quantmod)

# citation("quantmod")

k <- 10 # how many years back?

end<- format(Sys.Date(),"%Y-%m-%d")

start<-format(Sys.Date() - (k*365),"%Y-%m-%d")

symetf = c('XLY', 'XLP', 'XLE', 'XLF', 'XLV', 'XLI', 'XLB', 'XLK', 'XLU')

l <- length(symetf)

w0 <- NULL

for (i in 1:l) {

dat0 = getSymbols(

symetf[i],

src = "yahoo",

from = start,

to = end,

auto.assign = F,

warnings = FALSE,

symbol.lookup = F

)

w1 <- dailyReturn(dat0)

w0 <- cbind(w0, w1)

}

time <- as.Date(substr(index(w0), 1, 10))

w0 <- 100 * as.matrix(w0) # convert to %

colnames(w0) <- symetf

We will net-out, or control for, other variables by using the residuals $\{\varepsilon_i\}$ from the node-wise regressions: $x_i = \beta_0 + \sum_k_{k \neq i \cup k \neq j } {\beta_k x_k} + \varepsilon_i$ . So the partial linear dependence between $x_1$ and $x_2$ say will be estimated by $cor(\varepsilon_1, \varepsilon_2)$ .


get_precision_entry <- function(data, column_i, column_j){
  precision_mat <- solve(cov(data))
  name_i <- which(names(dat) == column_i ) 
  name_j <- which(names(dat) == column_j ) 
  data <- as.matrix(data)
  lm0 <- lm( data[ ,name_i] ~ data[ , -c(name_i, name_j)]  )
  lm1 <- lm( data[ ,name_j] ~ data[ ,-c(name_i, name_j)]  )
  partial_correlation <- cor(lm0$res, lm1$res)
  diag1_diag2 <- precision_mat[column_i, column_i] * precision_mat[column_j, column_j]
  precision_entry_i_j <- - partial_correlation * sqrt(diag1_diag2)
return(precision_entry_i_j)
 }
precision_mat <- solve(cov(w0))
get_precision_entry(data= w0, column_i= "XLV", column_j= "XLF")
[1] -0.328
precision_mat["XLV", "XLF"]
[1] -0.328

get_precision_entry <- function(data, column_i, column_j){

precision_mat <- solve(cov(data))

name_i <- which(names(dat) == column_i )

name_j <- which(names(dat) == column_j )

data <- as.matrix(data)

lm0 <- lm( data[ ,name_i] ~ data[ , -c(name_i, name_j)] )

lm1 <- lm( data[ ,name_j] ~ data[ ,-c(name_i, name_j)] )

partial_correlation <- cor(lm0$res, lm1$res)

diag1_diag2 <- precision_mat[column_i, column_i] * precision_mat[column_j, column_j]

precision_entry_i_j <- - partial_correlation * sqrt(diag1_diag2)

return(precision_entry_i_j)

}

precision_mat <- solve(cov(w0))

get_precision_entry(data= w0, column_i= "XLV", column_j= "XLF")

[1] -0.328

precision_mat["XLV", "XLF"]

[1] -0.328

To cite this blog post “Eran Raviv (2024, June 6). Correlation and correlation structure (8) – the precision matrix. Retrieved on … month, year from https:…”

Footnotes and references

Fan, Jianqing, Yuan Liao, and Han Liu. “An overview of the estimation of large covariance and precision matrices.” The Econometrics Journal 19, no. 1 (2016): C1-C32.

Bickel, Peter J., and Elizaveta Levina. “Covariance regularization by thresholding.” (2008): 2577-2604.

Meinshausen, Nicolai, and Peter Bühlmann. “High-dimensional graphs and variable selection with the lasso.” (2006): 1436-1462.

Lu, Tzon-Tzer, and Sheng-Hua Shiou. “Inverses of 2× 2 block matrices.” Computers & Mathematics with Applications 43, no. 1-2 (2002): 119-129.

Some algebra rules which were used

Reminder for completing the square:

$\mathbf{x}^{\top} M \mathbf{x}-2 \mathbf{b}^{\top} \mathbf{x}=\left(\mathbf{x}-M^{-1} \mathbf{b}\right)^{\top} M\left(\mathbf{x}-M^{-1} \mathbf{b}\right)-\mathbf{b}^{\top} M^{-1} \mathbf{b}$