If you are reading this, you already know that the covariance matrix represents unconditional linear dependency between the variables. Far less mentioned is the bewitching fact that the elements of the inverse of the covariance matrix (i.e. the precision matrix) encode the conditional linear dependence between the variables. This post shows why that is the case. I start with the motivation to even discuss this, then the math, then some code.
Wide data matrices, where , are widespread nowadays (pun intended). In this situation a good, stable and reliable estimate for the covariance matrix is hard to obtain. The usual sample covariance estimate when
is simply lousy (for reasons out of scope here). Much is written about how best to adjust and correct that sample covariance estimate (see below for references). A motivation to shift the focus from the covariance matrix to the precision matrix is how we start.
Contents
Motivation: thresholding covariance estimation
One glaring issue is that the sample estimate is always dense. I mean, if you simulate data from a population with zero covariance between variables (represented by a diagonal covariance matrix), the sample estimated of the off-diagonal elements will never be exactly zero. This is because the estimated covariance reflects the input data, which will naturally have some variability even if the true underlying covariance is zero. So setting small off-diagonal estimated elements to zero is good thinking, and we call this Thresholding estimation. Let be the
element of
the sample covariance estimate and let
be some threshold value then formally you have the new covariance estimate:
where is the indicator function to return zero if the absolute value is below the
you set. Like that we can create a sparse covariance estimate.
Now, each entry in the covariance matrix is independent of other variables. The entry is simply
. So for example if
drives, and co-move with both
, and
; meaning
and
are both high, then
is also likely to be high. If we are able to net-out the movement in
from
that would be good. In other words we look for
. Why would that be good? because
is likely to be much lower than
, and so threshold makes (even) better sense. We would expect many more zeros if we could find the partial covariances
.
The precision matrix
The precision matrix is the inverse of the covariance matrix. It just so happens that the entries of the precision matrix are exactly what we are looking for, the conditional (on other variables) rather than the unconditional as it is with the covariance matrix. This morning I was walking with a fellow statistician, data scientist, quant, machine learning engineer, AI researcher, statistician, and he challenged me to explain this. I could not do much but accept the challenge. So, why by inverting the covariance matrix we end up with a precision matrix which encodes the conditional dependence any two variables after controlling the rest?
Some gory details
Consider a data matrix with rows and
columns;
. Partition
to two matrices: first 2 columns as X and the rest as Y. For the sake of simplicity assume that
So
.
The covariance and precision matrice
are given by
and
I remind you what are we chasing here. is just a matrix with 2 columns. So the off-diagonal of
is just one number. We want to show that this number is the conditional dependence between the two columns, after accounting for all other columns
.
The order of things to come:
- Express
in terms of the covariance matrix
.
- Show that
is the covariance of the conditional distribution
.
- Relate the result to the more general case of linear projection.
- Illustrate it with some code.
-
We start by expressing
in terms of the covariance matrix
:
This expression comes from a tedious matrix algebra or better yet from something we know about this kind of block matrices and their inverse which is called Schur complement.
- (in the Gaussian case) The Precision matrix is the covariance of the conditional distribution
- The link with linear projection
- Code illustration
Usually you see this written in terms of but recall that
.
We can ignore anything that is dependent on (specifically
) since we condition on it, so we can treat it as a constant, and keep all terms related to
(the rest will go outside the brackets and be elegantly ignored by using the
which means “proportional to”):
Using some additional algebra (see below for some rules we need here) we can write it as:
What we are going to do now is to make it look like the normal distribution:
so we recognize the covariance term out of the expression. The trick is to add and remove the same term (preserving the value overall expression), and completing the square, aka “completing the quadratic form”. We have quite a few algebraic rules to remember moving forward: transposing, noting that the covariance is a symmetric matrix, and that we can consider anything which does not depends on X as a constant, even since it does not change with the realization of X. I wrote down some rules we use below and here you can find what I use for verification.
The term we need to add and subtract is to get
So this looks like what we aimed for:
as if and especially if you remember that
is itself an inverse (so actually “in the denominator” of that exponent):
Cool cool, but what do we have? that the variance of the conditional distribution of X on Y is an expression, which says that you need to subtract from which you can think of as total variability, the part which is explained by
:
.
You have seen this in the past:
Recall that when the mean vector is zero, then is the covariance simply (up to a scaling constant). Now observe that the above expression can be thought of (up to a constant) as the coefficient from a regression which explains the column of
using the columns of
. Specifically,
is analogous to the cross-product
in regression, capturing the covariance between
and
.
is analogous to
in that
expression, capturing the covariance within the columns of
.
Here is some R code to replicate the math above. We start by pulling some stock market data using the quantmod package:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
library(quantmod) # citation("quantmod") k <- 10 # how many years back? end<- format(Sys.Date(),"%Y-%m-%d") start<-format(Sys.Date() - (k*365),"%Y-%m-%d") symetf = c('XLY', 'XLP', 'XLE', 'XLF', 'XLV', 'XLI', 'XLB', 'XLK', 'XLU') l <- length(symetf) w0 <- NULL for (i in 1:l) { dat0 = getSymbols( symetf[i], src = "yahoo", from = start, to = end, auto.assign = F, warnings = FALSE, symbol.lookup = F ) w1 <- dailyReturn(dat0) w0 <- cbind(w0, w1) } time <- as.Date(substr(index(w0), 1, 10)) w0 <- 100 * as.matrix(w0) # convert to % colnames(w0) <- symetf |
We will net-out, or control for, other variables by using the residuals from the node-wise regressions:
. So the partial linear dependence between
and
say will be estimated by
.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
get_precision_entry <- function(data, column_i, column_j){ precision_mat <- solve(cov(data)) name_i <- which(names(dat) == column_i ) name_j <- which(names(dat) == column_j ) data <- as.matrix(data) lm0 <- lm( data[ ,name_i] ~ data[ , -c(name_i, name_j)] ) lm1 <- lm( data[ ,name_j] ~ data[ ,-c(name_i, name_j)] ) partial_correlation <- cor(lm0$res, lm1$res) diag1_diag2 <- precision_mat[column_i, column_i] * precision_mat[column_j, column_j] precision_entry_i_j <- - partial_correlation * sqrt(diag1_diag2) return(precision_entry_i_j) } precision_mat <- solve(cov(w0)) get_precision_entry(data= w0, column_i= "XLV", column_j= "XLF") [1] -0.328 precision_mat["XLV", "XLF"] [1] -0.328 |
To cite this blog post “Eran Raviv (2024, June 6). Correlation and correlation structure (8) – the precision matrix. Retrieved on … month, year from https:…”
Footnotes and references
Some algebra rules which were used