If you are reading this, you already know that the covariance matrix represents unconditional linear dependency between the variables. Far less mentioned is the bewitching fact that the elements of the inverse of the covariance matrix (i.e. the precision matrix) encode the conditional linear dependence between the variables. This post shows why that is the case. I start with the motivation to even discuss this, then the math, then some code.
Wide data matrices, where , are widespread nowadays (pun intended). In this situation a good, stable and reliable estimate for the covariance matrix is hard to obtain. The usual sample covariance estimate when is simply lousy (for reasons out of scope here). Much is written about how best to adjust and correct that sample covariance estimate (see below for references). A motivation to shift the focus from the covariance matrix to the precision matrix is how we start.
Contents
Motivation: thresholding covariance estimation
One glaring issue is that the sample estimate is always dense. I mean, if you simulate data from a population with zero covariance between variables (represented by a diagonal covariance matrix), the sample estimated of the off-diagonal elements will never be exactly zero. This is because the estimated covariance reflects the input data, which will naturally have some variability even if the true underlying covariance is zero. So setting small off-diagonal estimated elements to zero is good thinking, and we call this Thresholding estimation. Let be the element of the sample covariance estimate and let be some threshold value then formally you have the new covariance estimate:
where is the indicator function to return zero if the absolute value is below the you set. Like that we can create a sparse covariance estimate.
Now, each entry in the covariance matrix is independent of other variables. The entry is simply . So for example if drives, and co-move with both , and ; meaning and are both high, then is also likely to be high. If we are able to net-out the movement in from that would be good. In other words we look for . Why would that be good? because is likely to be much lower than , and so threshold makes (even) better sense. We would expect many more zeros if we could find the partial covariances .
The precision matrix
The precision matrix is the inverse of the covariance matrix. It just so happens that the entries of the precision matrix are exactly what we are looking for, the conditional (on other variables) rather than the unconditional as it is with the covariance matrix. This morning I was walking with a fellow statistician, data scientist, quant, machine learning engineer, AI researcher, statistician, and he challenged me to explain this. I could not do much but accept the challenge. So, why by inverting the covariance matrix we end up with a precision matrix which encodes the conditional dependence any two variables after controlling the rest?
Some gory details
Consider a data matrix with rows and columns; . Partition to two matrices: first 2 columns as X and the rest as Y. For the sake of simplicity assume that
So
.
The covariance and precision matrice are given by
and
I remind you what are we chasing here. is just a matrix with 2 columns. So the off-diagonal of is just one number. We want to show that this number is the conditional dependence between the two columns, after accounting for all other columns .
The order of things to come:
- Express in terms of the covariance matrix .
- Show that is the covariance of the conditional distribution .
- Relate the result to the more general case of linear projection.
- Illustrate it with some code.
-
We start by expressing in terms of the covariance matrix :
This expression comes from a tedious matrix algebra or better yet from something we know about this kind of block matrices and their inverse which is called Schur complement.
- (in the Gaussian case) The Precision matrix is the covariance of the conditional distribution
- The link with linear projection
- Code illustration
Usually you see this written in terms of but recall that .
We can ignore anything that is dependent on (specifically ) since we condition on it, so we can treat it as a constant, and keep all terms related to (the rest will go outside the brackets and be elegantly ignored by using the which means “proportional to”):
Using some additional algebra (see below for some rules we need here) we can write it as:
What we are going to do now is to make it look like the normal distribution:
so we recognize the covariance term out of the expression. The trick is to add and remove the same term (preserving the value overall expression), and completing the square, aka “completing the quadratic form”. We have quite a few algebraic rules to remember moving forward: transposing, noting that the covariance is a symmetric matrix, and that we can consider anything which does not depends on X as a constant, even since it does not change with the realization of X. I wrote down some rules we use below and here you can find what I use for verification.
The term we need to add and subtract is to get
So this looks like what we aimed for:
as if and especially if you remember that is itself an inverse (so actually “in the denominator” of that exponent):
Cool cool, but what do we have? that the variance of the conditional distribution of X on Y is an expression, which says that you need to subtract from which you can think of as total variability, the part which is explained by : .
You have seen this in the past:
Recall that when the mean vector is zero, then is the covariance simply (up to a scaling constant). Now observe that the above expression can be thought of (up to a constant) as the coefficient from a regression which explains the column of using the columns of . Specifically, is analogous to the cross-product in regression, capturing the covariance between and . is analogous to in that expression, capturing the covariance within the columns of .
Here is some R code to replicate the math above. We start by pulling some stock market data using the quantmod package:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
library(quantmod) # citation("quantmod") k <- 10 # how many years back? end<- format(Sys.Date(),"%Y-%m-%d") start<-format(Sys.Date() - (k*365),"%Y-%m-%d") symetf = c('XLY', 'XLP', 'XLE', 'XLF', 'XLV', 'XLI', 'XLB', 'XLK', 'XLU') l <- length(symetf) w0 <- NULL for (i in 1:l) { dat0 = getSymbols( symetf[i], src = "yahoo", from = start, to = end, auto.assign = F, warnings = FALSE, symbol.lookup = F ) w1 <- dailyReturn(dat0) w0 <- cbind(w0, w1) } time <- as.Date(substr(index(w0), 1, 10)) w0 <- 100 * as.matrix(w0) # convert to % colnames(w0) <- symetf |
We will net-out, or control for, other variables by using the residuals from the node-wise regressions: . So the partial linear dependence between and say will be estimated by .
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
get_precision_entry <- function(data, column_i, column_j){ precision_mat <- solve(cov(data)) name_i <- which(names(dat) == column_i ) name_j <- which(names(dat) == column_j ) data <- as.matrix(data) lm0 <- lm( data[ ,name_i] ~ data[ , -c(name_i, name_j)] ) lm1 <- lm( data[ ,name_j] ~ data[ ,-c(name_i, name_j)] ) partial_correlation <- cor(lm0$res, lm1$res) diag1_diag2 <- precision_mat[column_i, column_i] * precision_mat[column_j, column_j] precision_entry_i_j <- - partial_correlation * sqrt(diag1_diag2) return(precision_entry_i_j) } precision_mat <- solve(cov(w0)) get_precision_entry(data= w0, column_i= "XLV", column_j= "XLF") [1] -0.328 precision_mat["XLV", "XLF"] [1] -0.328 |
To cite this blog post “Eran Raviv (2024, June 6). Correlation and correlation structure (8) – the precision matrix. Retrieved on … month, year from https:…”
Footnotes and references
Some algebra rules which were used