Datasets have grown from large to massive, and so we increasingly find ourselves refactoring for readability and prioritizing computational efficiency (speed). The computing time for the ever-important sample covariance estimate of a dataset , with
observations and
variables is
. Although a single covariance calculation for today’s large datasets is manageable still, it’s computationally prohibitive to use bootstrap, or related resampling methods that require very many repetitions where each repetition demands its own covariance computation. Without fast computation bootstrap remains impractical for high-dimensional problems. And that, we undoubtedly all agree is a tragedy.
So, what can we do restore resampling methods to our toolkit? We can reduce computing times, and appreciably so, if we compute in parallel. We can reduce waiting times from overnight to matters of minutes or seconds even. Related to this, I wrote a post about Randomized Matrix Multiplication where I offer computationally cheaper approximation instead of the exact, but longer to compute procedure.
This post you now read was inspired by a question from Laura Balzano (University of Michigan) who asked if we can’t get an exact solution (rather than an approximation) using parallel computing shown in that other post. I spent some time thinking about it and indeed it’s possible, and valuable. So with that context out of the way, here is the Rython (R + Python) code to calculate the sample covariance estimate in parallel, with some indication for time saved. Use it when you have large matrices and you need the sample covariance matrix or derivative thereof.
Parallelizing Matrix Computation – Rython code
Assume a matrix , calculating the sample covariance takes around 40 seconds on my machine
1 2 3 4 5 6 7 8 9 10 |
TT <- 10000 p <- 2500 X <- matrix(rnorm(TT*p),TT,p) Xc <- scale(X, center=TRUE,scale=FALSE) no_parallel= cov(Xc) system.time(cov(Xc)) # user system elapsed # 41.99 0.00 42.00 |
It’s worth acknowledging NumPy’s impressively optimized backend – relative to the R compiler; I had to increase the matrix dimensions fivefold to get a comparable waiting time.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
import numpy as np from multiprocessing import Pool, cpu_count from itertools import product import time TT = 50000 p = 12500 X = np.random.randn(TT, p) Xc = X - np.mean(X, axis=0) start_time = time.time() tmpp = np.cov(Xc, rowvar=False) elapsed = time.time() - start_time print(f"Finished in {elapsed:.2f} seconds.") Finished in 18.75 seconds. |
Now let’s parallel. The trick is to break the big matrix into smaller chunks\blocks, compute the covariance of those small chunks, and carefully rearrange it back to it’s original dimensions. In the code block 1 has our variables indexed from 1 through , block 2 has indices
through
and so on until block 5. No reason to choose exactly 5 chunks, experiment at your leisure (which I am certain you will ). Next we create a grid that pairs each possible combination of these blocks. Now you can send your individual “workers” to work on those blocks separately. When done, rearrange it back into covariance form, which shouldn’t take long.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 |
library(parallel) blocks <- list( 1:(p/5), (1*(p/5)+1): (2*(p/5)), (2*(p/5)+1):(3*(p/5)), (3*(p/5)+1):(4*(p/5)), (4*(p/5)+1):p) ij<-expand.grid(i=1:5, j=1:5) cl <- makeCluster(5) clusterExport(cl, varlist = c("Xc", "ij", "blocks") ) system.time ( out <- parLapply(cl, 1:nrow(ij), function(k){ i<-ij$i[k] j<-ij$j[k] crossprod( Xc[,blocks[[i]]], Xc[,blocks[[j]]] ) } ) ) # user system elapsed # 0.04 0.01 10.27 stopCluster(cl) S <- matrix(0,p,p) for(k in 1:length(out)){ i<-ij$i[k] j<-ij$j[k] S[blocks[[i]],blocks[[j]]] <- out[[k]] } S[lower.tri(S)]<-t(S)[lower.tri(S)] # Due to symmetry S<-S/(nrow(X)-1) # because we had crossprod and not the cov function in the loop above # Now check there is no hiccup all.equal(no_parallel, S) [1] TRUE |
For completeness here is the Python code, but you don’t need it as I explain directly after.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 |
# from multiprocessing import Pool, cpu_count # as another option import time from joblib import Parallel, delayed block_size = p // 5 blocks = [np.arange(i * block_size, (i + 1) * block_size) for i in range(5)] ij = list(product(range(5), repeat=2)) def compute_block(i, j): return Xc[:, blocks[i]].T @ Xc[:, blocks[j]] start_time = time.time() out = Parallel(n_jobs=5, backend="threading")( delayed(compute_block)(i, j) for i, j in ij ) elapsed = time.time() - start_time print(f"Finished in {elapsed:.2f} seconds.") # Finished in 38.36 seconds. S = np.zeros((p, p)) # Fill in blocks from parallel output for k, (i, j) in enumerate(ij): # ij = list of (i, j) pairs rows = blocks[i] cols = blocks[j] S[np.ix_(rows, cols)] = out[k] # Fill the lower triangle by symmetry i_lower = np.tril_indices(p, -1) S[i_lower] = S.T[i_lower] # Normalize S = S / (X.shape[0] - 1) np.allclose(tmpp, S) True |
As you can see it takes longer, which caught me by surprise. After some digging, making sure there are no bugs, I understood that NumPy’s impressive speed is due to the fact that it’s already optimized to run in parallel in the backend. You can print np.__config__.show()
to see the lower level software used behind the scenes, with some heavy-hitters like Lapack and Blas which are specifically designed for matrix math on modern CPUs. Explicit parallelization then only creates additional overhead. Interesting stuff. Plus 1 for Python.
Appendix
When you refactor code, you will also do well to look under the hood for what the function is doing, and perhaps remove all kinds of checks that are perhaps not needed (class\type checks, NA checks etc).
Also, if you wish to refer to this post, you can cite it like so:
Eran Raviv (2025). “Parallelizing Matrix Computation.”