Correlation and correlation structure (9) – Parallelizing Matrix Computation

Datasets have grown from large to massive, and so we increasingly find ourselves refactoring for readability and prioritizing computational efficiency (speed). The computing time for the ever-important sample covariance estimate of a dataset $X \in \mathbb{R}^{n \times p}$ , with $n$ observations and $p$ variables is $\mathcal{O}(n p^2)$ . Although a single covariance calculation for today’s large datasets is manageable still, it’s computationally prohibitive to use bootstrap, or related resampling methods that require very many repetitions where each repetition demands its own covariance computation. Without fast computation bootstrap remains impractical for high-dimensional problems. And that, we undoubtedly all agree is a tragedy.

So, what can we do restore resampling methods to our toolkit? We can reduce computing times, and appreciably so, if we compute in parallel. We can reduce waiting times from overnight to matters of minutes or seconds even. Related to this, I wrote a post about Randomized Matrix Multiplication where I offer computationally cheaper approximation instead of the exact, but longer to compute procedure.

This post you now read was inspired by a question from Laura Balzano (University of Michigan) who asked if we can’t get an exact solution (rather than an approximation) using parallel computing shown in that other post. I spent some time thinking about it and indeed it’s possible, and valuable. So with that context out of the way, here is the Rython (R + Python) code to calculate the sample covariance estimate in parallel, with some indication for time saved. Use it when you have large matrices and you need the sample covariance matrix or derivative thereof.

Parallelizing Matrix Computation – Rython code

Assume a matrix $X_{10^4 \times 2500}$ , calculating the sample covariance takes around 40 seconds on my machine


TT <- 10000
p <- 2500
X <- matrix(rnorm(TT*p),TT,p)
Xc <- scale(X, center=TRUE,scale=FALSE)
no_parallel= cov(Xc)
system.time(cov(Xc))
# user  system elapsed 
# 41.99    0.00   42.00

TT <- 10000

p <- 2500

X <- matrix(rnorm(TT*p),TT,p)

Xc <- scale(X, center=TRUE,scale=FALSE)

no_parallel= cov(Xc)

system.time(cov(Xc))

# user system elapsed

# 41.99 0.00 42.00

It’s worth acknowledging NumPy’s impressively optimized backend – relative to the R compiler; I had to increase the matrix dimensions fivefold to get a comparable waiting time.


import numpy as np
from multiprocessing import Pool, cpu_count
from itertools import product
import time

TT = 50000
p = 12500

X = np.random.randn(TT, p)
Xc = X - np.mean(X, axis=0)  
start_time = time.time()
tmpp = np.cov(Xc, rowvar=False)
elapsed = time.time() - start_time
print(f"Finished in {elapsed:.2f} seconds.")
Finished in 18.75 seconds.

import numpy as np

from multiprocessing import Pool, cpu_count

from itertools import product

import time

TT = 50000

p = 12500

X = np.random.randn(TT, p)

Xc = X - np.mean(X, axis=0)

start_time = time.time()

tmpp = np.cov(Xc, rowvar=False)

elapsed = time.time() - start_time

print(f"Finished in {elapsed:.2f} seconds.")

Finished in 18.75 seconds.

Now let’s parallel. The trick is to break the big matrix into smaller chunks\blocks, compute the covariance of those small chunks, and carefully rearrange it back to it’s original dimensions. In the code block 1 has our variables indexed from 1 through $p/5$ , block 2 has indices $(p/5)+1$ through $2\times (p/5)$ and so on until block 5. No reason to choose exactly 5 chunks, experiment at your leisure (which I am certain you will ). Next we create a grid that pairs each possible combination of these blocks. Now you can send your individual “workers” to work on those blocks separately. When done, rearrange it back into covariance form, which shouldn’t take long.


library(parallel)
blocks <- list( 1:(p/5), 
             (1*(p/5)+1): (2*(p/5)), 
             (2*(p/5)+1):(3*(p/5)),
             (3*(p/5)+1):(4*(p/5)),
             (4*(p/5)+1):p)
ij<-expand.grid(i=1:5, j=1:5)

cl <- makeCluster(5)
clusterExport(cl, varlist = c("Xc", "ij", "blocks") )
system.time (
out <-  parLapply(cl, 1:nrow(ij),
               function(k){
                 i<-ij$i[k]
                 j<-ij$j[k]
                 crossprod( Xc[,blocks[[i]]], Xc[,blocks[[j]]] )
                 }
               )
)
#   user  system elapsed 
#   0.04    0.01   10.27 
stopCluster(cl)

S <- matrix(0,p,p)
for(k in 1:length(out)){
  i<-ij$i[k]
  j<-ij$j[k]
  S[blocks[[i]],blocks[[j]]] <- out[[k]]
  }
S[lower.tri(S)]<-t(S)[lower.tri(S)]
# Due to symmetry
S<-S/(nrow(X)-1)
# because we had crossprod and not the cov function in the loop above

# Now check there is no hiccup
all.equal(no_parallel, S)
[1] TRUE

library(parallel)

blocks <- list( 1:(p/5),

(1*(p/5)+1): (2*(p/5)),

(2*(p/5)+1):(3*(p/5)),

(3*(p/5)+1):(4*(p/5)),

(4*(p/5)+1):p)

ij<-expand.grid(i=1:5, j=1:5)

cl <- makeCluster(5)

clusterExport(cl, varlist = c("Xc", "ij", "blocks") )

system.time (

out <- parLapply(cl, 1:nrow(ij),

function(k){

i<-ij$i[k]

j<-ij$j[k]

crossprod( Xc[,blocks[[i]]], Xc[,blocks[[j]]] )

}

)

# user system elapsed

# 0.04 0.01 10.27

stopCluster(cl)

S <- matrix(0,p,p)

for(k in 1:length(out)){

i<-ij$i[k]

j<-ij$j[k]

S[blocks[[i]],blocks[[j]]] <- out[[k]]

}

S[lower.tri(S)]<-t(S)[lower.tri(S)]

# Due to symmetry

S<-S/(nrow(X)-1)

# because we had crossprod and not the cov function in the loop above

# Now check there is no hiccup

all.equal(no_parallel, S)

[1] TRUE

For completeness here is the Python code, but you don’t need it as I explain directly after.


# from multiprocessing import Pool, cpu_count # as another option 
import time
from joblib import Parallel, delayed

block_size = p // 5
blocks = [np.arange(i * block_size, (i + 1) * block_size) for i in range(5)]
ij = list(product(range(5), repeat=2))  

def compute_block(i, j):
    return Xc[:, blocks[i]].T @ Xc[:, blocks[j]]

start_time = time.time()
out = Parallel(n_jobs=5, backend="threading")( 
    delayed(compute_block)(i, j) for i, j in ij
)
elapsed = time.time() - start_time
print(f"Finished in {elapsed:.2f} seconds.")
# Finished in 38.36 seconds.
S = np.zeros((p, p))

# Fill in blocks from parallel output
for k, (i, j) in enumerate(ij):  # ij = list of (i, j) pairs
    rows = blocks[i]
    cols = blocks[j]
    S[np.ix_(rows, cols)] = out[k]

# Fill the lower triangle by symmetry
i_lower = np.tril_indices(p, -1)
S[i_lower] = S.T[i_lower]

# Normalize
S = S / (X.shape[0] - 1)
np.allclose(tmpp, S)
True

# from multiprocessing import Pool, cpu_count # as another option

import time

from joblib import Parallel, delayed

block_size = p // 5

blocks = [np.arange(i * block_size, (i + 1) * block_size) for i in range(5)]

ij = list(product(range(5), repeat=2))

def compute_block(i, j):

return Xc[:, blocks[i]].T @ Xc[:, blocks[j]]

start_time = time.time()

out = Parallel(n_jobs=5, backend="threading")(

delayed(compute_block)(i, j) for i, j in ij

)

elapsed = time.time() - start_time

print(f"Finished in {elapsed:.2f} seconds.")

# Finished in 38.36 seconds.

S = np.zeros((p, p))

# Fill in blocks from parallel output

for k, (i, j) in enumerate(ij): # ij = list of (i, j) pairs

rows = blocks[i]

cols = blocks[j]

S[np.ix_(rows, cols)] = out[k]

# Fill the lower triangle by symmetry

i_lower = np.tril_indices(p, -1)

S[i_lower] = S.T[i_lower]

# Normalize

S = S / (X.shape[0] - 1)

np.allclose(tmpp, S)

True

As you can see it takes longer, which caught me by surprise. After some digging, making sure there are no bugs, I understood that NumPy’s impressive speed is due to the fact that it’s already optimized to run in parallel in the backend. You can print np.__config__.show() to see the lower level software used behind the scenes, with some heavy-hitters like Lapack and Blas which are specifically designed for matrix math on modern CPUs. Explicit parallelization then only creates additional overhead. Interesting stuff. Plus 1 for Python.

Appendix

When you refactor code, you will also do well to look under the hood for what the function is doing, and perhaps remove all kinds of checks that are perhaps not needed (class\type checks, NA checks etc).

Also, if you wish to refer to this post, you can cite it like so:
Eran Raviv (2025). “Parallelizing Matrix Computation.”