Correlation and correlation structure (9) – Parallelizing Matrix Computation

Datasets have grown from large to massive, and so we increasingly find ourselves refactoring for readability and prioritizing computational efficiency (speed). The computing time for the ever-important sample covariance estimate of a dataset X \in \mathbb{R}^{n \times p}, with n observations and p variables is \mathcal{O}(n p^2). Although a single covariance calculation for today’s large datasets is manageable still, it’s computationally prohibitive to use bootstrap, or related resampling methods that require very many repetitions where each repetition demands its own covariance computation. Without fast computation bootstrap remains impractical for high-dimensional problems. And that, we undoubtedly all agree is a tragedy.

So, what can we do restore resampling methods to our toolkit? We can reduce computing times, and appreciably so, if we compute in parallel. We can reduce waiting times from overnight to matters of minutes or seconds even. Related to this, I wrote a post about Randomized Matrix Multiplication where I offer computationally cheaper approximation instead of the exact, but longer to compute procedure.

This post you now read was inspired by a question from Laura Balzano (University of Michigan) who asked if we can’t get an exact solution (rather than an approximation) using parallel computing shown in that other post. I spent some time thinking about it and indeed it’s possible, and valuable. So with that context out of the way, here is the Rython (R + Python) code to calculate the sample covariance estimate in parallel, with some indication for time saved. Use it when you have large matrices and you need the sample covariance matrix or derivative thereof.

Parallelizing Matrix Computation – Rython code

Assume a matrix X_{10^4 \times 2500}, calculating the sample covariance takes around 40 seconds on my machine

It’s worth acknowledging NumPy’s impressively optimized backend – relative to the R compiler; I had to increase the matrix dimensions fivefold to get a comparable waiting time.

Now let’s parallel. The trick is to break the big matrix into smaller chunks\blocks, compute the covariance of those small chunks, and carefully rearrange it back to it’s original dimensions. In the code block 1 has our variables indexed from 1 through p/5, block 2 has indices (p/5)+1 through 2\times (p/5) and so on until block 5. No reason to choose exactly 5 chunks, experiment at your leisure (which I am certain you will ). Next we create a grid that pairs each possible combination of these blocks. Now you can send your individual “workers” to work on those blocks separately. When done, rearrange it back into covariance form, which shouldn’t take long.

For completeness here is the Python code, but you don’t need it as I explain directly after.

As you can see it takes longer, which caught me by surprise. After some digging, making sure there are no bugs, I understood that NumPy’s impressive speed is due to the fact that it’s already optimized to run in parallel in the backend. You can print np.__config__.show() to see the lower level software used behind the scenes, with some heavy-hitters like Lapack and Blas which are specifically designed for matrix math on modern CPUs. Explicit parallelization then only creates additional overhead. Interesting stuff. Plus 1 for Python.

Appendix

When you refactor code, you will also do well to look under the hood for what the function is doing, and perhaps remove all kinds of checks that are perhaps not needed (class\type checks, NA checks etc).

Also, if you wish to refer to this post, you can cite it like so:
Eran Raviv (2025). “Parallelizing Matrix Computation.”

Leave a Reply

Your email address will not be published. Required fields are marked *