Covariance Estimation for Wide Data

My work on covariance estimation has recently been published as an Advanced Review in WIREs Computational Statistics, a highly regarded, peer-reviewed journal in the field. It feels remarkably rewarding to see a decade of my curiosity finally bound together in one place.
The writing process started about 4.5 years ago on evenings, weekends, and holidays as a side-project. But I actually wrote my first post about multivariate volatility forecasting well over a decade ago, which turned out to be the first of a series of (currently..) 11 posts on this topic. A decade of reading, coding, tinkering, and revisiting the problem from different angles. This reminds me of the quote (commonly attributed to Albert Einstein):

It’s not that I’m so smart, it’s just that I stay with problems longer.

Progress doesn’t have to be loud, and it doesn’t have to be fast. Even few hours here and there on weekends and holidays compound nicely over time. Crossing the finish line is thoroughly gratifying, but the climb is where the actual value is. Stay with the problem, and don’t let go of your back-burner ideas.

A quick word on what’s inside the paper (eject here if covariance estimation is not relevant for you). When the number of variables exceeds the number of observations ( $p \gg n$ ), estimating covariance matrices becomes especially challenging. The publication brings together different approaches usually siloed across statistics, econometrics, and machine learning: factor models, linear & nonlinear shrinkage, thresholding estimation, block averaging, graphical models, random matrix theory, as well as a dedicated section on ensuring a valid (e.g. invertible) covariance matrix estimate in real-world applications. After walking the reader through these diverse methodologies, I placed them all under a single umbrella – a unified view can help pinpoint explicit assumptions behind each of the many modelling choices we face.

If you’re working in high-dimensional dependence estimation, I hope this friendly paper (in as much as as high-dimensional statistics allows), its extensive collection of methods, clear taxonomy, unified notation and framework, can serve as a useful reference for real-world applications. Happy to discuss with anyone interested.