In the classical regime, when we have plenty of observations relative to what we need to estimate, we can rely on the sample covariance matrix as a faithful representation of the underlying covariance structure. However, in the high-dimensional settings common to modern data science – where the number of attributes/features $p$ is comparable to the number of observations $n$, the sample covariance matrix is a bad estimator. It is not merely noisy; it is misleading. The eigenvalues of such matrices undergo a predictable yet deceptive dispersion, inflating into “signals” that have no basis in reality. But, using some machinery from Random Matrix Theory is one way to correct for biases that high-dimensional data creates.
Category: Blog
Most popular posts – 2025
Today is the last day of 2025. Depending on where you’re reading this, the party might have already begun. I begin this end-of-year post by wishing you a safe and fun time tonight.
This blog is just a personal hobby. When I’m extra busy as I was this year the blog is a front-line casualty. This is why 2025 saw a weaker posting stream. As I do almost each year, I checked the analytics stats to see what generated the most interest. I am pleased with around 13K (active) visits this year, an average of 37 seconds spent per visit (when the browser-tab is in the foreground and active). Given our collective TikTok-wired brains, I consider those 37 seconds as a compliment.
The two most popular posts this year were also coincidentally the two pieces I most enjoyed creating. Both dive into the all-important world of word-embedding:
Correlation and correlation structure (10) – Inverse Covariance
The covariance matrix is central to many statistical methods. It tells us how variables move together, and its diagonal entries – variances – are very much our go-to measure of uncertainty. But the real action lives in its inverse. We call the inverse covariance matrix either the precision matrix or the concentration matrix. Where did these terms come from? I’ll now explain the origin of these terms and why the inverse of the covariance is named that way. I doubt this has kept you up at night, but I still think you’ll find it interesting.
Dot Product in the Attention Mechanism
The dot product of two embedding vectors and
with dimension
is defined as
Hardly the first thing that jumps to mind when thinking about a “similarity score”. Indeed, the result of a dot product is a single numbers (a scalar), with no predefined range (e.g. not between zero and one). So, it’s hard to quantify whether a particular score is high/low on its own. Still, deep learning Transformer family of models rely heavily on the dot product in the attention mechanism; to weigh the importance of different parts of the input sentence. This post explains why the dot product which seems like an odd pick as a “similarity scores”, actually makes good sense.
Understanding Word Embeddings (2) – Geometry
I have noticed that when I use the term “coordinates” to talk about vectors, it doesn’t always click for everyone in the room. The previous post covered the algebra of word embeddings and now we explain why you should think about the vector of the word embedding simply as coordinates in space. We skip the trivial 1D and 2D cases since they are straightforward. 4 dimensions is too complicated for me to gif around with, so 3D dimensions would have to suffice for our illustrations.
Understanding Word Embeddings (1) – Algebra
Some time back I took the time to explain that matrix multiplication can be viewed as a linear transformation. Having that perspective helps to grasp the inner-workings of all AI models across various domains (audio, images, etc.). Building on that, these next couple of posts will help you understand the inputs used in these matrix multiplication operations, specifically for those who want to understand how text-based models and LLMs function. Our focus is on the infamous one-hot encoding, as it is the key to unlock the underlying theory. It will provide you, I hope, the often-illusive intuition behind word-embeddings.
Correlation and correlation structure (9) – Parallelizing Matrix Computation
Datasets have grown from large to massive, and so we increasingly find ourselves refactoring for readability and prioritizing computational efficiency (speed). The computing time for the ever-important sample covariance estimate of a dataset $X \in \mathbb{R}^{n \times p}$, with $n$ observations and $p$ variables is $\mathcal{O}(n p^2)$. Although a single covariance calculation for today’s large datasets is manageable still, it’s computationally prohibitive to use bootstrap, or related resampling methods that require very many repetitions where each repetition demands its own covariance computation. Without fast computation bootstrap remains impractical for high-dimensional problems. And that, we undoubtedly all agree is a tragedy.
So, what can we do restore resampling methods to our toolkit? We can reduce computing times, and appreciably so, if we compute in parallel. We can reduce waiting times from overnight to matters of minutes or seconds even. Related to this, I wrote a post about Randomized Matrix Multiplication where I offer computationally cheaper approximation instead of the exact, but longer to compute procedure.
This post you now read was inspired by a question from Laura Balzano (University of Michigan) who asked if we can’t get an exact solution (rather than an approximation) using parallel computing shown in that other post. I spent some time thinking about it and indeed it’s possible, and valuable. So with that context out of the way, here is the Rython (R + Python) code to calculate the sample covariance estimate in parallel, with some indication for time saved. Use it when you have large matrices and you need the sample covariance matrix or derivative thereof.
Named One of the Best Statistics Websites for 2025
I was recently notified that my blog features as one of the Best statistics websites for 2025, here. You know it’s serious when you get a badge:
But more seriously, it’s quite something to be included among those other esteemed individuals, some of whom you may recognize as leading professors, accomplished scientists and practitioners in statistics, computer science, machine learning, and data science. It gives me the sense that at least some of my contributions turns out to be meaningful. I don’t write much (on average one post every 6 weeks or so). I don’t write to accumulate readership, no twitter\X or facebook presence for example. I don’t write about mainstream topics, nor do I write for a broad audience (though occasionally there are opinion pieces). It is therefore doubly nice to be recognized. Thank you.
Nonstandard errors?
Nonstandard errors is the title given to a recent published paper in the prestigious Journal of Finance by more than 350 authors. At first glance the paper appears to mix apples and oranges. At second glance, it still looks that way. To be clear, the paper is mostly what you expect from a top journal: stimulating, thought-provoking and impressive. However, my main reservation is with its conclusion and recommendations which are off the mark, I think.
AI models are NOT biased
The issue of bias in AI has become a focal point in recent discussions, both in the academia and amongst practitioners and policymakers. I observe a lot of confusion and diffusion in those discussions. At the risk of seeming patronizing, my advice is to engage only with the understanding of the specific jargon which is used, and particularly how it’s used in this context. Misunderstandings create confusion and blur the path forward.
Here is a negative, yet typical example:
In artificial intelligence (AI)-based predictive models, bias – defined as unfair systematic error – is a growing source of concern1.
This post tries to direct those important discussions to the right avenues, providing some clarifications, examples for common pitfalls, and some qualified advice from experts in the field on how to approach this topic. If nothing else, I hope you find this piece thought-provoking.
Correlation and correlation structure (8) – the precision matrix
If you are reading this, you already know that the covariance matrix represents unconditional linear dependency between the variables. Far less mentioned is the bewitching fact that the elements of the inverse of the covariance matrix (i.e. the precision matrix) encode the conditional linear dependence between the variables. This post shows why that is the case. I start with the motivation to even discuss this, then the math, then some code.
Correlation and correlation structure (7) – Chatterjee’s rank correlation
Remarkably, considering that correlation modelling dates back to 1890, statisticians still make meaningful progress in this area. A recent step forward is given in A New Coefficient of Correlation by Sourav Chatterjee. I wrote about it shortly after it came out, and it has since garnered additional attention and follow-up results. The more I read about it, the more I am impressed with it. This post provides some additional details based on recent research.
On Writing
Each year I supervise several data-science master’s students, and each year I find myself repeating the same advises. Situation has worsen since students started (mis)using GPT models. I therefore have written this blog post to highlight few important, and often overlooked, aspects of thesis-writing. Many points apply also to writing in general.
Matrix Multiplication as a Linear Transformation
AI algorithms are in the air. The success of those algorithms is largely attributed to dimension expansions, which makes it important for us to consider that aspect.
Matrix multiplication can be beneficially perceived as a way to expand the dimension. We begin with a brief discussion on PCA. Since PCA is predominantly used for reducing dimensions, and since you are familiar with PCA already, it serves as a good springboard by way of a contrasting example for dimension expansion. Afterwards we show some basic algebra via code, and conclude with a citation that provides the intuition for the reason dimension expansion is so essential.
Most popular posts – 2023
Welcome 2024.
This blog is just a personal hobby. When I’m extra busy as I was this year the blog is a front-line casualty. This is why 2023 saw a weaker posting stream. Nonetheless I am pleased with just over 30K visits this year, with an average of roughly one minute per visit (engagement time, whatever google-analytics means by that). This year I only provide the top two posts (rather than the usual 3). Both posts have to do with statistical shrinkage:
