Nonstandard errors?

Nonstandard errors is the title given to a recent published paper in the prestigious Journal of Finance by more than 350 authors. At first glance the paper appears to mix apples and oranges. At second glance, it still looks that way. To be clear, the paper is mostly what you expect from a top journal: stimulating, thought-provoking and impressive. However, my main reservation is with its conclusion and recommendations which are off the mark, I think.

I begin with a brief explanation about the content and some results from the paper, and then I share my own interpretation and perspective for what it’s worth.

What are nonstandard errors?

Say you hire two research teams to test the efficacy of a drug. You provide them with the same data. Later the two teams return with their results. Each team reports their estimated probability that the drug is effective, and the (standard) standard error for their estimate. But, since the two teams made different decisions along the way (e.g. how to normalize the data) their estimates are different. So there is additional (nonstandard) error because their estimates are not identical, despite being asked the exact same question and being given the exact same data. As the authors write: this “type of error can be thought of as erratic as opposed to erroneous”. So that is simply extra variation stemming from the teams’ distinct analytical choices (e.g. how to treat outliers, how to impute missing values).

Things I love about the paper

  • Exceptional clarity, and phenomenal design-thinking.
  • The logistical orchestration of bringing together over 350 people in a structured way is really not something to be jealous of. I can only imagine the headache it gives. This elevates the paper to have remarkable power. Both as an example that such large scale collaboration is actually possible, and of course the valuable data and evidence.
  • On the content side, the paper brilliantly alerts the readers to be aware that results of any research are highly dependent on the decision path chosen by the research team (e.g. which model, which optimization algorithm, which frequency to choose). Results and decision-path go beyond basic dependency – there’s a profound reliance at play. This is true for theoretical work (“under assumptions 1-5…”), you can double the force for empirical studies, and in my view you can triple the force for empirical social sciences work. Below is the point estimate and distribution around 6 different hypotheses which 164 research teams were asked to test (again, using the same data). Setting aside the hypotheses’ details for now, you can see below that there is a sizable variation around the point estimates. dispersion of estimates
    Not only the extent of the variation is eyebrows-raising, but in most cases there is not even an agreement on the sign…

    The paper dives deeper. Few more insights are that if we check only top research teams (setting aside now how “top” is actually determined) situation is a bit better. Also, when you asked the researchers what is their estimate for the across-teams variation they tend to underestimate it.

    What you see is that most research teams underestimate the actual variation (black dots under the big orange dot) and that is true for all 6 hypotheses tested. This very much echos Deniel Kahneman work: “We are prone to overestimate how much we understand about the world”.

  • What is the main contributor for the dispersion of estimates? You guessed it, the statistical model chosen by the researchers.

Things I don’t like about the paper

The authors claim that the extra decision-path induced variation adds uncertainty, and that this is undesirable. Because of that a better approach, the claim, would be to perfectly aligned on the decision-path.

6 months ago I made a linkedin comment about the paper based on a short 2-minutes video.

Yes, it took 6 months but I now feel after reading it through that my flat “shooting from the hip” comment is still valid (although I regret the language I chose).

In the main, any research paper is, and if not then it should be, read as a stand-alone input for our overall understanding. I think it’s clear to everyone that what they read is true, conditional on what they read was done.

It’s not that I don’t mind to read that a certain hypothesis is true if, say, checked using daily frequency but is reversed if checked using monthly frequency, I WANT to read that. Then I want to read why they made the decision they made, and to make up my own mind and relate it to what I need it for in my own context.

Do we want to dictate a single procedure for each hypothesis? It is certainly appealing. We would have an easier time pursuing the truth, one work (where the decision path is decided upon) for one hypothesis, and we will have no uncertainty and no across-researchers variation. But the big BUT is this, even in the words of the authors of the same paper: “there simply is no right path in an absolute sense”. The move to a fully-aligned single procedure boils down to a risk transfer. Rather than having a risk of a researchers taking wrong turns on their decision paths (or even p-hacking), we now carry another, higher risk in my opinion, that our aligned procedure is wrong for all researchers. So, the uncertainty is still there, but now under the rag. That is even more worrisome than the across-researchers variation we CAN observe.

While I commend the scientific pursuit for truth, there isn’t always one truth to uncover. Everything is a process. In the past stuttering was treated by placing pebbles in the mouth. More recently (and maybe even still) university courses in economics excluded negative interest rates on the ground that everyone would hold cash. When time came, it turns out that there are not enough mattresses.

Across-researchers variation is actually something you want. If it’s small it means the problem is not hard enough (everyone agrees on how to check it). So, should we just ignore across-researchers variation? also not. Going back to my opening point, the paper brilliantly captures the scale of this variation. Just be ultra aware that two research-teams are not checking one thing (even if working on the same data and testing for the same hypothesis), but they are checking two things. The same hypothesis but based on particular analytical choices which they made. We have it harder in that we need to consume more research outputs, but that is a small price compared to the alternative.

Footnote

While reading the paper I thought it would be good to sometimes report a trimmed standard deviations, because of the sensitivity of that measure to outliers.

AI models are NOT biased

The issue of bias in AI has become a focal point in recent discussions, both in the academia and amongst practitioners and policymakers. I observe a lot of confusion and diffusion in those discussions. At the risk of seeming patronizing, my advice is to engage only with the understanding of the specific jargon which is used, and particularly how it’s used in this context. Misunderstandings create confusion and blur the path forward.

Here is a negative, yet typical example:

In artificial intelligence (AI)-based predictive models, bias – defined as unfair systematic error – is a growing source of concern1.

This post tries to direct those important discussions to the right avenues, providing some clarifications, examples for common pitfalls, and some qualified advice from experts in the field on how to approach this topic. If nothing else, I hope you find this piece thought-provoking.

Continue reading

Correlation and correlation structure (8) – the precision matrix

If you are reading this, you already know that the covariance matrix represents unconditional linear dependency between the variables. Far less mentioned is the bewitching fact that the elements of the inverse of the covariance matrix (i.e. the precision matrix) encode the conditional linear dependence between the variables. This post shows why that is the case. I start with the motivation to even discuss this, then the math, then some code.

Continue reading

Correlation and correlation structure (7) – Chatterjee’s rank correlation

Remarkably, considering that correlation modelling dates back to 1890, statisticians still make meaningful progress in this area. A recent step forward is given in A New Coefficient of Correlation by Sourav Chatterjee. I wrote about it shortly after it came out, and it has since garnered additional attention and follow-up results. The more I read about it, the more I am impressed with it. This post provides some additional details based on recent research.

Continue reading

On Writing

Each year I supervise several data-science master’s students, and each year I find myself repeating the same advises. Situation has worsen since students started (mis)using GPT models. I therefore have written this blog post to highlight few important, and often overlooked, aspects of thesis-writing. Many points apply also to writing in general.

Continue reading

Matrix Multiplication as a Linear Transformation

AI algorithms are in the air. The success of those algorithms is largely attributed to dimension expansions, which makes it important for us to consider that aspect.

Matrix multiplication can be beneficially perceived as a way to expand the dimension. We begin with a brief discussion on PCA. Since PCA is predominantly used for reducing dimensions, and since you are familiar with PCA already, it serves as a good springboard by way of a contrasting example for dimension expansion. Afterwards we show some basic algebra via code, and conclude with a citation that provides the intuition for the reason dimension expansion is so essential.

Continue reading

Most popular posts – 2023

Welcome 2024.

This blog is just a personal hobby. When I’m extra busy as I was this year the blog is a front-line casualty. This is why 2023 saw a weaker posting stream. Nonetheless I am pleased with just over 30K visits this year, with an average of roughly one minute per visit (engagement time, whatever google-analytics means by that). This year I only provide the top two posts (rather than the usual 3). Both posts have to do with statistical shrinkage:

Continue reading

Randomized Matrix Multiplication

Matrix multiplication is a fundamental computation in modern statistics. It’s at the heart of all concurrent serious AI applications. The size of the matrices nowadays is gigantic. On a good system it takes around 30 seconds to estimate the covariance of a data matrix with dimensions $X_{10000 \times 2500}$, a small data today’s standards mind you. Need to do it 10000 times? wait for roughly 80 hours. Have larger data? running time grows exponentially. Want a more complex operation than covariance estimate? forget it, or get ready to dig deep into your pockets.

We, mere minions who are unable to splurge thousands of dollars for high-end G/TPUs, are left unable to work with large matrices due to the massive computational requirements needed; because who wants to wait few weeks to discover their bug.

This post offers a solution by way of approximation, using randomization. I start with the idea, followed by a short proof, and conclude with some code and few run-time results.

Continue reading

Statistical Shrinkage (4) – Covariance estimation

A common issue encountered in modern statistics involves the inversion of a matrix. For example, when your data is sick with multicollinearity your estimates for the regression coefficient can bounce all over the place.

In finance we use the covariance matrix as an input for portfolio construction. Analogous to the fact that variance must be positive, covariance matrix must be positive definite to be meaningful. The focus of this post is on understanding the underlying issues with an unstable covariance matrix, identifying a practical solution for such an instability, and connecting that solution to the all-important concept of statistical shrinkage. I present a strong link between the following three concepts: regularization of the covariance matrix, ridge regression, and measurement error bias, with some easy-to-follow math.

Continue reading

Statistical Shrinkage (3)

Imagine you’re picking from 1,000 money managers. If you test just one, there’s a 5% chance you might wrongly think they’re great. But test 10, and your error chance jumps to 40%. To keep your error rate at 5%, you need to control the “family-wise error rate.” One method is to set higher standards for judging a manager’s talent, using a tougher t-statistic cut-off. Instead of the usual 5% cut (t-stat=1.65), you’d use a 0.5% cut (t-stat=2.58).

When testing 1,000 managers or strategies, the challenge increases. You’d need a manager with an extremely high t-stat of about 4 to stay within the 5% error rate. This big jump in the t-stat threshold helps keep the error rate in check. However that is discouragingly strict: a strategy which t-stat of 4 is rarity.

Continue reading

Rython tips and tricks – Clipboard

For whatever reason, clipboard functionalities from Rython are under-utilized. One utility function for reversing backslashes is found here. This post demonstrates how you can use the clipboard to circumvent saving and loading files. It’s convenient for when you just want the quick insight or visual, rather than a full-blown replicable process.

Continue reading

Statistical Shrinkage (2)

During 2017 I blogged about Statistical Shrinkage. At the end of that post I mentioned the important role signal-to-noise ratio (SNR) plays when it comes to the need for shrinkage. This post shares some recent related empirical results published in the Journal of Machine Learning Research from the paper Randomization as Regularization. While mainly for tree-based algorithms, the intuition undoubtedly extends to other numerical recipes also.

Continue reading

Rython tips and tricks – Snippets

R or Python? who cares! Which editor? now that’s a different story.

I like Rstudio for many reasons. Outside the personal, Rstudio allows you to write both R + Python = Rython in the same script. Apart from that, the editor’s level of complexity is well-balanced, not functionality-overkill like some, nor too simplistic like some others. This post shares how to save time with snippets (easy in Rstudio). Snippets save time by reducing the amount of typing required, it’s the most convenient way to program copy-pasting into the machine’s memory.

In addition to the useful built-ins snippets provided by Rstudio like lib or fun for R and imp or def for Python, you can write your own snippets. Below are a couple I wrote myself that you might find helpful. But first we start with how to use snippets.

Continue reading

Trees 1 – 0 Neural Networks

Tree-based methods like decision trees and their powerful random forest extensions are one of the most widely used machine learning algorithms. They are easy to use and provide good forecasting performance off the cuff more or less. Another machine learning community darling is the deep learning method, particularly neural networks. These are ultra flexible algorithms with impressive forecasting performance even (and especially) in highly complex real-life environments.

This post is shares:

  • Two academic references lauding the powerful performance of tree-based methods.
  • Because both neural networks and tree-based methods are able to capture non-linearity in the data, it’s not easy to choose between them. Those references help form an opinion with regards to when one should use neural networks and when tree-based methods are preferable, if you don’t have time to implement both (which is usually the case).
  • Continue reading