Nonstandard errors?

Nonstandard errors is the title given to a recent published paper in the prestigious Journal of Finance by more than 350 authors. At first glance the paper appears to mix apples and oranges. At second glance, it still looks that way. To be clear, the paper is mostly what you expect from a top journal: stimulating, thought-provoking and impressive. However, my main reservation is with its conclusion and recommendations which are off the mark, I think.

I begin with a brief explanation about the content and some results from the paper, and then I share my own interpretation and perspective for what it’s worth.

What are nonstandard errors?

Say you hire two research teams to test the efficacy of a drug. You provide them with the same data. Later the two teams return with their results. Each team reports their estimated probability that the drug is effective, and the (standard) standard error for their estimate. But, since the two teams made different decisions along the way (e.g. how to normalize the data) their estimates are different. So there is additional (nonstandard) error because their estimates are not identical, despite being asked the exact same question and being given the exact same data. As the authors write: this “type of error can be thought of as erratic as opposed to erroneous”. So that is simply extra variation stemming from the teams’ distinct analytical choices (e.g. how to treat outliers, how to impute missing values).

Things I love about the paper

  • Exceptional clarity, and phenomenal design-thinking.
  • The logistical orchestration of bringing together over 350 people in a structured way is really not something to be jealous of. I can only imagine the headache it gives. This elevates the paper to have remarkable power. Both as an example that such large scale collaboration is actually possible, and of course the valuable data and evidence.
  • On the content side, the paper brilliantly alerts the readers to be aware that results of any research are highly dependent on the decision path chosen by the research team (e.g. which model, which optimization algorithm, which frequency to choose). Results and decision-path go beyond basic dependency – there’s a profound reliance at play. This is true for theoretical work (“under assumptions 1-5…”), you can double the force for empirical studies, and in my view you can triple the force for empirical social sciences work. Below is the point estimate and distribution around 6 different hypotheses which 164 research teams were asked to test (again, using the same data). Setting aside the hypotheses’ details for now, you can see below that there is a sizable variation around the point estimates. dispersion of estimates
    Not only the extent of the variation is eyebrows-raising, but in most cases there is not even an agreement on the sign…

    The paper dives deeper. Few more insights are that if we check only top research teams (setting aside now how “top” is actually determined) situation is a bit better. Also, when you asked the researchers what is their estimate for the across-teams variation they tend to underestimate it.

    What you see is that most research teams underestimate the actual variation (black dots under the big orange dot) and that is true for all 6 hypotheses tested. This very much echos Deniel Kahneman work: “We are prone to overestimate how much we understand about the world”.

  • What is the main contributor for the dispersion of estimates? You guessed it, the statistical model chosen by the researchers.

Things I don’t like about the paper

The authors claim that the extra decision-path induced variation adds uncertainty, and that this is undesirable. Because of that a better approach, the claim, would be to perfectly aligned on the decision-path.

6 months ago I made a linkedin comment about the paper based on a short 2-minutes video.

Yes, it took 6 months but I now feel after reading it through that my flat “shooting from the hip” comment is still valid (although I regret the language I chose).

In the main, any research paper is, and if not then it should be, read as a stand-alone input for our overall understanding. I think it’s clear to everyone that what they read is true, conditional on what they read was done.

It’s not that I don’t mind to read that a certain hypothesis is true if, say, checked using daily frequency but is reversed if checked using monthly frequency, I WANT to read that. Then I want to read why they made the decision they made, and to make up my own mind and relate it to what I need it for in my own context.

Do we want to dictate a single procedure for each hypothesis? It is certainly appealing. We would have an easier time pursuing the truth, one work (where the decision path is decided upon) for one hypothesis, and we will have no uncertainty and no across-researchers variation. But the big BUT is this, even in the words of the authors of the same paper: “there simply is no right path in an absolute sense”. The move to a fully-aligned single procedure boils down to a risk transfer. Rather than having a risk of a researchers taking wrong turns on their decision paths (or even p-hacking), we now carry another, higher risk in my opinion, that our aligned procedure is wrong for all researchers. So, the uncertainty is still there, but now under the rag. That is even more worrisome than the across-researchers variation we CAN observe.

While I commend the scientific pursuit for truth, there isn’t always one truth to uncover. Everything is a process. In the past stuttering was treated by placing pebbles in the mouth. More recently (and maybe even still) university courses in economics excluded negative interest rates on the ground that everyone would hold cash. When time came, it turns out that there are not enough mattresses.

Across-researchers variation is actually something you want. If it’s small it means the problem is not hard enough (everyone agrees on how to check it). So, should we just ignore across-researchers variation? also not. Going back to my opening point, the paper brilliantly captures the scale of this variation. Just be ultra aware that two research-teams are not checking one thing (even if working on the same data and testing for the same hypothesis), but they are checking two things. The same hypothesis but based on particular analytical choices which they made. We have it harder in that we need to consume more research outputs, but that is a small price compared to the alternative.

Footnote

While reading the paper I thought it would be good to sometimes report a trimmed standard deviations, because of the sensitivity of that measure to outliers.

Correlation and correlation structure (8) – the precision matrix

If you are reading this, you already know that the covariance matrix represents unconditional linear dependency between the variables. Far less mentioned is the bewitching fact that the elements of the inverse of the covariance matrix (i.e. the precision matrix) encode the conditional linear dependence between the variables. This post shows why that is the case. I start with the motivation to even discuss this, then the math, then some code.

Continue reading

Statistical Shrinkage (4) – Covariance estimation

A common issue encountered in modern statistics involves the inversion of a matrix. For example, when your data is sick with multicollinearity your estimates for the regression coefficient can bounce all over the place.

In finance we use the covariance matrix as an input for portfolio construction. Analogous to the fact that variance must be positive, covariance matrix must be positive definite to be meaningful. The focus of this post is on understanding the underlying issues with an unstable covariance matrix, identifying a practical solution for such an instability, and connecting that solution to the all-important concept of statistical shrinkage. I present a strong link between the following three concepts: regularization of the covariance matrix, ridge regression, and measurement error bias, with some easy-to-follow math.

Continue reading

Beware of Spurious Factors

The word spurious refers to “outwardly similar or corresponding to something without having its genuine qualities.” Fake.

While the meanings of spurious correlation and spurious regression are common knowledge nowadays, much less is understood about spurious factors. This post draws your attention to recent, top-shelf, research flagging the risks around spurious factor analysis. While formal solutions are still pending there are couple of heuristics we can use to detect possible problems.

Continue reading

Correlation and Correlation Structure (6) – Distance Correlation

While linear correlation (aka Pearson correlation) is by far the most common type of dependence measure there are few arguably better ways to characterize\estimate the degree of dependence between variables. This is a fascinating topic I keep coming back to. There is so much for a typical geek to appreciate: non-linear dependencies, should we consider the noise in the data or rather just focus on the underlying process, should we consider the whole distribution or just few moments.

In this post number 6 on correlation and correlation structure I share another dependency measure called “distance correlation”. It has been around for a while now (2009, see references). I provide just the intuition, since the math has little to do with the way distance correlation is computed, but rather with the theoretical justification for its practical legitimacy.

Continue reading

Similarity and Dissimilarity Metrics – Kernel Distance

In the field of unsupervised machine learning, similarity and dissimilarity metrics (and matrices) are part and parcel. These are core components of clustering algorithms or natural language processing summarization techniques, just to name a couple.

While at first glance distance metrics look like child’s play, the fact of the matter is that when you get down to business there are a lot of decisions to make, and who likes that? to make matters worse:

  • Theoretical guidance is nowhere to be found
  • Your choices and decisions matter, in the sense that results materially change

After reading this post you will understand concepts like distance metrics, (dis)similarity metrics, and see why it’s fashionable to use kernels as similarity metrics.

Continue reading

A New Parameterization of Correlation Matrices

In volatility modelling, a typical challenge is to keep the covariance matrix estimate valid, meaning (1) symmetric and (2) positive semi definite*. A new paper published in Econometrica (citing from the paper) “introduces a novel parametrization of the correlation matrix. The reparametrization facilitates modeling of correlation and covariance matrices by an unrestricted vector, where positive definiteness is an innate property” (emphasis mine). Econometrica is known to publish ground-breaking research, and you may wonder: what is the big deal in being able to reparametrise the correlation matrix?

Continue reading

Bayesian vs. Frequentist in Practice, part 3

This post is inspired by Leo Breiman’s opinion piece “No Bayesians in foxholes”. The saying “there are no atheists in foxholes” refers to the fact that if you are in the foxhole (being bombarded..), you pray! Leo’s paraphrase indicates that when complex, real problems are present, there are no Bayesian to be found.

Continue reading

Beta in the tails

Every form of strength is also a form of weakness*. I love statistics, but I focus to much on methodology, which is not for everyone. Some people (right or wrong) question: “wonderful sir, but what can I do with it?”.

A new paper titled “Beta in the tails” is a showcase application for why we should focus on correlation structure rather than on average correlation. They discuss the question: Do hedge funds hedge? The reply: No, they don’t!

The paper “Beta in the tails” was published in the Journal of Econometrics but you can find a link to a working paper version below. We start with a figure replicated from the paper, go through the meaning and interpretation of it, and explain the methods used thereafter.

Continue reading

Correlation and correlation structure (5) – a new coefficient of correlation

This is the fifth post which is concerned with quantifying the dependence between variables. When talking correlations one usually thinks about linear correlation, aka Pearson’s correlation. One serious limitation of linear correlation is that it’s, well.. linear. By construction it’s not useful for detecting non-monotonic relation between variables. Here I share some recent academic research, a new way to detect associations that are not monotonic.

Continue reading

Understanding Variance Explained in PCA – Matrix Approximation

Principal component analysis (PCA from here on) is performed via linear algebra functions called eigen decomposition or singular value decomposition. Since you are actually reading this, you may well have used PCA in the past, at school or where you work. There is a strong link between PCA and the usual least squares regression (previous posts here and here). More recently I explained what does variance explained by the first principal component actually means.

This post offers a matrix approximation perspective. As a by-product, we also show how to compare two matrices, to see how different they are from each other. Matrix approximation is a bit math-hairy, but we keep it simple here I promise. For this fascinating field itself I suspect a rise in importance. We are constantly stretching what we can do computationally, and by using approximations rather than the actual data, we can ease that burden. The price for using approximation is decrease in accuracy (à la “garbage in garbage out”), but with good approximation the tradeoff between the accuracy and computational time is favorable.

Continue reading

Boundary corrected kernel density

Density estimation is now a trivial one-liner script in all modern software. What is not so easy is to become comfortable with the result, how well is is my density estimated? we rarely know. One reason is the lack of ground-truth. Density estimation falls under unsupervised learning, we don’t actually observe the actual underlying truth. Another reason is that the theory around density estimation is seldom useful for the particular case you have at hand, which means that trial-and-error is a requisite.

Standard kernel density estimation is by far the most popular way for density estimation. However, it is biased around the edges of the support. In this post I show what does this bias imply, and while not the only way, a simple way to correct for this bias. Practically, you could present density curves which makes sense, rather than apologizing (as I often did) for your estimate making less sense around the edges of the chart; that is, when you use a standard software implementation.

Continue reading

Understanding Pointwise Mutual Information in Statistics

Intro

The term mutual information is drawn from the field of information theory. Information theory is busy with the quantification of information. For example, a central concept in this field is entropy, which we have discussed before.

If you google the term “mutual information” you will land at some page which if you understand it, there would probably be no need for you to google it in the first place. For example:

Not limited to real-valued random variables and linear dependence like the correlation coefficient, mutual information (MI) is more general and determines how different the joint distribution of the pair (X,Y) is to the product of the marginal distributions of X and Y. MI is the expected value of the pointwise mutual information (PMI).

which makes sense at first read only for those who don’t need to read it. It’s the main motivation for this post: to provide a clear intuition behind the pointwise mutual information term and equations, for everyone. At the end of this page, you would understand what mutual information metric actually measures, and how you should interpret it. We start with the easier concept of conditional probability and work our way through to the concept of pointwise mutual information.

Continue reading