Correlation and Correlation Structure (6) – Distance Correlation

While linear correlation (aka Pearson correlation) is by far the most common type of dependence measure there are few arguably better ways to characterize\estimate the degree of dependence between variables. This is a fascinating topic I keep coming back to. There is so much for a typical geek to appreciate: non-linear dependencies, should we consider the noise in the data or rather just focus on the underlying process, should we consider the whole distribution or just few moments.

In this post number 6 on correlation and correlation structure I share another dependency measure called “distance correlation”. It has been around for a while now (2009, see references). I provide just the intuition, since the math has little to do with the way distance correlation is computed, but rather with the theoretical justification for its practical legitimacy.

Continue reading

Similarity and Dissimilarity Metrics – Kernel Distance

In the field of unsupervised machine learning, similarity and dissimilarity metrics (and matrices) are part and parcel. These are core components of clustering algorithms or natural language processing summarization techniques, just to name a couple.

While at first glance distance metrics look like child’s play, the fact of the matter is that when you get down to business there are a lot of decisions to make, and who likes that? to make matters worse:

  • Theoretical guidance is nowhere to be found
  • Your choices and decisions matter, in the sense that results materially change

After reading this post you will understand concepts like distance metrics, (dis)similarity metrics, and see why it’s fashionable to use kernels as similarity metrics.

Continue reading

A New Parameterization of Correlation Matrices

In volatility modelling, a typical challenge is to keep the covariance matrix estimate valid, meaning (1) symmetric and (2) positive semi definite*. A new paper published in Econometrica (citing from the paper) “introduces a novel parametrization of the correlation matrix. The reparametrization facilitates modeling of correlation and covariance matrices by an unrestricted vector, where positive definiteness is an innate property” (emphasis mine). Econometrica is known to publish ground-breaking research, and you may wonder: what is the big deal in being able to reparametrise the correlation matrix?

Continue reading

Bayesian vs. Frequentist in Practice, part 3

This post is inspired by Leo Breiman’s opinion piece “No Bayesians in foxholes”. The saying “there are no atheists in foxholes” refers to the fact that if you are in the foxhole (being bombarded..), you pray! Leo’s paraphrase indicates that when complex, real problems are present, there are no Bayesian to be found.

Continue reading

Beta in the tails

Every form of strength is also a form of weakness*. I love statistics, but I focus to much on methodology, which is not for everyone. Some people (right or wrong) question: “wonderful sir, but what can I do with it?”.

A new paper titled “Beta in the tails” is a showcase application for why we should focus on correlation structure rather than on average correlation. They discuss the question: Do hedge funds hedge? The reply: No, they don’t!

The paper “Beta in the tails” was published in the Journal of Econometrics but you can find a link to a working paper version below. We start with a figure replicated from the paper, go through the meaning and interpretation of it, and explain the methods used thereafter.

Continue reading

Correlation and correlation structure (5) – a new coefficient of correlation

This is the fifth post which is concerned with quantifying the dependence between variables. When talking correlations one usually thinks about linear correlation, aka Pearson’s correlation. One serious limitation of linear correlation is that it’s, well.. linear. By construction it’s not useful for detecting non-monotonic relation between variables. Here I share some recent academic research, a new way to detect associations that are not monotonic.

Continue reading

Understanding Variance Explained in PCA – Matrix Approximation

Principal component analysis (PCA from here on) is performed via linear algebra functions called eigen decomposition or singular value decomposition. Since you are actually reading this, you may well have used PCA in the past, at school or where you work. There is a strong link between PCA and the usual least squares regression (previous posts here and here). More recently I explained what does variance explained by the first principal component actually means.

This post offers a matrix approximation perspective. As a by-product, we also show how to compare two matrices, to see how different they are from each other. Matrix approximation is a bit math-hairy, but we keep it simple here I promise. For this fascinating field itself I suspect a rise in importance. We are constantly stretching what we can do computationally, and by using approximations rather than the actual data, we can ease that burden. The price for using approximation is decrease in accuracy (à la “garbage in garbage out”), but with good approximation the tradeoff between the accuracy and computational time is favorable.

Continue reading

Boundary corrected kernel density

Density estimation is now a trivial one-liner script in all modern software. What is not so easy is to become comfortable with the result, how well is is my density estimated? we rarely know. One reason is the lack of ground-truth. Density estimation falls under unsupervised learning, we don’t actually observe the actual underlying truth. Another reason is that the theory around density estimation is seldom useful for the particular case you have at hand, which means that trial-and-error is a requisite.

Standard kernel density estimation is by far the most popular way for density estimation. However, it is biased around the edges of the support. In this post I show what does this bias imply, and while not the only way, a simple way to correct for this bias. Practically, you could present density curves which makes sense, rather than apologizing (as I often did) for your estimate making less sense around the edges of the chart; that is, when you use a standard software implementation.

Continue reading

Understanding Pointwise Mutual Information in Statistics


The term mutual information is drawn from the field of information theory. Information theory is busy with the quantification of information. For example, a central concept in this field is entropy, which we have discussed before.

If you google the term “mutual information” you will land at some page which if you understand it, there would probably be no need for you to google it in the first place. For example:

Not limited to real-valued random variables and linear dependence like the correlation coefficient, mutual information (MI) is more general and determines how different the joint distribution of the pair (X,Y) is to the product of the marginal distributions of X and Y. MI is the expected value of the pointwise mutual information (PMI).

which makes sense at first read only for those who don’t need to read it. It’s the main motivation for this post: to provide a clear intuition behind the pointwise mutual information term and equations, for everyone. At the end of this page, you would understand what mutual information metric actually measures, and how you should interpret it. We start with the easier concept of conditional probability and work our way through to the concept of pointwise mutual information.

Continue reading

Understanding Variance Explained in PCA

Principal component analysis (PCA) is one of the earliest multivariate techniques. Yet not only it survived but it is arguably the most common way of reducing the dimension of multivariate data, with countless applications in almost all sciences.

Mathematically, PCA is performed via linear algebra functions called eigen decomposition or singular value decomposition. By now almost nobody cares how it is computed. Implementing PCA is as easy as pie nowadays- like many other numerical procedures really, from a drag-and-drop interfaces to prcomp in R or from sklearn.decomposition import PCA in Python. So implementing PCA is not the trouble, but some vigilance is nonetheless required to understand the output.

This post is about understanding the concept of variance explained. With the risk of sounding condescending, I suspect many new-generation statisticians/data-scientists simply echo what is often cited online: “the first principal component explains the bulk of the movement in the overall data” without any deep understanding. What does “explains the bulk of the movement in the overall data” mean exactly, actually?

Continue reading

Adaptive Huber Regression

Many years ago, when I was still trying to beat the market, I used to pair-trade. In principle it is quite straightforward to estimate the correlation between two stocks. The estimator for beta is very important since it determines how much you should long the one and how much you should short the other, in order to remain market-neutral. In practice it is indeed very easy to estimate, but I remember I never felt genuinely comfortable with the results. Not only because of instability over time, but also because the Ordinary Least Squares (OLS from here on) estimator is theoretically justified based on few text-book assumptions, most of which are improper in practice. In addition, the OLS estimator it is very sensitive to outliers. There are other good alternatives. I have described couple of alternatives here and here. Here below is another alternative, provoked by a recent paper titled Adaptive Huber Regression.

Continue reading