Correlation and correlation structure (5) – a new coefficient of correlation

This is the fifth post which is concerned with quantifying the dependence between variables. When talking correlations one usually thinks about linear correlation, aka Pearson’s correlation. One serious limitation of linear correlation is that it’s, well.. linear. By construction it’s not useful for detecting non-monotonic relation between variables. Here I share some recent academic research, a new way to detect associations that are not monotonic.

A seminal paper from 2011 “Detecting Novel Associations in Large Datasets” provides a nice peek into this thorny world of non-monotonic association measures. A recent, wonderfully written paper “A New Coefficient of Correlation” produces a new, assumption-free dependence measure which simple to understand, simple to compute and is theoretically sound. For the sake or brevity, references and formulas are below for those readers who wish a closer look. Here, lets just give this new correlation measure a whirl. Denote $\xi$ as the new coefficient of correlation, and compare it to the familiar Pearson’s correlation denoted as $\rho$ . The left panel has $y$ as a noisy version of $x^2$ , so a parabolic relation. The right panel has $y$ as a noisy version of $2x$ , so a linear relation.

While the linear measure $\rho$ , as expected, is very low for the left panel, the new coefficient of correlation $\xi$ manages well to capture the non-linear relation. There is more to like. While the data becomes noisier as we move from top figures to the bottom ones, looking at the right panel the estimated $\rho$ decreases by not much compared with the estimated $\xi$ . A matter of taste I guess, but I myself find it easier to digest lower numbers on noisy data.

Few more comments before we adjourn here. The new coefficient of correlation $\xi$ :

is designed for general patterns of association, and runs from zero to one. This means you can’t say much about the “direction”, as you can say with the usual $\rho$

is based on ranks, so more robust to outliers in the data

is not symmetric. But that is an easy fix if you wish, just average $\xi(x,y)$ with $\xi(y,x)$ as they do usually for asymmetric metrics

can be computed using the R package XICOR: Association Measurement Through Cross Rank Increments

References

The main paper which this post is based on: A new coefficient of correlation (working paper version)

Reshef,D.N., Reshef, Y. A., Finucane, H. K.,Grossman, S. R.,McVean, G., Turnbaugh, P. J., Lander, E. S., Mitzenmacher, M., and Sabeti, P. (2011), “Detecting Novel Associations in Large Datasets” Science, 334, 1518–1524.

Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences