Statistical Shrinkage (4) – Covariance estimation

A common issue encountered in modern statistics involves the inversion of a matrix. For example, when your data is sick with multicollinearity your estimates for the regression coefficient can bounce all over the place.

In finance we use the covariance matrix as an input for portfolio construction. Analogous to the fact that variance must be positive, covariance matrix must be positive definite to be meaningful. The focus of this post is on understanding the underlying issues with an unstable covariance matrix, identifying a practical solution for such an instability, and connecting that solution to the all-important concept of statistical shrinkage. I present a strong link between the following three concepts: regularization of the covariance matrix, ridge regression, and measurement error bias, with some easy-to-follow math.

Continue reading

Statistical Shrinkage (3)

Imagine you’re picking from 1,000 money managers. If you test just one, there’s a 5% chance you might wrongly think they’re great. But test 10, and your error chance jumps to 40%. To keep your error rate at 5%, you need to control the “family-wise error rate.” One method is to set higher standards for judging a manager’s talent, using a tougher t-statistic cut-off. Instead of the usual 5% cut (t-stat=1.65), you’d use a 0.5% cut (t-stat=2.58).

When testing 1,000 managers or strategies, the challenge increases. You’d need a manager with an extremely high t-stat of about 4 to stay within the 5% error rate. This big jump in the t-stat threshold helps keep the error rate in check. However that is discouragingly strict: a strategy which t-stat of 4 is rarity.

Continue reading

Rython tips and tricks – Clipboard

For whatever reason, clipboard functionalities from Rython are under-utilized. One utility function for reversing backslashes is found here. This post demonstrates how you can use the clipboard to circumvent saving and loading files. It’s convenient for when you just want the quick insight or visual, rather than a full-blown replicable process.

Continue reading

Statistical Shrinkage (2)

During 2017 I blogged about Statistical Shrinkage. At the end of that post I mentioned the important role signal-to-noise ratio (SNR) plays when it comes to the need for shrinkage. This post shares some recent related empirical results published in the Journal of Machine Learning Research from the paper Randomization as Regularization. While mainly for tree-based algorithms, the intuition undoubtedly extends to other numerical recipes also.

Continue reading

Rython tips and tricks – Snippets

R or Python? who cares! Which editor? now that’s a different story.

I like Rstudio for many reasons. Outside the personal, Rstudio allows you to write both R + Python = Rython in the same script. Apart from that, the editor’s level of complexity is well-balanced, not functionality-overkill like some, nor too simplistic like some others. This post shares how to save time with snippets (easy in Rstudio). Snippets save time by reducing the amount of typing required, it’s the most convenient way to program copy-pasting into the machine’s memory.

In addition to the useful built-ins snippets provided by Rstudio like lib or fun for R and imp or def for Python, you can write your own snippets. Below are a couple I wrote myself that you might find helpful. But first we start with how to use snippets.

Continue reading

Trees 1 – 0 Neural Networks

Tree-based methods like decision trees and their powerful random forest extensions are one of the most widely used machine learning algorithms. They are easy to use and provide good forecasting performance off the cuff more or less. Another machine learning community darling is the deep learning method, particularly neural networks. These are ultra flexible algorithms with impressive forecasting performance even (and especially) in highly complex real-life environments.

This post is shares:

  • Two academic references lauding the powerful performance of tree-based methods.
  • Because both neural networks and tree-based methods are able to capture non-linearity in the data, it’s not easy to choose between them. Those references help form an opinion with regards to when one should use neural networks and when tree-based methods are preferable, if you don’t have time to implement both (which is usually the case).
  • Continue reading

    Most popular posts – 2022

    Welcome 2023.

    As per usual this point in time, I check my blog’s traffic-analytics to see which were the most popular pieces last year. Without further ado..

    Continue reading

    Beware of Spurious Factors

    The word spurious refers to “outwardly similar or corresponding to something without having its genuine qualities.” Fake.

    While the meanings of spurious correlation and spurious regression are common knowledge nowadays, much less is understood about spurious factors. This post draws your attention to recent, top-shelf, research flagging the risks around spurious factor analysis. While formal solutions are still pending there are couple of heuristics we can use to detect possible problems.

    Continue reading

    Understanding Convolutional Neural Networks

    Convolutional Neural Networks (CNNs from here on) triumph in the field of image processing because they are designed to effectively handle strong spatial dependencies. Simply put, adjacent pixel-values are close to each other, often changing only gradually from one pixel to the next. In a picture where you wear a blue shirt, all the pixels in that area of the picture are blue. You can think of a strong autocorrelated time series, just for spatial data rather than sequential data. This post explains few important concepts related to CNNs: sparsity of connections, parameter sharing, and hierarchical feature engineering.

    Continue reading

    R tips and tricks – get the gist

    In scientific programming speed is important. Functions written for general public use have a lot of control-flow checks which are not necessary if you are confident enough with your code.To quicken your code execution I suggest to strip run-of-the-mill functions to their bare bones. You can save serious wall-clock time by using only the laborers code lines. Below is a walk-through example of what I mean.

    Continue reading

    Correlation and Correlation Structure (6) – Distance Correlation

    While linear correlation (aka Pearson correlation) is by far the most common type of dependence measure there are few arguably better ways to characterize\estimate the degree of dependence between variables. This is a fascinating topic I keep coming back to. There is so much for a typical geek to appreciate: non-linear dependencies, should we consider the noise in the data or rather just focus on the underlying process, should we consider the whole distribution or just few moments.

    In this post number 6 on correlation and correlation structure I share another dependency measure called “distance correlation”. It has been around for a while now (2009, see references). I provide just the intuition, since the math has little to do with the way distance correlation is computed, but rather with the theoretical justification for its practical legitimacy.

    Continue reading

    On Writing Math

    There are a lot of examples for skills that despite being greatly needed, we never get any formal training for. At least nothing is built into our core educational programs. Few examples are: how to read well, how to listen well, or how to develop your can-do mental attitude. Writing well, in particular math-writing, is another such example. Here I share few pointers from my own experience of reading and writing math.

    Continue reading

    R Packages Download Stats

    One big advantage of using open-source tools is the fantastic ecosystems that typically accompany them. Being able to tap into a massive open-source community, by way of downloading freely available code is decidedly useful. But, yes, there are downsides to downloads.

    For one, there are too many packages out there. There are imperfect duplicates. You can easily end up downloading inferior code/package/module compared to existing other. Second, there is a matter of security. I myself try to refrain from downloading relatively new code, not yet tried-and-true. How do we know if a package is solid?

    Continue reading

    Similarity and Dissimilarity Metrics – Kernel Distance

    In the field of unsupervised machine learning, similarity and dissimilarity metrics (and matrices) are part and parcel. These are core components of clustering algorithms or natural language processing summarization techniques, just to name a couple.

    While at first glance distance metrics look like child’s play, the fact of the matter is that when you get down to business there are a lot of decisions to make, and who likes that? to make matters worse:

    • Theoretical guidance is nowhere to be found
    • Your choices and decisions matter, in the sense that results materially change

    After reading this post you will understand concepts like distance metrics, (dis)similarity metrics, and see why it’s fashionable to use kernels as similarity metrics.

    Continue reading