I have recently been reading about more modern ways to decompose a matrix. Singular value decomposition is a popular way, but there are more. I went down the rabbit whole. After a couple of “see references therein” I found something which looks to justify spending time on this. An excellent paper titled “CUR matrix decomposition for improved data analysis”. This post describes how to single-out the most important variables from the data in an unsupervised manner. Unsupervised here means without a target variable in mind.
Principal component analysis (PCA) is one of the earliest multivariate techniques. Yet not only it survived but it is arguably the most common way of reducing the dimension of multivariate data, with countless applications in almost all sciences.
Mathematically, PCA is performed via linear algebra functions called eigen decomposition or singular value decomposition. By now almost nobody cares how it is computed. Implementing PCA is as easy as pie nowadays- like many other numerical procedures really, from a drag-and-drop interfaces to
prcomp in R or
from sklearn.decomposition import PCA in Python. So implementing PCA is not the trouble, but some vigilance is nonetheless required to understand the output.
This post is about understanding the concept of variance explained. With the risk of sounding condescending, I suspect many new-generation statisticians/data-scientists simply echo what is often cited online: “the first principal component explains the bulk of the movement in the overall data” without any deep understanding. What does “explains the bulk of the movement in the overall data” mean exactly, actually?
Moving average is one of the most commonly used smoothing method, basically the go-to. It helps us detect trend in the data by smoothing out short term fluctuations. The computation is trivial: take the most recent k points and simple-average them. Here is how it looks:
The useR! 2019 held in Toulouse ended couple of days ago.
Where I work we are now hiring. We took few time-consuming actions to make sure we have a large pool of candidates to choose from. But what is the value in having a large pool of candidates? Intuitively, the more candidates you have the better the chance that you will end up with a strong prospective candidate in terms of experience, talent and skill set (call this one candidate “the maximum”). But what are we talking about? is this meaningful? If there is a big difference between 10 candidates versus 1500 candidates, but very little difference between 10 candidates versus 80 candidates it means that our publicity and screening efforts are not very fruitful\efficient. Perhaps it would be better running quickly over a small pool, few dozens candidates, and choose the best fit. Below I try to cast this question in terms of the distribution of the sample maximum (think: how much better is the best candidate as the number of candidates grow).
When you build your portfolio you must decide what is your risk profile. A pension fund’s risk profile is different than that of a hedge fund, which is different than that of a family office. Everyone’s goal is to maximize returns given the risk. Sinfully but commonly risk is defined as the variability in the portfolio, and so we feed our expected returns and expected risk to some optimization procedure in order to find the optimal portfolio weights. Risk serves as a decision variable. You choose the risk, and (hope to) get the returns.
A new paper from Kris Boudt, Dries Cornilly, Frederiek Van Hollee and Joeri Willems titled Algorithmic Portfolio Tilting to Harvest Higher Moment Gains makes good progress in terms of our definition of risk, and risk-return trade-off. They propose a quantified way in which you can adjust your portfolio to account not only for the variance, but also for higher moments, namely skewness and kurtosis. They do that in two steps. The first is to simply set your portfolio based on whichever approach you follow (e.g. minvol, equal risk contribution or other). In the second step you tilt the portfolio such that the higher moments are brought into focus and get the attention they deserve. This is done by deviating from the original optimization target so that higher moments are utility-improved: less variance, better skew and lower kurtosis.
Many years ago, when I was still trying to beat the market, I used to pair-trade. In principle it is quite straightforward to estimate the correlation between two stocks. The estimator for beta is very important since it determines how much you should long the one and how much you should short the other, in order to remain market-neutral. In practice it is indeed very easy to estimate, but I remember I never felt genuinely comfortable with the results. Not only because of instability over time, but also because the Ordinary Least Squares (OLS from here on) estimator is theoretically justified based on few text-book assumptions, most of which are improper in practice. In addition, the OLS estimator it is very sensitive to outliers. There are other good alternatives. I have described couple of alternatives here and here. Here below is another alternative, provoked by a recent paper titled Adaptive Huber Regression.
This post is about the concept of entropy in the context of information-theory. Where does the term entropy comes from? What does it actually mean? And how does it clash with the notion of robustness?
This post provides an intuitive explanation for the term Latent Variable.
In a previous post: Most popular machine learning R packages, trying to hash out what are the most frequently used machine learning packages, I simply chose few names from my own memory. However, there is a CRAN task views web page which “aims to provide some guidance which packages on CRAN are relevant for tasks related to a certain topic.” So instead of relying on my own experience, in this post I correct for the bias by simply looking at the topic
Machine Learning & Statistical Learning. There are currently around 100 of those packages on CRAN.
Broadly speaking, we can classify financial markets conditions into two categories: Bull and Bear. The first is a “todo bien” market, tranquil and generally upward sloping. The second describes a market with a downturn trend, usually more volatile. It is thought that those bull\bear terms originate from the way those animals supposedly attack. Bull thrusts its horns up while a bear swipe its paws down. At any given moment, we can only guess the state in which we are in, there is no way of telling really; simply because those two states don’t have a uniformly exact definitions. So basically we never actually observe a membership of an observation. In this post we are going to use (finite) mixture models to try and assign daily equity returns to their bull\bear subgroups. It is essentially an unsupervised clustering exercise. We will create our own recession indicator to help us quantify if the equity market is contracting or not. We use minimal inputs, nothing but equity return data. Starting with a short description of Finite Mixture Models and moving on to give a hands-on practical example.
Are returns this year actually different than what can be expected from a typical year? Is the variance actually different than what can be expected from a typical year? Those are fairly light, easy to answer questions. We can use tests for equality of means or equality of variances.
But how about the following question:
is the profile\behavior of returns this year different than what can be expected in a typical year?
This is a more general and important question, since it encompasses all moments and tail behavior. And it is not as trivial to answer.
In this post I am scratching an itch I had since I wrote Understanding Kullback – Leibler Divergence. In the Kullback – Leibler Divergence post we saw how to quantify the difference between densities, exemplified using SPY return density per year. Once I was done with that post I was thinking there must be a way to test the difference formally, rather than just quantify, visualize and eyeball. And indeed there is. This post aim is to show to formally test for equality between densities.
Orthogonality in mathematics
The word Orthogonality originates from a combination of two words in ancient Greek: orthos (upright), and gonia (angle). It has a geometrical meaning. It means two lines create a 90 degrees angle between them. So one line is perpendicular to the other line. Like so:
Even though Orthogonality is a geometrical term, it appears very often in statistics. You probably know that in a statistical context orthogonality means uncorrelated, or linearly independent. But why?
Why use a geometrical term to describe a statistical relation between random variables? By extension, why does the word angle appears in the incredibly common regression method least-angle regression (LARS)? Enough losing sleep over it (as you undoubtedly do), an extensive answer below.
At least in part, a typical data-scientist is busy with forecasting and prediction. Kaggle is a platform which hosts a slew of competitions. Those who have the time, energy and know-how to combat real-life problems, are huddling together to test their talent. I highly recommend this experience. A side effect of tackling actual problems (rather than those which appear in textbooks), is that most of the time you are not at all enjoying new wonderful insights or exploring fascinating unfamiliar, ground-breaking algos. Rather, you are handling\wrangling\manipulating data, which is.. ugly and boring, but necessary and useful.
I tried my powers few years ago, and again about 6 months ago in one of those competitions called Toxic Comment Classification Challenge. Here are my thoughts on that short experience and some insight from scraping the results of that competition.
The yearly R in Finance conference is one of my favorites: