Why complex models are data-hungry?

If you regularly read this blog then you know I am not one to jump on the “AI Bandwagon”, being quickly weary of anyone flashing the “It’s Artificial Intelligence” joker card. Don’t get me wrong, I understand it is a sexy term I, but to me it always feels a bit like a sales pitch.

If the machine does anything (artificially) intelligent it means that the model at the back is complex, and complex models need massive (massive I say) amounts of data. This is because of the infamous Curse of dimensionality.

I know it. You know it. Complex models need a lot of data. You have read this fact, even wrote it at some point. But why is it the case? “So we get a good estimate of the parameter, and a good forecast thereafter”, you reply. I accept. But.. what is it about simple models that they could suffice themselves with much less data compared to complex models? Why do I always recommend to start simple? and why the literature around shrinkage and overfitting is as prolific as it is?

More

Correlation and correlation structure (4) – asymmetric correlations of equity portfolios

Here I share a refreshing idea from the paper “Asymmetric correlations of equity portfolios” which was published in the Journal of financial Economics, a top tier journal in this field. The question is how much the observed conditional correlation on the downside (say) differs from the conditional correlation you would expect from a symmetrical distribution. You can find here an explanation for the H-statistic developed in the aforementioned paper and some code for illustration.

More

Understanding Spectral Clustering

Some problems are linear, but some problems are non-linear. I presume that you started your education discussing and solving linear problems which is a natural starting point. For non-linear problems solutions often involve an initial processing step. The aim of that initial step is to transform the problem such that it has, again, linear flavor.

A textbook example is the logistic regression, a tried-and-true recipe for getting the best linear boundary between two classes. In a standard neural network model, you will find logistic regression (or multinomial regression for multi-class output) applied on transformed data. Few preceding layers are “devoted” to transform a non-separable input-space into something which linear methods could handle, allowing the logistic regression to solve the problem with relative ease.

The same rationale holds for spectral clustering. Rather than working with the original inputs, work first with a transformed data which would make it easier to solve, and then link back to your original inputs.

Spectral clustering is an important and up-and-coming variant of some fairly standard clustering algorithms. It is a powerful tool to have in your modern statistics tool cabinet. Spectral clustering includes a processing step to help solve non-linear problems, such that they could be solved with those linear algorithms we are so fond of. For example, the undeniably popular K-means.

More

Boundary corrected kernel density

Density estimation is now a trivial one-liner script in all modern software. What is not so easy is to become comfortable with the result, how well is is my density estimated? we rarely know. One reason is the lack of ground-truth. Density estimation falls under unsupervised learning, we don’t actually observe the actual underlying truth. Another reason is that the theory around density estimation is seldom useful for the particular case you have at hand, which means that trial-and-error is a requisite.

Standard kernel density estimation is by far the most popular way for density estimation. However, it is biased around the edges of the support. In this post I show what does this bias imply, and while not the only way, a simple way to correct for this bias. Practically, you could present density curves which makes sense, rather than apologizing (as I often did) for your estimate making less sense around the edges of the chart; that is, when you use a standard software implementation.

More

Curse of Dimensionality part 4: Distance Metrics

Many machine learning algorithms rely on distances between data points as their input, sometimes the only input, especially so for clustering and ranking algorithms. The celebrated k-nearest neighbors (KNN) algorithm is our example chief, but distances are also frequently used as an input in the natural language processing domain; “You shall know a word by the company it keeps” (Firth, J. R. 1957:11); e.g. the word “jaguar” refers to the animal if words like “zoo” or “safari” are also in the neighborhood. But would refer to a mark of a car if words such as “parking” or “highway” are nearby. But (and a big one), ‘in the neighborhood’ means one thing in a low-dimension settings, and another thing in high-dimensional settings. This post emphasizes this important difference- another example of the curse of dimensionality; measuring distance in high dimension.

More

Understanding Pointwise Mutual Information in Statistics

Intro

The term mutual information is drawn from the field of information theory. Information theory is busy with the quantification of information. For example, a central concept in this field is entropy, which we have discussed before.

If you google the term “mutual information” you will land at some page which if you understand it, there would probably be no need for you to google it in the first place. For example:

Not limited to real-valued random variables and linear dependence like the correlation coefficient, mutual information (MI) is more general and determines how different the joint distribution of the pair (X,Y) is to the product of the marginal distributions of X and Y. MI is the expected value of the pointwise mutual information (PMI).

which makes sense at first read only for those who don’t need to read it. It’s the main motivation for this post: to provide a clear intuition behind the pointwise mutual information term and equations, for everyone. At the end of this page, you would understand what mutual information metric actually measures, and how you should interpret it. We start with the easier concept of conditional probability and work our way through to the concept of pointwise mutual information.

More

CUR matrix decomposition for improved data analysis

I have recently been reading about more modern ways to decompose a matrix. Singular value decomposition is a popular way, but there are more. I went down the rabbit whole. After a couple of “see references therein” I found something which looks to justify spending time on this. An excellent paper titled “CUR matrix decomposition for improved data analysis”. This post describes how to single-out the most important variables from the data in an unsupervised manner. Unsupervised here means without a target variable in mind.

More

Forecast Combination talk

Courtesy of R Consortium, you can view my forecast combination talk (16 mins) given in France few months ago, below.

More

Understanding Variance Explained in PCA

Principal component analysis (PCA) is one of the earliest multivariate techniques. Yet not only it survived but it is arguably the most common way of reducing the dimension of multivariate data, with countless applications in almost all sciences.

Mathematically, PCA is performed via linear algebra functions called eigen decomposition or singular value decomposition. By now almost nobody cares how it is computed. Implementing PCA is as easy as pie nowadays- like many other numerical procedures really, from a drag-and-drop interfaces to prcomp in R or from sklearn.decomposition import PCA in Python. So implementing PCA is not the trouble, but some vigilance is nonetheless required to understand the output.

This post is about understanding the concept of variance explained. With the risk of sounding condescending, I suspect many new-generation statisticians/data-scientists simply echo what is often cited online: “the first principal component explains the bulk of the movement in the overall data” without any deep understanding. What does “explains the bulk of the movement in the overall data” mean exactly, actually?

More

The Distribution of the Sample Maximum

Where I work we are now hiring. We took few time-consuming actions to make sure we have a large pool of candidates to choose from. But what is the value in having a large pool of candidates? Intuitively, the more candidates you have the better the chance that you will end up with a strong prospective candidate in terms of experience, talent and skill set (call this one candidate “the maximum”). But what are we talking about? is this meaningful? If there is a big difference between 10 candidates versus 1500 candidates, but very little difference between 10 candidates versus 80 candidates it means that our publicity and screening efforts are not very fruitful\efficient. Perhaps it would be better running quickly over a small pool, few dozens candidates, and choose the best fit. Below I try to cast this question in terms of the distribution of the sample maximum (think: how much better is the best candidate as the number of candidates grow).

More

Portfolio Construction Tilting towards Higher Moments

When you build your portfolio you must decide what is your risk profile. A pension fund’s risk profile is different than that of a hedge fund, which is different than that of a family office. Everyone’s goal is to maximize returns given the risk. Sinfully but commonly risk is defined as the variability in the portfolio, and so we feed our expected returns and expected risk to some optimization procedure in order to find the optimal portfolio weights. Risk serves as a decision variable. You choose the risk, and (hope to) get the returns.

A new paper from Kris Boudt, Dries Cornilly, Frederiek Van Hollee and Joeri Willems titled Algorithmic Portfolio Tilting to Harvest Higher Moment Gains makes good progress in terms of our definition of risk, and risk-return trade-off. They propose a quantified way in which you can adjust your portfolio to account not only for the variance, but also for higher moments, namely skewness and kurtosis. They do that in two steps. The first is to simply set your portfolio based on whichever approach you follow (e.g. minvol, equal risk contribution or other). In the second step you tilt the portfolio such that the higher moments are brought into focus and get the attention they deserve. This is done by deviating from the original optimization target so that higher moments are utility-improved: less variance, better skew and lower kurtosis.

More

Adaptive Huber Regression

Many years ago, when I was still trying to beat the market, I used to pair-trade. In principle it is quite straightforward to estimate the correlation between two stocks. The estimator for beta is very important since it determines how much you should long the one and how much you should short the other, in order to remain market-neutral. In practice it is indeed very easy to estimate, but I remember I never felt genuinely comfortable with the results. Not only because of instability over time, but also because the Ordinary Least Squares (OLS from here on) estimator is theoretically justified based on few text-book assumptions, most of which are improper in practice. In addition, the OLS estimator it is very sensitive to outliers. There are other good alternatives. I have described couple of alternatives here and here. Here below is another alternative, provoked by a recent paper titled Adaptive Huber Regression.

More