First things first.

## What do we mean by *sparse estimation*?

Sparse – thinly scattered or distributed; not thick or dense.

First things first.

Sparse – thinly scattered or distributed; not thick or dense.

We often see statements like “linear regression makes the assumption that the data is normally distributed”, “Data has no or little multicollinearity”, or other such blunders (you know who you are..).

Let’s set the whole thing straight.

It has to be said. Linear regression does not even assume linearity for that matter, I argue. It is simply an estimator, a function. We don’t need to ask anything from a function.

Consider that linear regression has an additional somewhat esoteric, geometric interpretation. When we perform a linear regression you simply find the best possible, closest possible, linear projection we can. A linear combination in your X space that is as close as possible in a Euclidean sense (squared distance) to some other vector y.

That is IT! a simple geometric relation. No assumptions needed whatsoever.

You don’t ask anything from the average when you use it as an estimate for the mean do you? So why do that when you use regression? We only need to ask more **if we do something more**.

The term ‘curse of dimensionality’ is now standard in advanced statistical courses, and refers to the disproportional increase in data which is needed to allow only slightly more complex models. This is true in high-dimensional settings. Here is an illustration of the ‘Curse of dimensionality’ in action.

The top three for the year are:

Out-of-sample data snooping

Code for my yield curve forecasting paper

Review of a couple of books

I personally enjoyed the most writing a few words on ML estimation, and about those great statistical discoveries. Since the last post did not involve any code or images I initially thought it would be a breeze. I in fact spent twice the time I usually do, and it was all good fun.

In 2015 I wrote quite a bit about volatility and correlation. In 2016 I plan to learn more (so to write more) about portfolio construction.

Some time during the 18th century the biologist and geologist Louis Agassiz said: *“Every great scientific truth goes through three stages. First, people say it conflicts with the Bible. Next they say it has been discovered before. Lastly they say they always believed it”.* Nowadays I am not sure about the Bible but yeah, it happens.

I express here my long-standing and long-lasting admiration for the following triplet of present-day great discoveries. The authors of all three papers had initially struggled to advance their ideas, which echos the quote above. Here they are, in no particular order.

In multivariate volatility forecasting (4), we saw how to create a covariance matrix which is driven by few principal components, rather than a complete set of tickers. The advantages of using such factor volatility models are plentiful.

Perhaps it is the different jargon used in different disciplines, not sure. But for some reason, the terms ‘predictions’, ‘forecasts’ and ‘projections’ are frequently used interchangeably.

To be instructive, I always use very few tickers to describe how a method works (and this tutorial is no different). Most of the time is spent on methods that we can easily scale up. Even if exemplified using only say 3 tickers, a more realistic 100 or 500 is not an obstacle. But, is it really necessary to model the volatility of each ticker individually? No.

If we want to forecast the covariance matrix of all components in the Russell 2000 index we don’t leave much on the table if we model only a *few underlying factors*, much less than 2000.

Volatility factor models are one of those rare cases where the appeal is both theoretical and empirical. The idea is to create a few principal components and, under the reasonable assumption that they drive the bulk of comovement in the data, model those few components only.

Broadly speaking, complex models can achieve great predictive accuracy. Nonetheless, a winner in a kaggle competition is required only to attach a code for the replication of the winning result. She is not required to teach anyone the built-in elements of his model which gives the specific edge over other competitors. In a corporation settings your manager and his manager and so forth MUST feel comfortable with the underlying model. Mumbling something like “This artificial-neural-network is obtained by using a grid search over a range of parameters and connection weights where the architecture itself is fixed beforehand…”, forget it!

This post is about copulas and heavy tails. In a previous post we discussed the concept of correlation structure. The aim is to characterize the correlation across the distribution. Prior to the global financial crisis many investors were under the impression that they were diversified, and they were, for how things looked there and then. Alas, when things went south, correlation in the new southern regions turned out to be different\stronger than that in normal times. The hard-won diversification benefits evaporated exactly when you needed them the most. This adversity has to do with fat-tail in the joint distribution, leading to great conceptual and practical difficulties. Investors and bankers chose to swallow the blue pill, and believe they are in the nice Gaussian world, where the math is magical and elegant. Investors now take the red pill, where the math is ugly and problems abound.

Last time we showed how to estimate a CCC and DCC volatility model. Here I describe an advancement labored by Engle and Kelly (2012) bearing the name: *Dynamic equicorrelation*. The idea is nice and the paper is well written.

Departing where the previous post ended, once we have (say) the DCC estimates, instead of letting the variance-covariance matrix be, we force some structure by way of averaging correlation **across** assets. Generally speaking, correlation estimates are greasy even without any breaks in dynamics, so I think forcing some structure is for the better.

Given a constant speed, time and distance are fully correlated. Provide me with the one, and I’ll give you the other. When two variables have nothing to do with each other, we say that they are not correlated.

You wish that would be the end of it. But it is not so. As it is, things are perilously more complicated. By far the most familiar correlation concept is the **Pearson’s correlation**. Pearson’s correlation coefficient checks for linear dependence. Because of it, we say it is a parametric measure. It can return an actual zero even when the two variables are fully dependent on each other (link to cool chart).

Open source software has many virtues. Being free is not the least of which. However, open source comes with “ABSOLUTELY NO WARRANTY” and with no power comes no responsibility (I wonder..). Since no one is paying, by definition it is your sole responsibility to make sure the code does what it is supposed to be doing. Thus, looking “under the hood” of a function written by someone else is can be of service. There are more reasons to examine the actual underlying code.