A new paper titled *“Beta in the tails”* is a showcase application for why we should focus on correlation structure rather than on average correlation. They discuss the question: *Do hedge funds hedge?* The reply: No, they don’t!

The paper *“Beta in the tails”* was published in the *Journal of Econometrics* but you can find a link to a working paper version below. We start with a figure replicated from the paper, go through the meaning and interpretation of it, and explain the methods used thereafter.

Hedge funds don’t hedge.

`rq`

function from the `quantreg`

package. On the X-axis: the actual quantiles (e.g. 50 means roughly median monthly returns).The data used is taken from lab.credit-suisse.com (registration needed). The figure shows that when market returns are low (lower quantiles on the X-axis), the hedge fund’s returns move more in tandem (so also low), then when market returns higher end (higher quantiles on the X-axis). If hedge fund are actually serving us as a hedge, we should have seen exactly the reverse. When markets do poorly for the individual, the hedge fund returns would “kick in” to compensate for market losses.

More interesting points from the paper:

- The figure above is typical, and holds true for almost all hedge fund styles. The wording “in the tails” is because the slope is very steep when returns are at the very low quantiles (left tail of the return distribution). Meaning the hedge, if that is what you tell yourself you are doing, fails exactly when you need it the most, so add-on tail risk.
- Two styles are actually not a bad hedge:
*managed futures*and e*quity market neutral.*It’s nice to see an analysis I have made back in 2012 about most profitable hedge fund style actually holds up, as 2 out of the 3 winning styles are indeed market neutral.

I followed a paper written by Dirk Baur: The structure and degree of dependence: A quantile regression approach, and used a slightly revised code from the post Correlation and correlation structure (1); quantile regression. Basically you create a quantile regression loop through the different quantiles to estimate the beta. While we can discuss the more general notion of correlation, clearly hedge funds follow the market rather than the other way around, so we can simply look at the beta from that regression. The actual function I used is given below.

library(quantreg) corquantile <- function(seriesa,seriesb,k=10){ if(length(seriesa)!=length(seriesb)){stop("length(seriesa)!=length(seriesb)")} TT <- length(seriesa) cofa <- cofb <- NULL for (i in k:(100-k)){ # The workhorse: lm0 <- summary(rq(seriesa~seriesb,tau=(i/100))) lm1 <- summary(rq(seriesb~seriesa,tau=(i/100))) cofa[i-k+1] <- lm0$coef[2,1] cofb[i-k+1] <- lm1$coef[2,1] } return(list(cofa=cofa,cofb=cofb)) }

* The first sentence is a quote by Michael Lewis.

]]>A distinctive power of neural networks (neural nets from here on) is their ability to flex themselves in order to capture complex underlying data structure. This post shows that the expressive power of neural networks can be quite swiftly taken to the extreme, in a bad way.

What does it mean? A paper from 1989 (universal approximation theorem, reference below) shows that any reasonable function can be approximated arbitrarily well by fairly a shallow neural net.

Speaking freely, if one wants to abuse the data, to overfit it like there is no tomorrow, then neural nets is the way to go; with neural nets you can perfectly map your fitted values to any data shape. Let’s code an example and explain the meaning of this.

I use figure 4.7 from the book Neural Networks and Statistical Learning as the function to be approximated (as if the data comes from a noisy version of this curve below).

x <- seq(0, 2, length.out= 100) fx <- function(x){ sin(2*pi*x) * exp(x) } y <- fx(x) plot(x,y)

Using the code below, you can generate the function plotted by a linear combination of Relu-activated neurons. In a neural net those are the hidden units. I use 10, 20, and 50 neurons. You can see below that the more neurons you use the more flexible is the fit. With only 10 neurons, so-so, respectable fit with 20, and a wonderful fit with 50 (in red) which captures almost all turning points.

Nice, but what’s the point?

The point is that with great (expressive) power comes great (modelling) responsibility.

There is a stimulating paper by Leo Breiman (the father of the highly successful random forest algorithms) called “Statistical modeling: The two cultures”. Commenting on that paper my favorite statistician writes

“At first glance it looks like an argument against parsimony and scientific insight, and in favor of black boxes with lots of knobs to twiddle. At second glance it still looks that way…”

Reason and care are needed for developing trust around deep learning models.

Drink and model, both responsibly (tip: sometimes better results if you do both simultaneously).

relu <- function(z) ifelse(z >= 0, z, 0) approx_fun <- function(x, a = a, b = b, c = c, d = d) { tmpp <- matrix(nrow = length(x), ncol = num_funs) for (j in 1:num_funs) { for (i in 1:length(x)) { tmpp[i, j] <- b[j] * relu(c[j] + d[j] * x[i]) } } a + apply(tmpp, 1, sum) } relu_approx <- function( par ) { a <- par[1] c <- par[2:(num_funs+1)] b <- par[(num_funs+2):(2*num_funs+1)] d <- par[(2*num_funs+2):(3*num_funs+1)] approx <- approx_fun(x, a= a, c= c, b= b, d= d ) er <- (approx - fx(x))^2 er <- sum(er) return(er) } num_funss <- c(10, 20, 50) approx_x_cache <- list() for (i in num_funss) { num_funs <- i op1 <- optim( par = rnorm(i * 3 + 1), fn = relu_approx, method = "BFGS", control = list(trace = 1, maxit = 1000) ) approx_x <- approx_fun(x, a = op1$par[1], b = op1$par[(i + 2):(2 * i + 1)], c = op1$par[2:(i + 1)], d = op1$par[(2 * i + 2):(3 * i + 1)] ) approx_x_cache[[i]] <- approx_x }

In 1961 it was shown (Weierstrass approximation theorem) that any well behaved function could be, as above, approximated arbitrarily well by a polynomial function. In that regards “..fitting NNs actually mimics Polynomial Regression, with higher and higher-degree polynomials emerging from each successive layer..” (quoted from the third reference below).

Let be a continuous function on a bounded subset of -dimensional space. Then there exists a two-layer neural network with finite number of hidden units that approximate arbitrarily well. Namely, for all in the domain of ,

199-231.

A seminal paper from 2011 “

While the linear measure , as expected, is very low for the left panel, the new coefficient of correlation manages well to capture the non-linear relation. There is more to like. While the data becomes noisier as we move from top figures to the bottom ones, looking at the right panel the estimated decreases by not much compared with the estimated . A matter of taste I guess, but I myself find it easier to digest lower numbers on noisy data.

Few more comments before we adjourn here. The new coefficient of correlation :

Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences

The actual formula (without ties) for the new coefficient of correlation:

is the rank of .

]]>This post offers a matrix approximation perspective. As a by-product, we also show how to compare two matrices, to see how different they are from each other. Matrix approximation is a bit math-hairy, but we keep it simple here I promise. For this fascinating field itself I suspect a rise in importance. We are constantly stretching what we can do computationally, and by using approximations rather than the actual data, we can ease that burden. The price for using approximation is decrease in accuracy (à la “garbage in garbage out”), but with good approximation the tradeoff between the accuracy and computational time is favorable.

If you apply the PCA to some matrix A, and you column-bind the first k principal components- call it matrix B, that B matrix is the best approximation for the original matrix A you can get. You get better and better approximation by increasing the k number, i.e. using more PCs, but utilizing only k columns you can’t have any better approximation.

Say you have 10 columns but you want to work with only 4, your k. The 4 principal components constitute the best algebraic approximation (again, which uses only 4 columns) to the original matrix. Change a single entry in that 4 column matrix, and you are moving away from your original A matrix. More details below, if you are interested to know a bit more about PCA internals.

Let’s expand the usual notion of distance between two points to matrices. If and are just numbers, is the euclidean distance between them. If they are vectors, say is 5 numbers and is 5 numbers we compute the 5 quantities and sum them up. Matrices are no different, we sum up all the entries simply. We call this summation over rows and columns :

the Frobenius norm of a matrix E. could be thought of as the error matrix, the distance between the (the original) and your approximation . You can be happy with your approximation for A if the Frobenius norm of the errors between the entries A and B is small.

Coding wise you don’t need to program from scratch. `Matrix::norm`

(R) and `np.linalg.norm`

(Python) will do the trick.

The following code pulls some price data for 4 ETFs which will be our matrix, perform PCA and binds the first few principal components (AKA scores) which will be our matrix.

> library(magrittr) > library(quantmod) > sym = c('IEF','TLT','SPY','QQQ') > l <- length(sym) > end <- format(as.Date("2020-12-01"), "%Y-%m-%d") > start <- format(as.Date("2007-01-01"), "%Y-%m-%d") > dat0 <- (getSymbols(sym[4], src="yahoo", from=start, to=end, auto.assign = F)) > n <- NROW(dat0) > dat <- array(dim = c(n, NCOL(dat0),l)) ; prlev = matrix(nrow = n, ncol = l) > for (i in 1:l){ + dat0 = (getSymbols(sym[i], src="yahoo", from=start, to=end, auto.assign = F)) + dat[1:length(dat0[,i]),,i] = dat0 + # Average Price during the day + prlev[1:NROW(dat[,,i]),i] = as.numeric(dat[,4,i]+dat[,1,i]+dat[,2,i]+dat[,3,i])/4 + } > time <- index(dat0) > x <- na.omit(prlev) > ret <- 100*(x[2:NROW(x),] - x[1:(NROW(x)-1),]) # Move to returns > head(ret,3) [,1] [,2] [,3] [,4] [1,] 21.3 32.3 -37.3 36.0 [2,] -22.5 -38.2 -47.8 12.7 [3,] 7.5 18.8 0.5 7.5 pc0 <- prcomp(ret, center= F, scale.=F) mat_approx4 <- pc0$x[,1:4] %*% t(pc0$rot[,1:4]) # using all 4 components mat_approx3 <- pc0$x[,1:3] %*% t(pc0$rot[,1:3]) # using only 3 mat_approx2 <- pc0$x[,1:2] %*% t(pc0$rot[,1:2]) # you get the point.. mat_approx1 <- as.matrix(pc0$x[,1]) %*% t(pc0$rot[,1])

You can see that the approximation becomes better as k increases from 1 to 4. `ret - mat_approx`

is our matrix in the math above. Using all 4 principal components we retrieve the original matrix :

> Matrix::norm( (ret - mat_approx1), ty= "F") [1] 6390 > Matrix::norm( (ret - mat_approx2), ty= "F") [1] 3220 > Matrix::norm( (ret - mat_approx3), ty= "F") [1] 916 > Matrix::norm( (ret - mat_approx4), ty= "F") [1] 0

The less principal components you use, the worse your approximation becomes (the norm of the E matrix becomes larger).

Coming back full circle to the start of the post where I mentioned that using PCA is a way to get the best approximation. Eckart and Young theorem tells us that the approximation we just made, is the best (mark: Frobenius norm speaking, for other norms it may not be the best), that we could have done. If you want to know more about this topic, it falls under the apt name of sketching. A good sketch matrix B is such that computations can be performed on B rather than on A without much loss in precision; sketching stands here for “not the actual picture but very economical and very clear”.

So on top of the statistical interpretation from previous PCA posts, you now also have this algebraic interpretation of PCA.

]]>In the words of my favorite statistician Bradley Efron:

“There is some sort of law working here, whereby statistical methodology always expands to strain the current limits of computation.”

In addition to the need for faster computation, the richness of open-source ecosystem means that you often encounter different functions doing the same thing, sometimes even under the same name. This post explains how to measure the computational efficacy of a function so you know which one to use, with a couple of actual examples for reducing computational time.

First, use the package `microbenchmark`

. It provides infrastructure to measure and compare the execution time of different R code. In Matlab you can use the the `tic toc`

functions and in Python you can use the `clock`

function from the `time`

module. You can find some toy code here.

**Example:** you need to extract the first value of a vector. I grew up with the function `head(vector, n= 1)`

which asks the first value of the vector. In Python’s Pandas it is `vector.head(1)`

. Now, there are other functions which, perhaps, are more speedy/efficient.

Google says that there is a function which is called `first`

. Actually, a function called `first`

is available from the `dplyr`

package, the `xts`

package and the `data.table`

package. So we have 4 different functions (that I know of, probably more out there) that do the same thing. If speed is of the essence, which function you should use? which is fastest? The following code-snippet answers.

TT <- 100 tmp_vector <- runif(TT) bench <- microbenchmark( head(tmp_vector,1), data.table:::first(tmp_vector), dplyr:::first(tmp_vector), xts:::first(tmp_vector), times=10^3, unit= "ns")

The triple colon operator `:::`

is used here to access functions within the package without attaching the actual package to the search path. Since the packages are not of the same size I use the triple colon operator but the use of the double colon operator, `::`

is better practice in general. Here are the results of the timing operation based on 1000 times of executing each one of those function. The y-axis is divided by 10000 for readability.

With the risk of over-generalization, based on this analysis/figure the recommendations are: (1) use `data.table`

(as an aside, I hear a lot of other good things about it). (2) Do not use `dplyr`

when speed is important. (3) In this example, if the function `head`

is replaced by the function `first`

from the data.table package you can save around 33% of computational time. If you are still reading this, I can imagine your code has many pockets where you can improve the speed, so time your code if you ~~enjoy~~ need it. Doing something like this within few pockets of your code could save appreciable time.

Eigenvalue decomposition is a mathematical operation which is common, and is computationally expensive even in fairly moderate dimension. The `eigen`

function is the go-to. I found two more faster solutions. You can use the `irlba`

function from the `irlba`

package which is good for very large matrices, or the `eigs_sym`

function from the `RSpectra`

package, which take advantage for when you don’t need the whole vector of eigenvalues (as is often the case) by simply replying with the first few largest values, and conserve computational time as a result. The code below illustrates the speed gain.

TT <- 500 tmp_mat <- lapply(rep(TT, 200), runif) tmp_mat <- do.call(cbind, tmp_mat) p <- 25 # Keeping just the 25 largest values bench <- microbenchmark(eigen(cov(tmp_mat))$values, eigs_sym(cov(tmp_mat), k= p )$values, times= 10^3) print(bench)

expr | min | lq | mean | median | uq | max | neval |
---|---|---|---|---|---|---|---|

eigen | 23.9 | 24.8 | 26.8 | 26.1 | 27.9 | 45.1 | 1000 |

eigs_sym | 16.2 | 16.8 | 18.2 | 17.6 | 18.9 | 39.8 | 1000 |

So if you are interested in say the top few eigenvalues you can save good amount of time by using the `eigs_sym `

function.

Optimize your code away, friends.