Modern statistical methods use simulations; generating different scenarios and repeating those thousands of times over. Therefore, even trivial operations burden computational speed.

In the words of my favorite statistician Bradley Efron:

“There is some sort of law working here, whereby statistical methodology always expands to strain the current limits of computation.”

In addition to the need for faster computation, the richness of open-source ecosystem means that you often encounter different functions doing the same thing, sometimes even under the same name. This post explains how to measure the computational efficacy of a function so you know which one to use, with a couple of actual examples for reducing computational time.

First, use the package `microbenchmark`

. It provides infrastructure to measure and compare the execution time of different R code. In Matlab you can use the the `tic toc`

functions and in Python you can use the `clock`

function from the `time`

module. You can find some toy code here.

**Example:** you need to extract the first value of a vector. I grew up with the function `head(vector, n= 1)`

which asks the first value of the vector. In Python’s Pandas it is `vector.head(1)`

. Now, there are other functions which, perhaps, are more speedy/efficient.

Google says that there is a function which is called `first`

. Actually, a function called `first`

is available from the `dplyr`

package, the `xts`

package and the `data.table`

package. So we have 4 different functions (that I know of, probably more out there) that do the same thing. If speed is of the essence, which function you should use? which is fastest? The following code-snippet answers.

1 2 3 4 5 6 7 8 |
TT <- 100 tmp_vector <- runif(TT) bench <- microbenchmark( head(tmp_vector,1), data.table:::first(tmp_vector), dplyr:::first(tmp_vector), xts:::first(tmp_vector), times=10^3, unit= "ns") |

The triple colon operator `:::`

is used here to access functions within the package without attaching the actual package to the search path. Since the packages are not of the same size I use the triple colon operator but the use of the double colon operator, `::`

is better practice in general. Here are the results of the timing operation based on 1000 times of executing each one of those function. The y-axis is divided by 10000 for readability.

With the risk of over-generalization, based on this analysis/figure the recommendations are: (1) use `data.table`

(as an aside, I hear a lot of other good things about it). (2) Do not use `dplyr`

when speed is important. (3) In this example, if the function `head`

is replaced by the function `first`

from the data.table package you can save around 33% of computational time. If you are still reading this, I can imagine your code has many pockets where you can improve the speed, so time your code if you ~~enjoy~~ need it. Doing something like this within few pockets of your code could save appreciable time.

### Speeding up eigenvalue decomposition

Eigenvalue decomposition is a mathematical operation which is common, and is computationally expensive even in fairly moderate dimension. The `eigen`

function is the go-to. I found two more faster solutions. You can use the `irlba`

function from the `irlba`

package which is good for very large matrices, or the `eigs_sym`

function from the `RSpectra`

package, which take advantage for when you don’t need the whole vector of eigenvalues (as is often the case) by simply replying with the first few largest values, and conserve computational time as a result. The code below illustrates the speed gain.

1 2 3 4 5 6 7 8 9 |
TT <- 500 tmp_mat <- lapply(rep(TT, 200), runif) tmp_mat <- do.call(cbind, tmp_mat) p <- 25 # Keeping just the 25 largest values bench <- microbenchmark(eigen(cov(tmp_mat))$values, eigs_sym(cov(tmp_mat), k= p )$values, times= 10^3) print(bench) |

expr | min | lq | mean | median | uq | max | neval |
---|---|---|---|---|---|---|---|

eigen | 23.9 | 24.8 | 26.8 | 26.1 | 27.9 | 45.1 | 1000 |

eigs_sym | 16.2 | 16.8 | 18.2 | 17.6 | 18.9 | 39.8 | 1000 |

So if you are interested in say the top few eigenvalues you can save good amount of time by using the `eigs_sym `

function.

Optimize your code away, friends.