The R language has improved over the years. Amidst numerous splendid augmentations, the magrittr
package by Stefan Milton Bache allows us to write more readable code. It uses an ingenious piping convention which will be explained shortly. This post talks about when to use those pipes, and when to avoid using pipes in your code. I am all about that bass readability, but I am also about speed. Use the pipe operator, but watch the tradeoff.
Piping convention example
We first get some data – say a matrix of daily return for some sectors’ ETFs:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
library(quantmod) symetf = c('XLY','XLP','XLE','XLF','XLV','XLI','XLB','XLK','XLU','SPY') end <- format(Sys.Date(),"%Y-%m-%d") start <- format(as.Date("2010-01-01"),"%Y-%m-%d") retd <- list() for (i in 1:length(symetf)) { dat0 = getSymbols(symetf[i], src="google", from=start, to=end, auto.assign = F) retd[[i]] <- as.numeric(dat0[2:NROW(dat0),4])/as.numeric(dat0[1:(NROW(dat0)-1),4]) -1 } # create a matrix of returns, column for tickers, rows for days timeseries <- do.call(cbind, retd) tail(timeseries, 3) |
Now we would like to do the following
1. Remove NA’s
2. Get the rankings in each row
3. Transpose the result so that we each column represents a ticker.
Here is the code for those steps without using pipes:
t( apply( apply (timeseries, 2, na.omit), 1, rank) )
And this is how the same operations look like using the pipe operator %>%
(make sure the library magrittr is loaded):
timeseries %>% apply(2, na.omit) %>% apply(1, rank) %>% t()
You can check that the two ways return exactly the same result using the function all.equal
.
Readability-wise, using pipes makes it much more easy to understand the code. On the object timeseries
, first do (1) remove NA then (2) rank then (3) transpose.
Without pipes, the code is well.. ugly. At first read, you really need to be quite experienced R user to understand what is happening.
The piping convention makes a huge difference for R newcomers.
Now, what about speed? Is making your code more readable slows it down? Not in this case. The microbenchmark
library conveniently time our operations:
1 2 3 4 5 6 7 8 9 |
library(microbenchmark) microbenchmark( t( apply( apply (timeseries, 2, na.omit), 1, rank) ) , timeseries %>% apply(2, na.omit) %>% apply(1, rank) %>% t() ) Unit: milliseconds expr min lq mean median uq max neval "No pipes" 21.93384 23.93330 26.35171 26.06845 27.81494 85.77408 100 "With pipes" 21.63650 23.79731 27.03342 27.11475 28.00256 96.77798 100 |
What this code is doing is running the same operation 100 times and each time measures how long it took to complete. Looking at the median of those 100, differences are negligible. So using pipes we gain readability without losing any efficiency. But that is not always the case.
Using pipes can considerably slow down your code
Don’t use pipes blindly if you care about speed. On many occasions it can materially slow down your code.
For example, when we created the data we converted it to a matrix. But there are many ways to mold your data into a comfortable format. Say we would use the unlist
function to convert from list to a numeric vector. In that case readability gains from using the pipe operator carries a steep cost in terms of speed. Let’s compare
retd %>% lapply(unlist)
with lapply(retd, unlist)
:
1 2 3 4 |
out <- microbenchmark(retd %>% lapply(unlist), lapply(retd, unlist) ) boxplot(out, names= c("One with pipes", 'One with no pipes'), unit= "ms") |
Log of the time (milliseconds), 100 repetitions
As you can see, using pipes is decisively slower.
Main point
Use pipes when speed is a non issue. When speed is important, dive deeper to see if you need to sacrifice readability for speed.