R tips and tricks - using the piping operator

The R language has improved over the years. Amidst numerous splendid augmentations, the magrittr package by Stefan Milton Bache allows us to write more readable code. It uses an ingenious piping convention which will be explained shortly. This post talks about when to use those pipes, and when to avoid using pipes in your code. I am all about ~~that bass~~ readability, but I am also about speed. Use the pipe operator, but watch the tradeoff.

Piping convention example

We first get some data – say a matrix of daily return for some sectors’ ETFs:


library(quantmod)
symetf = c('XLY','XLP','XLE','XLF','XLV','XLI','XLB','XLK','XLU','SPY')
end <- format(Sys.Date(),"%Y-%m-%d") 
start <- format(as.Date("2010-01-01"),"%Y-%m-%d")
retd <- list()
for (i in 1:length(symetf)) {
dat0 = getSymbols(symetf[i], src="google", from=start, to=end, auto.assign = F)
retd[[i]] <- as.numeric(dat0[2:NROW(dat0),4])/as.numeric(dat0[1:(NROW(dat0)-1),4]) -1
}
# create a matrix of returns, column for tickers, rows for days
timeseries <- do.call(cbind, retd)
tail(timeseries, 3)

library(quantmod)

symetf = c('XLY','XLP','XLE','XLF','XLV','XLI','XLB','XLK','XLU','SPY')

end <- format(Sys.Date(),"%Y-%m-%d")

start <- format(as.Date("2010-01-01"),"%Y-%m-%d")

retd <- list()

for (i in 1:length(symetf)) {

dat0 = getSymbols(symetf[i], src="google", from=start, to=end, auto.assign = F)

retd[[i]] <- as.numeric(dat0[2:NROW(dat0),4])/as.numeric(dat0[1:(NROW(dat0)-1),4]) -1

}

# create a matrix of returns, column for tickers, rows for days

timeseries <- do.call(cbind, retd)

tail(timeseries, 3)

Now we would like to do the following
1. Remove NA’s
2. Get the rankings in each row
3. Transpose the result so that we each column represents a ticker.
Here is the code for those steps without using pipes:

t( apply( apply (timeseries, 2, na.omit), 1, rank) )

And this is how the same operations look like using the pipe operator %>% (make sure the library magrittr is loaded):

timeseries %>% apply(2, na.omit) %>% apply(1, rank) %>% t()

You can check that the two ways return exactly the same result using the function all.equal.

Readability-wise, using pipes makes it much more easy to understand the code. On the object timeseries, first do (1) remove NA then (2) rank then (3) transpose.
Without pipes, the code is well.. ugly. At first read, you really need to be quite experienced R user to understand what is happening.

The piping convention makes a huge difference for R newcomers.

Now, what about speed? Is making your code more readable slows it down? Not in this case. The microbenchmark library conveniently time our operations:


library(microbenchmark)
microbenchmark( t( apply( apply (timeseries, 2, na.omit), 1, rank)  ) ,
timeseries %>% apply(2, na.omit) %>% apply(1, rank) %>% t() )
Unit: milliseconds
                 expr      min       lq     mean   median       uq      max    neval
            "No pipes" 21.93384 23.93330 26.35171 26.06845 27.81494 85.77408   100
          "With pipes" 21.63650 23.79731 27.03342 27.11475 28.00256 96.77798   100

library(microbenchmark)

microbenchmark( t( apply( apply (timeseries, 2, na.omit), 1, rank) ) ,

timeseries %>% apply(2, na.omit) %>% apply(1, rank) %>% t() )

Unit: milliseconds

expr min lq mean median uq max neval

"No pipes" 21.93384 23.93330 26.35171 26.06845 27.81494 85.77408 100

"With pipes" 21.63650 23.79731 27.03342 27.11475 28.00256 96.77798 100

What this code is doing is running the same operation 100 times and each time measures how long it took to complete. Looking at the median of those 100, differences are negligible. So using pipes we gain readability without losing any efficiency. But that is not always the case.

Using pipes can considerably slow down your code

Don’t use pipes blindly if you care about speed. On many occasions it can materially slow down your code.

For example, when we created the data we converted it to a matrix. But there are many ways to mold your data into a comfortable format. Say we would use the unlist function to convert from list to a numeric vector. In that case readability gains from using the pipe operator carries a steep cost in terms of speed. Let’s compare
retd %>% lapply(unlist) with lapply(retd, unlist):


out <- microbenchmark(retd %>% lapply(unlist), lapply(retd, unlist) ) 
boxplot(out, names= c("One with pipes", 'One with no pipes'), unit= "ms")

out <- microbenchmark(retd %>% lapply(unlist), lapply(retd, unlist) )

boxplot(out, names= c("One with pipes", 'One with no pipes'), unit= "ms")

Log of the time (milliseconds), 100 repetitions

As you can see, using pipes is decisively slower.

Main point

Use pipes when speed is a non issue. When speed is important, dive deeper to see if you need to sacrifice readability for speed.

You might also like:

R tips and tricks – the pipe operator

Piping convention example

Using pipes can considerably slow down your code

Log of the time (milliseconds), 100 repetitions

Main point

You might also like:

Understanding Variance Explained in PCA

CUR matrix decomposition for improved data analysis

Curse of dimensionality part 2: forecast combinations

Show yourself (look "under the hood" of a function in R)

Leave a Reply