R tips and tricks – get the gist

In scientific programming speed is important. Functions written for general public use have a lot of control-flow checks which are not necessary if you are confident enough with your code.To quicken your code execution I suggest to strip run-of-the-mill functions to their bare bones. You can save serious wall-clock time by using only the laborers code lines. Below is a walk-through example of what I mean.

I use the quantile function for the example. There are many ways to compute the estimate of a quantile, and all those various ways are coded into the one quantile function. The function has the default argument type = 7 which indicates the particular way we wish to estimate our quantiles. Given that R is an open-source language you can easily find the code for any function, then you can “fish out” only the lines that you actually need. While the code for the quantile function is around 90 lines (given below), the real labor is carried out mainly by lines 49 to 58 – the main workhorse (for the type=7 default).

Now, let’s write our own version of the quantile function; call it lean_quantile. Then we make sure our lean_quantile does what its meant to do, and compare the execution time.


lean_quantile <- function(x, probs = seq(0, 1, 0.25)) {
  n <- length(x)
  np <- length(probs)
  index <- 1 + (n - 1) * probs
  lo <- floor(index)
  hi <- ceiling(index)
  x <- sort(x, partial = unique(c(lo, hi)))
  qs <- x[lo]
  i <- which(index > lo)
  h <- (index - lo)[i]
  qs[i] <- (1 - h) * qs[i] + h * x[hi[i]]
  qs
}

lean_quantile <- function(x, probs = seq(0, 1, 0.25)) {

n <- length(x)

np <- length(probs)

index <- 1 + (n - 1) * probs

lo <- floor(index)

hi <- ceiling(index)

x <- sort(x, partial = unique(c(lo, hi)))

qs <- x[lo]

i <- which(index > lo)

h <- (index - lo)[i]

qs[i] <- (1 - h) * qs[i] + h * x[hi[i]]

}

Check that our lean_quantile does what its meant to do:


tmpp <- rnorm(10)
all( quantile(tmpp) == lean_quantile(tmpp) )
[1] TRUE

tmpp <- rnorm(10)

all( quantile(tmpp) == lean_quantile(tmpp) )

[1] TRUE

Now we can compare the execution time (more on timing and profiling code):


library(microbenchmark)
# citation("microbenchmark")
bench <- microbenchmark(quantile(rnorm(10)), 
                        lean_quantile(rnorm(10)), 
                        times=10^4)
bench
# Unit: microseconds
#                     expr  min   lq mean median   uq  max neval
#      quantile(rnorm(10)) 79.1 84.3 96.3   86.1 89.0 3907 10000
# lean_quantile(rnorm(10)) 27.9 31.6 36.2   33.4 34.8 4741 10000

library(microbenchmark)

# citation("microbenchmark")

bench <- microbenchmark(quantile(rnorm(10)),

lean_quantile(rnorm(10)),

times=10^4)

bench

# Unit: microseconds

# expr min lq mean median uq max neval

# quantile(rnorm(10)) 79.1 84.3 96.3 86.1 89.0 3907 10000

# lean_quantile(rnorm(10)) 27.9 31.6 36.2 33.4 34.8 4741 10000

Execution time is reduced by over 60%. Also, we did not have to work very hard for it. We can do more, diving further and improve the sort function which our lean_quantile uses, but you get the idea.

Is it a free lunch? Of course not.

It takes long to master efficient programming, and the functions you find in the public domain are probably well scrutinized – before and after they go up there. When you mingle with the internals you risk making a mistake, erasing an important line or creating unintended consequences and messing up the original behavior. So meticulous checks are good to do.

While some functions are written so efficiently that you will find very little value in pulling out just the workhorse, with most functions written for the general public you will certainly be able to squeeze out some time-profit. As you can see this “get the gist” tip has excellent potential to save a lot of waiting time.


#  Part of the R package, https://www.R-project.org
#
#  Copyright (C) 1995-2014 The R Core Team
#
#  This program is free software; you can redistribute it and/or modify
#  it under the terms of the GNU General Public License as published by
#  the Free Software Foundation; either version 2 of the License, or
#  (at your option) any later version.
#
#  This program is distributed in the hope that it will be useful,
#  but WITHOUT ANY WARRANTY; without even the implied warranty of
#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
#  GNU General Public License for more details.
#
#  A copy of the GNU General Public License is available at
#  https://www.R-project.org/Licenses/

quantile <- function(x, ...) UseMethod("quantile")

quantile.POSIXt <- function(x, ...)
    .POSIXct(quantile(unclass(as.POSIXct(x)), ...), attr(x, "tzone"))

quantile.default <-
    function(x, probs = seq(0, 1, 0.25), na.rm = FALSE, names = TRUE,
             type = 7, ...)
{
    if(is.factor(x)) {
        if(!is.ordered(x) || ! type %in% c(1L, 3L))
            stop("factors are not allowed")
        lx <- levels(x)
    } else lx <- NULL
    if (na.rm)
	x <- x[!is.na(x)]
    else if (anyNA(x))
	stop("missing values and NaN's not allowed if 'na.rm' is FALSE")
    eps <- 100*.Machine$double.eps
    if (any((p.ok <- !is.na(probs)) & (probs < -eps | probs > 1+eps)))
	stop("'probs' outside [0,1]")
    n <- length(x)
    if(na.p <- any(!p.ok)) { # set aside NA & NaN
        o.pr <- probs
        probs <- probs[p.ok]
        probs <- pmax(0, pmin(1, probs)) # allow for slight overshoot
    }
    np <- length(probs)
    if (n > 0 && np > 0) {
        if(type == 7) { # be completely back-compatible
            index <- 1 + (n - 1) * probs
            lo <- floor(index)
            hi <- ceiling(index)
            x <- sort(x, partial = unique(c(lo, hi)))
            qs <- x[lo]
	    i <- which(index > lo)
	    h <- (index - lo)[i] # > 0	by construction
##	    qs[i] <- qs[i] + .minus(x[hi[i]], x[lo[i]]) * (index[i] - lo[i])
##	    qs[i] <- ifelse(h == 0, qs[i], (1 - h) * qs[i] + h * x[hi[i]])
	    qs[i] <- (1 - h) * qs[i] + h * x[hi[i]]
        } else {
            if (type <= 3) {
                ## Types 1, 2 and 3 are discontinuous sample qs.
                nppm <- if (type == 3) n * probs - .5 # n * probs + m; m = -0.5
                else n * probs          # m = 0
                j <- floor(nppm)
		h <- switch(type,
			    (nppm > j),		# type 1
			    ((nppm > j) + 1)/2, # type 2
			    (nppm != j) | ((j %% 2L) == 1L)) # type 3
            } else {
                ## Types 4 through 9 are continuous sample qs.
                switch(type - 3,
                   {a <- 0; b <- 1},    # type 4
                       a <- b <- 0.5,   # type 5
                       a <- b <- 0,     # type 6
                       a <- b <- 1,     # type 7 (unused here)
                       a <- b <- 1 / 3, # type 8
                       a <- b <- 3 / 8) # type 9
                ## need to watch for rounding errors here
                fuzz <- 4 * .Machine$double.eps
                nppm <- a + probs * (n + 1 - a - b) # n*probs + m
                j <- floor(nppm + fuzz) # m = a + probs*(1 - a - b)
                h <- nppm - j
                if(any(sml <- abs(h) < fuzz)) h[sml] <- 0
            }
            x <- sort(x, partial =
                      unique(c(1, j[j>0L & j<=n], (j+1)[j>0L & j<n], n))
                      )
            x <- c(x[1L], x[1L], x, x[n], x[n])
            ## h can be zero or one (types 1 to 3), and infinities matter
####        qs <- (1 - h) * x[j + 2] + h * x[j + 3]
            ## also h*x might be invalid ... e.g. Dates and ordered factors
            qs <- x[j+2L]
            qs[h == 1] <- x[j+3L][h == 1]
            other <- (0 < h) & (h < 1)
            if(any(other)) qs[other] <- ((1-h)*x[j+2L] + h*x[j+3L])[other]
        }
    } else {
	qs <- rep(NA_real_, np)
    }
    if(is.character(lx))
        qs <- factor(qs, levels = seq_along(lx), labels = lx, ordered = TRUE)
    if(names && np > 0L) {
	names(qs) <- format_perc(probs)
    }
    if(na.p) { # do this more elegantly (?!)
        o.pr[p.ok] <- qs
        names(o.pr) <- rep("", length(o.pr)) # suppress <NA> names
        names(o.pr)[p.ok] <- names(qs)
        o.pr
    } else qs
}

100

101

102

103

104

105

106

107

108

109

110

111

112

# Part of the R package, https://www.R-project.org

# This program is free software; you can redistribute it and/or modify

# it under the terms of the GNU General Public License as published by

# the Free Software Foundation; either version 2 of the License, or

# (at your option) any later version.

# This program is distributed in the hope that it will be useful,

# but WITHOUT ANY WARRANTY; without even the implied warranty of

# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the

# GNU General Public License for more details.

# A copy of the GNU General Public License is available at

# https://www.R-project.org/Licenses/

quantile <- function(x, ...) UseMethod("quantile")

quantile.POSIXt <- function(x, ...)

.POSIXct(quantile(unclass(as.POSIXct(x)), ...), attr(x, "tzone"))

quantile.default <-

function(x, probs = seq(0, 1, 0.25), na.rm = FALSE, names = TRUE,

type = 7, ...)

{

if(is.factor(x)) {

if(!is.ordered(x) || ! type %in% c(1L, 3L))

stop("factors are not allowed")

lx <- levels(x)

} else lx <- NULL

if (na.rm)

x <- x[!is.na(x)]

else if (anyNA(x))

stop("missing values and NaN's not allowed if 'na.rm' is FALSE")

eps <- 100*.Machine$double.eps

if (any((p.ok <- !is.na(probs)) & (probs < -eps | probs > 1+eps)))

stop("'probs' outside [0,1]")

n <- length(x)

if(na.p <- any(!p.ok)) { # set aside NA & NaN

o.pr <- probs

probs <- probs[p.ok]

probs <- pmax(0, pmin(1, probs)) # allow for slight overshoot

}

np <- length(probs)

if (n > 0 && np > 0) {

if(type == 7) { # be completely back-compatible

index <- 1 + (n - 1) * probs

lo <- floor(index)

hi <- ceiling(index)

x <- sort(x, partial = unique(c(lo, hi)))

qs <- x[lo]

i <- which(index > lo)

h <- (index - lo)[i] # > 0 by construction

## qs[i] <- qs[i] + .minus(x[hi[i]], x[lo[i]]) * (index[i] - lo[i])

## qs[i] <- ifelse(h == 0, qs[i], (1 - h) * qs[i] + h * x[hi[i]])

qs[i] <- (1 - h) * qs[i] + h * x[hi[i]]

} else {

if (type <= 3) {

## Types 1, 2 and 3 are discontinuous sample qs.

nppm <- if (type == 3) n * probs - .5 # n * probs + m; m = -0.5

else n * probs # m = 0

j <- floor(nppm)

h <- switch(type,

(nppm > j), # type 1

((nppm > j) + 1)/2, # type 2

(nppm != j) | ((j %% 2L) == 1L)) # type 3

} else {

## Types 4 through 9 are continuous sample qs.

switch(type - 3,

{a <- 0; b <- 1}, # type 4

a <- b <- 0.5, # type 5

a <- b <- 0, # type 6

a <- b <- 1, # type 7 (unused here)

a <- b <- 1 / 3, # type 8

a <- b <- 3 / 8) # type 9

## need to watch for rounding errors here

fuzz <- 4 * .Machine$double.eps

nppm <- a + probs * (n + 1 - a - b) # n*probs + m

j <- floor(nppm + fuzz) # m = a + probs*(1 - a - b)

h <- nppm - j

if(any(sml <- abs(h) < fuzz)) h[sml] <- 0

}

x <- sort(x, partial =

unique(c(1, j[j>0L & j<=n], (j+1)[j>0L & j<n], n))

)

x <- c(x[1L], x[1L], x, x[n], x[n])

## h can be zero or one (types 1 to 3), and infinities matter

#### qs <- (1 - h) * x[j + 2] + h * x[j + 3]

## also h*x might be invalid ... e.g. Dates and ordered factors

qs <- x[j+2L]

qs[h == 1] <- x[j+3L][h == 1]

other <- (0 < h) & (h < 1)

if(any(other)) qs[other] <- ((1-h)*x[j+2L] + h*x[j+3L])[other]

}

} else {

qs <- rep(NA_real_, np)

}

if(is.character(lx))

qs <- factor(qs, levels = seq_along(lx), labels = lx, ordered = TRUE)

if(names && np > 0L) {

names(qs) <- format_perc(probs)

}

if(na.p) { # do this more elegantly (?!)

o.pr[p.ok] <- qs

names(o.pr) <- rep("", length(o.pr)) # suppress <NA> names

names(o.pr)[p.ok] <- names(qs)

o.pr

} else qs

}

Footnotes

As a side note, would be nice to do that in Python also, but the source code for the numpy quantile function is heavily “decorated”. Comment if you know how to create the Python counterpart.

Coding All-in-One For Dummies

You might also like:

6 comments on “R tips and tricks – get the gist”

Pingback: R tips and tricks – get the gist | R-bloggers
Kent Johnson says:

10/02/2022 at 12:43 PM

Note that the code you have removed is mostly error-checking for incorrect arguments and out-of-range values, and correct handling of NAs. So your fast version is less robust than the library version and should only be used when you are certain that the arguments are correct.

1. Eran Raviv says:
  
  10/02/2022 at 6:43 PM
  
  I agree wholeheartedly. It is by no means a free lunch. When you mingle with the internals you risk making a mistake, erasing an important line or creating unintended consequences and messing up the original behavior. It is indeed useful only if you are, very, comfortable with your code base and coding skills.
  
Pingback: Links 02/10/2022: Debian on Firmware Policy and PostgreSQL 15 RC 1 | Techrights
Analizy statystyczne says:

10/03/2022 at 6:59 AM

Now it would be nice to get a review of methods of finding function source codes. It is not always as easy as for quantile().

1. Eran Raviv says:
  
  10/03/2022 at 8:50 AM
  
  You are right. If I encounter “primitive” functions, or functions which are sent to C++ or Fortran I usually let it go. For getting the source code for functions which are written as methods you can check here.

Footnotes

You might also like:

6 comments on “R tips and tricks – get the gist”

Leave a Reply