Contents
Introduction
In part 1 of Good coding practices we considered how best to code for someone else, may it be a colleague who is coming from Excel environment and is unfamiliar with scripting, a collaborator, a client or the future-you, the you few months from now. In this second part, I give some of my thoughts on how best to write functions, the do’s and dont’s.
Of course, all that has been said in the first part, is relevant here as well, but some additional particulars as well, specifically when writing functions.
Function arguments
In R, you can easily set default values for any of the function arguments. Don’t set any defaults which are not intuitive. Take a look at the arguments of the dm.test
in the forecast
package. The function compares the forecast accuracy of two forecasting models using the Diebold-Mariano test.
1 2 3 4 5 6 |
library(forecast) args(dm.test) # function (e1, e2, alternative = c("two.sided", "less", "greater"), # h = 1, power = 2) |
The alternative
argument takes one of the three options: c("two.sided", "less", "greater")
. It is natural that the user would like to compare the two forecasting models without any prior preference for either, i.e. using a two sided test, and this is indeed the default for this argument. But, another option is to force the user to get more involved. We can add a line at the beginning of the function which reads something like: if ( length(alternative) > 1 ) stop("Please specify the alternative hypothesis as one of c("two.sided", "less", "greater")")
That piece of if
code tells you whether the user left this argument initial specification, which is a vector of length 3, unchanged. And if so, would force the user to specify it.
When you nonetheless decide an argument should hold its own default, you can communicate it to the user in the output. See how this is done with the dm.test
function:
1 2 3 4 5 6 7 8 9 10 11 |
tmp <- dm.test(rnorm(10), rnorm(10)) tmp # # Diebold-Mariano Test # # data: rnorm(10)rnorm(10) # DM = 0.63064, Forecast horizon = 1, Loss function power = 2, # p-value = 0.544 # alternative hypothesis: two.sided |
Notice the last line where is clearly stated: the alternative hypothesis is two sided.
En garde
R can make particular choices without informing you. As an example, if we use the mean
function on a matrix with several columns, the columns are stacked before averaging. This is a conscious behavioral choice. For the inexperienced programmer, it can be a fruitful ground for some exemplary hidden bugs. Bugs which are hard to spot, since the results are only slightly off, and those bugs usually go unnoticed. I speak from experience here.
A toy illustration:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
x <- rnorm(10) mean(x) [1] -0.230269 mean( x - mean(x) ) # centering [1] 1.248459e-17 # Now with a matrix x <- matrix(ncol= 2, nrow= 10) x <- apply(x, 2, rnorm ) # fill it with random normal vectors mean(x) # R stack the matrix and return a single number [1] -0.09613194 # Now after centering: apply( x - mean(x), 2, mean) # wrong center [1] -0.1208447 0.1208447 # You can create a mean function which only operates on vectors mean_vector <- function(x){ if (NCOL(x) !=1) stop ("Make sure x is a one column vector") mx <- mean(x) return(mx) } mean_vector(x) Error in mean_vector(x): Make sure x is a one column vector |
Whether the choices R makes are predictable or not is besides the point here. Simply try to protect new users from hidden choices R makes. Of course, we need to strike balance between defensive programming and maintenance\runtime costs. There are many other such behavioral choices R makes for you. See the beautifully written R Inferno by Patrick Burns from portfolioprobe.
Ill-advised
Certain actions have very good chance to frustrate the user. Here are a couple of those.
Functions within function
A function which looks like this:
1 2 3 4 5 6 7 8 |
function_1 <- function(x, y, z){ hidden_function <- function(z) x <- hidden_function() out <- sum(x, y) return(out) } |
Perhaps it is a good idea to help x
get its value from z
. That is well, but do it “outside” function_1
. As written here, hidden_function is cloaked under the function function_1
. If the user would like to see the inputs, better understand or (god forbid) debug hidden_function
, she unavoidably and unnecessarily must first step through function_1
. Unless you are very experienced, or you have no responsibility towards your users or co-authors, I don’t think it is a good idea.
Avoid the super assignment operator
Any object we create within a function is contained within the environment of that particular function. We can “up” the object outside that environment using the super assignment operator <<-
. This would create (or run over) a variable one environment up.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
tmp_function <- function(){ k <- 10 print(paste("k is", k)) } tmp_function() # [1] "k is 10" tryCatch(print(k), error = function(e) {print("error")} ) # [1] "error" tmp_function <- function(){ k <<- 10 print(paste("k is", k)) } tmp_function() # [1] "k is 10" tryCatch(print(k), error = function(e) {print("error")} ) # [1] 10 |
This is something you should avoid if you can. I think in general, writing "up" is a bad idea. But more so when you write for someone else. Simply, it is harder to realize an object has been created (or overwritten) from a local (function) environment than when it is created (or overwritten) in the usual manner.
One operation, one name
R software, being open source, enjoys massive continuous development. You can now find few functions which at their core, are meant to do the same thing. For example, as.character
and toString
are to a large extent the same. Don't iterate unnecessarily. Pick one and stick with it.
Final word
Properly writing and sharing code is important, and is not easy. Though it is way easier than it used to be. Here is a picture of Margaret Hamilton standing next to what appears to be a hard copy of the code written for the Apollo 11 space program (mid-1960s).
The code itself is now published on github. Out of curiosity, I took a look at the part which guides the spacecraft into orbit around the moon. From the code itself, I admittedly did not get much. But you can see many of the elements discussed here in both parts of this post.
Don't miss those hilarious comments. ("This is fixed in Apollo 14").
Enjoy.
Couple of excellent books
Introduction to Scientific Programming
Is your problem with nested functions just in R or more generally?
Broadly speaking, functions should not have a nested structure. Independent of language, I simply don’t see the big minus in keeping them ‘outside’.