Contents
Introduction
At work, I recently spent a lot of time coding for someone else, and like anything else you do, there is much to learn from it. It also got me thinking about scripting, and how best to go about it. To me it seems that the new working generation mostly tries to escape from working with Excel, but “let’s not kid ourselves: the most widely used piece of software for statistics is Excel” (Brian D. Ripley). this quote is 15 years old almost, but Excel still has a strong hold on the industry.
Here I discuss few good coding practices. Coding for someone else is not to be taken literally here. ‘Someone else’ is not necessarily a colleague, it could just as easily be the “future you”, the you reading your code six months from now (if you are lucky to get responsive referees). Did it never happened to you that your past-self was unduly cruel to your future-self? that you went back to some old code snippets and dearly regretted not adding few comments here and there? Of course it did.
Unlike the usual metric on which “good” is usually measured by when it comes to coding: good = efficient, here the metric would be different: good = friendly. They call this literate programming. There is a fairly deep discussion about this paradigm by John D. cook (follow what he has to say if you are not yet doing it, there is something for everyone).
Keeping someone else in mind does not come naturally. Especially after years of teaching yourself to be efficient based on other metrics (i.e. speed), which can sometimes be at odds with being friendly. But in practice, you need to collaborate. Writing speedy code, may agitate someone else who is coming from Excel and is new to scripting. Efficient coding may be penny wise but pound foolish. Your precious seconds saved writing the most elegant code, will lead to full minutes others scratching their heads, asking themselves what (the hell) is going on. Again, others can well be the future-you.
Good coding practices
Friendly coding practices start with the obvious. Use consistent coding style. Think of script the way you think of any other language. So don’t write ugly code sentences as you would not write “Thinkofscriptthewayyouthinkofanyotherlanguage”.
Coding style
In R, a good place to start is the R style guide.
I don’t always follow all of those suggestions there. For example, I feel we can do with one space after the equality sign:
x <- matrix(nrow= 2, ncol= 2)
instead of one space before and one after
x <- matrix(nrow = 2, ncol = 2)
Simply use the style guide as a starting point and stick with the deviations you choose to make.
Name everything
This is one where I had to force myself to do. I mean, why name an object if you know perfectly well what it is. No! Spend a few more seconds and name everything. Also, make your names meaningful and consistent with what may be found elsewhere. If you are working with cross-sectional time series it is customary to use the letter T for the time dimension and a letter like P or K to the cross section (firms, people, countries) dimension. Don't be creative. If the letter T is "occupied", use a workaround like TT, tT, T0 or something like that.
Name all dimensions clearly:
1 2 3 4 5 6 7 8 9 |
rownames_vector <- c("2016", "2017") TT <- length(rownames_vector) colnames_vector <- c("Orange", "Banana") P <- length(colnames_vector) x <- matrix(1:4, nrow= TT, ncol= P) colnames(x) <- colnames_vector rownames(x) <- rownames_vector |
Learn to report what you are doing, as you go along
Use the following commands extensively:
print, paste, View, plot, cat, str, head, tail, dim, summary
.
Often enough so that the reader can, in as much as possible, follow each step.
In order to make it easier on yourself to report the objects, you can write a function to facilitate that, something like:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
describe_it <- function(x){ # print("Dimension:"); print(dim(x)) # or cat("-------", "\n", "Dimension:", "\n", dim(x), "\n") # print(head(x, 2)) cat("-------", "\n", "first line", "\n", head(x,1) , "\n") cat("-------", "\n", "last line", "\n", tail(x,1) , "\n") } describe_it(x) # ------- # Dimension: # 4 2 # ------- # first line # 1 1 # ------- # last line # 4 4 |
You can make use of the following summary function, or adjust it to your needs.
Tip: (only) when you aim to print, keep the code in the same line. It reads better on the console:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
cat("Items sold:", "\n") print(x) # Items sold: # print(x) # Orange Banana # 2016 1 3 # 2017 2 4 cat("Items sold:", "\n") ; print(x) # Items sold: # Orange Banana # 2016 1 3 # 2017 2 4 |
We don't need to print print(x)
in the console, so the second option is better.
Use comments
Comments usually don't burden the RAM, and so it is very economical way to enhance the readability of your code.
Comment your code generously
You don't need to do that for each line. But insert comments if there is a loop coming up, or if you think a specific operation is not very common, or that the user would not be familiar with.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
# # reading the data into an object which would be called x x <- list(c(1:4), c(1:4)) ## x is now of class list class(x) # "list" ## we now convert x to a matrix ## simplify2array is a function to intelligently convert a list to an array x <- simplify2array(x) ## verify it is now a matrix class(x) # "matrix" ## you can learn more about simplify2array function by typing ?simplify2array |
Comments as additional optional actions
Comments can be useful not only for explaining, but also giving the option to explore objects further. When the user runs the code, you may not want to overload him with object descriptions, but make a closer look optional. View
is a function which opens a separate window for a deeper look at an object, so in order to avoid filling the screen with windows you make it optional:
1 2 3 4 |
# View(x) # Uncomment this line to view x # plot(x) # Uncomment this line to plot x |
Comments for objects
If an object needs more than a few words as explanation you better not add those directly to the script, not to inflate it. What you can do is to use the comment
function. I wrote about it in a different context here.
Here is an example of the comment
function:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
# An object x <- array(dim = rep(5,3)) # Attach a comment comment(x) <- "This is a 5*5*5 array. The dim1 stands for.., dim2 for.. and dim3 for.." # Save it save(x,file = "Path/x.RData") # Break period of 8 months and come back load("Path/x.RData") comment(x) "This is a 5*5*5 array. The dim1 stands for.., dim2 for.. and dim3 for.." |
Code splitting
A lot can be done with one-liners. That is why we like scripting. It is nice and efficient, but less readable. Trade elegance for readability. When possible, make your code more modular.
Split your operations
Here is an example for a find-and-replace operation, we replace a value of 4 with another value of 8.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
# # Define again the same matrix from the first chunk: rownames_vector <- c("2016", "2017") TT <- length(rownames_vector) colnames_vector <- c("Orange", "Banana") P <- length(colnames_vector) x <- matrix(1:4, nrow= TT, ncol= P) colnames(x) <- colnames_vector rownames(x) <- rownames_vector # Now assume there is a mistake in the value of Banana - 2017 # and we need to replace it: ## Bad # find the index and replace the value x[ which(x[,"Banana"]== 4), "Banana"] <- 8 ## Better # find the index, tmp_index tmp_index <- which(x[, "Banana"]== 4) # Replace the value x[tmp_index, "Banana"] <- 8 |
Note how quickly your code can become ugly with those one-liners. True, we create some overhead here with a completely redundant tmp_index
variable, but if you have the resources for those overheads, you should use them.
Split your code - avoid very long lines
Splitting operations as we have done above makes for understandable code. Try also to make it more readable. Remember we wrote before: (1) to name everything and (2) to give it a meaningful name. Your code will quickly lengthen, and you will need to operate on elements with lengthy names, this may create multiple lines, even for a single operation. What I suggest is to define the subsets "outside" the operation. You can use temporary indexes for your subsets.
1 2 3 4 5 6 7 8 9 10 11 12 |
# Write tmp_years <- as.character(2017:2018) tmp_fruits <- c("Banana", "Orange", "Mango") # and work with Store_A[tmp_years, tmp_fruits] # Instead of with Store_A[ as.character(2017:2018), c("Banana", "Orange", "Mango")] |
I add the preface tmp_
for variables I mean to discard or ignore later on. By convention, you can safely 'run over' or reuse all those terms with preface tmp_
. If you don't want to amass those temporary object terms onto the memory, you can use the rm
command to remove them.
Split your code - reconsider loops
I use the apply
function extensively. But unless you are familiar with R, I imagine it is not very easy to follow what it does. The apply
function is vectorized, which makes it faster than looping. Why is that is besides the point here, the curious reader is referred to Noam Ross's superb explanation.
If the operation is not very heavy, you should consider the less efficient but more readable loop.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
## less readable sumx <- apply(x, 2 , sum) ## less efficient but more readable # Make container for the sum of columns of x sumx <- NULL # Sum each columns in the x matrix for (i in 1 : ncol(x) ){ sumx[i] <- sum(x[, i]) } |
Or as a compromise, you can try to adopt the fairly new, more readable, piping convention using the library(magrittr)
:
Instead of
sumx <- apply(x, 2 , sum, na.rm= T)
You can use
sumx <- ( x %>% apply(MARGIN= 2, FUN = sum) )
which is arguably more readable. The outer parentheses are not necessary, but are there as a "separator" to create some distance of the pipe operator %>%
and the assign operator <-
. I have yet to adopt this piping convention, but I heard good things about it.
In R, there are often both efficient and readable wrappers, using those wrappers allows you to write a clearer code. Use those friendly functions colMeans, ColSums, rowMeans, rowSums
which allow you to vectorize, circumventing the need to call the apply
function.
Summary
Why not follow those guidelines?
The typical argument put forth is that of efficiency, of speed. The speed of writing and the speed of execution. 500 lines of code swell to 1000, and 1000 swell to 3000.
Although true, this argument and alike concern the immediate. The 'you' who just finished writing the code is happy, but that 'you' is a minority. Speedy, elegant code can save precious seconds now, on the expense of far more expensive minutes (and even hours) spent by others in the near or far future, chewing on those veiled lines. 3000 lines of readable code, may turn out to be more expeditious than those 1000 smart, condensed lines you know you can write.
Also, you have a good estimate for how much time is "lost" making sure your code is friendly. You have very poor estimate for how much time will be "lost" because your code is unfriendly (or ugly as f). How long the future-you will spend understanding your code is a function of time past, and of your memory strength. How long will your coauthors/colleagues will spend deciphering an unfriendly code is a function of other things as well. How many people will be using it? How proficient are they? The worse possible case is where they can do the work themselves, and realize it may be quicker and safer for them to re-code what you did. How efficient is that?
As is usual in writing (and writing code is no different), the first version is dirty and there is nothing wrong with it. Try to revise and polish inevitable subsequent versions.
That is it for now. Functions are important enough to have their dedicated part. Meanwhile, productive literate scripting.
One comment on “Good coding practices – part 1”