Background

We want the code to be interpetable by ourselves and other humans (e.g. code handovers).
Clear code makes errors easier to spot.
Clear code makes for an easy review.

Unlike the usual metric on which “good” is usually measured by when it comes to coding: good = efficient, here the metric would be different: good = friendly. As in literate programming paradigm.

There is a difference between coding for research and coding for operation. This document serves as a proposal for some good common coding practice, generally, rather than operationally.

“Code is more often read than written.” — Guido van Rossum

“It doesn’t matter how good your software is, because if the documentation is not good enough, people will not use it.” - Daniele Procida

Code style

Style your code such that it is not ugly-looking. Use the following coding convention^*:

For R
For python or this one.
Follow the KISS principle (keep it simple, stupid)
Name everything

make your names meaningful and consistent with what may be found elsewhere. If you are working with cross-sectional time series it is customary to use the letter T for the time dimension and a letter like P or K to the cross section (firms, people, countries) dimension. Don’t be creative. If the letter T is “occupied”, use a workaround like TT, tT, T0.

Name your dimensions clearly.

If your object is a matrix, name the columns (e.g. variable’s name) and name the rows (e.g. index, or dates).

Try to report as you go along

For example, use commands like dim in R, or shape in python to print the dimension of the object to the console. Create plots even, where it makes sense.

Use comments
- Comments usually don’t burden the RAM, and so it is very economical way to enhance the readability of your code.
- You can comment commands which are optional for the readers. E.g, printing or tentative plots.
- If you save a CSV for example, which needs to be later loaded, comment clearly on what should be returned. For example:

load("Path/x.RData") 
comment(x)  
"This is a 5*5*5 array. The dim1 stands for.., dim2 for.. and dim3 for.."

Split your code
A lot can be done with one-liners, but one-liners are less readable. Trade elegance for readability. When possible, make your code more modular.
You can use a convention which suits you for variables which are “temporary” (which you only need for the sake of readability). For example in Python you can use an underscore, _, as a prefix or prefix with a temp_.
Splitting your code also helps to avoid super long and unreadable lines.

For example:

# # Define again a toy matrix:
rownames_vector <- c("2016", "2017")
TT <- length(rownames_vector)
colnames_vector <-  c("Orange", "Banana")
P <- length(colnames_vector)
x <- matrix(1:4, nrow= TT, ncol= P)
colnames(x) <- colnames_vector
rownames(x) <- rownames_vector
# Now assume there is a mistake in the value of Banana - 2017
# and we need to replace it:
 
## Bad
 
# find the index and replace the value
x[ which(x[,"Banana"]== 4), "Banana"] <- 8
 
## Better
 
# find the index, tmp_index 
tmp_index <- which(x[, "Banana"]== 4) 
# Replace the value
x[tmp_index, "Banana"] <- 8

Or in Python:

rownames_vector = ["2016", "2017"]
TT = len(rownames_vector)
colnames_vector =  ["Orange", "Banana"]
P = len(colnames_vector)
import numpy as np
import pandas as pd
x = np.arange(TT*P).reshape(TT, P)
x=  pd.DataFrame(x, columns = colnames_vector, index= rownames_vector) 

## Bad:
x.loc[ x.loc[x["Banana"] == 3].index.tolist(), "Banana" ] = 8
x
      Orange  Banana
2016       0       1
2017       2       8
## Better:
tmp= x.loc[x["Banana"] == 3].index.tolist()
x.loc[tmp, "Banana"] = 8
x
      Orange  Banana
2016       0       1
2017       2       8

^{* There is a difference between coding for research and coding for operation. This document serves as a proposal for some good common coding practice, generally, rather than operationally.↩}

Functions

Documentation of a function:
- What are the inputs? what is the class (or type)? what are the dimensions (if relevant)?
- What is the function doing? When algorithms are used, especially complicated ones, it can be useful to explain how the algorithm works or how it’s implemented within your code. It may also be appropriate to describe why a specific algorithm was selected over another
- What is the output? What are the dimensions (if relevant)?

In R:

#' Add together two numbers.
#' 
#' @param x A number.
#' @param y A number.
#' @return The sum of \code{x} and \code{y}.
#' @examples
#' add(1, 1)
#' add(10, 1)
add <- function(x, y) {
  x + y
}

In Python:

import sys
import numpy
def square(x):
    """Summary or Description of the Function

    Parameters, or inputs:
    x (int): Description of x

    Returns, our output:
    int: the squared input 

   """
    return x**2

Setting default arguments

Set default arguments only if it is strictly obvious what the argument should be. Otherwise force the user to explicitly specify the choice.

Add checks and assertions

Try to prevent situations where the user is unaware of some particular software behaviour. E.g, if the user would like to get the mean of a vector, make sure it is a vector format which is taken as an input:

In R:

x <- rnorm(10)
mean(x)
> [1] 0.6033826
mean( x - mean(x) ) # centering 
> [1] 0
# Now with a matrix
x <- matrix(ncol= 2, nrow= 10)
x <- apply(x, 2,  rnorm ) # fill it with random normal vectors
mean(x) # R stack the matrix and return a single number
> [1] -0.2972571
# Now after centering: 
apply( x - mean(x), 2, mean) #  wrong center
> [1] -0.1322593  0.1322593
# You can create a mean function which only operates on vectors
mean_vector <- function(x){
 if (NCOL(x) !=1)  stop ("Make sure x is a vector")
 mx <- mean(x)
return(mx)
 }
try(mean_vector(x))
> Error in mean_vector(x) : Make sure x is a vector

In Python:

import numpy as np
from numpy.random import randn
# generate random numbers between 0-1
x = randn(10)
print(x)
> [ 0.0241013   0.2310887  -0.15294863 -1.2512864  -0.76701965  0.1225635
>   2.13630261  0.17481843 -2.10117088 -1.25999037]
np.mean(x)
> -0.28435413887101557
np.mean( x - np.mean(x) ) # centering 
# Now with a matrix
> 6.661338147750939e-17
x= np.random.normal(size= (10,2))
np.apply_along_axis(func1d= np.mean, axis = 0, arr= x) 
> array([ 0.34829808, -0.32067791])
np.mean(x) # Python stack the matrix and return a single number
> 0.013810087737123828
for column in x.T:
    tmpp_array = x - np.mean(x)

# Now after centering we get the wrong center:
np.apply_along_axis(func1d= np.mean, axis = 0, arr= tmpp_array)  
> array([ 0.33448799, -0.33448799])
print("\n")
## You can create a mean function which only operates on vectors
def mean_vector(x):
    assert x.shape[1] == 1, "Make sure x is a vector"
    mx <- np.mean(x)
    return mx

try:
  mean_vector(x)
except:
  "Make sure x is a vector"
> 'Make sure x is a vector'

Avoid nested functions

Unless there is a very good reason for it, don’t create a function inside another function. It is much more complicated to understand, and to debug. DO NOT CREATE A MONSTER MOTHER FUNCTION which does everything in “one click”.

Coding style - recommendation

Background

Code style

Functions

References