Unlike the usual metric on which “good” is usually measured by when it comes to coding: good = efficient, here the metric would be different: good = friendly. As in literate programming paradigm.
There is a difference between coding for research and coding for operation. This document serves as a proposal for some good common coding practice, generally, rather than operationally.
“Code is more often read than written.” — Guido van Rossum
“It doesn’t matter how good your software is, because if the documentation is not good enough, people will not use it.” - Daniele Procida
Style your code such that it is not ugly-looking. Use the following coding convention*:
Follow the KISS principle (keep it simple, stupid)
Name everything
make your names meaningful and consistent with what may be found elsewhere. If you are working with cross-sectional time series it is customary to use the letter T
for the time dimension and a letter like P
or K
to the cross section (firms, people, countries) dimension. Don’t be creative. If the letter T
is “occupied”, use a workaround like TT
, tT
, T0
.
If your object is a matrix, name the columns (e.g. variable’s name) and name the rows (e.g. index, or dates).
For example, use commands like dim
in R, or shape
in python to print the dimension of the object to the console. Create plots even, where it makes sense.
load("Path/x.RData")
comment(x)
"This is a 5*5*5 array. The dim1 stands for.., dim2 for.. and dim3 for.."
_
, as a prefix or prefix with a temp_
.For example:
# # Define again a toy matrix:
rownames_vector <- c("2016", "2017")
TT <- length(rownames_vector)
colnames_vector <- c("Orange", "Banana")
P <- length(colnames_vector)
x <- matrix(1:4, nrow= TT, ncol= P)
colnames(x) <- colnames_vector
rownames(x) <- rownames_vector
# Now assume there is a mistake in the value of Banana - 2017
# and we need to replace it:
## Bad
# find the index and replace the value
x[ which(x[,"Banana"]== 4), "Banana"] <- 8
## Better
# find the index, tmp_index
tmp_index <- which(x[, "Banana"]== 4)
# Replace the value
x[tmp_index, "Banana"] <- 8
Or in Python:
rownames_vector = ["2016", "2017"]
TT = len(rownames_vector)
colnames_vector = ["Orange", "Banana"]
P = len(colnames_vector)
import numpy as np
import pandas as pd
x = np.arange(TT*P).reshape(TT, P)
x= pd.DataFrame(x, columns = colnames_vector, index= rownames_vector)
## Bad:
x.loc[ x.loc[x["Banana"] == 3].index.tolist(), "Banana" ] = 8
x
Orange Banana
2016 0 1
2017 2 8
## Better:
tmp= x.loc[x["Banana"] == 3].index.tolist()
x.loc[tmp, "Banana"] = 8
x
Orange Banana
2016 0 1
2017 2 8
* There is a difference between coding for research and coding for operation. This document serves as a proposal for some good common coding practice, generally, rather than operationally.↩
In R:
#' Add together two numbers.
#'
#' @param x A number.
#' @param y A number.
#' @return The sum of \code{x} and \code{y}.
#' @examples
#' add(1, 1)
#' add(10, 1)
add <- function(x, y) {
x + y
}
In Python:
import sys
import numpy
def square(x):
"""Summary or Description of the Function
Parameters, or inputs:
x (int): Description of x
Returns, our output:
int: the squared input
"""
return x**2
Set default arguments only if it is strictly obvious what the argument should be. Otherwise force the user to explicitly specify the choice.
Try to prevent situations where the user is unaware of some particular software behaviour. E.g, if the user would like to get the mean of a vector, make sure it is a vector format which is taken as an input:
In R:
x <- rnorm(10)
mean(x)
> [1] 0.6033826
mean( x - mean(x) ) # centering
> [1] 0
# Now with a matrix
x <- matrix(ncol= 2, nrow= 10)
x <- apply(x, 2, rnorm ) # fill it with random normal vectors
mean(x) # R stack the matrix and return a single number
> [1] -0.2972571
# Now after centering:
apply( x - mean(x), 2, mean) # wrong center
> [1] -0.1322593 0.1322593
# You can create a mean function which only operates on vectors
mean_vector <- function(x){
if (NCOL(x) !=1) stop ("Make sure x is a vector")
mx <- mean(x)
return(mx)
}
try(mean_vector(x))
> Error in mean_vector(x) : Make sure x is a vector
In Python:
import numpy as np
from numpy.random import randn
# generate random numbers between 0-1
x = randn(10)
print(x)
> [ 0.0241013 0.2310887 -0.15294863 -1.2512864 -0.76701965 0.1225635
> 2.13630261 0.17481843 -2.10117088 -1.25999037]
np.mean(x)
> -0.28435413887101557
np.mean( x - np.mean(x) ) # centering
# Now with a matrix
> 6.661338147750939e-17
x= np.random.normal(size= (10,2))
np.apply_along_axis(func1d= np.mean, axis = 0, arr= x)
> array([ 0.34829808, -0.32067791])
np.mean(x) # Python stack the matrix and return a single number
> 0.013810087737123828
for column in x.T:
tmpp_array = x - np.mean(x)
# Now after centering we get the wrong center:
np.apply_along_axis(func1d= np.mean, axis = 0, arr= tmpp_array)
> array([ 0.33448799, -0.33448799])
print("\n")
## You can create a mean function which only operates on vectors
def mean_vector(x):
assert x.shape[1] == 1, "Make sure x is a vector"
mx <- np.mean(x)
return mx
try:
mean_vector(x)
except:
"Make sure x is a vector"
> 'Make sure x is a vector'
Unless there is a very good reason for it, don’t create a function inside another function. It is much more complicated to understand, and to debug. DO NOT CREATE A MONSTER MOTHER FUNCTION which does everything in “one click”.