Open source software has many virtues. Being free is not the least of which. However, open source comes with “ABSOLUTELY NO WARRANTY” and with no power comes no responsibility (I wonder..). Since no one is paying, by definition it is your sole responsibility to make sure the code does what it is supposed to be doing. Thus, looking “under the hood” of a function written by someone else is can be of service. There are more reasons to examine the actual underlying code.
A particularly advantageous feature of open source is that it is also by definition well.. open. So, we can learn from others and copy-paste good programming habits. There are a lot of experienced astute programmers out there. Looking “under the hood” of a function written by one of those guys\girls is a good start.
Another fine byproduct of having a close look at functions is that you can help debug them. It can sometimes happen that you realize some lines are redundant, or can be rewritten more efficiently. That is well, but there is no need to alarm the package maintainer for that. However, it likewise happens that something goes wrong with the function, mostly in some marginal cases and the returned value is flatly wrong. That IS a reason to fire a quick but orderly email to the maintainer of the package. Don’t be timid, make sure you are correct about your claim and he\she will thank you.
Yet another good reason to check the insides of a function is that it helps to understand the statistics\econometrics behind it. This is decidedly important for students who struggle with papers’ math notation. Inspecting how the method is de facto executed can help discard the “sort of” in the “I sort of understand” feel you sometimes get after reading an unfriendly paper or a textbook. Specifically, the exact dimension of the objects, how likelihood is constructed, where optimization happens or which algorithm is used. One can even dive to see how numerical algorithms arrive at their solution; i.e. is it gradient based algo or is it combinatorial optimization, though I freely confess I shy away from this latter addition.
With all this solid motivation to access at the actual body of functions, how do we do that?
In MATLAB, you can type in the console edit “name of function”, for example edit regress.
In R there is the usual parallel, but also some oddities to be aware of. Most functions written in R can be accessed in a similar manner to MATLAB. Take for example the Diebold-Mariano test, using the library forecast, all you need to do is to print the name of the function and the function itself is printed on your screen:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 |
dm.test function (e1, e2, alternative = c("two.sided", "less", "greater"), h = 1, power = 2) { alternative <- match.arg(alternative) d <- c(abs(e1))^power - c(abs(e2))^power d.cov <- acf(d, na.action = na.omit, lag.max = h - 1, type = "covariance", plot = FALSE)$acf[, , 1] d.var <- sum(c(d.cov[1], 2 * d.cov[-1]))/length(d) dv <- d.var if (dv > 0) STATISTIC <- mean(d, na.rm = TRUE)/sqrt(dv) else stop("Variance of DM statistic is zero") n <- length(d) k <- ((n + 1 - 2 * h + (h/n) * (h - 1))/n)^(1/2) STATISTIC <- STATISTIC * k names(STATISTIC) <- "DM" if (alternative == "two.sided") PVAL <- 2 * pt(-abs(STATISTIC), df = n - 1) else if (alternative == "less") PVAL <- pt(STATISTIC, df = n - 1) else if (alternative == "greater") PVAL <- pt(STATISTIC, df = n - 1, lower.tail = FALSE) PARAMETER <- c(h, power) names(PARAMETER) <- c("Forecast horizon", "Loss function power") structure(list(statistic = STATISTIC, parameter = PARAMETER, alternative = alternative, p.value = PVAL, method = "Diebold-Mariano Test", data.name = c(deparse(substitute(e1)), deparse(substitute(e2)))), class = "htest") } |
You can get a better look. Typing page(dm.test) or edit(dm.test) as in MATLAB will open the function in a separate window.
Sometimes we need to work a little harder. For example the function vcov returns variance-covariance matrix of model coefficients. But trying to print this function will not give us what we want:
1 2 3 4 5 6 7 |
vcov function (object, ...) UseMethod("vcov") <bytecode: 0x0709ea44> <environment: namespace:stats> |
The reason is that for some functions, it is important to know what is the kind of input provided. R is smart enough such that it recognize the kind of input you have provided without additional effort:
1 2 3 4 5 6 7 8 9 |
y <- rnorm(10) ; x1 <- rnorm(10) ; x2 <- rnorm(10) lm0 <- lm(y~ x1+ x2) vcov(lm0) (Intercept) x1 x2 (Intercept) 0.21995575 0.01725811 -0.15093467 x1 0.01725811 0.11131218 -0.02291188 x2 -0.15093467 -0.02291188 0.19955995 |
You don’t need to explain that lm0 is a linear model. But, since there are other models which require different treatment, when you wish to see what vcov looks like, you need to ask the right question, what does it look like for which kind of input? Here above, we use linear model, but it can also be a different model, say ARMA model or a generalized additive model (GAM). When a function is written differently for different inputs, we say it has different methods to deal with the different inputs. What we can do is to have a look which methods are available, and then we can choose to see the function for the method which is relevant for us:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
library(mgcv) # so that we can have a look at another vcov method. methods(vcov) [1] vcov.Arima* vcov.arma* vcov.cch* vcov.coxph* vcov.fitdistr* vcov.fracdiff* vcov.gam [8] vcov.garch* vcov.glm* vcov.gls* vcov.lm* vcov.lme* vcov.mlm* vcov.multinom* [15] vcov.negbin* vcov.nls* vcov.polr* vcov.rlm* vcov.summary.glm* vcov.summary.lm* vcov.survreg* Non-visible functions are asterisked # Now choose the one which are relevant for you, using the getAnywhere function: getAnywhere("vcov.lm") function (object, ...) { so <- summary.lm(object) so$sigma^2 * so$cov.unscaled } # or different method: getAnywhere("vcov.gam") function (object, freq = FALSE, dispersion = NULL, ...) { if (freq) vc <- object$Ve else vc <- object$Vp if (!is.null(dispersion)) vc <- dispersion * vc/object$sig2 name <- names(object$edf) dimnames(vc) <- list(name, name) vc } |
You can see the function vcov takes as an input model object. Those objects are written differently, so vcov needs to adjust itself, which it does.
As a concrete example for this kind of functionality, what I sometimes do when I take issue with some specific section inside a function is to change the script and redefine it. Say the dm.test function is sitting inside a loop, and I know I use quadratic loss function only. The ‘power’ argument is there to accommodate different loss functions, but for me specifically it is redundant if I only use quadratic loss. The line d <- c(abs(e1))^power - c(abs(e2))^power slows down the function for no reason and is better written as d <- (e1^2 - e2^2) removing the power argument altogether and some clutter with it. Redefine the new function new function as mydm.test and use that one within your loops.
May you find this kind or practice useful.
I, and several other StackOverflow R users, have written a very through answer to the question, “How do I view the source code for a function?”.
http://stackoverflow.com/q/19226816/271616
In addition to what you’ve described, we also show how to find source code for S4 methods, functions that call compiled code (in a base, recommended, or other package), and compiled code built into the R interpreter.