The good thing about using open-source software is the community around it. There are very many R packages online, and recently CRAN package download logs were released. This means we can have a look at the number of downloads for each package, so to get a good feel for their relative popularity. I pulled the log files from the server and checked a few packages which are known to be related to machine learning. With this post you can see which are the community favorites, and get a feel for the R-software trend growth.
Most popular machine learning R packages
Total number of downloads of selected packages (multiply x-axis by 10^4 for the actual number)
The forecast package seems to be the most widely used. Maintained by Rob Hyndman (but written by more people), with 6 previous versions, I am not surprised to see this. I think a lot of work went into enhancing previous versions of the package in terms of speed execution and user mindfulness. Evidently, it bears fruits.
The above figure makes me want to take a closer look at the e1071 package. I use it, but probably am not aware of all that it can do. Network Analysis and Visualization with igraph is perhaps another item to check out (I hear you, only 24 hours..).
We can see that randomForest is only slightly behind rpart. This is intuitive, if you learn regression trees, you may as well go for the forest.
I would expect quantmod and glmnet to be much higher. those are extremely smart and well developed packages. feel free to comment on other packages you would add to this list.
Downloads over time
Using data from 2013 onward, I plot the number of downloads for each package over time:
Total number of downloads for selected packages over time
You can see the steady increase in the number of downloads in all packages. For most packages, 2016 seems to be going extremely well. You can also see the spike in popularity of the forecast package during the end of 2015, placing it first in the rankings.
Looking ahead
We aggregate the number of downloads according to months, and sum them up. The aim is to get a feel for the R-users growth. We can plot the series and fit a polynomial model using the time variable as the only explanatory:
1 2 3 4 5 |
# y is the total number of downloads (monthly frequency) lm0 <- lm(y ~ c(1:length(y)) + I(c(1:length(y))^2) ) lm1 <- lm(y ~ c(1:length(y)) + I(c(1:length(y))^2)+ I(c(1:length(y))^3) ) |
We can use the coefficient for forecasting the path. The main focus is the trend, not so much the number of packages.
Second and third degree polynomial fit and prediction for the number of downloads (multiply y-axis by 10^4 for the actual number)
So two models, reasonable fit and different future paths. It is now May-2016, the lower of the two models (polynomial regression with degree of 3, aka cubic polynomial model) predicts twice as many downloads by Jan-2018, the quadratic polynomial model sees the number of downloads almost triple. Look around you, don’t you yourself see more (and more) R-users?
As a final word, those trends rely solely on history, with the trend as the only explanatory variable. Very recently, there has been some major frog leaps. Thanks to Hadley Wickham and Rstudio, we now have Rtools (for windows), devtools (Tools to Make Developing R Packages Easier) and roxygen2 (for better documentation process).
“Any sufficiently advanced technology is indistinguishable from magic” (Clarke’s third law). Those advanced tools call for a structural break. More developers, better ‘winner’ packages and much stronger growth in the future, I believe.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
file_list <- list.files("CRAN logs folder", full.names=TRUE) pck <- c("place here the vector of package names you are interested in") npck <- list() for (i in 1:length(file_list)) { tmp<- read.table(file_list[[i]], header = TRUE, sep = ",", quote = "\"", dec = ".", fill = TRUE, comment.char = "", as.is=TRUE) npck[[i]] <- list() for (j in 1:length(pck)) { npck[[i]][[j]] <- NROW(logs[tmp$package==pck[j],]) } timee[i] <- head(tmp$date, 1) # You must remove the file, otherwise you would run into memory issues rm(tmp) } |
Really enjoy reading your blog!
Not sure you have seen this:
http://robjhyndman.com/hyndsight/fpp-downloads/
“So the recent spike in forecast package downloads are clearly being driven by fpp installations”
Thanks Mark.
No, I completely missed that discussion. In the comment section of the post you mentioned there are numerous conjectures for that spike in forecast package downloads.
Nice post. I very much like your dataviz theme. I would love to see an update on this one if your time allows. Will modelr make the list? Will forecast stay on top?