How many times have you placed the legend in R plot to discover it is being overrun by some points or lines in the chart? Usually what comes next is a trial-and-error phase where you adjust the location, changing the arguments of the x and y coordinates, and re-drawing the plot again to check if the legend or text are now positioned such that they are fully readable.
At work, I recently spent a lot of time coding for someone else, and like anything else you do, there is much to learn from it. It also got me thinking about scripting, and how best to go about it. To me it seems that the new working generation mostly tries to escape from working with Excel, but “let’s not kid ourselves: the most widely used piece of software for statistics is Excel” (Brian D. Ripley). this quote is 15 years old almost, but Excel still has a strong hold on the industry.
Here I discuss few good coding practices. Coding for someone else is not to be taken literally here. ‘Someone else’ is not necessarily a colleague, it could just as easily be the “future you”, the you reading your code six months from now (if you are lucky to get responsive referees). Did it never happened to you that your past-self was unduly cruel to your future-self? that you went back to some old code snippets and dearly regretted not adding few comments here and there? Of course it did.
Unlike the usual metric on which “good” is usually measured by when it comes to coding: good = efficient, here the metric would be different: good = friendly. They call this literate programming. There is a fairly deep discussion about this paradigm by John D. cook (follow what he has to say if you are not yet doing it, there is something for everyone).
The good thing about using open-source software is the community around it. There are very many R packages online, and recently CRAN package download logs were released. This means we can have a look at the number of downloads for each package, so to get a good feel for their relative popularity. I pulled the log files from the server and checked a few packages which are known to be related to machine learning. With this post you can see which are the community favorites, and get a feel for the R-software trend growth.
There are many problems with p-values, and I too have chipped in at times. I recently sat in a presentation of an excellent paper, to be submitted to the highest ranked journal in the field. The authors did not conceal their ruthless search for those mesmerizing asterisks indicating significance. I was curious to see many in the crowd are not aware of current history in the making regarding those asterisks.
The web is now swarming with thought-provoking discussions about the recent American Statistical Association (ASA) statement on p-values. Despite their sincere efforts, there are still a lot of back-and-forth over what they actually mean. Here is how I read it.
The top three for the year are:
Out-of-sample data snooping
Code for my yield curve forecasting paper
Review of a couple of books
I personally enjoyed the most writing a few words on ML estimation, and about those great statistical discoveries. Since the last post did not involve any code or images I initially thought it would be a breeze. I in fact spent twice the time I usually do, and it was all good fun.
In 2015 I wrote quite a bit about volatility and correlation. In 2016 I plan to learn more (so to write more) about portfolio construction.
Some time during the 18th century the biologist and geologist Louis Agassiz said: “Every great scientific truth goes through three stages. First, people say it conflicts with the Bible. Next they say it has been discovered before. Lastly they say they always believed it”. Nowadays I am not sure about the Bible but yeah, it happens.
I express here my long-standing and long-lasting admiration for the following triplet of present-day great discoveries. The authors of all three papers had initially struggled to advance their ideas, which echos the quote above. Here they are, in no particular order.
Perhaps it is the different jargon used in different disciplines, not sure. But for some reason, the terms ‘predictions’, ‘forecasts’ and ‘projections’ are frequently used interchangeably.
I have recently reviewed couple of books.
This post concerns a paper I came across checking the nominations for best paper published in International Journal of Forecasting (IJF) for 2012-2013. The paper bears the annoyingly irresistible title: “The illusion of predictability: How regression statistics mislead experts”, and was written by Soyer Emre and Robin Hogarth (henceforth S&H). The paper resonates another paper published in “Psychological review” (1973), by Daniel Kahneman and Amos Tversky: “On the psychology of prediction”. Despite the fact that S&H do not cite the 1973 paper, I find it highly related.
At least for me, R by faR. MATLAB has its own way of doing things, which to be honest can probably be defended from many angles. Here are few examples for not so subtle differences between R and MATLAB:
We are all standing on the shoulders of giants. Bradley Efron is one such giant. With the invention of the bootstrap in 1979 and later with his very influential 2004 paper about the Least Angle Regression (and the accompanied software written in R).
I just finished reading An estimate of the science-wise false discovery rate and application to the top medical literature. The authors ask how many of what we read is scientific journals is actually incorrect, or false.