The useR! 2019 held in Toulouse ended couple of days ago.
Where I work we are now hiring. We took few time-consuming actions to make sure we have a large pool of candidates to choose from. But what is the value in having a large pool of candidates? Intuitively, the more candidates you have the better the chance that you will end up with a strong prospective candidate in terms of experience, talent and skill set (call this one candidate “the maximum”). But what are we talking about? is this meaningful? If there is a big difference between 10 candidates versus 1500 candidates, but very little difference between 10 candidates versus 80 candidates it means that our publicity and screening efforts are not very fruitful\efficient. Perhaps it would be better running quickly over a small pool, few dozens candidates, and choose the best fit. Below I try to cast this question in terms of the distribution of the sample maximum (think: how much better is the best candidate as the number of candidates grow).
This post shares short code snippet to make your own screen saver in R, The Matrix-style:
This post has two goals. I hope to make you think about your graphics, and think about the future of data-visualization. An example is given using some simulated time series data. A very quick read.
At least in part, a typical data-scientist is busy with forecasting and prediction. Kaggle is a platform which hosts a slew of competitions. Those who have the time, energy and know-how to combat real-life problems, are huddling together to test their talent. I highly recommend this experience. A side effect of tackling actual problems (rather than those which appear in textbooks), is that most of the time you are not at all enjoying new wonderful insights or exploring fascinating unfamiliar, ground-breaking algos. Rather, you are handling\wrangling\manipulating data, which is.. ugly and boring, but necessary and useful.
I tried my powers few years ago, and again about 6 months ago in one of those competitions called Toxic Comment Classification Challenge. Here are my thoughts on that short experience and some insight from scraping the results of that competition.
The R language has some quirks compared to other languages. One thing which you need to constantly watch for when moving to- or from R, is that R starts its indexing at one, while almost all other languages start indexing at zero, which takes some getting used to. Another quirk is the explicit need for clarity when modifying a variable, compared with other languages.
Take python for example, but I think it looks the same in most common languages:
This year on 4th of July I will be attending the annual usrR! conference. While it is often in the US, this year the UseR! conference takes place in the nearby Brussels. Sweet.
The website is state-of-the-art “don’t make me think” style. The program looks amazing. Belgian beers with the R community, exciting. Registration still open.
Watch this space for highlights and afterthoughts.
It seems like a very long while since my bachelor. Checking my bookshelf the other day I was thinking to flag some of those books which helped or inspired me along the way. Here they are in no particular order.
Every once in a while I play poker online. The poker site allows you to ask for tournament history. You get an email which contains hundreds summaries (I open several tables at once so have quite some history), a typical summary looks as follows:
In trading and in trading-related research one could be quickly overwhelmed with the sea of ink devoted to trading strategies and the like. It is essential that you “pick your battles” so to speak. I recently finished reading Machine Trading, by Ernest Chan. Here is what I think about the book.
How many times have you placed the legend in R plot to discover it is being overrun by some points or lines in the chart? Usually what comes next is a trial-and-error phase where you adjust the location, changing the arguments of the x and y coordinates, and re-drawing the plot again to check if the legend or text are now positioned such that they are fully readable.
At work, I recently spent a lot of time coding for someone else, and like anything else you do, there is much to learn from it. It also got me thinking about scripting, and how best to go about it. To me it seems that the new working generation mostly tries to escape from working with Excel, but “let’s not kid ourselves: the most widely used piece of software for statistics is Excel” (Brian D. Ripley). this quote is 15 years old almost, but Excel still has a strong hold on the industry.
Here I discuss few good coding practices. Coding for someone else is not to be taken literally here. ‘Someone else’ is not necessarily a colleague, it could just as easily be the “future you”, the you reading your code six months from now (if you are lucky to get responsive referees). Did it never happened to you that your past-self was unduly cruel to your future-self? that you went back to some old code snippets and dearly regretted not adding few comments here and there? Of course it did.
Unlike the usual metric on which “good” is usually measured by when it comes to coding: good = efficient, here the metric would be different: good = friendly. They call this literate programming. There is a fairly deep discussion about this paradigm by John D. cook (follow what he has to say if you are not yet doing it, there is something for everyone).
The good thing about using open-source software is the community around it. There are very many R packages online, and recently CRAN package download logs were released. This means we can have a look at the number of downloads for each package, so to get a good feel for their relative popularity. I pulled the log files from the server and checked a few packages which are known to be related to machine learning. With this post you can see which are the community favorites, and get a feel for the R-software trend growth.
There are many problems with p-values, and I too have chipped in at times. I recently sat in a presentation of an excellent paper, to be submitted to the highest ranked journal in the field. The authors did not conceal their ruthless search for those mesmerizing asterisks indicating significance. I was curious to see many in the crowd are not aware of current history in the making regarding those asterisks.
The web is now swarming with thought-provoking discussions about the recent American Statistical Association (ASA) statement on p-values. Despite their sincere efforts, there are still a lot of back-and-forth over what they actually mean. Here is how I read it.
The top three for the year are:
Out-of-sample data snooping
Code for my yield curve forecasting paper
Review of a couple of books
I personally enjoyed the most writing a few words on ML estimation, and about those great statistical discoveries. Since the last post did not involve any code or images I initially thought it would be a breeze. I in fact spent twice the time I usually do, and it was all good fun.
In 2015 I wrote quite a bit about volatility and correlation. In 2016 I plan to learn more (so to write more) about portfolio construction.