# Out-of-sample data snooping

In this day and age, paralleling and mining big data, I like to think about the new complications that follow this abundance. By way of analogy, Alzheimer’s dementia is an awful condition, but we are only familiar with it since medical advances allow for higher life expectancy. Better abilities allow for new predicaments. One of those new predicament is what I call out-of-sample data snooping.

Dutiful readers know that I think variable selection is one of the most worthy research areas. When you aim to forecast, and you need to guess which variables are important, an intuitive approach is to leave out part of the sample and use it for validation, kind of “semi-out-of-sample”. If the variables chosen (from whatever initial pool) are genuinely helpful, they would prove it using the validation set, and can be later used for actual forecasting. Solid reasoning which works, but only up to a certain extent. If you throw enough mud at the wall, some of it will stick.

Let me illustrate this concern. Generate p=1000 random variables which have nothing to do with the response Y.
A rule of thumb is to leave out 1/3 of the sample as a validation set, this is data which we use later to make sure the chosen variables, chosen based on the 2/3 training set, actually perform- and not overfitted. I choose the variables using a simple t-test (a blunt mistake without accounting for the number of tests).

So, 54 variables out the 1000 were chosen using the training set (the initial 2/3 of the data). Later, 4 variables also did well on the validation set. All random noise. You can see it is very helpful to “chop” the sample like that, and there are many nice papers in that direction (see references below for a couple of good ones), but it is only a step in the right direction given our current computing and storing capabilities. You may wonder if a ratio of 2 (1000 variables to only 500 observation) is not exaggerated, it is not. Think about all the technical indicator people are working with; all kind of dual and triple moving averages with very many combination to try. In this example we did not even allow for combinations of variables, things can get real thorny then.

I don’t participate any more in kaggle competitions. The concept of out-of-sample data snooping is clearly evident there. Kaggle provides a public leader board based on a validation set where the final winner is determined using a fresh test-set (so now we have 3 definitions: training-set, the 2/3 of the data, validation-set, the remaining 1/3 of the data, and a “fresh” test-set which is actually out-of-sample). Competition winners often come from mid range public leader board table, not from the top as you may expect. Top public leader board are often people who made MANY submissions, trying all kind of things until they manage to fit even the validation-set well and become public leader board. But in the end-game, their models may fail on the fresh test data. The code above illustrates how this can happen. A wise thing to do is to use cross-validation which makes it harder (but not impossible) to create such out-of-sample snooping. The idea is to dice the data, and use different portion of the sample as validation; reference below for a good review.

With further computing advancements, I expect this “Out-of-sample data snooping” problem to gain more attention. Keeping my analogy, there are not many cases of this sort of dementia yet, but there will be.