Out-of-sample data snooping

In this day and age, paralleling and mining big data, I like to think about the new complications that follow this abundance. By way of analogy, Alzheimer’s dementia is an awful condition, but we are only familiar with it since medical advances allow for higher life expectancy. Better abilities allow for new predicaments. One of those new predicament is what I call out-of-sample data snooping.

Dutiful readers know that I think variable selection is one of the most worthy research areas. When you aim to forecast, and you need to guess which variables are important, an intuitive approach is to leave out part of the sample and use it for validation, kind of “semi-out-of-sample”. If the variables chosen (from whatever initial pool) are genuinely helpful, they would prove it using the validation set, and can be later used for actual forecasting. Solid reasoning which works, but only up to a certain extent. If you throw enough mud at the wall, some of it will stick.

Let me illustrate this concern. Generate p=1000 random variables which have nothing to do with the response Y.
A rule of thumb is to leave out 1/3 of the sample as a validation set, this is data which we use later to make sure the chosen variables, chosen based on the 2/3 training set, actually perform- and not overfitted. I choose the variables using a simple t-test (a blunt mistake without accounting for the number of tests).


> TT=500 ; p <- 1000
> rmat <- matrix(nrow = TT, ncol = p)
> y <- rnorm(TT)
> ins <- c(1:(2*TT/3))
> oos <- c((2*TT/3+1):TT)
> pv <- NULL
> for (i in 1:p){ 
+   rmat <- rnorm(2*TT/3)
+ lm0 <- lm(y[ins]~rmat)
+ pv[i] <- coef(summary(lm0))[2,4]
+ }
> cat(significant.wanna.be <- sum(pv<0.05))
54
#M ake sure that the variables chosen using the training set (the initial 2/3) perform well also on the validation set. 
> pv <- NULL
> for (i in 1:significant.wanna.be){ 
+   rmat <- rnorm(TT/3)
+   lm0 <- lm(y[oos]~rmat)
+   pv[i] <- coef(summary(lm0))[2,4]
+ }
> cat("there are still", sum(pv<0.05), "Remaining")
there are still 4 Remaining

> TT=500 ; p <- 1000

> rmat <- matrix(nrow = TT, ncol = p)

> y <- rnorm(TT)

> ins <- c(1:(2*TT/3))

> oos <- c((2*TT/3+1):TT)

> pv <- NULL

> for (i in 1:p){

+ rmat <- rnorm(2*TT/3)

+ lm0 <- lm(y[ins]~rmat)

+ pv[i] <- coef(summary(lm0))[2,4]

+ }

> cat(significant.wanna.be <- sum(pv<0.05))

#M ake sure that the variables chosen using the training set (the initial 2/3) perform well also on the validation set.

> pv <- NULL

> for (i in 1:significant.wanna.be){

+ rmat <- rnorm(TT/3)

+ lm0 <- lm(y[oos]~rmat)

+ pv[i] <- coef(summary(lm0))[2,4]

+ }

> cat("there are still", sum(pv<0.05), "Remaining")

there are still 4 Remaining

So, 54 variables out the 1000 were chosen using the training set (the initial 2/3 of the data). Later, 4 variables also did well on the validation set. All random noise. You can see it is very helpful to “chop” the sample like that, and there are many nice papers in that direction (see references below for a couple of good ones), but it is only a step in the right direction given our current computing and storing capabilities. You may wonder if a ratio of 2 (1000 variables to only 500 observation) is not exaggerated, it is not. Think about all the technical indicator people are working with; all kind of dual and triple moving averages with very many combination to try. In this example we did not even allow for combinations of variables, things can get real thorny then.

I don’t participate any more in kaggle competitions. The concept of out-of-sample data snooping is clearly evident there. Kaggle provides a public leader board based on a validation set where the final winner is determined using a fresh test-set (so now we have 3 definitions: training-set, the 2/3 of the data, validation-set, the remaining 1/3 of the data, and a “fresh” test-set which is actually out-of-sample). Competition winners often come from mid range public leader board table, not from the top as you may expect. Top public leader board are often people who made MANY submissions, trying all kind of things until they manage to fit even the validation-set well and become public leader board. But in the end-game, their models may fail on the fresh test data. The code above illustrates how this can happen. A wise thing to do is to use cross-validation which makes it harder (but not impossible) to create such out-of-sample snooping. The idea is to dice the data, and use different portion of the sample as validation; reference below for a good review.

With further computing advancements, I expect this “Out-of-sample data snooping” problem to gain more attention. Keeping my analogy, there are not many cases of this sort of dementia yet, but there will be.

References to possible solutions and advances made (sometimes linking to free working paper version):
* Larry Wasserman and Kathryn Roeder. High dimensional variable selection. Annals of statistics, 37(5A):2178, 2009
* Nicolai Meinshausen, Lukas Meier, and Peter Bühlmann. P-values for high-dimensional regression. Journal of the American Statistical Association, 104(488):1671–1681, 2009
* Sylvain Arlot and Alain Celisse, A survey of cross-validation procedures Statistics Surveys Vol. 4 (2010) 40–79

You might also like:

3 comments on “Out-of-sample data snooping”

John Andrew Orford says:

03/25/2015 at 8:26 AM

Reminds me of this paper:

http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2474755

and the dePrado et al paper cited by it.

My current hobby is taking Quantopian strategies and plotting their Sharpes over time.

Victor says:

04/02/2015 at 2:28 PM

Thanks for the article and code to have some fun with. You were mentioned here:

http://www.quantstart.com/articles/Using-Cross-Validation-to-Optimise-a-Machine-Learning-Method-The-Regression-Setting

Based on you analysis I gather that the probability of finding an edge using one of those commercial GP engines that mine for systems is exactly 0.

Steamelephant says:

04/07/2015 at 5:27 PM

See also Carhart 1997, the winner’s curse, pretend money casinos and the sorry tale of the Manek Growth Fund!

This is a good point that deserves being made again and again. On the other hand if you take this seriously you must quickly become very uncertain about your conclusions but you still need an investment strategy! Is there a way of adjusting your ranking of investment strategies/learning algorithms, mutual funds or whatever that improves upon the ranking the naive approach gives you? Bonferoni adjustments don’t usually adjust your rankings, just the strength of your conclusions.
People have made mileage out of portfolios of losing stocks for example but that is a pretty special case.

You might also like:

3 comments on “Out-of-sample data snooping”

Leave a Reply