Present-day great statistical discoveries

Some time during the 18th century the biologist and geologist Louis Agassiz said: “Every great scientific truth goes through three stages. First, people say it conflicts with the Bible. Next they say it has been discovered before. Lastly they say they always believed it”. Nowadays I am not sure about the Bible but yeah, it happens.

I express here my long-standing and long-lasting admiration for the following triplet of present-day great discoveries. The authors of all three papers had initially struggled to advance their ideas, which echos the quote above. Here they are, in no particular order.

The Bootstrap (1979), by Bradly Efron.

If you study or studied Applied Statistics, there is simply no getting around it, in a good way.

It is nice to know that some quantity is asymptotically distributed F. However, the word asymptotically means approaching. Sometimes we are not sure how good is the approximation, how quickly is it approaching? Is it ‘Japanese high-speed train’ approaching? or is it ‘my grandpa goes to the toilet’ approaching? Bootstrapping relies on sampling from finite samples. So the number of data points is accounted for. More often than not, we simply sample from our empirical distribution, the one that we observe, or some variant thereof.

At the time of this writing, the original paper alone is cited over 12,500 times. And you can now find a great many bootstrap techniques solving many different complications: Bayesian bootstrap, smooth bootstrap, parametric bootstrap, residual sampling, usual regression bootstrap, Wild bootstrap to handle Heteroscedasticity, block bootstrap to handle time-dependencies and panel bootstrap to handle panel data (some references are also below, more are welcome). In this interview, Prof. Efron reveals that the paper was practically rejected on first submission.

Comparing Predictive Accuracy (1995), by Francis X. Diebold and Roberto S. Mariano (DM henceforth)

Circulated first in 1991, so far with 4658 citations. I would thoughtfully argue that this is the most common way to statistically compare the accuracy of two forecasts. Looking at two money managers, one of them is going to do better than the other. The paper helps to reply on the question: “is he really better?” You can see where it comes in handy.. I can only do worse than simply citing Prof. Francis Diebold himself:

If the need for predictive accuracy tests seems obvious ex post, it was not at all obvious to empirical econometricians circa 1991. Bobby Mariano and I simply noticed the defective situation, called attention to it, and proposed a rather general yet trivially-simple approach to the testing problem. And then – boom! – use of predictive accuracy tests exploded.

The paper was submitted to the highly prestigious Econometrica journal and was rejected. A referee expressed bewilderment as to why anyone would care about the subject it addressed. “.. Lastly they say they always believed it.” Around a year later, an extension was published: “Asymptotic Inference About Predictive Ability” by K.D West.

Eleven years after the DM paper, Econometrica published another important extension: “Tests of Conditional Predictive Ability” (2006) by Giacomini, R. and H. White (GW henceforth). For quite some time, I did not really understand the difference between DM test and GW test. Perhaps because computationally there is no difference. In a nutshell, the biggest difference between the DM and the GW tests for predictive ability is in the way they arrive to the asymptotic normal distribution. The DM proof relies on population parameters, while the GW proof lets us stay with our inaccurate sample estimates for those parameters. Practically, this means that if your forecasts are based on expanding windows, the two tests are asymptotically equivalent. However, if your forecasts are based on rolling estimation window, the DM test is not valid. Being that by using rolling window (with a constant size), you can’t assume anymore convergence to the population parameters which the proof is based on. If you are using rolling window, computation is the same but you are relying on the GW proof.

Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing (1995), by Yoav Benjamini and Yosef Hochberg

This ingenious paper concerns with multiple testing. If you have 10 possible independent regressors, none of which matter, you have a good chance to find at least one is important. A good chance being 40%:
prob(one or more looks important) = 1 – prob(non looks important) =

$1 - \prod_{i = 1}^K p(p.value_j>0.05) = 1 - 0.95^{10} = 0.4$

Not really adequate. You generally would like to keep your error rate at 5% regardless of how many tests you make. So, over the years various suggestions have been made.

A first and intuitive solution was given by Bonferroni (1936), just use 5%/10=0.005 (0.5%) for each test. This will ensure that the chance you will wrongly decide to reject the null is less than 5%. You are now controlling what they call the ‘Family-wise error rate’ (FWER). This is very conservative solution. Drawing from genetics, you will never decide a gene is important since you have so many of them (0.05/(~20,000) is a very small number). Benjamini and Hochberg (1995) suggest to relax that. They come up with a different quantity to control.

$E(\frac{V}{V+S}),$

where V is the number of false rejections and S is the number of correct rejections. Hence, instead of controlling the proportion of false rejections relative to the total number of tests (as in the traditional procedure), we aim to control the proportion of false rejections relative to the total number of rejections.

Why is this quantity more relevant nowadays?

I understand why you don’t want to be wrong when you check a couple of things, but presently you probably check more than a couple, simply because you can. The idea here is that you should perform, but you don’t have to be perfect. As long as you discover correct rejections (discover real effects, e.g. discover real market anomalies), you can make a mistake every-once-in-a-while. Every-once-in-a-while being not higher than say 1 unreal discovery for every 20 real discoveries (so confusingly, again 5%). Nice eh? Evidently 29785 citations nice. To be fair, there are a great deal more fields where multiple testing is applied, so many of those citations come from biology, medicine and so forth. Here is the link to the friendly original paper. Also in this case, the paper was not an overnight success. Far from it actually, you will be surprised to hear that it took 5 years and three journals for the paper to see the light of day.

As a final word, Journals’ editors are the gate-keepers for all kinds of papers coming our way. Those are only few examples of “true negatives”, I hope and believe that we are in good shape, with filtering process that albeit often painstakingly slow, works. For those of you out there who had the courage to explore some out-of-the-box original ideas, these three examples should provide some inspiration and comfort. Original ideas ~~may~~ encounter strong opposition.

REFERENCES (1)

Peter BÜhlmann. Bootstraps for time series. Statistical Science, pages 52-72, 2002.
M.P. Clements and J.H. Kim. Bootstrap prediction intervals for autoregressive time series. Computational statistics & data analysis, 51(7):3580-3594, 2007.
Russell Davidson and Emmanuel Flachaire. The wild bootstrap, tamed at last. Journal of Econometrics, 146(1):162-169, 2008.
B. Efron. Model selection, estimation, and bootstrap smoothing. 2012.
Bradley Efron. The bootstrap and markov-chain monte carlo. Journal of biopharmaceutical statistics, 21(6):1052–1062, 2011
B.E. Hansen. The grid bootstrap and the autoregressive model. Review of Economics and Statistics, 81(4):594-607, 1999.
Regina Y. Liu. Bootstrap procedures under some non-iid models. The Annals of Statistics, pages 1696 – 1708, 1988.
D.B. Rubin. The bayesian bootstrap. The annals of statistics, 9(1):130-134, 1981.
Xiaofeng Shao. The dependent wild bootstrap. Journal of the American Statistical Association, 105(489):218-235, 2010.
Marco Meyer and Jens-Peter Kreiss. On the vector autoregressive sieve bootstrap. Journal of Time Series Analysis, 2014
Joseph P Romano, Azeem M Shaikh, and Michael Wolf. Control of the false discovery rate under dependence using the bootstrap and subsampling. Test, 17(3):417–442, 2008

REFERENCES (2)

Francis X. Diebold. Comparing predictive accuracy, twenty years later: A personal perspective on the use and abuse of diebold-mariano tests. Working Paper 18391, National Bureau of Economic Research, September 2012.
Raffaella Giacomini and Halbert White. Tests of Conditional Predictive Ability, Econometrica Vol. 74, No. 6 (Nov., 2006), pp. 1545-1578
Diebold, F.X. and R.S. Mariano (1991). Comparing Predictive Accuracy I: An Asymptotic Test. Discussion Paper 52, Institute for Empirical Macroeconomics, Federal Reserve Bank of Minneapolis
West, K.D. (1996), Asymptotic Inference About Predictive Ability, Econometrica, 64, 1067–1084.

REFERENCES (3)

Benjamini, Y. and Hochberg, Y. (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Statist. Soc. B, 57, 289–300.
Yoav Benjamini (2010) Discovering the false discovery rate Econometrica, J. R. Statist. Soc. B, 72, Part 4, pp. 405–416

A Beautiful Mind