Linking backtesting with multiple testing

The other day, Harvey Campbell from Duke University gave a talk where I work. The talk- bearing the exciting name “Backtesting” was based on a paper by the same name.

The authors tackle the important problem of data-snooping; we need to account for the fact that we conducted many trials until we found a strategy (or a variable) that ‘works’. Accessible explanations can be found here and here. In this day and age, the ‘story’ behind what you are doing is more important than ever, given the things you can do using your desktop/laptop.

Basically, the authors fill a new bottle (backtesting literature) with by-now-old wine (statistical literature). Namely, we want to control the error rate. If we test one strategy the chance we will wrongly decide that it works where in reality it is rubbish, is say 5%. But if we test 10 of those where in reality they are all rubbish, the chance that we wrongly decide that one is working is about 40%. But you want to keep it at 5% even though you test 10 instead of one. A first solution was given by Bonferroni (1936), just use 5%/10=0.005 (0.5%) for each test. This will ensure that the chance you will wrongly decide that one of those 10 strategies is working is < 5%. This is called 'controlling the Family-wise error rate' (FWER) and it is a very strict solution. Strict in the sense that if you have 100 strategies, and you want to keep your 5%, you have to have the strategy that performs exceptionally well in order to discover it, and you will miss out on good strategies that work only very well, but not exceptionally well. Improvement was suggested by Holm (1979), and after that Benjamini and Hochberg (1995) suggest: relax (man). Allow for strategies that do not work to enter, as long as you find enough that actually work. Say you discover 20 out of the 1000 strategies you test that seem to work, what Benjamini and Hochberg are saying is that in that case, it is less important if one out of the 20 (5%) is not really working; or as they say “it has been falsely discovered”. This is called ‘controlling the false-discovery error rate’ (FDR), a different quantity.

The two options/quantities assist different purposes. The former is suited for the “Food and Drug Administration” where we absolutely do not want medicine that does not work to sit on the pharmacy shelves. The latter is more suitable for us, we endure the cost of implementing a strategy which does not work, as long as we gain the power to discover those that perform. In variables selection term, we are willing to add estimation noise to our model (variable which is not important) as long as we add relevant information as well (include more relevant variables).

The main thrust of Harvey and Liu is that since the t-stat is closely related to the Sharp-Ratio, specifically

$t.stat = \widehat{SR} \times \sqrt{T},$

we can compute the sharp ratio of falsely rejecting according to one of your preferred quantity (FWER or FDR). The paper continues to offer a correction for the sharp ratio you actually got, but it is basically a one-to-one mapping (see formula). As a dogmatic example, say we have 10 strategies and we use Bonferroni correction. We need a t-stat which is the mapping of 0.5% so about 3.3, and not 2 as we need if we test only one strategy; the corresponding SR follow from the formula.

Naturally, those corrections have long been implemented in standard computing software:
In Matlab: here.
in R:


p.adjust(p, method = p.adjust.methods, n = length(p))
p.adjust.methods
# c("holm", "hochberg", "hommel", "bonferroni", "BH", "BY", "fdr", "none")

p.adjust(p, method = p.adjust.methods, n = length(p))

p.adjust.methods

# c("holm", "hochberg", "hommel", "bonferroni", "BH", "BY", "fdr", "none")

References:
– A Simple Sequentially Rejective Multiple Test Procedure.
Sture Holm. Scandinavian Journal of Statistics Vol. 6, No. 2 (1979), pp. 65-70
– Controlling the false discovery rate: a practical and powerful approach to multiple testing
Benjamini, Yoav; Hochberg, Yosef. Journal of the Royal Statistical Society, (1995) Series B 57 (1)
– False Discovery Rate (Wikipedia)

* When the tests are not independent as is in reality, improvement is offered in:
– The control of the false discovery rate in multiple testing under dependency.
Benjamini, Yoav; Yekutieli, Daniel (2001). Annals of Statistics 29 (4): 1165–1188.