How Important is Variable Selection?

Very.

If you have 10 possible independent regressors, and none of which matter, you have a good chance to find at least one is important.

A good chance being 40%: prob(one or more looks important) = 1 – prob(non looks important) =

(1)   \begin{equation*} 1 - \prod_{i = 1}^K p(p.value_j>0.05) = 1 - 0.95^{10} = 0.4 \end{equation*}

So, on average, you have around 60% chance to get the correct conclusion. Note that more data does nothing to solve this. Imagine 10 independent trading strategies backtested. Imagine 100. Related post is do-they-really-know-what-they-are-doing where I made the point not to get so excited by a money manager with a very high sharp (or information) ratio.
It is also the reason for my principle point in the R talk I gave in AmsteRdam, know where the profits (losses) stem from, what is the story behind what the speculator does.

The following code illustrates the problem, the chance of finding “important” unimportant variables. You can play around with the parameters and double check that increased N does nothing to mitigate the problem.

Run the code to get the typical figure:
Probability for finding <code>important</code> unimportant” width=”969″ height=”794″ class=”aligncenter size-full wp-image-1593″ /></a></p>
<p><strong>Notes</strong><br />
1. In the function there is an argument “distrib”, you can change it to another distribution like runif, amazingly (function as an argument), it works..<br />
2. In practice, the variables are not independent so things are not <em>that</em> bad (the independent case is the worst), however, in practice we consider way more possible variables than just 10.</p>
<!-- relpost-thumb-wrapper --><div class=

2 comments on “How Important is Variable Selection?”

Leave a Reply

Your email address will not be published. Required fields are marked *