Along with improvements in computational power, variable selection has become one of the problems attracting the most effort. We (well.. experts) have made huge leaps in the realm of variable selection. Prediction being probably the most common objective. LASSO (Least Absolute Sum of Squares Operator) leading the way from the west (Stanford) with its many variations (Adaptive, Random, Relaxed, Fused, Grouped, Bayesian.. you name it), SCAD (Smoothly Clipped Absolute Deviation) catching up from the east (Princeton). With the good progress in that area, not secondary but has been given less attention -> Inference is now being worked out.

What seems to be the problem officer?

Straightforwardly that: “simple selection methods fail to deliver the usual significance, tests are misleading”*.
“..ignoring model selection can be deceptively optimistic”**
Why is that?

First, a selection step is performed, typically using BIC or AIC or some other information criteria. For clarity, I call structure to “variables number 2,7, and 8 were chosen, rest are 0”. After we are comfortable with the structure resulted from the selection step, the distribution of those coefficients is not the same as it would be if we had chosen the same structure beforehand, without a selection step.

We can use the following illustration, we consider:

The real impact on Y comes only from x1, but x1 and x2 are correlated and when we apply a selection procedure, sometimes we will see that x2 is chosen. See what the confidence interval for the coefficient are, when we ignore the selection step, and what they should be if we account for the selection step:

The red lines do not account for the fact that you had an added lair of uncertainty. The added lair comes from the fact that beforehand, you were not even certain that you want to estimate this coefficient. Now that you had a selection step (which helped you decide you actually want to include this variable) the red interval is too narrow. It reflect a different confidence interval, the one without the added lair of uncertainty.

From this illustration we can already conclude one solution to the problem. Just bootstrap your data and construct the confidence intervals assigning 0’s to all those times the variable was not chosen. You may end up with too ‘bumpy’ confidence intervals. In the second reference below you can find out how to solve that, basically a smoothed version, though even the bumpy version is better than ignoring the selection step altogether (which is a blunt mistake).

* Alan J Miller. Selection of subsets of regression variables. Journal of the Royal Statistical Society. Series A (General), pages 389-425, 1984, from discussion.
** Estimation and Accuracy after Model Selection Efron, 2013, pdf here
– Previously written on the subject of variable selection:
How Important is Variable Selection
Related book:

### One comment on “Advances in post-model-selection inference”

1. Georges Nguefack-Tsague says:

Thanks for illustrating this very important issue, often overlooked. The following paper also deal with this problem.
1. Nguefack-Tsague, G. and Zucchini, W. (2016). A Mixture-Based Bayesian Model Averaging Method. Open Journal of Statistics; (2):220-228
https://www.statindex.org/articles/313875
http://dx.doi.org/10.4236/ojs.2016.62019

2. Nguefack-Tsague G. and Zucchini W. (2016). Effects of Bayesian model selection on frequentist performances: an alternative approach.
Applied Mathematics; 7(10):1103-1115. http://dx.doi.org/10.4236/am.2016.710098
3. Nguefack-Tsague G., Zucchini W., and Fotso S. (2016). Frequentist model averaging and applications to Bernoulli trials. Open Journal of Statistics; 6(3):545-553. http://dx.doi.org/10.4236/ojs.2016.63046

4. Nguefack-Tsague, G. (2014). Estimation of a Multivariate Mean under Model Selection Uncertainty. Pakistan Journal of Statistics and Operation Research; 10 (1):131-145
https://www.statindex.org/articles/311042
http://dx.doi.org/10.18187/pjsor.v10i1.449

5. Nguefack-Tsague G. and Bulla I. (2014). A focused Bayesian information criterion.
Advances in Statistics; Volume 2014, Article ID 504325. http://dx.doi.org/10.1155/2014/504325
6. Nguefack-Tsague G. (2014). On optimal weighting scheme in model averaging.
American Journal of Applied Mathematics and Statistics; 2(3):150-156. http://dx.doi.org/10.12691/ajams-2-3-9
7. Nguefack-Tsague G. (2013). On bootstrap and post-model selection inference.
International Journal of Mathematics and Computation; 21(4):51-64.
http://www.ams.org/mathscinet-getitem?mr=MR3062016

8. Nguefack-Tsague, G. (2013). An alternative derivation of some commons distributions functions: A post-model selection approach
International Journal of Applied Mathematics and Statistics; 42 (12) :138-147
http://www.ams.org/mathscinet-getitem?mr=MR3093313
https://www.statindex.org/articles/271149
9. Nguefack-Tsague, G. (2013). Bayesian estimation of a multivariate mean under model uncertainty. International Journal of Mathematics and Statistics; 13 (1) :83-92
http://www.ams.org/mathscinet-getitem?mr=MR3021499
https://zbmath.org/?q=an:1308.62036
https://www.statindex.org/articles/267266

10. Nguefack-Tsague, G. and Zucchini, W. (2011). Post-model selection inference and model averaging. Pakistan Journal of Statistics and Operation Research; 7 (2-Sp) :347-361
https://www.statindex.org/articles/258048
http://dx.doi.org/10.18187/pjsor.v7i2-Sp.292

11. Zucchini W., Claeskens G. and Nguefack-Tsague G. (2011). Model Selection. International Encyclopedia of Statistical Science Part 13: pp. 830-833.