Along with improvements in computational power, variable selection has become one of the problems attracting the most effort. We (well.. experts) have made huge leaps in the realm of variable selection. Prediction being probably the most common objective. LASSO (Least Absolute Sum of Squares Operator) leading the way from the west (Stanford) with its many variations (Adaptive, Random, Relaxed, Fused, Grouped, Bayesian.. you name it), SCAD (Smoothly Clipped Absolute Deviation) catching up from the east (Princeton). With the good progress in that area, not secondary but has been given less attention -> **Inference** is now being worked out.

**What seems to be the problem officer?**

Straightforwardly that: “simple selection methods fail to deliver the usual significance, tests are misleading”*.

“..ignoring model selection can be deceptively optimistic”**

Why is that?

First, a selection step is performed, typically using BIC or AIC or some other information criteria. For clarity, I call *structure* to “variables number 2,7, and 8 were chosen, rest are 0”. After we are comfortable with the structure resulted from the selection step, the distribution of those coefficients is not the same as it would be if we had chosen the same structure beforehand, without a selection step.

We can use the following illustration, we consider:

The real impact on Y comes only from x1, but x1 and x2 are correlated and when we apply a selection procedure, sometimes we will see that x2 is chosen. See what the confidence interval for the coefficient are, when we ignore the selection step, and what they should be if we account for the selection step:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 |
tempcor <- NULL mod <- NULL repp=1000 bet0 <- bet1 <- NULL TT <- 50 # We choose from three options: only x1 # only x2 and x1+x2, according to AIC criterion for (i in 1:repp){ eps <- rnorm(TT) x1 <- rnorm(TT,3,1) x2 <- x1 + rnorm(TT,0,1) y <- x1+eps tempcor[i] <- cor(x1,x2) lm0 <- lm(y~x1+x2) bet0[i] <- summary(lm0)$coef[2,1] lm0 <- AIC(lm0) lm1 <- lm(y~x1) bet1[i] <- summary(lm1)$coef[2,1] lm1 <- AIC(lm1) lm2 <- AIC(lm(y~x2)) mod[i] <- which.min(c(lm0,lm1,lm2)) } # possible results for the coefficient of x1 - selection step accounted for postbet <- c(bet0,bet1,rep(0,sum(mod==3))) dens1 <- density(postbet) # possible results for the coefficient of x1 - selection step NOT accounted for dens2 <- density(bet1) # Ignoring model selection lwd1=3 # Graphical parameter plot(dens1,ylim=c(0,5),main="The problem of post model-selection inference",col=3, xlab="Ignoring model selection in red - too optimistic",lwd=lwd1) abline(v=1,lwd=lwd1,col=4) lines(dens2,col=2,lwd=lwd1) temp <-density(postbet)$x temp <- quantile(temp,c(.1,.9)) abline(v=temp,col=3,lwd=lwd1) temp <-density(postbet)$x temp <- quantile(temp,c(.1,.9)) abline(v=temp,lwd=lwd1,col=3) temp <- density(bet1)$x temp <- quantile(temp,c(.1,.9)) abline(v=temp,lwd=lwd1,col=2) legend(x=1.1,y=4,"Actual value",lwd=lwd1,lty=1,col=4,text.col=4,bty="n") |

The red lines do not account for the fact that you had an added lair of uncertainty. The added lair comes from the fact that beforehand, you were not even certain that you want to estimate this coefficient. Now that you had a selection step (which helped you decide you actually want to include this variable) the red interval is too narrow. It reflect a different confidence interval, the one without the added lair of uncertainty.

From this illustration we can already conclude one solution to the problem. Just bootstrap your data and construct the confidence intervals assigning 0’s to all those times the variable was not chosen. You may end up with too ‘bumpy’ confidence intervals. In the second reference below you can find out how to solve that, basically a smoothed version, though even the bumpy version is better than ignoring the selection step altogether (which is a blunt mistake).

* Alan J Miller. Selection of subsets of regression variables. Journal of the Royal Statistical Society. Series A (General), pages 389-425, 1984, from discussion.

** Estimation and Accuracy after Model Selection Efron, 2013, pdf here

– Previously written on the subject of variable selection:

How Important is Variable Selection

**Related book:**

Thanks for illustrating this very important issue, often overlooked. The following paper also deal with this problem.

1. Nguefack-Tsague, G. and Zucchini, W. (2016). A Mixture-Based Bayesian Model Averaging Method. Open Journal of Statistics; (2):220-228

https://www.statindex.org/articles/313875

http://dx.doi.org/10.4236/ojs.2016.62019

2. Nguefack-Tsague G. and Zucchini W. (2016). Effects of Bayesian model selection on frequentist performances: an alternative approach.

Applied Mathematics; 7(10):1103-1115. http://dx.doi.org/10.4236/am.2016.710098

3. Nguefack-Tsague G., Zucchini W., and Fotso S. (2016). Frequentist model averaging and applications to Bernoulli trials. Open Journal of Statistics; 6(3):545-553. http://dx.doi.org/10.4236/ojs.2016.63046

4. Nguefack-Tsague, G. (2014). Estimation of a Multivariate Mean under Model Selection Uncertainty. Pakistan Journal of Statistics and Operation Research; 10 (1):131-145

https://www.statindex.org/articles/311042

http://dx.doi.org/10.18187/pjsor.v10i1.449

5. Nguefack-Tsague G. and Bulla I. (2014). A focused Bayesian information criterion.

Advances in Statistics; Volume 2014, Article ID 504325. http://dx.doi.org/10.1155/2014/504325

6. Nguefack-Tsague G. (2014). On optimal weighting scheme in model averaging.

American Journal of Applied Mathematics and Statistics; 2(3):150-156. http://dx.doi.org/10.12691/ajams-2-3-9

7. Nguefack-Tsague G. (2013). On bootstrap and post-model selection inference.

International Journal of Mathematics and Computation; 21(4):51-64.

http://www.ams.org/mathscinet-getitem?mr=MR3062016

8. Nguefack-Tsague, G. (2013). An alternative derivation of some commons distributions functions: A post-model selection approach

International Journal of Applied Mathematics and Statistics; 42 (12) :138-147

http://www.ams.org/mathscinet-getitem?mr=MR3093313

https://www.statindex.org/articles/271149

9. Nguefack-Tsague, G. (2013). Bayesian estimation of a multivariate mean under model uncertainty. International Journal of Mathematics and Statistics; 13 (1) :83-92

http://www.ams.org/mathscinet-getitem?mr=MR3021499

https://zbmath.org/?q=an:1308.62036

https://www.statindex.org/articles/267266

10. Nguefack-Tsague, G. and Zucchini, W. (2011). Post-model selection inference and model averaging. Pakistan Journal of Statistics and Operation Research; 7 (2-Sp) :347-361

https://www.statindex.org/articles/258048

http://dx.doi.org/10.18187/pjsor.v7i2-Sp.292

11. Zucchini W., Claeskens G. and Nguefack-Tsague G. (2011). Model Selection. International Encyclopedia of Statistical Science Part 13: pp. 830-833.

http://www.springerlink.com/content/n13p3q0281322h22/

12. Nguefack-Tsague G., Zucchini W. and Fotso S. (2011). On correcting the effects of model selection on inference in linear regression. Syllabus Review (Sciences) 2(3), :122-140

http://www.ens.cm/files/syllabus_sciences/ScienceV2I3_2011_122_140.pdf

13. Nguefack-Tsague, G. (2006). Estimating and correcting the effects of model selection uncertainty; editor: Cuvilier Verlag, 2006