post-model-selection inference

Along with improvements in computational power, variable selection has become one of the problems attracting the most effort. We (well.. experts) have made huge leaps in the realm of variable selection. Prediction being probably the most common objective. LASSO (Least Absolute Sum of Squares Operator) leading the way from the west (Stanford) with its many variations (Adaptive, Random, Relaxed, Fused, Grouped, Bayesian.. you name it), SCAD (Smoothly Clipped Absolute Deviation) catching up from the east (Princeton). With the good progress in that area, not secondary but has been given less attention -> Inference is now being worked out.

What seems to be the problem officer?

Straightforwardly that: “simple selection methods fail to deliver the usual significance, tests are misleading”*.
“..ignoring model selection can be deceptively optimistic”**
Why is that?

First, a selection step is performed, typically using BIC or AIC or some other information criteria. For clarity, I call structure to “variables number 2,7, and 8 were chosen, rest are 0”. After we are comfortable with the structure resulted from the selection step, the distribution of those coefficients is not the same as it would be if we had chosen the same structure beforehand, without a selection step.

We can use the following illustration, we consider:

$y_t &= x_{1,t}+x_{2,t} + \varepsilon.$

The real impact on Y comes only from x1, but x1 and x2 are correlated and when we apply a selection procedure, sometimes we will see that x2 is chosen. See what the confidence interval for the coefficient are, when we ignore the selection step, and what they should be if we account for the selection step:


tempcor <- NULL
mod <- NULL
repp=1000
bet0 <- bet1 <- NULL
TT <- 50
# We choose from three options: only x1
# only x2 and x1+x2, according to AIC criterion
for (i in 1:repp){
  eps <- rnorm(TT)
  x1 <- rnorm(TT,3,1)
  x2  <- x1 + rnorm(TT,0,1)
y <- x1+eps
tempcor[i] <- cor(x1,x2)
lm0 <-   lm(y~x1+x2)
bet0[i] <- summary(lm0)$coef[2,1]
lm0 <- AIC(lm0)
lm1 <- lm(y~x1)
bet1[i] <- summary(lm1)$coef[2,1]
  lm1 <- AIC(lm1)
lm2 <- AIC(lm(y~x2))
mod[i] <- which.min(c(lm0,lm1,lm2))
}
# possible results for the coefficient of x1 - selection step accounted for
postbet <- c(bet0,bet1,rep(0,sum(mod==3))) 
dens1 <- density(postbet)
# possible results for the coefficient of x1 - selection step NOT accounted for
dens2 <- density(bet1) # Ignoring model selection
lwd1=3 # Graphical parameter
plot(dens1,ylim=c(0,5),main="The problem of post model-selection inference",col=3,
     xlab="Ignoring model selection in red - too optimistic",lwd=lwd1)
abline(v=1,lwd=lwd1,col=4)
lines(dens2,col=2,lwd=lwd1)
temp <-density(postbet)$x
temp <- quantile(temp,c(.1,.9))
abline(v=temp,col=3,lwd=lwd1)
temp <-density(postbet)$x
temp <- quantile(temp,c(.1,.9))
abline(v=temp,lwd=lwd1,col=3)
temp <- density(bet1)$x
temp <- quantile(temp,c(.1,.9))
abline(v=temp,lwd=lwd1,col=2)
legend(x=1.1,y=4,"Actual value",lwd=lwd1,lty=1,col=4,text.col=4,bty="n")

tempcor <- NULL

mod <- NULL

repp=1000

bet0 <- bet1 <- NULL

TT <- 50

# We choose from three options: only x1

# only x2 and x1+x2, according to AIC criterion

for (i in 1:repp){

eps <- rnorm(TT)

x1 <- rnorm(TT,3,1)

x2 <- x1 + rnorm(TT,0,1)

y <- x1+eps

tempcor[i] <- cor(x1,x2)

lm0 <- lm(y~x1+x2)

bet0[i] <- summary(lm0)$coef[2,1]

lm0 <- AIC(lm0)

lm1 <- lm(y~x1)

bet1[i] <- summary(lm1)$coef[2,1]

lm1 <- AIC(lm1)

lm2 <- AIC(lm(y~x2))

mod[i] <- which.min(c(lm0,lm1,lm2))

}

# possible results for the coefficient of x1 - selection step accounted for

postbet <- c(bet0,bet1,rep(0,sum(mod==3)))

dens1 <- density(postbet)

# possible results for the coefficient of x1 - selection step NOT accounted for

dens2 <- density(bet1) # Ignoring model selection

lwd1=3 # Graphical parameter

plot(dens1,ylim=c(0,5),main="The problem of post model-selection inference",col=3,

xlab="Ignoring model selection in red - too optimistic",lwd=lwd1)

abline(v=1,lwd=lwd1,col=4)

lines(dens2,col=2,lwd=lwd1)

temp <-density(postbet)$x

temp <- quantile(temp,c(.1,.9))

abline(v=temp,col=3,lwd=lwd1)

temp <-density(postbet)$x

temp <- quantile(temp,c(.1,.9))

abline(v=temp,lwd=lwd1,col=3)

temp <- density(bet1)$x

temp <- quantile(temp,c(.1,.9))

abline(v=temp,lwd=lwd1,col=2)

legend(x=1.1,y=4,"Actual value",lwd=lwd1,lty=1,col=4,text.col=4,bty="n")

The red lines do not account for the fact that you had an added layer of uncertainty. The added layer comes from the fact that beforehand, you were not even certain that you want to estimate this coefficient. Now that you had a selection step (which helped you decide you actually want to include this variable) the red interval is too narrow. It reflect a different confidence interval, the one without the added layer of uncertainty.

From this illustration we can already conclude one solution to the problem. Just bootstrap your data and construct the confidence intervals assigning 0’s to all those times the variable was not chosen. You may end up with too ‘bumpy’ confidence intervals. In the second reference below you can find out how to solve that, basically a smoothed version, though even the bumpy version is better than ignoring the selection step altogether (which is a blunt mistake).

* Alan J Miller. Selection of subsets of regression variables. Journal of the Royal Statistical Society. Series A (General), pages 389-425, 1984, from discussion.
** Estimation and Accuracy after Model Selection Efron, 2013, pdf here
– Previously written on the subject of variable selection:
How Important is Variable Selection
Related book: