We all use models. We all continuously working to improve and validate our models. Constant effort is made trying to estimate: how good our model actually is?
A general term for this estimate is error rate. Low error rate is better than high error rate, it means our model is more accurate.
By far the most common accuracy metric is the Mean Squared Error: how far is our predicted values from the realizations? ‘how far’ is measured using euclidean distance. Here we direct special attention to an important drawback: the sample gives too optimistic estimate replying on the question how good our model actually is? I explain the reason for this drawback.
The optimism bias of the training error rate is a very deep concept in statistics, not to be confused with overfitting. It has nothing to do with overfitting. Even if, magically, we managed to guess correctly the underlying model which commands the data. Even if we somehow made perfect modelling choices (e.g. number of parameters), we still are too optimistic, concluding that we know more than we actually do.
A carefully contrived yet realistic example
You think about buying a house. You have a model as to how much should each house cost, based on the size, location etc. You see a house located between two other houses, which recently changed hands (and you know at which price). You have the features of the two houses (the X), and their price (Y). Yo can say something about the house sitting between them, you have a guess for the price of the house based on what you know from the other two. Of course, there is a residual which you are aware of. This post tells you that you overestimate how good your guess really is!
Simulation
To exemplify this optimism bias we simulate a linear model, estimate the error rate- the mean squared error, and see what happens when new data come in.
Remember, the model is perfectly estimated, so nil misspecification.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
#Simulate a model with say 100 observations TT <- 100 # make containers MSE <- PMSE <- NULL # define the noise level sig <- 2 # 3 explanatories pp <- 3 # define the signal level signall <- 10 # simulate 1000 times for (i in 1:1000) { # the magnitute of the betas increases so some are strong and some weak beta = seq(from = 1, to = signall, length.out = 4) X = cbind(rep(1, TT), matrix(rnorm(TT * pp), nrow = TT) ) Y = X %*% beta + sig * rnorm(TT) # Now we create an out-of-sample Y: Ynew = X %*% beta + sig * rnorm(TT) # the linear model: mod = lm(Y ~ X - 1) # Yhat are the fitted values Yhat = predict(mod) # MSE is the error of the in-sample MSE = c(MSE, mean((Y - Yhat)^2)) # New Y - all else is the same PMSE = c(PMSE, mean((Ynew - Yhat)^2)) } |
What we did is to simulate 1000 MSE’s measures, in-sample (MSE) and out-of-sample (PMSE). We plot the distribution of those 1000 estimates. If all is well, the MSE in-sample should generalize to that of the out-of-sample. The noise level is the same, the explanatory variables are the same, so our simulated distributions should align, but they don’t:
You can see the distribution of the in-sample MSE estimate (black) is biased to the left, meaning lower error rate. However, when we use new observations, we see that the model is not as accurate as the in-sample estimate made us believe. The PMSE (red) which is based on new observations is higher. The “real” actual MSE is 4 in this example, and so the distribution of the out-of-sample estimate looks to be centered correctly.
What is the reason for the bias? It happens, at least in part, because we evaluate the fitted values based on particular locations of the X inputs. Your explanatory variables input do not span the whole space. I explain: A fair die can fall on six faces, but if you only threw it 4 times you will be missing 2 numbers (at least two). Think about those missing 2 numbers as “holes” in your 6 faces space. Those “holes” in the X space, your X spectrum is not “full” when estimating the error rate is part of the reason for the poor generalization to the out-of-sample. The other reason, in conjunction, is that the data was already used to fit the model. We can’t ask the data to now fill a new role. After it was used to fit the model (you used the data to estimate the coefficients for example), we can’t ask it to estimate the error rate of the model also. This is a new unrelated task. We already squeezed the data when estimating the model, there is not enough juice left.
This is a fundamental topic in statistics, which have spurred tremendous amount of papers discussing this bias and ways in which we can counter\cancel it. This Optimism of the training error rate is the chief motivation for the use of cross validation techniques for evaluating the error rate.
An excellent discussion can be found in The Elements of Statistical Learning, section 7.4. The book can be downloaded in a pdf format for free from here.
Statistical Methods in the Atmospheric Sciences
Hi, thanks for the helpful example.
Just one small remark:
I think you mean beta = seq(from = 1, to = signall, length.out = 4) in the first row of the loop. Otherwise beta could have any length depending on the value of signall.
Best regards!
Why not, thanks, now updated.
Hi, thanks for this clear and concise illustration.
But, how do you know or obtain the “real” actual MSE? Why it’s 4 in the simulation?
Thanks!
You don’t obtain it. I just set it arbitrarily. Since it is a simulated example I can do it. In reality we can only tell there is bias, but the magnitude needs to be estimated as well.