Bootstrap your way into robust inference. Wow, that was fun to write..
Introduction
Say you made a simple regression, now you have your . You wish to know if it is significantly different from (say) zero. In general, people look at the statistic or p.value reported by their software of choice, (heRe). Thing is, this p.value calculation relies on the distribution of your dependent variable. Your software assumes normal distribution if not told differently, how so? for example, the (95%) confidence interval is , the 1.96 comes from the normal distribution.
It is advisable not to do that, the beauty in bootstrapping* is that it is distribution untroubled, it’s valid for dependent which is Gaussian, Cauchy, or whatever. You can defend yourself against misspecification, and\or use the tool for inference when the underlying distribution is unknown.
I was not here 40 years ago but they say calculations were slow back then, not any more, take advantage of your Dell or your “meant for you” MacBook. We will see that it is better to add the non in the nonparametric, specifically in financial context, but that holds in general. You can still keep your distribution assumption but at least have a look at what happens when you relax it. The way to do that is by using bootstrap, the idea is intuitive and simple.
The idea
John Fox wrote: “The population is to the sample as the sample is to the bootstrap samples”.. Tadaaam.. but what does this mean? your estimate of which came from the sample, is suppose to estimate the “real” , the population , which is unknown. Now take a sample from the sample, we call that sample a bootstrap sample, estimate your according to this (bootstrap)sample, now this new estimate is an estimate for your original , the one coming from the original data. For clarity, say you have 3 observations, first is {x = 0.7,y = 0.6}, second is {whatever}, third is {whatever}, now, an example of sample from the sample would be the shuffled ordering: first is {whatever}, second is {x = 0.7,y = 0.6}, third is {whatever}. This “shuffling” is what we call the bootstrap sample, note that any observation can be chosen more than once or not at all, that is, we sample with replacement. Now we estimate the same statistic (in our case ) again.
Repeat this sample and estimate many times, you have many bootstrapped estimates, now you can check how do these guys behave. One thing you can do with it is to bootstrap confidence intervals (CI) for your estimate, without the need for underlying distributional assumption.
In R
In R, “boot” package will do the trick:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
library(boot) #load the package # Now we need the function we would like to estimate # In our case the beta: betfun = function(data,b,formula){ # b is the random indexes for the bootstrap sample d = data[b,] return(lm(d[,1]~d[,2], data = d)$coef[2]) # thats for the beta coefficient } # now you can bootstrap: bootbet = boot(data="your data here", statistic=betfun, R=5000) # R is how many bootstrap samples names(bootbet) plot(bootbet) hist(bootbet$t, breaks = 100) |
You can zoom in on which indices were picked at each bootstrap take, what was the exact permutation, you do that using the function boot.array:
1 2 3 4 5 6 7 8 |
zoombot = boot.array(bootbet, indices = T) dim(zoombot) # Size should be R(number of bootstrap samples) times n(NROW of your data) hist(zoombot[1,], breaks = 100) # this is the frequency of each index, for the first #bootstrap run, so a y-value of say, 3 in this histogram #means that the x-value observation was chosen 3 times in this specific bootstrap sample |
More transparent
You can better understand it if you can code it yourself, for a simple problem like bootstrapping confidence intervals it is even simpler and faster:
1 2 3 4 5 6 7 8 9 |
nb = 5000 ; bet = NULL ; n = NROW("your data here") ptm <- proc.time() # have a look at the time it takes for (i in 1:nb){ unifnum = sample(c(1:n),n,replace = T) # pick random indices bet[i] = lm(data[unifnum,1]~data[unifnum,2])$coef[2] } proc.time() - ptm # about 80 seconds on my end |
How bad are your current confidence interval?
Does it really matter? maybe it’s not worth the trouble. As an illustration I use stock returns which are known to have fat tails, meaning more observations far from the center. Take a look at the following figure:
That is the Morgan Stanley with the market. The estimate is centered at 1.87. The green vertical lines are the (95%) confidence interval reported by the the “lm” function, the red vertical lines are the equivalent nonparametric confidence intervals, the light blue curve is the normal density.
Notice how different is the interval from the nonparametric bootstrap, which is more accurate in this case. For example, it may be that the parameter is actually 2, you can see the software output rejects this possibility since it assumes normality, yet the bootstrap confidence interval indeed covers the value 2. So, an investor who thinks that “in my portfolio all beta’s are smaller than 2 with a 95% CI” is mistaken to hold Morgan Stanley as such. This check takes about 80 seconds, so I would think twice before plugging it in a double “for” loop. However, if you are social scientist or similar, beef up your standard (normal) output with this robust analysis.
Wrap up
Here you can see that when you use the bootstrapped confidence interval, when the normal distribution assumption IS valid, you are not that worse off (though you will need to wait these 80 seconds nontheless 🙂 ). I created a fake normal return distribution using same center and standard deviation as reported, and made the exact same analysis:
Again, I simulated from a normal distribution using same center and standard deviation as reported by the “lm”, you can see that the intervals are close to each other which concludes this post, using the parametric confidence interval, from the assumed normal distribution is in some sense suboptimal, since you don’t lose much even if it IS normal.
You can see the remaining code and some references below. Thanks for reading.
Comments
* Naturally we only cover the idea here, there is lots of bootstrapping literature, we only tasted nonparametric bootstrap.
Related literature
[asa onelinertpl]0521574714[/asa]
[asa onelinertpl]0412042312[/asa]
[asa onelinertpl]0817643869[/asa]
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 |
library(quantmod) ; library(xtable) ; library(tseries) library(ggplot2) ; library(fields) ; library(grDevices) tckr = c('MS', 'SPY') end<- format(Sys.Date(),"%Y-%m-%d") # yyyy-mm-dd start<-format(Sys.Date() - 365*10,"%Y-%m-%d") # 10 years of data dat1 = (getSymbols(tckr[1], src="yahoo", from=start, to=end, auto.assign = FALSE)) dat2 = (getSymbols(tckr[2], src="yahoo", from=start, to=end, auto.assign = FALSE)) ret1 = as.numeric((dat1[,4] - dat1[,1])/dat1[,1] ) ret2 = as.numeric((dat2[,4] - dat2[,1])/dat2[,1]) plot(ret1) length(as.numeric(ret1)) ; length(as.numeric(ret1)) plot(as.numeric(ret1)~as.numeric(ret2)) t1 = as.data.frame(cbind(as.numeric(ret1),as.numeric(ret2))) names(t1) <- tckr lm1 = lm(MS~SPY, data = t1) ; summary(lm1) nb = 5000 ; bet = NULL ; n = NROW(t1) ptm <- proc.time() for (i in 1:nb){ unifnum = sample(c(1:n),n,replace = T) bet[i] = lm(t1[unifnum,1]~t1[unifnum,2])$coef[2] } proc.time() - ptm den = density(bet) par(mfrow = c(1,1)) bethat <- bquote(Histogram~of ~bold(widehat(beta))) h = hist(bet, breaks = 100, freq = NULL, probability = F, ylim = c(0,280), xlab = expression(bold(widehat(beta))), , cex.lab = 1.6, main = bethat ) lwd1 = 2.5 xline(mean(bet), col = 4, lwd = lwd1) ; xline(lm1$coef[2], col = 6, lwd = lwd1) xline(mean(bet)+1.96*summary(lm1)$coef[2,2], col = 3, lwd = lwd1) xline(mean(bet)-1.96*summary(lm1)$coef[2,2], col = 3, lwd = lwd1) xline(quantile(bet,.025), col = 2, lwd = lwd1) ; xline(quantile(bet,.975), col = 2, lwd = lwd1) xfit<-seq(min(bet),max(bet),length=length(bet)) yfit<-dnorm(xfit,mean=mean(bet),sd=sd(bet)) yfit <- yfit*diff(h$mids[1:2])*length(bet) lines(xfit, yfit, col="blue", lwd=2) yfit2<-dnorm(xfit,mean=lm1$coef[2],sd=summary(lm1)$coef[2,2]) yfit2 <- yfit2*diff(h$mids[1:2])*length(bet) lines(xfit, yfit2, col=5, lwd=2) ################################################### ### Now I generate actually from the normal distribution ################################################### eps = rnorm(n, mean(ret1), sd(ret1)) fakey = summary(lm1)$coef[1,1]+summary(lm1)$coef[2,1]*ret1 +eps # hist(fakey, offset=0.16, width=0.13, add=TRUE, col = "green", breaks = 200) lm2 = lm(fakey~ret1) fakebet = NULL ptm <- proc.time() for (i in 1:nb){ unifnum = sample(c(1:n),n,replace = T) fakebet[i] = lm(fakey[unifnum]~ret1[unifnum])$coef[2] } proc.time() - ptm fakebethat <- bquote(Histogram~of~fake~bold(widehat(beta))) h2 = hist(fakebet, breaks = 100, freq = NULL, probability = F, ylim = c(0,470), xlab = expression(bold(widehat(beta))), , cex.lab = 1.6, main = fakebethat ) xline(mean(fakebet), col = 4, lwd = lwd1) xline(lm2$coef[2], col = 6, lwd = lwd1) xline(mean(fakebet)+1.96*summary(lm2)$coef[2,2], col = 3, lwd = lwd1) xline(mean(fakebet)-1.96*summary(lm2)$coef[2,2], col = 3, lwd = lwd1) xline(quantile(fakebet,.025), col = 2, lwd = lwd1) ; xline(quantile(fakebet,.975), col = 2, lwd = lwd1) xfit<-seq(min(fakebet),max(fakebet),length=length(fakebet)) yfit<-dnorm(xfit,mean=mean(fakebet),sd=sd(fakebet)) yfit <- yfit*diff(h2$mids[1:2])*length(fakebet) lines(xfit, yfit, col="blue", lwd=2) yfit2<-dnorm(xfit,mean=lm2$coef[2],sd=summary(lm2)$coef[2,2]) yfit2 <- yfit2*diff(h2$mids[1:2])*length(fakebet) lines(xfit, yfit2, col=5, lwd=2) |
Nice post.
the command xlines does not work
Is it in a package I have to load? If so what package.
Did you mean lines?
since there are both nonstationary time series you may consider my
R package “meboot” explained in a paper in J statsoft and also as
a vignette
Hi Hrishikesh
I am sure there might be an easier way, but here I just used this command from “library(fields)”.
As you can see I used returns, so it IS stationary.
ret1 = as.numeric((dat1[,4] – dat1[,1])/dat1[,1] )
ret2 = as.numeric((dat2[,4] – dat2[,1])/dat2[,1])
by the way, I enjoyed your “hands-on intermediate econometric using R”, your most recent book is on my “to read” list.