Spurious Regression, Spurious Correlation

Spurious Regression problem dates back to Yule (1926): “Why Do We Sometimes Get Nonsense Correlations between Time-series?”. Lets see what is the problem, and how can we fix it. I am using Morgan Stanley (MS) symbol for illustration, pre-crisis time span. Take a look at the following figure, generated from the regression of MS on the S&P, actual prices of the stock, actual prices of the S&P, when we use actual prices we term it regression in levels, as in price levels, as oppose to log transformed or returns.

Regression in levels, Morgan Stanley price level and fitted values from the regression MS~SPY.

The results from the regression are:

	Estimate	Std. Error	t.value	P.value
(Intercept)	-46.4234	2.1827	-21.27	0.0000
beta.hat	0.8534	0.0178	47.90	0.0000
R^2 = 0.76

The graph looks fine, and the results make sense, but utterly wrong!

Thing is, the two series are upward drifting, so.. they drift together, it seems as if they are related. As a matter of fact, they are related, but what we just did is the wrong way to check it. Here is similar results from x and y random walks!!


y = cumsum(rnorm(250*10,0.05)) # random normal, with small (0.05) drift.
x = cumsum(rnorm(250*10,0.05))
lm2 = lm(y~x) ; summary(lm2)
plot(y, ty = "l", main = "Fitted (in blue) over Actual -- 
Random WALK this time", xlab = "x") ; lines(lm2$fit, col = 4)

y = cumsum(rnorm(250*10,0.05)) # random normal, with small (0.05) drift.

x = cumsum(rnorm(250*10,0.05))

lm2 = lm(y~x) ; summary(lm2)

plot(y, ty = "l", main = "Fitted (in blue) over Actual --

Random WALK this time", xlab = "x") ; lines(lm2$fit, col = 4)

	Estimate	Std.Error	t.value	P.value
(Intercept)	7.0474	0.4651	15.15	0.0000
x	0.5862	0.0062	94.29	0.0000
R^2 = 0.78

Note the resemblance with the previous figure and table.
So.., analysis of two Random Walks which are clearly independent from each other by construction, and the analysis of two time series in levels can have same qualitative result, as if there is a significant positive correlation, that can’t be good right?
In real life, how would I know if what I see is an actual relation or the result of two unrelated series that, just so happen, are drifting in the same direction.

Here we step into the domain of the highly important yet amazingly boring of Unit Roots. This post is not about unit roots, and I want to keep it short not to lose the remaining 5% out of the 100% who started reading. Being abusive, it is suffice to say we need to remove the drift in the series, check here and reference therein for more information.
Once the drift is removed, we can verify that indeed there is a real relation, meaning Morgan Stanley stock movement is actually affected by the market movement. Removing the drift is easy, use returns or first differences. Feel important by telling your classmates that the series are not stationary, hence the transformation.

We can transform the data from levels to returns and re-execute the regression as follows:


library(quantmod) ; library(xtable) ; library(tseries)
tckr = c('MS', 'SPY')
end <- "2007-01-01"
start<-format(Sys.Date() - 365*8,"%Y-%m-%d") # 8 years of data
dat1 = (getSymbols(tckr[1], src="yahoo", from=start, to=end, auto.assign = FALSE))
dat2 = (getSymbols(tckr[2], src="yahoo", from=start, to=end, auto.assign = FALSE))
ret1 = (dat1[,4] - dat1[,1])/dat1[,1]  # Convert to returns
ret2 = (dat2[,4] - dat2[,1])/dat2[,1]
lmret = lm(ret1~ret2)
summary(lmret)	
plot(as.numeric(ret1)~as.numeric(lmret$fit)) 
abline(lmret, col = 2, lwd = 2.5)

library(quantmod) ; library(xtable) ; library(tseries)

tckr = c('MS', 'SPY')

end <- "2007-01-01"

start<-format(Sys.Date() - 365*8,"%Y-%m-%d") # 8 years of data

dat1 = (getSymbols(tckr[1], src="yahoo", from=start, to=end, auto.assign = FALSE))

dat2 = (getSymbols(tckr[2], src="yahoo", from=start, to=end, auto.assign = FALSE))

ret1 = (dat1[,4] - dat1[,1])/dat1[,1] # Convert to returns

ret2 = (dat2[,4] - dat2[,1])/dat2[,1]

lmret = lm(ret1~ret2)

summary(lmret)

plot(as.numeric(ret1)~as.numeric(lmret$fit))

abline(lmret, col = 2, lwd = 2.5)

Now we can see that even after analyzing using returns, not levels, we still get a good fit.

You can use the “adf.test” function in package “tseries” to check if your series drift (stationary*) or not.


adf.test(as.numeric(dat1[,1])) # --> P.value is 0.6481 --> has Unit Root
adf.test(as.numeric(ret1)) # --> P.value < 0.01 --> no Unit Root

adf.test(as.numeric(dat1[,1])) # --> P.value is 0.6481 --> has Unit Root

adf.test(as.numeric(ret1)) # --> P.value < 0.01 --> no Unit Root

As a final note, fact that we cannot make any inference using price levels does not render the regression completely useless. Both “MS” and “S&P” series are NOT stationary, but together they ARE co-integrated, which is the main justification behind pairs trading. Co-integrated means that y-series may drift, x-series may drift, but the residual from the regression will not!

See how the residuals from the regression fluctuate around zero.

Comments
1. * — stationary process does not only mean “no drift”, we have weak definition and strong definition, see here for more information.
2. according to the graph it seems that it was a good time to short MS and hedge with the market at the end end of the time span I used, which is start of 2007. I leave it to the reader to check what would have been the loss on such a trade.

Some Excellent Readings
Regression Modeling Strategies
Financial Econometrics: From Basics to Advanced Modeling Techniques
A Companion to Theoretical Econometrics
Financial Econometrics: Problems, Models, and Methods
Time Series Analysis

You might also like:

8 comments on “Spurious Regression Illustrated”

claudio says:

03/04/2012 at 6:49 PM

Hi, nice work. But I had a strange result here. Everything was ok until one line.

Specifically, the problem appeared here:

> tckr = c(‘MS’, ‘SPY’)
> end start dat1 = (getSymbols(tckr[1], src=”yahoo”, from=start, to=end, auto.assign = FALSE))
Erro em as.POSIXlt.character(as.character(x), …) :
character string is not in a standard unambiguous format

Do you have an idea of which kind of trouble could it be? Thanks for your time and attention.

1. Eran says:
  
  03/04/2012 at 8:27 PM
  
  Hi Claudio,
  
  ” > end start dat1 = (getSymbols(tckr[1], src=”yahoo”, from=start, to=end, auto.assign = FALSE)) ”
  
  That is not what the code looks like. Run:
  end <- "2007-01-01" (press enter 🙂 ) start<-format(Sys.Date() - 365*8,"%Y-%m-%d") # 8 years of data (press enter 🙂 ) dat1 = (getSymbols(tckr[1], src="yahoo", from=start, to=end, auto.assign = FALSE) (press enter 🙂 ) Worked for me, hope this helps.
  
  1. claudio says:
    
    03/04/2012 at 11:53 PM
    
    Hi Eran,
    
    Thanks, but you practically did the same as I did before. It didn’t work. Anyway, thanks.
  2. Chute says:
    
    03/24/2013 at 4:00 PM
    
    Claudio, try adding index.class to your yahoo download. This helps with xts & zoo
    
    dat1 = (getSymbols(tckr[1], index.class=”POSIXct”, src=”yahoo”, from=start, to=end, auto.assign = FALSE))
    dat2 = (getSymbols(tckr[2], index.class=”POSIXct”, src=”yahoo”, from=start, to=end, auto.assign = FALSE))
claudio says:

03/05/2012 at 12:14 AM

It’s not the first time this type of “import from yahoo” action give me problems. I wonder why. Again, thanks for your answer.

1. eran says:
  
  03/05/2012 at 2:04 PM
  
  Well,
  
  1. Might be that your “end” is the wrong class.. so character as oppose to Date, you can check that, you can also try to switch the source from “yahoo” to “google”.
  
  2. I am using the most recent R version = 2.14 and the most recent quantmod package, you can try to update what you have and maybe that will solve it.
  
  Should not be very problematic.
  
  Good luck
  
Shreyes Upadhyay says:

06/04/2012 at 3:58 PM

Really nice post.
We recently had a lecture on this concept of “spurious regession in 2 I(1) series” while studying co-integration.
Very informative and illustrative, keep sharing!
~
Shreyes

Chris Wynkoop says:

07/20/2012 at 2:34 PM

Great work! Extremely accessible explanation on cointegration of financial time series data. Kudos!

Spurious Regression Illustrated

You might also like:

Volatility forecast evaluation in R

R vs Matlab (round 1)

Europe most dangerous cities

R tips and tricks - Set Working Directory

8 comments on “Spurious Regression Illustrated”

Leave a Reply to claudio