Beware of Spurious Factors

The word spurious refers to “outwardly similar or corresponding to something without having its genuine qualities.” Fake.

While the meanings of spurious correlation and spurious regression are common knowledge nowadays, much less is understood about spurious factors. This post draws your attention to recent, top-shelf research, flagging the risks around spurious factor analysis. While formal solutions are still pending there are couple of heuristics we can use to detect possible problems.

Since you know what spurious correlation is, it’s easy to board the train of thought at this station. When two variables, think prices of two stocks, drift upwards or downwards simultaneity, this “drifting”-fact alone is enough to spike the correlation, regardless of the actual statistical relation between the two variables. Now, factors are often extracted from cross-sectional data using principal component analysis (PCA). The numerical procedure starts with the computation of the correlation\covariance matrix. Therefore, with spurious entries in the correlation matrix the extracted factors would over-represent the common variation in the data – sometimes absurdly so.

The problem is serious on at least couple of levels. First, the deception effect is substantial: the first factor extracted from random walk data without any common factors, would falsely claim to explain circa 61% of the variation in the data. Below you can find code which expresses how serious a problem this actually is. Second, presently there is no way of solving for this. But, there are couple of things we can do.

The paper Spurious Factor Analysis (see references for a working version), suggests a couple of heuristics to cope with spurious factors. The first is to always compare factors estimated from level data, with factors estimated from first-differenced data. A mismatch between the two would call for more investigation. Second, but less formal strategy is to eyeball a times series plot of the extracted factors, and compare it with another plot of completely spurious factors. If the two plots resemble each other than you are a go to sound the alarm.

Code

The following code is for replicating the Monte Carlo simulation presented in the aforementioned paper (Table A.II). It generates N i.i.d. Gaussian random walks of length T. You can change the (N, T) numbers in code ((P,TT) below) for the data-dimensions you wish to simulate. The Matlab code is taken directly from the supplementary material of the paper, and the R code is my own translation (so any bugs are my doing).


% This code is directly from the Econometrica paper
%FRED-MD
p=128;
T=710;
N=p;
CNT=min([sqrt(p);sqrt(T)]);
U=toeplitz([1;zeros(T-1,1)],ones(1,T));
rmax=15;
for i=1:10000
    epsil=randn(p,T);
    DATA=epsil*U;
    DATA=DATA-(mean(DATA'))'*ones(1,T);
    if T<=p
        [UU,D]=eig(DATA'*DATA/(N*T));
    else
        [UU,D]=eig(DATA*DATA'/(N*T));
    end
    d=sort(real(diag(D)));
    V=flipdim(cumsum(d),1);
    pen1=V(2:(rmax+1),1)*(0:rmax)*(T/(4*log(log(T))))*((N+T)/(N*T))*log((N*T/(N+T)));
    pen2=V(2:(rmax+1),1)*(0:rmax)*(T/(4*log(log(T))))*((N+T)/(N*T))*log(CNT^2);
    pen3=V(2:(rmax+1),1)*((0:rmax).*(N+T-(0:rmax))/(N*T))*(T/(4*log(log(T))))*log(N*T);
    pen1=pen1';
    pen2=pen2';
    pen3=pen3';
    IPC1=V(1:rmax+1,1)*ones(1,rmax)+pen1;
    IPC2=V(1:rmax+1,1)*ones(1,rmax)+pen2;
    IPC3=V(1:rmax+1,1)*ones(1,rmax)+pen3;
    for j=1:rmax
        [min1,khat1]=min(IPC1(1:j+1,j));
        [min2,khat2]=min(IPC2(1:j+1,j));
        [min3,khat3]=min(IPC3(1:j+1,j));
        khat1=khat1-1;
        khat2=khat2-1;
        khat3=khat3-1;
        Kh1(i,j)=khat1;
        Kh2(i,j)=khat2;
        Kh3(i,j)=khat3;
    end
end
for i=1:rmax
    for t=0:rmax
        Tablek1(rmax+1-t,i)=sum(Kh1(:,i)==t)/100;
        Tablek2(rmax+1-t,i)=sum(Kh2(:,i)==t)/100;
        Tablek3(rmax+1-t,i)=sum(Kh3(:,i)==t)/100;
    end
end

% This code is directly from the Econometrica paper

%FRED-MD

p=128;

T=710;

N=p;

CNT=min([sqrt(p);sqrt(T)]);

U=toeplitz([1;zeros(T-1,1)],ones(1,T));

rmax=15;

for i=1:10000

epsil=randn(p,T);

DATA=epsil*U;

DATA=DATA-(mean(DATA'))'*ones(1,T);

if T<=p

[UU,D]=eig(DATA'*DATA/(N*T));

else

[UU,D]=eig(DATA*DATA'/(N*T));

end

d=sort(real(diag(D)));

V=flipdim(cumsum(d),1);

pen1=V(2:(rmax+1),1)*(0:rmax)*(T/(4*log(log(T))))*((N+T)/(N*T))*log((N*T/(N+T)));

pen2=V(2:(rmax+1),1)*(0:rmax)*(T/(4*log(log(T))))*((N+T)/(N*T))*log(CNT^2);

pen3=V(2:(rmax+1),1)*((0:rmax).*(N+T-(0:rmax))/(N*T))*(T/(4*log(log(T))))*log(N*T);

pen1=pen1';

pen2=pen2';

pen3=pen3';

IPC1=V(1:rmax+1,1)*ones(1,rmax)+pen1;

IPC2=V(1:rmax+1,1)*ones(1,rmax)+pen2;

IPC3=V(1:rmax+1,1)*ones(1,rmax)+pen3;

for j=1:rmax

[min1,khat1]=min(IPC1(1:j+1,j));

[min2,khat2]=min(IPC2(1:j+1,j));

[min3,khat3]=min(IPC3(1:j+1,j));

khat1=khat1-1;

khat2=khat2-1;

khat3=khat3-1;

Kh1(i,j)=khat1;

Kh2(i,j)=khat2;

Kh3(i,j)=khat3;

end

for i=1:rmax

for t=0:rmax

Tablek1(rmax+1-t,i)=sum(Kh1(:,i)==t)/100;

Tablek2(rmax+1-t,i)=sum(Kh2(:,i)==t)/100;

Tablek3(rmax+1-t,i)=sum(Kh3(:,i)==t)/100;

end


# The following functions is for determining 
# the number of factors according to IPC criteria 
ICP <- function(X, rmax) {
  X <- as.matrix(dat)
  TT = dim(X)[1]
  P = dim(X)[2]
  d <- eigen( (t(X)%*%X) /(TT*P) )$values
  term1 <- TT/(4*log(log(TT)))
  term2 <- (P+TT)/(P*TT)
  term3 <- log((P*TT/(P+TT)))
  pen1 = d[2:(rmax+1)] * t(replicate(rmax, c(0:rmax))) * term1 * term2 * term3
  pen2 = d[2:(rmax+1)] * t(replicate(rmax, c(0:rmax))) * term1 * term2 * log( (min(sqrt(TT), sqrt(P)))^2 )
  pen3 = d[2:(rmax+1)] * t(replicate(rmax, c(0:rmax))) * (P+TT - c(0:rmax))/(TT*P) * term1 * log(P*TT)
  ipc1 <- replicate(rmax, d[1:(rmax+1)]) + t(pen1)
  ipc2 <- replicate(rmax, d[1:(rmax+1)]) + t(pen2)
  ipc3 <- replicate(rmax, d[1:(rmax+1)]) + t(pen3)
  khat1 <- khat2 <- khat3 <- NULL
  for (j in 1:rmax){
  khat1[j] <- which.min(ipc1[1:(j+1), j]) - 1
  khat2[j] <- which.min(ipc2[1:(j+1), j]) - 1
  khat3[j] <- which.min(ipc3[1:(j+1), j]) - 1
  }
list(khat1[rmax], khat2[rmax], khat3[rmax])
}

khat <- NULL
ss <- 20
TT <- 710
P <- 128
sdd <- 1
rmax <- 6
for (i in 1:ss){
tmp <- rep(TT, P) %>% lapply(rnorm, 0, sd= sdd) 
tmp2 <- do.call(cbind, tmp)
x <- rep(1, TT)
x <- toeplitz(x)
x[lower.tri(x)] <- 0
dat <- (t(tmp2) %*% x) %>% t
dat <- dat - t(replicate(TT, colMeans(dat)))
khat[i] <- ICP(dat, rmax= rmax)[[1]]
}
khat %>% table