Understanding Variance Explained in PCA

Principal component analysis (PCA from here on) is performed via linear algebra functions called eigen decomposition or singular value decomposition. Since you are actually reading this, you may well have used PCA in the past, at school or where you work. There is a strong link between PCA and the usual least squares regression (previous posts here and here). More recently I explained what does variance explained by the first principal component actually means.

This post offers a matrix approximation perspective. As a by-product, we also show how to compare two matrices, to see how different they are from each other. Matrix approximation is a bit math-hairy, but we keep it simple here I promise. For this fascinating field itself I suspect a rise in importance. We are constantly stretching what we can do computationally, and by using approximations rather than the actual data, we can ease that burden. The price for using approximation is decrease in accuracy (à la “garbage in garbage out”), but with good approximation the tradeoff between the accuracy and computational time is favorable.

If you apply the PCA to some matrix A, and you column-bind the first k principal components- call it matrix B, that B matrix is the best approximation for the original matrix A you can get. You get better and better approximation by increasing the k number, i.e. using more PCs, but utilizing only k columns you can’t have any better approximation.

Say you have 10 columns but you want to work with only 4, your k. The 4 principal components constitute the best algebraic approximation (again, which uses only 4 columns) to the original matrix. Change a single entry in that 4 column matrix, and you are moving away from your original A matrix. More details below, if you are interested to know a bit more about PCA internals.

Let’s expand the usual notion of distance between two points to matrices. If $x_1$ and $x_2$ are just numbers, $(x_1-x_2)^2$ is the euclidean distance between them. If they are vectors, say $x_1$ is 5 numbers and $x_2$ is 5 numbers we compute the 5 quantities $(x_{1,1..5} - x_{2,1..5})^2$ and sum them up. Matrices are no different, we sum up all the entries simply. We call this summation over rows $i$ and columns $j$ :

$\|E\|_{F}^{2}=\sum_{i=1}^{m} \sum_{j=1}^{n} E_{i j}^{2}$

the Frobenius norm of a matrix E. $E$ could be thought of as the error matrix, the distance between the $A$ (the original) and your approximation $B$ . You can be happy with your approximation for A if the Frobenius norm of the errors between the entries A and B is small.

Coding wise you don’t need to program from scratch. Matrix::norm (R) and np.linalg.norm (Python) will do the trick.

Illustration

The following code pulls some price data for 4 ETFs which will be our $A$ matrix, perform PCA and binds the first few principal components (AKA scores) which will be our $B$ matrix.


> library(magrittr)
> library(quantmod)
> sym = c('IEF','TLT','SPY','QQQ')
> l <- length(sym)
> end <- format(as.Date("2020-12-01"), "%Y-%m-%d")
> start <- format(as.Date("2007-01-01"), "%Y-%m-%d")
> dat0 <-  (getSymbols(sym[4], src="yahoo", from=start, to=end, auto.assign = F))
> n <-  NROW(dat0)
> dat <-  array(dim = c(n, NCOL(dat0),l)) ; prlev = matrix(nrow = n, ncol = l)
> for (i in 1:l){
+  dat0 = (getSymbols(sym[i], src="yahoo", from=start, to=end, auto.assign = F))
+  dat[1:length(dat0[,i]),,i] = dat0 
+ # Average Price during the day
+   prlev[1:NROW(dat[,,i]),i] = as.numeric(dat[,4,i]+dat[,1,i]+dat[,2,i]+dat[,3,i])/4
+ }
> time <- index(dat0)
> x <- na.omit(prlev) 
> ret <- 100*(x[2:NROW(x),] - x[1:(NROW(x)-1),]) # Move to returns
> head(ret,3)
      [,1]  [,2]  [,3] [,4]
[1,]  21.3  32.3 -37.3 36.0
[2,] -22.5 -38.2 -47.8 12.7
[3,]   7.5  18.8   0.5  7.5
pc0 <- prcomp(ret, center= F, scale.=F)
mat_approx4 <- pc0$x[,1:4] %*% t(pc0$rot[,1:4]) # using all 4 components
mat_approx3 <- pc0$x[,1:3] %*% t(pc0$rot[,1:3]) # using only 3
mat_approx2 <- pc0$x[,1:2] %*% t(pc0$rot[,1:2]) # you get the point..
mat_approx1 <- as.matrix(pc0$x[,1]) %*% t(pc0$rot[,1])

> library(magrittr)

> library(quantmod)

> sym = c('IEF','TLT','SPY','QQQ')

> l <- length(sym)

> end <- format(as.Date("2020-12-01"), "%Y-%m-%d")

> start <- format(as.Date("2007-01-01"), "%Y-%m-%d")

> dat0 <- (getSymbols(sym[4], src="yahoo", from=start, to=end, auto.assign = F))

> n <- NROW(dat0)

> dat <- array(dim = c(n, NCOL(dat0),l)) ; prlev = matrix(nrow = n, ncol = l)

> for (i in 1:l){

+ dat0 = (getSymbols(sym[i], src="yahoo", from=start, to=end, auto.assign = F))

+ dat[1:length(dat0[,i]),,i] = dat0

+ # Average Price during the day

+ prlev[1:NROW(dat[,,i]),i] = as.numeric(dat[,4,i]+dat[,1,i]+dat[,2,i]+dat[,3,i])/4

+ }

> time <- index(dat0)

> x <- na.omit(prlev)

> ret <- 100*(x[2:NROW(x),] - x[1:(NROW(x)-1),]) # Move to returns

> head(ret,3)

[,1] [,2] [,3] [,4]

[1,] 21.3 32.3 -37.3 36.0

[2,] -22.5 -38.2 -47.8 12.7

[3,] 7.5 18.8 0.5 7.5

pc0 <- prcomp(ret, center= F, scale.=F)

mat_approx4 <- pc0$x[,1:4] %*% t(pc0$rot[,1:4]) # using all 4 components

mat_approx3 <- pc0$x[,1:3] %*% t(pc0$rot[,1:3]) # using only 3

mat_approx2 <- pc0$x[,1:2] %*% t(pc0$rot[,1:2]) # you get the point..

mat_approx1 <- as.matrix(pc0$x[,1]) %*% t(pc0$rot[,1])

You can see that the approximation becomes better as k increases from 1 to 4. ret - mat_approx is our matrix $E$ in the math above. Using all 4 principal components we retrieve the original matrix $A$ :


> Matrix::norm( (ret - mat_approx1), ty= "F")
[1] 6390
> Matrix::norm( (ret - mat_approx2), ty= "F")
[1] 3220
> Matrix::norm( (ret - mat_approx3), ty= "F")
[1] 916
> Matrix::norm( (ret - mat_approx4), ty= "F")
[1] 0

> Matrix::norm( (ret - mat_approx1), ty= "F")

[1] 6390

> Matrix::norm( (ret - mat_approx2), ty= "F")

[1] 3220

> Matrix::norm( (ret - mat_approx3), ty= "F")

[1] 916

> Matrix::norm( (ret - mat_approx4), ty= "F")

[1] 0

The less principal components you use, the worse your approximation becomes (the norm of the E matrix becomes larger).

Coming back full circle to the start of the post where I mentioned that using PCA is a way to get the best approximation. Eckart and Young theorem tells us that the approximation we just made, is the best (mark: Frobenius norm speaking, for other norms it may not be the best), that we could have done. If you want to know more about this topic, it falls under the apt name of sketching. A good sketch matrix B is such that computations can be performed on B rather than on A without much loss in precision; sketching stands here for “not the actual picture but very economical and very clear”.

So on top of the statistical interpretation from previous PCA posts, you now also have this algebraic interpretation of PCA.