Matrix multiplication can be beneficially perceived as a way to expand the dimension. We begin with a brief discussion on PCA. Since PCA is predominantly used for reducing dimensions, and since you are familiar with PCA already, it serves as a good springboard by way of a contrasting example for dimension expansion. Afterwards we show some basic algebra via code, and conclude with a citation that provides the intuition for the reason dimension expansion is so essential.

PCA is used for dimension reduction. The factors in PCA are simply linear combinations of the original variables. It is explained here. For a data matrix you will have such linear combinations. Typically we use less than because the whole point is to reduce the dimension. The resulting factors, linear transformations, are often called rotations. They are called rotations because the original data is being “rotated” such that the factors/vectors points in the direction of the maximum variance. The transforming/rotating matrix is the matrix of the eigenvectors.

So, when you have a matrix (in PCA that’s the column-bind eigenvectors) and you multiply it with another matrix (in PCA that’s the original data), you effectively transform the data, and linearly so. In PCA we do that so as to reduce the dimension, using less than transformations. Also, we use the eigenvector matrix as the transforming matrix (reason is not important for us now).

Moving to dimension expansion, let’s relax those two: (1) We will not use the eigenvector matrix, and (2) we can use a larger-dimension transforming matrix. Therefore the result will be of higher dimension than the original input.

I write ‘we can’ but in truth we must. The ability to transform the original data, expand it, and project to higher dimensions is a BIG part of the success of deep, large, and deep and large ML algorithms; especially those flush and heavy LLMs which are popping out left and right.

Look at the following super-simple piece of code:

> A <- 10*matrix(runif(12), nrow = 4, ncol = 3) %>% round(digits= 1) > A [,1] [,2] [,3] [1,] 9 4 7 [2,] 6 0 9 [3,] 9 1 7 [4,] 4 5 3 > B <- 10*matrix(runif(6), nrow = 3, ncol = 2) %>% round(digits=1) > B [,1] [,2] [1,] 1 9 [2,] 8 7 [3,] 10 9 > > transformed_B <- A %*% B > A %*% B[,1] [,1] [1,] 111 [2,] 96 [3,] 87 [4,] 74 > A %*% B[,2] [,1] [1,] 172 [2,] 135 [3,] 151 [4,] 98 > transformed_B [,1] [,2] [1,] 111 172 [2,] 96 135 [3,] 87 151 [4,] 74 98

What you see is that A transforms each vector from B separately and the result is simply column-bind. Therefore we continue our talk only using one vector.

Now this following piece of code reminds us what is the meaning of matrix multiplication.

> transformed_vectors1 <- A %*% B[,1] > sum(B[,1] * A[1,]) [1] 111 > sum(B[,1] * A[2,]) [1] 96 > sum(B[,1] * A[3,]) [1] 87 > sum(B[,1] * A[4,]) [1] 74 > transformed_vectors1 [,1] [1,] 111 [2,] 96 [3,] 87 [4,] 74

So the result of the transformation is the sum of the element-wise rows of A times the vector. The entries of the transformed vector are a linear combination of that vector, with the combination given by the rows of A.

**Note we have created more linear transformations from our original vector than its size. Geometrically speaking, we projected our vector on a higher dimensional space (from size 3 to size 4). **

The combination themselves are given here, unlike as in PCA, by an arbitrarily chosen numbers (the entries of A). You do well to relate it to the massive number of parameters (billions..) which drive advanced AI algorithms. Think of those entries of A as weights/parameters which are randomly initialized, and later optimized to settle on values which create useful transformations for prediction purposes.

In fact, it turns out that it’s ~~useful~~ paramount to over-over-over-expand the dimension. Such a “let’s over-parametrize this problem” approach creates a enormous number of such transformations. While most transformations are totally redundant, some transformations are useful almost from their get-go randomized initialization (and become more useful after optimizing).

A paper titled What is Hidden in a Randomly Weighted Neural Network forcefully exemplify that point:

.. within a sufficiently overparameterized neural network with random weights (e.g. at initialization), there exists a subnet-work that achieves competitive accuracy. Specifically, the test accuracy of the subnetwork is able to match the accuracy of a trained network with the same number of parameters.

As an illustration, take a look at the first figure in that paper:

If you introduce enough expansions as in the middle panel, there will be a sub-network as in the right panel, which achieve good performance similar to that of a well-trained network as in the left panel, without ever modifying the initially randomized weight values. This figure reminds me of the saying If you throw enough shit against a wall, some of it has gotta stick

[pardon my french].

This blog is just a personal hobby. When I’m extra busy as I was this year the blog is a front-line casualty. This is why 2023 saw a weaker posting stream. Nonetheless I am pleased with just over 30K visits this year, with an average of roughly one minute per visit (engagement time, whatever google-analytics means by that). This year I only provide the top two posts (rather than the usual 3). Both posts have to do with statistical shrinkage:

The one is Statistical Shrinkage (2) and the other is Statistical Shrinkage (3).

On the left (scroll down) you can find the most popular posts from previous years.

To my readership. **Thank you** for reading, sharing, for your emails, corrections, comments and good questions. Happy, **healthy**, and productive 2024!

pic credit: Morvanic Lee

]]>

We, mere minions who are unable to splurge thousands of dollars for high-end G/TPUs, are left unable to work with large matrices due to the massive computational requirements needed; because who wants to wait few weeks to discover their bug.

This post offers a solution by way of approximation, using randomization. I start with the idea, followed by a short proof, and conclude with some code and few run-time results.

Randomization has long been a cornerstone of modern statistics. In the, unfortunately now-retired blog of the great mind Larry Wasserman, you can find a long list of statistical procedures where added randomness plays some role (many more are mentioned in the comments of that post). We randomly choose a subset of rows\columns, and multiply those smaller matrices to get our approximation.

To fix notation, denote as the ith column and as the ith row of some matrix A. Then the product can be written as

where is the column index for A and row index for B. We later code this so it’s clearer.

Say A is and B is , now imagine both A and B have completely random entries. Choosing say only row\column, calculating the product and multiplying that product by (as if we did it times) would provide us with an unbiased estimate for the sum of all products; even though we only chose one at random. And yes, there will be variance, but it will be an unbiased estimate. Here is a short proof of that.

In the second transition when we lose the expectation operator remember from probability that .

The proof shows that we need the scaling constant for the expectation to hold. It also shows that the product of the two smaller matrices (with number of columns/rows smaller than ) is the same, in expectation, as the product of the original matrices . Similar to how bootstrapping offers an unbiased estimator, this method also provides an unbiased estimate. It is a bit more involved due to its two-dimensional nature, but just as you can sum or average scalars or vectors, matrices can be summed or averaged in the same manner. The code provided below will clarify this.

What do we gain by doing this again? We balance accuracy with computational costs. Selecting only a subset of columns/rows we inevitably sacrifice some precision, but we significantly reduce computing time, so a practical compromise.

The variance of the estimate relative to the true value of AB (obtained through precise computation) can be high. But similar to the methodology used in bootstrapping, we can lesser the variance by repeatedly performing the subsampling process and averaging the results. Now you should wonder: you started doing this to save computational costs, but repeatedly subsampling and computing products of “smaller matrices” might end up being **even more** computationally demanding than directly computing AB, defeating the purpose of reducing computational costs. Enter parallel computing.

Just as independent bootstrap samples can be computed in parallel, the same principle applies in this context. Time to code it.

I start with a matrix with dimension . I then compute the actual because I want to see what is the difference and how good is the approximation. We base the approximation on quarter of the number of rows `TT/4`

. We vectorize the diagonal and off-diagonal entries (because is symmetric) and examine the difference between the actual result and the approximated result.

TT <- 200 p <- 50 A <- do.call(cbind, lapply(rep(TT,p), rnorm) ) mult_actual <- t(A) %*% A # Target vec_actual <- mult_actual[upper.tri(mult_actual, diag= T)] length(vec_actual) # (P/2) \times (P-1) + P, for the diagonal m <- TT/4 # only quarter of the number of rows mult_approx_array <- array(dim= c(rr, dim(mult_actual)) ) for (i in 1:rr) { tmpind <- sample(1:TT, m, replace = T) mult_approx_array[i,,] <- t(A)[,tmpind] %*% A[tmpind,] } # Now average the matrices across rr "bootstrap samples" mult_approx <- (TT/m) * apply(mult_approx_array, c(2,3), mean) # The scaling constant TT/m is because we sample uniformly # so each row is chosen with prob 1/TT. # The eventual approximation (individual entries): vec_approx <- mult_approx[upper.tri(mult_approx, diag= T)] # Difference vec_diff <- (vec_actual - vec_approx) plot(vec_diff)

We have 1275 unique entries (off diagonal and diagonal of ). Each such entry has the approximated value and the true value, we plot the difference. Top panel shows the difference when we average across 100 subsamples and the bottom is based on 5000 subsamples, so of course it’s more accurate.

Below is a more production-ready code, for when you actually need to work with big matrices. It parallels the computation using the `parallel`

package in R.

TT <- 10000 p <- 2500 A <- do.call(cbind, lapply(rep(TT,p), rnorm) ) dim(A) [1] 10000 2500 m <- TT/4 # How long to get exact solution? system.time ( t(A) %*% A ) user system elapsed 30 0 30 # 30 seconds # Let's apply the following function using a built-in optimized approach # we use rr=5 samples for now tmpfun <- function(index){ tmpind <- sample(1:TT, m, replace= T) (TT/m)*(t(A)[,tmpind] %*% A[tmpind,]) } rr <- 5 system.time ( mult_approx_array <- lapply(as.list(1:rr), tmpfun) ) user system elapsed 37.45 0.28 37.75 # so doing it 5 times takes 37 seconds # Now in parallel library(parallel) numCores <- detectCores() / 2 # Using half of the available cores cl <- makeCluster(numCores) clusterExport(cl, varlist = c("A", "TT", "m") ) system.time ( mult_approx_array_par <- parLapply(cl, 1:rr, tmpfun) ) # The equalizer clicks his suunto watch: user system elapsed 0.20 0.12 13.27 # 13 seconds stopCluster(cl)

Couple of comments are in order. First, if you look carefully at the code, the function `tmpfun`

takes an unnecessary, fictitious `index`

argument which is never used. It has to do with the cluster’s internal mechanism which needs an element to pass on as an argument. Second, the computational complexity in this case is , so you can expect larger computational time gains for longer matrices. Third and finally, we sampled rows uniformly (therefore ). This is primitive. There are better sampling schemes not covered here which will reduce the variance of the result.

In finance we use the covariance matrix as an input for portfolio construction. Analogous to the fact that variance must be positive, covariance matrix must be positive definite to be meaningful. The focus of this post is on understanding the underlying issues with an unstable covariance matrix, identifying a practical solution for such an instability, and connecting that solution to the all-important concept of statistical shrinkage. I present a strong link between the following three concepts: regularization of the covariance matrix, ridge regression, and measurement error bias, with some easy-to-follow math.

A covariance matrix is positive semi-definite if and only if for all possible vectors (check this post for the why), and it also means that the determinant of the matrix . And if you remember that , it directly follows that all the eigenvalues must be positive. If any of them is negative or if any of them is too close to zero, the inversion operation will become problematic, akin to how dividing any number by almost-zero results in disproportionately large value.

So what do we do if not all eigenvalues are positive? We **make them** positive! One way to do that is by way of a process known as diagonal loading. We add a positive value (usually small) to the diagonal elements of the matrix .

Why adding a constant to the diagonal of results in the same constant added to the vector of eigenvalues? Good question. Here is a short proof:

Let be a matrix, and a positive scalar. Consider , where is the identity matrix. Let be an eigenvector of with corresponding eigenvalue . Before we start a quick reminder that by definition: if then are the eigenvectors of and are the eigenvalues of . Now, consider the action of on :

So what is shown is that has the same eigenvectors, and its eigenvalues are increased by exactly , which concludes the proof.

Negative or small eigenvalues are shifted upwards which helps inversion. You may wonder (as I’m sure you do ): must we push all eigenvalues? I mean, it’s only the small or negative which are the culprits of instability.

The answer is no, we don’t have to increase all the eigenvalues. In fact, if you wish to remain as close as possible to the original matrix, you can choose to increase only the problematic eigenvalues. This is exactly what nearest-positive-definite type of algorithms do. You apply the minimum eigenvalues shift needed to reach positive definiteness.

Ridge regression minimize the Residual Sum of Squared error, but with a penalty on the size of the coefficients:

This is the resulting RR (for ridge regression) coefficients:

This is exactly diagonal loading of the covariance matrix.

Why? because when is a scaled version of the original data matrix (as it is with ridge regression) then is the covariance of the scaled data (up to the division by the number of observations).

Ok, so ridge regression has something to do with diagonal loading (and so increasing the eigenvalues of the covariance matrix), why does this mean that we shrink the vector of regression coefficients? Another good question, let’s doctor some econometrics to help us understand this.

In the measurement error bias I have already shown that the fact that you add noise to the explanatory variable shrink the coefficient so I don’t repeat it here. The main equation from that post is:

For our purpose here you can simply think of it as increasing the variance of the explanatory variable by a constant (say ). That exactly means that you increase the diagonal entry (again, in the covariance matrix) which corresponds to that explanatory variable.

What is interesting to see, I think, is that while in the econometric literature measurement error is considered a serious problem, in the statistical learning literature we introduce measurement error intentionally, to our advantage. Of course, I steer clear from the inference versus prediction never-ending debate.

Armed with this understanding, you can use it for estimating the entries of the covariance matrix individually. This approach is not widely used is primarily due to the numerical instabilities which are very likely to follow. But if you can handle numerical instability effectively, there are significant benefits to estimating individual elements of the covariance matrix as opposed to estimating the entire covariance structure in one go.

For those of you old enough to remember, paraphrasing John Hannibal Smith from the A-team: “I love it when optimization, econometrics and statistical learning come together”.

]]>