Correlation and Correlation Structure (11) – Estimation using Random Matrix Theory

In the classical regime, when we have plenty of observations relative to what we need to estimate, we can rely on the sample covariance matrix as a faithful representation of the underlying covariance structure. However, in the high-dimensional settings common to modern data science – where the number of attributes/features p is comparable to the number of observations n, the sample covariance matrix is a bad estimator. It is not merely noisy; it is misleading. The eigenvalues of such matrices undergo a predictable yet deceptive dispersion, inflating into “signals” that have no basis in reality. But, using some machinery from Random Matrix Theory is one way to correct for biases that high-dimensional data creates.

Suppose we observe n independent vectors \mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_n from a p-dimensional distribution with covariance matrix \boldsymbol{\Sigma}. The obvious estimate is the sample covariance matrix:

    \[\mathbf{S} = \frac{1}{n}\sum_{i=1}^n (\mathbf{x}_i - \bar{\mathbf{x}})(\mathbf{x}_i - \bar{\mathbf{x}})^\top.\]

The Problem That Won’t Go Away.

When p is fixed and n \to \infty, we’re in the classical regime where \mathbf{S} \to \boldsymbol{\Sigma} and all is well. But modern datasets, say in genomics or finance.., routinely present us with p \approx n or even p > n. Here, \mathbf{S} becomes a terrible estimate of \boldsymbol{\Sigma}.

How terrible? Consider the eigenvalues. And remind yourself that

Each eigenvalue = variance along its eigenvector direction

Even when the true covariance is the identity matrix \boldsymbol{\Sigma} = \mathbf{I} (pure noise, all eigenvalues are 1 in actuality), the eigenvalues of \mathbf{S} don’t cluster near 1. Instead, they spread themselves along an interval roughly [(1-\sqrt{p/n})^2, (1+\sqrt{p/n})^2]. The largest eigenvalue can be inflated by a factor of 4 when p = n. This isn’t sampling variability we can ignore – it’s systematic, serious, distortion.

Enter Random Matrix Theory

Around 1967, mathematicians Marčenko and Pastur proved a remarkable theorem: as n, p \to \infty with p/n \to c \in (0,1], the empirical distribution of eigenvalues of \mathbf{S} converges to a specific limiting density, now called the Marčenko-Pastur law:

    \[f_{MP}(\lambda) = \frac{1}{2\pi c \lambda} \sqrt{(\lambda - \lambda_{-})(\lambda_{+} - \lambda)}, \quad \lambda \in [\lambda_{-}, \lambda_{+}]\]

where \lambda_{\pm} = (1 \pm \sqrt{c})^2 – aka the bounds of the density.

This is beautiful mathematics, but what can we do with it? The key insight came from physicists (Laloux et al., 1999) studying financial correlation matrices. They noticed that most eigenvalues of real covariance matrices pile up in a “bulk” matching the Marčenko-Pastur prediction, while a few large eigenvalues stick out above. The bulk represents noise; the outliers represent signal.

A Simple Cleaning Procedure

This suggests an obvious strategy:

  • Identify the bulk: Estimate the noise level \sigma^2 and compute the Marčenko-Pastur bounds \lambda_{\pm} = \sigma^2(1 \pm \sqrt{c})^2
  • Classify eigenvalues:
    • Signal: l_i > \lambda_+ (or l_i < \lambda_- if c < 1)
    • Noise: l_i \in [\lambda_-, \lambda_+]
  • Shrink the noise: Replace noise eigenvalues with their average \bar{l} = \frac{1}{|\text{noise}|}\sum_{i \in \text{noise}} l_i
  • Reconstruct: \tilde{\mathbf{S}} = \mathbf{U}\tilde{\mathbf{L}}\mathbf{U}^\top where \tilde{\mathbf{L}} contains the cleaned eigenvalues
  • This is simple. We’re not done yet – the signal eigenvalues themselves are inflated by noise.

    The Baik-Silverstein Correction

    In 2006, mathematicians Baik and Silverstein analyzed the “spiked covariance model” where \boldsymbol{\Sigma} has a few large eigenvalues (the “spikes”) and the rest equal to 1. They derived the precise relationship between sample and population eigenvalues. Specifically, for a signal eigenvalue \lambda > 1 + \sqrt{c}, the corresponding sample eigenvalue l satisfies (asymptotically):

    l \approx \lambda + \frac{c \lambda}{\lambda - 1}
    This can be inverted: given l, we estimate

        \[\hat{\lambda} = \frac{1}{2}\left((l - c + 1) + \sqrt{(l - c + 1)^2 - 4l}\right)\]

    And so our \hat{\lambda} can now serve as a proper estimate for the true eigenvalue in the reconstruction \tilde{\mathbf{S}} = \mathbf{U}\tilde{\mathbf{L}}\mathbf{U}^\top.

    Here is how it looks:
    RMT example
    What is plotted is the empirical distribution of sample eigenvalues (blue histogram) from a 100×200 data matrix compared to the MP density (in black). The population covariance contains three (real) signal eigenvalues, indicated by the green dash-dot lines. c = 0.5 in this example. Red dashed lines mark the MP bounds. Magenta dot lines show the expected (bias-corrected) locations of sample signal eigenvalues as estimated by the inversion formula outlined above.

    A valuable addition to your covariance estimation arsenal.

    Code

    set.seed(123)  # For reproducibility
    
    n <- 200  # Number of observations (rows)
    p <- 100  # Number of variables (columns)
    
    # Define eigenvalues
    signal_eigenvalues <- c(0.1, 3, 4)
    noise_eigenvalue <- 0.5
    n_signal <- length(signal_eigenvalues)
    
    # Create the full eigenvalue vector
    eigenvalues <- c(signal_eigenvalues, rep(noise_eigenvalue, p - n_signal))
    
    # Generate random orthogonal matrix for eigenvectors
    # Use QR decomposition of a random matrix
    random_matrix <- matrix(rnorm(p * p), p, p)
    Q <- qr.Q(qr(random_matrix))  # Orthogonal matrix (eigenvectors)
    
    # Computes the QR decomposition of a random matrix
    # Any matrix can be factored as M = QR where:
    # Q is orthogonal (columns are orthonormal)
    # R is upper triangular
    # qr.Q Extracts just the Q matrix
    # Bottom line: It's a standard trick in simulation to generate random covariance matrices with controlled eigenvalues—you specify the eigenvalues (diagonal matrix) and use a random orthogonal matrix for the directions (eigenvectors)
    
    # Construct the population covariance matrix
    Sigma <- Q %*% diag(eigenvalues) %*% t(Q)
    
    Sigma[1:10,1:10]
    # Generate data from multivariate normal with this covariance
    # Mean vector is zero
    mu <- rep(0, p)
    
    # Generate the data matrix using the Cholesky decomposition
    # X = Z * L where L is Cholesky factor of Sigma
    L <- chol(Sigma)  # Upper triangular Cholesky factor
    Z <- matrix(rnorm(n * p), n, p)  # Standard normal random variables
    X <- Z %*% L  # n x p data matrix
    
    dim(X)
    
    # Verify the structure
    cat("Data matrix dimensions:", dim(X), "\n")
    cat("Population eigenvalues (first 10):", eigenvalues[1:10], "\n\n")
    
    # Compute sample covariance matrix
    S <- cov(X)
    
    # Check sample eigenvalues
    sample_eigenvalues <- eigen(S, symmetric = TRUE, only.values = TRUE)$values
    cat("Sample eigenvalues (first 10):", round(sample_eigenvalues[1:10], 3), "\n")
    cat("Sample eigenvalues (last 3):", round(tail(sample_eigenvalues, 3), 3), "\n\n")
    barplot(sample_eigenvalues)
    
    # Compute aspect ratio and Marchenko-Pastur bounds
    c_ratio <- p / n
    lambda_plus <- noise_eigenvalue * (1 + sqrt(c_ratio))^2
    lambda_minus <- noise_eigenvalue * (1 - sqrt(c_ratio))^2
    
    cat("Aspect ratio c = p/n:", round(c_ratio, 3), "\n")
    cat("Marchenko-Pastur bounds for noise (sigma^2 = 0.5):", 
        round(lambda_minus, 3), "to", round(lambda_plus, 3), "\n\n")
    
    # Expected locations of signal eigenvalues
    phi <- function(lambda, c, sigma_sq = 0.5) {
      lambda_scaled <- lambda / sigma_sq
      result <- sigma_sq * (lambda_scaled + c * lambda_scaled / (lambda_scaled - 1))
      return(result)
    }
    
    # this gives us the locations of the sample eigenvalues that we should expect for our signal, based on the theory.
    cat("Expected sample eigenvalue locations:\n")
    for(i in 1:n_signal) {
      expected <- phi(signal_eigenvalues[i], c_ratio)
      cat(sprintf("  Population λ = %.1f → Expected sample eigenvalue ≈ %.3f\n", 
                  signal_eigenvalues[i], expected))
    }
    
    mp_density <- function(x, c, sigma_sq) {
      lambda_plus <- sigma_sq * (1 + sqrt(c))^2
      lambda_minus <- sigma_sq * (1 - sqrt(c))^2
      density <- ifelse(x >= lambda_minus & x <= lambda_plus,
                        sqrt((lambda_plus - x) * (x - lambda_minus)) / (2 * pi * sigma_sq * c * x), 0)
      return(density)
    }
    

    References

    • Laloux, L., Cizeau, P., Bouchaud, J.-P., & Potters, M. (1999). Noise dressing of financial correlation matrices. Physical Review Letters, 83, 1467.
    • Baik, J., & Silverstein, J. W. (2006). Eigenvalues of large sample covariance matrices of spiked population models. Journal of Multivariate Analysis, 97, 1382-1408.
    • Marčenko, V. A., & Pastur, L. A. (1967). Distribution of eigenvalues for some sets of random matrices. Mathematics of the USSR-Sbornik, 1(4), 457-483.

    Leave a Reply

    Your email address will not be published. Required fields are marked *