Here is a negative, yet typical example:
In artificial intelligence (AI)-based predictive models, bias – defined as unfair systematic error – is a growing source of concern^{1}.
This post tries to direct those important discussions to the right avenues, providing some clarifications, examples for common pitfalls, and some qualified advice from experts in the field on how to approach this topic. If nothing else, I hope you find this piece thought-provoking.
In modern statistics – AI being a subclass here – a biased prediction (for example) means that the prediction deviates from the true unknown value you wish to predict. As an easy example, say you want to estimate the average weight in the population but you are using only women in your sample. Women don’t represent the whole population (only about 50%) so if you intend to estimate the average weight for both men and women, your estimate will be biased downwards since women weigh less than men on average. If you aim to estimate the average weight only for women then your estimate is unbiased. Is bias always a bad thing? No. In modern statistics we often, and purposely, introduce bias in many type of machine learning algorithms as a way stabilize volatile estimates^{2}.
Now continuing with our easy example, say you used your data properly, i.e. your sample is indeed a 50-50 mix between men and women, nevertheless, for whatever reason you don’t like the result, you disagree with the estimated weight, thinking it should be higher or lower. Does this make your model biased? of course not! so what if you think the results should be different? You may even refuse to use the model unless it’s tailored to provide you with results that YOU are comfortable with. But it’s nothing, nothing to do with model-bias, and nothing to do with the way the model is estimated. To begin with, we choose to use AI models precisely because these are able to capture the patterns present in the data and now, if we dislike the results, it means we take issue with the data; and of course we do. To quote Gordon Gekko: “human beings… you gotta give ’em a break. We’re all mixed bags.” So we are talking about any model-results that we are happy or unhappy to use. The data (cliché alert) is what it is.
Yes, we would like people to be blind to gender and race, that is work in progress, in the meantime we want to build models so they are useful. Useful particularly in that they are aligned with our, creators and users, values and preferences.
So we agree, I hope, that bias has nothing to do with anything in this context. The discussion should be cast in terms of usability and alignment. Simple? Sure. Easy? far from it.
Who is to determine what fair or is not fair, who is to determine the preferences? Consider this loaded topic: Policy makers think tax-rate should be x%, is it fair? Ask different people to get different answers and make of it what you will.
Now that we have our concepts lined up clearly. To create useful AI models we are comfortable using we need, like with all things, to start at the start:
Stop all discussions around biases in AI systems (no such thing). The focus should be on how to create more data which is aligned with the users\human values, and how to eliminate the data which is not. The AI model has none to do with it.
^{1} For example ridge regression is often used instead of the usual OLS, because it trades the variance of the estimates for biased estimates.
^{2} from Rising to the challenge of bias in health care AI
^{3} Training language models to follow instructions with human feedback
]]>Wide data matrices, where , are widespread nowadays (pun intended). In this situation a good, stable and reliable estimate for the covariance matrix is hard to obtain. The usual sample covariance estimate when is simply lousy (for reasons out of scope here). Much is written about how best to adjust and correct that sample covariance estimate (see below for references). A motivation to shift the focus from the covariance matrix to the precision matrix is how we start.
One glaring issue is that the sample estimate is always dense. I mean, if you simulate data from a population with zero covariance between variables (represented by a diagonal covariance matrix), the sample estimated of the off-diagonal elements will never be exactly zero. This is because the estimated covariance reflects the input data, which will naturally have some variability even if the true underlying covariance is zero. So setting small off-diagonal estimated elements to zero is good thinking, and we call this Thresholding estimation. Let be the element of the sample covariance estimate and let be some threshold value then formally you have the new covariance estimate:
where is the indicator function to return zero if the absolute value is below the you set. Like that we can create a sparse covariance estimate.
Now, each entry in the covariance matrix is independent of other variables. The entry is simply . So for example if drives, and co-move with both , and ; meaning and are both high, then is also likely to be high. If we are able to net-out the movement in from that would be good. In other words we look for . Why would that be good? because is likely to be much lower than , and so threshold makes (even) better sense. We would expect many more zeros if we could find the partial covariances .
The precision matrix is the inverse of the covariance matrix. It just so happens that the entries of the precision matrix are exactly what we are looking for, the conditional (on other variables) rather than the unconditional as it is with the covariance matrix. This morning I was walking with a fellow statistician, data scientist, quant, machine learning engineer, AI researcher, statistician, and he challenged me to explain this. I could not do much but accept the challenge. So, why by inverting the covariance matrix we end up with a precision matrix which encodes the conditional dependence any two variables after controlling the rest?
Consider a data matrix with rows and columns; . Partition to two matrices: first 2 columns as X and the rest as Y. For the sake of simplicity assume that
So
.
The covariance and precision matrice are given by
and
I remind you what are we chasing here. is just a matrix with 2 columns. So the off-diagonal of is just one number. We want to show that this number is the conditional dependence between the two columns, after accounting for all other columns .
The order of things to come:
This expression comes from a tedious matrix algebra or better yet from something we know about this kind of block matrices and their inverse which is called Schur complement.
Usually you see this written in terms of but recall that .
We can ignore anything that is dependent on (specifically ) since we condition on it, so we can treat it as a constant, and keep all terms related to (the rest will go outside the brackets and be elegantly ignored by using the which means “proportional to”):
Using some additional algebra (see below for some rules we need here) we can write it as:
What we are going to do now is to make it look like the normal distribution:
so we recognize the covariance term out of the expression. The trick is to add and remove the same term (preserving the value overall expression), and completing the square, aka “completing the quadratic form”. We have quite a few algebraic rules to remember moving forward: transposing, noting that the covariance is a symmetric matrix, and that we can consider anything which does not depends on X as a constant, even since it does not change with the realization of X. I wrote down some rules we use below and here you can find what I use for verification.
The term we need to add and subtract is to get
So this looks like what we aimed for:
as if and especially if you remember that is itself an inverse (so actually “in the denominator” of that exponent):
Cool cool, but what do we have? that the variance of the conditional distribution of X on Y is an expression, which says that you need to subtract from which you can think of as total variability, the part which is explained by : .
You have seen this in the past:
Recall that when the mean vector is zero, then is the covariance simply (up to a scaling constant). Now observe that the above expression can be thought of (up to a constant) as the coefficient from a regression which explains the column of using the columns of . Specifically, is analogous to the cross-product in regression, capturing the covariance between and . is analogous to in that expression, capturing the covariance within the columns of .
Here is some R code to replicate the math above. We start by pulling some stock market data using the quantmod package:
library(quantmod) # citation("quantmod") k <- 10 # how many years back? end<- format(Sys.Date(),"%Y-%m-%d") start<-format(Sys.Date() - (k*365),"%Y-%m-%d") symetf = c('XLY', 'XLP', 'XLE', 'XLF', 'XLV', 'XLI', 'XLB', 'XLK', 'XLU') l <- length(symetf) w0 <- NULL for (i in 1:l) { dat0 = getSymbols( symetf[i], src = "yahoo", from = start, to = end, auto.assign = F, warnings = FALSE, symbol.lookup = F ) w1 <- dailyReturn(dat0) w0 <- cbind(w0, w1) } time <- as.Date(substr(index(w0), 1, 10)) w0 <- 100 * as.matrix(w0) # convert to % colnames(w0) <- symetf
We will net-out, or control for, other variables by using the residuals from the node-wise regressions: . So the partial linear dependence between and say will be estimated by .
get_precision_entry <- function(data, column_i, column_j){ precision_mat <- solve(cov(data)) name_i <- which(names(dat) == column_i ) name_j <- which(names(dat) == column_j ) data <- as.matrix(data) lm0 <- lm( data[ ,name_i] ~ data[ , -c(name_i, name_j)] ) lm1 <- lm( data[ ,name_j] ~ data[ ,-c(name_i, name_j)] ) partial_correlation <- cor(lm0$res, lm1$res) diag1_diag2 <- precision_mat[column_i, column_i] * precision_mat[column_j, column_j] precision_entry_i_j <- - partial_correlation * sqrt(diag1_diag2) return(precision_entry_i_j) } precision_mat <- solve(cov(w0)) get_precision_entry(data= w0, column_i= "XLV", column_j= "XLF") [1] -0.328 precision_mat["XLV", "XLF"] [1] -0.328
To cite this blog post “Eran Raviv (2024, June 6). Correlation and correlation structure (8) – the precision matrix. Retrieved on … month, year from https:…”
Footnotes and references
Some algebra rules which were used
]]>
The formula for Chatterjee’s rank correlation:
is the rank of and is rearranged to be order .
Briefly: do, on average, larger values of drag a higher ranking for ? yeah –> high correlation, no? –> low correlation.
Reasons to like:
If is random draws in [-5,5], the following images depict , once y~x and once x~y. Values in the headers are the new ranking-based correlation measure and the traditional Pearson correlation measure.
Say you want just a measure for the dependency between two variables, and you don’t care what drives the dependency. No problem. Just as the Jensen-Shannon divergence symmetric divergence is simply the average of two Kullback-Leibler asymmetric divergences, so here you can just use
Reasons to dislike:
From a recent Biometrika paper:
… the standard bootstrap, in general, does not work for Chatterjee’s rank correlation. … Chatterjee’s rank correlation thus falls into a category of statistics that are asymptotically normal but bootstrap inconsistent.
This means that you shouldn’t use bootstrap for any inference or significance testing. The following simple code snippets reveals the problem:
# install.packages("XICOR") # citation("XICOR") library(XICOR) tt <- 100 x <- runif(tt,-5,5) sdd <- 1 x2 <- x^2 + rnorm(tt, sd=sdd) x1 <- x + rnorm(tt, sd=sdd) samplexi <- calculateXI(x1,x2) samplexi bootxi <- NULL for (i in 1:tt) { tmpind <- sample(tt, replace = T) bootxi[i] <- calculateXI(x1[tmpind],x2[tmpind]) } density(bootxi) %>% plot(xlab="", main="", xlim = c(0,1), ylab="", lwd=2, col="darkgreen") abline(v= samplexi, lwd=3) grid()
Bootstrap fails miserably here. What you see is that the density of the bootstrapped statistic is completely off the mark in that it’s not even centered around the sample-estimate. So, you can’t use bootstrap for inference, and you must also avoid any other procedures that are bootstrap-driven (e.g. bagging). This is the end of the post if you don’t care why bootstrap fails here.
10 years ago I gave this example for a bootstrap failure. Back then I did not know the reason for this, but now I know enough. It has to do with statistics which are not smooth with respect to the data.
A statistic of a dataset is considered -smooth if it has continuous derivatives with respect to each data point . Formally, this means that for any data point in , the partial derivatives exist, and are continuous for all . Bootstrap only works for smooth statistics, which qualifies for most of what you are familiar with, e.g. mean and variance. Bu ranking changes in a stepwise fashion rather than smoothly. So when we bootstrap, rankings “jump” rather than “crawl” which troubles the bootstrapping technique. The example I gave a decade ago spoke of the maximum. The derivative is not continuous in that or it’s 0 (no change in the maximum), or it jumps to a different value, so that is why we can’t use bootstrap (at least not the standard nonparametric version).
Matrix multiplication can be beneficially perceived as a way to expand the dimension. We begin with a brief discussion on PCA. Since PCA is predominantly used for reducing dimensions, and since you are familiar with PCA already, it serves as a good springboard by way of a contrasting example for dimension expansion. Afterwards we show some basic algebra via code, and conclude with a citation that provides the intuition for the reason dimension expansion is so essential.
PCA is used for dimension reduction. The factors in PCA are simply linear combinations of the original variables. It is explained here. For a data matrix you will have such linear combinations. Typically we use less than because the whole point is to reduce the dimension. The resulting factors, linear transformations, are often called rotations. They are called rotations because the original data is being “rotated” such that the factors/vectors points in the direction of the maximum variance. The transforming/rotating matrix is the matrix of the eigenvectors.
So, when you have a matrix (in PCA that’s the column-bind eigenvectors) and you multiply it with another matrix (in PCA that’s the original data), you effectively transform the data, and linearly so. In PCA we do that so as to reduce the dimension, using less than transformations. Also, we use the eigenvector matrix as the transforming matrix (reason is not important for us now).
Moving to dimension expansion, let’s relax those two: (1) We will not use the eigenvector matrix, and (2) we can use a larger-dimension transforming matrix. Therefore the result will be of higher dimension than the original input.
I write ‘we can’ but in truth we must. The ability to transform the original data, expand it, and project to higher dimensions is a BIG part of the success of deep, large, and deep and large ML algorithms; especially those flush and heavy LLMs which are popping out left and right.
Look at the following super-simple piece of code:
> A <- 10*matrix(runif(12), nrow = 4, ncol = 3) %>% round(digits= 1) > A [,1] [,2] [,3] [1,] 9 4 7 [2,] 6 0 9 [3,] 9 1 7 [4,] 4 5 3 > B <- 10*matrix(runif(6), nrow = 3, ncol = 2) %>% round(digits=1) > B [,1] [,2] [1,] 1 9 [2,] 8 7 [3,] 10 9 > > transformed_B <- A %*% B > A %*% B[,1] [,1] [1,] 111 [2,] 96 [3,] 87 [4,] 74 > A %*% B[,2] [,1] [1,] 172 [2,] 135 [3,] 151 [4,] 98 > transformed_B [,1] [,2] [1,] 111 172 [2,] 96 135 [3,] 87 151 [4,] 74 98
What you see is that A transforms each vector from B separately and the result is simply column-bind. Therefore we continue our talk only using one vector.
Now this following piece of code reminds us what is the meaning of matrix multiplication.
> transformed_vectors1 <- A %*% B[,1] > sum(B[,1] * A[1,]) [1] 111 > sum(B[,1] * A[2,]) [1] 96 > sum(B[,1] * A[3,]) [1] 87 > sum(B[,1] * A[4,]) [1] 74 > transformed_vectors1 [,1] [1,] 111 [2,] 96 [3,] 87 [4,] 74
So the result of the transformation is the sum of the element-wise rows of A times the vector. The entries of the transformed vector are a linear combination of that vector, with the combination given by the rows of A.
Note we have created more linear transformations from our original vector than its size. Geometrically speaking, we projected our vector on a higher dimensional space (from size 3 to size 4).
The combination themselves are given here, unlike as in PCA, by an arbitrarily chosen numbers (the entries of A). You do well to relate it to the massive number of parameters (billions..) which drive advanced AI algorithms. Think of those entries of A as weights/parameters which are randomly initialized, and later optimized to settle on values which create useful transformations for prediction purposes.
In fact, it turns out that it’s useful paramount to over-over-over-expand the dimension. Such a “let’s over-parametrize this problem” approach creates a enormous number of such transformations. While most transformations are totally redundant, some transformations are useful almost from their get-go randomized initialization (and become more useful after optimizing).
A paper titled What is Hidden in a Randomly Weighted Neural Network forcefully exemplify that point:
.. within a sufficiently overparameterized neural network with random weights (e.g. at initialization), there exists a subnet-work that achieves competitive accuracy. Specifically, the test accuracy of the subnetwork is able to match the accuracy of a trained network with the same number of parameters.
As an illustration, take a look at the first figure in that paper:
If you introduce enough expansions as in the middle panel, there will be a sub-network as in the right panel, which achieve good performance similar to that of a well-trained network as in the left panel, without ever modifying the initially randomized weight values. This figure reminds me of the saying If you throw enough shit against a wall, some of it has gotta stick
[pardon my french].