- Understanding False Discovery Rate False Discovery Rate is an unintuitive name for a very...
- Orthogonality in Statistics Orthogonality in mathematics The word Orthogonality originates from a combination...
- Information Criteria for Autoregression Some knowledge about the bootstrapping procedure is assumed. In time...
- How regression statistics mislead experts This post concerns a paper I came across checking the...

The term mutual information is drawn from the field of information theory. Information theory is busy with the quantification of information. For example, a central concept in this field is entropy, which we have discussed before.

If you google the term “mutual information” you will land at some page which if you understand it, there would probably be no need for you to google it in the first place. For example:

which makes sense at first read only for those who don’t need to read it. It’s the main motivation for this post: to provide a clear intuition behind theNot limited to real-valued random variables and linear dependence like the correlation coefficient, mutual information (MI) is more general and determines how different the joint distribution of the pair (X,Y) is to the product of the marginal distributions of X and Y. MI is the expected value of the pointwise mutual information (PMI).

If you know the result of a fair, six-sided die is larger than 4, the probability that it is a 5 is 1/2- while if you don’t know the result is larger than 4, then the probability remains 1/6. So the fact that you know the result is larger than 4 made a big difference for you in this case. But how big of a difference? We want to quantify how big is this difference compared to say, you know that the result of the die roll is larger than 2, or don’t know anything for that matter.

In the example above we implicitly used the conditional probability formula: with A being the event “larger than 4”, being the event “result is equal to 5”, and means both A and B occurred simultaneously.

Those “events” above are just random variables: what can happen? what is the probability of each of the possible outcomes? If we denote those random variables as x and y, the formula for pointwise information is very closely related to that of conditional probability. **The link between conditional probability and mutual information is your main engine for understanding this topic.** The formula for pointwise information is

Forget about the log operator for a second. Let’s massage this formula:

Let’s focus on the last expression. As you can see, it’s the conditional probability of Y given X times . If Y and X are independent, there is no meaning to the multiplication (it’s going to be zero times something). But if the conditional probability is larger than zero, then there is a meaning to the multiplication. How “important” is the event ? if then the event X=x is not really important is it? think a die which always rolls the same number; there is no point to consider it. But, If the event is fairly rare → p(x) is relatively low → is relatively high → the value of becomes much more important in terms of information. So that is the first observation regarding the PMI formula.

We are about half way.

In this code we pull some ETF data from yahoo, for TLT (US treasury bonds) and SPY (US S&P 500 stocks). We create two series of daily returns for those two tickers.

library(quantmod) library(magrittr) k <- 10 end<- format(Sys.Date(),"%Y-%m-%d") start<-format(Sys.Date() - (k*365),"%Y-%m-%d") symetf = c('TLT', 'SPY') l <- length(symetf) w0 <- NULL for (i in 1:l){ dat0 = getSymbols(symetf[i], src="yahoo", from=start, to=end, auto.assign = F, warnings = FALSE, symbol.lookup = F) w1 <- dailyReturn(dat0) w0 <- cbind(w0,w1) } time <- as.Date(substr(index(w0),1,10)) w0 <- as.matrix(w0)*100 colnames(w0) <- symetf > tail(w0,3) TLT SPY 2020-01-22 0.3513343 0.01207606 2020-01-23 0.7001964 0.11468733 2020-01-24 0.8088548 -0.88930785

Now let’s define the our random variables. X would be: “returns of TLT is below it’s 5% quantile”. The random variable Y would be: “returns of SPY is below it’s 5% quantile”, so two binomial random variables.

Now, based on the pointwise mutual information formula we compute the PMI measure:

alpha <- 0.05 y <- w0[,"SPY"] < quantile(w0[,"SPY"], prob= alpha) %>% as.numeric x <- w0[,"TLT"] < quantile(w0[,"TLT"], prob= alpha) %>% as.numeric p_x <- sum(x)/TT p_y <- sum(y)/TT p_xy <- (x[y==1] %>% sum)/TT (p_ab/p_b)/p_a [1] 0.3167045 log2((p_ab/p_b)/p_a) [1] -1.658791

The PMI measure is about -1.65. What (the hell) does that mean?

Pointwise mutual information measure is not confined to the [0,1] range. So here we explain how to interpret a zero, a positive or, as it is in our case, a negative number. The case where PMI=0 is trivial. It occurs for log(1) =0 and it means that which tells us that x and y are independents. If the number is positive it means that the two events co-occuring in a frequency higher than what we would expect if they would be independent event. Why? because (or equivalently ) is larger than 1 (if it’s smaller than 1, the log is negative). In our case the number is lower than one, meaning which means we see more of X=x than we see y given that X=x.

Let’s talk numbers to make it more tangible. The individual probabilities are p_x = p_y = roughly 5% (by construction here). If the events\variables are independent we would expect to see both occur simultaneously around 0.05^2 = 0.25% of the time. Instead we see those events co-occur only

`> p_xy *100`

[1] 0.07952286

so only 0.08% of the time. So we see this joint event roughly one third of what we would expect relatively to the events being independent (approximately 0.08/0.25). This 0.316 figure is what continues into the log operator and produces us with the negative number.

`(p_ab/p_b)/p_a`

[1] 0.3167045

→

`log2((p_ab/p_b)/p_a)`

[1] -1.658791

Practically it means that the number of times where *both* stocks and bonds are having a bad day (being below their 5% quantile), is much lower compared to them having bad days *individually*. So seeing a bad day for one of those does not drag a bad day for the other, on the contrary. Which makes sense given the bonds-as-a-hedge-againt-stock-market-doom textbook argument.

The pointwise mutual information can be understood as a scaled conditional probability.

The pointwise mutual information represents a quantified measure for how much more- or less likely we are to see the two events co-occur,** given their individual probabilities, and relative to the case where the two are completely independent. **

- Most popular posts – 2018 2019 is well underway. 2018 was personally difficult, so I...
- Most popular posts – 2013 Here (what people think) are the most interesting posts in...
- Most popular posts – 2015 The top three for the year are: Out-of-sample data snooping...
- Most popular posts – 2016 Another year. Looking at my google analytics reports I can’t...

Looks nice, but probably highly biased. The survey appears at the end of the html page, so only readers who were done reading, actually reach the survey- so if they are reading all the way through they are probably very interested, while if they leave halfway through they don’t fill in their replies. Still it looks like many readers are happy with at least some content, which is nice to see. Now for the most popular posts for 2019.

By far the clear winner is Understanding Variance Explained in PCA (5:54 minutes average time on page). Followed by Adaptive Huber Regression (4:22 minutes average time on page). Followed by Portfolio Construction Tilting towards Higher Moments (5:43 minutes average time on page)

Own personal favorites:

Day of the week and the cross-section of returns (3:29 minutes average time on page)

The lowest *readers /effort* ratio was the post CUR matrix decomposition for improved data analysis. However, the lowest ratio this year was not as low as in previous years.

**Thank you** for reading, for sharing, for your emails, and for your corrections and comments.

Happy and productive 2020!

pic credit: Morvanic Lee

* I would like of course to dive into the individual post level. You can try polleverywhere for what looks like a decent alternative.

]]>- Understanding Variance Explained in PCA Principal component analysis (PCA) is one of the earliest multivariate...
- Linear regression assumes nothing about your data We often see statements like “linear regression makes the assumption...
- Curse of dimensionality part 2: forecast combinations In a previous post we discussed the term ‘curse of...
- Matrix-style screensaver in R This post shares short code snippet to make your own...

CUR matrix decomposition provides an alternative to the more common ways like SVD or PCA. Why do you need an alternative? Because PCA (for example) provides you with some latent factors. But those factors are not very informative or meaningful in any clear inductive way. If you want the story behind PCA, you need to sprout it from within, typically by looking at the factor’s loadings and inventing a way to interpret those. Beyond the first factor it can be quite challenging and karma-dependent.

The paper I reference here proposes an algorithm to get an * interpretable* lower rank approximation. Their proposal is based on capturing the “influence” of a given variable/column, which is what attracted my attention: A way to measure the importance (for a lack of a better word) of a particular variable; particular column in your data matrix, in an unsupervised manner.

Their equation (3) in the paper (link below) reads

where is the normalized statistical leverage score for particular column and the are simply the right singular vectors from an SVD of the original data matrix.

Good question. The oil price is one of variable in the matrix which drives the consumer price index. Economically speaking, that variable (column) is very important for understanding of the consumer price index data (matrix). Wouldn’t it be nice if we would have some statistical procedure to recognize those important columns? PCA constructs linear combinations which explains the variability in the data, but those are hard to interpret. Just tell me which variables are important, not which linear combination is important. To be able to point out the variables which are “important” without any target is quite an enticing proposition, I think.

Let’s examine the highest market capital 94 blue chip stocks and see if we can tell which individual names are the most relevant for the over all movement in the data, using this statistical leverage scores method. Here are the results (code below). It turns out these are the most important names in the sense that they drive the most of the movement in the data:

` [1] "BBY" "QCOM" "OKE" "KR" "WBA" `

Of course, an itching question is what so special about those names that they were chosen? Is it because they have the largest standard deviation (SD from here on)? so that they exert a large pull on the overall matrix? Let’s check:

The 5 grey vertical bars are those 5 names. We see that their SD indeed lay above the average, but we also see that there are other names with higher SD that are not flagged as particularly important based on their statistical leverage score.

What about the cross correlation? Each name has a correlation with all the rest. The figure below shows, on average, how much each name is correlated with all the rest:

The 5 grey vertical bars are those 5 names. We see that on average they have less correlation with the rest of the names.

So for a name to be flagged as important in the data it should have high SD while keeping it’s own “independence” in as much as possible. If a name has high variance, but is also highly correlated with the rest, then it is not particularly important, while if a name is quite independent, but not very influential for the overall movement in the data (low SD) then it is also not worthy of special attention.

The statistical method outlined in this post shows how to check the “influence” or “importance” of particular variable in the context of a matrix data. Turns out that influential, intuitively means, one that has fairly high variability, yet is quite independent from the rest- and hence deserves more attention. Use this numerical procedure to examine which individual variables, rather than their linear combination as done with PCA, is important for the overall movement in the data. The main, perhaps only, advantage is in term of explainability. It is much easier to communicate which variables are important than communicate which linear combination of the variable is important.

Here is the R code to create the scores. Takes as input the v, from an svd and k, the rank restriction.

levscores <- function(v,k) { if (k==1) { v[,1]^2 } else { apply(v[,1:k]^2,1,sum)/k } }

Here is the R code for pulling the data and for the histograms

library(quantmod) tmpfile <- "~/posts/2019-09-Blue Chip highest market cap.csv" nam <- read.csv(tmpfile, header= F)[,1] %>% as.character k <- 10 # how many years back? end<- format(Sys.Date(),"%Y-%m-%d") start<-format(Sys.Date() - (k*365),"%Y-%m-%d") l <- length(nam) w0 <- w1 <- NULL for (i in 1:l) { dat0 <- tryCatch(getSymbols(nam[i], src = "yahoo", from = start, to = end, auto.assign = F, warnings = FALSE, symbol.lookup = F ), error = function(e) { dat0 <- NULL message(paste("Ticker number", i, ",", nam[i], "was not downloaded")) } ) tryCatch(w1 <- weeklyReturn(dat0), error = function(e) { return(NULL) }) w0 <- cbind(w0, w1) } svd0 <- svd(w0) # perform svd lev_scores <- levscores(svd0$v, k= 20) tmp_ind <- lev_scores %>% order(decreasing= T) # temp index # check what are the top 5 most important topp <- 5 nam[tmp_ind] %>% head(topp) [1] "BBY" "QCOM" "OKE" "KR" "WBA" # Histograms apply(w0,2,sd) %>% hist(breaks=22, col= "darkgreen", ylab="", main="") abline(v= apply(w0,2,sd)[tmp_ind] %>% head(topp) , lwd= 2) cor_mat <- cor(w0, use= "pairwise.complete.obs") apply(cor_mat,1,mean) %>% hist(breaks=22, col= "darkgreen", ylab="", main="") abline(v= apply(cor_mat,1,mean)[tmp_ind] %>% head(topp) , lwd= 2)

CUR matrix decomposition for improved data analysis

Projection Matrices, Generalized Inverse Matrices, and Singular Value Decomposition

- Forecast Combination in R – slides The useR! 2019 held in Toulouse ended couple of days...
- Backtesting trading strategies with R Few weeks back I gave a talk about Backtesting trading...
- Curse of dimensionality part 2: forecast combinations In a previous post we discussed the term ‘curse of...
- Forecast averaging example Especially in economics/econometrics, modellers do not believe their models reflect...

The slides for talk and the paper it’s based on can be found here ]]>

- Multivariate volatility forecasting (4), factor models To be instructive, I always use very few tickers to...
- PCA as regression (2) In a previous post on this subject, we related the...
- PCA as regression A way to think about principal component analysis is as...
- Understanding Kullback – Leibler Divergence It is easy to measure distance between two points. But...

Mathematically, PCA is performed via linear algebra functions called eigen decomposition or singular value decomposition. By now almost nobody cares how it is computed. Implementing PCA is as easy as pie nowadays- like many other numerical procedures really, from a drag-and-drop interfaces to `prcomp`

in R or `from sklearn.decomposition import PCA`

in Python. So implementing PCA is not the trouble, but some vigilance is nonetheless required to understand the output.

This post is about understanding the concept of *variance explained*. With the risk of sounding condescending, I suspect many new-generation statisticians/data-scientists simply echo what is often cited online: “the first principal component explains the bulk of the movement in the overall data” without any deep understanding. What does “explains the bulk of the movement in the overall data” mean exactly, actually?

In order to properly explain the concept of “variance explained” we need some data. We would use very small scale so that we can later visualize it with ease. Now pulling price from yahoo for the three following tickers: SPY (S&P), TLT (long term US bonds) and QQQ (NASDAQ). Let’s look at the covariance matrix of the daily return series:

library(quantmod) library(magrittr) library(xtable) # citation("quantmod"); citation("magrittr") ; citation("xtable") k <- 10 # how many years back? end <- format(Sys.Date(),"%Y-%m-%d") start <-format(Sys.Date() - (k*365),"%Y-%m-%d") symetf = c('TLT', 'SPY', 'QQQ') l <- length(symetf) w0 <- NULL for (i in 1:l){ dat0 = getSymbols(symetf[i], src="yahoo", from=start, to=end, auto.assign = F, warnings = FALSE,symbol.lookup = F) w1 <- dailyReturn(dat0) w0 <- cbind(w0, w1) } dat <- as.matrix(w0)*100 # percentage timee <- as.Date(rownames(dat)) colnames(dat) <- symetf print(xtable( cov(dat), digits=2), type= "html")

TLT | SPY | QQQ | |
---|---|---|---|

TLT | 0.77 | -0.40 | -0.39 |

SPY | -0.40 | 0.90 | 0.96 |

QQQ | -0.39 | 0.96 | 1.20 |

As expected SPY and QQQ have high covariance while TLT, being bonds, on average negatively co-move with the other two.

We now apply PCA once on a data which is highly positively correlated, and once on data which is not very positively correlated so we can later compare the results. We apply PCA on a matrix which excludes TLT: `c("SPY", "QQQ")`

(call it `PCA_high_correlation`

) and PCA on a matrix which only has the TLT and the SPY columns (call it `PCA_low_correlation`

):

PCA_high_correlation <- dat[, c("SPY", "QQQ")] %>% prcomp(scale= T) PCA_high_correlation %>% summary %>% xtable(digits=2) %>% print(type= "html") PCA_low_correlation <- dat[, c("SPY", "TLT")] %>% prcomp(scale= T) PCA_low_correlation %>% summary %>% xtable(digits=2) %>% print(type= "html")

`PCA_high_correlation`

:

PC1 | PC2 | |
---|---|---|

Standard deviation | 1.42 | 0.28 |

Proportion of Variance | 0.96 | 0.04 |

Cumulative Proportion | 0.96 | 1.00 |

`PCA_low_correlation `

:

PC1 | PC2 | |
---|---|---|

Standard deviation | 1.11 | 0.65 |

Proportion of Variance | 0.74 | 0.26 |

Cumulative Proportion | 0.74 | 1.00 |

I am guessing the detailed summary contributes to the lack of understanding amidst undergraduates. In essence only the first row should be given so any user would be forced to derive the rest if they need to.

The first step in order to understand the second row is to compute it. The first row gives the standard deviation of the principal components. Square that to get the variance. The `Proportion of Variance`

is basically how much of the total variance is explained by each of the PCs with respect to the whole (the sum). In our case looking at the `PCA_high_correlation`

table: . Notice we now made the link between the variability of the principal components to how much variance is explained in the bulk of the data. Why is this link there?

The average is a linear combination of the original variables, where each variable gets . The PC is also a linear combination but instead of each of the original variables getting the weight, it gets some other weight coming from the PCA numerical procedure. We call those weights “loadings”, or “rotation”. Using those loadings we can “back out” the original variables. It is not a one-to-one mapping (so not the exact numbers of the original variables), but using all PCs we should get back numbers which are fully correlation (correlation=1) with the original variables*. But what would be the correlation if we try to “back out” using not all the PCs but only a subset? This is exactly where the variability of the PCs comes into play. If it is not entirely clear at this point, bear with me to the end. It would become clearer as you see the numbers.

Coming back to our 2-variables PCA example. Take it to the extreme and imagine that the variance of the second PCs is zero. This means that when we want to “back out” the original variables, only the first PC matters. Here is a plot to illustrate the movement of the two PCs in each of the PCA that we did.

`PCA_high_correlation`

These are the cumulative sums of the two principal components. The shaded area is one standard deviation.

You can see that the first principal component is much more variable, so when we “backtrack” to the original variable space it would be this first PC that almost completely would tell us the overall movement in the original data space. By way of contrast, have a look at the two PCs from the `PCA_low_correlation`

:

`PCA_low_correlation`

These are the cumulative sums of the two principal components. The shaded area is one standard deviation.

In this chart, as also seen from the third table in this post, the variability of the two PCs is much more comparable. It means that now in order to “backtrack” to the original variable space the first factor would give a lot of information, but we would also need to second factor to genuinely map back to the original variable space.

You know what? Let’s not be lazy and do it. Let’s back out the original variables from the PCs and visualize how much can we tell by using just the first component and how much can we tell by using both components. The way to back out the original variables (again, not a one-to-one mapping..) is using the rotation matrix:

# First chart back_x_using1 = PCA_high_correlation$x[,1] %*% t(PCA_high_correlation$rotation[,1]) back_x_using2 = PCA_high_correlation$x[,1:2] %*% t(PCA_high_correlation$rotation[,1:2]) plot(back_x_using1[,1], dat[,"SPY"], ylab="", main= "SPY backed out from the first PC") plot(back_x_using1[,2], dat[,"QQQ"], ylab="", main= "QQQ backed out from the first PC") plot(back_x_using2[,1], dat[,"SPY"], ylab="", main= "SPY backed out from both PCs") plot(back_x_using2[,2], dat[,"QQQ"], ylab="", main= "QQQ backed out from both PCs") # Second chart back_x_using1 = PCA_low_correlation$x[,1] %*% t(PCA_low_correlation$rotation[,1]) back_x_using2 = PCA_low_correlation$x[,1:2] %*% t(PCA_low_correlation$rotation[,1:2]) plot(back_x_using1[,1], dat[,"SPY"], ylab="", main= "SPY backed out from the first PC") plot(back_x_using1[,2], dat[,"TLT"], ylab="", main= "TLT backed out from the first PC") plot(back_x_using2[,1], dat[,"SPY"], ylab="", main= "SPY backed out from both PCs") plot(back_x_using2[,2], dat[,"TLT"], ylab="", main= "TLT backed out from both PCs")

`PCA_high_correlation`

**Top:** scatter plot of the original variables as backed out from the first PC over their actual values. **Bottom:** Of course, if you are using all PCs you will get back the original space.

`PCA_low_correlation`

**Top:** scatter plot of the original variables as backed out from the first PC over their actual values. **Bottom:** Of course, if you are using all PCs you will get back the original space.

Consider the four panels in each of the above charts. See how in the first chart there is much stronger correlation between what we got using just the first PC and the actual values in the data. We almost don’t need the second factor in order to get an exact match in terms of the movement in the original space (here I hope you are glad you continued reading). Quite amazing if you think about it. You can see why this PCA business is so valuable; we have reduced the two variables to one without losing almost any information. Now think about the yield curve. Those monthly\yearly rates are highly correlated, which means we don’t need to work with so many series, but with one series (first PC) without losing much information.

Back to our charts. At the bottom chart we get a good feel for the data from the first PC. But it is not as strong as what we get in the first case (upper chart). The reason is that the low correlation makes for a more difficult summarization of the data if you will. We need more principal components to help us grasp the (co)movement in the original variable space.

* This is because PCA is often centered PCA (we center the original variables, or center and scale them- which is like working with a correlation matrix instead of the covariance matrix)

Practical Guide To Principal Component Methods in R

Multivariate Statistical Analysis: A Conceptual Introduction

Principal component analysis: a review and recent developments