Understanding Pointwise Mutual Information

Contents

Intro

The term mutual information is drawn from the field of information theory. Information theory is busy with the quantification of information. For example, a central concept in this field is entropy, which we have discussed before.

If you google the term “mutual information” you will land at some page which if you understand it, there would probably be no need for you to google it in the first place. For example:

Not limited to real-valued random variables and linear dependence like the correlation coefficient, mutual information (MI) is more general and determines how different the joint distribution of the pair (X,Y) is to the product of the marginal distributions of X and Y. MI is the expected value of the pointwise mutual information (PMI).

which makes sense at first read only for those who don’t need to read it. It’s the main motivation for this post: to provide a clear intuition behind the pointwise mutual information term and equations, for everyone. At the end of this page, you would understand what mutual information metric actually measures, and how you should interpret it. We start with the easier concept of conditional probability and work our way through to the concept of pointwise mutual information.

Conditional probability

If you know the result of a fair, six-sided die is larger than 4, the probability that it is a 5 is 1/2- while if you don’t know the result is larger than 4, then the probability remains 1/6. So the fact that you know the result is larger than 4 made a big difference for you in this case. But how big of a difference? We want to quantify how big is this difference compared to say, you know that the result of the die roll is larger than 2, or don’t know anything for that matter.

In the example above we implicitly used the conditional probability formula: $\frac{P(A \cap B)}{P(B)}$ with A being the event “larger than 4”, $B$ being the event “result is equal to 5”, and $A \cap B$ means both A and B occurred simultaneously.

Pointwise mutual information

Those “events” above are just random variables: what can happen? what is the probability of each of the possible outcomes? If we denote those random variables as x and y, the formula for pointwise information is very closely related to that of conditional probability. The link between conditional probability and mutual information is your main engine for understanding this topic. The formula for pointwise information is

$\operatorname {pmi} (x;y)\equiv \log {\frac {p(x,y)}{p(x)p(y)}}.$

Forget about the log operator for a second. Let’s massage this formula:

$\frac {p(x,y)}{p(x)p(y)} = \frac {p(x,y)}{p(x)} \times \frac{1}{p(y)} = p(y \vert x) \times \frac{1}{p(y)} = p(x \vert y) \times \frac{1}{p(x)}.$

Let’s focus on the last expression. As you can see, it’s the conditional probability of X given Y times $\frac{1}{p(x)}$ . If Y and X are independent $p(x \vert y)$ equals $p(x)$ , and in that case you would have $p(x) \times \frac{1}{p(x)} = 1$ and $PMI = 0 = log(1)$ .

How “important” is the event $X = x$ ? if $P(X = x) = 1$ then the event X=x is not really important is it? think a die which always rolls the same number; there is no point to consider it. But, If the event $X = x$ is fairly rare → p(x) is relatively low → $\frac{1}{p(x)}$ is relatively high → the value of $p(y \vert x)$ becomes much more important in terms of information. So that is the first observation regarding the PMI formula.

We are about half way.

A practical example and some additional intuition

In this code we pull some ETF data from yahoo, for TLT (US treasury bonds) and SPY (US S&P 500 stocks). We create two series of daily returns for those two tickers.


library(quantmod)
library(magrittr)
k <- 10 
end<- format(Sys.Date(),"%Y-%m-%d")
start<-format(Sys.Date() - (k*365),"%Y-%m-%d")
symetf = c('TLT', 'SPY')
l <- length(symetf)
w0 <- NULL
for (i in 1:l){
  dat0 = getSymbols(symetf[i], src="yahoo", from=start, to=end, 
                    auto.assign = F, warnings = FALSE, symbol.lookup = F)
  w1 <- dailyReturn(dat0)
  w0 <- cbind(w0,w1)
}
time <- as.Date(substr(index(w0),1,10))
w0 <- as.matrix(w0)*100
colnames(w0) <- symetf
> tail(w0,3)
                 TLT         SPY
2020-01-22 0.3513343  0.01207606
2020-01-23 0.7001964  0.11468733
2020-01-24 0.8088548 -0.88930785

library(quantmod)

library(magrittr)

k <- 10

end<- format(Sys.Date(),"%Y-%m-%d")

start<-format(Sys.Date() - (k*365),"%Y-%m-%d")

symetf = c('TLT', 'SPY')

l <- length(symetf)

w0 <- NULL

for (i in 1:l){

dat0 = getSymbols(symetf[i], src="yahoo", from=start, to=end,

auto.assign = F, warnings = FALSE, symbol.lookup = F)

w1 <- dailyReturn(dat0)

w0 <- cbind(w0,w1)

}

time <- as.Date(substr(index(w0),1,10))

w0 <- as.matrix(w0)*100

colnames(w0) <- symetf

> tail(w0,3)

TLT SPY

2020-01-22 0.3513343 0.01207606

2020-01-23 0.7001964 0.11468733

2020-01-24 0.8088548 -0.88930785

Now let’s define the our random variables. X would be: “returns of TLT is below it’s 5% quantile”. The random variable Y would be: “returns of SPY is below it’s 5% quantile”, so two binomial random variables.

Now, based on the pointwise mutual information formula we compute the PMI measure:


alpha <- 0.05
y <- w0[,"SPY"] < quantile(w0[,"SPY"], prob= alpha) %>% as.numeric
x <- w0[,"TLT"] < quantile(w0[,"TLT"], prob= alpha) %>% as.numeric
p_x <- sum(x)/TT
p_y <- sum(y)/TT
p_xy <- (x[y==1] %>% sum)/TT
(p_ab/p_b)/p_a
[1] 0.3167045
log2((p_ab/p_b)/p_a) 
[1] -1.658791

alpha <- 0.05

y <- w0[,"SPY"] < quantile(w0[,"SPY"], prob= alpha) %>% as.numeric

x <- w0[,"TLT"] < quantile(w0[,"TLT"], prob= alpha) %>% as.numeric

p_x <- sum(x)/TT

p_y <- sum(y)/TT

p_xy <- (x[y==1] %>% sum)/TT

(p_ab/p_b)/p_a

[1] 0.3167045

log2((p_ab/p_b)/p_a)

[1] -1.658791

The PMI measure is about -1.65. What (the hell) does that mean?

Pointwise mutual information measure is not confined to the [0,1] range. So here we explain how to interpret a zero, a positive or, as it is in our case, a negative number. The case where PMI=0 is trivial. It occurs for log(1) =0 and it means that $p(x,y) = p(x)p(y)$ which tells us that x and y are independents. If the number is positive it means that the two events co-occuring in a frequency higher than what we would expect if they would be independent event. Why? because $p(y \vert x) \times \frac{1}{p(x)}$ (or equivalently $p(x \vert y) \times \frac{1}{p(y)}$ ) is larger than 1 (if it’s smaller than 1, the log is negative). In our case the number is lower than one, meaning $p(y \vert x) < p(x)$ which means we see more of X=x than we see y given that X=x.

Let’s talk numbers to make it more tangible. The individual probabilities are p_x = p_y = roughly 5% (by construction here). If the events\variables are independent we would expect to see both occur simultaneously around 0.05^2 = 0.25% of the time. Instead we see those events co-occur only
> p_xy *100 [1] 0.07952286
so only 0.08% of the time. So we see this joint event roughly one third of what we would expect relatively to the events being independent (approximately 0.08/0.25). This 0.316 figure is what continues into the log operator and produces us with the negative number.
(p_ab/p_b)/p_a [1] 0.3167045
→
log2((p_ab/p_b)/p_a) [1] -1.658791
Practically it means that the number of times where both stocks and bonds are having a bad day (being below their 5% quantile), is much lower compared to them having bad days individually. So seeing a bad day for one of those does not drag a bad day for the other, on the contrary. Which makes sense given the bonds-as-a-hedge-againt-stock-market-doom textbook argument.

Summary

The pointwise mutual information can be understood as a scaled conditional probability.

The pointwise mutual information represents a quantified measure for how much more- or less likely we are to see the two events co-occur, given their individual probabilities, and relative to the case where the two are completely independent.

Information Theory: A Tutorial Introduction

You might also like:

6 comments on “Understanding Pointwise Mutual Information in Statistics”

Pingback: Quantocracy's Daily Wrap for 01/27/2020 | Quantocracy
Dorian says:

07/22/2020 at 1:43 PM

When you say: “If Y and X are independent, there is no meaning to the multiplication (it’s going to be zero times something).”

Do you actually mean “If Y and X are mutually exclusive” ?

1. Eran Raviv says:
  
  07/30/2020 at 5:51 PM
  
  I don’t
  
Abbas says:

09/04/2020 at 12:04 PM

In the section: Pointwise mutual information your chain rule formula has a tiny mistake:
P(x,y)/P(x) = P(y/x) and is not equal to P(x/y)

Andrea says:

09/29/2020 at 1:12 AM

Thank you for the very helpful explanation. Two corrections:

+1 to Abbas’s comment. P(y|x) and P(x|y) should be swapped in the section where you work out the PMI formula.

1. Eran Raviv says:
  
  10/02/2020 at 3:11 PM
  
  Andrea, Dorian and Abbas,
  Now updated. Sincere thanks.
  Eran

Understanding Pointwise Mutual Information in Statistics

Intro

Conditional probability

Pointwise mutual information

A practical example and some additional intuition

Summary

You might also like:

Mom, are we bear yet?

Named One of the Best Statistics Websites for 2025

Measurement error bias

Laws of large numbers

6 comments on “Understanding Pointwise Mutual Information in Statistics”

Leave a Reply to Eran Raviv