Dot Product in the Attention Mechanism

The dot product of two embedding vectors $\mathbf{x}$ and $\mathbf{y}$ with dimension $d$ is defined as

$\mathbf{x} \cdot \mathbf{y} = x_1y_1 + x_2y_2 + \dots + x_dy_d.$

Hardly the first thing that jumps to mind when thinking about a “similarity score”. Indeed, the result of a dot product is a single numbers (a scalar), with no predefined range (e.g. not between zero and one). So, it’s hard to quantify whether a particular score is high/low on its own. Still, deep learning Transformer family of models rely heavily on the dot product in the attention mechanism; to weigh the importance of different parts of the input sentence. This post explains why the dot product which seems like an odd pick as a “similarity scores”, actually makes good sense.

Dot Product and Vector Similarity

I assume here you already understand the geometry of word embeddings.

Consider the following: two vectors pointing the same way (0° angle), so we want a maximum similarity score; two vectors pointing in opposite ways (180° angle), so we want a minimum (least similar) score, and two perpendicular vectors (90° angle), so we want a zero similarity score, meaning no relationship.

If we just used the angle itself, the numbers would be backward: 180° (least similar) is a higher number than 0° (most similar). That’s not at all intuitive for a “similarity” scale. Solution, feed the angle to the cosine function. This gives us exactly the order we order:

$\theta$ (degrees)	$\cos(\theta)$
180	-1.000
150	-0.866
120	-0.500
90	0.000
60	0.500
30	0.866
0	1.000

The formula to find the cosine of the angle between two vectors $\mathbf {x}$ and $\mathbf {y}$ of size $d$ is:

$\cos(\theta )={\mathbf {x} \cdot \mathbf {y} \over \|\mathbf {x} \|\|\mathbf {y} \|}={\frac {\sum \limits _{i=1}^{d}{x_{i}y_{i}}}{{\sqrt {\sum \limits _{i=1}^{d}{x_{i}^{2}}}}{\sqrt {\sum \limits _{i=1}^{d}{y_{i}^{2}}}}}}.$

Manipulating the above just a bit to get our dot product friend on the left-hand gives us:

$\mathbf {x} \cdot \mathbf {y} = \|\mathbf {x} \| ( \|\mathbf {y} \| \cos(\theta ))$

now what is $\|\mathbf{y}\|\cos(\theta)$ ? It’s how closely two vectors are pointing in the same direction multiplied by the magnitude of $\mathbf{y}$ : $\|\mathbf{y}\| = \sqrt{y_1^2 + y_2^2 + \cdots + y_d^2}$ . Or put differently: how much of $\mathbf{y}$ lies in the direction of $\mathbf{x}$ , in other words, if I project $\mathbf{y}$ onto $\mathbf{x}$ , how much is “captured”? (answer: exactly: $\|\mathbf {y} \| \cos(\theta )$ ). Then, we multiply it by the magnitude of $\mathbf{x}$ , $\|\mathbf {x} \|$ so as to account for its magnitude as well.

Can a Vector Be More “Similar” to Others Than to Itself???

If you google self-attention matrix images you will see that the diagonal is quite high, for example: A description of the image that is because as we mentioned, if the angle is zero, so the cosine is 1 and we boil down to multiplying the magnitude of the two embedding vectors.

That said, I got to thinking about this topic because of the observation that while a vector should theoretically be most similar to itself, that’s not always true with some of the images I saw in papers or generated myself (diagonal is not always the highest scores as you can see above figure as well). So it is possible for different vectors to receive higher similarity scores than a vector does to itself – even in self-attention matrices. This is because the embeddings coordinates are not always normalized to a consistent length. So, even if two vectors point in almost the same direction (a very small angle), the sheer “size” of one of them can still make their similarity score surprisingly high. This is why I explained it that way above.

By the way, that’s also the reason for the normalizing factor $\sqrt{d_k}$ . Because, As $d$ grows, the variability of the dot product grows; numbers there can become very large because of the magnitude of the vectors themselves (larger $d$ –> more elements to square and sum). It’s not good to feed large numbers to the softmax ( $e^{(\cdot)..}$ ), because it’s going to explode, so before we exponentiate ( $e^{(\cdot)..}$ ), we “stabilize” that dot product by dividing it with $\sqrt{d_k}$ . Why $\sqrt{d_k}$ ? No real reason. You can choose another term if you wish. The story is that the term $\sqrt{d_k}$ is the standard deviation of the dot product if you assume the embedding vectors are normally distributed (they are not, but this Gaussian assumption provides a good excuse for using $\sqrt{d_k}$ ).