Understanding Word Embeddings (1) – Algebra

Some time back I took the time to explain that matrix multiplication can be viewed as a linear transformation. Having that perspective helps to grasp the inner-workings of all AI models across various domains (audio, images, etc.). Building on that, these next couple of posts will help you understand the inputs used in these matrix multiplication operations, specifically for those who want to understand how text-based models and LLMs function. Our focus is on the infamous one-hot encoding, as it is the key to unlock the underlying theory. It will provide you, I hope, the often-illusive intuition behind word-embeddings.

I use the following simple example sentences:

S1: “The dog chased the cat”
S2: “The cat watched the bird”

For simplicity I use the term ‘words’ but in practice we use the more general term ‘tokens’ which could stand for anything, for example a question marks or word fragments, rather than whole words. But we keep at it as if our tokens are just whole words. The starting point of feeding these texts to the machine the conversion to numerical values. Text to numbers conversion can be done in few different ways, which might seem a bit confusing for beginners. Here is a summary table to make sure we are all discussing the same thing.

Method Definition Structure Information Preserved Example
One-Hot Encoding Each unique word as a sparse vector with a single “1” Vector with length = vocabulary size Identity only “cat” = [0,0,0,1,0,0]
Sequence of One-Hot Vectors Sentence as a sequence of one-hot vectors Matrix, but could be flatten Identity and order S1 = [1,0,0,0,0,0], [0,1,0,0,0,0], [0,0,1,0,0,0], [1,0,0,0,0,0], [0,0,0,1,0,0]
Bag-of-Words Counts word frequencies in a single document Vector with length = vocabulary size Frequencies, without order S1 = [2,1,1,1,0,0]
S2 = [2,0,0,1,1,1]
Term-Document Matrix columns- or rows-concatenated of bag-of-words vectors for multiple documents or sentences Matrix: vocabulary size × documents Word frequencies across corpus
Word S1 S2
the 2 2
dog 1 0
chased 1 0
cat 1 1
watched 0 1
bird 0 1

As mentioned, our focus here is on using one-hot encoding (first row in the table above) the size of the complete one-hot matrix is:Total Words × Unique Words. Where:

  • Total Words = The sum of all word occurrences across all documents/sentences
  • Unique Words = The number of different words in the vocabulary

In the one-hot matrix each row corresponds to a single word occurrence in the corpus, each column corresponds to a unique word in the vocabulary, and each row contains exactly one “1” with zeros elsewhere. As the corpus grows, the matrix becomes very large and increasingly sparse, which is why one-hot encoding in itself is usually a no-go for direct storage and manipulation in large-scale applications. But it’s crucial for grasping the concepts we’ll build on later. The following table illustrates how sparsity grows with dimensionality:

Corpus Size Total Words Unique Words Matrix Size Non-Zero Elements Non-Zero Proportion
Small 10 7 10×7 = 70 10 14.3%
Medium 1,000 200 1,000×200 = 200,000 1,000 0.5%
Large 1,000,000 100,000 10^6 \times 10^5 = 100 billion 10^6 0.001%

While in itself the one-hot matrix is not super practical, all word embeddings originate from that one-hot matrix. Heads up: Algebra ahead 📐!

Why is the one-hot matrix the foundation for embedding Learning?

The one-hot matrix is a basis in the algebraic sense. I first explain the meaning of basis, then we see why it’s relevant for word embeddings. Basis in the algebraic sense means that:

  1. The vectors in that matrix are linearly independent
  2. Any other vector you can think of can be expressed as a linear combination of the one-hot vectors. More formally, the one-hot matrix spans the vector space.

In less abstract terms you can think of the one-hot vectors as colors. Imagine your colors as a “palette.” A complete palette, capable of making any color, has these properties: Completeness – it “spans” all colors, and non-redundancy – its colors are “independent”. Red, yellow, blue, white, and black are complete and non-redundant. In contrast: Red, blue, and white are incomplete, missing colors like yellow and green. The set: red, orange, yellow, green, blue, purple, white, and black has some redundancies, for example purple can be created using a mixture of red and blue.

Returning to our usual formal perspective:

  1. Linear dependence means one vector can be exactly written as a combination of others. To state that two vectors are linearly dependent is like saying: “one vector is entirely predictable from the others,” like a perfect correlation. When vectors are linearly independent, no vector is “explaining” or “duplicating” another – similar to uncorrelated variables. To show that one-hot vectors are linearly independent we use the zero vector (where all entries are zero). The zero vector can be created by a linear combination of any vectors, how? you trivially set all the weights\coefficients of the other vectors to zero. So, if the only way we can get the zero vector is by setting all the weights to zero it implies that no vector can be “cancelled” by another vector, meaning there is no redundancy or put differently there is no linear dependence.

    Example

    Take these three one-hot vectors in 3:

    • v1 = [1, 0, 0]
    • v2 = [0, 1, 0]
    • v3 = [0, 0, 1]

    And try for:

    a v1 + b v2 + c v3 = [0, 0, 0]
    a[1, 0, 0] + b[0, 1, 0] + c[0, 0, 1] = [a, b, c] = [0, 0, 0]

    For the above to hold there is no other way but:

    • a = 0
    • b = 0
    • c = 0

    The only way to make the zero vector is by making all coefficients zero. So there is your linear independence, now let’s move on to spanning.

  2. For a set of vectors to span the vector space, you must prove that every vector in that space can be written as a linear combination of the given set. The standard basis vectors (the one-hot vectors in our case) \{e_1, e_2, \dots, e_n\} span \mathbb{R}^n (\mathbb{R}^n means a vector of real number of size n where n should be thought of as the size of the vocabulary) because any vector x \in \mathbb{R}^n, where x = [x_1, x_2, \dots, x_n]^T, can be written as a linear combination of the basis vectors. Here is why:

    Define a_i = x_i for each i = 1, \dots, n. Then:

        \[ x = \sum_{i=1}^{n} a_i e_i = \sum_{i=1}^{n} x_i e_i \]

    In words, you simply set the a_i vectors such that their entries are all zeros except the entry needed to construct the x_i vector exactly; because when you multiply that entry x_i with 1, you get x_i. This shows that x can be expressed as a linear combination of \{e_1, \dots, e_n\}, so they span \mathbb{R}^n. Meaning: all and any embedding vectors (even in lower dimensional space) are linear combinations of the one-hot vectors.

While one-hot encoding isn’t practical for real-world use its algebraic understanding makes it why word-embeddings actually works, and why it’s theoretically legit to shift from that large, discrete and sparse space to a smaller, dense, continuous, and semantically relevant space.

With algebra covered, the next post will explore the geometric interpretation of word embeddings.

Leave a Reply

Your email address will not be published. Required fields are marked *