Illustrate! Step by step to understand the math of Transformers

author：AI Observation Room 2023-10-22 19:37:00

Illustrate! Step by step to understand the math of Transformers

The transformer architecture may seem scary, and you may also see various explanations on YouTube or blogs. However, in my blog, I will illustrate its principle by providing a comprehensive mathematical example. By doing so, I hope to simplify the understanding of the transformer architecture.

Then let's get started!

Inputs and Positional Encoding

Let's solve the initial part, where we will determine our inputs and calculate their positional encoding.

Illustrate! Step by step to understand the math of Transformers

Illustrate! Step by step to understand the math of Transformers

Step 1 (Defining the data)

The first step is to define our dataset (corpus).

Illustrate! Step by step to understand the math of Transformers

In our dataset, there are 3 sentences (dialogue) taken from the Game of Thrones TV series. Although this dataset may seem small, it is enough to help us understand the mathematical formulas that followed.

Step 2 (Finding the Vocab Size)

To determine the vocabulary, we need to determine the total number of unique words in the dataset. This is essential for encoding (that is, converting data to numbers).

Illustrate! Step by step to understand the math of Transformers

Illustrate! Step by step to understand the math of Transformers

where N is a list of all words, and each word is a single token, we will break down our dataset into a list of tokens, represented as N.

Illustrate! Step by step to understand the math of Transformers

Illustrate! Step by step to understand the math of Transformers

Once we have a list of tokens (denoted as N), we can apply formulas to calculate the vocabulary.

The specific formula principle is as follows:

Illustrate! Step by step to understand the math of Transformers

Illustrate! Step by step to understand the math of Transformers

Using the set operation helps to remove duplicates, and then we can calculate unique words to determine vocabulary. Therefore, the vocabulary is 23 because there are 23 unique words in a given list.

Step 3 (Encoding and Embedding)

Next, each unique word of the dataset is assigned an integer as a number.

Illustrate! Step by step to understand the math of Transformers

Illustrate! Step by step to understand the math of Transformers

After encoding our entire dataset, it's time to select our inputs. We will choose a sentence from the corpus to start:

“When you play game of thrones”

Illustrate! Step by step to understand the math of Transformers

Illustrate! Step by step to understand the math of Transformers

Each word passed as input will be represented as an encoding, and each corresponding integer value will have an associated embedding associated with it.

Illustrate! Step by step to understand the math of Transformers

Illustrate! Step by step to understand the math of Transformers

These embeddings can be found using Google Word2vec (vector representation of words). In our numerical example, we will assume that the embedding vector for each word is populated with random values between (0 and 1).
Furthermore, the original paper uses dimension 512 of the embedding vector, and we will consider a very small dimension, i.e. 5, as a numerical example.

Illustrate! Step by step to understand the math of Transformers

Illustrate! Step by step to understand the math of Transformers

Now, each word embedding is represented by a 5-dimensional embedding vector and the Excel function RAND() is used to fill the value with random numbers.

Step 4 (Positional Embedding)

Let's consider the first word, "when", and calculate the position embedding vector for it. There are two formulas for position embedding:

Illustrate! Step by step to understand the math of Transformers

The POS value of the first word "when" will be zero because it corresponds to the starting index of the sequence. In addition, the value of i (depending on whether it is even or odd) determines the formula used to calculate the PE value. The dimension value represents the dimension of the embedding vector, which in our case is 5.

Illustrate! Step by step to understand the math of Transformers

Illustrate! Step by step to understand the math of Transformers

Continuing to calculate the position embedding, we will assign a pos value of 1 to the next word "you" and continue to increment the pos value for each subsequent word in the sequence.

Illustrate! Step by step to understand the math of Transformers

Illustrate! Step by step to understand the math of Transformers

Once we have found the location embedding, we can associate it with the original word embedding.

Illustrate! Step by step to understand the math of Transformers

Illustrate! Step by step to understand the math of Transformers

The result vectors we get are e1+p1, e2+p2, e3+p3, and so on.

Illustrate! Step by step to understand the math of Transformers

Illustrate! Step by step to understand the math of Transformers

The output of the initial part of the transformer architecture will be used as the input to the encoder later.

encoder

In the encoder, we perform complex operations involving a matrix of queries, keys, and values. These operations are critical to transforming input data and extracting meaningful representations.

Illustrate! Step by step to understand the math of Transformers

Illustrate! Step by step to understand the math of Transformers

Within the multi-head attention mechanism, a single attention layer consists of several key components. These components include:

Illustrate! Step by step to understand the math of Transformers

Illustrate! Step by step to understand the math of Transformers

Note that the yellow box represents the single-headed attention mechanism. What makes it a bullish attention mechanism is the superposition of multiple yellow boxes. For the sake of an example, we will consider only one single-headed attention mechanism, as shown in the image above.

Step 1 (Performing Single Head Attention)

The attention layer has three inputs

Query
Key
Value

Illustrate! Step by step to understand the math of Transformers

Illustrate! Step by step to understand the math of Transformers

In the figure provided above, the three input matrices (pink matrix) represent the transposed output obtained from the previous step of adding positional embedding to the word embedding matrix. On the other hand, a matrix of linear weights (yellow, blue, and red) represents the weights used in the attention mechanism. The columns of these matrices can have any number of dimensions, but the number of rows must be the same as the number of columns in the input matrix used for multiplication. In our case, we will assume that the linear matrix (yellow, blue, and red) contains random weights. These weights are usually initialized randomly and then adjusted during training through techniques such as backpropagation and gradient descent. So let's calculate (Query, Key and Value metrices):

Illustrate! Step by step to understand the math of Transformers

Illustrate! Step by step to understand the math of Transformers

Once we have the query, key, and value matrices in the attention mechanism, we move on to additional matrix multiplication.

Illustrate! Step by step to understand the math of Transformers

Illustrate! Step by step to understand the math of Transformers

Illustrate! Step by step to understand the math of Transformers

Illustrate! Step by step to understand the math of Transformers

Illustrate! Step by step to understand the math of Transformers

Illustrate! Step by step to understand the math of Transformers

Now, we multiply the result matrix by the matrix of values we calculated earlier:

Illustrate! Step by step to understand the math of Transformers

Illustrate! Step by step to understand the math of Transformers

If we have multiple head attentions, each of which produces a matrix of dimension (6x3), the next step is to cascade these matrices together.

Illustrate! Step by step to understand the math of Transformers

Illustrate! Step by step to understand the math of Transformers

In the next step, we will again perform a linear transformation similar to the process used to get the query, key, and value matrices. This linear transformation is applied to a cascading matrix obtained from multiple head notes.

Illustrate! Step by step to understand the math of Transformers

Illustrate! Step by step to understand the math of Transformers

Previous: In 1860, the Anglo-French army invaded the capital, and Zeng Guofan, who had 120,000 Xiang troops, could not be saved

Next: Hot melon! City of Dreams Macau bouncing live beauty Internet celebrities and black DJs intimately interacted, igniting the audience

Read on