Since its launch in 2017, Transformer has become a major force in machine learning, revolutionizing the capabilities of translation and autocomplete services.

More recently, Transformer's popularity has soared further with the advent of large language models such as OpenAI's ChatGPT, GPT-4, and Meta's LLama. These models that have generated a lot of attention and excitement are all built on the Transformer architecture. By harnessing the power of Transformers, these models have made significant breakthroughs in natural language understanding and generation.

Although there are many good resources to explain how Transformer works, I found myself in a situation where I understood how its mechanics worked mathematically, but found it difficult to explain visually how Transformer works.

In this blog post [1], my goal is to provide a high-level explanation of how Transformer works without relying on code or math. My goal is to avoid confusing technical terms and comparing with previous architectures. While I'll try to keep things as simple as possible, it's not easy because Transformers are quite complex, but I hope it gives people a better visual understanding of what they do and how to do it.

What is Transformer?

A Transformer is a neural network architecture that is ideal for tasks that involve processing sequences as input. Perhaps the most common example of a sequence in this case is a sentence, which we can think of as an ordered set of words.

The purpose of these models is to create a digital representation of each element in the sequence; Encapsulates basic information about an element and its adjacent context. The resulting digital representation can then be passed to the downstream network, which can use this information to perform various tasks, including generation and classification.

By creating such rich representations, these models enable downstream networks to better understand the underlying patterns and relationships in the input sequence, enhancing their ability to produce coherent and contextually relevant outputs.

The main advantage of Transformers is their ability to handle remote dependencies within sequences and are highly efficient; Ability to process sequences in parallel. This is especially useful for tasks such as machine translation, sentiment analysis, and text generation.

Decoding Transformers for Natural Language Processing

What is attention ?

Perhaps the most important mechanism used by the Transformer architecture is called attention, which enables the network to understand which parts of the input sequence are most relevant to a given task. For each marker in the sequence, the attention mechanism identifies which other markers are important for understanding the current marker in a given context. Before we explore how to achieve this in a transformer, let's start simple and try to understand what the attention mechanism is trying to implement conceptually to build our intuition.

One way to understand attention is to think of it as a way to replace each marker embedding with an embedding that contains information about its neighbors; Instead of using the same embedding for each tag, regardless of its context. If we know which tags are relevant to the current tag, one way to capture this context is to create a weighted average of these embeddings, or more generally, a linear combination.

Let's consider a simple example of how to find one of the sentences we saw earlier. An embedding in a sequence does not have the context of its neighbors until attention is applied. Thus, we can visualize the embedding of the word light as the following linear combination.

Here we can see that our weights are just the identity matrix. After applying our attention mechanism, we want to learn a weight matrix so that we can express our light embedding in a way similar to the following.

This time, we give more weight to embeddings corresponding to the most relevant parts of the tag sequence we have chosen; This should ensure that the most important context is captured in the new embedding vector. Embeddings that contain current context information are sometimes called context embeddings, and that's what we ultimately want to create.

How is attention calculated?

There are several types of attention, the main difference being the way the weights used to perform linear combinations are calculated. Here, we will consider the zoom point accumulation attention introduced in the original paper as this is the most common method. In this section, it is assumed that all of our embeddings are position-encoded.

Recall that our goal is to create context embeddings using linear combinations of primitive embeddings, let's start simple, assuming that we can encode all the necessary information we need into the embedding vector we learn, and all we need to compute is the weights.

In order to calculate the weights, we must first determine which tags are related to each other. To achieve this, we need to establish the concept of similarity between the two embeddings. One way to represent this similarity is to use a dot product, where we want to learn embeddings so that higher scores indicate that two words are more similar.

Since for each mark we need to calculate its correlation with each other token in the sequence, we can generalize it to matrix multiplication, which gives us a matrix of weights; This is often referred to as attention score. To ensure that the weights sum to 1, we also apply the SoftMax function. However, since matrix multiplication can produce arbitrarily large numbers, this can cause the SoftMax function to return a very small gradient for larger attention scores; This can lead to problems with gradient vanishing during training. To solve this problem, multiply the attention score by the scaling factor before applying SoftMax.

Now, to get the context embedding matrix, we can multiply the attention score by the original embedding matrix; This is equivalent to linearly combining our embeddings.

While it is possible for the model to learn embeddings complex enough to generate attention scores and subsequent contextual embeddings; We try to squeeze a lot of information into embedding dimensions that are usually small.

So, to make the model learning task a little easier, let's introduce some parameters that are easier to learn! We do not use the embedding matrix directly, but through three independent linear layers (matrix multiplication); This should enable the model to "focus" on different parts of the embedding. As shown in the following figure:

From the image, we can see that the linear projection is labeled Q, K, and V. In the original paper, these projections were named Query, Key, and Value, supposedly inspired by information retrieval. Personally, I have never found this analogy helpful to my understanding, so I tend not to focus on this; I follow the terminology here to be consistent with the literature and to make it clear that these linear layers are different.

Now that we understand how this process works, we can think of the attention computation as a single block with three inputs, which will be passed to Q, K, and V.

When we pass the same embedding matrix to Q, K, and V, this is called self-attention.

What is bullish attention?

In practice, we often use multiple self-attention modules in parallel to enable the Transformer to focus on different parts of the input sequence at the same time – this is called multi-head attention.

The idea behind multi-head attention is very simple, with the outputs of multiple independent self-attention blocks connected together and then passed through linear layers. This linear layer enables the model to learn to combine contextual information from each attention head.

In practice, the size of the hidden dimension used in each self-attention block is usually chosen as the original embedding size divided by the number of attention heads; Preserve the shape of the embedded matrix.

What else does a Transformer consist of?

Although the paper introducing Transformer is titled "Attention is all you need", this is a bit confusing because the components of Transformer are more than just Attention!

The Transformer also contains the following:

Feedforward neural network (FFN): A two-layer neural network applied independently to each marker embedding in batches and sequences. The purpose of the FFN block is to introduce additional learnable parameters into the transformer that are responsible for ensuring that the context embedding is distinct and scattered. The original paper used the GeLU activation function, but the components of the FFN may vary depending on the architecture.
Layer normalization: Helps stabilize the training of deep neural networks, including Transformers. It normalizes the activation of each sequence, preventing them from becoming too large or too small during training; This can cause gradient-related problems such as the gradient disappearing or exploding. This stability is critical for effectively training very deep Transformer models.
Skip joins: As with the ResNet architecture, residual joins are used to mitigate vanishing gradients and improve training stability.

While the Transformer architecture has remained fairly stable since its introduction, the location of the layer normalization block may change depending on the Transformer architecture. The original architecture, now called the post-layer specification, looked like this:

As shown in the following diagram, the most common placement in recent architectures is the pre-layer specification, which places normalized blocks before self-attention and FFN blocks within hop connections.

What are the different types of transformers?

While there are many different Transformer architectures out there, most can be divided into three main types.

Encoder architecture

The encoder model is designed to produce contextual embeddings that can be used for downstream tasks such as classification or named entity recognition, because the attention mechanism is able to focus on the entire input sequence; This is the type of schema that this article has explored so far. The most popular series of pure encoder transformers is BERT and its variants.

After passing the data to one or more transformer blocks, we get a complex context embedding matrix representing the embedding for each tag in the sequence. However, to use it for downstream tasks such as classification, we only need to make one prediction. Traditionally, the first mark is taken out and passed through the classification head; Dropout and Linear layers are typically included. The output of these layers can be passed through the SoftMax function, which converts it into class probabilities. An example is described below.

Decoder schema

Almost identical to the encoder architecture, the main difference is that the decoder architecture employs a shielded (or causal) self-attention layer, so the attention mechanism can only focus on the current and previous elements of the input sequence; This means that the resulting context embedding only considers the previous context. Popular decoder-only models include the GPT series.

This is usually achieved by masking attention scores with binary lower triangular matrices and replacing unshielded elements with negative infinity; When operating through SoftMax below, this ensures that the attention score for these locations is equal to 0. We can update the previous self-attention map to include it as shown below.

Because they can only proceed from the current position and backwards, the decoder architecture is often used for autoregressive tasks such as sequence generation. However, when using context embeddings to generate sequences, there are some additional considerations compared to using encoders. An example is shown below.

We can note that while the decoder generates a context embedding for each token in the input sequence, when generating the sequence, we typically use the embedding corresponding to the final token as input for subsequent layers.

Also, after applying the SoftMax function to logits, if filtering is not applied, we will receive the probability distribution for each token in the model vocabulary; This can be very large! In general, we want to use various filtering strategies to reduce the number of potential options, some of the most common ones are:

Temperature adjustment: Temperature is a parameter applied to the internals of a SoftMax operation that affects the randomness of the generated text. It determines the creativity or concentration of the model's output by changing the probability distribution of the output words. Higher temperatures flatten the distribution and diversify the output.
Top-P sampling: This method filters the number of potential candidates for the next mark based on a given probability threshold and redistributes the probability distribution based on candidates above this threshold.
Top-K sampling: This method limits the number of potential candidates to K most likely markers based on Logit or probability score (depending on implementation).

Once we change or reduce the probability distribution of potential candidates for the next mark, we can sample from it to get our predictions – this is just sampling from multiple distributions. The predicted marks are then attached to the input sequence and fed back into the model until the required number of marks are generated, or the model generates stop markers; A special tag that represents the end of a sequence.

Encoder-decoder architecture

Initially, Transformer was proposed as an architecture for machine translation and used encoders and decoders to achieve this; An intermediate representation is created using an encoder before using the decoder to convert to the desired output format. While encoder-decoder converters have become less common, architectures such as T5 demonstrate how tasks such as question answering, summarization, and classification can be structured as sequence-to-sequence problems and solved using this approach.

The main difference in the encoder-decoder architecture is that the decoder uses encoder-decoder attention, which uses both the output of the encoder (as K and V) and the input of the decoder block (as Q) during attention calculation. This is in contrast to self-attention, which uses the same input embedding matrix for all inputs. Other than that, the overall build process is very similar to using a decoder-only architecture.

We can visualize the encoder-decoder architecture as shown in the following figure. Here, to simplify the graph, I chose to depict the back-layer norm variant of the transformer shown in the original paper; The middle specification layer is located after the attention block.

I hope the above description has helped you understand Transformer.

Reference

[1] Source: https://towardsdatascience.com/de-coded-transformers-explained-in-plain-english-877814ba6429

Decoding Transformers for Natural Language Processing