1. Background knowledge

After ChatGPT gained global attention, learning and using large language models quickly became a hot trend. As programmers, we need to not only understand what it looks like, but also how it works. What exactly makes ChatGPT capable of such superior Q&A performance? The clever integration of the self-attention mechanism is undoubtedly one of the key factors. So, what exactly is the self-attention mechanism, and how does it create such an amazing effect? Today, let's explore the rationale behind this mechanism.

Over hundreds of millions of years of evolution, humans have developed the ability to quickly pay attention to changing factors in their environment, so that they can better pursue advantages and avoid disadvantages. This ability is attention. In the field of machine learning, the concept of biomimicry is also widely used. Like neural networks and genetic algorithms, attention mechanisms have appeared in related research since the 80s of the last century. In the early stages, attention mechanisms are often used to identify and extract key information from images. For example, in the figure below, the attention algorithm extracts the "stop sign" in the picture.

With the progress of deep learning algorithms, the attention mechanism has also been widely used in Sequence Input processing, among which the self-attention mechanism in the Transformer model is the most famous. Compared with the algorithm that treats each sequential input content indiscriminately, the self-attention mechanism can extract the correlation between the input content, so as to achieve better results. Taking part-of-speech analysis as an example, a model is now needed to analyze the part-of-speech of each word in a sentence, and the input sentence is "I saw a saw". There are two saws in the sentence, the first of which is a verb that means "to see". The second is a noun, which means "saw". If the part of speech of each word is analyzed separately, the algorithmic model cannot tell whether the word is a noun or a verb. Therefore, it is necessary to put the two words into the same context to get the desired result. We set up the window to get the effect of continuous context. The length of the window will be positively correlated with the cost of the calculation. Therefore, the longer the length of the text, the more resources are consumed by the algorithmic model. In large models, the referencing of the self-attention mechanism is designed to break the correlation between the length of the window and the length of the text content.

The self-attention model can directly calculate the results of the current character without relying on the results of the pre-order content. This guarantees that the calculations are carried out in parallel, and does not depend on the constraints of the window. So when dealing with very long sequences, there will be a huge increase in efficiency.

二、理解注意力机制（Attention mechanism）

The attention mechanism is an algorithm that simulates a person's attention. It can be thought of as a function that can be used to calculate the degree of correlation between two vectors. The higher the correlation between the two vectors, the higher the attention score. As shown in the figure below, the attention of the I vector is calculated on the N vector, and the I vector is derived from the word embedding of the ith token entered. N vector input the nth token for word embedding. Its O vector represents the result, which is the attention score of vector I on N. After processing O (e.g. Softmax processing with multiple values), it can be thought of as the weight of I on the input N. After attention calculation, the higher the score of vector I and vector N, the higher the correlation between I and N. In this way, in the subsequent calculations, only the vector relationship with high attention score can be paid attention to, so as to optimize the computational efficiency.

The Attention function is a definite function that will be explained in detail in the next section. To put it simply, attention at the algorithmic level is the calculation of the similarity of two vectors. When the similarity of the two vectors is high, the weight of the input vector will be larger, and more abstractly, the input vector will get more attention tilt.

三、理解Self Attention

In the Transformer framework, a deep learning model based on the self-attention mechanism, the referenced attention mechanism is called self-attention (sometimes called intra-attention). This is a context-sensitive enhancement to the sequential input content based on the previous section, so that the attention mechanism can better notice the correlation information of other input tokens in the whole input, so as to create conditions for extracting more information. It is the role of self-attention that makes the GPT model using the Transformer framework outstanding in terms of content generation.

As shown in the figure below, the input is a set of vectors a1~a4, and the output is the vector group b1~b4 after self-attention calculation. The input vector A1 may be a set of vectors obtained from the original input after word embedding. The input vector a1 may also be the output of some hidden layer of the preface. In other words, the results of self-attention calculations are often used as input to other layers to help later parts of the model better understand the highly correlated vector relationships in the input vector group.

In the figure above, the output vector B1 is the result of the input vector A1 after considering the relevant of all input vectors. So how do you calculate the correlation α between a1 and a(n)? This is the calculation mentioned in the previous chapter. Let's take a look at the specific calculations made in the attention function.

First, a1 is multiplied by Wq to get the vector q, and a(n) and Wk are given to get the vector k. Then, the Dot Product is calculated by using the vectors q and k to obtain the correlation αn between a1 and a(n). The overall process is shown in the figure below. Among them, Wq and Wk are two matrices learned through model training, which can be understood as definite constants, which represent the focus of attention (which can also be understood as knowledge).

There is no definite formula for the choice of attention function. For example, in Transformer, a formula algorithm called Scaled Dot-Product Attention is used. Therefore, it is an open question of which algorithm to use as the attention function, and you can try different algorithms according to your own understanding.

下面我们来具体看下Transformer的Self Attention都做了哪些优化。在Self Attention中，注意力函数被抽象为，将Query信息与一个Key-Value数据集进行相关性计算的过程，计算的Output结果为Key-Value数据集中每对元素与Query的权重值。其中Query、Key、Value和Output均为向量。此函数被称为Scaled Dot-Product Attention。

让我们继续以a1-a4的输入为例，拆解一下Transformer中的Scaled Dot-Product Attention的计算过程。

1. Multiply a1 by Wq to get the vector q1.

2、分别将Wq与a1、a2、a3、a4相乘，得到k1、k2、k3、k4。

3、分别将q1与k(n)点乘，得到a1,n。

4. Perform softmax calculation on a1,1,a1,2,a1,3,a1,4 to obtain a1,1′, a1,2, a1,3′, a1,4′ The use of softmax here is a replaceable scheme, and other functions can be used instead, such as ReLU.

5、分别将Wv与a1、a2、a3、a4相乘，得到v1、v2、v3、v4。

6. Finally, multiply v(n) by a1,(n)′ to obtain matrix b1. The calculation process of b2, b3, and b4 is the same, and the input a1 can be replaced with b2, b3, and b4 respectively.

The above process can be described in the following diagram.

If a1, a2, a3, and a4 are used as vector matrix Ainput, they can be input into self-attention at one time. We can get the corresponding matrices Q, K, and V, as shown in the following formula. Wq, Wk, and Wv are the constant coefficients learned through model training.

位置编码（Positional Encoding）

In the self-attention model described above, there is no input of location information corresponding to the token. But for sequence input, position information also contains important information. For example, in the previous example "I saw a saw", if the location information is removed, the two saws cannot tell which meaning is exactly. In order for the Transformer model to pay attention to the position information, the model designer encoded and vectorized the position. This is known as Positional Encoding.

The model designer used the sine and cosine functions to encode the position, as shown in the figure below. where pos represents the current location and i represents the dimension. The position prediction for each dimension is a sine wave with a value between 2π and 10000⋅2π. The dmodel represents the dimension of the word embedding vector, so that the generated position is encoded, and the encoding result can be directly added to the word vector.

The reason why the position coding is selected in the sine and cosine functions of finite amplitude cyclic oscillation is so that the model can predict the content that exceeds the maximum length of the training text. So as to achieve better generalization effect.

The above positional encoding function is only used in the Transformer model. The choice of encoding function is also a problem for which there is no definite solution. The designer of the model can design different position encoding functions based on his own understanding of position. There are even some papers that have been trained by models to dynamically generate location-encoded information.

四、理解Multi-head attention

In order to extract the features of input vectors in different dimensions in parallel, the Transformer architecture proposes the concept of multi-head attention on the basis of self-attention. For example, in the previous training, the model learned the features of the two dimensions of speech and sentence meaning, which were denoted as Ωattribute and Ωsemantic, respectively. Ω=(Wq,Wk,Wv) can be understood here as consisting of two heads, which are used to extract features under different attentions. The multi-head attention can be represented in the figure below, consisting of h heads, which are used to calculate different eigenvalues. Finally, the features are merged by the Conform method and sent to the next layer for processing.

5. Summary

The concept of self-attention was first proposed in the epoch-making paper "Attention is all you need", marking a major breakthrough in the understanding of attention mechanism. The self-attention mechanism breaks through the performance limitations of traditional attention algorithms and greatly improves the efficiency of processing large-scale datasets. Thanks to this, the training efficiency of the model in processing large data sets mainly depends on the hardware resources invested, and the effect is proportional to it. At the same time, the parallel processing characteristics of the self-attention mechanism complement the parallel computing capabilities of the GPU, which further improves the training efficiency. Therefore, with the passage of time and the continuous investment in training resources, large language models using self-attention mechanism also show a steady increase in parameter size.

In addition, this paper only introduces the self-attention mechanism related to Transformer, and the more classical attention mechanism is less introduced. However, understanding the development process of attention mechanism can better help to understand the self-attention mechanism. Therefore, it is recommended that interested readers read this paper "Attention Mechanism in Neural Networks: Where it Comes and Where it Goes", which is dedicated to the history of attention mechanisms.

Link to article

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

NEURAL MACHINE TRANSLATION BY JOINTLY LEARNING TO ALIGN AND TRANSLATE

Learning to Encode Position for Transformer with Continuous Dynamical Model

NTU Video Link:

https://b23.tv/jBK3VXe

Author: Jingdong Logistics Chen Haolong

Source: JD Cloud Developer Community

探索大语言模型：理解Self Attention| 京东物流技术团队

1. Background knowledge

二、理解注意力机制（Attention mechanism）

三、理解Self Attention

四、理解Multi-head attention

5. Summary

Link to article

Read on

The secret of using large language models: How to control AI with efficient prompt words?

Apple has been exposed to a big move again, self-developed device-side large language model, AI is a new way out of "revitalization"?

No wonder the previous iPhone 16 series national version of the AI function will be provided by Baidu, the original Baidu in the Chinese artificial intelligence invention patent enterprise ranking is still high. Ranked in the top 10

Apple released OpenELM, an efficient language model based on an open-source training and inference framework

Solomonov: The Prophet of Large Language Models

Large Language Model Deployment: vLLM and Quantization

Apple launches OpenELM, an efficient language model, Xiaomi plans a new car for 150,000 yuan, and AI successfully rewrites human DNA

The combination of deep learning and chemical language models is used for de novo drug design, which is published in the journal Nature

The tuyere belonging to major technology companies is here again! This large language model leads to the "new industrial revolution."

The landing of large language models Why the first step is to do customer service

OpenAI launches new large language model GPT-4o; Apple will start selling the Vision Pro in China; SoftBank sold almost all of its shares in Alibaba

The synergy of knowledge graphs with large language models

Multi-functional RNA analysis, the RNA language model of the Baidu team was published in the journal Nature

The parameters are improved slightly, and the performance index explodes! Google: Large language models hide mysterious skills

Analysis of Large Model Parameter Efficient Fine-tuning (PEFT) technology and fine-tuning acceleration practice

Why do Stanford students want to copy Chinese models?

The popularity of generative AI mobile applications is accelerating! MediaTek Dimensity chips, models, and applications are driven at lightning speed

Multi-format component-level model assembly - model king: flexible combination ● unlimited creativity

Obviously the Snapdragon chip on the PC side is not weak? Why are few manufacturers using it? Now that the performance is directly benchmarked against Apple's M3's Snapdragon XElite, the situation has not only changed, but also improved

Mobile phone into model machine! Baoshan Police: A gentleman loves money and takes it in a good way

Learn more about large language model operations (LLMOps)

Slap in the face! The domestic AI model is far stronger than you think

10 domestic large models vs. college entrance examination essay: writing AI with AI

12 domestic large models vs. college entrance examination mathematics, accidentally exploded a big bug

The last round of mathematics in the high school entrance examination is to check and fill in the gaps: auxiliary circle & hidden circle & maximum value model and its extended application

The last round of mathematics in the high school entrance examination to fill in the gaps: the Hu Bugui model and its extended application

The last round of mathematics in the high school entrance examination is to fill in the gaps: the model of the melon bean principle and its extended application

The last round of mathematics in the high school entrance examination is missing and filling: the Afch's circle maximum value model and its extended application

The final round of mathematics in the high school entrance examination is to fill in the gaps: the general's drinking horse model and its extended application

The final round of mathematics in the high school entrance examination: the Fermat point model and its extended application