Model quantization and its application in LLMs

One

Model inference optimization

With the implementation of models in various scenarios, model inference acceleration has long become an important part of AI engineering. In recent years, large models based on Transformer architecture have become the mainstream, and they have achieved SoTA results in various tasks, and their expensive cost in training and inference makes their deployment practice at a reasonable cost more and more important.

There are two main challenges in large model inference:

The huge memory (video memory) requirements mainly come from the immediate needs of the parameters and inference of the model itself.

For an LLaMA2-30B model, about 60 GiB of video memory is required to load the video memory, and about 1.6 GiB of video memory is required for the KV cache of a single token during the inference process: 6656 (layer dim) * 52 (layer num) *2 (K & V) * 2 (fp16, 2bytes), and 3.3 GiB of video memory is required for a request of 2048 tokens.

Parallelism is poor because the generation process is usually a serial process in timing, which makes the process of decoding difficult to parallelize and becomes a bottleneck in computing.

Common inference optimization methods include Knowledge Distillation (KD), Pruning, and Quantization, as well as various schemes (such as Flash Attention, Paged Attention, etc.) proposed for memory optimization of LLMs.

Distillation refers to the direct construction of a small model, as a student model, through the combination of soft labels and original labels to supervise the learning of the knowledge of the original model, so that the small model has the same performance as the original model, and finally replace the large model with the small model to improve the inference efficiency.

Model quantization and its application in LLMs

【图片出处：Knowledge Distillation: A survey,2021,p2】

Pruning is to "slim down" the model by pruning the unimportant weights in the model, improving the inference efficiency of the model, and in order to ensure the ability of the model, the pruning process usually needs to be accompanied by the fine-tuning of the model based on the training data. According to the different dimensions of pruning weights, it can be divided into structured pruning and unstructured pruning.

Structured pruning: Unimportant channels are usually pruned in blocks according to one or more dimensions of the weight tensor, and the normal matrix multiplication is maintained, but the logical accuracy of the network needs to be checked because the pruned channels affect the reasoning of the upper and lower layers.
Unstructured pruning: Random pruning of unimportant elements in the weight tensor, so it usually retains the original weight structure, resulting in sparse multiplication calculations, but it is not suitable for general-purpose hardware, so specialized hardware is required to achieve speedup.

At present, pruning is rarely used in LLMs, such as the following pruning work based on Activation-aware [1], which mainly does unstructured pruning based on the absolute value of the weight itself and the absolute value of the input tensor, so that the weight tensor itself is sparse, and the accuracy loss of the model cannot meet the engineering requirements.

【图片出处：A simple and effective pruning approach for large language models,2021,p2】

In the recent structured pruning work shown in the figure below [2], the substructure in the model is found by search and the accuracy of the model is maintained by retraining, and the accuracy of the pruned model is greatly reduced compared with the original model, and can only be compared with other smaller models with the same amount of parameters (after pruning) to show the significance of the method.

【图片出处: Sheared LLaMA: accelerating language model pre-training via structured pruning,2023,p3】

【图片出处: huggingface/Sheared-llama-1.3B】

The reason why quantization has become the first choice for neural networks and LLMs has the following advantages:

A visual manifestation of reduced video memory.

Generally, LLM weights are stored in FP16, but after the weights are converted to int4, the volume is intuitively reduced to 1/4 of the original (in fact, it may be slightly more due to some reasons such as non-quantization of embeddings, memory allocation), and the resource requirements for video memory are greatly reduced.

W4A16, W8A16, and other operators are accelerated, thereby improving the computing speed.

Two

Introduction to quantification

base

The essence of quantization is usually to transform the parameters of the model, or the reasoning process of the entire model, from floating point to integer.

Quantization parameters are usually composed of two values, scale, which is floating-point, and zero-point, which is integer. Let x be a tensor (it can be a weight or an intermediate variable of inference), and its quantization process can be expressed as follows,

For example, [-128,127] can be taken for int-8 quantization, that is, q{min}=-2^(b-1)=-128, q{max}=2^(b-1)-1=127, clamp(a; q{min},q{max}) represents the truncation operation of the input value a based on the range of [q{min}, q{max}], x{int} represents the quantized result, and s and z represent the quantization parameters scale and zero-point.

【图片出处：A Survey of Quantization Methods for Efficient Neural Network Inference,2021,p5；An Introduction to Quantization of Large Language Models,p12】

And the dequantization process from integer to floating point is as follows,

Regarding the quantization parameters, there are many algorithms based on search, optimization, LKD (layer-by-layer distillation) and other algorithms to calculate the optimal solution, so as to reduce the accuracy loss caused by quantization as much as possible, and the most direct calculation scale and method is based on the tensor element min/max.

Here's an example of a simple piece of code that shows that tensor x is quantized from fp32 to int8 integer and then dequantized back to fp32:

An example of the process for x->x{int}->x_hat is as follows:

Pre-quantification x:

After quantification x_hat:

Symmetrical/asymmetrical

Compared with asymmetric quantization, symmetric quantization is defined as the quantization of the mapped integer value range based on 0 value symmetry, that is, the zero-point of the above formula is 0, qmax = -qmin, so that the expression of quantization is more simplified.

Asymmetric quantization is beneficial to make full use of the quantification range. For example, the value of the excitation tensor output by Conv+ReLU is positive, if symmetric quantization is used, the floating point will be mapped to the range of [0~127], half of the range is not used, and its quantization accuracy is not as good as that of asymmetric quantization.

【图片出处：A Survey of Quantization Methods for Efficient Neural Network Inference,2021,p5】

In practice, it is often chosen to do symmetric quantization of the weight tensor and asymmetric quantization of the input tensor. The following is an analysis from qualcomm's quantization whitepaper, for example, when asymmetric quantization is selected for both weights and inputs, the matrix multiplication of the Linear layer is used as an example, and the expression is expanded as follows:

The first is the multiplication operation of the integer tensor, which is a necessary immediate operation;
The operations of the third and fourth terms include the multiplication of scale, zero, and integer weights, which are predicted in advance and can therefore be calculated in advance as bias additions;
The computation of the second term relies on x{int}, which requires instant computation for each inference, which results in additional computing power.

Therefore, when we change the weightization to symmetric quantization (zW=0), the above equation is simplified as follows: for immediate calculation, only the matrix multiplication of the first term needs to be calculated, and the second term is the pre-calculated bias term:

When both are symmetric quantizations, the expression is simplified as follows:

Compared to the floating-point calculation W{x} in the original model, W{int}x{int} is a multiplication between integers and integers, and the latter is much faster on Nvidia GPUs, which is the reason why the inference speed of the quantized model is greatly accelerated.

Three

Quantification of LLMs

Challenges in LLM Quantization

From the perspective of model performance, one of the prerequisites to be solved from beginning to end quantization is how to maintain the accuracy of the quantized model, that is, to make the users of the model feel that the quantized model can maintain the original performance while improving the inference efficiency.

The operations that need to be quantized in neural networks are mainly convolutional layer Conv(x;W) and fully connected layer Wx, that is, the weight quantization (WQ) and activation quantization (AQ) of W and x are mainly done according to the operations described in the previous part.

Different from the CNN model or the small Transformer model, the excitation tensors generated by the matrix multiplication of the large model based on the Transformer usually have more outliers, that is, the values of the point groups formed by most points of the value distribution are far away, and these element values with large absolute values but a low proportion increase the difficulty of quantization. If it is overly considered, the expression range of quantization will be reduced due to the large quantization range, and if it is over-truncated, it will usually have a greater impact on the results in the model inference because these values with large absolute values will have a greater impact on the results in the model reasoning, resulting in the deterioration of the model effect, and the latter is especially obvious in the quantization of LLM.

The following figure shows the element values of a layer of input tensors of Resnet18 and Opt-13B, sigma represents the standard deviation of their respective distributions, the maximum value of Resnet18 input is about 28sigma, and the proportion outside the absolute value of 6sigma is 0.05%, while the maximum value of Opt-13B network input is 325sigma, and the proportion of the input value outside the absolute value of 6sigma is 0.2%. In terms of quantification effect, the int-8 accuracy of Resnet18 is basically loss-free, while the accuracy of the int-8 model of Opt-13B has collapsed.

【图片出处：An Introduction to Quantization of Large Language Models,p20 】

In order to address the challenge of incentive quantification, there are some schemes that try to reduce the quantization accuracy, such as the idea proposed by SmoothQuant.

[Image source: SmoothQuant, p4]

In matrix multiplication, they transform the problem from quantizing X and W to quantizing X·diag(s^(-1)) and diag(s)· W。 Therefore, on the premise of ensuring that the product of multiplication operation remains unchanged, the quantization difficulty of tensor X is reduced. However, in actual engineering, the quantization error caused by this quantization scheme still has a relatively obvious impact on the reasoning effect of the large model, even in the int-8 precision quantization. For example, the following results of SmoothQuant application of Llama2-7B show that its perplexity is very poor and difficult to apply in practice.

Therefore, most of the practical solutions in the current engineering deployment are based on weight-only quantification schemes, that is, the quantization of activation is abandoned.

GPTQ

GPTQ is the earliest quantization scheme accepted by engineering deployment, and the quantization effect of W8A16 or W4A16 is close to the original model in most scenarios, and the quantization process is very fast.

Quantification process

Taking the basic unit operation of matrix multiplication as an example, based on the mean square deviation of the product before and after weight-only quantization, the following optimization function can be written:

W is the Linear layer weight in Transformer, and X is the corresponding input. The process of offline quantization is module-by-module (Transformer) and layer-by-layer (Q, K, V, O, Fc1, Fc2).

The parameters and data are defined as follows:

W∈R^{K×M}，X∈R^{M×N}，Y=W×X∈R^{K ×N}
calibrate set: Part of the data is used as an inference to see the range of values of the input tensors for each layer and quantify based on this.

The specific quantification process is as follows:

Compute the Hessian (the above optimization function is for the Hessian in the W_hat, not the Hessian in backpropagation), and add the perturbation term:

act order sort(desc_act, columns with similar value ranges are quantified together), based on diag(H) to W based on M dimension column rearrangement, in the same way, H correspondingly rearranged in two dimensions.
求逆H^(-1)(cholesky分解)。
For W along the dimension M, block by block size B=128, the unquantized part on the right side is updated based on H^(-1) to compensate for the quantization loss.

(inner loop) quantizes column by column for each block, calculates the error, and updates the unquantized columns inside the block based on the error.

(outer loop) to update all the columns behind the block:

group_size

If the group size is not specified, the default is g=-1, and the quantization parameters are counted in all column units, and the weight of each row is quantified, for W∈R^{K×M}, the number of quantization parameters is K×1.

If the group size is specified, for example, g=128, the quantization parameters will be counted in units of each 128 columns, and the weight of each row will be quantized, and for W∈R^{K×M}, the number of quantization parameters will be K×(M/g).

Rearrange desc_act

According to Hessian Matrix H, diag(H) is used to rearrange W based on M dimension. The aim is to preferentially quantify the columns of the weight corresponding to the larger absolute value of the activaiton, which are considered to be more important columns in the inference, so that the error is as small as possible when quantifying these columns, and more of the quantization error is transferred to the later less important columns.

Some experiments have shown that desc_act effect on quantifying losses is effective trick in most tasks.

Perplexity of Pygmalion-7B with GPTQ [7]

[Image source: https://huggingface.co/reeducator/vicuna-13b-free/discussions/22]

Arithmetic

Strictly speaking, the weight-only W4A16 does not have much efficiency improvement compared with the original W16A16, and the quant/dequant process is also added to the inference, and as weight-only has become the mainstream of LLM quantization and has more and more applications, there are many open source works based on the writing of W4A16 efficient operators to empower the inference acceleration of quantization algorithms, such as GPTQ's python package AutoGPTQ is integrated with the open-source tool exllama, which rewrites the parallel computation of quantized multiplication based on triton and CUDA. In exllama/exllama_ext/matrix.cuh, you can see dot_product8_h implementation of out=W_hat·x=(W{int}-z)s·x=(W{int}-z)x·s.

[Image source: https://github.com/turboderp/exllama/blob/3b013cd53c7d413cf99ca04c7c28dd5c95117c0d/exllama_ext/matrix.cuh#L86]

AWQ

Compared with GPTQ, which designs a solution based on an optimization problem, AWQ is a quantitative solution based on search.

With Q (·) represents the quantization dequantization process, then the quantization process before modification is as follows:

After the change, the quantization process is as follows, with the addition of scaling to W:

AWQ stands for Activation-aware Weight Quantization, which means that the quantification process of Weight considers the influence of the value of Activation. The starting point is also based on the fact that in each channel of Weight, the channel with the larger value of the corresponding Activation is relatively important, and the opposite is relatively unimportant, and then its importance is reflected by multiplying a scale factor Δ, and the value and range of Δ are designed by the tensor value of the input activation.

The search is based on the comparison of the output results before and after the quantization of the Linear layer, and the solution with the smallest MSE result is the optimal solution.

effect

In terms of model performance, the following comparison of the effects from the AWQ paper shows that the quantization results are slightly better than GPTQ and GPTQ sequences on the tests of the two generations of Llama from the perspective of Perplexity.

[Image source: AWQ, p6]

In terms of the accuracy of the actual task, the accuracy of AWQ is comparable to the act_order version of GPTQ (GPTQ-R), but the speed is better than the latter.

[Image source: AWQ, p5]

In terms of the computational performance of the model, GPTQ has a reorder operation, and the matrix multiplication is MV (matrix×vector), which is a non-contiguous memory access, while AWQ does not have a reorder operation, and the matrix multiplication is (matrix×matrix), which is faster.

Four

summary

The current SOTA performance of LLM quantization is basically based on the weight-only quantization model, and the reduction of the video memory required for the model to run on the GPU is its main contribution.

From the perspective of model performance, because there is an inevitable quantization loss, and the LLM model is usually much more sensitive to quantization than the traditional CNN model, although the quantized LLM performance on many tasks is not much different from that before quantization, it may still be incompetent in some tasks.

From the perspective of model acceleration, the weight-only quantization to promote the underlying acceleration work is basically on W4A16, W3A16, W8A16 and other multiplication operators, from the theoretical data provided on the paper, it is usually only 1.x ~ 3.x times faster than the FP16 model, and the actual deployment effect may be lower than this value, and its acceleration effect is far inferior to the traditional quantization methods of W4A4, W8A8 and other full integer multiplication operators.

Generally speaking, the quantification work in the field of LLM is still very preliminary, and if the performance accuracy of the model is very high in practical tasks, it is recommended to improve the throughput per unit of memory based solely on KV cache and other directions, such as Flash Attention-2 and Paged Attention.

Five

Reference

1. A Simple and Effective Pruning Approach for Large Language Models, 2023.

2. Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning, 2023.

3. A White Paper on Neural Network Quantization, 2021.

4. SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models, 2023.

5. GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers, 2023.

6. AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration, 2023.

7. Some evaluation on GPTQ performance.

作者:xujiong

Source-WeChat public account: Dewu Technology

Source: https://mp.weixin.qq.com/s/25gZPdn29YBpyEx_l-D09A

Model quantization and its application in LLMs

Read on

Lu Han was sprayed by netizens for not being masculine, why does he always have long hair?

Lu Han's high-definition picture is released, objectively speaking, as a male star, what level is this state?

After returning to the country, some people went to prison, and some people dragged their feet again, only to find out how "shrewd" Lu Han was

Lu Han's unrecognised picture was exposed, his eyelids were sunken, and netizens questioned whether he was worthy of Guan Xiaotong

Deng Chao, Chen He, and Luhan sang superheroes again

Bulgari's "insulting incident" escalated, the official media spoke out, and Lu Han and Eason Chan slapped the stars in the face

Deng Chaowuha slipped on the spot, Lu Han jokingly kicked his buttocks, and netizens laughed and called "Lu Han's upper body"

What does it look like when a man is 40 pounds fat? Lu Han transformed into Qiao Shan, and netizens exclaimed: One fat destroys three views

Guan Xiaotong's black kelp skirt shows her long legs, and she envies Lu Han.

Lu Han made a video call, and the photos under the original camera went viral in the circle of friends, doubting the aesthetics of girls for the first time

Guan Xiaotong shows off her figure in a high-profile manner: No wonder, Lu Han likes her so much!

4.24 What's new: Yang Zi, Xu Kai, Yang Yang, Deng Wei, Yang Mi, Lu Han, Zhou Ye, Lu

The celebrity wax figure was played bad: Lu Han's mouth was kissed off, Sun Li was taken advantage of, and Yang Mi couldn't recognize it

It is rumored on the Internet that Huang Zitao's marriage proposal was successful! The intimate photo with his girlfriend was exposed, and netizens praised him for being more grandfather than Lu Han

Five forces model to improve personal core competence

Meta AI released the most powerful open-source large model, Llama 3, which is available in versions 8B and 70B?

How to use AI models to solve practical problems?

In the era of large models, is the data center outdated now?

轩辕大模型的实践与应用 | ML-Summit 2024

The mobile UI model came out, and the Apple iPhone may welcome a new cycle of upgrades

iFLYTEK does not tell the "sexy story" of large models

Meta released the "strongest open-source AI model", and the next generation may be stronger than GPT

面壁新模型:早于Llama3、比肩 Llama3、推理超越 Llama3!

Huawei's profit soared by 564% in the first quarter, Tianya community recovered, and Xiaohongshu tested its self-developed large model

13 Models of Effective Communication Expression

Eat through an industrial chain in one day: NO.37 AI large model industrial chain

10 domestic large models vs. mentally handicapped - Chinese comprehension ability assessment

The most complete interpretation of the MoE hybrid expert model: revealing the key technologies and challenges

Baidu's strongest SOTA: 3DGS based on diffusion model!

Sprint 2024 "Half Year Red" | Sixty percent of AI companies have achieved profitable growth, and large model companies have made money?

Lu Han Guan Xiaotong is unmarried and pregnant? The insider responded positively and will officially announce his marriage to Lu Han?

Lu Han and Guan Xiaotong officially announced their marriage?