Author丨Kevin Wu Jiawen @ Zhihu (Authorized)

Source丨 https://zhuanlan.zhihu.com/p/641641929

Editor丨Jishi platform

In summary, for the LLaMa series model at level 7B, after GPTQ quantization, the inference speed of 140+ tokens/s can be reached on the 4090. On the 3070, an inference speed of 40 tokens/s can be achieved.

LM.int8()

来自论文：LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

https://arxiv.org/pdf/2208.07339.pdf

LM.int8() is a quantitative strategy (https://huggingface.co/docs/transformers/main_classes/quantization) integrated by Hugingface. This can be achieved by passing load_in_8bit at .from_pretrain() time, which works for almost all HF Transformers models. The general method is that in the process of matrix dot product calculation, find out the outliers parameter (in rows or columns), and then use a method similar to absolute maximum (absmax) quantization to quantify the regular parameter according to the row/column, and the outlier parameter is still fp16 calculation, and finally added.

A summary of common scenarios for LLaMa quantitative deployment

According to Huggingface's blog (https://huggingface.co/blog/hf-bitsandbytes-integration), LLM.INT8() allows us to perform LLM inference with fewer resources without much impact on model performance. But LLM.int8() generally has a slower inference speed than fp16. The blog points out that for smaller models, int8() results in slower speeds.

Combined with the experimental results in the paper, the larger the model, the more obvious the INT8() acceleration, and personal guess is that because the number of non-outliers has increased, more parameters have been calculated by int8, offsetting the additional quantitative conversion time overhead?

GPTQ

GPTQ: ACCURATE POST-TRAINING QUANTIZATION FOR GENERATIVE PRE-TRAINED TRANSFORMERS

Unlike LLM.int8(), GPTQ requires post-training quantification of the model to obtain quantification weights. GPTQ mainly refers to Optimal Brain Quanization (OBQ) and improves the OBQ method. Some netizens have sorted out quantitative strategies such as GPTQ, OBQ, OBS and so on in the article, so I won't repeat it here.

Several GPTQ repositories are described below. All of the following tests were performed on the 4090 with model inference speed using the UI provided by oobabooga/text-generation-webui (https://github.com/oobabooga/text-generation-webui).

GPTQ-for-LLaMa

A repository dedicated to GPTQ quantization solutions for LLaMa, GPTQ-for-LLaMa is a very useful reference tool if you consider GPU deployment LLaMa models. A large portion of Thebloke's models on http://huggingface.co are quantified using GPTQ-for-LLaMa.

Post Training Quantization: GPTQ-for-LLaMa uses the C4 (https://huggingface.co/datasets/allenai/c4) dataset for quantization training by default (only a portion of the C4 English and Chinese data is used for quantization, not all 9TB+ data):

CUDA_VISIBLE_DEVICES=0 python llama.py /models/vicuna-7b c4 \
    --wbits 4 \
    --true-sequential \
    --groupsize 128 \
    --save_safetensors vicuna7b-gptq-4bit-128g.safetensors

Since GPTQ is layer-wise quantization, quantization requires less memory and memory. In the 4090 test, the peak memory usage was 7000MiB, and the entire GPTQ quantification process took 10 minutes. After quantification, PPL test is performed, 7b without arc_order quantification, the PPL of c4 will probably be around 5-6:

CUDA_VISIBLE_DEVICES=0 python llama.py /models/vicuna-7b c4 \
    --wbits 4 \
    --groupsize 128 \
    --load vicuna7b-gptq-4bit-128g.safetensors \
    --benchmark 2048 --check

The quantization model was tested on the MMLU task (https://github.com/FranxYao/chain-of-thought-hub/tree/main), and the MMLU after quantization was slightly different from fp16 (46.1).

Most of the LLaMa GPTQ models published by TheBloke (https://huggingface.co/TheBloke) on Huggingface are quantified by the above methods (C4 dataset + wbit 4 + group 128 + no arc_order + true-sequential). If the model published on the Huggingface.co may have problems such as failure to load or accuracy error due to continuous updates of GPTQ-for-LLaMa and transformers repositories, you can consider requantization and improve the quantization accuracy by optimizing the quantization dataset and adding arc_order.

Some pitfalls of GPTQ-for-LLaMa:

Model loading issues: When using gptq-for-llama, there may be problems with model loading due to different transformer versions. If TheBloke/Wizard-Vicuna-30B-Uncensored-GPTQ (https://huggingface.co/TheBloke/Wizard-Vicuna-30B-Uncensored-GPTQ/discussions/5) is loaded, the latest version of GPTQ-for-LLaMa will be weighted on the model registry Cases where the names do not match.
left-padding issue: Currently all branches of GPTQ-for-LLaMa (triton, old-cuda, or fastest-inference-int4) have this issue. If the model makes predictions about the presence of left-padding inputs, the output is chaotic. As a result, GPTQ-for-LLaMa is currently unable to support the correct batch inference.

After testing, the problem exists in the quant.make_quant_attn (model) in llama.py. Using quant_attn can greatly improve model inference speed. Referring to this historical ISSUE, it is estimated that there is a problem with the configuration of the position_id inference cache in the attention layer. left-padding issue(https://github.com/qwopqwop200/GPTQ-for-LLaMa/issues/89)

The GPTQ-for-LLaMa version changes greatly, if other repositories have dependencies that use GPTQ-for-LLaMa, you need to carefully check the following versions. For example, obbabooga forks a separate GPTQ-for-LLaMa to support oobabooga/text-generation-webui. The latest version of GPTQ-for-LLaMa has bugs when used in text-generation-webui.

AutoGPTQ

AutoGPTQ is relatively easy to use, and it provides quantization schemes for most Huggingface LLM models, such as LLaMa architecture family models, bloom, moss, falcon, gpt_bigcode, etc. (I don't see the ChatGLM family model in the support table). For details, please refer to the official Quick Start (https://github.com/PanQiWei/AutoGPTQ/blob/main/docs/tutorial/01-Quick-Start.md) and Advanced use (https://github.com/PanQiWei/AutoGPTQ/blob/main/docs/tutorial/02-Advanced-Model-Loading-and-Best-Practice.md) for quantization model training and deployment.

AutoGPTQ can directly load the quantization model of GPTQ-for-LLaMa:

from auto_gptq import AutoGPTQForCausalLM

model = AutoGPTQForCausalLM.from_quantized(
    model_dir,     # 存放模型的文件路径，里面包含 config.json, tokenizer.json 等模型配置文件
    model_basename="vicuna7b-gptq-4bit-128g.safetensors",
    use_safetensors=True,
    device="cuda:0",
    use_triton=True,    # Batch inference 时候开启 triton 更快
    max_memory = {0: "20GIB", "cpu": "20GIB"}    # 
)

AutoGPTQ provides more quantized loading options, such as whether to use fused_attention, configure CPU offload, etc. Loading weights with AutoGPTQ will save a lot of unnecessary trouble, such as AutoGPTQ does not have left-padding bugs similar to GPTQ-for-LLaMa, and it is more compatible with other LLM models of Huggingface. Therefore, if you do GPTQ-INT4 batch inference, AutoGPTQ will be the first choice.

But for LLaMa series models, AutoGPTQ will be significantly slower than GPTQ-for-LLaMa. When tested on the 4090, GPTQ-for-LLaMa's inference speed is almost 30%.

exllama

In order to make LLaMa's GPTQ series models run faster on 4090/3090 Ti graphics cards, exllama can achieve an average of 140+ tokens/s in inference. Of course, in order to achieve such a high performance speedup, the model in exllama removes most of the dependencies of the HF transformers model, which also leads to additional adaptation work if the exllama model is used in the project. The HF adaptation of exllama in text-generation-webui allows us to use exllama like the HF model at the expense of some performance, see exllama_hf.

gptq

The official repository of GPTQ. Most of the above repositories are developed based on official repositories, thanks to the open source of GPTQ, so that single-card 24G video memory can also run on the 33B large model.

GGML

GGML is a machine learning architecture, written in C, that supports Integer quantization (4-bit, 5-bit, 8-bit) and 16-bit floats. At the same time, some hardware architectures have been accelerated and optimized. The LLaMa quantization acceleration scheme discussed in this chapter is derived from LLaMa.cpp. LLaMa.cpp has many peripheral products, such as llama-cpp-python, etc., in the following, we will call this model quantization scheme GGML.

llama.cpp supported full GPU acceleration a month ago (when inference, you can put the entire model on the GPU for inference). Referring to the tests below, LLaMa.cpp has faster inference speed than AutoGPTQ, but it is still much slower than exllama.

GGML has different quantification strategies (Specific Quantization Type Reference (https://github.com/ggerganov/llama.cpp%23quantization)), LLaMa-2-13B-chat-hf is quantified and tested below using Q4_0.

Docker with cuda deployment is used here, for ease of customization, first comment out .devops/full-cuda. EntryPoint in Dockerfile. Then build the image:

docker build -t local/llama.cpp:full-cuda -f .devops/full-cuda.Dockerfile .

After the build is successful, open the container (models mapped to the model file path):

docker run -it --name ggml --gpus all -p 8080:8080 -v /home/kevin/models:/models local/llama.cpp:full-cuda bash

Refer to the official documentation (https://github.com/ggerganov/llama.cpp%23prepare-data--run) to perform weight conversion that is quantified:

# 转换 ggml 权重
python3 convert.py /models/Llama-2-13b-chat-hf/

# 量化
./quantize /models/Llama-2-13b-chat-hf/ggml-model-f16.bin /models/Llama-2-13b-chat-GGML_q4_0/ggml-model-q4_0.bin q4_0

After completion, start the server test

./server -m /models/Llama-2-13b-chat-GGML_q4_0/ggml-model-q4_0.bin --host 0.0.0.0 --ctx-size 2048 --n-gpu-layers 128

Send a request test:

curl --request POST \
    --url http://localhost:8080/completion \
    --header "Content-Type: application/json" \
    --data '{"prompt": "Once a upon time,","n_predict": 200}'

When using llama.cpp server, refer to the official documentation (https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md) for specific parameter interpretation. The main parameters are:

--ctx-size: The length of the context.
--n-gpu-layers: How many model layers to put on the GPU, we choose to put the entire model on the GPU.
--batch-size: The batch size when processing prompts.

Requests deployed using llama.cpp are about as fast as llama-cpp-python. For the above example, sending Once a upon time and returning 200 characters is about 2400 ms (about 80 tokens/second) for both to complete.

Inference deployment

Remember that in the BERT era, there may be some aspects that may be considered when deploying Pytorch models, such as dynamic graph to static graph, exporting models to ONNX, Torch JIT, etc., mixed precision inference, quantization, pruning, distillation, etc. For these inference acceleration scenarios, we may need to manually apply them to the trained model ourselves. But in the LLaMa era, the biggest change is that some open source frameworks seem to do everything for you, just put the weights of your trained model to achieve inference speed that is N times faster than HF models.

Here's a comparison of these inference acceleration schemes: HF official float16 (baseline), vllm, llm.int8(), GPTQ-for-LLaMa, AUTOGPTQ, exllama, llama.cpp.

Model_nametooltokens/svicuna 7bfloat1643.27vicuna 7bload-in-8bit (HF)19.21vicuna 7bload-in-4bit (HF)28.25vicuna7b-gptq-4bit-128gAUTOGPTQ79.8vicuna7b-gptq-4bit-128gGPTQ-for-LLaMa80.0vicuna7b-gptq-4bit-128gexllama143.0Llama-2-7B-Chat-GGML (q4_0)llama.cpp111.25Llama-2-13B-Chat-GGML (q4_0)llama.cpp72.69Wizard-Vicuna-13B-GPTQexllama90Wizard-Vicuna-30B-uncensored-GPTQexllama43.1Wizard-Vicuna-30B-uncensored-GGML (q4_ 0）llama.cpp34.03Wizard-Vicuna-30B-uncensored-GPTQAUTOGPTQ31

All of the above tests are performed on the 4090 + Inter i9-13900K, and the model inference speed uses the UI provided by oobabooga/text-generation-webui (text-generation-webui inference speed will be a little slower than actual API deployment). For accuracy testing, you can see GPT-for-LLaMa result (https://github.com/qwopqwop200/GPTQ-for-LLaMa%23result) and exllama results( https://github.com/turboderp/exllama/tree/master%23new-implementation)。

Some notes

The speed of model inference is most affected by the GPU, i.e. CPU. Some netizens pointed out that link, also for 4090, in the case of different CPUs, 7B LLaMa fp16 has 50 tokens/s when it is fast, and can reach 23 tokens/s when it is slow.
For Stable Diffusion, the Torch CUDA118 is up to 1 times faster than the Torch CUDA 117. But for LLaMa, cuda 117 and 118 are not much different.
Quantify batch inference preferred AUTOGPTQ (TRITON), although AutoGPTQ is slower, the current version of GPTQ-for-LLaMa has left-padding problems and cannot use batch inference; When batch size = 1, exllama or GPTQ-for-LLaMa is preferred.
VLM's model speed for deploying FP16 is also good (80+ tokens/s), and memory optimization is also made; If the equipment resources are enough, you can consider VLLM, after all, there is still a little accuracy deviation using GPTQ.
Some of TheBloke's early models may not load into exllama, and a new model can be trained using the latest version of GPTQ-for-LLaMa.
When the graphics card capacity cannot load the entire model (such as loading llama-2-70B-chat on a single-card 4090), llama .cpp better than GPTQ Faster (reference: https://www.reddit.com/r/LocalLLaMA/comments/147z6as/llamacpp_just_got_full_cuda_acceleration_and_now/?rdt=56220).

A summary of common scenarios for LLaMa quantitative deployment