Get the Llama 3.1 405B with a single card, so that the large model can easily slim down! Here comes the Super Compression Toolkit

2024-08-02 16:22:00

Model Toolchain Team Submission

Quantum Position | 公众号 QbitAI

Llama 3.1 (405B) can be done with a single card, and the latest large model compression tool is here!

Recently, Llama-3.1 has reached the top of open source, but its strongest version of the 405B version model has more than 900GB of memory requirements, posing a more demanding challenge to resources.

The large-scale model compression tools and benchmark LLMC jointly launched by teams such as Beihang University, SenseTime, and Nanyang Polytechnic can solve this problem well.

It enables the calibration and evaluation of Llama 3.1 405B with a single 80G A100, enabling ultra-low-cost quantification.

It supports a variety of compression algorithms, models, and inference backends, with strong scalability and all-round evaluation capabilities.

Get the Llama 3.1 405B with a single card, so that the large model can easily slim down! Here comes the Super Compression Toolkit

At present, the research team has put the usage method on the GitHub homepage, and you can get it by clicking the link at the end of the article.

Llama3.1 更大也更难压缩

Low-bit quantization is one of the general techniques to solve the problem of resource constraints. To this end, the researchers used LLMC to quantitatively compress Llama 3.1.

As shown in Table 1, some algorithms in LLMC, such as QuaRot and AWQ, can effectively maintain the quantization accuracy on the 70B and 405B parameter models. The simplest "rounding" (Naive) algorithm exhibits a significant drop in accuracy on these large-scale models, especially when activations are quantized.

The research team found that the degradation of the quantization accuracy of the Llama 3.1 series models was due to the presence of some outliers or outliers in their activation tensors that were more significant than those of other models. With the increase of the volume of the Llama 3.1 model, the phenomenon of these outliers is more serious. Outliers refer to the points where some values in the data are quite different from other values, which is one of the key factors affecting the quantization accuracy.

With the help of the LLMC tool, the research team visualized the input activation tensor of the four layers (q_proj, o_proj, gate_proj, and down_proj) of the first block of the Llama 3.1 series model (8B, 70B, 405B) (as shown in Figure 1-3). The bottom of each subgraph shows the mean and standard deviation of the Kurtosis values of all tokens activated in that layer.

As can be seen from Figure 1-3, in the Llama 3.1 model, there is an outlier in some channels of the activation tensor, and this phenomenon is more obvious in the larger model.

Therefore, it is reasonable to infer that the Llama 3.1 405B model, while stronger, has also become more "anomalous" and more difficult to quantify.

The LLMC tool supports a series of quantization algorithms for suppressing outliers in large models, including AWQ, SmoothQuant, OS+, QuaRot, etc. As can be seen from Table 1, these methods greatly improve the quantization accuracy of Llama 3.1 by effectively suppressing outlier. For example, in the 405B model W8A8 quantization, SmoothQuant, OS+, and QuaRot can achieve almost the same accuracy as the floating-point model.

LLMC: One-stop large model slimming kit

△LLMC frame diagram

Supports a variety of algorithms. LLMC supports a variety of compression algorithms, including 16 different quantization methods, covering weight-only, weight-activated, and mixed-precision quantization. This diversity allows for fair comparisons and in-depth analyses of different methods. Of course, in addition to quantization, various types of sparse and related algorithms are also supported.

△ Classification of some hardware-friendly compression algorithms currently supported by LLMC

Accuracy is highly aligned. The LLMC team conducted several alignment experiments comparing several established quantization algorithms (LLMC vs. the original paper/code).

The experimental settings are the same as those in the original paper or the default settings for its open-source code (as shown in Table 3).

The results of these experiments are summarized in Table 4-6. The results in the table show that the LLMC tool is almost consistent in performance with the original quantization algorithm reported in the literature. Through these experiments, it was proved that LLMC is not only effective, but also reliable in reproducing the results of existing quantification methods. This ensures that the tool's contribution to LLM quantification research is credible and valuable.

Quantify at ultra-low cost. The LLMC toolkit is designed to be resource-efficient and capable of running large models with minimal hardware requirements. Thanks to the single-block level mechanism, only one 80G A100 is required to calibrate and evaluate the Llama 3.1 405B, enabling ultra-low-cost quantification.

多后端兼容性。 LLMC支持多种量化设置和模型格式，兼容多个后端和硬件平台，例如LightLLM、TRT-LLM、PPL-LLM、vLLM、MLC-TVM和llama.cpp，具有高度的通用性。

High scalability. The toolkit is highly modular and extensible, and can be easily adapted from integer quantization to floating-point quantization, from dense models to expert hybrid (MoE) models, from LLMs to visual language models (VLMs), and from quantization to sparsity. This modular design ensures that users can expand and customize the toolkit to meet their needs.

Diversity assessment. LLMC provides a comprehensive evaluation of compression models, providing detailed performance metrics and analysis, such as perplexity (PPL), data visualization analysis, kurtosis, error, and outlier distribution. This comprehensive evaluation capability ensures that users can make informed decisions about the best compression strategy for their models.

The LLMC team has released a multi-functional large model compression toolkit LLMC, which supports a variety of compression algorithms, models, and inference backends, with strong scalability and all-round evaluation capabilities.

The toolkit allows users to compress hundreds of billions of parameter LLMs using only a single GPU, which greatly facilitates the application of LLM quantization. With this powerful toolkit, future large model researchers and general users can effectively integrate the appropriate algorithms and formats required by the corresponding back-end platform for their applications, thereby democratizing the compression of large models.

Tool address: https://github.com/ModelTC/llmc

Address: https://arxiv.org/abs/2405.06001

— END —

Quantum QbitAI · 头条号

Get the Llama 3.1 405B with a single card, so that the large model can easily slim down! Here comes the Super Compression Toolkit

Llama3.1 更大也更难压缩

LLMC: One-stop large model slimming kit

Read on