Large Language Model Deployment: vLLM and Quantization

Summary: There are a variety of tools and methods for large language deployment, and vLLM is used as a best practice in this article.

We live in an era of amazing large language models, such as ChatGPT, GPT-4, and Claude, that can perform a wide variety of breathtaking tasks.

In almost every field, from education and healthcare to the arts and business, large language models are being used to improve the efficiency of service delivery.

In particular, in the past year, many excellent open source large language models have been released, such as Llama, Mistral, Falcon, and Gemma. These open-source LLMs are available to everyone, but deploying them can be very challenging as they can be very slow and require a lot of GPU computing power to run, while also requiring real-time deployment.

Currently, different tools and methods have been created to simplify the deployment of large language models.

Many tools are deployed to provide faster inference services for LLMs, such as vLLM, c2translate, TensorRT-LLM, and llama.cpp. Quantization techniques are also used to optimize GPUs to load very large language models.

In this article, I'll explain how to deploy large language models using vLLM and quantization.

Latency and throughput

Some of the main factors that affect the speed performance of large language models are GPU hardware requirements and model size.

The larger the model, the more GPU computing power is required to run it. Common benchmark metrics used to measure the velocity performance of large language models are latency and throughput.

Latency: This is the time it takes for a large language model to generate a response. It is usually measured in seconds or milliseconds.
Throughput: This is the number of tokens generated by a large language model per second or millisecond.

Install the required packages

以下是运行大型语言模型所需的两个包：Hugging Face Transformer和accelerate。

pip3 install transformer
pip3 install accelerate

What is Phi-2?

Phi-2 is the most advanced base model offered by Microsoft, with 2.7 billion parameters. It is pre-trained using a variety of data sources, from code to textbooks.

使用 Hugging Face Transformer 对 LLM 延迟和吞吐量进行基准测试

The output generated

Latency: 2.739394464492798 seconds
Throughput: 32.36171766303386 tokens/second
Generate a python code that accepts a list of numbers and returns the sum. [1, 2, 3, 4, 5]
A: def sum_list(numbers):
    total = 0
    for num in numbers:
        total += num
    return total

print(sum_list([1, 2, 3, 4, 5]))

Step-by-step code decomposition

Lines 6-10: Load the Phi-2 model and label the prompt "Generate Python code that accepts a list of numbers and returns a sum." ”

Lines 12-18: Generate a response from the model and get a delay by calculating the time it takes to generate a response.

Lines 21-23: Get the total length of the token in the generated response, divide it by the latency and calculate the throughput.

The model runs on an A1000 (16GB GPU) and achieves a latency of 2.7 seconds and a throughput of 32 tokens per second.

Deploy large language models using vLLMs

vLLM is an open-source LLM library for serving large language models with low latency and high throughput.

How vLLMs work

Transformer is the building block of a large language model. Transformer networks use a mechanism called the attention mechanism, which is used by the network to study and understand the context of words. The attention mechanism consists of a series of matrix mathematical calculations called attention keys and values. The memory used by these attention key and value interactions affects the speed of the model.

vLLM introduces a new attention mechanism called PagedAttention, which can effectively manage the memory allocation of transformer attention keys and values during token generation. The memory efficiency of vLLMs has proven to be very useful for running large language models with low latency and high throughput.

Here's a high-level explanation of how vLLM works. For more in-depth technical details, visit the vLLM documentation.

Install vLLM

pip3 install vllm==0.3.3

Run Phi-2 with vLLM

Output generated:

Latency: 1.218436622619629seconds
Throughput: 63.15334836428132tokens/second
 [1, 2, 3, 4, 5]
A: def sum_list(numbers):
    total = 0
    for num in numbers:
        total += num
    return total

numbers = [1, 2, 3, 4, 5]
print(sum_list(numbers))

Step-by-step code decomposition

Lines 1–3: Import the packages needed to run Phi-2 from vLLM.

Lines 5-8: Load Phi-2 with vLLM, define prompts, and set important parameters for running the model.

Lines 10–16: Use llm.generate to generate the model's response and calculate the latency.

Lines 19-21: Get the total token length generated by the response, divide the token length by the latency to get the throughput.

Lines 23-24: Get the generated text.

I'm running Phi-2 with vLLM at the same prompt, "Generate Python code that accepts a list of numbers and returns sums." "On the same GPU (A1000 (16GB GPU)), the vLLM produced a latency of 1.2 seconds and a throughput of 63 tokens/sec, while the Hugging Face converter had a latency of 2.85 seconds and a throughput of 32 tokens/sec. Running a large language model with a vLLM yields the same accurate results as with Hugging Face, with much lower latency and much higher throughput.

Note: The vLLM metrics I obtained (latency and throughput) are an estimated baseline for vLLM performance. The speed at which the model is generated depends on many factors, such as the length of the input prompt and the size of the GPU.

According to the official vLLM report, running LLM models on powerful GPUs such as the A100 can achieve up to 24x higher throughput than Hugging Face Transformers in production environments using vLLMs.

Real-time benchmark latency and throughput

My method of calculating the latency and throughput of running Phi-2 is experimental, and I did this to explain how vLLM can accelerate the performance of large language models. In real-world use cases for LLM, such as chat-based systems, where the model outputs tokens as it is generated, measuring latency and throughput is more complex.

Chat-based systems output tokens based on streams. Some of the main factors that affect LLM metrics include the time of the first token (the time it takes for the model to generate the first token), the time of each output token (the time it takes to generate each output token), the length of the input sequence, the expected output, the total expected output token, and the model size. In a chat-based system, latency is typically a combination of the time of the first token and the time of each output token multiplied by the total expected output token.

The longer the length of the input sequence passed into the model, the slower the response. Some of the methods used when running LLMs in real-time involve batching input requests or prompts from users to simultaneously perform inference on the requests, which can help improve throughput. In general, using powerful GPUs and serving LLMs with efficient tools such as vLLMs can improve latency and throughput in real-time.

Run a vLLM deployment on Google Colab

Quantization of large language models

Quantization is the transformation of a machine learning model from higher accuracy to lower accuracy by reducing the weight of the model to smaller bits, typically 8 or 4 bits. Deployment tools such as vLLMs are useful for inference services that provide large language models with very low latency and high throughput.

We were able to conveniently run Phi-2 with Hugging Face and vLLM on Google Colab's T4 GPU because it's a smaller LLM with 2.7 billion parameters. For example, a 7 billion parameter model like Mistral 7B can't be run on Colab using Hugging Face or vLLM.

Quantify the GPU hardware requirements that are best suited for managing large language models. When GPU availability is limited and we need to run very large language models, quantization is the best way to load LLMs on constrained devices.

BitsandBytes

It is a Python library built with a custom quantization function to reduce the weight of the model to the lower bits (8-bit and 4-bit).

安装 BitsandBytes:

pip3 install bitsandbytes

Mistral 7B 模型的量化

Mistral 7B is a 7 billion parameter model from MistralAI, one of the most advanced open-source large language models. I'll walk through running Mistral 7B with different quantization techniques that can run on Google Colab's T4 GPU.

8-bit precision quantization: This is the conversion of the weights of the machine learning model to 8-bit precision. BitsandBytes has integrated with the Hugging Face converter to load the language model using the same Hugging Face code, but with minor modifications to the quantization.

Line 1: Import the packages required to run the model, including the BitsandBytesConfig library.

Lines 3–4: Define the quantization configuration and set the parameter load_in_8bit to true to load the model's weights with 8-bit precision.

Lines 7-9: Pass the quantization configuration into the function that loads the model, set the parameter device_map to bitsandbytes to automatically allocate the appropriate GPU memory to load the model. Finally, the tokenizer weights are loaded.

4-bit precision quantization: This is the conversion of the weights of the machine learning model to 4-bit precision.

The code for loading Mistral 7B with 4-bit precision is similar to the code with 8-bit precision, but with some variations:

Change the load_in_8bit to load_in_4bit.
A new parameter has been introduced in BitsandBytesConfig bnb_4bit_compute_dtype perform the model's calculations in bfloat16. bfloat16 is a computed data type that is used to load the weights of the model for faster inference. It can be used with both 4-digit and 8-digit precision. If it's 8-bit, just change the parameter from bnb_4bit_compute_dtype to bnb_8bit_compute_dtype.

NF4 (4-bit ordinary floating-point) and double quantization

QLoRA's NF4 (4-bit ordinary floating-point) is an optimal quantization method that yields better results than standard 4-bit quantization. It integrates double quantization, where quantization occurs twice, and the quantization weights from the first stage of quantization are passed to the next stage of quantization, resulting in the optimal floating-point range values for the model weights.

According to the QLoRA paper, the double-quantized NF4 did not show any degradation in accuracy performance. Read more in-depth technical details about NF4 and double quantization from the QLoRA paper:

Lines 4-9: Set extra parameters in BitsandBytesConfig:

load_4bit: Load model with 4-bit precision set to true.
bnb_4bit_quant_type: Set the quantization type to NF4.
bnb_4bit_use_double_quant: Double quantization is set to True.
bnb_4_bit_compute_dtype: bfloat16 computes data types for faster inference.
Lines 11-13: Load the model's weights and tokenizers.

The full code for model quantization

Output generated:

<s> [INST] What is Natural Language Processing? [/INST] Natural Language Processing (NLP) is a subfield of artificial intelligence (AI) and
computer science that deals with the interaction between computers and human language. Its main objective is to read, decipher, 
understand, and make sense of the human language in a valuable way. It can be used for various tasks such as speech recognition, 
text-to-speech synthesis, sentiment analysis, machine translation, part-of-speech tagging, name entity recognition, 
summarization, and question-answering systems. NLP technology allows machines to recognize, understand,
 and respond to human language in a more natural and intuitive way, making interactions more accessible and efficient.</s>

Quantization is a very good way to optimize very large language models to run on smaller GPUs and can be applied to any model, such as the Llama 70B, Falcon 40B, and mpt-30b.

According to the LLM.int8 paper, very large language models have less loss of accuracy when quantized compared to smaller language models.

Quantization is best applied to very large language models and is not suitable for smaller models due to the loss of accuracy performance.

epilogue

In this article, we provide a step-by-step approach to measuring the velocity performance of large language models, explain how vLLMs work, and how they can be used to improve the latency and throughput of large language models.

Finally, we explain quantization, including how to use it and how to load large language models on small GPUs.

Reference:

https://towardsdatascience.com/deploying-large-language-models-vllm-and-quantizationstep-by-step-guide-on-how-to-accelerate-becfe17396a2

Large Language Model Deployment: vLLM and Quantization

Install the required packages

Deploy large language models using vLLMs

Read on

Use LM Studio to deploy local AI large language models with one click

With 3 times the sensitivity, it only takes a few seconds to search for millions of protein pairs, and Fudan and others have developed new language models

8.3K Stars!

Meta Researchers Crack the Curse of Large Model Reversal and Launch "Language Model Physics"

Decoding AI: Demystifying the "brain" of chatbots - large language models

Predicting protein co-regulation and function, Harvard & MIT trained a genomic language model

Intel has made important progress in the field of artificial intelligence accelerators, and its subsidiary HabanaLabs is in

Researchers propose a new concept of artificial intelligence that allows large language models to interact with the real physical world

Llama 3: The next frontier of open-source large language models

The secret of using large language models: How to control AI with efficient prompt words?

Apple has been exposed to a big move again, self-developed device-side large language model, AI is a new way out of "revitalization"?

No wonder the previous iPhone 16 series national version of the AI function will be provided by Baidu, the original Baidu in the Chinese artificial intelligence invention patent enterprise ranking is still high. Ranked in the top 10

Apple released OpenELM, an efficient language model based on an open-source training and inference framework

Solomonov: The Prophet of Large Language Models

Apple launches OpenELM, an efficient language model, Xiaomi plans a new car for 150,000 yuan, and AI successfully rewrites human DNA

轩辕大模型的实践与应用 | ML-Summit 2024

The mobile UI model came out, and the Apple iPhone may welcome a new cycle of upgrades

iFLYTEK does not tell the "sexy story" of large models

Meta released the "strongest open-source AI model", and the next generation may be stronger than GPT

面壁新模型:早于Llama3、比肩 Llama3、推理超越 Llama3!

Huawei's profit soared by 564% in the first quarter, Tianya community recovered, and Xiaohongshu tested its self-developed large model

13 Models of Effective Communication Expression

Eat through an industrial chain in one day: NO.37 AI large model industrial chain

10 domestic large models vs. mentally handicapped - Chinese comprehension ability assessment

The most complete interpretation of the MoE hybrid expert model: revealing the key technologies and challenges

Baidu's strongest SOTA: 3DGS based on diffusion model!

Sprint 2024 "Half Year Red" | Sixty percent of AI companies have achieved profitable growth, and large model companies have made money?

Dialogue with UBTECH Jiao Jichao: Large model accelerates humanoid robots to "work in the factory"

iFLYTEK's profit puzzle: high investment and low return in the field of large models

Ali Lin Junyang: Large models are not enough for many people, and building multimodal agents is the key

Li Feifei, the godmother of AI, founded a spatial intelligence company that strives to overcome the existing limitations of large-scale AI technology