laitimes

Microsoft Open Source Deep Speed Chat: The era of ChatGPT for everyone is here

Microsoft Open Source Deep Speed Chat: The era of ChatGPT for everyone is here

  Shin Ji Won reports  

Editor: Aeneas is sleepy

Microsoft's open-source DeepSpeed Chat allows developers to realize the dream of a ChatGPT!

The dream of a ChatGPT is about to come true?

Just now, Microsoft open-sourced a system framework that can add the complete RLHF process to model training, DeepSpeed Chat.

That said, high-quality ChatGPT-like models of all sizes are now at your fingertips!

Microsoft Open Source Deep Speed Chat: The era of ChatGPT for everyone is here

Project address: https://github.com/microsoft/DeepSpeed

Unlock ChatGPT with one click, easily saving 15 times money

As we all know, because OpenAI is too unOpen, the open source community has successively launched LLaMa, Alpaca, Vicuna, Databricks-Dolly and other models in order to allow more people to use ChatGPT-like models.

However, due to the lack of an end-to-end RLHF scale-up system, the training of ChatGPT-like models is still very difficult. The emergence of DeepSpeed Chat just completes this "bug".

Even brighter, DeepSpeed Chat has knocked down the cost.

Previously, expensive multi-GPU setups were beyond the reach of many researchers, and even with access to multi-GPU clusters, existing methods could not afford to train hundreds of billions of parameters of ChatGPT models.

Now, for $1620, you can train an OPT-66B model in 2.1 days with the hybrid engine DeepSpeed-HE.

If a multi-node, multi-GPU system is used, DeepSpeed-HE can train an OPT-13B model in 1.25 hours for $320 and an OPT-175B model in less than a day for $5120.

Elvis, a former Meta AI expert, retweeted excitedly, calling it a big deal and wondering how DeepSpeed Chat compares to ColossalChat.

Microsoft Open Source Deep Speed Chat: The era of ChatGPT for everyone is here

Let's take a look at how it works.

After DeepSpeed-Chat training, the 1.3 billion parameter version of "ChatGPT" performed very well in the Q&A session. Not only can you get the context of the question, but the answer given is also similar.

Microsoft Open Source Deep Speed Chat: The era of ChatGPT for everyone is here

In multiple rounds of conversation, the performance demonstrated by this 1.3 billion parameter version of "ChatGPT" completely exceeds the inherent impression of this scale.

Microsoft Open Source Deep Speed Chat: The era of ChatGPT for everyone is here

A piece of code to generate your first ChatGPT

Of course, before the experience, you also need to configure the environment:

A cup of coffee, 1.3 billion parameter version of ChatGPT

If you only have about 1-2 hours of coffee or lunch break, try training a "little toy" with DeepSpeed-Chat.

The team has prepared a training example for the 1.3B model that can be tested on a consumer GPU. Best of all, when you come back from your lunch break, everything is ready.

Consumer NVIDIA A6000 GPU with 48GB of memory:

One GPU Node, 13 billion parameters in half a day

If you only have half a day and a server node, you can generate a 13 billion parameter-like ChatGPT model with pre-trained OPT-13B as the actor model and OPT-350M as the reward model:

Single DGX node with 8 NVIDIA A100-40G GPUs:

Ultra-money-saving cloud solution, training 66 billion parameter models

If you can use multi-node clusters or cloud resources and want to train a larger, higher-quality model. Then just enter the model size (e.g. 66B) and number of GPUs (e.g. 64) based on the following line of code:

8 DGX nodes, each equipped with 8 NVIDIA A100-80G GPUs:

Specifically, the time and cost required by the DeepSpeed-RLHF system for different scale models and hardware configurations are as follows:

Microsoft Open Source Deep Speed Chat: The era of ChatGPT for everyone is here

What is DeepSpeed Chat?

DeepSpeed Chat is a general system framework that enables end-to-end RLHF training similar to ChatGPT models, helping us generate our own high-quality ChatGPT-like models.

Microsoft Open Source Deep Speed Chat: The era of ChatGPT for everyone is here

DeepSpeed Chat has the following three core features:

1. Simplify the training and reinforcement inference experience of ChatGPT-type models

Developers can implement multiple training steps with a single script, and can also use the inference API for conversational interactive testing after completion.

2. DeepSpeed-RLHF module

DeepSpeed-RLHF replicates the training model in the InstructGPT paper and provides data abstraction and blending capabilities that allow developers to train using multiple data sources from different sources.

3. DeepSpeed-RLHF system

The team integrated DeepSpeed's training engine and inference engine into a unified hybrid engine (DeepSpeed Hybrid Engine or DeepSpeed-HE) for RLHF training. Since DeepSpeed-HE is able to seamlessly switch between inference and training modes, it can take advantage of various optimizations from DeepSpeed-Inference.

The DeepSpeed-RLHF system has unmatched efficiency in large-scale training, making complex RLHF training fast, economical, and easy to scale up:

Efficient and economical:

DeepSpeed-HE is more than 15 times faster than existing systems, making RLHF training fast and affordable.

For example, DeepSpeed-HE can train an OPT-13B model in just 9 hours on the Azure cloud and an OPT-30B model in just 18 hours. These two trainings cost less than $300 and $600, respectively.

Excellent scalability:

DeepSpeed-HE can support the training of models with hundreds of billions of parameters and show excellent scalability on multi-node and multi-GPU systems.

Therefore, even a model with 13 billion parameters can be trained in just 1.25 hours. For a model with 175 billion parameters, training with DeepSpeed-HE takes less than a day.

Democratize RLHF training:

With a single GPU, DeepSpeed-HE can support training models with more than 13 billion parameters. This makes it easy for data scientists and researchers who do not have access to multi-GPU systems to not only create lightweight RLHF models, but also large and powerful models for different use cases.

Complete RLHF training process

To provide a seamless training experience, the researchers followed InstructGPT and included a complete end-to-end training process in DeepSpeed-Chat.

Microsoft Open Source Deep Speed Chat: The era of ChatGPT for everyone is here

DeepSpeed-Chat's RLHF training flowchart includes a number of optional features

The process consists of three main steps:

Step 1:

Supervised fine-tuning (SFT), which uses curated human responses to fine-tune a pre-trained language model for a variety of queries.

Step 2:

Reward model fine-tuning uses a dataset containing multiple answers to the same query scored by humans to train an independent (usually smaller than SFT) reward model (RW).

Step 3:

RLHF training, in this step, the SFT model is further fine-tuned from the reward feedback of the RW model by using an approximate strategy optimization (PPO) algorithm.

In Step 3, the investigator provides two additional features to help improve model quality:

- Exponential Moving Average (EMA) collection, an EMA-based checkpoint can be selected for final evaluation.

- Hybrid training, which mixes a pre-training target (i.e., the next word prediction) with a PPO target to prevent performance regression on a common benchmark such as SQuAD2.0.

Microsoft Open Source Deep Speed Chat: The era of ChatGPT for everyone is here

EMA and hybrid training, two training features, are often overlooked by other open source frameworks because they do not hinder training.

However, according to InstructGPT, EMA checkpoints tend to provide better response quality than traditional final trained models, while hybrid training can help the model maintain its pre-training baseline resolution capability.

As a result, the researchers provided users with these features that allowed them to fully gain access to the training experience described in InstructGPT.

In addition to being highly consistent with the InstructGPT paper, the researchers also provide features that allow developers to train their own RLHF models using a variety of data resources:

Data abstraction and blending capabilities:

DeepSpeed-Chat is equipped with (1) an abstract dataset layer to unify the format of different datasets; and (2) data splitting/blending capabilities, whereby multiple datasets are appropriately blended and then split in 3 training phases.

Microsoft Open Source Deep Speed Chat: The era of ChatGPT for everyone is here

DeepSpeed hybrid engine

Steps 1 and 2 of the instructed RLHF pipeline, similar to the general fine-tuning of large models, are achieved by a combination of ZeRO-based optimization and flexible parallel strategies in DeepSpeed training to achieve scale and speed.

Step 3 of the pipeline is the most complex part in terms of performance impact.

Each iteration needs to efficiently handle two phases: a) the inference phase, which is used for token/experience generation, generating inputs for training; b) During the training phase, update the weights of the Actor and Reward models, as well as the interactions and scheduling between them.

It introduces two major difficulties: (1) memory cost, as multiple SFT and RW models need to be run throughout the third phase; (2) The generation of answers is slower, and if not accelerated correctly, it will significantly slow down the entire third stage.

In addition, two important functions added by researchers in the third phase, exponential moving average (EMA) collection and hybrid training, will incur additional memory and training costs.

To address these challenges, the researchers put all of DeepSpeed's training and inference system capabilities into a unified infrastructure, the Hybrid Engine.

It uses the original DeepSpeed engine for fast training mode, while effortlessly applying the DeepSpeed inference engine for generation/evaluation mode, providing a faster training system for the third stage of RLHF training.

As shown in the figure below, the transition between DeepSpeed training and inference engines is seamless: by enabling the typical eval and train modes for the actor model, when running the inference and training process, DeepSpeed chooses different optimizations to run the model faster and improve the throughput of the entire system.  

Microsoft Open Source Deep Speed Chat: The era of ChatGPT for everyone is here

DeepSpeed hybrid engine design to accelerate the most time-consuming part of the RLHF process

During the inference execution in the experience generation stage of RLHF training, the DeepSpeed hybrid engine uses a lightweight memory management system to process KV cache and intermediate results, while using highly optimized inference CUDA cores and tensor parallel computing, achieving a significant increase in throughput (tokens per second) compared with existing schemes.

During training, the hybrid engine enables memory-optimized techniques such as DeepSpeed's ZeRO series technology and low-order adaptive (LoRA).

The way researchers design and implement these system optimizations is that they are compatible with each other and can be combined to provide the highest training efficiency under a unified hybrid engine.

The hybrid engine can seamlessly change model partitioning in training and inference to support tensor-based parallel inference and ZeRO-based training sharding mechanism.

It can also reconfigure the memory system to maximize memory availability in each mode.

This avoids memory allocation bottlenecks and supports large batch sizes, greatly improving performance.

Together, hybrid engines push the boundaries of modern RLHF training, delivering unparalleled scale and system efficiency for RLHF workloads.

Microsoft Open Source Deep Speed Chat: The era of ChatGPT for everyone is here

Effectiveness evaluation

Compared to existing systems such as Colossal-AI or HuggingFace-DDP, DeepSpeed-Chat has more than an order of magnitude of throughput, being able to train larger actor models on the same latency budget or similarly sized models at a lower cost.

For example, on a single GPU, DeepSpeed increased the throughput of RLHF training by more than 10 times. While both CAI-Coati and HF-DDP can run 1.3B models, DeepSpeed can run 6.5B models on the same hardware, directly 5x higher.

Microsoft Open Source Deep Speed Chat: The era of ChatGPT for everyone is here

On multiple GPUs on a single node, DeepSpeed-Chat is 6-19 times faster than CAI-Coati in terms of system throughput and 1.4-10.5 times faster than HF-DDP.

Microsoft Open Source Deep Speed Chat: The era of ChatGPT for everyone is here

According to the team, one of the key reasons why DeepSpeed-Chat has achieved such excellent results is the acceleration provided by the hybrid engine during the generation phase.

Microsoft Open Source Deep Speed Chat: The era of ChatGPT for everyone is here

Resources:

https://github.com/microsoft/DeepSpeed

Read on