laitimes

Think about it, can federated learning train large language models?

author:Heart of the Machine Pro

Machine Heart Original

Written by Jiying

Editor: H4O

1. Overview

With the rapid development of Large Language Model (LLM), more and more discussions have been sparked about the impact of large language models on the development of the artificial intelligence industry. One view is that the development of big language models has destroyed the development path of artificial intelligence startups, because large language models have many parameters, the computing power required is large, and the scale of training data they rely on is also large. Big models, big parameters, and big data are actually concentrated in large AI companies, resulting in fewer and fewer opportunities for start-ups. Another view is the opposite, they believe that the development of large language models has promoted the extensive development of artificial intelligence in many fields to a certain extent, such as directly using private data to build some vertical large language models on the basis of large language models, and directly applying large language models in different business scenarios.

We will not discuss the above two views in this article, we focus on the application method mentioned in the second view: how to use private data to train large language models under the premise of ensuring data privacy to meet the application needs in the vertical field? This will also be key for LLM to promote the development of AI start-ups. Specifically, we focus on whether federated learning can be used to train large language models?

1.1 Large language model structure and required resources

Due to its excellent parallelability and capacity, the Transformer architecture has become the de facto backbone for developing various LLMs, making it possible for language models to scale thousands or hundreds of millions of parameters. In general, the mainstream architecture of existing LLM can be roughly divided into three types, namely encoder-decoder, causal decoder, and prefix decoder, as shown in Figure 1 [8].

Think about it, can federated learning train large language models?

Figure 1. Comparison of attention patterns in three mainstream architectures. where blue, green, yellow, and gray rounded rectangles represent attention between prefix tags, attention between prefix and target tags, attention between target tags, and attention to masks, respectively.

  • Encoder-decoder architecture. The Vanilla Transformer model is built on an encoder-decoder structure, which consists of two stacked transformer blocks that act as encoders and decoders. The encoder encodes the input sequence using stacked multi-head self-attention layers to generate its potential representations, while the decoder cross-attention these representations and automatically generates the target sequence. Encoder-decoders such as T5 and BART have shown effectiveness on various NLP tasks. So far, only a handful of LLMs have been built on encoder-decoder architectures, such as Flan-T5.
  • Causal decoder architecture. The structure of the causal decoder contains a one-way attention mask to ensure that each input marker can only notice past tokens and itself. Input and output markers are handled by the decoder in the same way. As a representative language model of this architecture, the GPT series model is developed based on the causal decoder architecture. In particular, GPT-3 successfully demonstrates the effectiveness of this architecture, and also shows the amazing connotation learning ability of LLM. Interestingly, GPT-1 and GPT-2 do not exhibit as superior capabilities as GPT-3, and it seems that scaling has played an important role in increasing the model capacity of this model architecture. To date, causal decoders have been widely adopted as LLM architectures by various existing LLMs, such as OPT, BLOOM, and Gopher.
  • Prefix decoder schema. The prefix decoder architecture (also known as the non-causal decoder) modifies the masking mechanism of the causal decoder so that there is a bidirectional focus on the prefix token and only a one-way attention on the generated token. In this way, like the encoder-decoder structure, the prefix decoder can bi-directional encode the prefix sequence and automatically predict the output token one by one, sharing the same parameters during encoding and decoding. Instead of pre-training from scratch, a practical suggestion is to continuously train the causal decoder and then convert it to a prefix decoder to accelerate convergence, for example, U-PaLM is derived from PaLM. Existing representative LLMs based on prefix decoders include GLM130B and U-PaLM.

However, in the process of developing large language models, more and more people have raised doubts, mainly focusing on the following aspects:

  • Privacy should not be compromised. State-of-the-art models are only accessible through common black-box APIs. These APIs raise privacy concerns for companies when transferring data to centralized LLM vendors. For example, Samsung reportedly leaked its secrets via ChatGPT, highlighting the risks associated with such APIs.
  • Proprietary and trained models are important intellectual property. For businesses, the questions and datasets that stand to benefit most from AI tend to be sensitive and proprietary. This precludes fine-tuning the public model, so it's best for the business to fine-tune with an on-premises model or deploy the model within the enterprise's firewall.
  • Cost issues. Currently, LLM training requires thousands of GPU nodes at a cost of up to billions of dollars. Models may need to balance different trade-offs, such as slightly impacting the performance of the model to reduce the cost of the cloud. Having a model makes it easy for customers to fine-tune and retrain it. For example, they can remove redundant model parameters to achieve high performance for domain-specific tasks while minimizing cloud costs.

This leads to the approach discussed in this article: FL+LLM, that is, the introduction of federated learning to train large language models, thereby providing many advantages for enterprise business users, greatly enhancing the ability of enterprises to use large models in terms of model size and performance, privacy, efficiency, cloud computing costs, and labor costs.

1.2 Review of federated learning

Federated learning (FL) is a machine learning environment in which multiple clients (such as mobile devices or entire organizations) collaborate to train a model coordinated by a central server (such as a service provider) while maintaining the dispersed nature of the training data. FL embodies the principle of centralized data collection and minimization to mitigate many of the privacy, security risks, and costs associated with traditional centralized machine learning and data science approaches. Therefore, FL is an effective high-performance computing paradigm and is also regarded as a distributed training method that meets the requirements of data privacy. At present, FL has been successfully used in a number of application scenarios, including some consumer devices, such as GBoard, keyword discovery, as well as pharmaceutical, medical research, finance, manufacturing, and so on. In addition, a large number of FL development tools and platforms, such as Tensorow Federated, LEAF, PaddleFL and PySy, have emerged to further promote the development of FL.

In a federated learning framework, the central server holds the global data that initializes shareable. Individual clients (actors, edge devices) save local data and train local machine learning models based on local data. The client transmits model parameters and other data to the central server according to a certain communication mechanism (the complete client raw data will not be transmitted), and the central server aggregates the data uploaded by each client to train and build a global model, and each client has the same identity and status in the entire federated learning mechanism. Federated learning effectively solves the problem of data co-use by two or more data consuming entities (clients) without contributing data, and solves the problem of data islands. In addition, under the premise of aligning the characteristics of each client data, the global model of federated learning can obtain the same modeling effect as the centralized storage of the data set.

Here are four very important federated learning algorithms:

FedAvg (Federated Averaging)[1]: FedAvg is the most classic federated learning algorithm, proposed by Google in 2016. The algorithm iteratively updates the model parameters by gradient descent. In FedAvg, each client (for example, a phone or other device) first trains the model locally with its own data, and then sends the local model weights to a central server. The central server collects weight updates for all clients, calculates the average of the weights, and distributes the updated weights back to individual clients. This process is carried out in multiple rounds until convergence is reached.

FedProx (Federated Proximal)[2]: FedProx is an improved federated learning algorithm for solving the problem of non-independent iso-distributed (non-IID) data and device heterogeneity in federated learning. These issues can lead to a decrease in training quality, especially if the training data is not evenly distributed. FedProx solves this problem by adding a proximal term to the optimization goal, which makes the model pay more attention to local model weights that are close to the global model weights when updating. This mitigates the negative impact of non-independent homogeneous data and device heterogeneity, and improves model performance.

SCAFFOLD (Federated Learning via Stochastic Controlled Averaging)[3]: SCAFFOLD is an efficient federated learning algorithm that focuses on gradient sparsity and communication between devices. SCAFFOLD reduces traffic by maintaining a control variant in each round. This control variable can be adjusted to vary in gradient in the global model, resulting in more compact communication between each device. SCAFFOLD IS MORE EFFICIENT IN COMMUNICATION AND CAN REDUCE COMMUNICATION COSTS WITHOUT SACRIFICING MODEL PERFORMANCE.

FedNova (Federated Learning via Normalized Averaging)[4]: FedNova's core idea is to use normalized averaging to eliminate target inconsistencies while maintaining fast error convergence. In each iteration, local training is performed on each client device, local model parameters are normalized, and then normalized parameters are sent to a central server. The central server collects parameter updates from all clients and normalizes them to average, reducing target inconsistencies. The central server then distributes the updated model parameters back to the various clients to continue training in the next iteration. The advantage of FedNova is its ability to eliminate target inconsistencies due to data heterogeneity while maintaining fast error convergence. This allows the FedNova algorithm to perform well in processing non-independent homogeneous data and device heterogeneity.

2. Difficulties in training large language models with federated learning structure

In recent years, research on both LLM and FL has made some progress. However, under the current technical and resource conditions, direct FL+LLM still faces a series of problems.

2.1 Data

LLM training relies on very large amounts of data. Chinchilla explores how much data is needed to train LLM [9]. In May 2020, OpenAI demonstrated their LLM data scaling laws (also known as Kaplan scaling laws): 300B tokens can be used to train a size of 175B LLM. In September 2022, DeepMind found new data scaling laws (also known as Chinchilla or Hoffman scaling laws) for "data-optimal" LLMs: 1,400B (1.4T) tokens should be used to train with a parameter size of 70B LLM optimal.

Think about it, can federated learning train large language models?

Table 1. Required dataset size to align with Chinchilla data optimization models [9]

From the data given by Chinchilla in Table 1, it can be seen that the application of federated learning structure to train LLM is bound to distribute a massive amount of data in different client devices, if the number of clients is small, the amount of data stored and carried in each client is still large, and the computing performance requirements for each client are very high, which is inconsistent with the original design intention of federated learning architecture; If the number of clients is very large and the amount of data stored in each client is moderate, the problem of resource imbalance between a large number of clients, multi-party computing power and uneven resource allocation will be very prominent, and the client incentive mechanism under unbalanced conditions is also particularly critical, which will have a great impact on the central model effect and training efficiency of federated learning.

2.2 Client Device Aspects

How can a single client fit the entire LLM model in a distributed system under federated learning architecture? In the existing federated learning platforms, especially in some practical scenarios, most of the distributed client devices are mobile phones, tablets, etc., and these devices cannot accommodate the entire LLM model or provide sufficient memory resources for model training. If these devices are completely excluded from the training process, it is also inconsistent with the original design of the federated learning architecture. But how to train LLM in parallel when the client cannot accommodate the entire model and cannot completely exclude the client from the training process?

In an article, Microsoft analyzed LLaMA's current footprint of device memory [10]. During model training, most of the memory overhead is spent on model states (model parameters, state of the optimizer, gradient). The rest of the memory overhead comes from residual states.

Taking the LLaMA 7B model as an example, the memory required for the model parameters is (the amount of parameters * the memory of each parameter): fp32 accuracy: 7*4 bytes = 28GB; fp16 precision: 7*2 bytes = 14GB; int8 precision: 7*1 bytes = 7GB; Mixed-precision (fp16/32) training: Storage fp16 precision + fp32 precision = 14GB + 28GB = 42GB.

Memory required for gradient (calculated as model parameters, number of parameters * memory per parameter): fp32 Accuracy: 7*4 bytes = 28GB; fp16 accuracy: 7*2 bytes= 14GB; int8 precision: 7*1 bytes = 7GB; Mixed-precision (fp16/32) training: Only fp16 accuracy = 14GB is stored.

Memory required by the optimizer (amount of parameters * memory per parameter * 2, using Adam as an example): fp32 Precision: 7*4 bytes * 2 = 56GB; fp16 precision: 7*2 bytes * 2 = 28GB; int8 precision: 7*1 bytes * 2 = 14GB; Mixed-precision (fp16/32) training: Storage fp32 accuracy = 56 G.

In general, the total memory requirement for training LLaMa models is fp32 accuracy: 28+28+56 = 112GB; fp16 accuracy: 14+14+28 = 56GB; int8 precision: 7+7+14 = 28G; Mixed-precision (fp16/32) training: 42+14+56 = 112GB.

The scale of LLaMA 7B is very small compared to the current LLM, but even the memory consumption of LLaMA 7B is difficult to meet for some mobile phones and tablets, which constitutes a huge constraint on the choice of client devices under the FL architecture.

2.3 Training algorithm aspects

Existing federated learning training algorithms are inefficient for LLM. As mentioned above, federated learning works: "Individual clients train local models based on local data. The client transmits model parameters and other data to the central server according to a certain communication mechanism, and the central server gathers the data uploaded by each client to train and build a global model." Because the parameter quantity and model structure of LLM are very large, this transmission and aggregation process is affected by communication conditions, so the efficiency of the whole working process is very low.

3. What work progress has been made by FL+LLM?

Although FL+LLM still faces many problems, many researchers have begun to explore the technical possibilities of related directions and have made some progress. In this section, we describe the progress of the relevant work.

3.1 FedLLM: Build your own large-scale language models on proprietary data

FedLLM is an MLOps-powered training pipeline that enables enterprises to build their own large-language models on proprietary data [7]. Code exposed: https://github.com/FedML-AI/FedML/tree/master/python/app/fedllm. FedLLM represents "foundational ecosystem design for LLM", not just "federated learning for LLM".

FedLLM is a member of FedML.Inc. Introduced FL+LLM framework. FedML, Inc. (https://FedML.ai) is an international team led by Chinese who originated at the University of Southern California and was one of the early institutions working on the technology worldwide. Over the past few years, FedML has started as a doctoral student-led open source research project, serving multiple research grants and helping to publish more than 50 top-level papers in the lab. In 2023, FedML announced the completion of a $6 million seed and pre-seed round of funding, led by Camford Capital and with participation from investors such as Plug and Play Ventures, AimTop Ventures, Acequia Capital, LDV Partners and others.

Think about it, can federated learning train large language models?

Figure 2. The training process of FedLLM

As shown in Figure 2, FedLLM implements data collaboration, compute collaboration, and model collaboration, and supports training on centralized and geographically distributed GPU clusters, as well as federated learning of data silos. FedLLM is compatible with popular LLM libraries such as HuggingFace and DeepSpeed and is designed to improve efficiency and security/privacy.

  • Data collaboration — Enables LLM training on domain-specific proprietary data. General-purpose LLMs ChatGPT and GPT-4 are trained on large amounts of text posted and annotated by humans. In many verticals, such as healthcare, fintech, legal, and automotive, ChatGPT may not work well. Enabling enterprises to train their models on their proprietary data can achieve better performance while protecting privacy. In some cases, when datasets are scattered across data silos, the FedML federated learning training pipeline "Train on the Edge" will be processed in a secure and scalable manner.
  • Compute collaboration — Leverage dispersed computing resources. Due to the high investment in computing power, only a few large tech companies (e.g. Microsoft, Google) can afford LLM pre-training. Enterprises in different verticals cannot afford to spend billions of dollars on thousands of GPU nodes. A more cost-effective way to make the most of GPU resources is to establish a sharing mechanism across organizations. FedML does this through its "Train on the cloud" platform, which schedules training to a geographically distributed CPU/GPU network. This collaboration on compute can reduce the financial burden of purchasing a large number of GPU nodes in one organization.
  • Model collaboration — Serve models in a federated manner. Servicing large base models is also a challenge. In contrast to other MLOps, FedML pioneered the idea of federated model inference through geographically distributed cloud computing resources. When an inference request is sent to the inference endpoint, the master node routes the request to a decentralized edge node hosted by the GPU vendor, who can share the GPU's idle time. Such a model service platform can provide better service reliability and cheaper cloud computing costs.

For specific siloed GPU clusters, FedLLM leverages existing open-source LLM and popular frameworks for native training:

  • The model definition and pre-trained weights come from versions 2.8B, 7B, and 12B of Pythia of EleutherAI. Use Hugging Face's transformer as a reference implementation (https://github.com/huggingface/peft?ref=blog.fedml.ai). FedLLM is also a model-agnostic framework, and enterprises and developers can plug in any LLM.
  • FedLLM supports parameter-efficient training methods such as LoRA. Its reference implementation comes from HuggingFace's peft(https://huggingface.co/docs/transformers/index?ref=blog.fedml.ai).
  • For a standalone trainer, the training/evaluation code is based on the Trainer in the Transformers library (https://huggingface.co/docs/transformers/index?ref=blog.fedml.ai).
  • The distributed training framework on a specific GPU cluster is handled by DeepSpeed (https://www.deepspeed.ai/?ref=blog.fedml.ai). Zero3 is also enabled to reduce memory costs for individual GPUs. By enhancing the compatibility between FedML and DeepSpeed, FedLLM can run training jobs in different physical space distributed clusters.

FedML also gives an example of applying FedLLM. As shown in Figure 3, the local data remains local, and only the model parameters or weights flow from the central server to and back. This particular instance assumes a chat-type application that combines training on local data with the benefits of fine-tuning training built by a central server leveraging training on other devices. Chat applications are just one example that can be replaced by other applications that take advantage of the LLM model.

Think about it, can federated learning train large language models?

Figure 3. FedLLM application examples

FedLLM enables to train LLM using federated learning to some extent. However, looking back at what we discussed in Chapter 2, there are still many problems with training LLM with federated learning architectures that FedLLM has not been able to solve from the technical content that is publicly available. The computing nodes assumed in the FedLLM architecture are GPU nodes with certain computing power, and the problem that FedLLM can solve is "no need to purchase a large number of GPUs". However, in the real situation of federated learning, many client nodes are mobile phones and tablets, and FedLLM cannot adapt to these situations. The same problem exists in the case of the amount of data on client nodes, i.e. FedLLM assumes that the client is sufficient to accommodate the amount of data needed to train the local LLM. Finally, FedLLM does not discuss what kind of aggregation algorithm is suitable for FL for LLM, nor does it discuss whether it is necessary to improve the algorithm for client training LLM, which needs to be further refined for the real promotion and application of FL+LLM.

3.2 DeepSpeed: Advance large-scale model training by improving scale, speed, cost, and availability, unleashing the ability to train models with 100 billion parameters

LLM often requires a large amount of memory to store parameters such as intermediate activation and weights during training, and tens of billions of models cannot even be trained on a single GPU, making model training very inefficient and impossible in some cases. In the early work, researchers have focused on multi-node distributed training problems, mainly using data parallelism (different instances of the model running on different GPUs and different batches of data) and model parallelism (splitting the model into multiple GPUs for training).

DeepSpeed is a deep learning training optimization tool developed by Microsoft to improve training speed and save resources through distributed training and mixed-precision techniques. It is an open-source Python library that runs on multiple platforms. Compared with traditional deep learning frameworks such as TensorFlow, Pytorch, Keras, etc., DeepSpeed enables the computation of large models by disassembling and distributing model parameters across GPUs, making it possible to train larger models with fewer GPUs and not being limited to video memory.

Data parallelism replicates models and optimizers across all workers, so memory is not efficient. DeepSpeed developed ZeRO, a series of optimizers for improving memory efficiency for data parallelism. For model parallelism, DeepSpeed borrowed NVIDIA's Megatron-LM to provide large-scale model parallelism for Transformer-based language models.

Specifically, DeepSpeed (https://www.deepspeed.ai) enables ChatGPT-like model training with a single click, providing a 15x speed increase compared to the SOTA RLHF system and unprecedented cost reductions at all scales. DeepSpeed has three innovations:

  • DeepSpeed-Training: DeepSpeed offers a range of system innovations that make large-scale deep learning training effective and efficient, greatly improve ease of use, and redefine the field of deep learning training in terms of possible scale. These innovations, such as ZeRO, 3D-Parallelism, DeepSpeed-MoE, ZeRO-Infinity, and others, are at the heart of DeepSpeed-Training.
  • DeepSpeed inference: DeepSpeed inference is a comprehensive system solution for Transformer model inference. DeepSpeed inference includes: (1) a multi-GPU inference solution to minimize latency while maximizing throughput for dense and sparse Transformer models when they fit into aggregating GPU memory; (2) A heterogeneous inference solution that leverages CPU and NVMe memory in addition to GPU memory and compute to achieve high inference throughput for large models that are not suitable for aggregating GPU memory.
  • DeepSpeed-Compression: To further improve inference efficiency, DeepSpeed provides researchers and practitioners with easy-to-use and flexible composition compression technology to compress their models while providing faster speeds, smaller model sizes, and greatly reducing compression costs. In addition, SoTA innovations in compression, such as ZeroQuant and XTC, are included in DeepSpeed-Compression.

Specifically, the two core components of DeepSpeed Inference are described as follows:

  • DeepSpeed Transformer: DeepSpeed Transformer is a GPU-only solution designed to minimize latency while maximizing throughput for dense and sparse Transformer models. It achieves state-of-the-art latency and throughput for Transformer models of all sizes, and supports models that run on a single GPU or scale to hundreds of GPUs to reason about millions of parameters. The DeepSpeed Transformer solution is a three-tier system architecture consisting of: i) a single GPU transformer core optimized for memory bandwidth utilization at small batch sizes and high throughput at large scales; ii) Multiple GPU-dense Transformer layers that leverage tensor-cutting and inference-optimized pipeline parallelism to scale dense Transformer models on GPUs. A massive GPU-scale sparse transformer layer designed to scale the MoE Transformer layer to hundreds of GPUs using a combination of parallel technology and communication optimization strategies, while also minimizing the overhead of single-GPU sparse computation with optimized sparse cores. By taking this layered approach, each layer addresses a unique aspect of the latency challenge: batch size, scaling of dense models, and scaling of sparse models, but compatible and built on top of each other, DeepSpeed Transformer is a comprehensive system that achieves optimal latency and throughput for dense and sparse Transformer models at unprecedented scale, despite heterogeneity in batch size, model size, and model features.
  • ZeRO-Inference: ZeRO-Inference is a solution based on heterogeneous GPU+CPU+NVMe, a new memory optimization technology for large-scale distributed deep learning. Solve memory challenges by enabling large-scale model inference with minimal GPU resources. Compared to DeepSpeed Transformer, ZeRO-Inference can infer models with hundreds of billions of parameters on a single or multiple GPUs, as long as there is enough CPU or NVMe memory to store model parameters, for applications that are less sensitive to latency but have limited resources.
Think about it, can federated learning train large language models?

Figure 4. DeepSpeed compression library

DeepSpeed Compression proposes a seamless pipeline to solve the challenge of compression synthesis, as shown in Figure 4. The core part of DeepSpeed Compression is a component called compression composer, which includes several important functions:

1. It provides a variety of cutting-edge compression methods, including extreme quantization, head/row/channel pruning, and knowledge extraction, which can effectively reduce model size and inference cost. This list will grow as more state-of-the-art compression methods are integrated.

2. It provides an easy-to-use API that automates the complex problem of assembling different compression techniques to provide the compounding benefits of multiple compression methods. For example, XTC needs to be composed of lightweight layer reduction, binarization, and knowledge distillation. However, combining them together is not difficult. With compression composer, applying extreme compression is as simple as adding two new API calls to enable compression and clean up the compression model.

3. It is designed in a modular way so that users can easily add new compression schemes. For example, additional compression methods can be added through a custom compression layer, and by registering with Compression Composer, new methods can be composed with existing methods that are already managed by Composer.

4. It works seamlessly with the DeepSpeed library. This has two benefits. First, DeepSpeed compression can be specified and enabled through a JSON file in the same way as DeepSpeed training and inference, where enabling different combinations of compression techniques only requires a few lines to be modified in the JSON file. Second, once the compression scheme is configured, the compression composer automatically modifies the model layer and training to enable the compression process, requiring no additional modifications to the model structure or training program by the user.

3.3 Related algorithm basics

In order to cope with the cost of large-scale pre-training of LLM, from the algorithm level, researchers proposed LoRA (LOW-RANK ADAPTATION), that is, a low-rank adaptation method for large language models, a lightweight method for training large language models. By freezing the pre-trained model weights and injecting the trainable rank decomposition matrix into each layer of the Transformer architecture, the number of trainable parameters for downstream tasks is greatly reduced, and the finetune efficiency of pre-trained models on downstream tasks is effectively improved.

Think about it, can federated learning train large language models?

Project Address:https://github.com/microsoft/LoRA [6]

To solve the challenge of fine-tuning large language models to different domains and tasks, there are several solutions, such as partial fine-tuning, using adapters, and prompting. However, these methods have the following problems:

  • Adapters introduce additional inference latency (due to increased number of model layers)
  • Prefix-tuning is difficult to train, and the sequence reserved for prompt occupies the input sequence space of downstream tasks, which affects model performance

Lora's idea is that although the model has many parameters, the model mainly relies on the content of the low intrinsic dimension. It is assumed that during adaptation, the update of weights also has a lower "intrinsic rank". For a pre-trained weight matrix W_0∈R^d×k, constrain its updates by representing the latter with a low-rank decomposition:

Think about it, can federated learning train large language models?

,

Think about it, can federated learning train large language models?

,

Think about it, can federated learning train large language models?

During training, the W_0 is frozen and gradient updates are not accepted, while A and B contain trainable parameters. Both W_0 and ∆W = BA are multiplied by the same input, and their respective output vectors are added by coordinates. For h=W_0x, the modified forward pass yields:

Think about it, can federated learning train large language models?
Think about it, can federated learning train large language models?

Figure 5. Reparameterized only trains A and B

As shown in Figure 5, random Gaussian initialization is used for A and zero initialization is used for B, so ∆W = BA is zero at the beginning of training. Then, adjust the ∆ Wx with α/r, where α is a constant in r. When optimizing with Adam, tuning the α is about the same as adjusting the learning rate if initialization is adjusted appropriately. So just set α to the first r you try without adjusting it. This ratio helps reduce the need to readjust hyperparameters when changing r.

Think about it, can federated learning train large language models?
Think about it, can federated learning train large language models?

Figure 6. General fine-tuning and LoRA illustration

As shown in Figure 6, r is a hyperparameter that specifies the rank of the low-rank matrix used for adaptation. The smaller the r, the simpler the low-rank matrix, the fewer parameters that need to be learned in the adaptive process, the faster the training, and the corresponding reduction in computational requirements. However, the disadvantage of smaller r is that the ability of low-rank matrices to capture task-specific information is reduced. This can result in lower adaptive quality, and the model may perform poorly on new tasks compared to higher R. In summary, determining the value of r in LoRA requires trade-offs between model complexity, adaptive ability, and the risk of underfitting or overfitting. Therefore, it is important to experiment with different r values to find the right balance to meet the desired performance in the new task.

The matrix rank (r) can be very low, for example, for GPT-3 175B models, rank-1 or rank-2 can basically align with the effect of the original rank-12288:

Compared to Adam's fine-tuned GPT-3 175B, the number of LoRA trainable parameters is reduced by 10,000 times, and GPU memory requirements are reduced by 3x

On large language models such as RoBERTa, DeBERTa, GPT-2, and GPT-3, LoRA performs as well or better than fine-tuning in terms of model quality, although it has fewer trainable parameters, higher training throughput, and, unlike adapters, no additional inference latency

3.4 Related Hardware Basics

Finally, let's look at the hardware side.

Grace Hopper is the ninth generation NVIDIA data center GPU designed to provide orders of magnitude improvement over previous generations of NVIDIA Ampere GPUs for large-scale AI and HPC applications. Thread block clustering and thread block reconfiguration improve data localization in space and time, coupled with a new asynchronous execution engine, enable applications to keep all units busy at all times. NVIDIA Grace Hopper fuses NVIDIA Grace CPUs and NVIDIA Hopper GPUs into a single superchip with NVIDIA NVLink C2C, a chip-to-chip interconnect with a total bandwidth of 900 GB/s. NVLink C2C memory coherency enables both Grace CPU superchips and Grace Hopper superchips to be programmed through a unified programming model.

Think about it, can federated learning train large language models?

Figure 7. NVIDIA Grace Hopper superchip logic overview

In June 2023, NVIDIA announced that its new GH200 Grace Hopper "superchip"—a combination of CPUs and GPUs created specifically for large-scale AI applications—has entered full-scale production. It has 528 GPU Tensor Cores, supports up to 480GB of CPU memory and 96GB of GPU memory, and boasts up to 4TB/s of GPU memory bandwidth.

Think about it, can federated learning train large language models?

Figure 8. NVIDIA's GH200 "Grace Hopper" AI Super Chip [5]

The GH200 is based on "Hopper" and combined with Nvidia's "Grace" CPU platform (both named after computer pioneer Grace Hopper) to incorporate it into a single chip through Nvidia's NVLink chip-to-chip (C2C) interconnect technology. Nvidia expects this combination to dramatically accelerate training (creating models) and inference (running models) for AI and machine learning applications.

"Generative AI is rapidly transforming the enterprise, unlocking new opportunities and accelerating discovery in healthcare, finance, business services, and more industries," Ian Buck, vice president of accelerated computing at Nvidia, said in a press release. "With the full production of Grace Hopper superchips, global manufacturers will soon provide the accelerated infrastructure enterprises need to build and deploy generative AI applications that leverage their unique proprietary data." According to the company, key features of the GH200 include a new 900GB/s coherent (shared) memory interface, which is 7 times faster than PCIe Gen5. The GH200 also provides the GPU with 30 times the total bandwidth of system memory. In addition, the GH200 can run all Nvidia software platforms, including the Nvidia HPC SDK, Nvidia AI, and Nvidia Omniverse. Nvidia also announced that it will build this CPU/GPU combination chip into a new supercomputer called the DGX GH200, which can harness the combined power of 256 GH200 chips to execute as a single GPU, providing 1 exaflop of performance and 144 megabytes of shared memory, nearly 500 times more memory than the previous generation Nvidia DGX A100. The DGX GH200 will be able to train huge next-generation AI models for generative language applications, recommender systems, and data analytics. Nvidia has not disclosed the price of the DGX GH200 at this time, but if based on the DGX H100 shipped last year as a reference, an 8U GPU server cabinet with 8 sets of H100 GPUs is about $200,000, considering that the DGX GH200 has up to 256 Grace Hoppers, its price may be higher than this range. According to Anandtech, the price of a DGX GH200 computer "can easily reach the 8-figure (dollar) level."

Thanks to continued hardware advancements from vendors like Nvidia and Cerebras, high-end cloud AI models are likely to continue to become more capable, process more data, and do faster than before over time. According to internal testing, the DGX GH200 system has shown better performance than the DGX H100 in handling memory-intensive AI workloads, with an average performance improvement of 2-6x, for example, the DGX GH200 is 1x faster when using 1TB of memory capacity for GPT3 model training, and up to 4x faster when processing deep learning recommendation models (DLRM) with 40TB of memory. Even graph neural network processing is much faster, up to 5 times faster.

From the perspective of GH200, it can support the distributed training of LLM to a certain extent, making FL+LLM more feasible. However, returning to our discussion at the beginning of the article, the original intention of FL is to use a large number of distributed general devices to distribute training a central model, which can effectively use decentralized client resources on the one hand, and meet the data privacy needs of each client on the other hand. Requiring these clients to have GH200 is obviously unrealistic, and the cost of such a GPU is not consistent with FL's original intention.

In addition, some researchers believe that GH200 itself will not have much impact on the application and promotion of large models [11]. They analyzed that at the computing power level, there is no difference in the floating point computing power between a single GH chip and the FP8 floating point of an H100 chip. The GH200 performed better in internal tests due to its storage. The DGX GH200's internal GPU and CPU are connected differently than the DGX H100, and its high-speed access storage capacity can be greatly increased. However, the improvement of cluster performance mainly has three major factors: computing power itself, network, and storage. Therefore, under the traditional GPT3, GPT4 and other mainstream large models, there will be no obvious difference between the DGX H100 cluster (NVLink networking) and the DGX GH200 system.

Regarding H100, it has been reported that the performance of its large model training has been analyzed [12]. In the latest data of the two MLPerf benchmarks, the NVIDIA H100 chipset set set a new record in all categories in the test of artificial intelligence computing performance, and it is also the only hardware platform that can run all tests. Data submitted by NVIDIA and GPU cloud computing platform CoreWeave set the industry standard for this test. With the combined action of 896 Intel Xeon 8462Y+ processors and 3584 NVIDIA H100 chips, it took only 10.94 minutes to complete the GPT-3-based large-language model training task. In BERT-Large model training, H100 and CoreWeave brushed the data to an extreme 0.13 minutes, and in the case of 64 cards, the test data also reached 0.89 minutes.

4. Follow-up development discussion

In this article, we discuss some of the approaches related to FL+LLM, including algorithmic improvements, hardware development, and architectures for distributed training and federated learning. Among them, FedLLM should be the most FL+LLM compliant work, although it is still far from the practical and technically complete FL+LLM.

DeepSpeed is through distributed training and mixed precision technology to improve training speed and resource-saving model training architecture, which is oriented to distributed training application scenarios, in general, distributed learning will evenly distribute training data in different nodes, nodes are mostly computing nodes in dedicated computer rooms, and nodes are usually in a unified geographical location, good communication conditions between nodes. For FL, the amount of data owned by each compute node is related to the device itself, and it is difficult to ensure that different compute nodes have similar amounts of data. In addition, FL's computing nodes may be mobile phones, tablets, etc., which are generally in a remote connection state with the central server, and the connection stability and communication cost are relatively poor. Therefore, it is still very difficult for DeepSpeed's work to be directly applied to FL+LLM.

LoRA is currently a very popular fine-tuning method, and more excellent fine-tuning methods are constantly being proposed. The improvement of the algorithm makes it possible for FL client nodes to train models based on local data. The current development of hardware we also discussed earlier, the performance of the hardware itself is constantly improving, but for the FL application scenario, it is very difficult to make mobile phones and tablets have such hardware conditions.

From the analysis of some of the current work related to FL+LLM, we feel that federated learning supports LLM and still has many problems to solve. Including FedLLM, in fact, it does not talk about how to deal with client devices with small storage capacity, poor processing performance, and poor network conditions, which happens to be the most typical application scenario of FL. We look forward to more researchers paying attention to the FL+LLM problem, giving more technical details and landable solutions."

About the author

Jiying, Ph.D. in Engineering, graduated from Beijing Jiaotong University, worked as a research assistant and research assistant at Chinese University of Hong Kong and the Hong Kong University of Science and Technology, respectively, and is currently engaged in the research of new information technology in the field of e-government. His main research direction is pattern recognition, computer vision, and he loves scientific research, hoping to keep learning and continuous progress.

References to the literature cited:

[1] McMahan, Brendan, et al. "Communication-efficient learning of deep networks from decentralized data." Artificial intelligence and statistics. PMLR, 2017.

[2] Li, Tian, et al. "Federated optimization in heterogeneous networks." Proceedings of Machine learning and systems2 (2020): 429-450.

[3] Karimireddy, Sai Praneeth, et al. "Scaffold: Stochastic controlled averaging for federated learning." International Conference on Machine Learning. PMLR, 2020.

[4] Wang, Jianyu, et al. "Tackling the objective inconsistency problem in heterogeneous federated optimization." Advances in neural information processing systems 33 (2020): 7611-7623.

[5] https://arstechnica.com/information-technology/2023/06/nvidias-new-ai-superchip-combines-cpu-and-gpu-to-train-monster-ai-systems/

[6] LoRA: Low-Rank Adaptation of Large Language Models, https://arxiv.org/abs/2106.09685

[7] Releasing FedLLM: Build Your Own Large Language Models on Proprietary... (https://blog.fedml.ai/releasing-fedllm-build-your-own-large-language-models-on-proprietary-data-using-the-fedml-platform/)

[8] Wayne Xin Zhao, Kun Zhou, etc., A Survey of Large Language Models, https://arxiv.org/abs/2303.18223

[9] https://lifearchitect.ai/chinchilla/

[10] Samyam Rajbhandari, Jeff Rasley, etc., ZeRO: Memory Optimizations Toward Training Trillion Parameter Models, https://arxiv.org/pdf/1910.02054.pdf%22%20%5Ct%20%22_blank

[11] Guotai Junan Securities, Research Report "AI Supercomputing Seamlessly Integrated, Optical Interconnection Status Significantly Improved"

[12] NVIDIA H100 authoritative AI performance test, 11 minutes to get GPT-3-based large model training, https://m.cnbeta.com.tw/view/1367739.htm

Read on