Xie Saining's team broke through the Gaussian splash memory bottleneck and realized multi-graphics card training in a parallel scheme

2024-07-10 14:25:00

Cressy from the temple of Wafei

Quantum Position | 公众号 QbitAI

The memory bottleneck of Gaussian splash model training was finally broken by Xie Saining's team and NYU System Lab!

By designing a parallel strategy, the team launched a multi-card training scheme for Gaussian splash models, which no longer has to be limited by the memory of a single card.

Training on 4 cards in this way can speed up by more than 3.5 times, and if you increase it to 32 cards, you can get an additional 6.8 times speedup.

Xie Saining's team broke through the Gaussian splash memory bottleneck and realized multi-graphics card training in a parallel scheme

The team proposed a distributed training system called Grendel, and the first author was Zhao Hexu, an alumnus of Tsinghua University's Yao class.

Through multi-card training, not only is the speed faster, but the research team has also broken through the memory limitations in large-scene, high-resolution environments, generated more Gaussian, and the quality of 3D results is also higher.

In order to reflect how a goose girl this achievement is, Xie Saining himself posted such an emoji:

(crying): No! You can't scale up the 3D Gaussian splash, whether it's the scene, the resolution, or the batch size, which consumes too much computing power and memory

GPU: I just smiled and didn't speak

Some netizens joked that it seems that Lao Huang's stock is going to rise again.

Fast and good 3D generation

The multi-card parallel approach breaks through the computing power and memory limitations of a single card, allowing Grendel to handle extremely challenging rendering tasks for large scenes (more Gaussian particles).

For example, in two large and complex scenes, Rubble (4K resolution) and MatrixCity (1080p resolution), Grendel used up to 40 million and 24 million Gaussian particles to produce high-fidelity renderings.

The meticulousness and coherence of Grendel's results can also be seen in the dynamic process of zooming in closer.

From the data point of view, on the Mip360 and TT&DB datasets, the rendering quality (PSNR) after batch training of 4 cards is almost no loss compared with that of a single card, which further verifies the effectiveness of Grendel's multi-card parallelism in different scenarios.

On top of the quality, Grendel also achieved a 3-4x speed increase on both datasets.

Especially in 4K scenarios, single-card training is not only slow, but also prone to insufficient memory, so using Grendel to perform parallel training on multiple cards not only brings quantitative changes, but also brings a qualitative breakthrough.

In addition, by supporting larger batch sizes and dynamic load balancing, Grendel can make better use of multi-GPU resources and avoid wasting computing power.

For example, on the Mip-NeRF360 dataset, Grendel was able to increase the speedup ratio of 4 cards in parallel from 2x to nearly 4x by increasing batch and dynamic balancing loads.

So, how exactly does Grendel do it?

Disassemble the Gaussian splash process

Before we get into the principles of Grendel, let's answer this question:

A single card is not enough, and using multiple cards seems to be an easy idea to think of, why haven't you seen a multi-card scheme before?

This brings us to the unique computational approach of the Gaussian spill model, which is divided into several different phases, each with a different parallel granularity that needs to be switched.

This is a far cry from the single-grained parallelism of most neural network models, where even Gaussian splash doesn't use any neural networks at all.

As a result, the existing multi-card parallel schemes for neural network training (such as data parallelism and model parallelism) are difficult to directly apply to 3D Gaussian splash.

In addition, in the training process of Gaussian splash model, a large amount of data communication is required between processes of different granularities, which increases the difficulty of parallel schemes.

This is exactly what Grendel needs to solve in its design.

首先，Grendel将3D高斯泼溅的训练过程划分为三个主要阶段——高斯变换（Gaussian transformation）、渲染（rendering）和损失计算（loss computation）。

For these three phases, Grendel adopts a mixed-granularity parallelism strategy, using different parallelism granularities in different training phases.

In the Gaussian-wise stage, Gaussian-wise particles are evenly distributed to each GPU node.
The rendering and loss calculation phases use pixel-wise parallelism, where the image is split into contiguous pixel blocks and distributed to individual GPU nodes.

In the middle of the Gaussian transform and rendering phases, Grendel uses sparse all-to-all communication to transfer the Gaussian particles on each GPU node to the GPU node where they need to be rendered.

Since each pixel block relies only on a subset of Gaussian particles within its coverage, Grendel takes advantage of spatial locality to transmit only relevant particles, thus reducing the amount of traffic.

After the loss calculation is completed, on each GPU node, the gradient of each parameter of the render pipeline is calculated according to the loss function, and the individual attribute parameters are passed back to the Gaussian particles through backpropagation.

After that, the system aggregates the gradients calculated by each GPU to obtain the total gradient of the batch data, and updates the property parameters of the Gaussian particles accordingly.

The next step is to repeat the steps from the Gaussian transformation to the parameter update until the model converges or reaches a preset number of training rounds.

In addition, to deal with load imbalance during the rendering phase, Grendel has introduced a dynamic load balancing mechanism:

During the training process, Grendel records the rendering time of each pixel block, which is used to predict the load distribution of the current iteration, and then dynamically adjusts the distribution of the pixel block to the GPU node so that the rendering time of each node is as close as possible.

In order to further improve GPU utilization and training throughput, Grendel supports batch training, that is, multiple input images are processed in parallel in each training iteration, and the learning rate is dynamically adjusted according to the batch size to ensure the stability and convergence of training.

About the Author

The first author of Grendel is Zhao Hexu, a Ph.D. student in computer science at New York University and a Class 19 alumnus of Tsinghua University's Yao Class, whose main research direction is distributed machine learning.

During his time at Tsinghua University, Zhao Hexu participated in research in the team of Sun Maosong in the NLP Laboratory of Tsinghua University, and received guidance from Associate Professor Liu Zhiyuan.

He also visited Eric Xing's group to optimize a communication problem in distributed machine learning, which was accepted by MLsys2023.

The other three Chinese authors, Weng Haoyang, are also from Yaoban; Daohan Lu is a Ph.D. student at New York University. and Dr. Ang Li, an alumnus of Zhejiang University, who is now conducting research at the PNNL laboratory in the United States.

Two of Zhao's advisors, Professor Jinyang Li and Assistant Professor Aurojit Panda at NYU, as well as Assistant Professor Xie Saining, a well-known NYU scholar and co-author of ResNeXt and DiT (Sora Core Architecture), participated in the project.

Address:

https://arxiv.org/abs/2406.18533

Project Homepage:

https://daohanlu.github.io/scaling-up-3dgs/

GitHub：

https://github.com/nyu-systems/Grendel-GS

— END —

QubitAI · 头条号签约

Xie Saining's team broke through the Gaussian splash memory bottleneck and realized multi-graphics card training in a parallel scheme

Fast and good 3D generation

Disassemble the Gaussian splash process

About the Author

Read on