laitimes

The nerve radiation field removes the "nerve", the training speed is increased by more than 100 times, and the quality of the 3D effect is not reduced

Reports from the Heart of the Machine

Editor: Qian Zhang

Without neural networks, Radiance Fields can achieve the same effect as Neural Radiance Fields (NeRFs), but converge more than 100 times faster.

In 2020, researchers at the University of California, Berkeley, Google, and the University of California, San Diego, proposed a 2D image-to-3D model called "NeRF" that can generate a multi-perspective realistic 3D image using a few still images. Its improved model, NeRF-W (NeRF in the Wild), also adapts to light-filled and occluded outdoor environments, generating 3D sightseeing blockbusters in minutes.

The nerve radiation field removes the "nerve", the training speed is increased by more than 100 times, and the quality of the 3D effect is not reduced

NeRF model demo.

The nerve radiation field removes the "nerve", the training speed is increased by more than 100 times, and the quality of the 3D effect is not reduced

NeRF-W model demo.

However, these stunning effects are very power-intensive: each frame of the graph is rendered for 30 seconds, and the model is trained for a day with a single GPU. As a result, several subsequent papers have improved on the cost of hashrate, especially rendering. However, the training cost of the model has not been significantly reduced, and it still takes several hours to train with a single GPU, which becomes a major bottleneck that limits its landing.

In a new paper, researchers from the University of California, Berkeley, took aim at the problem, proposing a new approach called Plenoxels. The new study shows that even without neural networks, training a radiance field from scratch can achieve the quality of NeRF generation, and the optimization speed is increased by two orders of magnitude.

The nerve radiation field removes the "nerve", the training speed is increased by more than 100 times, and the quality of the 3D effect is not reduced

Thesis link: https://arxiv.org/pdf/2112.05131.pdf

Project Home: https://alexyu.net/plenoxels/

Code link: https://github.com/sxyu/svox2

They offer a custom CUDA implementation that leverages the simplicity of the model to achieve considerable acceleration. In bounded scenarios, the typical optimization time for Plenoxels on a single Titan RTX GPU is 11 minutes, and NeRF is about one day, with the former achieving more than 100 times the acceleration; in the unbounded scene, the optimization time for Plenoxels is about 27 minutes, and the NeRF++ is about four days, with the former achieving more than 200 times the acceleration. Although the implementation of Plenoxels is not optimized for fast rendering, it can render new viewpoints at an interactive rate of 15 frames per second. If you want faster rendering speeds, the optimized Plenoxel model can be converted to PlenOctree (a new approach proposed by author Alex Yu et al. in an ICCV 2021 paper: https://alexyu.net/plenoctrees/).

The nerve radiation field removes the "nerve", the training speed is increased by more than 100 times, and the quality of the 3D effect is not reduced
The nerve radiation field removes the "nerve", the training speed is increased by more than 100 times, and the quality of the 3D effect is not reduced

Specifically, the researchers proposed an explicit voxel representation based on a view-dependent sparse voxel mesh without any neural networks. The new model renders realistic new viewpoints and leverages the microrenderable render loss and variation regularizer on the trained view for end-to-end optimization of calibrated 2D photos.

They called the model Plenoxel (plenoptic volume elements) because it consisted of sparse voxel meshes, each of which stores opacity and spherical harmonic coefficient information. These coefficients are interpolated to continuously model the complete all-optical function in space. To achieve high resolution on a single GPU, the researchers trimmed empty voxels and followed an optimization strategy from coarse to fine. While the core model is a bounded voxel mesh, they can model unbounded scenes in two ways: 1) using standardized device coordinates (for forward-facing scenes) and encoding the background around a mesh with a multisphere image (for 360° scenes).

The nerve radiation field removes the "nerve", the training speed is increased by more than 100 times, and the quality of the 3D effect is not reduced

Plenoxel's effect in an forward-facing scene.

The nerve radiation field removes the "nerve", the training speed is increased by more than 100 times, and the quality of the 3D effect is not reduced

Plenoxel's effect in a 360° scene.

This method shows that we can use standard tools to reconstruct realistic voxels from inverse problems, including data representations, forward models, regularization functions, and optimizers. Each of these components can be very simple and still achieve SOTA results. Experimental results show that the key element of the neural radiation field is not a neural network, but a differentiatable voxel renderer.

Framework overview

Plenoxel is a sparse voxel mesh in which each occupied voxel angle stores a scalar opacity σ and a spherical harmonic coefficient vector for each color channel. The authors refer to this representation as Plenoxel. Opacity and color at any location and in the direction of observation are determined by trilinear interpolation of values stored on adjacent voxels and evaluation of spherical harmonic coefficients in the appropriate direction of observation. Given a set of calibrated images, the model is optimized using render loss directly on the training ray. The architecture of the model is shown in Figure 2 below.

The nerve radiation field removes the "nerve", the training speed is increased by more than 100 times, and the quality of the 3D effect is not reduced

Figure 2 above is a conceptual diagram of the sparse Plenoxel model framework. Given an image of a set of objects or scenes, the researcher reconstructs one with density and spherical harmonic coefficients at each voxel: (a) a sparse voxel (Plenoxel) mesh. To render the rays, they (b) calculated the color and opacity of each sample point by three linear interpolations of adjacent voxel coefficients. They also used (c) microspecific rendering to integrate the colors and opacities of these samples. The voxel coefficients can then be optimized using standard MSE reconstruction loss relative to the training image and the total variation regularizer.

Experimental results

The researchers demonstrated the model effect in a composite bounded scene, a real unbounded forward-facing scene, and a real unbounded 360° scene. They compared the optimization time of the new model with all previous methods, including real-time rendering, and found that the new model was significantly faster. The quantitative comparison results are shown in Table 2, and the visual comparison results are shown in Figure 6, Figure 7, and Figure 8.

The nerve radiation field removes the "nerve", the training speed is increased by more than 100 times, and the quality of the 3D effect is not reduced
The nerve radiation field removes the "nerve", the training speed is increased by more than 100 times, and the quality of the 3D effect is not reduced
The nerve radiation field removes the "nerve", the training speed is increased by more than 100 times, and the quality of the 3D effect is not reduced
The nerve radiation field removes the "nerve", the training speed is increased by more than 100 times, and the quality of the 3D effect is not reduced

In addition, the new method yields high-quality results even after the first epoch of optimization in less than 1.5 minutes, as shown in Figure 5.

The nerve radiation field removes the "nerve", the training speed is increased by more than 100 times, and the quality of the 3D effect is not reduced

Quickly build an enterprise-grade ASR speech recognition assistant with NVIDIA Riva

NVIDIA Riva is an SDK that uses GPU acceleration to rapidly deploy high-performance conversational AI services for rapid development of speech AI applications. Riva is designed to help developers easily and quickly access sessionAL AI capabilities, out of the box, and quickly build high-level speech recognition services with a few simple commands and API operations. The service can process hundreds to thousands of audio streams as input and return text with minimal latency.

On December 29, 19:30-21:00, the main introduction of this online sharing is:

Introduction to Automatic Speech Recognition

Introduction and features of NVIDIA Riva

Rapid deployment of NVIDIA Riva

Launch the NVIDIA Riva client to quickly implement speech-to-text transcription

Use Python to quickly build an NVIDIA Riva-based automatic speech recognition service application

Read on