Reports from the Heart of the Machine
Editor: Qian Zhang
Without neural networks, Radiance Fields can achieve the same effect as Neural Radiance Fields (NeRFs), but converge more than 100 times faster.
In 2020, researchers at the University of California, Berkeley, Google, and the University of California, San Diego, proposed a 2D image-to-3D model called "NeRF" that can generate a multi-perspective realistic 3D image using a few still images. Its improved model, NeRF-W (NeRF in the Wild), also adapts to light-filled and occluded outdoor environments, generating 3D sightseeing blockbusters in minutes.

NeRF model demo.
NeRF-W model demo.
However, these stunning effects are very power-intensive: each frame of the graph is rendered for 30 seconds, and the model is trained for a day with a single GPU. As a result, several subsequent papers have improved on the cost of hashrate, especially rendering. However, the training cost of the model has not been significantly reduced, and it still takes several hours to train with a single GPU, which becomes a major bottleneck that limits its landing.
In a new paper, researchers from the University of California, Berkeley, took aim at the problem, proposing a new approach called Plenoxels. The new study shows that even without neural networks, training a radiance field from scratch can achieve the quality of NeRF generation, and the optimization speed is increased by two orders of magnitude.
Thesis link: https://arxiv.org/pdf/2112.05131.pdf
Project Home: https://alexyu.net/plenoxels/
Code link: https://github.com/sxyu/svox2
They offer a custom CUDA implementation that leverages the simplicity of the model to achieve considerable acceleration. In bounded scenarios, the typical optimization time for Plenoxels on a single Titan RTX GPU is 11 minutes, and NeRF is about one day, with the former achieving more than 100 times the acceleration; in the unbounded scene, the optimization time for Plenoxels is about 27 minutes, and the NeRF++ is about four days, with the former achieving more than 200 times the acceleration. Although the implementation of Plenoxels is not optimized for fast rendering, it can render new viewpoints at an interactive rate of 15 frames per second. If you want faster rendering speeds, the optimized Plenoxel model can be converted to PlenOctree (a new approach proposed by author Alex Yu et al. in an ICCV 2021 paper: https://alexyu.net/plenoctrees/).
Specifically, the researchers proposed an explicit voxel representation based on a view-dependent sparse voxel mesh without any neural networks. The new model renders realistic new viewpoints and leverages the microrenderable render loss and variation regularizer on the trained view for end-to-end optimization of calibrated 2D photos.
They called the model Plenoxel (plenoptic volume elements) because it consisted of sparse voxel meshes, each of which stores opacity and spherical harmonic coefficient information. These coefficients are interpolated to continuously model the complete all-optical function in space. To achieve high resolution on a single GPU, the researchers trimmed empty voxels and followed an optimization strategy from coarse to fine. While the core model is a bounded voxel mesh, they can model unbounded scenes in two ways: 1) using standardized device coordinates (for forward-facing scenes) and encoding the background around a mesh with a multisphere image (for 360° scenes).
Plenoxel's effect in an forward-facing scene.
Plenoxel's effect in a 360° scene.
This method shows that we can use standard tools to reconstruct realistic voxels from inverse problems, including data representations, forward models, regularization functions, and optimizers. Each of these components can be very simple and still achieve SOTA results. Experimental results show that the key element of the neural radiation field is not a neural network, but a differentiatable voxel renderer.
Framework overview
Plenoxel is a sparse voxel mesh in which each occupied voxel angle stores a scalar opacity σ and a spherical harmonic coefficient vector for each color channel. The authors refer to this representation as Plenoxel. Opacity and color at any location and in the direction of observation are determined by trilinear interpolation of values stored on adjacent voxels and evaluation of spherical harmonic coefficients in the appropriate direction of observation. Given a set of calibrated images, the model is optimized using render loss directly on the training ray. The architecture of the model is shown in Figure 2 below.
Figure 2 above is a conceptual diagram of the sparse Plenoxel model framework. Given an image of a set of objects or scenes, the researcher reconstructs one with density and spherical harmonic coefficients at each voxel: (a) a sparse voxel (Plenoxel) mesh. To render the rays, they (b) calculated the color and opacity of each sample point by three linear interpolations of adjacent voxel coefficients. They also used (c) microspecific rendering to integrate the colors and opacities of these samples. The voxel coefficients can then be optimized using standard MSE reconstruction loss relative to the training image and the total variation regularizer.
Experimental results
The researchers demonstrated the model effect in a composite bounded scene, a real unbounded forward-facing scene, and a real unbounded 360° scene. They compared the optimization time of the new model with all previous methods, including real-time rendering, and found that the new model was significantly faster. The quantitative comparison results are shown in Table 2, and the visual comparison results are shown in Figure 6, Figure 7, and Figure 8.
In addition, the new method yields high-quality results even after the first epoch of optimization in less than 1.5 minutes, as shown in Figure 5.
Quickly build an enterprise-grade ASR speech recognition assistant with NVIDIA Riva
NVIDIA Riva is an SDK that uses GPU acceleration to rapidly deploy high-performance conversational AI services for rapid development of speech AI applications. Riva is designed to help developers easily and quickly access sessionAL AI capabilities, out of the box, and quickly build high-level speech recognition services with a few simple commands and API operations. The service can process hundreds to thousands of audio streams as input and return text with minimal latency.
On December 29, 19:30-21:00, the main introduction of this online sharing is:
Introduction to Automatic Speech Recognition
Introduction and features of NVIDIA Riva
Rapid deployment of NVIDIA Riva
Launch the NVIDIA Riva client to quickly implement speech-to-text transcription
Use Python to quickly build an NVIDIA Riva-based automatic speech recognition service application