Reports from the Heart of the Machine
Editors: Zenan, Boat
Really not used to make a metacosm?
Training self-driving systems requires high-precision maps, massive amounts of data, and virtual environments, and every tech company working in this direction has its own approach, Waymo has its own fleet of self-driving taxis, and Nvidia has created a virtual environment for large-scale training, the NVIDIA DRIVE Sim platform. Recently, researchers from Google AI and Waymo, Google's own self-driving company, have come up with a new idea, trying to reconstruct the entire 3D environment of downtown San Francisco with 2.8 million Street View photos.

Using a large number of Street View images, Google's researchers built a Block-NeRF mesh that completed the largest neural network scene representation to date, rendering San Francisco's Street View.
As soon as the study was submitted to arXiv, Jeff Dean retweeted:
Block-NeRF is a variant of a neural radiation field that can characterize large-scale environments. Specifically, the study shows that when extending NeRF to render urban scenes that span multiple blocks, it is critical to break down the scene into multiple individually trained NeRF. This decomposition separates rendering time from scene size, allows rendering to scale to environments of any size, and allows environments to be updated block by block.
The study employs several architectural changes that make NeRF robust to data captured over different environmental conditions over several months, adds appearance embedding, learning pose refinement, and controlled exposure for each individual NeRF, and proposes a program for aligning appearances between adjacent NeRFs for seamless combination.
NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis is a paper by UC Berkeley researchers in ECCV 2020 that was nominated for Best Paper. It proposes an implicit 3D scene representation, which differs from displaying scene representations (e.g., point clouds, mesh meshes) by solving for the color of any light that passes through the scene to render a 2D scene image from a new perspective.
Given a set of attitude camera images, NeRF enables photorealistic reconstructions and new view compositing. NeRF's early work tended to focus on small-scale and object-centered reconstructions. Although there are now methods that can reconstruct scenes the size of a single room or building, these methods are still limited in scope and cannot be extended to city-scale environments. Due to the limited capacity of the model, applying these methods to large environments often results in noticeable artifacts and low visual fidelity.
Reconstructing large-scale environments has a wide range of application prospects in fields such as autonomous driving and aerial surveying. For example, create large-scale high-fidelity maps that provide prior knowledge for applications such as robot positioning and navigation. In addition, autonomous driving systems are often evaluated by re-simulating previously encountered scenarios, however any deviation from the recording can alter the trajectory of the vehicle, requiring high-fidelity view rendering along the path. In addition to basic view compositing, scene-conditioned NeRF can alter ambient lighting conditions such as camera exposure, weather, or different times of day, which can be used to further enhance the simulated scene.
Thesis link: https://arxiv.org/abs/2202.05263
Project Link: https://waymo.com/intl/zh-cn/research/block-nerf/
As shown in the figure above, Google's block-NeRF is a method to achieve large-scale scene reconstruction by using multiple compact NeRF characterization environments. At inference, Block-NeRF seamlessly incorporates the rendering of the associated NeRF for a given area. The example above reconstructs san Francisco's Alamo Square neighborhood using data collected over a 3-month period. Block-NeRF can update individual blocks of an environment without having to retrain the entire scene.
Rebuilding an environment at such a large scale presents additional challenges, including the presence of transient objects (cars and pedestrians), limitations on model capacity, and memory and computing limitations. In addition, under consistent conditions, it is highly unlikely that training data for such a large environment can be collected in a single capture. Conversely, data from different parts of the environment may need to come from different data collection efforts, which introduces differences in the scene geometry (e.g., construction work and parked cars) and appearance (e.g., weather conditions and different times of the day).
method
The study extends NeRF to handle environmental changes and pose errors in the collected data by embedding appearances and learning pose refinement, while also adding exposure conditions to NeRF to provide the ability to modify exposure during inference. The model after adding these changes is called Block-NeRF by the researchers. Expanding block-NeRF's network capacity will be able to characterize larger and larger scenarios. However, this approach itself has many limitations: rendering times vary with the size of the network, the network is no longer suitable for a single computing device, and updating or expanding an environment requires retraining the entire network.
To address these challenges, the researchers proposed dividing large environments into multiple block-NeFs trained individually, and then dynamically rendering and combining them as they reasoned. Modeling these Block-NeFs individually allows for maximum flexibility, scales to any large environment, and provides the ability to update or introduce new areas in pieces without having to retrain the entire environment. To compute the target view, simply render a subset of block-NeRF and composite them based on their geographic location relative to the camera. To achieve more seamless synthesis, Google proposed an appearance matching technology that visually aligns different Block-NeRF by optimizing their appearance embedding.
Figure 2: The reconstruction scene is divided into blocks-NeRF, each of which is trained on data within a prototype area (orange dotted line) at a specific Block-NeRF origin coordinate (orange point).
The study built a Block-NeRF implementation on top of mipNeRF, improving the aliasing problem of NeRF performance that impairs NeRF performance due to input images observing scenes from many different distances. The researchers combined technology from NeRF in the Wild (NeRF-W), which adds a potential code to each training image to handle inconsistent scene appearance when applying NeRF to landmarks in the Photo Tourism dataset. While NeRF-W creates a separate NeRF for each landmark from thousands of images, Google's new approach combines many NeRF to reconstruct a coherent environment from millions of images and combines learning camera attitude refinement.
Figure 3. The new model is an extension of the model proposed in mip-NeRF.
Some NeRF-based approaches use segmentation data to isolate and reconstruct static and dynamic objects such as people or cars in video sequences. Since the study focused primarily on reconstructing the environment itself, it was simple to choose to mask out dynamic objects during training.
To dynamically select the relevant Block-NeRF for rendering and compositing in a smooth manner as it traverses the scene, Google optimized the appearance code to match the lighting conditions and used interpolated weights calculated based on the distance of each Block-NeRF to the new view.
Rebuild the effect
Given that different parts of the data may be captured under different environmental conditions, the algorithm follows NeRF-W and uses Generative Latent Optimization (GLO) to optimize the perimage appearance embedding vector. This allows NeRF to interpret several conditions that change appearance, such as changing weather and lighting. These appearance embeddings can also be manipulated to interpolate between different conditions observed in the training data (e.g., cloudy versus clear skies, or day and night).
Figure 4. Appearance codes allow the model to exhibit different lighting and weather conditions.
The entire environment can consist of any number of Block-NeRF. To improve efficiency, the researchers used two filtering mechanisms to render only the relevant chunks of a given target viewpoint, where only Block-NeRF within the target viewpoint setting radius is considered. In addition, the system calculates the relevant visibility for each candidate. If the average visibility falls below the threshold, Block-NeRF is discarded. Figure 2 provides an example of visibility filtering. Visibility can be calculated quickly because its network is independent of the color network and does not require rendering at the target image resolution. After filtering, there are typically 1 to 3 Block-NeFPs that need to be merged.
Figure 5. Google's model includes exposure conditions, which help explain the changes in exposure present in the training data, allowing users to change the appearance of the output image in a human-interpretable way during inference.
To reconstruct the entire urban scene, the researchers captured long-term sequence data (more than 100 seconds) when recording Street View and repeatedly captured different sequences over several months in specific target areas. Google uses image data captured from 12 cameras, which together provide a 360° view. Eight of the cameras provide a complete loop view from the roof, while the other 4 cameras are located at the front of the vehicle, pointing forward and to the side. Each camera captures images at a frequency of 10 Hz and stores a scalar exposure value. The vehicle attitude is known and all cameras are calibrated.
With this information, the study calculates the corresponding camera ray origin and direction in a common coordinate system, taking into account the camera's rolling shutter.
Figure 6. When rendering a scene based on multiple Block-NeRF, the algorithm uses appearance matching to get a consistent look and feel of the entire scene. Given a fixed target appearance for a Block-NeRF (left), the algorithm optimizes the appearance of adjacent Block-NeRF to match. In this example, the appearance matches the consistent nighttime appearance produced in Block-NeRF.
Figure 7. Model ablation results for multi-segment data. Appearance embedding helps the neural network avoid adding cloud geometry to explain environmental changes such as weather and lighting. Removing exposure slightly reduces accuracy. Pose optimization helps sharpen results and eliminate ghosting of duplicate objects, as observed on the poles in the first row.
Future outlook
Google researchers say there are still some issues to be solved with the new method, such as some vehicles and shadows not being removed correctly, and vegetation becoming blurred in the virtual environment because of its appearance changing with the seasons. At the same time, time inconsistencies in the training data, such as construction work, cannot be automatically handled by the AI and the affected areas need to be manually retrained.
In addition, the current inability to render scenes containing dynamic objects limits the applicability of Block-NeRF to robotic closed-loop simulation tasks. In the future, these problems may be solved by learning transient objects during optimization, or by modeling dynamic objects directly.