Waymo's new study: "Rebuilding" a San Francisco with 2.8 million pictures for autonomous driving

2022-02-23 19:55:11

Throughout the life cycle of autonomous driving algorithms, 3D models that reproduce the environment are an important part.

Author | Cosca

Edit | Literary

The industry generally believes that massive data, high-precision maps and virtual simulations are the only way to "upgrade dimensions" of automatic driving.

However, as far as the current industry status quo is concerned, there is still a long way to go before the automatic driving technology is truly landed. Therefore, in addition to continuous testing on real roads, researchers have also been exploring various long-tail scenarios in simulated environments, hoping to solve the variables that do not often appear in reality.

For example, Cruise's virtual test vehicle runs 200,000 hours a day in a simulated scenario; waymo, an industry leader, has a hundred years of driving experience in the simulator every day, accumulating more than 15 billion miles of test miles; NVIDIA has developed a powerful cloud computing platform, NVIDIA DRIVE Sim, which can generate data sets for training vehicle perception systems; ...

But as mentioned earlier, the road to "fighting monsters and upgrading" of algorithms is difficult. For example, dealing with the presence of transient objects (cars, pedestrians, etc.), changes in weather and lighting conditions, as well as limitations in model capacity and memory and computational constraints, even under constant conditions, trying to complete training in a complex environment at once is nothing short of a fantasy.

Researchers from UC Berkeley, Waymo, and Google AI trained a grid of Block-NeRFs with more than 2.8 million images to render the entire 3D environment of downtown San Francisco. This is also the largest neural network scenario representation to date.

Illustration: Rendering a San Francisco neighborhood with Block-NeRFs

Judging from the video, the pure automatically generated architectural effect has been better than the "AI generation + manual debugging" building in Microsoft's simulated flight, and it has killed a number of 3D models based on real photos and reconstructed by AI.

The work that shocked the four, Block-NeRF: Scalable Large-Scene Neural View Synthesis, was also called Google Maps 2.0 or Neural Google Maps by many netizens.

Screenshot of Wayno's new research paper

Block-NeRF is a variant of a neural radiation field that can characterize large-scale environments. Its biggest highlight is the ability to extend NeRF's application scenarios from small micro to single scene objects to the city level. This decomposition decouples the rendering time from the scene size, so only chunk updates are required when the environment is updated, rather than fully retraining the scene.

As shown in the figure, the reconstruction scene is divided into multiple Block-NeRF, each of which is trained on data within a prototype area (orange dotted line) in a specific Block-NeRF origin coordinate (orange point).

The researchers expanded by embedding appearances and learning gesture refinement to collect the corresponding environmental changes and attitude errors in the data, and add exposure conditions to NeRF to modify exposure during inference.

ECCV 2020's NeRF (Neural Radiance Field) is definitely the most popular technology in recent years. It works stunningly in a series of challenging scenes and exhibits a high level of fidelity.

The core master of NeRF has two parts:

Model the radiation field and density of a scene with neural network weights.

Volume rendering is used to synthesize a new perspective view, from the input image and camera pose of a given environment, and then composite a viewpoint that has never been observed to render the scene, allowing the user to navigate the reconstructed environment with high visual fidelity.

Note: The volumetric rendering principle of NeRF

The deformed model is Block-NeRF. Expanding the network capacity of Block-NeRF will be able to characterize larger and larger scenarios. However, this approach itself has many limitations: rendering time varies with the size of the network, the network does not fit a single computing device, and updating or expanding the environment requires retraining the entire network.

To this end, the researchers propose to divide large environments into multiple block-NeFMs trained individually, and then dynamically render and combine them at inference.

Modeling these Block-NeRF individually also maximizes their flexibility and scales to any larger environment. Making NeRF robust to data acquired over months under different environmental conditions, adding appearance embeddings for each individual NeRF, learning posture refinement and controlled exposure, and introducing an appearance alignment program for adjacent NeRF for seamless combination.

For example, placing a Block-NeRF at each intersection needs to cover 75% of any street connected to the intersection so that any two adjacent Block-NeRF can have a 50% overlap.

Note: Data collected over a 3-month period reconstructed San Francisco's Alamo Square neighborhood

Block-NerF is based on NerFs and mip-NerFs. mip-NeRF is a dynamic method of anti-aliased neural radiation fields. When the input image is taken from different angles, this eliminates aliasing issues with reduced NeRF performance due to settings. From millions of photos, Block-NeRF reconstructs a substantial, coherent environment by combining a large number of NeRF.

Note: The new model is an extension of the model proposed in mip-NeRF

Essentially, the research team divided city-scale scenarios into many lower-capacity models, which also significantly reduced overall computational costs. Block-NeRF efficiently handles transient objects and filters them out using a segmentation algorithm.

At the same time, Google has also optimized the appearance code to match the lighting conditions, so that the relevant Block-NeRF can be dynamically selected for rendering, and composited in a smooth manner as it traverses the scene, and the interpolation weights can be calculated based on the distance of each Block-NeRF to the new view.

The effect of the reconstruction is also amazing. Given that the various parts of the data are captured under different environmental conditions, the algorithm follows the Generative Latent Optimization of NeRF-W, which also improves the perimage appearance embedding vector. Allow NeRF to interpret multiple conditions that vary in appearance, such as changing weather and lighting. It is also possible to manipulate these appearance embeddings, interpolating between different observed conditions (e.g., cloudy versus clear skies, day or night).

Note: Appearance codes allow models to represent different lighting and weather conditions

Note: The resulting model contains exposure conditions, which help explain the changes in exposure that exist in the training data

Note: Results of model ablation experiments with multi-segment data. Appearance embedding helps the network avoid adding geometry to explain changes in the environment, eliminating the problem of reduced accuracy caused by exposure. Pose optimization helps sharpen results and eliminates ghosting of repetitive objects, such as utility poles in the first row

Write at the end

As can be seen from the paper, the "main battlefield" of Block-NeRF is the field of autonomous driving, focusing on how to extend NeRF to large-scale urban scenarios. But the study focused more on the expansion of the horizontal range, and did not cover too much in the longitudinal range.

With the blessing of Block-NeRFs, perhaps the physical space expression of the metaverse has another smooth way. In the future, the Street View map will no longer be in the form of "point + panoramic picture", but a flexible scene with continuous reality and the perspective can be switched arbitrarily.

The researchers say Block-NeRF is not a panacea and that there are still many problems to be solved. For example, in the virtual environment, some vehicles and shadow removal effects are not good, and the appearance of vegetation is gradually blurred in the change of seasons. At the same time, AI cannot automatically handle time inconsistencies in the training data, and needs to artificially retrain modules with "bugs".

Currently, scenes that cannot render dynamic objects limit the applicability of Block-NeRF to robot closed-loop simulation tasks, and solving such problems may require direct modeling of dynamic objects.

Thesis link: arxiv.org/abs/2202.05263

link:

END

The era of autonomous driving, the new story of on-board cameras

Alternative lidar? The "Heat" and "Pain" of 4D Imaging Millimeter Wave Radar

Waymo's new study: "Rebuilding" a San Francisco with 2.8 million pictures for autonomous driving

Read on