For autonomous driving, Google used NeRF to rebuild the city of San Francisco in a virtual world

Reports from the Heart of the Machine

Editors: Zenan, Boat

Really not used to make a metacosm?

Training self-driving systems requires high-precision maps, massive amounts of data, and virtual environments, and every tech company working in this direction has its own approach, Waymo has its own fleet of self-driving taxis, and Nvidia has created a virtual environment for large-scale training, the NVIDIA DRIVE Sim platform. Recently, researchers from Google AI and Waymo, Google's own self-driving company, have come up with a new idea, trying to reconstruct the entire 3D environment of downtown San Francisco with 2.8 million Street View photos.

For autonomous driving, Google used NeRF to rebuild the city of San Francisco in a virtual world

Using a large number of Street View images, Google's researchers built a Block-NeRF mesh that completed the largest neural network scene representation to date, rendering San Francisco's Street View.

As soon as the study was submitted to arXiv, Jeff Dean retweeted:

Block-NeRF is a variant of a neural radiation field that can characterize large-scale environments. Specifically, the study shows that when extending NeRF to render urban scenes that span multiple blocks, it is critical to break down the scene into multiple individually trained NeRF. This decomposition separates rendering time from scene size, allows rendering to scale to environments of any size, and allows environments to be updated block by block.

The study employs several architectural changes that make NeRF robust to data captured over different environmental conditions over several months, adds appearance embedding, learning pose refinement, and controlled exposure for each individual NeRF, and proposes a program for aligning appearances between adjacent NeRFs for seamless combination.

NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis is a paper by UC Berkeley researchers in ECCV 2020 that was nominated for Best Paper. It proposes an implicit 3D scene representation, which differs from displaying scene representations (e.g., point clouds, mesh meshes) by solving for the color of any light that passes through the scene to render a 2D scene image from a new perspective.

Given a set of attitude camera images, NeRF enables photorealistic reconstructions and new view compositing. NeRF's early work tended to focus on small-scale and object-centered reconstructions. Although there are now methods that can reconstruct scenes the size of a single room or building, these methods are still limited in scope and cannot be extended to city-scale environments. Due to the limited capacity of the model, applying these methods to large environments often results in noticeable artifacts and low visual fidelity.

Reconstructing large-scale environments has a wide range of application prospects in fields such as autonomous driving and aerial surveying. For example, create large-scale high-fidelity maps that provide prior knowledge for applications such as robot positioning and navigation. In addition, autonomous driving systems are often evaluated by re-simulating previously encountered scenarios, however any deviation from the recording can alter the trajectory of the vehicle, requiring high-fidelity view rendering along the path. In addition to basic view compositing, scene-conditioned NeRF can alter ambient lighting conditions such as camera exposure, weather, or different times of day, which can be used to further enhance the simulated scene.

Thesis link: https://arxiv.org/abs/2202.05263

Project Link: https://waymo.com/intl/zh-cn/research/block-nerf/

As shown in the figure above, Google's block-NeRF is a method to achieve large-scale scene reconstruction by using multiple compact NeRF characterization environments. At inference, Block-NeRF seamlessly incorporates the rendering of the associated NeRF for a given area. The example above reconstructs san Francisco's Alamo Square neighborhood using data collected over a 3-month period. Block-NeRF can update individual blocks of an environment without having to retrain the entire scene.

Rebuilding an environment at such a large scale presents additional challenges, including the presence of transient objects (cars and pedestrians), limitations on model capacity, and memory and computing limitations. In addition, under consistent conditions, it is highly unlikely that training data for such a large environment can be collected in a single capture. Conversely, data from different parts of the environment may need to come from different data collection efforts, which introduces differences in the scene geometry (e.g., construction work and parked cars) and appearance (e.g., weather conditions and different times of the day).

method

The study extends NeRF to handle environmental changes and pose errors in the collected data by embedding appearances and learning pose refinement, while also adding exposure conditions to NeRF to provide the ability to modify exposure during inference. The model after adding these changes is called Block-NeRF by the researchers. Expanding block-NeRF's network capacity will be able to characterize larger and larger scenarios. However, this approach itself has many limitations: rendering times vary with the size of the network, the network is no longer suitable for a single computing device, and updating or expanding an environment requires retraining the entire network.

To address these challenges, the researchers proposed dividing large environments into multiple block-NeFs trained individually, and then dynamically rendering and combining them as they reasoned. Modeling these Block-NeFs individually allows for maximum flexibility, scales to any large environment, and provides the ability to update or introduce new areas in pieces without having to retrain the entire environment. To compute the target view, simply render a subset of block-NeRF and composite them based on their geographic location relative to the camera. To achieve more seamless synthesis, Google proposed an appearance matching technology that visually aligns different Block-NeRF by optimizing their appearance embedding.

Figure 2: The reconstruction scene is divided into blocks-NeRF, each of which is trained on data within a prototype area (orange dotted line) at a specific Block-NeRF origin coordinate (orange point).

The study built a Block-NeRF implementation on top of mipNeRF, improving the aliasing problem of NeRF performance that impairs NeRF performance due to input images observing scenes from many different distances. The researchers combined technology from NeRF in the Wild (NeRF-W), which adds a potential code to each training image to handle inconsistent scene appearance when applying NeRF to landmarks in the Photo Tourism dataset. While NeRF-W creates a separate NeRF for each landmark from thousands of images, Google's new approach combines many NeRF to reconstruct a coherent environment from millions of images and combines learning camera attitude refinement.

Figure 3. The new model is an extension of the model proposed in mip-NeRF.

Some NeRF-based approaches use segmentation data to isolate and reconstruct static and dynamic objects such as people or cars in video sequences. Since the study focused primarily on reconstructing the environment itself, it was simple to choose to mask out dynamic objects during training.

To dynamically select the relevant Block-NeRF for rendering and compositing in a smooth manner as it traverses the scene, Google optimized the appearance code to match the lighting conditions and used interpolated weights calculated based on the distance of each Block-NeRF to the new view.

Rebuild the effect

Given that different parts of the data may be captured under different environmental conditions, the algorithm follows NeRF-W and uses Generative Latent Optimization (GLO) to optimize the perimage appearance embedding vector. This allows NeRF to interpret several conditions that change appearance, such as changing weather and lighting. These appearance embeddings can also be manipulated to interpolate between different conditions observed in the training data (e.g., cloudy versus clear skies, or day and night).

Figure 4. Appearance codes allow the model to exhibit different lighting and weather conditions.

The entire environment can consist of any number of Block-NeRF. To improve efficiency, the researchers used two filtering mechanisms to render only the relevant chunks of a given target viewpoint, where only Block-NeRF within the target viewpoint setting radius is considered. In addition, the system calculates the relevant visibility for each candidate. If the average visibility falls below the threshold, Block-NeRF is discarded. Figure 2 provides an example of visibility filtering. Visibility can be calculated quickly because its network is independent of the color network and does not require rendering at the target image resolution. After filtering, there are typically 1 to 3 Block-NeFPs that need to be merged.

Figure 5. Google's model includes exposure conditions, which help explain the changes in exposure present in the training data, allowing users to change the appearance of the output image in a human-interpretable way during inference.

To reconstruct the entire urban scene, the researchers captured long-term sequence data (more than 100 seconds) when recording Street View and repeatedly captured different sequences over several months in specific target areas. Google uses image data captured from 12 cameras, which together provide a 360° view. Eight of the cameras provide a complete loop view from the roof, while the other 4 cameras are located at the front of the vehicle, pointing forward and to the side. Each camera captures images at a frequency of 10 Hz and stores a scalar exposure value. The vehicle attitude is known and all cameras are calibrated.

With this information, the study calculates the corresponding camera ray origin and direction in a common coordinate system, taking into account the camera's rolling shutter.

Figure 6. When rendering a scene based on multiple Block-NeRF, the algorithm uses appearance matching to get a consistent look and feel of the entire scene. Given a fixed target appearance for a Block-NeRF (left), the algorithm optimizes the appearance of adjacent Block-NeRF to match. In this example, the appearance matches the consistent nighttime appearance produced in Block-NeRF.

Figure 7. Model ablation results for multi-segment data. Appearance embedding helps the neural network avoid adding cloud geometry to explain environmental changes such as weather and lighting. Removing exposure slightly reduces accuracy. Pose optimization helps sharpen results and eliminate ghosting of duplicate objects, as observed on the poles in the first row.

Future outlook

Google researchers say there are still some issues to be solved with the new method, such as some vehicles and shadows not being removed correctly, and vegetation becoming blurred in the virtual environment because of its appearance changing with the seasons. At the same time, time inconsistencies in the training data, such as construction work, cannot be automatically handled by the AI and the affected areas need to be manually retrained.

In addition, the current inability to render scenes containing dynamic objects limits the applicability of Block-NeRF to robotic closed-loop simulation tasks. In the future, these problems may be solved by learning transient objects during optimization, or by modeling dynamic objects directly.

For autonomous driving, Google used NeRF to rebuild the city of San Francisco in a virtual world

Read on

Some of Blizzard's answers about the difficulty of the P3 phase

Expand on Craft| my 2021 app of the year

Square changed its name to Block and hear how Jack Dorsey bet on blockchain

NVIDIA's new technology increases the training speed of NeRF models by 60 times, and it takes as little as 5 seconds at the fastest

Waymo's new study: "Rebuilding" a San Francisco with 2.8 million pictures for autonomous driving

The neural radiation field is point-based, the training speed is increased by 30 times, and the rendering quality exceeds that of NeRF

Wu Jun, a well-known computer expert: ChatGPT is not a new technological revolution and does not bring any new opportunities

In the face of ChatGPT's global popularity, how should China's AI debut?

Silicon Valley Big L5: Survivors of Winter

Why can't Europe create a mobile operating system that can compete with Android and iOS?

Ten thousand layoffs turned around and embraced AI, and Meta was going to change its name again

Microsoft Google wants to reinvent the business with AI, Musk said that AI will destroy humanity... Talk about AI

Samsung "backstabbed" Google

AI competition is intense, Google makes another big move! Merger of DeepMind and Google Brain

By merging DeepMind and Google Brain, Google ushered in a new era of AI

Keep up with Microsoft! Google's generative AI Bard can program and debug code bugs too

Nothing has been achieved in AI research and development, and you still lay off employees while sending yourself "red envelopes"? Google's CEO made nearly $1.6 billion last year

Google CEO Pichai: Artificial intelligence occupies the C position Search is important but no longer the core business

Apple and Google led the development of draft tracking industry specifications to prevent abuse of features

After sparking outrage in Brazil, Google removed the Slave Simulator game

The Queens rights sold for more than $1 billion, and EXO members terminated their contracts with SM Entertainment

Can't stay 3 days a week, the Amazon CEO was forced to say ruthlessly: If you don't go back to the office, you will leave!

Preview of the Global Industry Morning Post: Apple has officially stopped its rumored self-driving car program

Overview of the market development of China's automotive autonomous driving industry in 2024 (released by Zhiyan Consulting)

The post-90s AI genius builds trucks and relies on end-to-end to enter the first echelon of autonomous driving

Baidu Radish Express will push unmanned taxis to land, and Tesla/Weimei Holographic has multiple layouts to seize the highland of the autonomous driving industry

In "AutoNavi", you can also hit Pony.ai self-driving taxis!

Insurance fastens seat belts for autonomous driving

The latest statement of the Ministry of Industry and Information Technology: the preparation of the new era of intelligent networked vehicle industry planning, improve the high-level autonomous driving supervision

The global war for autonomous driving

Honor CEO Zhao Ming demonstrated the AI agent function of the Honor Magic7 series in a recent live broadcast, which is capable of completing complex operations with simple voice commands, such as:

China's autonomous driving unicorn advances into the US stock market! Pony.ai applies for an IPO or raise as much as $300 million

Yu Chengdong responds to Tesla's FSD will enter China: confident to win the competition, Huawei ADS 4.0 will launch high-speed L3 autonomous driving commercial next year [with intelligent networked vehicle industry scale forecast]

The first share of Robotaxi is here! Pony.ai went to IPO in the United States, and self-driving trucks contributed 70% of revenue

The speed of life and death in the race of autonomous driving: who will dominate the future of China and the United States?

From the Nobel Prize to autonomous driving: AI leads the global innovation race

Low visibility accidents are frequent, and Tesla's "full self-driving" is under investigation again, involving 2.4 million vehicles

Didi Autonomous Driving participated in the World Intelligent Connected Vehicle Conference