laitimes

NVIDIA's new GSNERF: How to solve the new perspective generation of unseen scenes?

author:3D Vision Workshop

Source: 3D Vision Workshop

Add a small assistant: dddvision, note: direction + school/company + nickname, pull you into the group. At the end of the article, industry subdivisions are attached

论文题目:GSNeRF: Generalizable Semantic Neural Radiance F

作者:Zi-Ting Chou,Sheng-Yu Huang等

作者机构:Graduate Institute of Communication Engineering, National Taiwan University,NVIDIA, Taiwan

Paper link: https://arxiv.org/pdf/2403.03608.pdf

In this paper, a general-purpose semantic neural radiance field called GSNeRF is introduced, which considers image semantics in the synthesis process and can generate new perspective images and related semantic maps for unseen scenes. GSNeRF consists of two phases: semantic geographic inference and depth-guided visual rendering. The former extracts semantic and geometric features from multi-view image inputs, while the latter performs image and semantic rendering under the guidance of image geometric information, and the performance is improved. Experiments confirm the superiority of GSNeRF in the synthesis of new perspective images and semantic segmentation, and verify the effectiveness of the sampling strategy for visual rendering.

The reader understands:

The GSNeRF method proposed in this paper is innovative and practical in solving the general problems of new perspective synthesis and semantic segmentation. By combining visual feature extraction and depth map prediction, GSNeRF is able to generalize to unseen scenarios without the need for retraining, which is of great significance in practical applications. Experimental results show that GSNeRF achieves good performance on real-world and synthetic datasets, which is better than existing methods. This indicates that GSNeRF is an effective method that can be applied to a variety of scenarios that require new perspective synthesis and semantic segmentation.

NVIDIA's new GSNERF: How to solve the new perspective generation of unseen scenes?

This paper introduces a general-purpose semantic neural radiance field called GSNeRF, which aims to solve the problem of generalized new perspective synthesis and semantic segmentation at the same time. By learning the visual features, depth information and semantic information of the scene, GSNeRF can render a new perspective image on the unseen scene and generate the corresponding semantic segmentation mask. The method consists of two key learning stages: semantic geographic inference and depth-guided visual rendering. The former is used to derive the visual features of the scene and aggregate the depth information of the source view to estimate the depth of the new view, while the latter is used to render the RGB image and semantic segmentation map of the target view. Through experiments on real-world and synthetic datasets, it is proved that GSNeRF outperforms the current generalized NeRF methods in terms of new perspective synthesis and semantic segmentation.

This article contributes:

  • GSNeRF is proposed to co-render images from new perspectives on unseen scenes and generate related semantic segmentation masks.
  • In the proposed semantic geographic inference stage, the color, geometry and semantic information of the input scene are learned, and the generalization ability of GSNeRF in this paper is introduced.
  • Based on the inferred geometric information, the introduced depth-guided visual rendering phase customizes two different sampling strategies based on the predicted target view depth map, so that both the image and the semantic map can be rendered.

The basic principles and methods of generalized NeRF are briefly reviewed. By learning the visual features, depth information, and semantic information of the scene, NeRF can render a new perspective image on an unknown scene and generate the corresponding semantic segmentation mask. The method consists of two key learning stages: semantic geographic inference and depth-guided visual rendering. In the semantic geographic inference stage, the color, geometry and semantic information of the input scene are learned, and the generalization ability of GSNeRF is introduced. The depth-guided visual rendering phase tailors two different sampling strategies based on the inferred geometric information so that both the image and the semantic map can be rendered at the same time. The optimization goal of the versatile NeRF is to optimize the model through render loss and minimize the difference between the rendered image and the real image.

NVIDIA's new GSNERF: How to solve the new perspective generation of unseen scenes?
NVIDIA's new GSNERF: How to solve the new perspective generation of unseen scenes?

3.1 Problem Description and Model Overview

This section describes the methodology of the dissertation. Firstly, the problem setting and symbolic representation are defined, and the goals of new perspective synthesis and semantic segmentation are described for the given scene and camera pose. A Universal Semantic Neural Radiance Field (GSNeRF) is proposed to achieve this goal, which consists of two key learning stages: semantic geographic inference and depth-guided visual rendering. In the semantic geographic inference stage, the semantic geographic inference machine is used to extract 2D features, semantic features, 3D volume features and depth prediction from each input source image. In the depth-guided visual rendering stage, a unique sampling strategy is performed based on the depth map of the target view, and then the sampling points and features are input to the volume renderer and semantic renderer to synthesize the image and semantic segmentation map of the target view.

3.2 Universal semantic NeRF

There are two key learning stages of generic semantic NeRF: semantic geographic inference and depth-guided visual rendering. In the semantic geographic inference stage, the Gθ model extracts geometric clues and semantic information from K multi-view source images, including 3D volume features, depth maps, 2D image features and semantic features, and learns to predict the depth map of the target view. In the depth-guided visual rendering stage, the traditional volumetric rendering strategy is modified and the depth-guided sampling strategy is adopted to make the sampling points concentrated near the predicted depth value, so as to improve the sampling efficiency. Finally, the predicted depth map was used for volume rendering and semantic rendering, and the volume renderer Rθ and the semantic renderer Pθ were used to predict the image and semantic segmentation results of the target view, respectively. The whole process enables the model to be directly generalized to unseen scenarios after training, without fine-tuning, and realizes the universality of unknown scenarios.

3.3 Training and Inference

In this section, the training and inference process of GSNeRF is introduced. In the training phase, a variety of loss functions were used to optimize the model, including image rendering loss, depth prediction loss, and semantic segmentation loss. If the true depth of the ground is available, the true depth of the ground is used to supervise the depth prediction, and if not, the self-supervised depth loss is used to optimize the depth estimate. In the inference stage, the model in this paper is able to generate new view images and semantic segmentation maps in unseen scenes without retraining. This is because the model in this paper is able to construct a semantic neural radiance field in real time based on the features of the input scene, so as to realize generalized inference of new scenes.

NVIDIA's new GSNERF: How to solve the new perspective generation of unseen scenes?

In the experimental part, the authors used real-world and synthetic datasets to evaluate the effectiveness of their proposed method. For real-world data, they used the ScanNet dataset, a large-scale indoor RGB-D video dataset containing over 2.5 million views and 1,513 different scenes, with semantic annotations and camera poses. They trained the model on 60 scenarios and tested generalization capabilities on 10 new never-before-seen scenarios. For the synthetic data, they used the Replica dataset, an indoor dataset based on 3D reconstruction containing 18 high-quality scenes with dense geometry, HDR textures, and semantic labels. They trained the model on 12 video sequences on 6 different scenes and tested it on 2 new scenes on 4 video sequences.

In the Results and Analysis section, the authors first compared their methods to several baseline methods, including S-Ray, MVSNeRF, GeoNeRF, GNT, and NeuRay, using metrics such as PSNR, SSIM, and others. Experimental results show that their method performs well in unseen scenarios and still outperforms other baseline methods even in the absence of true ground depth, verifying its effectiveness and practicability. In addition, the authors made a comparison of qualitative results, demonstrating the advantages of their method over SRay, which is able to better capture geometric details and the realism of the scene.

The authors also conducted an ablation study to analyze the effectiveness of the design module. Through experiments on the ScanNet dataset, they validated the contribution of various parts of the model to the results and demonstrated the effectiveness of the depth-guided sampling strategy. Finally, the authors discuss the advantages of their method in terms of sampling efficiency, pointing out that depth-guided sampling makes the model insensitive to changes in the number of sampling points per ray, and can still maintain good visual effect while reducing the number of sampling points.

NVIDIA's new GSNERF: How to solve the new perspective generation of unseen scenes?
NVIDIA's new GSNERF: How to solve the new perspective generation of unseen scenes?
NVIDIA's new GSNERF: How to solve the new perspective generation of unseen scenes?
NVIDIA's new GSNERF: How to solve the new perspective generation of unseen scenes?

In this paper, we propose a general Semantic Neural Radiance Field (GSNeRF) method to achieve general new perspective synthesis and semantic segmentation. The GSNeRF in this paper is trained to extract the visual features of each source view and perform depth map prediction so that the depth map of the new target view can be estimated. By observing such target view depth information, the associated RGB images and semantic segmentation can be co-generated by depth-guided rendering. In the authors' experiments, the authors quantitatively and qualitatively confirm that the proposed GSNeRF outperforms existing general semantic-aware NeRF methods on real-world and synthetic datasets.

NVIDIA's new GSNERF: How to solve the new perspective generation of unseen scenes?
NVIDIA's new GSNERF: How to solve the new perspective generation of unseen scenes?

This article is only for academic sharing, if there is any infringement, please contact to delete the article.

3D Vision Workshop Exchange Group

At present, we have established multiple communities in the direction of 3D vision, including 2D computer vision, large models, industrial 3D vision, SLAM, autonomous driving, 3D reconstruction, drones, etc., and the subdivisions include:

2D Computer Vision: Image Classification/Segmentation, Target/Detection, Medical Imaging, GAN, OCR, 2D Defect Detection, Remote Sensing Mapping, Super-Resolution, Face Detection, Behavior Recognition, Model Quantification Pruning, Transfer Learning, Human Pose Estimation, etc

Large models: NLP, CV, ASR, generative adversarial models, reinforcement learning models, dialogue models, etc

Industrial 3D vision: camera calibration, stereo matching, 3D point cloud, structured light, robotic arm grasping, defect detection, 6D pose estimation, phase deflection, Halcon, photogrammetry, array camera, photometric stereo vision, etc.

SLAM: visual SLAM, laser SLAM, semantic SLAM, filtering algorithm, multi-sensor fusion, multi-sensor calibration, dynamic SLAM, MOT SLAM, NeRF SLAM, robot navigation, etc.

Autonomous driving: depth estimation, Transformer, millimeter-wave, lidar, visual camera sensors, multi-sensor calibration, multi-sensor fusion, autonomous driving integrated group, etc., 3D object detection, path planning, trajectory prediction, 3D point cloud segmentation, model deployment, lane line detection, Occupancy, target tracking, etc.

3D reconstruction: 3DGS, NeRF, multi-view geometry, OpenMVS, MVSNet, colmap, texture mapping, etc

Unmanned aerial vehicles: quadrotor modeling, unmanned aerial vehicle flight control, etc

In addition to these, there are also exchange groups such as job search, hardware selection, visual product landing, the latest papers, the latest 3D vision products, and 3D vision industry news

Add a small assistant: dddvision, note: research direction + school/company + nickname (such as 3D point cloud + Tsinghua + Little Strawberry), pull you into the group.

3D Vision Workshop Knowledge Planet

3DGS, NeRF, Structured Light, Phase Deflection, Robotic Arm Grabbing, Point Cloud Practice, Open3D, Defect Detection, BEV Perception, Occupancy, Transformer, Model Deployment, 3D Object Detection, Depth Estimation, Multi-Sensor Calibration, Planning and Control, UAV Simulation, 3D Vision C++, 3D Vision python, dToF, Camera Calibration, ROS2, Robot Control Planning, LeGo-LAOM, Multimodal fusion SLAM, LOAM-SLAM, indoor and outdoor SLAM, VINS-Fusion, ORB-SLAM3, MVSNet 3D reconstruction, colmap, linear and surface structured light, hardware structured light scanners, drones, etc.

Read on