CVPR'24 Open Source | University of Bonn proposes a new solution for 3D LiDAR mapping in a dynamic environment!

作者:Xingguang Zhong | 编辑：计算机视觉工坊

Add a small assistant: dddvision, note: direction + school/company + nickname, pull you into the group. At the end of the article, industry subdivisions are attached

0. Reader's personal understanding

Mapping with LiDAR or RGB-D cameras is an essential task in computer vision and robotics. Often, we want accurate maps to support downstream tasks such as positioning, planning, or navigation. In order to achieve an accurate reconstruction of the outdoor environment, we must consider the dynamics caused by moving objects, such as vehicles or pedestrians. In addition, dynamic object removal plays an important role in autonomous driving and robotics applications for the creation of digital twins for realistic simulation and high-definition mapping, where static maps are combined with semantic and task-related information.

Mapping and state estimation in a dynamic environment are classic problems in robotics. Simultaneous localization and mapping (SLAM) methods can employ different strategies to deal with dynamics. Common approaches include: (1) filtering out dynamics from the input as a preprocessing step, which requires semantic interpretation of the scene; (2) the occupancy in the map representation is modeled, and the measurement is retrospectively removed in free space, so that the dynamics can be implicitly removed; and (3) include it in state estimation to simulate the origin of measurements from both dynamic and static parts of the environment. The approach we propose falls into the last category and allows us to model the dynamics directly in the map representation, resulting in a spatiotemporal map representation.

Recently, implicit neural representations have attracted increasing attention in computer vision for new perspective synthesis and 3D shape reconstruction. Due to its compactness and continuity, several approaches have explored the use of neural representations in large-scale 3D LiDAR mapping, resulting in accurate maps while significantly reducing memory consumption. However, these methods often do not address the problem of dealing with dynamics during the mapping process. Recent advances in dynamic NeRF and neural deformable object reconstruction have shown that neural representations can also be used to represent dynamic scenes, which has inspired us to solve mapping problems in dynamic environments from the perspective of 4D reconstruction.

In this paper, we propose a novel approach to reconstruct large 4D dynamic scenes by encoding the time-dependent truncated signed distance function (TSDF) at each point into an implicit neural scene representation. We collected the LiDAR point clouds recorded sequentially in a dynamic environment as input and generated a TSDF for each time frame, which could be extracted using Marching Cubes. The background TSDF that does not change throughout the sequence can be easily extracted from the 4D signal. We think of it as a static map that can be used to split dynamic objects from the original point cloud. In contrast to traditional voxel-based mapping methods, continuous neural representations allow for the removal of dynamic objects while retaining rich map detail.

2. Introduction

Building accurate maps is a key building block for reliable positioning, planning, and autonomous vehicle navigation. We propose a novel approach to construct an accurate map of the dynamic environment using a series of LiDAR scans. To this end, we propose to encode the 4D scene into a novel spatiotemporal implicit neural map representation, which is realized by fitting a time-dependent truncated signed distance function for each point. Using our representation, we extract the static map by filtering the dynamic sections. Our neural representations are based on sparse feature meshes, globally shared decoders, and time-dependent basis functions that we jointly optimize in an unsupervised manner. To learn this representation from a series of LiDAR scans, we design a simple and efficient loss function to supervise map optimization in a segmented manner. We evaluated the performance of our method in a variety of scenarios involving moving objects, including the reconstruction quality of static maps and the segmentation of dynamic point clouds. Experimental results show that our method is superior to several state-of-the-art methods in removing the dynamic part of the input point cloud and reconstructing an accurate and complete 3D map at the same time.

3. Effect display

Given a series of point clouds, as shown in Figure (a), we optimize our four-dimensional neural representation to query values at a specific time at any location. Based on the estimated time-correlated TSDF values, we can extract the grids at a specific point in time. In addition, our four-dimensional neural representations can also be used for static mapping (C) and dynamic object removal (C).

CVPR'24 Open Source | University of Bonn proposes a new solution for 3D LiDAR mapping in a dynamic environment!

Reconstruction TSDF for the KITTI dataset: subplots (a) and (b) are adjacent frames of the input. Correspondingly, (c) and (d) are horizontal TSDF slices queried from our 4D map. Please note that we only show TSDF values less than 0.3m.

4. Major Contributions

We propose a new implicit neural representation that uses sequential LiDAR scans as input to collectively reconstruct the dynamic 3D environment and maintain a static map.
We adopt a piecewise training data sampling strategy and design a simple but effective loss function to maintain the consistency of static point supervision through gradient constraints.
We evaluated the mapping results by the accuracy of dynamic object segmentation and the quality of the reconstructed static map, showing better performance than several benchmarks. We provide the code and data for the experiment.

5. What is the rationale?

Our four-dimensional TSDF representation: The image on the left shows a moving object and a query point p. The diagram on the right depicts the corresponding sign distance at the point p over time. At time t0, the sign distance of p is a positive truncation value. When a moving object reaches p at moment t1, p is inside the object and its sign distance is correspondingly negative. At time t2, after the moving object passes p, the sign distance of p becomes positive again.

An overview of querying TSDF values in our 4D map representation. For querying the points p located at ti and ti+1, we first retrieve the features of each corner point in the Fl of the voxels where p is located, and obtain the fusion feature fp by trilinear interpolation. We then input fp into the decoder Dmlp and take the output as the weight of the different basis functions φ1(t) ,..., φK(t). Finally, we calculate the weighted sum of the basis function values at ti and ti+1 to obtain their respective SDF results. For the sake of simplicity, we have only described one level of hash feature grids.

6. Experimental results

Quantitative results from the ToyCar3 synthetic dataset and the real-world dataset Newer College are presented in Tab. 2 and Tab. 3, respectively. In terms of accuracy, SHINE-mapping and VDB-Fusion perform better on noisy Newer College datasets by filtering out some of the high-frequency noise through multi-frame fusion. In contrast, our method treats each scan as accurate to store 4D information, which makes it more sensitive to measurement noise. On the ToyCar3 dataset, both our method and VDB-Fusion succeeded in eliminating all moving objects. However, on the Newer College dataset, VDB-Fusion incorrectly eliminated static trees and part of the ground, resulting in poor integrity shown in Tab. 3. SHINE-mapping eliminates dynamic pedestrians on the Newer College dataset, but retains a portion of the dynamic point cloud on the ToyCar3 dataset, which has a larger proportion of dynamic objects, resulting in poor accuracy in Tab. 2. NKSR is the least accurate because it can't eliminate dynamic objects, which means that applying NKSR directly in dynamic reality isn't suitable.

The quantitative results of dynamic object segmentation are displayed in Tab. 4. We can see that our method achieves the best correlation accuracy (AA) across three autonomous driving sequences (KITTI 00, KITTI 05, Argoverse2), which is far better than the baseline. The supervised learning-based methods 4DMOS and MapMOS do not yield good dynamic accuracy (DA) due to their limited generalization capabilities. Erasor and Octomap tend to over-segment dynamic objects, resulting in poor static accuracy (SA). Removert and SHINE-mapping are too conservative to detect all dynamic objects. Thanks to the continuity and large capacity of the 4D neural representation, we achieved a better balance between preserving static background points and removing dynamic objects. Again, it's worth mentioning that our method doesn't rely on any pre- or post-processing algorithms, such as ground fitting, outlier filtering, and clustering, and doesn't require training labels.

7. Summary & Limitations

In this paper, we propose a 4D implicit neural map representation for dynamic scenes, allowing us to represent TSDF for both static and dynamic parts of a scene. To do this, we use a hierarchical voxel feature representation, which is then decoded into a weight of the basis function to represent a time-varying TSDF that can be queried anywhere. In order to learn representations from LiDAR scan sequences, we designed an efficient data sampling strategy and loss function. Equipped with the representations we propose, we have experimentally demonstrated that we are able to solve the challenging problems of static mapping and dynamic object segmentation. Specifically, our experiments show that our method is able to accurately reconstruct the 3D map of the static part of the scene while completely removing moving objects.

Limitations. While our approach has yielded convincing results, we must acknowledge that we currently rely on positude estimated by SLAM methods alone, nor can we apply our methods online. However, we believe that this is a way to study joint incremental mapping and pose estimation in the future.

8. References

[1] 3D LiDAR Mapping in Dynamic Environments Using a 4D Implicit Neural Representation

Computer Vision Workshop Exchange Group

At present, we have established multiple communities in the direction of 3D vision, including 2D computer vision, large models, industrial 3D vision, SLAM, autonomous driving, 3D reconstruction, drones, etc., and the subdivisions include:

2D Computer Vision: Image Classification/Segmentation, Target/Detection, Medical Imaging, GAN, OCR, 2D Defect Detection, Remote Sensing Mapping, Super-Resolution, Face Detection, Behavior Recognition, Model Quantification Pruning, Transfer Learning, Human Pose Estimation, etc

Large models: NLP, CV, ASR, generative adversarial models, reinforcement learning models, dialogue models, etc

Industrial 3D vision: camera calibration, stereo matching, 3D point cloud, structured light, robotic arm grasping, defect detection, 6D pose estimation, phase deflection, Halcon, photogrammetry, array camera, photometric stereo vision, etc.

Slam:视觉Slam、激光Slam、语义Slam、滤波算法、多传感器融吇𴢆算法、多传感器标定、动态Slam、Mot Slam、Nerf Slam、机器人导航等.

Autonomous driving: depth estimation, Transformer, millimeter-wave, lidar, visual camera sensor, multi-sensor calibration, multi-sensor fusion, autonomous driving integrated group, etc., 3D object detection, path planning, trajectory prediction, 3D point cloud segmentation, model deployment, lane line detection, BEV perception, Occupancy, target tracking, end-to-end autonomous driving, etc.

三维重建：3DGS、NeRF、多视图几何、OpenMVS、MVSNet、colmap、纹理贴图等

Unmanned aerial vehicles: quadrotor modeling, unmanned aerial vehicle flight control, etc

In addition to these, there are also exchange groups such as job search, hardware selection, visual product landing, the latest papers, the latest 3D vision products, and 3D vision industry news

Add a small assistant: dddvision, note: research direction + school/company + nickname (such as 3D point cloud + Tsinghua + Little Strawberry), pull you into the group.

3D Visual Learning Knowledge Planet

3DGS、NeRF、结构光、相位偏折术、机械臂抓取、点云实战、Open3D、缺陷检测、BEV感知、Occupancy、Transformer、模型部署、3D目标检测、深度估计、多传感器标定、规划与控制、无人机仿真、三维视觉C++、三维视觉python、dToF、相机标定、ROS2、机器人控制规划、LeGo-LAOM、多模态融合SLAM、LOAM-SLAM、室内室外SLAM、VINS-Fusion、ORB-SLAM3、MVSNet三维重建、colmap、线面结构光、硬件结构光扫描仪，无人机等。