laitimes

Tsinghua Newly!Unified open-vocabulary 3D object detection is achieved through cyclic modal propagation

author:3D Vision Workshop

Edit: 3DCV

Add WeChat: dddvision, note: 3D object detection, pull you into the group. At the end of the article, industry subdivisions are attached

Tsinghua Newly!Unified open-vocabulary 3D object detection is achieved through cyclic modal propagation

标题:OV-Uni3DETR: Towards Unified Open-Vocabulary 3D Object Detection via Cycle-Modality Propagation

作者:Zhenyu wang等人

Affiliation: Tsinghua University and other units

Link: https://arxiv.org/pdf/2403.19580.pdf

1. Introduction

In this paper, we propose OV-Uni3DETR, a multimodal detector for universal open-vocabulary 3D object detection. Specifically, we propose the concept of cyclic modal propagation with the aim of transferring knowledge between 2D and 3D modalities to support the above functions. 2D semantic knowledge leads to new class discovery in the 3D domain from large-scale vocabulary learning, while 3D geometric knowledge provides localized supervision for 2D inspection images. Experiments show that OV-Uni3DETR has achieved the latest performance in various 3D inspection tasks, and its average performance is improved by more than 6% compared with the existing methods. Performance using only RGB images is competitive with previous point cloud-based methods. The code and pretrained model will be released at a later date.

Tsinghua Newly!Unified open-vocabulary 3D object detection is achieved through cyclic modal propagation

图1:OV-Uni3DETR示意图

2. Innovation

Compared with existing 3D detectors, the OV-Uni3DETR has the following features:

  • Open Vocabulary 3D Detection: During the training phase, it leverages a variety of available datasets, especially rich 2D detection images, to enhance the diversity of training. In the inference phase, it can detect both seen and unseen categories.
  • Modal unification: It seamlessly adapts to input data from any given modality, effectively addressing situations involving different modalities or missing sensor information, thus supporting modal switching while testing.
  • Scenario Unification: It provides a unified multimodal model architecture for diverse scenarios collected by different sensors.

3. Method

OV-Uni3DETR is a unified open-vocabulary 3D object detector with the following key features:

Tsinghua Newly!Unified open-vocabulary 3D object detection is achieved through cyclic modal propagation

Figure 2: OV-Uni3DETR. Extract features from point clouds and images. Once converted to the same voxel space, add them to the multimodal feature. The 3D inspection transformer is ultimately used for class and box prediction. We perform semantic knowledge transfer from 2D to 3D to discover new classes. To inspect images using 2D, we predict camera extrinsic parameters and propagate geometric knowledge from 3D to 2D via a class-independent (CA) 3D detector

Multimodal learning: OV-Uni3DETR uses a point cloud, a 3D detection image with 3D box annotation, and a 2D inspection image with 2D box annotation only to enhance the diversity of training during training. In addition, the introduction of 2D inspection images has significant advantages, especially for open-vocabulary 3D detection.

Modal switching during testing: After multi-modal learning, OV-Uni3DETR can adapt to the data input of any modality to achieve modal switching during testing.

Unified multi-modal architecture: OV-Uni3DETR provides a unified multi-modal model architecture for indoor and outdoor scenarios.

Cyclic Modal Propagation: OV-Uni3DETR proposes the concept of cyclic modal propagation, which aims to disseminate knowledge between 2D and 3D modalities. Specifically, 2D semantic knowledge guides the discovery of new classes in the 3D domain, while 3D geometry knowledge provides localization supervision for 2D detection images.

Advantages: OV-Uni3DETR performs well in open vocabulary 3D detection tasks, and its performance is significantly improved compared with the previous method. At the same time, it also shows strong performance in closed-word 3D detection.

In summary, OV-Uni3DETR is a unified open vocabulary 3D object detector, which realizes modal unification, scene unification and open vocabulary learning through multi-modal learning and cyclic modal propagation, which is an important step towards universal 3D object detection.

4. Experiments

The following experiments were carried out in this part:

Open Vocabulary 3D Object Detection:

  • The results show that OV-Uni3DETR can detect new classes under point cloud input, and the AP is improved by more than 6% compared with the previous method.
  • Evaluated on outdoor KITTI and nuScenes datasets, the results show that OV-Uni3DETR can also detect new classes in outdoor scenes.
  • Visualizations are provided to demonstrate the ability of OV-Uni3DETR to detect new classes of targets in indoor and outdoor scenarios.

Closed Vocabulary 3D Object Detection:

  • Evaluated on the indoor SUN RGB-D dataset, the results show that the performance of OV-Uni3DETR is better than that of the previous unimodal method.
  • The results of evaluation on the outdoor KITTI dataset showed that the performance of OV-Uni3DETR was superior to that of other monocular 3D detection methods.

ablation study:

  • The effects of cyclic modal propagation and multimodal learning are analyzed, and the improvement effect of these two designs on the model performance is verified.

More quantitative results:

  • Evaluated in a ScanNet multi-view setting, the results show that the OV-Uni3DETR outperforms other methods.
  • The effects of different 2D detection image datasets on the performance of the model were analyzed, and the results showed that the richer the dataset categories, the more significant the performance of 3D open vocabulary detection.

More visualizations:

  • More visualizations on the SUN RGB-D, ScanNet, and KITTI datasets are provided, demonstrating the ability of OV-Uni3DETR to detect new classes of targets.
Tsinghua Newly!Unified open-vocabulary 3D object detection is achieved through cyclic modal propagation
Tsinghua Newly!Unified open-vocabulary 3D object detection is achieved through cyclic modal propagation
Tsinghua Newly!Unified open-vocabulary 3D object detection is achieved through cyclic modal propagation
Tsinghua Newly!Unified open-vocabulary 3D object detection is achieved through cyclic modal propagation
Tsinghua Newly!Unified open-vocabulary 3D object detection is achieved through cyclic modal propagation
Tsinghua Newly!Unified open-vocabulary 3D object detection is achieved through cyclic modal propagation
Tsinghua Newly!Unified open-vocabulary 3D object detection is achieved through cyclic modal propagation
Tsinghua Newly!Unified open-vocabulary 3D object detection is achieved through cyclic modal propagation
Tsinghua Newly!Unified open-vocabulary 3D object detection is achieved through cyclic modal propagation
Tsinghua Newly!Unified open-vocabulary 3D object detection is achieved through cyclic modal propagation

In this part, the superior performance of the OV-Uni3DETR model is fully verified by sufficient experiments under different datasets and different settings.

5. Summary

In this paper, we introduce a unified open-vocabulary 3D object detector called OV-Uni3DETR. The detector realizes the detection of 3D objects of unknown classes through multi-modal learning and cyclic modal propagation. Specifically, the detector uses point clouds, 3D detection images, and 2D detection images for multimodal training during training. At the same time, the 2D bounding box generated by the 2D open vocabulary detector is projected into the 3D space to disseminate semantic knowledge, and the class-independent 3D bounding box is generated to disseminate geometric knowledge. Experimental results show that OV-Uni3DETR can effectively detect 3D targets of unknown class for different modal inputs in indoor and outdoor scenes, and achieves the most advanced results in the open vocabulary 3D detection task.

This article is only for academic sharing, if there is any infringement, please contact to delete the article.

3D Vision Boutique Courses:

3DGS, NeRF, Structured Light, Phase Deflection, Robotic Arm Grabbing, Point Cloud Practice, Open3D, Defect Detection, BEV Perception, Occupancy, Transformer, Model Deployment, 3D Object Detection, Depth Estimation, Multi-Sensor Calibration, Planning and Control, UAV Simulation, 3D Vision C++, 3D Vision python, dToF, Camera Calibration, ROS2, Robot Control Planning, LeGo-LAOM, Multi-modal fusion SLAM, LOAM-SLAM, indoor and outdoor SLAM, VINS-Fusion, ORB-SLAM3, MVSNet 3D reconstruction, colmap, linear and surface structured light, and hardware structured light scanners.

3D Visual Learning Circle

3D vision from the beginning to the proficient knowledge planet, the earliest establishment in China, 6000+ members exchange and learn. Including: nearly 20 planetary video courses (worth more than 6000), project docking, 3D vision learning route summary, the latest top meeting papers & codes, the latest modules in the 3D vision industry, 3D vision high-quality source code summary, book recommendations, programming basics & learning tools, practical projects & assignments, job search & recruitment & interview questions, etc. Welcome to 3D Vision: From beginner to proficient knowledge planet, learn and progress together.

3D visual communication group

At present, the workshop has established multiple communities in the direction of 3D vision, including SLAM, industrial 3D vision, autonomous driving, 3D reconstruction, drones, etc., and the subdivisions include:

Industrial 3D vision: camera calibration, stereo matching, 3D point cloud, structured light, robotic arm grasping, defect detection, 6D pose estimation, phase deflection, Halcon, photogrammetry, array camera, photometric stereo vision, etc.

SLAM: visual SLAM, laser SLAM, semantic SLAM, filtering algorithm, multi-sensor fusion, multi-sensor calibration, dynamic SLAM, MOT SLAM, NeRF SLAM, robot navigation, etc.

Autonomous driving: depth estimation, Transformer, millimeter-wave, lidar, visual camera sensors, multi-sensor calibration, multi-sensor fusion, autonomous driving integrated group, etc., 3D object detection, path planning, trajectory prediction, 3D point cloud segmentation, model deployment, lane line detection, Occupancy, target tracking, etc.

3D reconstruction: 3DGS, NeRF, multi-view geometry, OpenMVS, MVSNet, colmap, texture mapping, etc

Unmanned aerial vehicles: quadrotor modeling, unmanned aerial vehicle flight control, etc

In addition to these, there are also exchange groups such as job search, hardware selection, visual product landing, the latest papers, the latest 3D vision products, and 3D vision industry news

Add a small assistant: dddvision, note: research direction + school/company + nickname (such as 3D point cloud + Tsinghua + Little Strawberry), pull you into the group.

Read on