laitimes

Peking University | CLIP model semantic information and 3DGS, real-time and accurate semantic understanding of 3D scenes

author:3D Vision Workshop

作者:Guibiao Liao | 编辑:3DCV

Add WeChat: cv3d008, note: direction + unit + nickname, pull you into the group. At the end of the article, industry subdivisions are attached

Peking University | CLIP model semantic information and 3DGS, real-time and accurate semantic understanding of 3D scenes

1. Introduction

This article introduces a method called CLIP-GS, which combines the semantic information of the CLIP model with 3D Gaussian sputtering to achieve real-time, accurate semantic understanding of 3D scenes. The key innovations of this method include the efficient rendering ability of Gaussian sputtering, and the introduction of semantic attribute compactness and 3D collaborative self-training strategies. Experimental results show that CLIP-GS achieves the most advanced performance on multiple datasets, especially in terms of real-time rendering speed and segmentation accuracy. In conclusion, this paper proposes an efficient and accurate 3D semantic understanding method, which brings a new breakthrough to the field of 3D scene understanding by integrating semantic information and efficient rendering technology.

2. Paper information

标题:CLIP-GS: CLIP-INFORMED GAUSSIAN SPLATTING FOR REAL-TIME AND VIEW-CONSISTENT 3D SEMANTIC UNDERSTANDING

Author: Guibiao Liao 等人

Affiliation: Peking University and other units

Thesis: https://arxiv.org/pdf/2404.14249

3. Main contributions

The main contributions of the CLIP-GS approach include the following:

  1. Semantic Compactness (SAC): A semantic compactness method is proposed, which uses the unified semantics of the same object to minimize the redundant similar features of each object by learning the representative semantic features of each object to achieve efficient rendering.
  2. 3D Consistent Self-Training (3DCS): A 3D consistent self-training strategy is introduced, which uses the pseudo-labels generated by the trained 3D Gaussian model to carry out cross-view semantic consistency constraints and enhance the view consistency semantic learning of the Gaussian model.
  3. Experimental verification: Experiments show that the proposed method is superior to other 3D semantic segmentation methods based on CLIP on multiple datasets, improves the semantic segmentation accuracy and rendering efficiency, and demonstrates the robustness of the method under sparse input data.

4、CLIP-GS

Peking University | CLIP model semantic information and 3DGS, real-time and accurate semantic understanding of 3D scenes

CLIP-GS optimization

The optimization process of the CLIP-GS method is described in detail as shown in the figure. First, to represent the 3D scene, we follow the 3DGS approach by adding an additional property to the 3D Gaussian distribution: semantic embedding. These 3D Gaussian properties are then rendered onto a 2D plane for optimization using a differentiable rasterizer. Second, the optimization process is divided into two phases. In the first phase, we introduced the Semantic Attribute Compactness (SAC) method to learn the compact semantic representation of 3D Gaussian for efficient rendering. In the second phase, after several rounds of training on CLIP-GS, we introduced the 3D Consistent Self-Training (3DCS) method. 3DCS leverages cross-view self-prediction semantics from CLIP-GS and enhances it with consistency regularization to provide Gauss with stronger view consistency supervision. It's worth noting that for the sake of simplification, we've omitted the adaptive density control and color optimization process, which is the same as 3DGS.

4.1、语义紧凑性(SAC)

The idea of the SAC method is to use the intrinsically unified semantic meaning of the same object for efficient representation. Specifically, the region mask is obtained by segmentation arbitrary model (SAM), and the weighted average of the semantic features is calculated for each region to obtain a unified semantic feature representing the region. Then, the semantic index is used to represent these unified features to obtain a semantic index graph. In this way, the CLIP semantic features fed into the training view can be compactly represented as unified features and low-dimensional semantic index graphs. During the optimization process, low-dimensional semantic learnable parameters are embedded for each 3D Gaussian, and then the semantic index is learned by α hybrid rendering to retrieve CLIP features. In addition, to further speed up the learning process, we compute the retrieval process offline before training. The SAC method achieves efficient rendering while maintaining high-quality visual results by embedding compact semantic information into 3D Gaussian. Therefore, the SAC method is of great significance for efficient representation of scene semantics and accurate semantic segmentation.

4.2. 3D consistent self-training

The key idea of the 3DCS approach is to enhance semantic consistency by taking advantage of the cross-view consistency inherent in 3D models. Specifically, after training the 3D Gaussian distribution for a period of time, we use the trained 3D Gaussian model to render the semantic map of the training view. Then, the region mask generated by SAM is used to integrate the semantic information of adjacent views into the semantic map of the current view, so as to eliminate the semantic ambiguity of the same object in different views. To achieve this consensus regularization, we use a majority voting mechanism to unify the semantics of the current view in combination with the semantic information of adjacent views. In this way, the consistent output of the 3D model is utilized through self-training, which provides consistent semantic supervision across views for 3D Gaussian, thereby enhancing the semantic consistency. The 3DCS method provides consistent semantic supervision for 3D Gaussian across views by taking advantage of the consistent output of the 3D model, which effectively improves the semantic consistency. Therefore, this method is of great significance for improving the accuracy and consistency of 3D semantic segmentation.

4.3. End-to-end training process

The entire model training process consists of two phases:

  1. Phase I: In this phase, we use the Semantic Attribute Compactness (SAC) approach to optimize the semantic embedding parameters of 3D Gaussian by calculating the semantic loss (L2Ds) of the training view. The main goal of this phase is to learn compact and efficient semantic representations.
  2. Phase II: After training the 3D Gaussian distribution a certain number of times (T times), we move on to the second phase. At this stage, we use the 3D Consistent Self-Training (3DCS) method to replace L2Ds by calculating the 3D Self-Training Loss (L3Ds) to enhance semantic consistency. The 3DCS method uses the cross-view semantic consistency constraint to enhance the supervision signal and further improve the accuracy of semantic segmentation. In addition, in order to improve rendering efficiency while maintaining high-quality scene representation, we introduced the Progressive Density Regulation (PDR) strategy. This strategy gradually increases the image resolution and density control frequency, effectively reducing the number of Gaussian points while maintaining the rendering quality.

5. Experiments

Quantitative comparison: In quantitative comparison, our method outperforms other competing methods in terms of rendering quality and segmentation accuracy. Especially on the Replica and ScanNet datasets, our method improves the mIoU metrics by 17.29% and 20.81% compared with the suboptimal method, respectively.

Peking University | CLIP model semantic information and 3DGS, real-time and accurate semantic understanding of 3D scenes
Peking University | CLIP model semantic information and 3DGS, real-time and accurate semantic understanding of 3D scenes

Qualitative comparison: In qualitative comparison, our method obtains more continuous and consistent semantic segmentation results across different views. Compared with other methods, our method presents better visual rendering quality, and also shows robust reconstruction quality and segmentation performance under sparse input data.

ablation studies: ablation studies have shown that SAC, 3DCS, and PDR strategies all contribute significantly to the final performance. Specifically, SAC improves the inference efficiency and segmentation accuracy, 3DCS introduces important cross-view consistent semantic constraints to improve the semantic quality, and the PDR strategy effectively improves the efficiency by reducing the number of Gaussian points.

Peking University | CLIP model semantic information and 3DGS, real-time and accurate semantic understanding of 3D scenes
Peking University | CLIP model semantic information and 3DGS, real-time and accurate semantic understanding of 3D scenes
Peking University | CLIP model semantic information and 3DGS, real-time and accurate semantic understanding of 3D scenes
Peking University | CLIP model semantic information and 3DGS, real-time and accurate semantic understanding of 3D scenes
Peking University | CLIP model semantic information and 3DGS, real-time and accurate semantic understanding of 3D scenes

6. Conclusion

In this section, the authors introduce a new method they propose called CLIP-GS, which aims to achieve real-time and accurate semantic understanding of 3D scenes through Gaussian Splatting. The approach consists of two key components:

  1. Semantic Attribute Compactness (SAC): This method embeds compact semantic information into 3D Gaussian to efficiently represent 3D semantics, thus ensuring high rendering efficiency.
  2. 3D Consistent Self-Training (3DCS): This method enhances the semantic consistency between different views, resulting in accurate 3D segmentation results.

Through experiments on synthetic and real-world scenarios, the authors found that the proposed method is significantly superior to the existing state-of-the-art methods, and also shows superior performance under sparse input data, which verifies its robustness in 3D semantic learning.

This article is only for academic sharing, if there is any infringement, please contact to delete the article.

3DCV technical exchange group

At present, we have established multiple communities in the direction of 3D vision, including 2D computer vision, large models, industrial 3D vision, SLAM, autonomous driving, 3D reconstruction, drones, etc., and the subdivisions include:

2D Computer Vision: Image Classification/Segmentation, Target/Detection, Medical Imaging, GAN, OCR, 2D Defect Detection, Remote Sensing Mapping, Super-Resolution, Face Detection, Behavior Recognition, Model Quantification Pruning, Transfer Learning, Human Pose Estimation, etc

Large models: NLP, CV, ASR, generative adversarial models, reinforcement learning models, dialogue models, etc

Industrial 3D vision: camera calibration, stereo matching, 3D point cloud, structured light, robotic arm grasping, defect detection, 6D pose estimation, phase deflection, Halcon, photogrammetry, array camera, photometric stereo vision, etc.

Slam:视觉Slam、激光Slam、语义Slam、滤波算法、多传感器融吇𴢆算法、多传感器标定、动态Slam、Mot Slam、Nerf Slam、机器人导航等.

Autonomous driving: depth estimation, Transformer, millimeter-wave, lidar, visual camera sensor, multi-sensor calibration, multi-sensor fusion, autonomous driving integrated group, etc., 3D object detection, path planning, trajectory prediction, 3D point cloud segmentation, model deployment, lane line detection, BEV perception, Occupancy, target tracking, end-to-end autonomous driving, etc.

三维重建:3DGS、NeRF、多视图几何、OpenMVS、MVSNet、colmap、纹理贴图等

Unmanned aerial vehicles: quadrotor modeling, unmanned aerial vehicle flight control, etc

In addition to these, there are also exchange groups such as job search, hardware selection, visual product landing, the latest papers, the latest 3D vision products, and 3D vision industry news

Add a small assistant: dddvision, note: research direction + school/company + nickname (such as 3D point cloud + Tsinghua + Little Strawberry), pull you into the group.

3D Vision Technology Planet

3DGS、NeRF、结构光、相位偏折术、机械臂抓取、点云实战、Open3D、缺陷检测、BEV感知、Occupancy、Transformer、模型部署、3D目标检测、深度估计、多传感器标定、规划与控制、无人机仿真、三维视觉C++、三维视觉python、dToF、相机标定、ROS2、机器人控制规划、LeGo-LAOM、多模态融合SLAM、LOAM-SLAM、室内室外SLAM、VINS-Fusion、ORB-SLAM3、MVSNet三维重建、colmap、线面结构光、硬件结构光扫描仪,无人机等。

Read on