Peking University | CLIP model semantic information and 3DGS, real-time and accurate semantic understanding of 3D scenes

作者：Guibiao Liao | 编辑：3DCV

Add WeChat: cv3d008, note: direction + unit + nickname, pull you into the group. At the end of the article, industry subdivisions are attached

Peking University | CLIP model semantic information and 3DGS, real-time and accurate semantic understanding of 3D scenes

1. Introduction

This article introduces a method called CLIP-GS, which combines the semantic information of the CLIP model with 3D Gaussian sputtering to achieve real-time, accurate semantic understanding of 3D scenes. The key innovations of this method include the efficient rendering ability of Gaussian sputtering, and the introduction of semantic attribute compactness and 3D collaborative self-training strategies. Experimental results show that CLIP-GS achieves the most advanced performance on multiple datasets, especially in terms of real-time rendering speed and segmentation accuracy. In conclusion, this paper proposes an efficient and accurate 3D semantic understanding method, which brings a new breakthrough to the field of 3D scene understanding by integrating semantic information and efficient rendering technology.

2. Paper information

标题：CLIP-GS: CLIP-INFORMED GAUSSIAN SPLATTING FOR REAL-TIME AND VIEW-CONSISTENT 3D SEMANTIC UNDERSTANDING

Author: Guibiao Liao 等人

Affiliation: Peking University and other units

Thesis: https://arxiv.org/pdf/2404.14249

3. Main contributions

The main contributions of the CLIP-GS approach include the following:

Semantic Compactness (SAC): A semantic compactness method is proposed, which uses the unified semantics of the same object to minimize the redundant similar features of each object by learning the representative semantic features of each object to achieve efficient rendering.
3D Consistent Self-Training (3DCS): A 3D consistent self-training strategy is introduced, which uses the pseudo-labels generated by the trained 3D Gaussian model to carry out cross-view semantic consistency constraints and enhance the view consistency semantic learning of the Gaussian model.
Experimental verification: Experiments show that the proposed method is superior to other 3D semantic segmentation methods based on CLIP on multiple datasets, improves the semantic segmentation accuracy and rendering efficiency, and demonstrates the robustness of the method under sparse input data.

4、CLIP-GS

CLIP-GS optimization

The optimization process of the CLIP-GS method is described in detail as shown in the figure. First, to represent the 3D scene, we follow the 3DGS approach by adding an additional property to the 3D Gaussian distribution: semantic embedding. These 3D Gaussian properties are then rendered onto a 2D plane for optimization using a differentiable rasterizer. Second, the optimization process is divided into two phases. In the first phase, we introduced the Semantic Attribute Compactness (SAC) method to learn the compact semantic representation of 3D Gaussian for efficient rendering. In the second phase, after several rounds of training on CLIP-GS, we introduced the 3D Consistent Self-Training (3DCS) method. 3DCS leverages cross-view self-prediction semantics from CLIP-GS and enhances it with consistency regularization to provide Gauss with stronger view consistency supervision. It's worth noting that for the sake of simplification, we've omitted the adaptive density control and color optimization process, which is the same as 3DGS.

4.1、语义紧凑性(SAC)

The idea of the SAC method is to use the intrinsically unified semantic meaning of the same object for efficient representation. Specifically, the region mask is obtained by segmentation arbitrary model (SAM), and the weighted average of the semantic features is calculated for each region to obtain a unified semantic feature representing the region. Then, the semantic index is used to represent these unified features to obtain a semantic index graph. In this way, the CLIP semantic features fed into the training view can be compactly represented as unified features and low-dimensional semantic index graphs. During the optimization process, low-dimensional semantic learnable parameters are embedded for each 3D Gaussian, and then the semantic index is learned by α hybrid rendering to retrieve CLIP features. In addition, to further speed up the learning process, we compute the retrieval process offline before training. The SAC method achieves efficient rendering while maintaining high-quality visual results by embedding compact semantic information into 3D Gaussian. Therefore, the SAC method is of great significance for efficient representation of scene semantics and accurate semantic segmentation.

4.2. 3D consistent self-training

The key idea of the 3DCS approach is to enhance semantic consistency by taking advantage of the cross-view consistency inherent in 3D models. Specifically, after training the 3D Gaussian distribution for a period of time, we use the trained 3D Gaussian model to render the semantic map of the training view. Then, the region mask generated by SAM is used to integrate the semantic information of adjacent views into the semantic map of the current view, so as to eliminate the semantic ambiguity of the same object in different views. To achieve this consensus regularization, we use a majority voting mechanism to unify the semantics of the current view in combination with the semantic information of adjacent views. In this way, the consistent output of the 3D model is utilized through self-training, which provides consistent semantic supervision across views for 3D Gaussian, thereby enhancing the semantic consistency. The 3DCS method provides consistent semantic supervision for 3D Gaussian across views by taking advantage of the consistent output of the 3D model, which effectively improves the semantic consistency. Therefore, this method is of great significance for improving the accuracy and consistency of 3D semantic segmentation.

4.3. End-to-end training process

The entire model training process consists of two phases:

Phase I: In this phase, we use the Semantic Attribute Compactness (SAC) approach to optimize the semantic embedding parameters of 3D Gaussian by calculating the semantic loss (L2Ds) of the training view. The main goal of this phase is to learn compact and efficient semantic representations.
Phase II: After training the 3D Gaussian distribution a certain number of times (T times), we move on to the second phase. At this stage, we use the 3D Consistent Self-Training (3DCS) method to replace L2Ds by calculating the 3D Self-Training Loss (L3Ds) to enhance semantic consistency. The 3DCS method uses the cross-view semantic consistency constraint to enhance the supervision signal and further improve the accuracy of semantic segmentation. In addition, in order to improve rendering efficiency while maintaining high-quality scene representation, we introduced the Progressive Density Regulation (PDR) strategy. This strategy gradually increases the image resolution and density control frequency, effectively reducing the number of Gaussian points while maintaining the rendering quality.

5. Experiments

Quantitative comparison: In quantitative comparison, our method outperforms other competing methods in terms of rendering quality and segmentation accuracy. Especially on the Replica and ScanNet datasets, our method improves the mIoU metrics by 17.29% and 20.81% compared with the suboptimal method, respectively.

Qualitative comparison: In qualitative comparison, our method obtains more continuous and consistent semantic segmentation results across different views. Compared with other methods, our method presents better visual rendering quality, and also shows robust reconstruction quality and segmentation performance under sparse input data.

ablation studies: ablation studies have shown that SAC, 3DCS, and PDR strategies all contribute significantly to the final performance. Specifically, SAC improves the inference efficiency and segmentation accuracy, 3DCS introduces important cross-view consistent semantic constraints to improve the semantic quality, and the PDR strategy effectively improves the efficiency by reducing the number of Gaussian points.

6. Conclusion

In this section, the authors introduce a new method they propose called CLIP-GS, which aims to achieve real-time and accurate semantic understanding of 3D scenes through Gaussian Splatting. The approach consists of two key components:

Semantic Attribute Compactness (SAC): This method embeds compact semantic information into 3D Gaussian to efficiently represent 3D semantics, thus ensuring high rendering efficiency.
3D Consistent Self-Training (3DCS): This method enhances the semantic consistency between different views, resulting in accurate 3D segmentation results.

Through experiments on synthetic and real-world scenarios, the authors found that the proposed method is significantly superior to the existing state-of-the-art methods, and also shows superior performance under sparse input data, which verifies its robustness in 3D semantic learning.

This article is only for academic sharing, if there is any infringement, please contact to delete the article.

3DCV technical exchange group

At present, we have established multiple communities in the direction of 3D vision, including 2D computer vision, large models, industrial 3D vision, SLAM, autonomous driving, 3D reconstruction, drones, etc., and the subdivisions include:

2D Computer Vision: Image Classification/Segmentation, Target/Detection, Medical Imaging, GAN, OCR, 2D Defect Detection, Remote Sensing Mapping, Super-Resolution, Face Detection, Behavior Recognition, Model Quantification Pruning, Transfer Learning, Human Pose Estimation, etc

Large models: NLP, CV, ASR, generative adversarial models, reinforcement learning models, dialogue models, etc

Industrial 3D vision: camera calibration, stereo matching, 3D point cloud, structured light, robotic arm grasping, defect detection, 6D pose estimation, phase deflection, Halcon, photogrammetry, array camera, photometric stereo vision, etc.

Slam:视觉Slam、激光Slam、语义Slam、滤波算法、多传感器融吇𴢆算法、多传感器标定、动态Slam、Mot Slam、Nerf Slam、机器人导航等.

Autonomous driving: depth estimation, Transformer, millimeter-wave, lidar, visual camera sensor, multi-sensor calibration, multi-sensor fusion, autonomous driving integrated group, etc., 3D object detection, path planning, trajectory prediction, 3D point cloud segmentation, model deployment, lane line detection, BEV perception, Occupancy, target tracking, end-to-end autonomous driving, etc.

三维重建：3DGS、NeRF、多视图几何、OpenMVS、MVSNet、colmap、纹理贴图等

Unmanned aerial vehicles: quadrotor modeling, unmanned aerial vehicle flight control, etc

In addition to these, there are also exchange groups such as job search, hardware selection, visual product landing, the latest papers, the latest 3D vision products, and 3D vision industry news

Add a small assistant: dddvision, note: research direction + school/company + nickname (such as 3D point cloud + Tsinghua + Little Strawberry), pull you into the group.

3D Vision Technology Planet

3DGS、NeRF、结构光、相位偏折术、机械臂抓取、点云实战、Open3D、缺陷检测、BEV感知、Occupancy、Transformer、模型部署、3D目标检测、深度估计、多传感器标定、规划与控制、无人机仿真、三维视觉C++、三维视觉python、dToF、相机标定、ROS2、机器人控制规划、LeGo-LAOM、多模态融合SLAM、LOAM-SLAM、室内室外SLAM、VINS-Fusion、ORB-SLAM3、MVSNet三维重建、colmap、线面结构光、硬件结构光扫描仪，无人机等。

Peking University | CLIP model semantic information and 3DGS, real-time and accurate semantic understanding of 3D scenes

Read on

Knowledge Reader丨Applicable scenarios for equity income swaps

#跑运输那些事#货车停在路边, I finally waited for the first door-to-door business in three years! After bargaining, I followed the owner's car to load the goods... How familiar this scene used to be

The old house in Scandinavian tones is the warmest scene in my mind.

Altman selected netizen prompts and generated them with OpenAI's new large model Sora

Who is the Chinese version of Sora?

Smart Tourism, Folk Tourism, Intangible Cultural Heritage Tourism...... The cross-border integration of multiple consumption scenarios stimulates new impetus for the holiday economy

Microsoft has "defected"! This month, it may launch a new AI model MAI-1 of 500 billion yuan to compete with Google and OpenAI

BYD is going to fish from the technical fish pond again, and it is about to popularize 80%~100% tail charging technology. knife

More scenarios landed, low-altitude economy "flying high"

Large model + education has achieved results, and iFLYTEK Xinghuo was selected as the first batch of typical application cases of "artificial intelligence + higher education" by the Ministry of Education

How to accurately meet the application scenarios to improve the product strength of electric heavy trucks?

Will Nancheng Xiang, the "King of Ping Effect", be the "ultimate model" of Chinese fast food?

Based on the scene, there is more than light: Ruijie Network releases the minimalist Ethernet all-optical 3. X

Hong Kong stocks have entered a technical bull market, and the first share of the industry model, 4Paradigm, has been favored by many brokerages

The first stock of AIGC large model has become a super dark horse of "May 1st"! The trading volume and stock price have reached a new high

Tiantu Wanjing Tulagu: Most of the large-scale model companies died within 5 years

A dialectical look at the problem of "illusion", the application practice of NIO in the field of AI and large models

The combination of deep learning and chemical language models is used for de novo drug design, which is published in the journal Nature

Strategic Fundamentals Part 1: A Practical Business Growth Model

Accurately identify user needs and create personalized smart scenarios - Broadlink 60G strong electric millimeter wave radar product evaluation

Simple and practical real-life 3D model web display management method, why don't you try it quickly?

Color quantization algorithm model and its application in diversified apps

A good movie will go straight to the heart, and obviously "White Balloon" did it, and how this Iranian director's film is presented through a simple narrative structure and a smooth narrative rhythm

Sangfor Security GPT won the Top 10 Excellence Award for Financial Applications

Microsoft will launch its self-developed new AI model "MAI-1" to compete with Google and OpenAI

Mainline Technology has received hundreds of millions of yuan in financing to accelerate the construction of a full-scenario autonomous driving freight network

Special topic on domestic large models in the media industry: the underestimated large model "Tencent Mixed Yuan"

Compatible with all scenes, large inventory of ultra-large capacity battery models ~ say goodbye to power bank when traveling

Review of the classic scenes of the first season of "Celebrating More Than Years": Recreate the rivers and lakes

Quan Hongchan's mistake in the face of the 207C action is a high-profile moment. It's puzzling that she was asked to make a deliberate mistake in what she should have been able to do with ease. Guo Jing

#王一博: Bright stars in the peacekeeping team#Last night, I immersed myself in the dark night of the cinema and watched a high-profile movie, "Riot Squad Peacekeeping". In this visual feast