Latest open source | Fast and good diffusion model helps complete 3D Gaussian scenes

Editor: Computer Vision Workshop

Add assistant: dddvision, note: 3D Gauss, pull you into the group. At the end of the article, industry subdivisions are attached

Scan the QR code below to join the 3D vision knowledge planet, which condenses many 3D vision practical problems, as well as learning materials for each module: nearly 20 video courses (free learning for planet members), the latest top papers, computer vision books, high-quality 3D vision algorithm source code, etc. If you want to get started with 3D vision, do projects, and engage in scientific research, welcome to scan the code to join!

Latest open source | Fast and good diffusion model helps complete 3D Gaussian scenes

1. Introduction

3D Gaussian splash has recently emerged as an efficient representation of new perspective synthesis. This work examines its editing capabilities, especially with a focus on completion tasks, aiming to supplement Gaussian with incomplete 3D scenes for better visual rendering. Compared with the 2D image completion task, the key to completing the 3D Gaussian model is to determine the relevant Gaussian properties of the newly added points, and the optimization of these properties greatly benefits from their initial 3D positions. To this end, we propose to guide the initialization of points using an image-guided depth completion model that directly restores the depth map based on 2D images. Such a design allows our model to populate depth values in proportions aligned with the original depth, and to take advantage of the powerful prior of the large-scale diffusion model. Thanks to more accurate depth completion, our approach, called InFusion, outperforms existing alternatives with sufficiently better visual fidelity and efficiency (about 20 times faster) in a variety of complex scenes. And it has the ability to conform to user-specified textures or insert novel objects.

(a) InFusion seamlessly removes 3D objects for texture editing and object insertion in a user-friendly way.

(b) InFusion significantly improves the quality of deep prosthetics through diffusion prior learning deep completion.

Let's read about this work together~

2. Paper information

标题：InFusion: Inpainting 3D Gaussians via Learning Depth Completion from Diffusion Prior

作者:Zhiheng Liu等人

Institution: USTC, HKUST, Ant, Alibaba

Project homepage: https://johanan528.github.io/Infusion/

Github repository: https://github.com/ali-vilab/infusion

3. Background

3D Gaussian is valued as an important method for compositing new perspectives, and is valued for its ability to produce realistic images at amazing rendering speeds. 3D Gaussian offers explicit representation capabilities and real-time processing possibilities, greatly improving the usefulness of editing 3D scenes. Especially for interactive downstream applications such as virtual reality (VR) and augmented reality (AR), it is becoming increasingly important to investigate how to edit 3D Gaussian. Our research focuses on 3D Gaussian completion tasks, which are essential for 3D scene editing, effectively filling in the real parts, and laying the foundation for further editing methods such as moving objects, adding new objects, changing textures, etc. The initial exploration of 3D Gaussian completion by existing methods is usually to use the image level completion of rendered images from different angles, and iteratively use the repaired 2D multi-view images as new training data. However, this approach tends to produce blurry textures due to inconsistencies in the generation process and is slow. It is worth noting that the training quality of the Gaussian model is significantly improved when the initial point is precisely located in the 3D scene. Therefore, a practical solution is to set the Gaussian of the position to be completed to the correct initial point, thus simplifying the entire training process. Therefore, depth completion is key when assigning initial Gaussian points to the Gaussian to be completed, and projecting the repaired depth map back into the 3D scene can achieve a seamless transition to 3D space.

Therefore, we introduce InFusion, an innovative 3D Gaussian completion method, in which we train a deep completion model using a pre-trained diffusion model priori. Our method shows that Infusion can accurately determine the position of the initial point, significantly improving the fidelity and efficiency of 3D Gaussian image inpainting. The model shows significant superiority in terms of alignment with the unrepaired area and the depth of the reconstructed object. This enhanced alignment capability ensures seamless compositing of the complete Gaussian and original 3D scenes. In addition, to address challenging scenarios involving large areas of occlusion, InFusion demonstrates its ability to solve such complex cases through progressive completion.

4. Method

As shown in the figure above, the core of the InFusion technology solution is a depth completion model based on the input RGB image. This model is able to predict and fix missing depth information based on the observed single-view imagery. It leverages a priori pre-trained latent diffusion models that are trained on large-scale image datasets, giving them powerful generative power and generalization.

The overall process is as follows:

Scene editing initialization: Firstly, in the process of training the 3D Gaussian scene, the pre-labeled mask is used to construct the incomplete Gaussian scene according to the editing requirements and the mask provided.
Depth completion: In general, a reference view is selected, and a single RGB image rendered from that perspective is repaired using an image restoration model such as (Stable Diffusion XL Inpainting). Then, the depth completion model is used to predict the depth information of the missing area based on the observation image, and the completed depth map is generated. Specifically, the depth completion model accepts three inputs: a depth map rendered from 3D Gaussian, a corresponding restored color image, and a mask that defines the area to be completed. Variational autoencoders (VAEs) are used to encode depth maps and color images into the latent space. The depth map is repeated to fit the input requirements of VAE, and linear normalization is applied so that the depth values are mainly in the [-1,1] interval. Then, the near-Gaussian noise obtained by adding noise to the encoded depth map, the encoded depth map with the mask area set to 0, the encoded RGB guidance image, and the mask image are connected in the channel dimension and input to the U-Net network for denoising, and the clean depth potential representation is gradually recovered from the noise. The completed depth map is obtained by VAE decoding again.
3D point cloud construction: Using the completed depth map and the corresponding color image, the 2D image points are converted into 3D point clouds through a back-projection operation in 3D space, and these point clouds are then merged with the original 3D Gaussian body set.
Gaussian Model Optimization: The merged 3D point cloud is adjusted through an optimization process with fewer iterations to ensure the visual consistency and smooth transition between the newly completed Gaussian body and the original scene.

5. Experimental results

Compared to previous methods, Infusion exhibits crisp textures that maintain 3D coherence, while baseline methods often produce blurry textures, especially in complex scenes.

In more challenging scenes, including those with multi-object occlusion, Infusion is able to produce satisfactory results compared to other methods

At the same time, through comparison with other widely used baseline methods, as well as the corresponding point cloud visualization. The comparison clearly shows that our method is successful in making up the correct shape that aligns with the existing geometry.

Infusion can iteratively complete complex and incomplete gaussians.

Thanks to the spatial accuracy of Infusion's 3D Gaussian points, users can modify the appearance and texture of the completion area.

By editing a single image, users can project objects into a realistic 3D scene. This process seamlessly integrates virtual objects into the physical environment, providing intuitive tools for scene customization.

7. Conclusion

The method proposed in this paper, InFusion, provides high-quality and efficient completion capabilities for 3D Gaussian scenes. In addition, we demonstrate that the combined diffusion prior can significantly enhance our deep image inpainting model. This improved depth completion model has great application prospects for various 3D applications, especially in the field of new perspective compositing. Our approach creates a link between the Latent Diffusion Model (LDM) and 3D scene editing. This synergy has great potential for further development and optimization in the future.

This article is only for academic sharing, if there is any infringement, please contact to delete the article.

Computer Vision Workshop Exchange Group

At present, we have established multiple communities in the direction of 3D vision, including 2D computer vision, large models, industrial 3D vision, SLAM, autonomous driving, 3D reconstruction, drones, etc., and the subdivisions include:

2D Computer Vision: Image Classification/Segmentation, Target/Detection, Medical Imaging, GAN, OCR, 2D Defect Detection, Remote Sensing Mapping, Super-Resolution, Face Detection, Behavior Recognition, Model Quantification Pruning, Transfer Learning, Human Pose Estimation, etc

Large models: NLP, CV, ASR, generative adversarial models, reinforcement learning models, dialogue models, etc

Industrial 3D vision: camera calibration, stereo matching, 3D point cloud, structured light, robotic arm grasping, defect detection, 6D pose estimation, phase deflection, Halcon, photogrammetry, array camera, photometric stereo vision, etc.

SLAM: visual SLAM, laser SLAM, semantic SLAM, filtering algorithm, multi-sensor fusion, multi-sensor calibration, dynamic SLAM, MOT SLAM, NeRF SLAM, robot navigation, etc.

Autonomous driving: depth estimation, Transformer, millimeter-wave, lidar, visual camera sensor, multi-sensor calibration, multi-sensor fusion, autonomous driving integrated group, etc., 3D object detection, path planning, trajectory prediction, 3D point cloud segmentation, model deployment, lane line detection, BEV perception, Occupancy, target tracking, end-to-end autonomous driving, etc.

3D reconstruction: 3DGS, NeRF, multi-view geometry, OpenMVS, MVSNet, colmap, texture mapping, etc

Unmanned aerial vehicles: quadrotor modeling, unmanned aerial vehicle flight control, etc

In addition to these, there are also exchange groups such as job search, hardware selection, visual product landing, the latest papers, the latest 3D vision products, and 3D vision industry news

Add a small assistant: dddvision, note: research direction + school/company + nickname (such as 3D point cloud + Tsinghua + Little Strawberry), pull you into the group.

3D Visual Learning Knowledge Planet

3DGS, NeRF, Structured Light, Phase Deflection, Robotic Arm Grabbing, Point Cloud Practice, Open3D, Defect Detection, BEV Perception, Occupancy, Transformer, Model Deployment, 3D Object Detection, Depth Estimation, Multi-Sensor Calibration, Planning and Control, UAV Simulation, 3D Vision C++, 3D Vision python, dToF, Camera Calibration, ROS2, Robot Control Planning, LeGo-LAOM, Multimodal fusion SLAM, LOAM-SLAM, indoor and outdoor SLAM, VINS-Fusion, ORB-SLAM3, MVSNet 3D reconstruction, colmap, linear and surface structured light, hardware structured light scanners, drones, etc.

Latest open source | Fast and good diffusion model helps complete 3D Gaussian scenes

Read on