laitimes

USTC Open Source | A 3D Scene Generation Method Based on 3D Gaussian and Formation Mode Sampled Text

author:3D Vision Workshop

作者:Haoran Li | 编辑:3DCV

Add WeChat: cv3d008, note: direction + unit + nickname, pull you into the group. At the end of the article, industry subdivisions are attached

1. Effect display

DreamScene uses 3D Gaussian to generate high-quality, consistent, and editable 3D scenes.

USTC Open Source | A 3D Scene Generation Method Based on 3D Gaussian and Formation Mode Sampled Text

This is mainly due to the fact that the Formation Pattern Sampling (FPS) method in DreamScene can generate high-quality 3D objects.

USTC Open Source | A 3D Scene Generation Method Based on 3D Gaussian and Formation Mode Sampled Text

2. Thesis information

标题:DreamScene: 3D Gaussian-based Text-to-3D Scene Generation via Formation Pattern Sampling

作者:Haoran Li, et al.

机构:University of Science and Technology of China、HKUST、The Hong Kong Polytechnic University

Thesis: https://arxiv.org/abs/2404.03575

Code: https://github.com/DreamScene-Project/DreamScene

Homepage: https://dreamscene-project.github.io/

3. Summary

Text-to-3D scene generation has great potential in games, film, and architecture, but existing methods still struggle to maintain high quality, consistency, and editorial flexibility. In this paper, we propose DreamScene, a novel text-to-3D scene generation framework based on 3D Gaussian, which solves the above three challenges through two strategies. First, DreamScene employs Formation Pattern Sampling (FPS), a multi-time-step sampling strategy guided by 3D object formation patterns, which can quickly form semantically rich and high-quality representations. FPS utilizes 3D Gaussian filtering to optimize stability and reconstructions to generate believable textures. Second, DreamScene employs a progressive three-stage camera sampling strategy designed for both indoor and outdoor scenes, effectively ensuring the integration of objects with the environment and scene-wide 3D consistency. Finally, DreamScene enhances the flexibility of scene editing, making targeted adjustments possible by integrating objects and environments.

4. Arithmetic analysis

USTC Open Source | A 3D Scene Generation Method Based on 3D Gaussian and Formation Mode Sampled Text

Dreamscene is mainly composed of two parts: FPS method and camera sampling strategy, among which the FPS method includes multi-time step sampling, 3D Gaussian filtering, and 3D reconstruction optimization generation.

The specific algorithm process is as follows: firstly, the semantics of the object and the environment in the scene are segmented according to the prompt, for a single object in the scene, the corresponding initialized point cloud is obtained by using Point-E, and then the camera pose is randomly selected for rendering, and the multi-time step sampling strategy is used to guide the optimization of 3D content, which not only ensures the shape constraint of the 3D content in the optimization process, but also enriches the semantic information. However, too much 3D Gaussian can hinder the optimization process, so 3D Gaussian filtering implements the filtering out of redundant 3D Gaussian distributions during optimization. In the later stages of optimization, Dreamscene uses 3D reconstruction to accelerate the generation of reasonable surface textures for 3D content due to the high consistency of the generated 3D content.

For the scene's environment, Dreamscene optimizes environment generation using a progressive, three-stage camera sampling strategy. First, the environment is initialized (the indoor environment is initialized as a square point cloud, and the outdoor environment is initialized as a hemispherical point cloud), and then the optimized objects are combined with the environment. In the first stage of camera sampling, the method samples the camera pose within a certain range of the center of the scene to generate a rough representation of the surrounding environment (indoor walls, outdoor distant environment), in the second stage, the rough ground is generated by sampling the camera poses of some specific areas, and the contact parts of the ground with the surrounding environment are as coherent as possible, and in the third stage, the method uses all the camera poses in the first two stages to optimize all the environmental elements, and then uses 3D The method of reconstruction for more reasonable textures and details.

5. Experiments

Dreamscene uses GPT-4 as the LLM for scene prompt decomposition, Point-E to generate sparse point clouds for initial representation of objects, and Stable Diffusion 2.1 as a 2D text-to-image model. The maximum number of iterations for an object and environment is set at 1,500 and 2,000 rounds, respectively. The initial interval value m, starting at 4, decreases by 4 per 1 round.

Quality

Comparing DreamScene with existing SOTA methods in both indoor and outdoor scenes, it can be seen that Text2Room and Text2NeRF can only produce satisfactory results if they are generated in the right camera pose. Compared to text-to-3D methods for generating individual objects, Dreamscene's FPS method can also generate high-quality 3D representations in a short period of time following text prompts.

USTC Open Source | A 3D Scene Generation Method Based on 3D Gaussian and Formation Mode Sampled Text
USTC Open Source | A 3D Scene Generation Method Based on 3D Gaussian and Formation Mode Sampled Text

Consistency

The Dreamscene generation results ensure good 3D consistency while maintaining high generation quality.

USTC Open Source | A 3D Scene Generation Method Based on 3D Gaussian and Formation Mode Sampled Text

Scene Editing

DreamScene can add or remove objects or redesign their position in the scene by adjusting the value of the object's affine component. When making these edits, the camera pose needs to be resampled at the original and new positions of the object, re-optimizing the ground and surrounding orientation. In addition, Dreamscene can change the style of the environment or objects in the scene by changing the text prompts.

USTC Open Source | A 3D Scene Generation Method Based on 3D Gaussian and Formation Mode Sampled Text

Ablation

Results optimized for 30 minutes under the prompt of "A DSLR photo of Iron Man". As shown in the figure, multi-time step sampling (MTS) results in better geometry and texture compared to fractional distillation sampling (SDS) mentioned in DreamTime and DreamFusion. FPS (Formation Pattern Sampling) is built on top of MTS, using a refactoring approach to create smoother and more believable textures, demonstrating the superiority of FPS.

USTC Open Source | A 3D Scene Generation Method Based on 3D Gaussian and Formation Mode Sampled Text

The figure below compares the results of the reconstruction and generation tasks before and after compression using the Gaussian filter algorithm. It can be seen that in the reconstruction task, the compression rate of Dreamscene reached 73.9%, and the overall image was slightly blurry, and some details were lost. However, in Dreamscene's generation task, the compression rate was 66.1% with no significant loss of quality.

USTC Open Source | A 3D Scene Generation Method Based on 3D Gaussian and Formation Mode Sampled Text

Quantitative Results

The generation time of the Dreamscene compute environment generation phase. The left side of the table shows the environment with editing features that takes the shortest time to generate, and the right side shows the user survey (5 out of 5 points, the higher the score, the better), with DreamScene being far ahead of other SOTA methods in terms of consistency and rationality, and the generation quality is also high

USTC Open Source | A 3D Scene Generation Method Based on 3D Gaussian and Formation Mode Sampled Text

6. Summary

Today, I would like to introduce a new text-to-3D scene generation strategy, DreamScene. By employing FPS, camera sampling strategies, and integrating objects and environments, Dreamscene solves the problems of inefficiency, inconsistency, and limited editability in current text-to-3D scene generation methods. A large number of experiments have proved that DreamScene has the potential to be widely used in many fields.

USTC Open Source | A 3D Scene Generation Method Based on 3D Gaussian and Formation Mode Sampled Text

This article is only for academic sharing, if there is any infringement, please contact to delete the article.

3D Vision Boutique Courses:

3DGS, NeRF, Structured Light, Phase Deflection, Robotic Arm Grabbing, Point Cloud Practice, Open3D, Defect Detection, BEV Perception, Occupancy, Transformer, Model Deployment, 3D Object Detection, Depth Estimation, Multi-Sensor Calibration, Planning and Control, UAV Simulation, 3D Vision C++, 3D Vision python, dToF, Camera Calibration, ROS2, Robot Control Planning, LeGo-LAOM, Multi-modal fusion SLAM, LOAM-SLAM, indoor and outdoor SLAM, VINS-Fusion, ORB-SLAM3, MVSNet 3D reconstruction, colmap, linear and surface structured light, and hardware structured light scanners.

3D Visual Learning Circle

3D vision from the beginning to the proficient knowledge planet, the earliest establishment in China, 6000+ members exchange and learn. Including: nearly 20 planetary video courses (worth more than 6000), project docking, 3D vision learning route summary, the latest top meeting papers & codes, the latest modules in the 3D vision industry, 3D vision high-quality source code summary, book recommendations, programming basics & learning tools, practical projects & assignments, job search & recruitment & interview questions, etc. Welcome to 3D Vision: From beginner to proficient knowledge planet, learn and progress together.

3D visual communication group

At present, the workshop has established multiple communities in the direction of 3D vision, including SLAM, industrial 3D vision, autonomous driving, 3D reconstruction, drones, etc., and the subdivisions include:

Industrial 3D vision: camera calibration, stereo matching, 3D point cloud, structured light, robotic arm grasping, defect detection, 6D pose estimation, phase deflection, Halcon, photogrammetry, array camera, photometric stereo vision, etc.

SLAM: visual SLAM, laser SLAM, semantic SLAM, filtering algorithm, multi-sensor fusion, multi-sensor calibration, dynamic SLAM, MOT SLAM, NeRF SLAM, robot navigation, etc.

Autonomous driving: depth estimation, Transformer, millimeter-wave, lidar, visual camera sensors, multi-sensor calibration, multi-sensor fusion, autonomous driving integrated group, etc., 3D object detection, path planning, trajectory prediction, 3D point cloud segmentation, model deployment, lane line detection, Occupancy, target tracking, etc.

3D reconstruction: 3DGS, NeRF, multi-view geometry, OpenMVS, MVSNet, colmap, texture mapping, etc

Unmanned aerial vehicles: quadrotor modeling, unmanned aerial vehicle flight control, etc

In addition to these, there are also exchange groups such as job search, hardware selection, visual product landing, the latest papers, the latest 3D vision products, and 3D vision industry news

Add a small assistant: dddvision, note: research direction + school/company + nickname (such as 3D point cloud + Tsinghua + Little Strawberry), pull you into the group.

Read on