laitimes

Precisely editing 3D scenes with only text or images, CustomNeRF was selected for CVPR 2024

author:Quantum Position

The Meitu Imaging Research Institute (MT Lab), together with the Institute of Information Engineering of the Chinese Academy of Sciences, Beijing University of Aeronautics and Astronautics, and Sun Yat-sen University, has jointly proposed a 3D scene editing method called CustomNeRF, which supports text descriptions and reference images as editing hints for 3D scenes, which has been accepted by CVPR 2024.

background

Since the Neural Radiance Field (NeRF) was proposed in 2020, implicit expression has been pushed to a new level. As one of the most cutting-edge technologies, NeRF is rapidly generalized in computer vision, computer graphics, augmented reality, virtual reality and other fields, and continues to receive extensive attention.

Due to its easy optimization and continuous representation, NeRF has a large number of applications in 3D scene reconstruction, and has also led to research in the field of 3D scene editing, such as texture repainting and stylization of 3D objects or scenes. In order to further improve the flexibility of 3D scene editing, the method of 3D scene editing based on pre-trained diffusion model has been explored a lot recently, but due to the implicit representation of NeRF and the geometric characteristics of 3D scenes, it is not easy to obtain editing results that conform to text prompts.

In order to achieve precise control of text-driven 3D scene editing, Meitu Imaging Research Institute (MT Lab), together with the Institute of Information Engineering of the Chinese Academy of Sciences, Beihang University, and Sun Yat-sen University, jointly proposed a CustomNeRF framework that unifies text descriptions and reference images into editing prompts∗. The research results have been included in CVPR 2024 and the code has been open-sourced.

  • Paper Links:
  • Code Links:
Precisely editing 3D scenes with only text or images, CustomNeRF was selected for CVPR 2024

Figure 1: CustomNeRF editing in text driven (left) and image driven (right).

Two major challenges that CustomNeRF solves

At present, the mainstream methods for 3D scene editing based on pre-trained diffusion models are mainly divided into two categories.

The first is to use an image editing model to iteratively update the images in the dataset, but due to the capabilities of the image editing model, it will fail in some editing cases. The second is to use fractional distillation sampling (SDS) loss to edit the scene, but due to the alignment between the text and the scene, this method cannot be directly adapted in the real scene, which will cause unnecessary modifications to the non-edited area, and often requires explicit intermediate expressions such as mesh or voxel.

In addition, the current two types of methods are mainly focused on text-driven 3D scene editing tasks, and the text description is often difficult to accurately express the user's editing needs, and the specific concepts in the image cannot be customized into the 3D scene, and the original 3D scene can only be edited in general, so it is difficult to obtain the editing results expected by the user.

In fact, the key to achieving the desired editing results is to accurately identify the foreground area of the image, which facilitates geometrically consistent foreground editing while maintaining the background of the image.

Therefore, in order to achieve accurate editing of only the foreground area of the image, this paper proposes a training scheme of Local-Global Iterative Editing (LGIE), which alternates between image foreground area editing and full-image editing. This scheme is able to accurately locate the foreground area of the image and operate only on the foreground of the image while preserving the background of the image.

In addition, in image-driven 3D scene editing, there is a problem of geometric inconsistencies in the editing results caused by the overfitting of the fine-tuned diffusion model to the perspective of the reference image. In this paper, a kind of guided regularization is designed, in which only class words are used to represent the subject of the reference image in the local editing stage, and the general class priors in the pre-trained diffusion model are used to promote geometrically consistent editing.

The overall process of CustomNeRF

As shown in Figure 2, CustomNeRF achieves the goal of accurately editing and reconstructing 3D scenes guided by text prompts or reference images in three steps.

Precisely editing 3D scenes with only text or images, CustomNeRF was selected for CVPR 2024

Figure 2 The overall flow chart of CustomNeRF

First, when reconstructing the original 3D scene, CustomNeRF introduces an additional mask field to estimate the editing probability in addition to the regular color and density. As shown in Figure 2(a), for a set of images that need to be reconstructed in 3D scenes, the paper first uses Grouded SAM to extract the mask of the image editing region from the natural language description, and trains foreground-aware NeRF in combination with the original image set. After NeRF reconstruction, the editing probability is used to distinguish between the image area to be edited (i.e., the image foreground area) and the irrelevant image area (i.e., the image background area), so as to facilitate the decoupled rendering during the image editing training process.

Secondly, in order to unify the image-driven and text-driven 3D scene editing tasks, as shown in Figure 2(b), the paper adopts the Custom Diffusion method to fine-tune the reference image under image-driven conditions to learn the key features of a specific subject. After training, the special word V∗ can be used as a regular word marker to express the subject concept in the reference image, thus forming a hybrid prompt, such as "a photo of a V∗ dog". In this way, CustomNeRF enables consistent and efficient editing of adaptive types of data, including images or text.

In the final editing phase, due to the implicit expression of NeRF, optimizing the entire 3D area with SDS loss will result in significant changes in the background areas that should be consistent with the original scene after editing. As shown in Figure 2(c), this paper proposes a Local-Global Iterative Editing (LGIE) scheme for decoupled SDS training, which enables it to edit the layout area while preserving the background content.

Specifically, this paper divides the editing training process of NeRF into a more granular way. With foreground-aware NeRF, CustomNeRF can flexibly control the rendering process of NeRF during training, i.e., it can choose to render foreground, background, and regular images with foreground and background in a fixed camera perspective. During the training process, by iteratively rendering the foreground and background, combined with the corresponding foreground or background cues, the current NeRF scene can be edited at different levels by using SDS loss. Among them, the local foreground training makes it possible to focus only on the area to be edited in the editing process, simplifying the difficulty of editing tasks in complex scenes, while the global training takes the whole scene into account and can maintain the coordination between the foreground and the background. In order to further keep the non-editing areas unchanged, the paper also uses the newly rendered background during the background supervision training process before the editing training to maintain the consistency of the background pixels.

In addition, there are exacerbated geometric inconsistencies in image-driven 3D scene editing. Because the diffusion model that has been fine-tuned by the reference image tends to produce images with similar viewing angles to the reference image in the inference process, the multiple viewing angles of the edited 3D scene are the geometric problems of the front view. To this end, this paper designs a kind of guided regularization strategy, using the special descriptor V* in the global prompt and only the class word in the local prompt, so as to inject the new concept into the scene in a more geometrically consistent way by using the class priors contained in the pre-trained diffusion model.

Experimental results

As shown in Figures 3 and 4, the comparison of the 3D scene reconstruction results of CustomNeRF and the baseline method shows that CustomNeRF has achieved good editing results in both reference image and text-driven 3D scene editing tasks, not only achieving good alignment with the editing prompts, but also keeping the background area consistent with the original scene. In addition, Tables 1 and 2 show a quantitative comparison of CustomNeRF with the baseline method driven by images and text, and the results show that CustomNeRF surpasses the baseline method in the text alignment metric, image alignment metric, and human evaluation.

Precisely editing 3D scenes with only text or images, CustomNeRF was selected for CVPR 2024

Fig. 3 Visual comparison with the baseline method under image-driven editing

Precisely editing 3D scenes with only text or images, CustomNeRF was selected for CVPR 2024

Figure 4 Visual comparison with baseline under text-driven editing

Precisely editing 3D scenes with only text or images, CustomNeRF was selected for CVPR 2024

Table 1 Quantitative comparison with baseline under image-driven editing

Precisely editing 3D scenes with only text or images, CustomNeRF was selected for CVPR 2024

Table 2 Quantitative comparison with baseline under text-driven editing

summary

This paper innovatively proposes a CustomNeRF model that supports editing hints for text descriptions or reference images, and solves two key challenges—accurate foreground-only editing and consistency across multiple views when using a single-view reference image. The scheme includes a local-global iterative editing (LGIE) training scheme, which enables editing operations to keep the background unchanged while focusing on the foreground, and class-guided regularization to alleviate view inconsistencies in image-driven editing.

Research team

The research results were jointly proposed by researchers from the Meitu Imaging Research Institute (MT Lab), the Institute of Information Engineering of the Chinese Academy of Sciences, Beihang University, and Sun Yat-sen University.

MT Lab is a Meitu team dedicated to algorithm research, engineering development, and product implementation in the fields of computer vision, machine learning, augmented reality, and cloud computing. It has won more than 10 championships and runners-up awards in top international computer vision competitions such as CVPR, ICCV, ECCV, etc., and has published a total of 49 academic papers in top conferences and journals in the field of artificial intelligence.

In 2023, Meitu continued to deepen its efforts in the AI field, investing RMB640 million in R&D, accounting for 23.6% of its total revenue, and officially launched Meitu's MiracleVision model in June of the same year, which has been iterated to version 4.0 in less than half a year with strong technical capabilities. In the future, MT Lab will strengthen its AI capability reserves, continue to strengthen its model capabilities on the technical side, and help build AI-native workflows.

— END —

量子位 QbitAI · 头条号签

Follow us and be the first to know about cutting-edge technology trends

Read on