The Meitu Imaging Research Institute (MT Lab), together with the Institute of Information Engineering of the Chinese Academy of Sciences, Beijing University of Aeronautics and Astronautics, and Sun Yat-sen University, has jointly proposed a 3D scene editing method called CustomNeRF, which supports text descriptions and reference images as editing hints for 3D scenes, which has been accepted by CVPR 2024.

background

Since the Neural Radiance Field (NeRF) was proposed in 2020, implicit expression has been pushed to a new level. As one of the most cutting-edge technologies, NeRF is rapidly generalized in computer vision, computer graphics, augmented reality, virtual reality and other fields, and continues to receive extensive attention.

Due to its easy optimization and continuous representation, NeRF has a large number of applications in 3D scene reconstruction, and has also led to research in the field of 3D scene editing, such as texture repainting and stylization of 3D objects or scenes. In order to further improve the flexibility of 3D scene editing, the method of 3D scene editing based on pre-trained diffusion model has been explored a lot recently, but due to the implicit representation of NeRF and the geometric characteristics of 3D scenes, it is not easy to obtain editing results that conform to text prompts.

In order to achieve precise control of text-driven 3D scene editing, Meitu Imaging Research Institute (MT Lab), together with the Institute of Information Engineering of the Chinese Academy of Sciences, Beihang University, and Sun Yat-sen University, jointly proposed a CustomNeRF framework that unifies text descriptions and reference images into editing prompts∗. The research results have been included in CVPR 2024 and the code has been open-sourced.

Paper Links:
Code Links:

Precisely editing 3D scenes with only text or images, CustomNeRF was selected for CVPR 2024

Figure 1: CustomNeRF editing in text driven (left) and image driven (right).

Two major challenges that CustomNeRF solves

At present, the mainstream methods for 3D scene editing based on pre-trained diffusion models are mainly divided into two categories.

The first is to use an image editing model to iteratively update the images in the dataset, but due to the capabilities of the image editing model, it will fail in some editing cases. The second is to use fractional distillation sampling (SDS) loss to edit the scene, but due to the alignment between the text and the scene, this method cannot be directly adapted in the real scene, which will cause unnecessary modifications to the non-edited area, and often requires explicit intermediate expressions such as mesh or voxel.

In addition, the current two types of methods are mainly focused on text-driven 3D scene editing tasks, and the text description is often difficult to accurately express the user's editing needs, and the specific concepts in the image cannot be customized into the 3D scene, and the original 3D scene can only be edited in general, so it is difficult to obtain the editing results expected by the user.

In fact, the key to achieving the desired editing results is to accurately identify the foreground area of the image, which facilitates geometrically consistent foreground editing while maintaining the background of the image.

Therefore, in order to achieve accurate editing of only the foreground area of the image, this paper proposes a training scheme of Local-Global Iterative Editing (LGIE), which alternates between image foreground area editing and full-image editing. This scheme is able to accurately locate the foreground area of the image and operate only on the foreground of the image while preserving the background of the image.

In addition, in image-driven 3D scene editing, there is a problem of geometric inconsistencies in the editing results caused by the overfitting of the fine-tuned diffusion model to the perspective of the reference image. In this paper, a kind of guided regularization is designed, in which only class words are used to represent the subject of the reference image in the local editing stage, and the general class priors in the pre-trained diffusion model are used to promote geometrically consistent editing.

The overall process of CustomNeRF

As shown in Figure 2, CustomNeRF achieves the goal of accurately editing and reconstructing 3D scenes guided by text prompts or reference images in three steps.

Figure 2 The overall flow chart of CustomNeRF

First, when reconstructing the original 3D scene, CustomNeRF introduces an additional mask field to estimate the editing probability in addition to the regular color and density. As shown in Figure 2(a), for a set of images that need to be reconstructed in 3D scenes, the paper first uses Grouded SAM to extract the mask of the image editing region from the natural language description, and trains foreground-aware NeRF in combination with the original image set. After NeRF reconstruction, the editing probability is used to distinguish between the image area to be edited (i.e., the image foreground area) and the irrelevant image area (i.e., the image background area), so as to facilitate the decoupled rendering during the image editing training process.

Secondly, in order to unify the image-driven and text-driven 3D scene editing tasks, as shown in Figure 2(b), the paper adopts the Custom Diffusion method to fine-tune the reference image under image-driven conditions to learn the key features of a specific subject. After training, the special word V∗ can be used as a regular word marker to express the subject concept in the reference image, thus forming a hybrid prompt, such as "a photo of a V∗ dog". In this way, CustomNeRF enables consistent and efficient editing of adaptive types of data, including images or text.

In the final editing phase, due to the implicit expression of NeRF, optimizing the entire 3D area with SDS loss will result in significant changes in the background areas that should be consistent with the original scene after editing. As shown in Figure 2(c), this paper proposes a Local-Global Iterative Editing (LGIE) scheme for decoupled SDS training, which enables it to edit the layout area while preserving the background content.

Specifically, this paper divides the editing training process of NeRF into a more granular way. With foreground-aware NeRF, CustomNeRF can flexibly control the rendering process of NeRF during training, i.e., it can choose to render foreground, background, and regular images with foreground and background in a fixed camera perspective. During the training process, by iteratively rendering the foreground and background, combined with the corresponding foreground or background cues, the current NeRF scene can be edited at different levels by using SDS loss. Among them, the local foreground training makes it possible to focus only on the area to be edited in the editing process, simplifying the difficulty of editing tasks in complex scenes, while the global training takes the whole scene into account and can maintain the coordination between the foreground and the background. In order to further keep the non-editing areas unchanged, the paper also uses the newly rendered background during the background supervision training process before the editing training to maintain the consistency of the background pixels.

In addition, there are exacerbated geometric inconsistencies in image-driven 3D scene editing. Because the diffusion model that has been fine-tuned by the reference image tends to produce images with similar viewing angles to the reference image in the inference process, the multiple viewing angles of the edited 3D scene are the geometric problems of the front view. To this end, this paper designs a kind of guided regularization strategy, using the special descriptor V* in the global prompt and only the class word in the local prompt, so as to inject the new concept into the scene in a more geometrically consistent way by using the class priors contained in the pre-trained diffusion model.

Experimental results

As shown in Figures 3 and 4, the comparison of the 3D scene reconstruction results of CustomNeRF and the baseline method shows that CustomNeRF has achieved good editing results in both reference image and text-driven 3D scene editing tasks, not only achieving good alignment with the editing prompts, but also keeping the background area consistent with the original scene. In addition, Tables 1 and 2 show a quantitative comparison of CustomNeRF with the baseline method driven by images and text, and the results show that CustomNeRF surpasses the baseline method in the text alignment metric, image alignment metric, and human evaluation.

Fig. 3 Visual comparison with the baseline method under image-driven editing

Figure 4 Visual comparison with baseline under text-driven editing

Table 1 Quantitative comparison with baseline under image-driven editing

Table 2 Quantitative comparison with baseline under text-driven editing

summary

This paper innovatively proposes a CustomNeRF model that supports editing hints for text descriptions or reference images, and solves two key challenges—accurate foreground-only editing and consistency across multiple views when using a single-view reference image. The scheme includes a local-global iterative editing (LGIE) training scheme, which enables editing operations to keep the background unchanged while focusing on the foreground, and class-guided regularization to alleviate view inconsistencies in image-driven editing.

Research team

The research results were jointly proposed by researchers from the Meitu Imaging Research Institute (MT Lab), the Institute of Information Engineering of the Chinese Academy of Sciences, Beihang University, and Sun Yat-sen University.

MT Lab is a Meitu team dedicated to algorithm research, engineering development, and product implementation in the fields of computer vision, machine learning, augmented reality, and cloud computing. It has won more than 10 championships and runners-up awards in top international computer vision competitions such as CVPR, ICCV, ECCV, etc., and has published a total of 49 academic papers in top conferences and journals in the field of artificial intelligence.

In 2023, Meitu continued to deepen its efforts in the AI field, investing RMB640 million in R&D, accounting for 23.6% of its total revenue, and officially launched Meitu's MiracleVision model in June of the same year, which has been iterated to version 4.0 in less than half a year with strong technical capabilities. In the future, MT Lab will strengthen its AI capability reserves, continue to strengthen its model capabilities on the technical side, and help build AI-native workflows.

— END —

量子位 QbitAI · 头条号签

Precisely editing 3D scenes with only text or images, CustomNeRF was selected for CVPR 2024

background

Two major challenges that CustomNeRF solves

The overall process of CustomNeRF

Experimental results

summary

Research team

Read on

Who knows, it's really exciting to see that Huawei Hongmeng's market share surpasses Apple's iOS. You must know that the rise of the Hongmeng system not only reflects Huawei's technological innovation and production

No contrast, no harm! Deep blue G318 is shockingly listed, with a price starting at 175,900! Deep Blue G318 breaks the boundaries between traditional hardcore SUVs and urban SUVs, dedicated

GAC Toyota's 2024 Highlander is officially launched, and the new car continues to provide two power options: intelligent electric hybrid dual engine and 380T fuel version, with 11 models in the whole series, including five-seater and seven-seater

The dark blue G318 is not a hard-core off-road, but an SUV with a load-bearing body, although it looks wild on the outside. So, the 318 should be classified as a square box SUV, it is more classy

Scenes of use of interactive installations

The market size, marketing scenarios and future development trends of freeze-dried food

Cheng Yi xCOSMO┊ has only been a month, brushed two five women's magazines in a row, wore a TUDOR watch on the cover, and the traffic was busy and the windows of the car on a rainy night created an electricity for you to go on a date

Naruto's Valley of the End, the interactive gameplay of the classic scene building blocks is very exciting!

There is a couplet, saying that it is "husband and wife are fate, there are good fate and bad fate, and there is no fate and no gathering". This is very strange, good karma is naturally going to be knotted, as for this evil fate, it's a big deal to break up in two.

#此生必驾G318城野生活新选择#深蓝G318, the new era of technology hard SUV, urban and outdoor, one car all! Tired of the outdoor powerlessness of urban SUVs

The star's embarrassing moment, an undisguised gaffe! Have you ever seen such a scene?

When I woke up, I suddenly found that the ideal was deceived, Xiaopeng was deceived, even BYD Wang Chuanfu was deceived, and all electric car companies were deceived by Akio Toyoda! I didn't expect a joint venture electric car

It's not very hurtful, it's extremely insulting! As soon as BYD's fifth-generation DMi technology was announced, the world was shocked, thinking that this was the strongest hybrid technology on the surface, but the reality was very cold and very

Many people attach great importance to the interior when choosing passenger cars, in fact, the current new energy vehicles in the interior is much more than the traditional fuel vehicles; Recently #AVATR 07 real car interior

Why is Python the language of glue? Python is known as the "glue language" mainly because of its ability to integrate and connect different components and systems very easily. P