laitimes

Apple's open-source image editing artifact MGIE, want to be on the iPhone?

author:Heart of the Machine Pro

Reported by the Heart of the Machine

Editors: Egg Sauce, Chen Ping

Take a photo, enter text commands, and your phone will start to automatically retouch?

This amazing feature comes from Apple's just open-sourced image editing artifact "MGIE".

Apple's open-source image editing artifact MGIE, want to be on the iPhone?

Remove the person in the background

Apple's open-source image editing artifact MGIE, want to be on the iPhone?

Add pizza to the table

Loading...

In recent times, AI has made a lot of progress in the application of image editing. On the one hand, on the basis of LLMs, multimodal large models (MLLMs) can naturally treat images as inputs and provide visual perceptual responses. Instruction-based editing techniques, on the other hand, can not rely on detailed descriptions or region masks, but instead allow humans to give instructions that directly express how and which aspect of the image to edit. This method is extremely practical because the guidance is more in line with human intuition.

Based on the inspiration of the above technology, Apple proposed MGIE (MLLM-Guided Image Editing), which uses MLLM to solve the problem of insufficient instruction guidance.

Apple's open-source image editing artifact MGIE, want to be on the iPhone?
  • 论文标题:Guiding Instruction-based Image Editing via Multimodal Large Language Models
  • Paper link: https://openreview.net/pdf?id=S1RKWSyZ2Y
  • Project Homepage: https://mllm-ie.github.io/

As shown in Figure 2, MGIE consists of an MLLM and a diffusion model. MLLM learns to acquire concise instructions for expression and provide clear visually relevant guidance. With end-to-end training, the diffusion model is updated synchronously and performs image editing using the latent imagination of the intended target. In this way, MGIE is able to benefit from inherent visual derivation and solve the fuzzy human instructions for sensible editing.

Apple's open-source image editing artifact MGIE, want to be on the iPhone?

Guided by human commands, MGIE can make Photoshop-style modifications, global photo optimizations, and local object modifications. For example, it's hard to capture the meaning of "healthy" without additional context, but MGIE can precisely link "vegetable toppings" to pizza and edit them to human expectations.

Apple's open-source image editing artifact MGIE, want to be on the iPhone?

This reminds us of Cook's "ambitions" expressed during the earnings call recently: "I think there's a huge opportunity for Apple in generative AI, but I don't want to go into more details. Among other things, he revealed, that Apple is actively developing generative AI software features that will be available to customers later in 2024.

Combined with a series of theoretical research results on generative AI released by Apple in the recent past, it seems that we are looking forward to the new AI features that Apple will release next.

Essay details

Apple's open-source image editing artifact MGIE, want to be on the iPhone?

Concise expression instructions

Through feature alignment and instruction tuning, MLLM is able to provide visually relevant responses across modal perception. For image editing, the study uses the prompt "what will this image be like if [instruction]" as the language input for the image and derives a detailed explanation of the editing command. However, these explanations are often too verbose and even misleading. To get a more concise description, the study applied a pre-trained summarizer to have MLLM learn to generate summary output. This process can be summarized as follows:

Apple's open-source image editing artifact MGIE, want to be on the iPhone?

Picture editing through latent imagination

Apple's open-source image editing artifact MGIE, want to be on the iPhone?
Apple's open-source image editing artifact MGIE, want to be on the iPhone?

Experimental evaluation

For the input image, under the same command, the comparison between different methods, for example, the command in the first line is "turn day into night":

Apple's open-source image editing artifact MGIE, want to be on the iPhone?

Table 1 shows the results of zero-shot edits where the model was trained on the dataset IPr2Pr only. For EVR and GIER that involve Photoshop-style modifications, the editing results are closer to the bootstrap intent (e.g., LGIE gets a higher CVS of 82.0 on EVR). For global picture optimization on MA5k, InsPix2Pix is difficult to handle due to the scarcity of relevant training triples. LGIE and MGIE can provide detailed explanations through the study of LLMs, but LGIE is still limited to its single modality. By accessing the image, MGIE can derive clear instructions, such as which areas should be lightened or which objects are sharper, resulting in significant performance gains (e.g., a higher 66.3 SSIM and a lower 0.3 shooting distance), similar results were found on the MagicBrush. MGIE also gets the best performance from precise visual imagery and modifies the designated target as the target (e.g., a higher 82.2 DINO visual similarity and a higher 30.4 CTS global subtitle alignment).

Apple's open-source image editing artifact MGIE, want to be on the iPhone?

To investigate instruction-based image editing for a specific purpose, Table 2 fine-tuns the models on each dataset. For EVR and GIER, all models have been improved after adapting to Photoshop-style editing tasks. MGIE is consistently superior to LGIE in all aspects of editing. This also illustrates that learning with expressive instructions can effectively enhance image editing, and that visual perception plays a crucial role in obtaining clear guidance for maximum enhancement.

Apple's open-source image editing artifact MGIE, want to be on the iPhone?

Trade-offs between α_X and α_V. Image editing has two goals: the action as the target of the instruction and the retention as the rest of the input image. Figure 3 shows the trade-off curve between instruction (α_X) and input consistency (α_V). The study fixed α_X at 7.5 α_V varied in the range of [1.0, 2.2]. The larger the α_V, the more similar the edit is to the input, but the less consistent it is with the instructions. The X-axis calculates the CLIP directional similarity, which is how well the edit is consistent with the instruction, and the Y-axis is the feature similarity between the CLIP visual encoder and the input image. With specific expression instructions, the experiment surpassed InsPix2Pix in all settings. In addition, MGIE is able to learn through clear visually relevant guidance, resulting in overall improvement. Whether higher input relevance or editorial relevance is required, this supports robust improvements.

Apple's open-source image editing artifact MGIE, want to be on the iPhone?

Ablation studies

In addition, the researchers also performed ablation experiments to consider the performance of different architectures FZ, FT, and E2E in expressing instructions. The results showed that MGIE consistently outperformed LGIE in FZ, FT, E2E. This suggests that expression instructions with critical visual perception are always advantageous in all ablation settings.

Apple's open-source image editing artifact MGIE, want to be on the iPhone?

Figure 5 shows the CLIP-Score value between the input or ground-truth target image and the expression instruction. The higher the CLIP-S score of the input image, the more relevant the instruction is to the source of the edit, while better alignment with the target image provides clear, relevant guidance for editing. As you can see, MGIE is more aligned with the input/target, which explains why its expressive instructions are helpful. With a clear narrative of the expected results, MGIE can achieve the greatest improvement in image editing.

Apple's open-source image editing artifact MGIE, want to be on the iPhone?

Manual assessment. In addition to the automatic indicators, the investigators also performed manual assessments. Figure 6 shows the quality of the generated expression instructions, and Figure 7 compares the image editing results of InsPix2Pix, LGIE, and MGIE in terms of instruction adherence, ground-truth relevance, and overall quality.

Apple's open-source image editing artifact MGIE, want to be on the iPhone?

Inference efficiency. Although MGIE relies on MLLM to drive image editing, it only introduces concise expression instructions (less than 32 tokens), so it's on par with InsPix2Pix's efficiency. Table 4 lists the cost of inference time on NVIDIA A100 GPUs. For a single input, MGIE can complete an editing task in less than 10 seconds. In the case of higher data parallelization, the time required is similar (37 seconds when the batch size is 8). The entire process can be done with just one GPU (40GB).

Apple's open-source image editing artifact MGIE, want to be on the iPhone?

Qualitative comparison. Figure 8 shows a visual comparison of all the datasets used, and Figure 9 further compares the expression instructions for LGIE or MGIE.

Apple's open-source image editing artifact MGIE, want to be on the iPhone?
Apple's open-source image editing artifact MGIE, want to be on the iPhone?

In the project homepage, the researcher also provides more demos (https://mllm-ie.github.io/). For more details of the study, please refer to the original paper.

Read on