Apple's open-source image editing artifact MGIE, want to be on the iPhone?

Reported by the Heart of the Machine

Editors: Egg Sauce, Chen Ping

Take a photo, enter text commands, and your phone will start to automatically retouch?

This amazing feature comes from Apple's just open-sourced image editing artifact "MGIE".

Apple's open-source image editing artifact MGIE, want to be on the iPhone?

Remove the person in the background

Add pizza to the table

In recent times, AI has made a lot of progress in the application of image editing. On the one hand, on the basis of LLMs, multimodal large models (MLLMs) can naturally treat images as inputs and provide visual perceptual responses. Instruction-based editing techniques, on the other hand, can not rely on detailed descriptions or region masks, but instead allow humans to give instructions that directly express how and which aspect of the image to edit. This method is extremely practical because the guidance is more in line with human intuition.

Based on the inspiration of the above technology, Apple proposed MGIE (MLLM-Guided Image Editing), which uses MLLM to solve the problem of insufficient instruction guidance.

论文标题：Guiding Instruction-based Image Editing via Multimodal Large Language Models
Paper link: https://openreview.net/pdf?id=S1RKWSyZ2Y
Project Homepage: https://mllm-ie.github.io/

As shown in Figure 2, MGIE consists of an MLLM and a diffusion model. MLLM learns to acquire concise instructions for expression and provide clear visually relevant guidance. With end-to-end training, the diffusion model is updated synchronously and performs image editing using the latent imagination of the intended target. In this way, MGIE is able to benefit from inherent visual derivation and solve the fuzzy human instructions for sensible editing.

Guided by human commands, MGIE can make Photoshop-style modifications, global photo optimizations, and local object modifications. For example, it's hard to capture the meaning of "healthy" without additional context, but MGIE can precisely link "vegetable toppings" to pizza and edit them to human expectations.

This reminds us of Cook's "ambitions" expressed during the earnings call recently: "I think there's a huge opportunity for Apple in generative AI, but I don't want to go into more details. Among other things, he revealed, that Apple is actively developing generative AI software features that will be available to customers later in 2024.

Combined with a series of theoretical research results on generative AI released by Apple in the recent past, it seems that we are looking forward to the new AI features that Apple will release next.

Essay details

Concise expression instructions

Through feature alignment and instruction tuning, MLLM is able to provide visually relevant responses across modal perception. For image editing, the study uses the prompt "what will this image be like if [instruction]" as the language input for the image and derives a detailed explanation of the editing command. However, these explanations are often too verbose and even misleading. To get a more concise description, the study applied a pre-trained summarizer to have MLLM learn to generate summary output. This process can be summarized as follows:

Picture editing through latent imagination

Experimental evaluation

For the input image, under the same command, the comparison between different methods, for example, the command in the first line is "turn day into night":

Table 1 shows the results of zero-shot edits where the model was trained on the dataset IPr2Pr only. For EVR and GIER that involve Photoshop-style modifications, the editing results are closer to the bootstrap intent (e.g., LGIE gets a higher CVS of 82.0 on EVR). For global picture optimization on MA5k, InsPix2Pix is difficult to handle due to the scarcity of relevant training triples. LGIE and MGIE can provide detailed explanations through the study of LLMs, but LGIE is still limited to its single modality. By accessing the image, MGIE can derive clear instructions, such as which areas should be lightened or which objects are sharper, resulting in significant performance gains (e.g., a higher 66.3 SSIM and a lower 0.3 shooting distance), similar results were found on the MagicBrush. MGIE also gets the best performance from precise visual imagery and modifies the designated target as the target (e.g., a higher 82.2 DINO visual similarity and a higher 30.4 CTS global subtitle alignment).

To investigate instruction-based image editing for a specific purpose, Table 2 fine-tuns the models on each dataset. For EVR and GIER, all models have been improved after adapting to Photoshop-style editing tasks. MGIE is consistently superior to LGIE in all aspects of editing. This also illustrates that learning with expressive instructions can effectively enhance image editing, and that visual perception plays a crucial role in obtaining clear guidance for maximum enhancement.

Trade-offs between α_X and α_V. Image editing has two goals: the action as the target of the instruction and the retention as the rest of the input image. Figure 3 shows the trade-off curve between instruction (α_X) and input consistency (α_V). The study fixed α_X at 7.5 α_V varied in the range of [1.0, 2.2]. The larger the α_V, the more similar the edit is to the input, but the less consistent it is with the instructions. The X-axis calculates the CLIP directional similarity, which is how well the edit is consistent with the instruction, and the Y-axis is the feature similarity between the CLIP visual encoder and the input image. With specific expression instructions, the experiment surpassed InsPix2Pix in all settings. In addition, MGIE is able to learn through clear visually relevant guidance, resulting in overall improvement. Whether higher input relevance or editorial relevance is required, this supports robust improvements.

Ablation studies

In addition, the researchers also performed ablation experiments to consider the performance of different architectures FZ, FT, and E2E in expressing instructions. The results showed that MGIE consistently outperformed LGIE in FZ, FT, E2E. This suggests that expression instructions with critical visual perception are always advantageous in all ablation settings.

Figure 5 shows the CLIP-Score value between the input or ground-truth target image and the expression instruction. The higher the CLIP-S score of the input image, the more relevant the instruction is to the source of the edit, while better alignment with the target image provides clear, relevant guidance for editing. As you can see, MGIE is more aligned with the input/target, which explains why its expressive instructions are helpful. With a clear narrative of the expected results, MGIE can achieve the greatest improvement in image editing.

Manual assessment. In addition to the automatic indicators, the investigators also performed manual assessments. Figure 6 shows the quality of the generated expression instructions, and Figure 7 compares the image editing results of InsPix2Pix, LGIE, and MGIE in terms of instruction adherence, ground-truth relevance, and overall quality.

Inference efficiency. Although MGIE relies on MLLM to drive image editing, it only introduces concise expression instructions (less than 32 tokens), so it's on par with InsPix2Pix's efficiency. Table 4 lists the cost of inference time on NVIDIA A100 GPUs. For a single input, MGIE can complete an editing task in less than 10 seconds. In the case of higher data parallelization, the time required is similar (37 seconds when the batch size is 8). The entire process can be done with just one GPU (40GB).

Qualitative comparison. Figure 8 shows a visual comparison of all the datasets used, and Figure 9 further compares the expression instructions for LGIE or MGIE.

In the project homepage, the researcher also provides more demos (https://mllm-ie.github.io/). For more details of the study, please refer to the original paper.

Apple's open-source image editing artifact MGIE, want to be on the iPhone?

Read on

This year, the cumulative activation share of each mobile phone brand ranked: Huawei ranked third, Apple fourth

The Great Wall will not fall, Moutai will not fall, the United States needs Apple, and China needs Moutai

Apple618 The third wave of price reduction diving: iPhone15The price has fallen below 4500 yuan, and fruit fans can buy the bottom

How does the Apple mobile phone adjust the 5G network to 4G

Beyond Apple iOS! HarmonyOS ushered in a milestone, with its domestic share rising to 17%

Apple's market capitalization surpassed Microsoft and regained the world's No. 1 position

How do I import things from my old phone to my new phone after changing my phone? Apple, Huawei, Xiaomi?

Leading boss + Huawei + data elements + servers + auto parts + Apple + Tesla + NVIDIA

Should you eat less apples for high blood pressure, and the more you eat your blood vessels, the more brittle your blood vessels will be? Doctor: The key to eating fruit is to pay attention to these points

The smartphone market in the first quarter: Apple plummeted, domestic production accelerated to go overseas, and vivo faced falling behind

Apple iPhone phone wallpaper - Porsche 911 GT3 car wallpaper

苹果iPhone手机壁纸-Chiikawa

Apples are "contraindicated" for hyperglycemia sufferers? Reminder: If you want to stabilize your blood sugar, don't be greedy for several fruits and vegetables

Lei Jun is playing a big game of chess, selling mobile phones and building cars at a loss, and begins to challenge Apple and Tesla

Apple's AI big move was unveiled, and its market value dropped by 400 billion

Huawei's HarmonyOS and Apple's iOS face off, can China's scientific and technological forces continue to lead?