The unified model of multimodal controllable picture generation is here, and the model parameters and inference code are all open source

Machine Heart column

Heart of the Machine Editorial Office

Researchers from Salesforce AI, Northeastern University, and Stanford University have proposed MOE-style Adapter and Task-aware HyperNet to implement multimodal conditional generation capabilities in UniControl. UniControl was trained on nine different C2I tasks, demonstrating powerful visual generation and zero-shot generalization.

The unified model of multimodal controllable picture generation is here, and the model parameters and inference code are all open source

Paper address: https://arxiv.org/abs/2305.11147

Code address: https://github.com/salesforce/UniControl

Project Home Page: https://shorturl.at/lmMX6

Introduction: Stable Diffusion exhibits powerful visual generation capabilities. However, they often fall short in generating images with spatial, structural, or geometric controls. Efforts such as ControlNet [1] and T2I-adpater [2] enable controlled image generation for different modalities, but being able to accommodate a variety of visual conditions in a single, unified model remains an unsolved challenge. UniControl combines various controllable conditions into image (C2I) tasks within a single framework. To enable UniControl to handle diverse visual conditions, the authors introduce a task-aware HyperNet to tune downstream conditional diffusion models so that they can adapt to different C2I tasks simultaneously. UniControl was trained on nine different C2I tasks, demonstrating powerful visual generation and zero-shot generalization. The author has open-sourced the model parameters and inference code, and the dataset and training code will also be open source as soon as possible.

Figure 1: The UniControl model consists of multiple pre-trained and zero-shot tasks

Motivation: Existing controllable image generation models are designed for a single modality, but work such as Taskonomy [3] proves that features and information are shared between different visual modalities, so this paper argues that a unified multimodal model has great potential.

Solution: This paper proposes MOE-style Adapter and Task-aware HyperNet to implement multimodal condition generation capabilities in UniControl. And the author built a new dataset MultiGen-20M, containing 9 major tasks, more than 20 million image-condition-prompt triples, and image size ≥ 512.

Advantages: 1) More compact model (1.4B #params, 5.78GB checkpoint), fewer parameters to implement multiple tasks. 2) More powerful visual generation and control accuracy. 3) Zero-shot generalization on never-before-seen modalities.

1. Introduction

Generative foundational models are changing the way AI interacts in areas such as natural language processing, computer vision, audio processing, and robot control. In natural language processing, generative foundational models like InstructGPT or GPT-4 excel at a variety of tasks, and this multitasking ability is one of the most attractive features. In addition, they can do zero-shot or few-shot learning to handle unseen tasks.

However, in generative models in the field of vision, this multitasking ability does not stand out. While text descriptions provide a flexible way to control the content of the resulting image, they often fall short in providing pixel-level spatial, structural, or geometric control. Recent hot research such as ControlNet, T2I-adapter can enhance the Stable Diffusion Model (SDM) for precise control. However, unlike language prompts, which can be handled by a unified module like CLIP, each ControlNet model can only handle the specific modality in which it was trained.

To overcome the limitations of previous work, this paper proposes UniControl, a unified diffusion model that can handle both language and various visual conditions. UniControl's unified design can enjoy the benefits of improved training and inference efficiency, as well as enhanced controlled generation. UniControl, on the other hand, benefits from the inherent connections between different visual conditions to enhance the generation of each condition.

UniControl's unified controllable generation capability relies on two parts, one is the "MOE-style Adapter" and the other is the "Task-aware HyperNet". MOE-style Adapter has about 70K parameters, which can learn low-level feature maps from various modalities, and Task-aware HyperNet can take task instructions as natural language prompt input, and output task embedding embedded in downstream networks to modulate the parameters of downstream models to adapt to the input of different modes.

The study pre-trained UniControl for multitasking and zero-shot learning capabilities, including nine different tasks in five categories: Edge (Canny, HED, Sketch), Segmentation (Object Bound Box), Human Skeleton, Depth, Normal Surface, and Image (Image). Outpainting)。 The study then trained UniControl on NVIDIA A100 hardware for more than 5,000 GPU hours (the new model is still continuing to be trained). And UniControl has demonstrated zero-shot adaptability to new tasks.

The contribution of the study can be summarized as follows:

The study proposes UniControl, a unified model (1.4B #params, 5.78GB checkpoint) that can handle a variety of visual conditions for controlled vision generation.

The study collected a new multi-conditional vision-generating dataset containing more than 20 million image-text-conditional triples covering nine different tasks across five categories.

The study carried out experiments that proved that the unified model UniControl exceeds the controlled image generation for each single task due to learning the intrinsic relationship between different visual conditions.

UniControl has demonstrated the ability to adapt to unseen tasks in a zero-shot fashion, demonstrating the possibilities and potential for widespread use in an open environment.

2. Model design

Figure 2: Model structure. To accommodate multiple tasks, the study designed MOE-style adapters with approximately 70K parameters per task, and a task-aware HyperNet (about 12M parameters) to modulate 7 zero-convolutional layers. This structure allows multitasking to be implemented in a single model, ensuring multitasking diversity while retaining the underlying parameter sharing. Compared to equivalent stacked single-task models (each model has approximately 1.4B parameters), the size of the model is significantly reduced.

The UniControl model design ensures two properties:

1) Overcome misalignment between low-level features from different modalities. This helps UniControl learn necessary and unique information from all tasks. For example, when a model uses a segmentation diagram as a visual condition, 3D information may be ignored.

2) Be able to learn meta-knowledge across tasks. This allows the model to understand the shared knowledge between tasks and the differences between them.

To provide these properties, the model introduces two novel modules: MOE-style Adapter and Task-aware HyperNet.

The MOE-style Adapter is a set of convolutional modules, each corresponding to a separate modality, inspired by Expert Hybrid Models (MOEs) that serve as features for UniControl to capture various low-level visual conditions. This adapter module has parameters of approx. 70K and is extremely computationally efficient. After that, the visual features are fed into a unified network for processing.

Task-aware HyperNet adjusts ControlNet's zero-convolution modules through task instruction conditions. HyperNet first projects the task instructions as task embedding, and then the researchers inject task embedding into ControlNet's zero-convolutional layer. Here the task embedding corresponds to the convolution kernel matrix size of the zero convolutional layer. Similar to StyleGAN [4], the study directly multiplies the two to modulate the convolution parameters, and the modulated convolution parameters are used as the final convolution parameters. Therefore, the zero-convolution parameters after modulation are different for each task, which ensures the adaptability of the model to each modality, and in addition, all weights are shared.

3. Model training

Unlike SDM or ControlNet, image generation conditions for these models are single language cues, or a single type of visual condition such as canny. UniControl needs to handle a variety of visual conditions from different tasks, as well as language cues. Therefore, UniControl's input consists of four parts: noise, text prompt, visual condition, task instruction. The task instruction can be obtained naturally according to the modality of the visual condition.

With the training pairings thus generated, the study uses DDPM [5] to train the model.

4. Experimental results

Figure 6: Test set visual comparison results. Test data from MSCOCO [6] and Laion [7]

The results compared with the official or replicated ControlNet are shown in Figure 6, please refer to the paper for more results.

5.Zero-shot tasks generalization

The model tests the zero-shot capability in the following two scenarios:

Hybrid task generalization: The study considers two different visual conditions as input to UniControl, one is a mixture of segmentation graph and human skeleton, and adds specific keywords "background" and "foreground" to the text prompt. In addition, the study rewrote hybrid task instructions into instruction blends that combine two tasks, such as "segmentation map and human bone-to-image".

New task generalization: UniControl needs to produce controllable images on new unseen visual conditions. To achieve this, estimating task weights based on the relationship between unseen and seen pre-training tasks is critical. Task weights can be estimated by manually assigning or calculating the similarity score of task instructions embedded in space. MOE-style Adapter can be assembled linearly with estimated task weights to extract shallow features from new unseen visual conditions.

The results of the visualization are shown in Figure 7, and more results can be found in the paper.

Figure 7: UniControl's visualization on Zero-shot tasks

6. Summary

Overall, the UniControl model provides a new base model for controllable vision generation through the diversity of its controls. This model could provide the possibility of achieving higher levels of autonomy and human control for image generation tasks. The study looks forward to discussing and collaborating with more researchers to further advance the development of this field.

More visuals

[1] Zhang, Lvmin, and Maneesh Agrawala. "Adding conditional control to text-to-image diffusion models." arXiv preprint arXiv:2302.05543 (2023).

[2] Mou, Chong, et al. "T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models." arXiv preprint arXiv:2302.08453 (2023).

[3] Zamir, Amir R., et al. "Taskonomy: Disentangling task transfer learning." Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.

[4] Karras, Tero, Samuli Laine, and Timo Aila. "A style-based generator architecture for generative adversarial networks." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019.

[5] Ho, Jonathan, Ajay Jain, and Pieter Abbeel. "Denoising diffusion probabilistic models." Advances in Neural Information Processing Systems 33 (2020): 6840-6851. APA

[6] Lin, Tsung-Yi, et al. "Microsoft coco: Common objects in context." Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. Springer International Publishing, 2014.

[7] Schuhmann, Christoph, et al. "Laion-400m: Open dataset of clip-filtered 400 million image-text pairs." arXiv preprint arXiv:2111.02114 (2021).