laitimes

Xiaohongshu open-source InstantID, a new "magic" for stylized high-fidelity portraits in seconds

In the field of personalized image generation, traditions such as DreamBooth, Textual Inversion, and LoRAs often rely on training on datasets on specific topics, such as people or styles. Although these methods are excellent in generating subject-specific images, they are often difficult to be compatible with the existing pre-trained models in the community in practical applications due to the need to update the entire network or conduct long-term customized training, and it is difficult to achieve rapid and low-cost deployment. At the same time, embedding methods based on single image features, such as FaceStudio, PhotoMaker, and IP-Adapter, although avoid the need for comprehensive training, either require full-parameter training or PEFT fine-tuning of the Wensheng graph model, which may impair the generalization ability of the model, or have shortcomings in maintaining the high fidelity of the image. In order to solve these technical challenges, the Xiaohongshu InstantX team proposed InstantID, which does not train the UNet part of the Wensheng graph model, only trains pluggable modules, and does not require test-time tuning during inference, and achieves high-fidelity ID retention with little to no impact on text control.

Recently, the whole web has been flooded by InstantID, a super cool AI avatar generation tool. With just one photo, no model training required, a variety of strong style portraits can be generated in just a few tens of seconds, and the facial features can be kept unchanged. I tried it directly with the picture of the fairy sister, and the effect was so good that it exploded!

Xiaohongshu open-source InstantID, a new "magic" for stylized high-fidelity portraits in seconds

Ultraman, the father of ChatGPT, transforms and makes a variety of exaggerated expressions to you, which is very dramatic~

Xiaohongshu open-source InstantID, a new "magic" for stylized high-fidelity portraits in seconds

The Chinese poet Du Fu travels through time and space, and InstantID allows him to leap from a two-dimensional picture scroll to a three-dimensional world.

Xiaohongshu open-source InstantID, a new "magic" for stylized high-fidelity portraits in seconds

Xiaohongshu launched this open-source masterpiece, and within a week, GitHub won 4,000 stars.

Deep learning scientist Yann LeCun also sent an affirmation, wishing for the Iron Man suit online.

Xiaohongshu open-source InstantID, a new "magic" for stylized high-fidelity portraits in seconds

目前 InstantID 位列 HuggingFace Space Trending 榜首,欢迎在线使用:

Online Experience: https://huggingface.co/spaces/InstantX/InstantID
Xiaohongshu open-source InstantID, a new "magic" for stylized high-fidelity portraits in seconds
论文题目: InstantID: Zero-shot Identity-Preserving Generation in Seconds 论文链接:https://arxiv.org/abs/2401.07519 代码链接: https://github.com/InstantID/InstantID 项目主页:https://instantid.github.io 线上Demo页面: https://huggingface.co/spaces/InstantX/InstantID

In the field of personalized image generation, traditions such as DreamBooth, Textual Inversion, and LoRAs often rely on training on datasets on specific topics, such as people or styles. Although these methods are excellent in generating subject-specific images, they are often difficult to be compatible with the existing pre-trained models in the community in practical applications due to the need to update the entire network or conduct long-term customized training, and it is difficult to achieve rapid and low-cost deployment. At the same time, embedding methods based on single image features, such as FaceStudio, PhotoMaker, and IP-Adapter, although avoid the need for comprehensive training, either require full-parameter training or PEFT fine-tuning of the Wensheng graph model, which may impair the generalization ability of the model, or have shortcomings in maintaining the high fidelity of the image. In order to solve these technical challenges, the Xiaohongshu InstantX team proposed InstantID, which does not train the UNet part of the Wensheng graph model, only trains pluggable modules, and does not require test-time tuning during inference, and achieves high-fidelity ID retention with little to no impact on text control.

Xiaohongshu open-source InstantID, a new "magic" for stylized high-fidelity portraits in seconds

InstantID is an efficient, lightweight, pluggable adapter that gives pretrained text-to-image diffusion models the ability to save by ID. The focus is divided into the following steps:

  • Step 1: Replace weakly aligned CLIP features with strong semantic face features;
  • Step 2: Features of the face image are embedded as Image Prompts in Cross-Attention;
  • Step 3: IdentityNet is proposed to exert semantic and weak spatial conditional control on faces, so as to enhance the fidelity of ID and the control of text

The image below shows the result of stylization with InstantID, with only the leftmost image of the person being entered.

Xiaohongshu open-source InstantID, a new "magic" for stylized high-fidelity portraits in seconds

The main contributions of the article are as follows:

(1) As a new ID retention method, InstantID effectively bridges the gap between training efficiency and ID fidelity.

(2) InstantID is pluggable and fully compatible with the current Wensheng graph basic model, LoRAs, ControlNets, etc. in the community, and can maintain the character ID attribute in the inference process at zero cost. In addition, InstantID maintains good text editing capabilities, allowing IDs to be silkily embedded in a variety of styles.

(3) Experimental results show that InstantID not only surpasses the current method of embedding based on single image features (IP-Adapter-FaceID), but also is on par with ROOP, LoRAs and other methods in specific scenarios. Its superior performance and efficiency unleash its potential in a range of real-world applications, such as novel view compositing, ID interpolation, multi-ID and multi-style compositing, and more.

Xiaohongshu open-source InstantID, a new "magic" for stylized high-fidelity portraits in seconds

Given only one reference ID image, InstantID's goal is to generate custom images with a variety of poses or styles from a single reference ID image while maintaining high fidelity. The diagram above provides a detailed overview of InstantID's approach. It consists of three key components: (1) robust face representation, (2) cross-attention with decoupling to support Image Prompt, and (3) IdentityNet, which introduces additional weak spatial control to encode complex features of the reference face image. Of particular note are:

  1. Since CLIP only provides weak semantic representations and cannot be directly applied in strong semantic scenarios such as faces, the team directly used pre-trained face encoders (such as the antelopev2 model) to extract face features.
  2. As described in the previous approach, the image prompting feature in the pre-trained text-to-image diffusion model enhances text prompting, especially for content that is difficult to describe in text, so the team used the same cross-attention mechanism with decoupling as the IP-Adapter, but the difference is that InstantID uses facial features instead of CLIP representations.
  3. IdentityNet was introduced to encode face images. In the implementation, IdentityNet uses a residual structure that is consistent with ControlNet, thus maintaining the compatibility of the original model. In IdentityNet, there are two main changes to the vanilla ControlNet:
  • Instead of the fine-grained OpenPose face keys (two for the eyes, one for the nose, and two for the mouth) are used for conditional input, only five face keys are used.
  • Eliminate text prompts and use ID embeddings as conditions to join the cross-attention layer in ControlNet.

From the experimental results, the authors first demonstrate the robustness, editability, and compatibility of the method, corresponding to the generation effects under empty text, edited text, and additional use of ControlNets, respectively. As you can see, InstantID still maintains good text control and is compatible with the open-source ControlNet model.

Xiaohongshu open-source InstantID, a new "magic" for stylized high-fidelity portraits in seconds

At the same time, this method also supports multi-image injection to further improve the effect.

Xiaohongshu open-source InstantID, a new "magic" for stylized high-fidelity portraits in seconds

InstantID compares to the three main approaches in the community. (1) Based on single-image feature injection (IP-Adapter and PhotoMaker). In contrast, the IP-Adapter is pluggable, compatible with community models, and its FaceID version has significantly improved face fidelity, but the ability to control text has been significantly degraded, while the recently launched PhotoMaker, which needs to train the entire model (although using the LoRA method), has reduced the style degradation problem, but its face fidelity has not been significantly improved, or even worse than IP-Adapter-FaceID. InstantID offers a great balance of face fidelity and text control.

Xiaohongshu open-source InstantID, a new "magic" for stylized high-fidelity portraits in seconds

(2) Character LoRAs based on fine-tuning. While LoRAs rely on high quality and large amounts of data, InstantID only needs a single image to achieve stylistic intensity.

Xiaohongshu open-source InstantID, a new "magic" for stylized high-fidelity portraits in seconds

(3) Inswapper, a face-swapping model for non-diffusion models. In contrast, InstantID is more flexible in the blending of faces and backgrounds.

Xiaohongshu open-source InstantID, a new "magic" for stylized high-fidelity portraits in seconds

In addition, InstantID also supports multi-view generation, ID interpolation, and multi-ID generation as potential application scenarios.

(1) Multi-view generation: Extract features from a single image and generate multiple views of the object from different reference perspectives, so as to create an all-round three-dimensional visual effect.

Xiaohongshu open-source InstantID, a new "magic" for stylized high-fidelity portraits in seconds

(2) ID interpolation: a smooth transition between the two identity features, mildew and power combined, 50% Yang Mi + 50% Taylor.

Xiaohongshu open-source InstantID, a new "magic" for stylized high-fidelity portraits in seconds

(3) The generation of multiple IDs + multiple styles, multiple individual features and multiple artistic styles, presented in the same image, does not contradict the harmony.

Xiaohongshu open-source InstantID, a new "magic" for stylized high-fidelity portraits in seconds

Based on high-performance portrait injection and editing capabilities, InstantID can support many derivative applications. For example, fast and low-threshold live-action photos are fast in time and low in cost.

Xiaohongshu open-source InstantID, a new "magic" for stylized high-fidelity portraits in seconds

There is also exaggerated facial features and portrait customization, which is highly playable.

Xiaohongshu open-source InstantID, a new "magic" for stylized high-fidelity portraits in seconds

and non-portrait mixed customization, such a unique artistic image, which is very suitable for people who raise cute pets.

Xiaohongshu open-source InstantID, a new "magic" for stylized high-fidelity portraits in seconds

At present, Xiaohongshu is an open source project that has opened the model inference code, welcome to experience it online or deploy it offline, and directly feel the charm of InstantID.

Code Address: https://github.com/InstantID/InstantID Project Address: https://instantid.github.io

Illustration From IconScout By Delesign Graphics

-The End-

Scan the QR code to watch!

New this week!

"AI Technology Stream" original submission plan

TechBeat is an AI learning community (www.techbeat.net) established by Jiangmen Ventures. The community has launched 500+ talk videos and 3000+ technical dry articles, covering CV/NLP/ML/Robotis, etc., and regularly holds top meetings and other online communication activities every month, and holds offline gatherings and exchange activities for technical personnel from time to time. We are striving to become a high-quality and knowledge-based communication platform loved by AI talents, and hope to create more professional services and experiences for AI talents, accelerate and accompany their growth.

Post content

Interpretation of the latest technology/systematic knowledge sharing //

Cutting-edge information commentary/experience telling //

Posted by 须知

The manuscript needs to be an original article and indicate the author's information.

We will select some articles in the direction of in-depth technical analysis and scientific research experience, which will inspire users more, and do original content rewards

Submission method

Send an email to

[email protected]

Or add the staff WeChat (chemn493) to submit the manuscript to communicate the details of the submission, or you can also follow the official account of "Jiangmen Venture Capital" and reply to the word "submission" in the background to get the submission instructions.

>>> Add Xiaobian WeChat!

About me "door"▼

Jiangmen is a new venture capital institution focusing on the core technology field of digital intelligence, and it is also a benchmark incubator in Beijing. The company is committed to discovering and cultivating scientific and technological innovation enterprises with global influence by connecting technology and business, and promoting enterprise innovation and development and industrial upgrading.

Jiangmen was founded at the end of 2015, and the founding team was built by the original team of Microsoft Ventures in China, and has selected and incubated 126 innovative technology-based startups for Microsoft.

If you are a start-up in the technology field and want to not only get investment, but also want to get a series of continuous and valuable post-investment services, welcome to send or recommend projects to my "door":

Xiaohongshu open-source InstantID, a new "magic" for stylized high-fidelity portraits in seconds