laitimes

Microsoft's ace frame in the fraud industry is really terrifying! A photo + audio can generate a digital human

author:New Zhiyuan

Edit: LRS

How low is the threshold for making a video that can "speak with a real character"? Just a photo and an audio can generate a fake video that can be called real and terrifying, will the video evidence in court still be credible in the future?

In the process of the characters speaking, every subtle movement and expression can express emotions and convey a silent message to the audience, which is also a key factor affecting the authenticity of the generated results.

If a vivid and realistic image can be automatically generated based on a specific face, it will completely change the way humans interact with AI systems, such as improving the way people with disabilities communicate, enhancing the interest of AI tutoring education, therapeutic support and social interaction in healthcare scenarios, etc.

Recently, researchers at Microsoft Research Asia dropped a bombshell VASA-1 framework, using visual affective skills (VAS) to generate ultra-realistic speaking facial videos with precise lip sync, realistic facial behavior, and natural head movements, by simply inputting a portrait photo + a piece of voice audio.

Microsoft's ace frame in the fraud industry is really terrifying! A photo + audio can generate a digital human

Link to paper: https://arxiv.org/pdf/2404.10667.pdf

Project Homepage: https://www.microsoft.com/en-us/research/project/vasa-1/

After watching the demonstration, netizens said that "everyone agrees on a password word with family and friends" to prevent fraud, because AI can always listen to the phone's microphone to learn.

Microsoft's ace frame in the fraud industry is really terrifying! A photo + audio can generate a digital human

From a legal point of view, "the value of video evidence in the future will be greatly reduced".

Microsoft's ace frame in the fraud industry is really terrifying! A photo + audio can generate a digital human

But some netizens pointed out that if you look closely, there are flaws in the video, such as the size of the teeth has been changing, but if you don't know that this video is generated by AI, I don't know if you can tell it?

Under the framework of VASA, the first model, VASA-1, is not only capable of producing lip movements that are perfectly synchronized with the audio, but also capable of capturing a large number of facial nuances and natural head movements, contributing to the perception of realism and vividness.

The core innovation of the framework is the generation of models of global facial dynamics and head movements based on diffusion, and the use of video to develop this expressive and decoupled face latent space.

The researchers also used a new set of metrics to evaluate the model's capabilities, which showed that the method significantly outperformed the previous method in all dimensions, delivering high-quality video with realistic facial and head dynamics, and supporting real-time 512×512 video generation at frame rates up to 40 FPS with negligible startup latency.

Arguably, the VASA framework paves the way for real-time interaction with photorealistic avatars in simulated human conversational behavior.

VASA framework

A good generated video should have several key points: high fidelity, clarity and realism of image frames, precise synchronization between audio and lip movements, facial dynamics of expressions and emotions, and natural head postures.

Microsoft's ace frame in the fraud industry is really terrifying! A photo + audio can generate a digital human

The model can accept an optional set of control signals to guide the generation during the generation process, including the direction of the main eye gaze, the distance from the head to the camera, and the emotional offset.

Monolithic framework

Instead of directly generating video frames, the VASA model generates overall facial dynamics and head movements in submerged space under the condition of audio and other signals.

Given a latent code for motion, VASA uses the appearance and identity features extracted from the input image by the face encoder as input, and then generates a video frame.

The researchers first constructed a face latent space, trained the face encoder and decoder using real-life face videos, and then trained a simple diffusion Transformer to model the motion distribution and generate motion latent codes for audio and other conditions during the test.

1. 表情和解耦面部潜空间构建(Expressive and Disentangled Face Latent Space Construction)

Given a set of unlabeled videos of speaking faces, the researchers aim to build a highly decoupled and expressive latent space of faces.

In the case of subject identity change, decoupling can efficiently generate and model the face and overall facial behavior in the video, and can also realize the decoupling factor control of the output, in contrast, the existing methods either lack expressiveness or lack decoupling.

On the other hand, facial appearance and dynamic moving expressions ensure that the decoder is able to output high-quality video with rich facial details, and the submersible generator is able to capture subtle facial movements.

To achieve this, the VASA model is built on a 3D-aid face reenactment framework, which is a powerful way to model 3D head and face movements compared to 2D feature maps, which can better characterize the appearance details in 3D compared to 2D feature maps.

Specifically, the researchers decomposed the facial image into a canonical 3D appearance volume, identity coding, 3D head posture and facial dynamic coding, each feature was extracted from the face image by an independent encoder, and the appearance volume needed to be constructed by extracting the pose 3D volume and then distorting the rigid and non-rigid 3D to the canonical volume.

The decoder takes the above latent variables as input and reconstructs the facial image.

The core idea of learning to decouple latent space is to construct image reconstruction loss by exchanging latent variables between different images in the video, but the loss function in the original model cannot distinguish between "facial dynamics" and "head posture" well, nor can it identify the correlation between "body" and "movement".

The researchers additionally added pairs of head poses and facial dynamics to transfer the loss to improve the decoupling effect.

In order to enhance the entanglement between identity and movement, facial identity similarity loss is introduced into the loss function.

2. 基于扩散Transformer的整体人脸动态生成(Holistic Facial Dynamics Generation with Diffusion Transformer)

Given the constructed face latent space and the trained encoder, it is possible to extract face dynamics and head movements from real-life face videos, and train the generative model.

Most critically, the researchers considered identity-agnostic global facial dynamic generation (HFDG), and the learned latent codes represent all facial movements, such as lip movements, (non-labial) expressions, eye gaze, and blinking, in contrast to the existing approach of applying separate models for different factors using interleaved regression and generative formulas.

Microsoft's ace frame in the fraud industry is really terrifying! A photo + audio can generate a digital human

In addition, previous methods were often trained on limited identities and could not model a wide range of movement patterns of different humans, especially in the case of expressive locomotor latent spaces.

In this work, the researchers utilized a diffusion model of HFDG under audio conditions, trained on a large number of conversational face videos from a large number of identities, and applied the Transformer architecture to the sequence generation task.

3. Talking Face video generation

In the inference, given an arbitrary face image and audio fragment, the trained face encoder is used to extract the 3D appearance volume and identity encoding, then the audio features are extracted, divided into fragments of the same length, and the head and face motion sequences are generated one by one in the form of a sliding window by using the trained diffusion Transformer, and finally the final video is generated by the trained decoder.

Experimental results

The researchers used the publicly available VoxCeleb2 dataset, which contained conversational facial videos of about 6,000 subjects, and reprocessed the dataset and discarded "segments containing multiple people" and low-quality segments.

For the motion latent generation task, an 8-layer Transformer encoder with embedding size 512 and head number 8 was used as the diffusion network.

The model was trained on VoxCeleb2 and another high-resolution talk video dataset collected that contained about 3,500 subjects.

Qualitative assessment

Visualize the results

With visual inspection, our method can generate high-quality video frames with vivid facial emotions. In addition, it can produce human-like conversational behaviors, including occasional changes in the gaze of the eyes during speech and contemplation, as well as the natural and variable rhythm of blinking, among other nuances. We highly recommend readers to check out our video results online to fully understand the functionality and output quality of our method.

Generate controllability

The results generated under different control signals, including the main eye gaze, head distance, and emotional shift, can be interpreted well by generative models and produce face results that are closely related to these specific parameters.

Microsoft's ace frame in the fraud industry is really terrifying! A photo + audio can generate a digital human

解耦face latents

When the same latent sequence of motion was applied to different subjects, the method effectively maintained different facial movements and unique facial features, indicating the effectiveness of the method in decoupling identity and movement.

Microsoft's ace frame in the fraud industry is really terrifying! A photo + audio can generate a digital human

The diagram below further illustrates the effective decoupling between head pose and facial dynamics, by keeping one aspect constant and changing the other, the resulting image faithfully reflects the expected head and face movements without interference, demonstrating the ability to process photo and audio inputs outside of the training distribution.

Microsoft's ace frame in the fraud industry is really terrifying! A photo + audio can generate a digital human

The model can also process art photos, singing audio clips (the first two lines), and non-English speeches (the last row), and these data variants are not present in the training dataset.

Microsoft's ace frame in the fraud industry is really terrifying! A photo + audio can generate a digital human

Quantitative assessment

The table below gives the results of the VoxCeleb2 and OneMin-32 benchmarks.

Microsoft's ace frame in the fraud industry is really terrifying! A photo + audio can generate a digital human

In both benchmarks, the method achieved the best results of all the methods on all evaluation metrics.

This method is far superior to other methods in terms of audio lip synchronization scores (SC and SD), producing better scores than real video, due to the influence of audio CFG.

From the results reflected on the CAPP score, the poses generated by the model are more consistent with the matching effect of the audio, especially on the OneMin-32 benchmark.

According to the ∆P, head movements also exhibit the highest intensity, but there is still a gap between the intensity of the real video, and the FVD score is significantly lower than that of other models, indicating that the result has higher video quality and realism.

Resources:

https://www.microsoft.com/en-us/research/project/vasa-1/

Read on