laitimes

Microsoft's explosive single-picture digital human, Sora's same idea, "more real than AI Liu Qiangdong"

author:Quantum Position

Mengchen is from Wafei Temple

量子位 | 公众号 QbitAI

AI fakes live videos, and the threshold is lowered again.

Microsoft released a picture to generate digital human technology VASA-1, and netizens have seen it and called it "explosive effect", which is more real than "AI Liu Qiangdong".

Microsoft's explosive single-picture digital human, Sora's same idea, "more real than AI Liu Qiangdong"

Without further ado, let's go straight to the one-minute demo video:

You don't need to train for a specific person, just upload a picture of a face and an audio, even if it's not a real person.

For example, you can let the Mona Lisa sing Rap and imitate the scene where Anne Hathaway improvised and complains about the famous paparazzi.

Or let the sketch portrait read Huaqiang's lines.

More 1-minute videos and more 15-second videos are available on the project homepage.

Digital people of different genders, ages, and races speak with different accents.

Microsoft's explosive single-picture digital human, Sora's same idea, "more real than AI Liu Qiangdong"

According to the team's description in the paper, VASA-1 has the following characteristics:

  • Precise synchronization of lip shape and speech

This is the most basic, and VASA-1 is also at the top of its level in quantitative assessments.

  • Rich and natural facial expressions

Not only do you let the photo "speak", but also your eyebrows, eyes, and micro-expressions are also coordinated to avoid looking dull.

  • Humanized head movements

Appropriate actions such as nodding, shaking the head, and tilting the head when speaking can make the character look more vivid and convincing.

Overall, there are still some flaws in the eyes if you look closely, but it has been rated as "the best presentation so far" by netizens.

Microsoft's explosive single-picture digital human, Sora's same idea, "more real than AI Liu Qiangdong"

However, what is even more terrifying is that the reasoning speed of the entire system is still real-time.

Generate 512x512 resolution video and run up to 40fps with a single NVIDIA RTX4090 graphics card.

So, how did VASA-1 do this?

3 key technologies, the same idea of Sora

In a nutshell:

Instead of directly generating video frames, action encoding is generated in the latent space and then restored to video.

Isn't it similar to Sora's thinking?

In fact, the model architecture of VASA-1 is Diffusion Transformer, which is also consistent with the core components of Sora.

Microsoft's explosive single-picture digital human, Sora's same idea, "more real than AI Liu Qiangdong"

According to the paper, there are three key technologies behind it:

Microsoft's explosive single-picture digital human, Sora's same idea, "more real than AI Liu Qiangdong"

Face latent coding learning, this part is highly decoupled.

From a large number of real speaking videos, the team learned an ideal space for facial features.

Strip away factors such as identity, appearance, expression, and posture in the hidden space. In this way, the same movement can drive different faces, and it is natural for everyone to change it.

Microsoft's explosive single-picture digital human, Sora's same idea, "more real than AI Liu Qiangdong"

Head movement generation model, which is highly unified.

Unlike the previous method of modeling local movements such as lips, eyes, eyebrows, and head posture, VASA-1 encodes all facial dynamics and uses the Diffution Transfromer model, which is the same core component of SORA, to model their probability distributions.

This not only generates more coordinated and natural overall movements, but also learns long-term dependencies with the help of Transformer's powerful timing modeling capabilities.

For example, given a set of primitive sequences (the first column in the figure below), it is possible to do the following:

  • With the original head pose, change the facial expression (second column)
  • Change your head posture with your original facial expressions (third column)
  • With the original facial expressions, a new head pose is generated (column 4)
Microsoft's explosive single-picture digital human, Sora's same idea, "more real than AI Liu Qiangdong"

Finally, efficient reasoning.

In order to achieve real-time synthesis in seconds, the team made a lot of optimizations to the inference process of the diffusion model.

In addition, VASA-1 also allows users to input some optional control signals, such as the direction of the character's gaze, emotional tone, etc., to further improve controllability.

Microsoft's explosive single-picture digital human, Sora's same idea, "more real than AI Liu Qiangdong"

The cost of AI fraud is getting lower and lower

After being shocked by the effect of VASA-1, many people began to think, is it really appropriate to release such a technology to make the AI digital human so realistic?

Microsoft's explosive single-picture digital human, Sora's same idea, "more real than AI Liu Qiangdong"
Microsoft's explosive single-picture digital human, Sora's same idea, "more real than AI Liu Qiangdong"

After all, we've seen too many examples of scams using AI to fake audio and video.

Just over 2 months ago, there was a case of pretending to be the CFO of the company holding a video conference and directly defrauding 180 million yuan.

The Microsoft team is also aware of the issue and has made the following statement:

Our research focuses on generating visual emotions for digital humans, with the aim of enabling positive applications. Unintentionally create content that is misleading or deceptive.

However, like other related content generation techniques, it can still be abused to mimic humans.

We oppose any misleading or harmful content that creates real people and are interested in applying our technology to advance counterfeit detection......

At present, VASA-1 has only published papers, and it seems that there will be no demos or open source code released anytime soon.

Microsoft said that the videos generated by the method still contain recognizable traces, and numerical analysis shows that there is still a gap between the authenticity of the real video.

Without professional evaluation methods, if you look at it with the naked eye, you can indeed find some flaws in the current VASA-1 demonstration video if you carefully pick or directly compare it with the live video.

For example, teeth can occasionally be deformed.

Microsoft's explosive single-picture digital human, Sora's same idea, "more real than AI Liu Qiangdong"

And the eyes are not as rich as real people. (Eyes are indeed the windows of the soul.)

Microsoft's explosive single-picture digital human, Sora's same idea, "more real than AI Liu Qiangdong"

However, judging from the progress rate of "AIGC for one day, one year for the world", it may not take long to repair these flaws.

And can you ensure that you are vigilant at all times to distinguish between real and fake videos?

Seeing is no longer believing. Not believing in any video by default has become a choice that many people make today.

Microsoft's explosive single-picture digital human, Sora's same idea, "more real than AI Liu Qiangdong"

Anyway, as one netizen concluded.

We can't undo what we've done, we can only embrace the future.
Microsoft's explosive single-picture digital human, Sora's same idea, "more real than AI Liu Qiangdong"

Address:

https://arxiv.org/abs/2404.10667

Reference Links:

[1]https://www.microsoft.com/en-us/research/project/vasa-1/

[2]https://x.com/bindureddy/status/1780737428715950460

— END —

QbitAI · Headline number signed

Follow us and be the first to know about cutting-edge technology trends

Read on