Mengchen is from Wafei Temple

量子位 | 公众号 QbitAI

AI fakes live videos, and the threshold is lowered again.

Microsoft released a picture to generate digital human technology VASA-1, and netizens have seen it and called it "explosive effect", which is more real than "AI Liu Qiangdong".

Microsoft's explosive single-picture digital human, Sora's same idea, "more real than AI Liu Qiangdong"

Without further ado, let's go straight to the one-minute demo video:

You don't need to train for a specific person, just upload a picture of a face and an audio, even if it's not a real person.

For example, you can let the Mona Lisa sing Rap and imitate the scene where Anne Hathaway improvised and complains about the famous paparazzi.

Or let the sketch portrait read Huaqiang's lines.

More 1-minute videos and more 15-second videos are available on the project homepage.

Digital people of different genders, ages, and races speak with different accents.

According to the team's description in the paper, VASA-1 has the following characteristics:

Precise synchronization of lip shape and speech

This is the most basic, and VASA-1 is also at the top of its level in quantitative assessments.

Rich and natural facial expressions

Not only do you let the photo "speak", but also your eyebrows, eyes, and micro-expressions are also coordinated to avoid looking dull.

Humanized head movements

Appropriate actions such as nodding, shaking the head, and tilting the head when speaking can make the character look more vivid and convincing.

Overall, there are still some flaws in the eyes if you look closely, but it has been rated as "the best presentation so far" by netizens.

However, what is even more terrifying is that the reasoning speed of the entire system is still real-time.

Generate 512x512 resolution video and run up to 40fps with a single NVIDIA RTX4090 graphics card.

So, how did VASA-1 do this?

3 key technologies, the same idea of Sora

In a nutshell:

Instead of directly generating video frames, action encoding is generated in the latent space and then restored to video.

Isn't it similar to Sora's thinking?

In fact, the model architecture of VASA-1 is Diffusion Transformer, which is also consistent with the core components of Sora.

According to the paper, there are three key technologies behind it:

Face latent coding learning, this part is highly decoupled.

From a large number of real speaking videos, the team learned an ideal space for facial features.

Strip away factors such as identity, appearance, expression, and posture in the hidden space. In this way, the same movement can drive different faces, and it is natural for everyone to change it.

Head movement generation model, which is highly unified.

Unlike the previous method of modeling local movements such as lips, eyes, eyebrows, and head posture, VASA-1 encodes all facial dynamics and uses the Diffution Transfromer model, which is the same core component of SORA, to model their probability distributions.

This not only generates more coordinated and natural overall movements, but also learns long-term dependencies with the help of Transformer's powerful timing modeling capabilities.

For example, given a set of primitive sequences (the first column in the figure below), it is possible to do the following:

With the original head pose, change the facial expression (second column)
Change your head posture with your original facial expressions (third column)
With the original facial expressions, a new head pose is generated (column 4)

Finally, efficient reasoning.

In order to achieve real-time synthesis in seconds, the team made a lot of optimizations to the inference process of the diffusion model.

In addition, VASA-1 also allows users to input some optional control signals, such as the direction of the character's gaze, emotional tone, etc., to further improve controllability.

The cost of AI fraud is getting lower and lower

After being shocked by the effect of VASA-1, many people began to think, is it really appropriate to release such a technology to make the AI digital human so realistic?

After all, we've seen too many examples of scams using AI to fake audio and video.

Just over 2 months ago, there was a case of pretending to be the CFO of the company holding a video conference and directly defrauding 180 million yuan.

The Microsoft team is also aware of the issue and has made the following statement:

Our research focuses on generating visual emotions for digital humans, with the aim of enabling positive applications. Unintentionally create content that is misleading or deceptive.

However, like other related content generation techniques, it can still be abused to mimic humans.

We oppose any misleading or harmful content that creates real people and are interested in applying our technology to advance counterfeit detection......

At present, VASA-1 has only published papers, and it seems that there will be no demos or open source code released anytime soon.

Microsoft said that the videos generated by the method still contain recognizable traces, and numerical analysis shows that there is still a gap between the authenticity of the real video.

Without professional evaluation methods, if you look at it with the naked eye, you can indeed find some flaws in the current VASA-1 demonstration video if you carefully pick or directly compare it with the live video.

For example, teeth can occasionally be deformed.

And the eyes are not as rich as real people. (Eyes are indeed the windows of the soul.)

However, judging from the progress rate of "AIGC for one day, one year for the world", it may not take long to repair these flaws.

And can you ensure that you are vigilant at all times to distinguish between real and fake videos?

Seeing is no longer believing. Not believing in any video by default has become a choice that many people make today.

Anyway, as one netizen concluded.

We can't undo what we've done, we can only embrace the future.

Address:

https://arxiv.org/abs/2404.10667

Reference Links:

[1]https://www.microsoft.com/en-us/research/project/vasa-1/

[2]https://x.com/bindureddy/status/1780737428715950460

— END —

QbitAI · Headline number signed

Microsoft's explosive single-picture digital human, Sora's same idea, "more real than AI Liu Qiangdong"

3 key technologies, the same idea of Sora

The cost of AI fraud is getting lower and lower

Read on

Liu Qiangdong's live broadcast debut was hot, and the order exceeded 100,000 in 40 minutes!

Liu Qiangdong and his wife appeared at Li Yundi's concert, and Zhang Zetian became a "little fan girl" in seconds

Liu Qiangdong took Zhang Zetian to listen to the concert, and the man stared at Xiaotian, afraid of getting separated, and took the initiative to hold hands and be affectionate

Liu Qiangdong and Zhang Zetian watched Li Yundi's performance together! Who loves a thousand miles more!

Liu Qiangdong: Jingdong Company is all civilians, and we must let the brothers live with dignity!

Liu Qiangdong: If the three lines were lost at the same time, what would JD.com be like today?

Liu Qiangdong took Zhang Zetian to watch the concert, and the whole process was very crooked, Zhang Zetian was pure and innocent.

Liu Qiangdong's AI digital human will be broadcast live again!Appearing on JD.com, it is really cheap to purchase and sell live broadcast rooms, bringing 5% off explosive products

Lin Tianer celebrated her 10th anniversary, Li Jiaxin became his "treasure", and Liu Qiangdong directly "sweetened" it

Liu Qiangdong and Zhang Zetian held hands to watch Li Yundi's concert, and the couple were like glue, and their relationship was very good

Liu Qiangdong really has a vision, secretly raising two 100 billion unicorns, and has become the new trump card of JD.com

Jingdong digital person "Liu Qiangdong" was criticized by netizens, and after-sales disputes over fakes and refurbished machines continued!

Liu Qiangdong has stirred up another storm! What bad head did he bring this time? It has aroused heated discussions in society!

Zhang Zetian and Liu Qiangdong took a walk: both of them were carefully dressed, showing an elegant and warm scene

deceived Dong Mingzhu of 2.6 billion, and also pulled Liu Qiangdong and Wang Jianlin into the water, how did Wei Yincang do it?

Liu Qiangdong's digital human live broadcast orders exceeded 100,000! What should I do for digital human live broadcast?