Microsoft showcases VASA-1 AI model that can turn photos into "talking faces"

author：cnBeta 2024-04-19 09:55:00

A new AI research paper from Microsoft looks to the future: upload a photo and a sample of your voice, then create a vivid avatar of a talking person. The AI model, called VASA-1, takes a single portrait photo and audio file and converts it into an ultra-realistic video of a face, including lip sync, realistic facial features, and head movements.

Microsoft showcases VASA-1 AI model that can turn photos into "talking faces"

The model is currently only in research preview and can't be tried out by anyone outside of Microsoft's research team, but the demo video looks impressive.

Runway and NVIDIA have introduced similar lip sync and head movement techniques, but this technology appears to be much more quality and realism to reduce mouth artifacts. This approach to audio-driven animation is also similar to the VLOGGER AI model recently launched by Google Research.

How does VASA-1 work?

Microsoft says it's a new framework for creating lifelike talking faces specifically for animating virtual characters. All of the characters in the example are composited and made with DALL-E, but if it can animate realistic AI images, it can also animate real photos.

In the demo, we saw that people spoke as if they were being filmed, the movements were slightly jerky but looked very natural. The lip sync is impressive, the movements are natural, and there are no artifacts found in other tools up and down the mouth.

The most impressive thing about the VASA-1 seems to be that it doesn't require a face-up portrait image to work.

Among them are examples of shooting in different directions. The model also appears to have a strong sense of control, being able to use the direction of eye gaze, head distance, and even emotions as inputs to guide generation.

What is the significance of VASA-1?

One of the most obvious use cases is advanced lip sync in games. Creating AI-driven NPCs with natural lip movements can change the immersion of the game.

It can also be used to create avatars for social media videos, a technique that companies like HeyGen and Synthesia have already adopted. Another area is AI-based filmmaking. If you can make an AI singer look like they're singing, you can make a more realistic music video.

Still, the team said that this is only a research demonstration and that there are no plans for a public release or even for developers to use in the product.

How effective is VASA-1?

To the surprise of the researchers, VASA-1 was able to lip-sync the lyrics of the song perfectly, reflecting the singer's lyrics without any problems, despite the fact that no music was used in the training dataset. It can also handle different styles of images, including the Mona Lisa.

They had it create a 512x512 pixel image at 45 frames per second, which was done in about 2 minutes using a desktop-class NVIDIA RTX 4090 GPU.

While they say it's only for research, it would be a shame if it didn't make it into the public domain, even if only for developers, which could even be part of a future Copilot Sora integration given Microsoft's huge stake in OpenAI.

Microsoft showcases VASA-1 AI model that can turn photos into "talking faces"

Read on