Alibaba Cloud launched Live Portait, a digital human video generation tool, that allows photos to speak

On August 16, Alibaba Cloud launched Live Portait, a digital human video generation tool, which uploads a photo and a text or voice to generate a digital human video of speaking, which can be applied to scenarios such as live video, chatbots, and enterprise marketing. At present, the tool has been opened up on the open source model platform Modai Community Creative Space.

After entering Live Portait, upload an image, and there are two driving ways to make the image "move":

The first one: text-driven mode, after entering text, selecting the host's voice, you can generate a video, providing 28 voices such as Mandarin, English, Cantonese, and children's voices. Users can also choose whether to turn on lip and tooth repair and adjust the blink frequency to improve the accuracy of video lip shapes:

The second type: audio-driven mode, users can upload audio files within 30s, Live Portait can identify audio content, match photo lip shape, and generate video:

Since the dialogue large model and AI painting model have become popular, the industry's research on generative AI has gradually evolved in the direction of more modalities, and AI video generation is one of the hot technologies. This technology converts information such as text or audio into facial movement information, which in turn drives the animation of the person in the photo, lowering the threshold for video shooting and production.

The Live Portait tool consists of motion modules and generation modules:

Alibaba Cloud's self-developed lip shape prediction algorithm is used to generate higher lip shape accuracy than traditional methods.

In the training stage, explicit control of attitude is added, and any action video can be generated without a baseplate video, which improves the realism of digital human speech;

In addition, through active eye control technology, Live Portait can add some natural movement to the eyeball, making the generated result closer to the real person in terms of look and feel.

Zhang Bang, head of the tool's algorithm, said: "Live Portait integrates a number of self-developed innovative technologies of the team, such as generating realistic facial animations with only a single image, breaking through the limitations of traditional adversarial generation networks. With the further iteration of technology, Tusheng video has a huge application space, and is expected to become a production tool for enterprises to reduce costs and increase efficiency. ”

It is reported that the team's research direction covers digital human, 3D model AI generation, high-fidelity rendering, natural human-computer interaction and other fields, and has published more than 50 international top conference papers.

Live Portait is not the first tool to let people photos "speak", similar digital human video generation platforms in China include Tencent Smart Shadow Mini Program "Photo Broadcast" function, Laihua, AIGC marketing platform KreadoAI, etc., which can be mainly used in e-commerce, live streaming and other scenarios.

There are similar tools abroad, such as the recent emergence of many popular "little monk" IPs on short video platforms, which often have more than one million fans:

These "little monks" are virtual digital people, who generate pictures of small monks through Midjourney, accompany the copywriting generated by GPT, and finally use foreign Studio D-ID tools to make the pictures "move".

Studio D-ID is a foreign digital human generation tool, combined with face animation technology, GPT-3 text generation ability and Stable Diffusion image generation ability combined, to achieve video generation ability, just upload portrait photos, enter text, the system will automatically convert text into speech, and synchronize it with the mouth shape of digital characters, and finally generate a highly realistic video that will speak.

Tencent news creator AI Talk also took a similar approach, using Midjourney to generate virtual images, equipped with tools to generate videos from pictures, and co-produced a video with Tencent Technology on the impact of room-temperature superconductivity on the technological revolution: