laitimes

Decoding "Domestic Sora" is also a Tsinghua department

author:IT Times
Decoding "Domestic Sora" is also a Tsinghua department

Vidu generates a 16-second video with one click

Author/ IT Times reporter Shen Yibin

Editor/ Hao Junhui Sun Yan

A white off-road vehicle, with a tail roll up a puff of dust and driving quickly in the forest, the trees next to it gradually disappear from the picture as the vehicle travels from far to near; in a seaside hut, the sun shines into the room, the camera slowly passes through the balcony, overlooking several boats floating on the calm sea, and finally the camera freezes, the waves on the sea, the reflection of the boats, and the blue sky and white clouds in the distance are so realistic; a panda sits by the lake, waving his arms to the rhythm and playing the guitar......

Decoding "Domestic Sora" is also a Tsinghua department

On April 27, at the Future Artificial Intelligence Pioneer Forum of the 2024 Zhongguancun Forum Annual Conference, Vidu, a large Wensheng video model created by Shengshu Technology and Tsinghua University, was released for the first time, which can simulate the real physical world, with rich imagination, multi-lens language and high spatio-temporal consistency, and a high understanding of Chinese elements.

Two months ago, Open AI brought a new Wensheng video model Sora to "menacing" and became the "king bomb" of the artificial intelligence industry. Since then, although many large model manufacturers have released large video models, none of them have been able to "equal" Sora.

The domestic model broke this situation in its own way, and Vidu was called "domestic Sora". However, compared with Sora, Vidu still lags behind in terms of generation time and simulation, and computing power is still a huge challenge.

The "reverse reduction" of a basin of ink

In the past, the generation of short videos was to generate key frames, and then use them as the core to expand the continuous time series, which is equivalent to predicting the images before and after the frames, but it is difficult to ensure the continuity of long-term picture prediction.

SORA extends the application of the diffusion model in the temporal dimension, thus ensuring that the resulting video is not only of high quality for a single frame, but also has good transitions and coherence between frames. Most of the large models of Wensheng video before Sora could only generate videos of a few seconds to more than a dozen seconds, while Sora could generate videos of up to 60 seconds.

Zhang Xudong, head of the product of Shengshu Technology, said in an interview with the media that the essence of the diffusion model is to make a probability distribution, and as the scale of the model becomes larger and larger, the closer the probability distribution is to reality, the more realistic the generation effect will become.

Judging from the generated video released so far, Vidu already has a strong ability to simulate the real world, but compared with Sora, the picture generated by Vidu is more inclined to oil painting, and the details of some complex pictures are not as accurate as Sora, and there is still a long gap in duration.

Decoding "Domestic Sora" is also a Tsinghua department

Vidu generated

Decoding "Domestic Sora" is also a Tsinghua department

Sora generation

Li Jiaxin (pseudonym), an insider of Shengshu Technology, told the IT Times, "To put it simply, the diffusion model is like a drop of ink that drops into the water and gradually spreads, and finally the whole water area turns black." He further explained that video generation training is first and foremost a forward diffusion process that progressively adds noise to pure data (such as images or text) until the data becomes completely random "noise" (in this case, noisy data).

The second is the process of reverse denoising. After generating a new sample of "full noise", the model works in reverse, that is, starting from Gaussian noise, gradually "denoise", and restoring clear data samples, that is, through a complex reverse process, continuous iterative prediction to reduce noise, and gradually reconstruct the data, each step may make the video clearer and closer to the real sample, that is, from a basin of ink to a basin of water.

Both Sora and Vidu extend this denoising process in the time dimension, setting an objective function in continuous training, and guiding each denoising step according to the objective function, so that the current noise state is closer to the original data distribution, which requires training one or more prediction models to accurately predict and subtract the noise components in the data. It is important to consider not only the denoising recovery of each frame of the image, but also the dynamic continuity and smoothness between adjacent frames.

The technical roadmap was released before Sora

The underlying architecture used by Vidu is U-ViT (Uni-Vision Transformer), which was launched by Shengshu Technology in September 2022 and is the world's first architecture that integrates Diffusion (diffusion probability model) and Transformer (transformation model). Two months later, two scholars from Berkeley and New York University released the DiT (Diffusion Transformer) architecture, which is considered to be the main source of Sora's technology.

Transformer is the core architecture of large language models such as ChatGPT and Wenxin Yiyan, and has strong capabilities in parallel processing, long sequence data processing, context understanding, flexibility and scalability, while the Diffusion architecture is the core of large image models and the key to ensuring high-quality images. The result is a U-ViT architecture that can be flexibly expanded, contextually understood, and capable of generating high-definition images.

Although the Shengshu Technology team first found the most suitable technical route for Wensheng Video, due to the US embargo on computing power in China and the increasing cost of computing power, in 2023, the team will focus on the development of large models with small computational requirements such as Wensheng Diagram and Wensheng 3D, and in March, the large model UniDiffuser based on multimodal fusion was open-sourced.

In January 2024, the Vidu team achieved a breakthrough in the 4-second video generation time.

In February, Sora was the first to be released, which brought quite "excitement" to the team, and in the next two months, everyone held their breath to accelerate the development of the development. Today, Vidu can generate videos up to 16 seconds long.

Decoding "Domestic Sora" is also a Tsinghua department

When will you catch up with Sora?

Zhu Jun, a professor at Tsinghua University and chief scientist of Shengshu Technology, said frankly that the consumption of long video on computing and the transmission of network bandwidth of distributed systems have brought new challenges, which require a little bit of research, and at the same time, they also need the support of computing power and the training and governance of high-quality data. In addition, algorithm principles, model architecture, data governance, and engineering implementation are also the keys to breaking through video duration.

Up to now, Shengshu Technology has completed a number of financings, including Qiming Venture Partners, Ant Group, BV Baidu Venture Capital, Datai Capital, Jinqiu Fund, Zhuoyuan Asia, Zhipu AI and other well-known institutions and enterprises. With the continuous injection of capital, the vision of leading the Chinese Sora model may become Vidu's foreseeable future.

Decoding "Domestic Sora" is also a Tsinghua department

It starts with the game for the game

Like Sora, Vidu is not open to the public.

Li Jiaxin said that Vidu's main application scenarios in the future may be games and film and television, and Shengshu Technology can provide functions such as Wensheng diagrams, 3D model generation and video generation, which are in the greatest demand in the game field.

Zhao Junbo, a researcher and doctoral supervisor at Zhejiang University's 100 Talents Program, has speculated that Sora may have used large-scale data generated by game engines for training. If Vidu follows the same route, the game will become an important scene for its landing.

It starts with the game, for the game.

"For example, in the early stages of game creation, Vidu can help creators generate sketches of characters, scenes, etc., and continue to improve on this basis. Traditionally, game 3D models need to be modeled manually, and with the support of 3D model generation, not only can various 3D models such as game props and player images be automatically generated, but also character promotional videos can be generated to improve the efficiency of game development. In addition, game backgrounds, game props, character demonstrations, etc. can also be generated using video. Li Jiaxin gave an example.

At present, the key to the use of Wensheng video large models in game scenarios lies in the ability of the model, and many generated content can only provide a fast but rough design, and it needs to be manually optimized in the later stage.

On the day of the press conference, Shengshu Technology officially launched the "Vidu Large Model Partner Program", hoping to build a cooperative ecology with upstream and downstream enterprises and research institutions in the industrial chain. Tang Jiayu, co-founder and CEO of Shengshu Technology, said that Shengshu Technology will continue to build an underlying general large model covering multimodal capabilities such as text, images, videos, and 3D models, and provide model service capabilities for the B-side.

Read on