laitimes

Zhipu AI will put a "big move" again, and generate a video of any text in 30 seconds

Zhipu AI will put a "big move" again, and generate a video of any text in 30 seconds

After text generation and image generation, video generation has also joined the ranks of "involution".

On July 26th, at the Zhipu Open Day, Zhipu AI, which has frequent actions on the large model track, officially launched the video generation model CogVideoX, and released two "big moves":

One is the video creation agent Qingying created by Zhipu Qingyan, which can use text or images to generate high-definition videos with a duration of 6 seconds and a resolution of 1440x960.

The other is the "Make Photos Move" launched by the Zhipu Qingyan Mini Program, which can directly upload photos in the Mini Program and enter prompts to generate dynamic videos.

Unlike some products that are open to a small area or can only be used by appointment, Inna is open to all users, enter a prompt word, choose the style you want, including cartoon 3D, black and white, oil painting, cinematic sense, etc., with Inna comes with music, you can generate imaginative short videos. Enterprises and developers can also call APIs to experience the capabilities of Wensheng video and picture video.

This leads to such a question: at present, video generation products are still in the stage of "playable", and there is still a big gap from commercial use.

01 Faster and more controllable "Clear Shadow"

After Sora detonated the video generation track, a chain reaction was set off in the industry, first Runway, Pika and other products became popular in overseas markets, and a number of Wensheng video models have been exposed in China since April, and new products will be launched almost every month.

The market level is becoming more and more lively, but the experience has fallen into a similar dilemma, to be precise, there are two common problems that cannot be bypassed:

First, the inference speed is slow, even if it is only a 4-second video, it takes about 10 minutes to generate, and the longer the video, the slower the generation speed;

The second is poor controllability, in the limited sentences and limited training samples, there can be a good effect, once "out of bounds", there will be a situation of "demons dancing".

Some people compare it to a "card gacha" in the game, and it takes a few more tries to produce the desired effect. However, it is impossible to hide the fact that if a Wensheng video needs to be tried 25 times to generate one usable video, and the time for each generation is often 10 minutes, it means that it will take more than four hours to get a few seconds of video, and the so-called "productivity" will be impossible.

After trying out the Wensheng video and Tusheng video functions of "Qingying" in Zhipu Qingyan, we found two amazing experiences: it only takes about 30 seconds to generate a 6-second video, and the inference time is compressed from minutes to seconds; Using the prompt word formula of "lens language + establishing scene + detailed description", you can generally get satisfactory video content by "drawing cards two or three times".

Taking the scene of Wensheng's video as an example, after entering the command "realistic depiction, close up, cheetah lying on the ground, body slightly undulating" to "Qingying", a video of "fake and real" was generated within a minute: the wind blows the background of the grass, the cheetah's ears are constantly shaking, the body undulating with breathing, and even every beard is lifelike...... It can almost be mistaken for a close-up video.

Why can Zhipu AI "skip" the pain points that are common in the industry? Because all technical problems can be solved through technological innovation.

Hidden behind the video creation agent "Qingying" is CogVideoX, a video generation model developed and created by the Zhipu large model team, which adopts the same DiT structure as Sora, which can integrate text, time and space.

Through better optimization technology, the inference speed of CogVideoX is 6 times faster than that of the previous model. In order to improve controllability, Zhipu AI has developed an end-to-end video understanding model to generate detailed and content-appropriate descriptions for massive video data to enhance the model's text understanding and instruction compliance capabilities, so that the generated videos are more in line with the user's input and can understand ultra-long and complex prompt instructions.

If similar products on the market are still working "usability", the "home run" Zhipu AI has entered the stage of "easy to use" in terms of innovation.

A direct example of this is the soundtrack function provided by Zhipu Qingyan Synchronously, which can be used to match the generated video with music, and all the user needs to do is publish. Whether it is a novice user with no video production foundation or a professional content creator, you can turn your imagination into productivity with the help of "Qingying".

02 Scaling Law再次被验证

Behind every seemingly unusual, there is an inevitability. While similar products are either not open for use, or are still in the alpha version, the reason why "Qingying" has become an AI video application that everyone can use is inseparable from Zhipu AI's years of deep cultivation in frequency generation large models.

Back in early 2021, there are still nearly two years before ChatGPT became popular, and terms such as Transformer and GPT were only discussed in academic circles, and Zhipu AI launched the Wensheng graph model CogView, which can generate images of Chinese text, surpassing OpenAI's Dall · in MS COCO's evaluation test E, and in 2022, CogView2 was launched, which solved the problems of slow generation speed and low definition.

In 2022, Zhipu AI developed a video generation model CogVideo based on CogView2, which can input text to generate realistic video content.

Zhipu AI will put a "big move" again, and generate a video of any text in 30 seconds

At that time, the outside world was still immersed in the scene of conversational AI, and video generation was not the focus of the topic, but in the cutting-edge technology circle, CogVideo was already a hot "star".

For example, the multi-frame rate hierarchical training strategy adopted by CogVideo proposes a method based on recursive interpolation, that is, the video clips corresponding to each sub-description are generated step by step, and these video clips are interpolated layer by layer into the final video clip, which gives CogVideo the ability to control the intensity of changes in the generation process, which helps to better align text and video semantics, and realizes efficient conversion from text to video.

Meta's Make-A-Video, Google's Phenaki and MAGVIT, Microsoft's Nuwa DragNUWA, and NVIDIA's Video LDMs, among others, have referenced CogVideo's strategy and have attracted widespread attention on GitHub.

And with the new and upgraded CogVideoX, there are many more innovations like this. For example, in terms of content coherence, Zhipu AI has developed an efficient 3D variational autoencoder structure (3D VAE), which compresses the original video space to 2% of the size, and cooperates with the 3D RoPE position coding module, which is more conducive to capturing the relationship between frames in the time dimension and establishing long-range dependence in the video.

In other words, the emergence of the video creation agent "Qingying" is by no means an accident and a miracle, but the inevitable result of the one-day innovation of Zhipu AI.

There is a well-known law in the large model industry called Scaling Law, that is, when not constrained by other factors, the performance of the model is power-law related to the amount of computation, the amount of model parameters, and the size of the data, and increasing the amount of computation, model parameters, or data size may improve the performance of the model.

According to the official information given by Zhipu AI, the training of CogVideoX relies on the high-performance computing power cluster of Yizhuang, and the partner Huace Film and Television participated in the model co-construction, and the other partner bilibili participated in the technology research and development process of Qingying. Following this logic, the experience of "Clearing" exceeding expectations in terms of generation speed and controllability undoubtedly confirms the effectiveness of the Scaling Law once again.

It is even foreseeable that under the role of Scaling Law, the subsequent version of CogVideoX will have the ability to generate videos with higher resolution and longer duration.

03 "Multimodality is the starting point of AGI"

An information that may be habitually ignored is that Zhipu AI does not treat "Qingying" as an independent product, but goes online as an agent of Zhipu Qingyan.

The reason for this can be traced back to the speech of Zhang Peng, CEO of Zhipu AI, at the ChatGLM large model press conference: "2024 must be the first year of AGI, and multimodality is a starting point for AGI. If you want to go to the road of AGI, it is not enough to stay at the level of language, but to take a high degree of abstract cognitive ability as the core, and integrate the cognitive ability of a series of modalities such as vision and hearing, which is the real AGI. ”

At ICLR 2024 in May, the Zhipu large model team once again elaborated on the judgment of the AGI technology trend in the keynote speech: "Text is the key foundation for building large models, and the next step should be to mix multiple modalities such as text, image, video, and audio together for training to build a truly native multimodal model." ”

Zhipu AI will put a "big move" again, and generate a video of any text in 30 seconds

In the past year or so, the popularity of large models has been higher and higher, but it has not been able to get rid of the limitations of "brain in a vat", and the application scenarios are very limited. If the large model wants to move from the virtual to the real and create value in real life and work, it must have the ability to execute, such as the ability to hear and see beyond the language ability, and seamlessly connect through these abilities and the physical world.

Looking at CogVideoX, a large video generation model, and Qingying, a video creation agent, we can undoubtedly get some different answers.

CogVideoX's Wensheng video and picture student video capabilities can be regarded as a dismantling of cognitive abilities, and a breakthrough in individual abilities is achieved first; In the case that the native multimodal large model is not yet mature, users can efficiently and accurately solve real-world problems through the combination of multiple agents.

It can be proved that in the large model matrix of Zhipu AI, it has covered GLM-4/4V with visual and agent capabilities, GLM-4-Air, which is extremely fast and cost-effective for reasoning, CogView-3 which creates images based on text description, CharacterGLM, a custom model for super-anthropomorphic characters, Embedding-2, a vector model that is good at Chinese, CodeGeeX, The open-source model GLM-4-9B and the large video generation model CogVideoX allow customers to call different large models according to different needs to find the optimal solution.

In terms of To C applications, there are currently more than 300,000 agents on Zhipu Qingyan, including excellent productivity tools such as mind maps, document assistants, and scheduling. At the same time, Zhipu AI has also launched a multi-agent collaboration system composed of hundreds of thousands of AI bodies - Qingyan Flow, which is not limited to the interaction of a single agent, but also involves a multi-round, polymorphic, and multi-dimensional dialogue interaction mode, and people can handle highly complex tasks only through concise natural language instructions.

To sum up: at this stage, there is still a lot of distance from AGI in the real sense, but Zhipu AI is using the method of "single breakthrough and capability aggregation" to make AGI shine into reality in advance, so that the powerful large model ability can really be used to help people's work, study and life.

04 Write at the end

It needs to be faced that there is still a lot of room for improvement in the understanding of the laws of the physical world, high resolution, coherence and duration of camera action of the current video generation model.

On the road to AGI, large model manufacturers such as Zhipu AI should not be lonely travelers. As ordinary users, we can also be one of them, at least we can use our own "brain holes" to generate interesting videos on Zhipu Qingyan, so that more people can see the value of large models, and use AI to improve creative efficiency while accelerating the continuous maturity of multimodal large models.

Read on