Zhang Peng, CEO of Zhipu AI: AI-generated video cannot completely replace the film and television industry, and it will take time to truly enter the movie

2024-07-26 14:56:00

Zhang Peng, CEO of Zhipu AI (source: edited and photographed by Titanium Media App)

On the morning of July 26, the domestic AI large model Unicorn Zhipu AI released the AI video generation technology "Ying" in Beijing, which supports Wensheng video and Tusheng video. In addition, Qingying is fully launched on its "Zhipu Qingyan" App for all users this time, which does not require an appointment and is available to everyone.

It is reported that based on CogVideoX, a large video generation model developed by Zhipu, Qingying has increased the inference speed of the Zhipu generative video model by 6 times through technical optimization, shortening the generation time of 6-second video to 30 seconds in theory.

At the same time, in terms of video parameters, Clarion currently supports the generation of 6-second AI videos with a resolution of 1440x960. At the technical level, Clarion does not completely follow the DiT architecture that has become a certain "consensus" due to Sora, but "a Transformer architecture developed by Zhipu that integrates all three dimensions of text, time, and space".

In addition, to solve the problem of content coherence, Zhipu AI has independently developed a set of efficient 3D variational autoencoder structure (3D VAE), which can compress the original video data to 2% of the original size, which significantly reduces the training cost and difficulty of the video diffusion generation model. In terms of controllability, Zhipu AI has created an end-to-end video understanding model that can generate accurate and contextual descriptions for large amounts of video data. This innovation enhances the model's ability to understand the text and follow instructions, ensuring that the generated video is more in line with the user's input needs.

At present, the CogVideoX model has been launched with the "Clear Shadow" function on the PC, mobile application and applet side of Zhipu Qingyan, which not only supports rapid generation and efficient instruction following capabilities, but also has stronger content coherence and picture scheduling flexibility.

Specifically, Qingyan provides two modes: Wensheng video and Tusheng video:

Wensheng video is suitable for brain-opening scenes: puppies dancing at their fingertips, dolphins flying into deep space, the universe flashing for you, no matter how complex and abstract the picture, as long as you describe the imaginary scene in one or two sentences, Qingyan can present it for you one by one with exquisite pictures.
Tusheng video can discover more fun of the original picture: enter a picture and a simple description to make the picture move. You can make the people in the old photos move and make the memories more dynamic and real; also let the characters in famous paintings and stills do some brain-opening things.

At the paid level, all users can use it for free during the initial test. If you want to accelerate the time, pay 5 yuan to unlock the rights and interests of the high-speed channel for one day (24 hours); Pay 199 yuan to unlock the rights and interests of paid high-speed channel for one year.

Zhang Peng, CEO of Zhipu AI, said at the meeting that AI multimodal technology comes from the way the human brain works. As a complex system cognitive function, the human brain is completed through the cooperation of various brain regions, including text, vision, hearing, etc., so multimodal perception and understanding are very closely related to the development of human cognitive ability.

"The AI industry's exploration of multimodal models is still in its infancy, and we will continue to work hard to provide you with better models and better products." Zhang Peng said.

After the meeting, Zhang Peng had a nearly one-hour exchange with Titanium Media AGI and others, discussing many topics such as the commercialization of AI video applications, landing scenarios, whether it will replace the film and television industry, and the competition in the large model market.

AI video generated by Qingyan (Image source: Zhipu AI introductory video)

Zhang Peng said frankly that the existing AI video generation technology cannot completely replace the film and television industry, and is more of an auxiliary role, but AI is of positive significance for the changes in the film and television industry. At present, it may not be enough to directly use AI in the audience-oriented film and television production process, and at most it is a small-scale creation. "If AI is really going to do something more demanding like changing the production of a movie, it may have to go a long way."

Zhang Peng believes that at present, AI video is mainly used for online e-commerce marketing, short video self-media demand, etc. "However, I believe that it is certainly not limited to these customers. At present, it is a phased thing, which direction to develop next, and which things will become the most critical issues for technological breakthroughs and applications, which require us to continuously form a closed loop from top to bottom and bottom to top. ”

When it comes to the commercialization of AI video generation, Zhang Peng said that the commercialization of Zhipu Qingying is still in the early stage, and more payment is made through APIs.

"The Qingying function is online, just like the opening to you just now, it is mainly a phased result, to say how perfect it is, it still needs to be solved in stages, to report to you about our progress, so that everyone can experience how much such a thing as video generation can be done under the premise that everyone can use it, rather than being locked in the laboratory or generating something with a small probability. From the current stage, whether it is 2C or 2B, it is still relatively early to move towards large-scale commercialization. Zhang Peng said.

Zhang Peng said that at present, the computing power and algorithm cost of video generation are very high. "Indeed, it is too expensive to make a large model, and it is indeed facing the demand of the market, and you have to commercialize it, so we are doing it at different levels, and the most basic technological breakthrough and innovation is the largest part of our consumption of resources and computing power, and the commercialization level is promoted on this basis."

Zhang Peng emphasized, "I believe that all friends and businessmen do not open up this thing, to a large extent, because of the cost, whether it can withstand many people to use, this is also a choice." ”

Therefore, Zhang Peng pointed out that if you want to do a good job in the commercialization of AI-generated videos, controllability is a necessary condition, and you need to make great efforts to do it, so as to accurately express the creator's intentions. "If it can understand the deep intent and semantics behind simple words very well, it can be very controllable."

Talking about the gap with Sora, Zhang Peng admitted that Qingying is still a preliminary phased achievement, and it has not yet reached the long video effect demonstrated by Sora, and more efforts need to be made.

"We've always been honest and admit the gap between us and OpenAI and the world's top level. However, we have to walk our own way, and we have always been following our own path. A lot of times, we're constantly catching up in our own way, for example. How to reduce the cost of video generation computing power, improve the response speed, and make it available to everyone, so we are pursuing the technical height at the same time, and at the same time pursuing the popularization and cost of technology, which are also some characteristics of our team. Zhang Peng said.

Talking about the competition and cooperation relationship with ecological companies, Zhang Peng said frankly that in the process of commercialization, the process of serving customers is driven by Zhipu technology and product core capabilities, and customer needs and feedback are the driving forces for technological innovation and breakthroughs, so that the two form a better closed loop. Whether it is to make 2C products or serve B-end enterprises, it is the same idea. Some things may not be in the direction of our focus, this may be handed over to partners in the ecosystem or other parties to do, and some of the things that help us complete the closed loop are what we do independently, (we commercialize) in this way.

Looking forward to the next step of super App development, Zhang Peng emphasized to Titanium Media AGI that Zhipu continues to position Qingyan as an "AI assistant" to help you solve practical problems in work, study and life, and help you improve productivity and efficiency, work convenience and other aspects.

"We think that the so-called super app may not necessarily be 'super', and we are also gradually and imperceptibly getting people used to using this tool, which is also a good thing. So, this may not necessarily be a step-by-step change, but a subtle change. We are looking forward to using efficiency (Qingyan) tools in such an AI era to allow everyone to unconsciously change their living conditions, which is also the development direction of human-machine collaboration that we advocate. Zhang Peng said.

(This article was first published on Titanium Media App, author | Lin Zhijia, editor | Hu Runfeng)

Zhang Peng, CEO of Zhipu AI: AI-generated video cannot completely replace the film and television industry, and it will take time to truly enter the movie

Read on