Does Sora "kill" Jianying?

Sora's birth is both happy and sad for Zhang Yiming.

The AI model industry, which has been shaken by OpenAI's ChatGPT for a round, was once again shocked by the company's first video generation model, Sora.

Unlike Runway, Pika, and other videos that can only generate less than 10 seconds, with a single camera angle and highly distorted content, Sora's video generation length not only breaks through to 60 seconds, but also realizes multi-angle lens switching for a single video, and can also restore the real scene of the real world to the greatest extent.

The scaling laws that were validated on ChatGPT and transplanted to the video field by OpenAI CEO Altman have proven to be effective in terms of "working wonders" in the video field, i.e., increasing the size of the model will continue to improve performance. ChatGPT's praised "intelligent emergence" feature has once again appeared in Sora.

OpenAI CEO阿尔特曼

In the face of the "dimensionality reduction attack" from Sora, some entrepreneurs in the field of AI video, like Runway CEO Cristóbal Valenzuela, are ready for "Game On", some are like Guo Wenjing, the founder of Pika, and have begun to prepare for new products that benchmark Sora, and some are like Emad Mostak, CEO of Stability AI, who can't help but sigh "Altman is really a magician" and sees Sora as the GPT-3 moment of the AI video world.

But for ByteDance, this is not necessarily good news, because the video generation track in which Sora is located is the current AI innovation direction that ByteDance's Jianying is aiming at. After transferring from Douyin in February, according to Jiemian News, Zhang Nan, the former CEO of Douyin, is about to launch an AI product with pictures and videos.

Zhang Nan, who was planning to make a big splash in the field of AI-generated video, was the first to encounter Sora's internal re-entrepreneurship plan before the product landed.

OpenAI is the catalyst for Zhang Nan's determination to bet on AI and start a new business. In 2022, the release of OpenAI's Wensheng graph model DALL-E 2 made Zhang Nan intuitively feel the subversive power of AI image generation for the first time to traditional content creation, which is not only an opportunity for ByteDance global CEO Liang Rubo to "produce a new creation platform", but also one of the reasons for Zhang Nan's transfer to Jianying.

The emergence of Jianying in 2019 has helped Douyin's content ecology shift from PGC (content produced by professional institutions) to UGC (user-generated content), greatly lowering the threshold for user creation. Nowadays, as the content ecology of the platform shifts to the integrated state of PUGC, it has put forward new requirements for the cost and overall quality of videos created by users. The emergence of AI-generated video products provides a realistic possibility for every ordinary person to create videos as much as possible, and Douyin and even TikTok are expected to usher in a new explosion in the number of content creators.

It is worth mentioning that AI video generation is a promising entrepreneurial track. By the end of 2023, a number of unicorns had sprung up in the track: Midjourney was valued at $10 billion, Stability AI was valued at $4 billion, and Runway was valued at $1.5 billion. The upstart Pika, which exploded at the beginning of the year, has been established for less than a year, and its valuation has reached $250 million.

But under Sora's sudden attack, the time left for Zhang Yiming and Zhang Nan to hatch the next AI video-generated unicorn became more and more nervous.


Prior to Sora's debut, Byte was also developing AI-generated video products internally.

In January, ByteDance researchers published a paper on arXiv, in which they described a text-to-video model being developed by Bytes, named MagicVideo-V2, which automates text-to-video generation by integrating multiple modules, including a text-to-image model, a video motion generator, a reference image embedding module, and an interpolation module.

The problems that MagicVideo-V2 wants to solve are the low fidelity, unnatural movement, low resolution, and lack of variety of styles in the generated videos of Runway and Pika.

Byte's products cut the original "picture and text into film" function module, in the process of Wensheng video conversion, it also faces the torture of the above problems.

While waiting for MagicVideo-V2 to improve its research and development and move from demo to mass production, Zhang Nan has collected more dissatisfaction and expectations from front-line creators for AI-generated video products in user interviews over the past month, one of which includes some creators "In order to better express their ideas, it is almost impossible to complete all the creations with one product, and it is necessary to use complex editing and interaction processes across several products to complete their expression." ”

In August last year, a popular video "The Wandering Earth 3 Trailer" produced by Kazik, the main digital life of UP, successively used a variety of products such as MidJourney and Runway, and went through 5 days of post-editing and splicing.

The main reason for the difficulty in creating Kazik, the main digital life of UP, is that the AI software is not intelligent and convenient enough. Before the advent of Sora, the default way of Wensheng video in the industry was to output only a single, often static perspective of short video clips, and the background of the picture was mostly cyberpunk.

After the advent of Sora, the old knowledge in the field of Wensheng video has been broken, no matter how complex the perspective and scene switching, it can be generated with a single prompt word, which not only takes into account the convenience of the generated content, but also ensures the relevance of the generated content to the real physical world to the greatest extent.

Example of Sora prompt words Source: Screenshot of the official website

Sora was the first to realize the higher-fidelity generation effect, clearer generation picture, and smoother and more natural logical comprehension ability that AI video should have in Byte and Zhang Nan's planning.

It should be noted that Sora, which has not yet been tested to the outside world, still has many imperfections, and according to its official statement, "it is still in the early stages of world model research and application." ”

Meta Chief Scientist Yang Likun directly questioned Sora: "Just being able to generate realistic videos based on prompts does not mean that the system truly understands the physical world." ”

OpenAI also cautions in its official website that Sora may struggle to accurately simulate the physics of complex scenes, and may not be able to understand cause and effect, and may also confuse the spatial details of the prompt, such as confusing left and right, and may have difficulty accurately describing events that occur over time, such as following a specific camera trajectory. These flaws can cause Sora to generate some illogical videos, such as a person running in the wrong direction on a treadmill.

These unresolved bugs are one of the reasons why OpenAI has decided not to fully open Sora for the time being. Now, OpenAI is conducting private testing with a select group of users to assess potential hazards or risks in key areas in order to gain valuable feedback to advance the model.


After the release of ChatGPT, the outside world began to realize the possibility of realizing the AGI era, and video generation models such as Sora are undoubtedly important accelerators to promote the arrival of AGI.

OpenAI wrote directly on its official website: "Sora provides the foundation for being able to understand and simulate real-world models, and we believe this capability will be an important milestone in achieving AGI." ”

OpenAI is not the only company that wants to use video generation models to promote AGI. In December, Runway proposed to develop a General World Model to simulate the entire world with its video-generated Gen-2, "We believe that the next big advance in AI will come from systems that understand the visual world and its dynamics, which is why we are starting a new long-term research effort around the Universal World Model." ”

Understanding the laws of physics in the real world becomes the only way to AGI. Zhou Hongyi, the founder of 360, said bluntly when commenting on Sora that once the AI is connected to the camera and watches all the existing videos, its ability to understand the world will far exceed that of text learning. "It's really not far from AGI, it's not a matter of 10 years or 20 years, it may be a year or two and it will be achieved soon. ”

It is precisely under the stimulation of AGI-related concepts that the valuation of vertical large-scale model companies in the field of AI graphics and video has soared, and a number of star unicorn startups such as Midjourney, Stability AI, and Runway have emerged.

Specific to ByteDance's business level, image/video generation can also help improve the commercialization needs of bytes, such as helping byte advertisers produce videos at low cost and conveniently. A Byte source told PostLate that 10%-20% of the total cost of Byte advertisers is the cost of video production, and since last year, Byte has been developing some related products to help advertisers reduce this part of the investment.

Although it is a step behind in launching similar Wensheng video products, for Zhang Nan, in turn, it has also ushered in an opportunity to cross the river by feeling Sora.

Before the debut of ChatGPT, the shortcomings of the algorithm were once one of the main obstacles to the industry's research and development of dialogue models. Dr. Ding Lei, an expert in artificial intelligence, explained that some large-model startups "are not so good at training large models...... If the training method is wrong, no matter how many GPUs you have, it won't work. ”

In the process of catching up with Sora, Guo Wenjing, the founder of Pika, mentioned that an important limitation in the current development of generative video is the maturity of the algorithm. But there is no good algorithm before the video. ”

The release of SORA has undoubtedly once again provided an effective solution for the industry, and also provided a mature algorithm reference route for entrepreneurs in the same field such as Guo Wenjing and Zhang Nan.


With the official debut of Sora, Bytes, which lagged behind in the last wave of language conversation models, has once again fallen into a passive catch-up embarrassment in the video field.

After the release of ChatGPT in November 2022, Baidu, Ali and other major domestic manufacturers successively launched self-developed large models Wenxin Yiyan and Tongyi Qianwen in March and April last year, but it was not until August that Byte unveiled the lark large model.

One of the consequences of the slow action is that when Wenxin Yiyan's monthly activity has exceeded 100 million, the monthly activity of Byte's similar product "Doubao" is less than 10 million.

At the end of January, Liang Rubo focused on the slow progress of AI to illustrate the company's sluggish situation, saying that "the semi-annual technology review at the company level did not start to consider GPT until 2023, and the large-scale model startups that did better in the industry were founded from 2018 to 2021." ”

It's not too late for bytes to focus on the big model. According to a late LatePost report, after OpenAI released GPT-3 in June 2020, Byte trained a large generative language model with billions of parameters.

Now the selection of a general like Zhang Nan to use AI to transform Jianying is regarded by the outside world as a signal that Byte wants to speed up the development of AI.

Zhang Nan

However, after the completion of the internal deployment of troops, the challenge left to Zhang Nan is not only the urgency of catching up time, but also the problem of computing power shortage caused by the supply of external chips.

In October last year, the ban on five GPU chips, including NVIDIA's A100, A800, H100, H800 and L40S, officially came into effect. For a number of domestic large-scale model manufacturers, the interruption of the supply of NVIDIA chips has objectively become the biggest obstacle for them to catch up with ChatGPT and even GPT-4.

Guided by scaling laws, Altman proposed "Moore's Law" in the era of large models where chip demand doubles every three or four months. This undoubtedly raised the threshold for Zhang Nan to catch up with Sora again.

"For domestic manufacturers, after this model of vigorously producing miracles runs through, Silicon Valley will enter a more frenzied computing power arms race. The shortcomings of bytes' computing power will be further amplified. Liu Fang, a researcher at China Merchants Securities, said.


