laitimes

Volcano Engine Throws Out Video Model "King Bomb", Cloud Vendors Return to "Volume Performance" from "Price War"

Since Sora came out in February of this year, many people have been looking forward to Byte's move. With Douyin and Jianying, the two strongest video apps in hand, Byte's video generation model is expected to be high.

And here it is.

On September 24, ByteDance's Volcano Engine held an AI innovation tour in Shenzhen, and released a variety of models in one fell swoop, including the "Doubao Video Generation Model" and "Music Generation Model".

Prior to this, many phenomenal similar model products at home and abroad have been released one after another, including ByteDance's successive releases of Jimeng, Sponge Music, and new features in Jianying (including CapCut). The secret sponge music app is regarded as the most suitable music generation app for Chinese, and it is a well-deserved "Suno" in China.

Why did Byte choose to launch the large model engine behind these AI apps in September, when some "AI products are numb"?

In this regard, Tan Cheng, president of Volcano Engine, told Geek Park that it is not a fixed plan to carefully design what node to release, the AI model is progressing with each passing day, and it will be released as soon as possible when it is ready and suitable for the outside world.

The logic behind this is that the positioning of the Volcano Engine is ByteDance's ToB cloud platform, and the opening of the model to enterprises is done by the Volcano Engine, but before launching the product, it needs to be used internally, polished to a certain extent, and made enterprise-level usable before it is launched outward. The same is true for the previously released Doubao, which first has the product Doubao App launched by Byte, and then the large model of Doubao that can be used at the enterprise level launched by Volcano Engine in May this year.

He added, "It's not necessarily about being the first, it's about launching mature products, because the model has a long-term impact on the next 10-20 years, and it's good to do a good job of accumulation, and it's good to be the last to come."

In the next decade, Volcano Engine will not lead one or two models, such as video generation models, but "become the world's leading cloud and AI service provider".

01 With the blessing of Douyin and Jianying, the byte video generation model pays more attention to the use scenario

The video generated a large model, which became the biggest highlight of the whole conference.

Tan said, "Because the video is particularly difficult, we launched two at a time to fully solve the various problems in the video." The new members of the Doubao family, Doubao Video Generation-PixelDance and Doubao Video Generation-Seaweed, have officially opened the invitation test for the enterprise market.

Judging from the on-site display, the bean bag video model can generate corresponding videos based on the input of text and pictures. It's worth noting that ByteDance doesn't disclose the maximum length of video its model can generate, although the latter is considered a major demonstration of technical capabilities.

The large model generated by the bean bag video emphasizes the three core functions required for its practical applications, various life and business scenarios.

The first is the model's understanding of complex instructions. In the video below, for example, type "Close-up of a woman's face, a little angry, wearing a pair of sunglasses; At that moment, a man walked in from the right side of the frame and hugged her."

Volcano Engine Throws Out Video Model "King Bomb", Cloud Vendors Return to "Volume Performance" from "Price War"

Under this relatively complex description, the video generated by the bean bag model shows the changes in a person's mood, the time before and after the action, and the emergence of a new character, who also interacts with the original character. In other words, the Doubao video model can achieve sequential action instructions according to the instructions, and can generate multiple subjects and allow multiple subjects to interact with each other.

The second feature of the Doubao video model is the camera movement, which allows the video to switch between the subject's large dynamics and lenses, and has the realization of multi-lens languages such as zoom, surrounding, panning, zooming, and target following.

Volcano Engine Throws Out Video Model "King Bomb", Cloud Vendors Return to "Volume Performance" from "Price War"

The generated video can flexibly control the view, which is closer to the real-world experience|Video source: ByteDance

The third feature is consistent multi-shot. In AI-generated videos, how to ensure that the shots of different subjects are the same when switching back and forth between multiple lenses is also a common difficulty in the current industry.

Volcano Engine Throws Out Video Model "King Bomb", Cloud Vendors Return to "Volume Performance" from "Price War"

The video generated by Doubao under a single prompt can realize multiple camera switches while maintaining the consistency of the subject, style, and atmosphere. Source: ByteDance

When talking about the characteristics of the Doubao video generation model, Tan Cheng said that there are two advantages behind the Doubao video model, one is the advantages of technological breakthroughs and full-stack capabilities, and in terms of technology, Byte has made a lot of technological innovations in these two video models, such as through the efficient DiT fusion computing unit, the newly designed diffusion model training method and the deeply optimized Transformer structure, so that the entire generated video action is more flexible, the lens is more diverse, and the details are fuller.

At the same time, Douyin and Jianying's understanding of video is also an advantage. "Jianying's understanding of video is helpful to the Doubao video generation model, and it is inseparable from the language model to follow the instructions well, Doubao is a full-system model, and the underlying pedestal model helps to better understand the instructions."

In terms of solutions that go deep into video scenes, the Doubao video model supports different genres, including black, 3D animation, 2D animation, Chinese painting, watercolor, gouache and other styles, including 1:1, 3:4, 4:3, 16:9, 9:16, 21:9 and other ratios, corresponding to multiple commercial scenarios such as movies, television, computers, and mobile phones.

Volcano Engine Throws Out Video Model "King Bomb", Cloud Vendors Return to "Volume Performance" from "Price War"

The bean bag video generation model can quickly 3D the product through the whole model, and display it dynamically and multi-facetedly, and can also cooperate with different festivals, such as the Mid-Autumn Festival, Qixi Festival, Spring Festival and other nodes to quickly replace the background and format, generate content of different sizes and publish it on different platforms, and finally fit the overall marketing strategy to complete.

In terms of more focused scenarios, the Doubao video model has also launched more suitable solutions, such as e-commerce marketing scenarios, which allow users to generate a large number of video materials that match the marketing nodes according to the product, and adapt to the different sizes of different media platforms.

In the video release session, there is also an easter egg, and Volcano Engine brings a hands-on example of how Jianying and Jimeng use video generation models internally. Zhang Nan (Kelly), who switched from Douyin to Jianying CapCut, appeared in the form of a digital doppelganger, Kelly.

Volcano Engine Throws Out Video Model "King Bomb", Cloud Vendors Return to "Volume Performance" from "Price War"

In this digital human video, Kelly's digital avatar moves as naturally as a real person, and the lip sync can be perfectly adapted to the different languages of different countries.

This case also shows the outside world the new possibilities brought by the bean bag video model in the scene, such as self-media, oral broadcasting, marketing, delivery, corporate training, etc., without having to shoot in person, and the content production cost can also be greatly reduced.

It is reported that the bean bag video model is not futures, and the latest model will be launched on the Volcano Engine Ark platform after the National Day, that is, the latest beta version of Dream has used the bean bag video generation model-Seawe.

As for the pricing, Tan said it has not yet been decided. "Video models and language models have different application scenarios and different pricing logics. Considering the cost of migration of the new experience – the old experience – and ultimately whether it can be widely adopted depends on whether the productivity ROI is much higher than before."

02 From the price of "volume" to the performance of "volume".

Along with the video model, there are also music models, simultaneous interpretation models, and new upgrades to the main model of bean bags. Just like the effect of the bean bag video model is eye-catching, the above model products also have a bright performance improvement.

This series of new upgrades also reflects the beginning of the transformation of the Volcano Engine from "volume price" to "volume performance", which will become its next strategic focus. In an interview after the meeting, Tan Bei, president of Volcano Engine, reiterated this position, saying, "The application cost of the large model has been well resolved. Large models should move from volume price to volume performance, volume better model capabilities and services."

As early as May this year, the bean bag model launched by Volcano Engine reduced the price to a minimum of less than one cent per 1,000 tokens, triggering a price war among large model manufacturers. Since then, the overall model calls of model vendors have increased dramatically.

According to Volcano Engine, as of September, the average daily tokens usage of the Doubao language model exceeded 1.3 trillion, an increase of more than 10 times compared to the first release in May, and the multimodal data processing capacity also reached 50 million images and 850,000 hours of voice per day, respectively.

However, the performance constraints of the model have become a bottleneck and an opportunity for the further increase in the number of model calls. For example, Tan said that many large models in the industry currently only support TPM (tokens per minute) of 300K or even 100K, which is difficult to carry the traffic of the production environment of enterprises. For example, in the literature translation scenario of a scientific research institution, the TPM peak is 360K, the TPM peak of an automotive intelligent cockpit is 420K, and the TPM peak of an AI education company reaches 630K. To this end, the bean bag model supports an initial TPM of 800K by default, which is beyond the industry average, and customers can flexibly expand the capacity according to their needs.

Previously, Yan Junjie, the founder of MiniMax, told Geek Park that from the perspective of technology development, it is inevitable that the cost of model inference will be reduced by 10 times or 100 times, and it is only a matter of time, and the difficulty is to improve the performance of general models.

When we see a significant improvement in performance from ChatGPT to GPT-4, the large model field also follows OpenAI's Scaling Law on model pre-training, aiming to improve model performance with more data, more computing power, and more model parameters. With the loss of efficiency and concerns about running out of good data, the path to improving performance through this approach is bottlenecked.

Now, with the advent of O1, large models have introduced a path of reinforcement learning in the inference phase, bringing a clear path to further improve model performance.

At the same time, as more enterprises explore AI applications, it also brings many engineering tuning methods to improve model performance. Better model performance and better model services can open up more scenarios on the product, and this will also become the focus of AI infrastructure service providers, including Volcano Engine, in the next stage.

Read on