Volcano Engine Throws Out Video Model "King Bomb", Cloud Vendors Return to "Volume Performance" from "Price War"

Since Sora came out in February of this year, many people have been looking forward to Byte's move. With Douyin and Jianying, the two strongest video apps in hand, Byte's video generation model is expected to be high.

And here it is.

On September 24, ByteDance's Volcano Engine held an AI innovation tour in Shenzhen, and released a variety of models in one fell swoop, including the "Doubao Video Generation Model" and "Music Generation Model".

Prior to this, many phenomenal similar model products at home and abroad have been released one after another, including ByteDance's successive releases of Jimeng, Sponge Music, and new features in Jianying (including CapCut). The secret sponge music app is regarded as the most suitable music generation app for Chinese, and it is a well-deserved "Suno" in China.

Why did Byte choose to launch the large model engine behind these AI apps in September, when some "AI products are numb"?

In this regard, Tan Cheng, president of Volcano Engine, told Geek Park that it is not a fixed plan to carefully design what node to release, the AI model is progressing with each passing day, and it will be released as soon as possible when it is ready and suitable for the outside world.

The logic behind this is that the positioning of the Volcano Engine is ByteDance's ToB cloud platform, and the opening of the model to enterprises is done by the Volcano Engine, but before launching the product, it needs to be used internally, polished to a certain extent, and made enterprise-level usable before it is launched outward. The same is true for the previously released Doubao, which first has the product Doubao App launched by Byte, and then the large model of Doubao that can be used at the enterprise level launched by Volcano Engine in May this year.

He added, "It's not necessarily about being the first, it's about launching mature products, because the model has a long-term impact on the next 10-20 years, and it's good to do a good job of accumulation, and it's good to be the last to come."

In the next decade, Volcano Engine will not lead one or two models, such as video generation models, but "become the world's leading cloud and AI service provider".

01 With the blessing of Douyin and Jianying, the byte video generation model pays more attention to the use scenario

The video generated a large model, which became the biggest highlight of the whole conference.

Tan said, "Because the video is particularly difficult, we launched two at a time to fully solve the various problems in the video." The new members of the Doubao family, Doubao Video Generation-PixelDance and Doubao Video Generation-Seaweed, have officially opened the invitation test for the enterprise market.

Judging from the on-site display, the bean bag video model can generate corresponding videos based on the input of text and pictures. It's worth noting that ByteDance doesn't disclose the maximum length of video its model can generate, although the latter is considered a major demonstration of technical capabilities.

The large model generated by the bean bag video emphasizes the three core functions required for its practical applications, various life and business scenarios.

The first is the model's understanding of complex instructions. In the video below, for example, type "Close-up of a woman's face, a little angry, wearing a pair of sunglasses; At that moment, a man walked in from the right side of the frame and hugged her."

Volcano Engine Throws Out Video Model "King Bomb", Cloud Vendors Return to "Volume Performance" from "Price War"

Under this relatively complex description, the video generated by the bean bag model shows the changes in a person's mood, the time before and after the action, and the emergence of a new character, who also interacts with the original character. In other words, the Doubao video model can achieve sequential action instructions according to the instructions, and can generate multiple subjects and allow multiple subjects to interact with each other.

The second feature of the Doubao video model is the camera movement, which allows the video to switch between the subject's large dynamics and lenses, and has the realization of multi-lens languages such as zoom, surrounding, panning, zooming, and target following.

The generated video can flexibly control the view, which is closer to the real-world experience|Video source: ByteDance

The third feature is consistent multi-shot. In AI-generated videos, how to ensure that the shots of different subjects are the same when switching back and forth between multiple lenses is also a common difficulty in the current industry.

The video generated by Doubao under a single prompt can realize multiple camera switches while maintaining the consistency of the subject, style, and atmosphere. Source: ByteDance

When talking about the characteristics of the Doubao video generation model, Tan Cheng said that there are two advantages behind the Doubao video model, one is the advantages of technological breakthroughs and full-stack capabilities, and in terms of technology, Byte has made a lot of technological innovations in these two video models, such as through the efficient DiT fusion computing unit, the newly designed diffusion model training method and the deeply optimized Transformer structure, so that the entire generated video action is more flexible, the lens is more diverse, and the details are fuller.

At the same time, Douyin and Jianying's understanding of video is also an advantage. "Jianying's understanding of video is helpful to the Doubao video generation model, and it is inseparable from the language model to follow the instructions well, Doubao is a full-system model, and the underlying pedestal model helps to better understand the instructions."

In terms of solutions that go deep into video scenes, the Doubao video model supports different genres, including black, 3D animation, 2D animation, Chinese painting, watercolor, gouache and other styles, including 1:1, 3:4, 4:3, 16:9, 9:16, 21:9 and other ratios, corresponding to multiple commercial scenarios such as movies, television, computers, and mobile phones.

The bean bag video generation model can quickly 3D the product through the whole model, and display it dynamically and multi-facetedly, and can also cooperate with different festivals, such as the Mid-Autumn Festival, Qixi Festival, Spring Festival and other nodes to quickly replace the background and format, generate content of different sizes and publish it on different platforms, and finally fit the overall marketing strategy to complete.

In terms of more focused scenarios, the Doubao video model has also launched more suitable solutions, such as e-commerce marketing scenarios, which allow users to generate a large number of video materials that match the marketing nodes according to the product, and adapt to the different sizes of different media platforms.

In the video release session, there is also an easter egg, and Volcano Engine brings a hands-on example of how Jianying and Jimeng use video generation models internally. Zhang Nan (Kelly), who switched from Douyin to Jianying CapCut, appeared in the form of a digital doppelganger, Kelly.

In this digital human video, Kelly's digital avatar moves as naturally as a real person, and the lip sync can be perfectly adapted to the different languages of different countries.

This case also shows the outside world the new possibilities brought by the bean bag video model in the scene, such as self-media, oral broadcasting, marketing, delivery, corporate training, etc., without having to shoot in person, and the content production cost can also be greatly reduced.

It is reported that the bean bag video model is not futures, and the latest model will be launched on the Volcano Engine Ark platform after the National Day, that is, the latest beta version of Dream has used the bean bag video generation model-Seawe.

As for the pricing, Tan said it has not yet been decided. "Video models and language models have different application scenarios and different pricing logics. Considering the cost of migration of the new experience – the old experience – and ultimately whether it can be widely adopted depends on whether the productivity ROI is much higher than before."

02 From the price of "volume" to the performance of "volume".

Along with the video model, there are also music models, simultaneous interpretation models, and new upgrades to the main model of bean bags. Just like the effect of the bean bag video model is eye-catching, the above model products also have a bright performance improvement.

This series of new upgrades also reflects the beginning of the transformation of the Volcano Engine from "volume price" to "volume performance", which will become its next strategic focus. In an interview after the meeting, Tan Bei, president of Volcano Engine, reiterated this position, saying, "The application cost of the large model has been well resolved. Large models should move from volume price to volume performance, volume better model capabilities and services."

As early as May this year, the bean bag model launched by Volcano Engine reduced the price to a minimum of less than one cent per 1,000 tokens, triggering a price war among large model manufacturers. Since then, the overall model calls of model vendors have increased dramatically.

According to Volcano Engine, as of September, the average daily tokens usage of the Doubao language model exceeded 1.3 trillion, an increase of more than 10 times compared to the first release in May, and the multimodal data processing capacity also reached 50 million images and 850,000 hours of voice per day, respectively.

However, the performance constraints of the model have become a bottleneck and an opportunity for the further increase in the number of model calls. For example, Tan said that many large models in the industry currently only support TPM (tokens per minute) of 300K or even 100K, which is difficult to carry the traffic of the production environment of enterprises. For example, in the literature translation scenario of a scientific research institution, the TPM peak is 360K, the TPM peak of an automotive intelligent cockpit is 420K, and the TPM peak of an AI education company reaches 630K. To this end, the bean bag model supports an initial TPM of 800K by default, which is beyond the industry average, and customers can flexibly expand the capacity according to their needs.

Previously, Yan Junjie, the founder of MiniMax, told Geek Park that from the perspective of technology development, it is inevitable that the cost of model inference will be reduced by 10 times or 100 times, and it is only a matter of time, and the difficulty is to improve the performance of general models.

When we see a significant improvement in performance from ChatGPT to GPT-4, the large model field also follows OpenAI's Scaling Law on model pre-training, aiming to improve model performance with more data, more computing power, and more model parameters. With the loss of efficiency and concerns about running out of good data, the path to improving performance through this approach is bottlenecked.

Now, with the advent of O1, large models have introduced a path of reinforcement learning in the inference phase, bringing a clear path to further improve model performance.

At the same time, as more enterprises explore AI applications, it also brings many engineering tuning methods to improve model performance. Better model performance and better model services can open up more scenarios on the product, and this will also become the focus of AI infrastructure service providers, including Volcano Engine, in the next stage.

Volcano Engine Throws Out Video Model "King Bomb", Cloud Vendors Return to "Volume Performance" from "Price War"

01 With the blessing of Douyin and Jianying, the byte video generation model pays more attention to the use scenario

02 From the price of "volume" to the performance of "volume".

Read on

CNCC | The future of multimodal affective computing under large models

The "Fuxi Eye" large model was released! It has the world's largest ophthalmic image database

New car | The AI large model is on the car, 13 new/27 optimizations, and the ZEEKR 009 glorious OTA upgrade

AI Daily: Fudan and Baidu's new models can generate 1-hour long videos; The new version of ChatGPT for Windows is launched; Two new features have been added to NotebookLM

Surveying and Mapping Bulletin | Ren Ping: Noise data visualization based on LOD1 city model

Video|Li Yugui went to the Provincial Emergency Management Department to carry out work research

Video|Li Yugui went to the Provincial Youth League Committee to carry out research on youth work

The terminal AI grading standard has been implemented, and the "fire" of the mobile phone model has burned to the agent

J Clin Invest丨Yang Weili/Li Shihua/Li Xiaojiang's team used monkey models to reveal new pathological mechanisms of Parkinson's disease

Tens of millions of dollars lost by poisoning for large model training? Anthropic found a hidden bug in the LLM codebase

Nearly 1,000 teenagers in the city gathered at Zhonghai Expo to show their skills in the three major model competitions of navigation, aviation and architecture

DeepMind and MIT developed Fluid, which enables autoregressive models to achieve large-scale expansion of Wensheng graphs

Recently, a Chinese in the Philippines was emptied by compatriots and the surveillance video was exposed!

AI Weekly | ByteDance's large model training was "poisoned"; Microsoft will terminate the Azure OpenAI service for individuals in China

Tesla clarified that Optimus is not being manipulated by someone behind it and released a new showcase video

How to set the cover image of VideoStudio

Unscrupulous for the sake of traffic! Selling grief and creating compassion...... How deep is the pose routine of short videos?

Small traders have violent tendencies, and the attitude of the chengguan is always very good, I watched the original video, let's talk about it

ByteDance responded to the attack on the intern for the training of the large model: it has been dismissed and does not affect the online business

The girl on the basketball court was kicked to the ground by a man Follow-up: The police intervened, the video was exposed, and the comment area fell

The Israeli army's new video said that Sinwar had taken his wife and children to take refuge in the tunnel, leaving a DNA leak with a tissue

Micro video|BRICS Power

It's too low! 5-second video, tiger wolf words, "sexual gaze" is vividly reflected in her

Li Ziqi is back strongly! She has had a rough life, how has the content of the new video changed?

A number of large models have been rolled out in the field of traditional Chinese medicine, and the "AI old Chinese medicine" is coming?

Shoot the king to bomb? Photorealistic generative world model, with Pixar investment

CBA's biggest bastard? With a maximum salary of 6 million, he missed 17 games, and ran to shoot videos without playing

New Year's red envelopes, click on them to get it! Today's 🧧 video is sending red envelopes

It's not over yet! After the heroine of the red panties deleted the video and admitted to stealing, Fat Donglai began to do it

Tencent, Huawei, etc. access to DeepSeek lose more than 400 million yuan per month, and the MaaS model as a service is about to be subverted? Titanium media AGI

The sex robot was unexpectedly empowered by a large model, and the concept stocks of adult products rose collectively, against the sky?

The video of Liu Yifei and Huawei's vice president went viral, exposing the heart-wrenching truth of the relationship between men and women