Zhu Wei of Wondershare Technology: Sora has not been commercialized so far, and it takes a period for the video model to mature

作者 | GenAICon 2024

The 2024 China Generative AI Conference was held in Beijing on April 18-19, and Zhu Wei, Vice President of Wondershare Technology, delivered a speech on "Audio and Video Multimedia Large Model Market Insight and Landing Practice" at the main venue of the conference on the first day of the conference.

At present, the large model is moving from the 1.0 era of graphics and texts to the 2.0 era with audio, video and multimedia as the carrier. However, for a long time, video-related models account for only a very small number of videos, and the application of large models in the audio and video field faces severe challenges such as lack of datasets, complex video content structure and hierarchy, and high computing costs.

Zhu Wei, vice president of Wondershare Technology, believes that the emergence of large models has brought unprecedented convenience and possibilities to video creation, and 2024 is expected to usher in the first year of AI video. In this context, on April 28, Wondershare Technology's Wondershare "Tiancanop" audio and video multimedia model was officially tested for public testing.

Released in January this year, the "Tianmu" model focuses on digital creative vertical creation scenarios, based on 1.5 billion creators and 10 billion localized high-quality audio and video data precipitation, based on audio and video generative AI technology, and has three major characteristics: multimedia, vertical solutions, computing power data and application localization.

This is the first audio and video multimedia model focusing on the field of digital creativity in China, which has been filed by the Central Cyberspace Administration of China, and will support one-click generation of 60-second video, and has nearly 100 atomic capabilities such as video raw video, cultural music, and cultural sound effects, empowering global creators in the whole chain.

The following is a transcript of Zhu Wei's speech:

Our Wondershare model is called "Audio and Video Multimedia Model", which focuses on the market and application situation. Since its establishment more than 20 years ago, the company has been deeply engaged in the field of audio, video and multimedia, empowering audio and video content producers. We are primarily a manufacturer of tools and services for the production of video content.

Today, we have more than 20 years of accumulation in the field of audio and video multimedia, and we are very happy when the large model appears, because it provides better technical services for our industry and users. Therefore, we hope to introduce to you some trends and practical experiences of audio and video multimedia models in the market through this sharing.

1. Large models have begun to enter the 2.0 era, and the entrance of audio and video modalities has not yet been fully opened

Speaking of large models, Wondershare Technology has already formed a team to conduct artificial intelligence (AI) research when deep learning is coming. With the advent of the era of large models, we will inevitably follow the trend, because we believe that large models will bring greater impetus to content production, especially video content production. We've argued that big models should be like infrastructure, enabling all walks of life, and we've always believed in that.

Last year, everyone was talking about the "100 model war", the entire large model field showed explosive growth, many large models came out one after another, and many models have been put into commercial use, especially in the field of text and pictures have formed a commercial closed loop, bringing value to users. Among them, once some products are connected to the large model, we find that it has achieved a tenfold or even dozens of times growth, which makes us believe that the large model may bring extraordinary changes to many industries. In addition, we see that the number of visits to ChatGPT has reached more than 1.7 billion in March, which is already a large volume.

However, we are also aware of some problems, such as the fact that the rate of growth has begun to slow down.

Why is that? Why hasn't there been a massive explosion of user usage in large models that are dominated by text and images, as they were last year or at the end of last year? We think it's probably because the entrance to the next modality hasn't been fully opened.

We believe that in the era of AI 2.0, large models are also beginning to enter the era of 2.0. If we define it as the 1.0 era, this year we will gradually shift to the 2.0 era with audio, video and multimedia as the carrier.

This trend is closely related to Wondershare's business, so we have been researching and practicing technology in this field and paying attention to the needs of market users.

From the data point of view, 80% of Internet traffic is video traffic, which has a certain relationship with the volume of video data, but also reflects user preferences, they are more inclined to video content. Especially when conducting user research on large models, some public information shows that users most hope that large models can generate videos, which is one of the top three projects in terms of user needs.

As a result, users are eager for large models to assist them in their video creation. Wondershare happens to be a deep cultivator of the video creative track for more than 20 years, so it is indispensable to do things in the field of audio and video generation.

In the past, video creation was the preserve of Hollywood directors and editors. However, with the development of technology, the popularity of mobile phone photography equipment, and the improvement of AI capabilities, it has become easier and easier to edit videos, and more and more people are creating videos, and the demand is increasing.

Second, Sora has not been successfully commercialized so far, and the application of video models is difficult and challenging

One problem is that in the past two years, there have been more models in the field of text and images, but there are fewer models in the field of video. Although some video models have come out, they face much more serious problems than text and images, such as data, algorithms, costs, and especially effects.

In fact, there is still a lot of room for improvement in the effect of large video models. Sora, which was released during the Chinese New Year this year, is considered to be the most effective video model at present, and I believe that the second echelon of the industry has more room for improvement compared to its effect.

The proportion of video models is very small, but we have also started to apply it, but the difficulty and challenge of application are greater.

So, why is the video model so difficult to apply? Just like Sora, which released the video during the Chinese New Year this year, but not only has it not been commercialized, but it has not been widely opened to the public. So we think that although everyone is making video models, there is still a long way to go before commercialization.

For these reasons, we can't help but wonder: why hasn't video and multimedia work been used as widely as the literal language model has been used as soon as it came out?

We believe that the video scene is relatively complex, whether it is the amount of information contained in it, the way it is expressed, and even the time dimension is added, which makes the expression of the video very complicated. Plus, video production itself is a long process.

Our tools are primarily aimed at semi-professional users, not professional users. Semi-professional users use our tools and massive data, and it takes 1.6 hours to make a video, which shows that there is a certain threshold for the entire video production.

From the perspective of AI technology, the maturity and application of video models will definitely require a certain period. Therefore, we have always believed that 2024 may become the first year of AI video, that is, we believe that there will be more and more AI videos this year, and there will even be an explosive trend. Under this trend, as a company in the field of audio and video, especially a company that empowers audio and video creators, Wondershare Technology feels an unprecedented opportunity.

3. It is not difficult to obtain video data, but it is difficult to convert it into data that can be used in large models

At present, the company has accumulated a large number of loyal users in more than 200 countries around the world. Many users are responding to a question: why don't our products have AI capabilities yet, or why don't we have that AI capability yet?

Although some AI features have been added to our products since a few years ago, the needs of users far exceed the speed and capabilities we offer. As a result, we feel both opportunity and pressure.

At the same time, I believe that the old users know about our company. For more than 20 years, we have been providing our users with the technical capabilities and empowerment of the current era. From the earliest PC era to the mobile Internet era, and then to the current AI era, we have been committed to providing users with corresponding technical empowerment. So, I think users expect from us as well.

We have a deep understanding of video or multimedia creators around the world. We know what kind of capabilities and empowerment a video creator needs at what time in order to facilitate his ability to create videos. In addition to the capabilities of large models, we have accumulated many traditional algorithmic capabilities, which play a very important role in empowering creators when combined with large models.

In the era of large models, our original capabilities have played a great role, and we call them "data production and management capabilities", that is, the ability to process data.

It is not difficult to obtain video data, but it still requires a certain amount of cost, time, and technical capabilities to turn it into data that can be used for large model training. That's exactly the benefit of having such a platform to handle this better. At the same time, our investment in algorithm infrastructure, especially our self-developed inference training platform, also provides better support for the development of large models.

Fourth, in January this year, the audio and video multimedia model was released, and three major features supported the commercialization

Based on years of basic accumulation, user expectations, and the accumulation of data, algorithms, and technologies over the years, as well as the observation of the era of large models, at the beginning of this year, that is, at the end of January, we released our own multimedia model, that is, the "audio and video multimedia model" Wondershare "Tianmu".

Let me briefly introduce the characteristics of the "Sky Canopy".

First, from multimodality to multimedia.

Nowadays, everyone is talking about multimodality, and we are not denying multimodality, but from the perspective of application and user cognition, the term multimodality is a bit too technical for the average video editing user. Therefore, we want to emphasize that multimodality actually refers to the combination of various elements such as text, pictures, etc. Our goal is to blend all these multimodal elements well together, ultimately allowing users to produce high-quality multimedia videos when editing videos.

In addition, we are currently not working on the lowest L0 model in terms of video model, but at the level of L0.5 or above, we are more committed to providing vertical solutions, and we hope that our model can bring value to users and solve the actual needs of users.

For example, for the concepts of multimodality and multimedia, we emphasize that the video should contain the integration of multiple modal elements such as titles, themes, and subtitles to form a multimedia video. What we emphasize is not the processing power of multimodality, but the fusion ability of each modality to eventually form a video, which is the first feature that our model wants to achieve.

Second, from a generic model to a vertical solution.

Generic models like ChatGPT have been around for a long time, and the growth of visits has slowed. When using ChatGPT to ask TCM questions, the answers you may get are not ideal. Therefore, we believe that if GPT is a basic model, it needs to be marketed on this basis, and solve the actual problems of users, and ultimately create business value. This must be achieved by solving certain problems for certain people.

When developing the "Sky Canopy" model, we focused on how to provide vertical solutions to solve the specific problems of users. We don't treat generic capabilities as a product, but rather combine them with vertical scenarios to form usable capabilities or solutions. This approach may be a better commercial solution at the moment.

Third, localized expansion in data, computing power, and applications.

You might say, "Why do we do this when we already have data all over the world?" That's actually a feeling that we're feeling in our research. I remember in October and November last year, when we did a video called "A Girl's Life". After I made it, many friends asked me, "Why does your video look like an Oriental girl in the front and look like a Western lady when you get old?" I think this may be a data problem.

We recently redid the video "A Girl's Life", and the consistency of IDs and the ID attributes of people have been well maintained, so data integrity is very important. Therefore, we say that we are homegrown and oriented to the global market.

5. The "Skylight" large model will be tested on April 28, and it can generate 60 seconds + video with one click

Having said so much about our large model, what are its characteristics and capabilities? I will give you a brief introduction through a few videos of atomic capabilities, we will start the public beta test on April 28th, and I hope you can experience it more and provide guidance.

The first is the ability of Wensheng Video, that is, to generate videos of more than 60 seconds with one click. This means that a short story can be turned into a video in a one-click way. The quality of the generated video includes the storyline, character image, picture coherence, etc., and basically the video production can be completed according to your storyline.

In this area, we don't compare too much with other models the basic capabilities of generating videos, such as the quality and duration of the generation. We hope to use the capabilities of Wensheng Video to solve a series of problems encountered by users in the process of video creation, such as not being able to obtain materials and insert scenes.

The other piece is the video raw video, which is mainly biased towards video stylization. There have been many such algorithms, but it is rare to actually apply it to industry products and commercialize them. Our current technology is not only used by users in C-end products, but also communicates and cooperates with the domestic video media industry on the B-end to think about how to empower them.

We provide the ability to generate sounds, i.e., to generate sounds in text. These generation capabilities greatly facilitate the user's time and efficiency in finding materials in the video editing process, so when these capabilities are given to the user, the user still prefers them.

In addition, we also provide the ability to generate music, as each video needs to be equipped with background music. But in the past, finding background music for users was a time-consuming and laborious task, and there were also copyright issues. These capabilities of ours bring great solutions to our users.

Therefore, we can say that we are the first enterprise in China to have an audio and video multimedia model, and it has passed the filing of the Central Cyberspace Administration of China, which can effectively support the development of the company's global business.

6. A number of AI products with large model capabilities have been launched, and an open ecosystem of skylight models will be built in the future

I mentioned earlier that once a small capability matures, we will promote its application to the product and make the product experience for the user.

In particular, our company's main product, Wondershare Filmora, whose Chinese version is called "Wondershare Filmora", has successively added a lot of AI capabilities last year to solve personalized and specific problems for users. We've seen a huge increase in how our users love and use these capabilities. This also makes us more determined to invest further in the field of AI large models.

In addition, in the domestic market, we have also developed a new product, Wondershare Blastor, which uses the ability of oral broadcast digital humans to provide convenience for cross-border e-commerce sellers through Wensheng videos, so that they can more easily make product display introduction videos. This field is also being loved by more and more people.

At this stage, we are more using the capabilities of AI and models to empower our internal products, pass on the value of the model to end users, solve users' problems, and generate value through products.

Since the beginning of this year, we have gradually expanded the capabilities of the model, hoping that the entire "Tianmu" capability can not only be used internally, but also empower all walks of life, especially ecological empowerment.

We have an AI Lab center in Changsha, which is located in the Malanshan area of Changsha, which is a concentration of video media cultural and creative industries in Hunan. We are exploring cooperation with some enterprises, especially media companies, which are currently mainly in the video post-processing stage, using our technology to improve their efficiency, reduce their costs, and reduce costs and efficiency for the traditional media industry.

Therefore, on the whole, we anchor the positioning of "the new generation of AIGC digital creative enabler". Because we believe that AI will bring about a disruptive change in the entire video production industry, it is the trend of the times to use the power of AI and large models to reduce costs and improve efficiency.

Since our last release, our canopy model has been mainly used internally and has been tested on a small scale. On April 28th, we officially opened the beta. Maybe our model is not perfect now, but it is precisely because of the imperfection that there is more vision. In the future, we will make unremitting efforts to continuously improve our model, and we ask for more comments and suggestions. Thank you!

The above is a complete summary of the content of Zhu Wei's speech.

Zhu Wei of Wondershare Technology: Sora has not been commercialized so far, and it takes a period for the video model to mature