Kunlun Wanwei Fanghan: It is the premise to turn technology dividends into market dividends and get SOTA in vertical fields

China's R&D can achieve global SOTA in vertical fields.

The 2024 China Generative AI Conference was held in Beijing on April 18-19, and at the opening ceremony of the main venue on the first day of the conference, Fang Han, chairman and CEO of Kunlun Wanwei, delivered a speech on the topic of "SOTA Dividends from the Tiangong SkyMusic Music Model".

Fang Han emphasized the importance of "technology leadership" in the field of AI, which is different from the business model orientation of products in the Internet era, and the technology orientation in the era of large models can bring leading advantages in the market, attract a large number of users and obtain dividends.

OpenAI's position in AI startups is essentially brought about by the SOTA (No. 1 current technical indicators) capability of its large text model. For current AI entrepreneurs, as long as they can obtain SOTA in any track such as images, videos, and music, they can gain a large number of users through technical advantages, and then solidify users on the platform through product innovation and business model innovation to form their own moat.

In the field of AGI and AIGC, Kunlun Wanwei has developed the Tiangong 3.0 model, including the music model SkyMusic and the world's largest open-source MoE model. Among them, SkyMusic, a large music model, is in the SOTA position in the music field.

SkyMusic's music model supports a variety of dialect outputs, recognizable natural vocals, lowers the threshold and cost of music creation, and facilitates the development of the content industry. Based on the advantages of sound quality, naturalness and comprehension, even ordinary people with no musical background will be able to use this technology for music creation, which will greatly expand the possibilities and scope of music creation.

In addition, the ability of Tiangong 3.0 large model mathematics, reasoning, code and other aspects has been greatly improved, and at the same time, it has the ability to call multiple rounds of search and synthesis tools, which will bring about the explosion of the content industry, and will also promote cultural equality and break the monopoly, so that everyone can better shape and express themselves.

The following is a transcript of Fang Han's speech:

SOTA is actually a very commonly used proper noun in academics, the full name is "State of the Art", which is rather awkward, but it is actually the first meaning of the current technical indicators. This was originally a technical metric used to evaluate models in the field of machine learning, and it was originally an academic term, why did it attract so much attention?

01. Only by obtaining SOTA in the vertical field can we transform technology dividends into market dividends

Looking at this wave of large-scale model investment and entrepreneurship, we can find an obvious phenomenon. In the last wave of mobile Internet, the CEOs or leaders of all startups are basically product or business backgrounds, that is to say, mobile Internet is the innovation of business models and product models, and in this wave of AI entrepreneurship, the CEOs of all AI companies are basically technical backgrounds.

Why? Because in the field of AI, technology far exceeds the product business model, and the leading technology can bring business dividends.

To give a few examples, it is inevitable to talk about the number one in this industry, OpenAI. Why is OpenAI's valuation so high now, and why is the world expecting it so much? Because in essence, after ChatGPT, its position as a SOTA on large text models has been very solid. It wasn't until Claude 3 came out that for the first time there was a large model that could approach the position of GPT-4, so it hurriedly threw Sora, which is SOTA in the field of video generation. In other words, in the most generalized field of general artificial intelligence and text large models, OpenAI's SOTA dividend is very obvious.

But let's look at the image generation track, DALL· E 3 came out very early, but soon after Midjourney and Stable Diffusion came out, these small and medium-sized startups gained a large number of users, far exceeding OpenAI's DALL· E 3。 WHY? IS IT BECAUSE DALL· E 3 did not reach SOTA. SOTA is acquired by Midjourney and some small and medium-sized startups behind it, and users will definitely choose the products of small and medium-sized companies and not the products of other companies.

That is to say, in any field, for all our entrepreneurs and latecomers, as long as you can get the first place in the technical indicators in this field, you will be able to get a large number of users. After acquiring a large number of users, you can use your product model and business model to solidify these users on your platform.

In the long run, as long as there is no obvious generational suppression of new large models and new competitors, you can still continue to rely on these users to reap dividends. Just like Midjourney V6 came out, its SOTA position is not very solid, but the user base is still solid.

The previous is some of my background introduction, of course, Chinese companies because of the entry into the large model track, in fact, everyone is quite late, but we still insist on technology first in the global competition center, we must get SOTA in a vertical field, in order to be able to transform from technical dividends to dividends in users and the market.

02. Based on the Tiangong model, six business matrices are formed

First of all, I would like to report to you when Kunlun Wanwei started to do AGI and AIGC.

When GPT-3 came out in 2020, we set up a team to do the pre-training of the text model, and in 2021, we started to do the pre-research work on the generation of the music model. Now all vertical tracks must be end-to-end to the endgame.

In December 2022, we released China's first open source text model, and at that time, various overseas open source models did not come out, which is the first 13B open source text model based on Chinese from a Chinese company.

On April 17, 2023, we released Tiangong 1.0, and on August 23, 2023, we released the first AI search in China, "Tiangong AI Search". On April 17 this year, we released Tiangong 3.0, which includes the SOTA of the first Chinese music AIGC track, that is, the SkyMusic music generation model, and the world's largest open-source MoE model with 400 billion parameters.

At present, we have six business matrices, including AI large model, AI search, AI music, AI video, AI social and AI game.

Although there are so many matrices, our goal is very clear. First of all, we must make the base model - Tiangong model, from the original text model to the current MoE model, and then to the next generation of multi-modal model, we must continue to evolve on the base model.

Because there is an obvious phenomenon now, all vertical models of all vertical tracks, whether audio, music, video, images or 3D models, are actually heavily dependent on the ability of the base text model, if the base text model is not capable, the upper limit of the ability of various vertical models is relatively low.

Social, music, gaming, and video all belong to the AIGC vertical track, and we believe that as long as we continue to invest in these vertical tracks, we will definitely be able to achieve SOTA and gain a leading edge in the market.

03. SOTA on China's first music AIGC, with a full music dataset of nearly 20 million songs

First of all, I would like to introduce you to the Tiangong music model SkyMusic, which is now open to all users, you search for "Tiangong" in the App Store and App Store, "Tiangong" has a music type in it, you can use it immediately, this is China's first music AIGC on SOTA.

This is a case study of how we made a recipe for chopped pepper fish head into a song, which was sung in Cantonese.

This is a case of the combination of Tiangong AI Music and Pang Bo, and we made the lyrics written by Pang Bo into a song.

Let's talk about the technical indicators, at present, SkyMusic and Suno V3 version, in terms of vocals, BGM sound quality, vocal naturalness, and pronunciation intelligibility have beaten Suno. We believe that this gap can be further widened in the next version.

The AI music model has gone through three years of research and development since 2021, and our technical route is constantly evolving. Before Sora came out in August last year, we had already moved to the Diffusion Transformer architecture, because this architecture is the most scaled up architecture.

Our dataset contains a full dataset of nearly 20 million songs, and after more than three years of cleaning and processing, its data quality is guaranteed. We have used the DiT-like architecture to develop the SkyMusic music model, which is only our first version, and will implement more functions in the lab version in the future.

Here's a look at some of our unique strengths.

Anyone who's ever used Suno knows that you're choosing text labels and styles. Our current production method is to upload a song that you like to listen to, or even a melody that you record yourself, and we can generate music based on that. This is actually more like the logic of traditional musicians creating music, many musicians first hum a piece of music in their heads or mouths, and then write it down, and then do the work of the main song, chorus, and arrangement according to this.

Second, we support the ability to export monolingual dialects. At present, the Chinese version has supported Sichuan, Cantonese, Beijing, Tianjin, Shanghai and other dialects, which is very meaningful for users.

Finally, there are the more recognizable natural vocals. The quality of the vocals, as well as the generalization of all kinds of voices, female, male, toddler and adult, is a very good generalization.

When the technical SOTA is achieved, how to turn the technical SOTA into a product dividend?

We believe that all AIGC, including the large model of music creation, has greatly lowered the threshold for music creation in the first place.

Just like the circle of friends I posted, "Everyone can sing Mingzhi". In the past, what was the logic of how we made a song? First of all, we had to start learning piano, music theory, and music notation at the age of four or five. It probably took seven or eight years for my own children to get to a professional level in piano, which is a long time to train. If he can become a composer, he may have to study in college for another four years, and after four years of study, it will only be a composition, and after composing, he will have to arrange music, and after arranging music, he will have to find a singer to sing and find a recording studio to record. We record a song in the market from beginning to end, even with the most simplified configuration and the most ordinary equipment, it costs about 20,000 yuan.

This leads to the whole music creation, in fact, it is not possible to create a song by relying on only one person, but with the SkyMusic model, everyone only needs to spend 1 minute, as long as you have the lyrics, you can generate a complete song, which greatly lowers the threshold for music creation and benefits the entire content industry.

In the past, everyone knew that the so-called soundtrack was all based on existing tunes and matched with video content, but today this topic can be greatly generalized, such as today's keynote speech, you can give each keynote speaker a song, for example, every big V on Weibo writes a song about them to them, which is very convenient for the entire content industry.

Finally, SkyMusic proves that Chinese R&D can also achieve global SOTA in vertical fields, so that Chinese companies can compete head-on with their foreign counterparts in the global AIGC market and gain our due market share.

04 .

400 billion parameter open source MoE large model Tiangong 3.0,

Lower the threshold for creation in all fields

Let's introduce Tiangong 3.0, the world's largest open-source MoE model with 400 billion parameters, which we opened public beta on April 17, and the current performance has surpassed the 314 billion parameter MoE large model Grok-1 released by xAI. This is the technical base of the MoE large model of Tiangong 3.0 400B parameters, and the blue one that is currently in the MMBench inference ability test is in the leading technical indicators, and we are still far ahead in this technical indicator.

Tiangong 3.0 has been comprehensively upgraded, it is smarter, the technical knowledge ability of the model has been increased by more than 20%, the ability of mathematics, reasoning, code, and cultural creativity has been increased by 30%, and the ability to create content, including the ability to search, write, read, chat, voice dialogue, write pictures, and compose lyrics and music for you.

Let me demonstrate the ability to call multi-round search and synthesis tools. The question in the picture is how to get to Chengdu Disneyland, Chengdu Disneyland is actually a terrier, it is a community in Chengdu, not a real Disneyland. At this time, through the combination of search and large models, it was judged that Chengdu Disneyland was actually an Internet meme, but it still made a plan for how to get to Chengdu Disney. When you ask about the weather at Shanghai Disneyland, the big model will generate a weather card to tell you that it is raining at Shanghai Disneyland today, and finally call the Wensheng map to generate a picture for you.

The second research mode, doing research needs to write outlines, write knowledge maps and mind maps, and Tiangong 3.0 can not only generate tables after automatic search, but also can quickly generate outlines and automatically generate brain maps, and finally automatically generate brain maps, which can be used immediately.

Agent Square can build super AI agents to help you complete the corresponding work. For example, to build an agent to help study the difference between Xiaomi cars and Tesla cars, after conducting a lot of searches to obtain a lot of information, the model begins to generate a comparison table with text and pictures, which is very important for students who do copywriting.

05 .

Kunlun Wanwei's new mission:

Achieving AGI allows everyone to better shape and express themselves

As a Chinese Internet company, Kunlun's current strategy is All in AGI and AIGC. In 2023, we announced our company's new mission: to enable artificial general intelligence, allowing everyone to better shape and express themselves.

Why do we need to revise such a mission? Because the realization of general artificial intelligence is essentially based on the text model to compress all human knowledge into our general large model, which is the only way to general artificial intelligence.

But as you know, there is a priority in the landing of general artificial intelligence in reality, what kind of scenarios will be easier to land?

In fact, it is relatively simple, some people joke that when the large model first came out, everyone thought that all the students who did production work would be the first to lose their jobs, and all the students who did creative arts should be in the most stable position. But in fact, after this period of development, you can see that all students who do literary and artistic creation will have a greater probability of being unemployed, which means that in the field of content generation, it is actually easier for AI to land.

Quite simply, because the content generation field has a very high fault tolerance rate, users are very tolerant of content errors. We can tolerate one more finger and one less finger from the person we draw or video on. But in our work, even if you have one less decimal place, it is a huge production accident. In other words, in the content track with high fault tolerance, this wave of large models and artificial intelligence has a lot of room to play.

In the field of artificial intelligence and AIGC, the purpose of our research is to lower the threshold for creation. Whether it's Wensheng diagrams, Wensheng videos, Wensheng music, Wensheng sound effects, or Wensheng 3D assets, it is essentially to remove all the long-term professional training we need in the prelude, and make anyone feel that as long as they can tell a story, they can create corresponding content to express themselves.

First of all, we know that the cost of creating content has been greatly reduced, and the barrier to entry for creating content has been lowered. We know that as long as the barrier to creating content is lowered, the number of people who create content will explode.

For example, when I was a child, I used TV reporters to shoot videos with cameras of dozens of kilograms, and there were very few creators at this time. However, after the advent of smartphones, the photographers have become more than a billion people in China, and everyone can shoot videos, and the result is the huge development of the short video industry. This wave of AI has lowered the threshold for creation in all fields, and what is the result? The entire content industry will usher in a huge explosion. At the same time, it also brings another effect of cultural equality and breaking the monopoly.

I stayed in Africa for a long time, and when I was in Nigeria, I learned that the average cost of making a movie in Nigeria is between 20,000 and 200,000 US dollars, and the films made in this way are not competitive with "The Wandering Earth" made at 4.5 billion yuan in China and the "Marvel" series of movies made at 4.5 billion US dollars in the United States. However, after the advent of our next-generation AIGC technology, we believe that all people from disadvantaged cultures around the world can create content that is comparable to the strong cultures of Europe and the United States at a very low cost.

What is the result of this? Each ethnic group with a disadvantaged culture can use AIGC technology to produce a culture suitable for its own nationality and a culture suitable for its own minority language, which is very meaningful for cultural equality in the world. This is also the second point, so that everyone can better shape and express themselves, which is also the ultimate goal of R&D personnel in the field of Wensheng music, Wensheng videos, Wensheng novels, Wensheng comics and other content creation.

The above is a complete compilation of the content of Fang Han's speech.