From 0 to 1, unveiling China's first AI music SOTA model

The first domestic music AIGC SOTA model.

Author丨Zhang Jin

Editor丨Chen Caixian

Victor Hugo once said: "The three keys that open the treasury of human wisdom are numbers, letters, and musical notes." ”

Music has long been the best vehicle for human beings to express their emotions.

But music creation is a very high threshold thing, because music creation is not a one-man show, but a highly collaborative process of a team. From lyrics, composition to arrangement, mixing, and song recording, each link requires the efforts of professional musicians and is accompanied by high cost investment.

But imagine what would happen if one day we could create songs with just a few fingers?

This conjecture will ferment step by step in 2023 with the explosion of large models:

In 2023, a cover boom led by "AI singers" will sweep the Internet, and many Chinese music superstars such as Stefanie Sun, Eason Chan, and Lin Junjie will have their own AI stand-ins, and various online platforms will become the scene of the "AI singer comeback concert". Behind all this is the application of So-vits Svc AI music generation technology. This technology can accurately simulate the unique timbre of the target singer by analyzing a small number of audio clips, and although there are still gaps in capturing the singer's personalized singing characteristics, singing techniques and personal style, it almost achieves 1:1 timbre restoration, and also stimulates a wave of national music creation.

Since March of this year, with the release of Suno V3 and Udio, this music creation boom has been ignited again. This time, we can not only cover a song by a certain singer, but also get two complete songs with a duration of about two minutes by entering a few words of lyrics and musical style. This kind of breakthrough technological innovation is regarded by the industry as a real lowering of the threshold for music creation and allowing more people to participate in music creation.

In just over a year, from So-vits Svc to OpenAI's MuseNet, Google's MusicLM, Meta's MusicGen, to SunoV3 and Udio, large-scale model technology has continued to reshape the field of music creation.

It can be seen that AI music generation technology is constantly leaping forward, from "AI singers" who clone their voices to Suno, which generates complete songs. Unfortunately, these products are still some way from producing high-quality, genre-rich songs. Especially in the field of Chinese songs, there has been no AI music generation model that meets the aesthetics of Chinese music.

Until yesterday, Kunlun Wanwei released the world's largest open-source MOE model "Tiangong 3.0", and based on it, it created the only publicly available AI music generation model "Tiangong SkyMusic" in China. This music model surpasses Suno V3 with a comprehensive score of 6.65 points in terms of vocal &BGM sound quality, vocal naturalness, pronunciation intelligibility and other performance, becoming China's first music AIGC SOTA (state of the art, the best level in the field) model.

天工SkyMusic综合评分超越Suno V3

So how did Tiangong SkyMusic become China's first music AIGC SOTA model, and what is its actual experience? Let's take a look.

China's first music AIGC SOTA model

Open the Tiangong APP, click on the music section, enter the song title and lyrics, select the song you want to reference, and then click Generate Music to get the song you created. This is the simplified and efficient whole process of music creation in SkyMusic.

This ability to generate reference music is also one of the highlights of "SkyMusic". Users can either upload their favorite songs as templates or select suitable reference tracks from SkyMusic's vast database, and the system will generate new compositions with similar styles and voices. This feature significantly lowers the technical threshold for music creation, so that even ordinary users who lack professional music literacy can participate in music creation and enjoy the fun of creating music.

With the help of "Tiangong SkyMusic", we have produced two songs with very different styles:

See: https://mp.weixin.qq.com/s/S4I6DyqvR7z10s5NeedOPA

Then enter a well-known English nursery rhyme "Little Star" and adapt it into a rock-style and lyrical male version, which can be regarded as a unique memory of childhood:

See: https://mp.weixin.qq.com/s/S4I6DyqvR7z10s5NeedOPA

In the process of creation, we found that "Tiangong SkyMusic" covers a variety of genres such as rap, folk, funk, antique, and electronic. As a next step, the team also plans to let users generate songs based on the melody they hum. At the same time, compared with similar overseas products such as SunoV3, the songs created by "Tiangong SkyMusic" perform better in Chinese vocal delicacy and recognizability, and can also use vibrato, chanting, male and female duet, automatic harmony and other techniques.

Let's sing "No Work Tomorrow" to celebrate the upcoming Friday.

See: https://mp.weixin.qq.com/s/S4I6DyqvR7z10s5NeedOPA

This song also perfectly demonstrates the core experience advantage of "Tiangong SkyMusic" compared with Suno, which is the ability to generate songs in dialects, that is, users can freely sing songs in various dialects such as Sichuan, Cantonese, and Beijing, which greatly enriches the user's music creation space.

The reason for the emergence of such a remarkable AI music generation model is that it is more complex to process music data than image and video data. The inherent complexity of music, as a long-temporal form of technology, contains tens of thousands of closely related sampling points per second, making it one of the most complex modalities. In addition, music integrates multiple layers of information such as lyrics, vocals, and melodies, and contains a huge amount of information under each layer, which means that when processing music, it is necessary not only to construct an accurate time series model, but also to comprehensively consider many elements such as sound wave morphology, frequency characteristics, and rhythmic structure.

However, with the continuous evolution of AI large model technology, two effective strategies have been found to control the complex nature of music, which also constitutes two major technical paths for AI music generation large models: symbolic music generation route and large model music audio generation route.

The symbolic music generation route, which is to retrain the model by annotating a large amount of sheet music data, has been widely studied in the academic community, but it ultimately generates musical scores, which are converted into playable music with the help of other programs or tools, and the actual effect is not satisfactory.

The large-scale model music audio generation route is an end-to-end integrated generation of musical elements such as musical instruments, vocals, melodies, volumes, and notes, and finally generates an audible audio file. But at the cost of a huge investment of R&D resources and reliance on large-scale training datasets. Even industry giants such as Google and OpenAI have yet to make major breakthroughs.

In addition, the simulation of the realism of vocal singing by AI is also a crucial research topic. However, in the past, AI music technology mainly focused on the creation of background music (BGM) without vocal singing, and there has been a lack of effective solutions for the song field of vocal singing.

At the beginning of the project of "Tiangong SkyMusic", Kunlun Wanwei was faced with these two difficult choices. In the end, the R&D team unanimously decided to choose the large-scale model music audio generation route and tackle the vocal song field. This means that Kunlun Wanwei will officially advance into the two no-man's lands of AI music generation technology with almost no open source to learn from, and the difficulty can be imagined.

Schematic diagram of SkyMusic technology

After many experiments and explorations, the R&D team realized the deep compatibility between the DiT structure and the AI music generation model, and firmly invested in this direction, and finally independently developed a Sora-like model architecture suitable for the music audio field, filling the technical gap in the industry in the technical route and vocal singing field. The architecture consists of three core modules: Encoder, DiT (Diffusion Transformer), and Decoder. Among them, the Large-scale Transformer is responsible for composing the music, learning the contextual dependencies of the Music Patches, and completing the controllability of the music, while the Diffusion Transformer is responsible for singing, and the Music Patches are restored to high-quality audio through LDM.

At the same time, in order to train "Tiangong SkyMusic", Kunlun Wanwei has established the world's largest music dataset so far, including more than 20 million song samples, ensuring that "Tiangong SkyMusic" is accurate, controllable and widely applicable in terms of music style.

In this way, "Tiangong SkyMusic" lowers the entry threshold for music creation, so that there are no professional barriers to music creation. In a real sense, it has shortened the distance between music creation and the general public, and promoted the AIGC industry to move forward. At the same time, Kunlun Wanwei also took the initiative to disclose the technical schematic diagram of "Tiangong SkyMusic", which provides a reference case for the global open source community and developers, and promotes the co-construction and sharing of the global AIGC technology ecology.

Tiangong 3.0 model that can think independently

The success of "Tiangong SkyMusic" is inseparable from the technical foundation behind "Tiangong 3.0". Fang Han, Chairman and CEO of Kunlun Wanwei, said: "The text model is a solid foundation for all AIGC. All social, gaming, and music-specific models need to be supported by large text models. "At present, whether it is the GPT model, the GLM model or the Baichuan model, they all use the combination of the underlying text model and the professional subdivision model.

The "Tiangong 3.0" released by Kunlun Wanwei has up to 400 billion parameters and a performance of more than 314 billion parameters, which is the world's largest open source MoE model so far, and is also the cornerstone of all AI technology application models under Kunlun Wanwei.

Tiangong 3.0 has become the world's largest open-source MoE model

Compared with the previous generation, "Tiangong 3.0" has an amazing performance improvement in the fields of model semantic understanding, logical reasoning, generalization, uncertainty knowledge, and learning ability, with its technical knowledge ability increased by more than 20%, and mathematics/reasoning/code/cultural creativity ability increased by more than 30%.

At the same time, as a multi-modal large model, "Tiangong 3.0" integrates AI search, AI writing, AI long text reading, AI image generation, AI music generation and other functions. In the evaluation of the authoritative evaluation MMBench-CN, the performance of "Tiangong 3.0" in AR (attribute inference), RR (relational inference), FP-C (fine-grained perception-cross instance), and CP (coarse perception) all ranked first, and the overall comprehensive score surpassed GPT-4V, ranking first in the world's multimodal large model.

Tiangong 3.0 multimodal performance surpasses GPT-4V

Based on the all-round leap in performance and ability, "Tiangong 3.0" also mastered the crucial ability of independent thinking. This enables it to provide users with an unprecedented AI application experience in multiple rounds of search and synthesis tool calling, chart drawing, research mode, enhancement mode, map modification and expansion, and other capabilities.

"Tiangong 3.0" has strong logical reasoning ability:

Tiangong 3.0 can also better understand and process complex semantic information in users' natural language queries, including metaphors, polysemous words, etc. For example, the recently popular "Chengdu Disney", we asked the Tiangong model, which can not only accurately explain this Internet hot meme. We will also ask questions to plan our itinerary or give feedback to recent tourists.

See: https://mp.weixin.qq.com/s/S4I6DyqvR7z10s5NeedOPA

In the face of complex needs such as industrial research, product evaluation, information analysis, image generation, chart drawing, etc., "Tiangong 3.0" can demonstrate multiple capabilities at the same time and control the model to complete the task.

As demonstrated in the figure above, in the process of performing the task of "querying the per capita GDP of South Africa in 2023 and making it into a histogram", "Tiangong 3.0" took the lead in calling the search function, then called the python tool to draw the histogram, and finally interpreted and summarized, giving the correct answer and comprehensive analysis:

"Tiangong 3.0" first provides a deep understanding of user needs through semantic understanding, then disassembles complex tasks into subdivisions through logical reasoning capabilities, and finally sends subdivisions to different models through independent planning, calling, and combining external tools and information, so as to accurately and efficiently complete such complex requirements.

On the basis of the previous generation of "Tiangong 2.0" large model, "Tiangong 3.0" has carried out a comprehensive upgrade of content creation capabilities, which can not only realize AI music generation, AI voice, AI dialogue, AI two-dimensional comic generation and other powerful content creation capabilities, but also realize the ability to generate pictures in real time in dialogue in combination with text needs, real-time content analysis and chart construction in combination with text needs through special agent training.

Let "Tiangong 3.0" analyze which car is better, Xiaomi SU7 or NIO ET5:

It can be seen that in the complex requirements of product comparison such as the above, "Tiangong 3.0" can analyze the content in real time and build charts to make the results more clear.

postscript

Through the release of "Tiangong 3.0" and "Tiangong SkyMusic", we can see that Kunlun's strategy of "All in AGI and AIGC" is not just a theoretical slogan, but truly guides Kunlun Wanwei's layout in technology and business model. Relying on the technical cornerstone of "Tiangong model", Kunlun Wanwei has planned six AI business matrices, including AI model, AI search, AI music, AI social networking, AI game and AI video, and strives to integrate these six major sectors to build an integrated AI UGC platform.

"Kunlun Wanwei believes that the next generation of AI giants must be C-end plus free, because successful enterprises in the Internet era and the mobile Internet era all adopt the free plus C-end model, and in the AI era, we also firmly believe in this logic. Fang Han said.

Since the large model needs to consume inference resources every time it provides services, in order to realize the free toC model, Fang Han summarized three paths for the industry: "The first is to reduce the inference cost to below the advertising value created by the user through continuous optimization, and the second is to realize end-to-end inference through AI mobile phones, and allocate the inference cost to the terminal hardware." The third is to establish an AI UGC platform, where 1% of users create content and 99% of users consume content. ”

These three paths do not contradict each other, but belong to different stages of the industry. For example, Fang Han judged that before the large-scale popularization of AI terminal hardware, the AI UGC platform will form a commercial closed loop more quickly, but the end game of the large model must be terminal AI.

Whether it is "Tiangong SkyMusic" or other core AI businesses, they all follow this business logic. That is, through the empowerment of AI technology, the threshold for creation is lowered, and the group of content creators is continuously expanded, so as to improve the production and richness of personalized content, thereby meeting the public's demand for high-quality content and forming a virtuous circle of positive return on investment.

At the same time, Kunlun will also use AI technology to break the barriers of traditional content creation, so that different cultural and linguistic groups can easily convey their stories and emotions on this AI UGC platform, and promote cultural equality on a global scale.

In the process of promoting the construction of the AI UGC platform, Kunlun Wanwei adheres to the combination of technological innovation and business model innovation, and actively explores the growth path suitable for the current and future markets. Kunlun Wanwei is fully practicing "All in AGI and AIGC", striving to build an inclusive, participatory, and innovative AI content ecosystem on a global scale, leading the industry to a new era.

From 0 to 1, unveiling China's first AI music SOTA model

Read on