The first AI music model in China is a god!

After trying out "Tiangong SkyMusic" in advance, the entire editorial department cried in a storm: its adaptation of Jay Chou and the Phoenix legend is simply a god. The team chose a path less traveled, and they won the bet: they bet on the Sora architecture ahead of OpenAI and released a record-breaking technical diagram for the first time in the industry.

The shock of recent weeks has been given by Suno.

The heavily upgraded Suno V3 has been born with the "Divine Comedy" that swipes the entire network, making the whole world crazy.

Who would have thought that the ChatGPT moment of music would come like this.

Everyone in the circle is discussing: this wave, the music industry may be endured by AI.

The first AI music model in China is a god!

The first music AI in China is here!

No, just last week, China's first AI music generation model "Tiangong SkyMusic" also officially opened the internal test!

Various "great gods" have already begun to create on the homepage show

What if you don't get inspired? The product page is even equipped with topics for you to be inspired by.

After trying it out, I experienced it again, and the amazing feeling that Suno gave people at the beginning.

First experience: Dream back to the 80s in one second, and the human voice is fake and real

For example, in this song "Long Ancient Rhyme", as soon as the clear female voice spoke, she instantly dreamed back to the 80s, which was the feeling of my mother's square dance.

This song "Love is Happiness" is even better to hear, which makes the editor's jaw drop in shock. The melody is catchy, the lyrics are timeless and evocative, and there is a smell of Tanya Tsai when you listen closely.

In addition to the very high overall music quality of the songs, one of the biggest highlights of "Tiangong SkyMusic" is its clear and realistic vocals.

You must know that vocal synthesis is the most important dimension in AI music generation, which can best reflect the generation effect and quality.

The AI vocal synthesis of "Tiangong SkyMusic" can produce singing voices with extremely high Chinese proficiency and clear pronunciation, showing excellent audio quality and realistic singing effects, which has reached the SOTA level in the industry!

You know, in this regard, "Tiangong SkyMusic" has killed several large foreign models. In terms of Chinese pronunciation, they are simply miserable.

For example, Suno's "Kung Pao Chicken", singing Chinese songs is also the taste of foreigners speaking Chinese.

Suno's Cantonese "Qili Xiang", the pronunciation is also very non-standard.

It can be seen that if you want to make Chinese songs, you have to look at our own music model!

Controllability, a professional indicator for musicians

Next, we have to come up with some professional indicators.

Lyric paragraphs

Why can a song become popular all over the Internet and all over the country?

From the perspective of pop music, it needs to have strong melodies, distinct rhythms, colorful harmonies, and passionate emotions.

Therefore, if you want to make an ear-catching pop song, the subtle emotional changes between different lyric passages are a very important point.

And "Tiangong SkyMusic" is particularly good at this-

It controls the song through the lyrics, reflecting the differences between the main and chorus, the intro and the main song.

For example, in this song "Dragon Walking", the melodious female folk song at the beginning is in stark contrast to the agitated male and female duet part, and a majestic national style song is natural.

style

In terms of style control, it can refer to specific audio to learn a specific genre.

The song "Flying Bird" created by it sounds very much like learning Xu Wei's folk style.

Automatic intro, interlude, outro

One of the problems that music producers often face is that they already have a suitable song, but they lack an intro and outro, and they can't find the right one after racking their brains.

At this time, you can ask "Tiangong SkyMusic" for help. It complements the complete song "Guitar", and the lazy and casual singing is just right, and it sounds very healing.

harmony

According to the lyrics, "Tiangong SkyMusic" automatically added harmony to this "Water Tune Song Head".

The harmony of several male voices and the timbre of the lead singer are very compatible, combined with the rhythmic drum beat, a majestic national style "Water Tune Song Head" was born.

Lyric skills

Moreover, the model can also refer to the characteristics of the audio and intelligently learn singing skills.

For example, the tremolo version of "Lost".

The operatic version of "My Skate Shoes".

Glory of Kings, Jay Chou, Phoenix Legend, you can have it all

How can the current pop icon be integrated with pop music? If you find the right point that hits the public's heart, it is not a difficult thing for Douyin Divine Comedy.

"SkyMusic" makes all this possible.

Enter the lyrics with structure + reference audio, and you can write a song about your experience of playing Glory of Kings.

I opened the Glory of Kings today to choose Zhao Yun

After the start, I was blown up everywhere I went

I was so angry that I had to hide in the grass

Alternatively, we can create a second based on the lyrics we already have.

For example, input the lyrics of "Rainbow", and then record the audio of the main song and chorus of "The Longest Movie" for reference, and a new song of the two songs "born" is born:

It can be heard that some of the melodies are still remarkable.

再用蕾哈娜的《Diamonds》的词，配上霉霉维密秀震撼神曲《See You Again》试试？

The English song of the "mixed-race" female voice that came out is like this:

The singing control is excellent, the high-bass conversion is smooth, and the multiple transpositions of the chorus are quite magical and worth savoring. It seems that it is rare for human composers to hear such a "ghostly" melodic combination, and this is the ingenuity of AI.

And it's very amazing that the singing voice of the song suddenly becomes like Rihanna, not like the moldy voice.

Next, let's blow up the Phoenix Legend's "The Most Dazzling National Style", but the difference is that this time the original lyrics and original songs are entered, so that it can "rectify" itself.

What comes out is another sense of square dancing divine comedy.

Not only that, but we can even turn sudden hot events into explosive potential stocks in minutes.

The rap version of the hot meme "high-speed running machinery" to find out:

So, how did SkyMusic achieve such amazing results?

To this end, we recently talked to the big guy of the start-up team.

A road less traveled

Symbol or large model?

I believe everyone has a question in their hearts: why didn't there be a good music AI before, but it has only recently come out?

Because, of course – it's very hard!

Good AI music is difficult to make, one reason is that the previous mainstream symbolic school (MIDI) technology is too poor, and another reason is that the music AI in the past is basically in the field of BGM without voices, and songs with vocals either can't be made, or the effect is also very poor.

It's self-evident how much of a song's appeal with and without vocals varies.

Specifically, there are two main technical paths for AI music generation, symbolic and large-scale models. Semiotics are dominated by MIDI.

MIDI stands for Musical Instrument Digital Interface, which does not contain audio files, but records instructions for music performance, such as which note is played, what is the volume, how long the note lasts, etc.

Because the song cannot be generated directly, instruments, melodies, timbres, and vocals need to be added in the later stage.

The second large-scale model music audio generation route can directly learn and generate audio waveforms, and instruments, vocals, melodies, volumes, and notes are all integrated end-to-end generation.

There are many studies in the field of MIDI, but the effect is very poor, and the audio direction of large models is extremely difficult and very few are done.

Which one to choose?

At the beginning of the project, the company was faced with this difficult choice. The former is not effective, and the latter is very likely to not be able to do it, and the whole project will be beaten with chickens and eggs.

In the end, the R&D team voted unanimously to choose the audio solution. It's a big deal: it's better to take a big risk than make really good AI music.

Luckily, they succeeded.

Note that the image you see below is priceless.

SkyMusic三大核心模块:Encoder DiT Decoder

Because, at present, there are no available AI music model companies on the market that have disclosed their own technical paths, including Suno.

After ChatGPT came out, LLMs flourished, and this is because there are countless open-source projects to refer to.

However, the audio route + vocal Song route, there is no public information to refer to, and Tiangong smashed into countless R&D resources and computing power algorithm investment, and only then did he find out the above extremely valuable path map.

The team has already stepped on the pit in advance, and now this reproducible plan has also been generously contributed by them.

And coincidentally, although the final framework is similar to Sora, Sora was not yet born at the time of development.

It can only be said that the hero sees the same thing.

Let's talk about music

In a Sora-like architecture, the Large-scale Transformer is responsible for composing music, controlling the music structure and style by learning the contextual dependencies of Music Patches.

In this way, the style is completely controlled.

The Diffusion Transformer is responsible for singing, that is, the generation and rendering of sounds, and converts Music Patches into high-quality audio output through LDM technology, so that the music has clear stylistic characteristics and sound quality performance.

When AI starts learning emotions

And if we listen carefully to the above works, we will feel that "Tiangong SkyMusic" captures the emotions of music very delicately.

The music it produces seems to have a rich emotional vein and a dynamic change.

It is this reinforcement of emotional expression that allows its works to generate works with different emotional atmospheres according to the lyrics and musical elements.

Compared with the previous AGI model, which focuses on intellectual improvement, its "emotional AGI" route is particularly rare and valuable.

Because it is not only a smart AI, but also an AI that strives to understand and simulate human emotions and express emotions with music.

Compared with those melodies on the market that focus on melody creation and learning a large number of musical passages, or AI that goes deep into the chord, rhythm, and arrangement levels, the emotional dimension of "Tiangong SkyMusic" has also become its differentiating highlight in the industry.

比Suno和Stable Audio 2.0强在哪儿

Compared with AI music tools such as Suno on the market, the AI music generation model "Tiangong SkyMusic" has unique advantages.

Behind it is the 400 billion-level parameter multi-modal super model "Tiangong 3.0" based on the MoE architecture.

With the blessing of industry-leading logical reasoning, semantic understanding and generalization capabilities, the response speed and training inference efficiency of "Tiangong SkyMusic" have also been greatly improved.

First of all, in Chinese, the AI vocal synthesis of "Tiangong SkyMusic" is extremely excellent, with clear pronunciation and no abnormal sound.

In particular, thanks to the in-depth optimization in the Chinese environment, its Chinese singing effect greatly meets the needs of the Chinese market.

Secondly, in terms of music style, "SkyMusic" is slightly better.

It can control emotional changes through lyrics, and realize a variety of singing techniques such as vibrato, opera, chanting, etc., so that the resulting musical work is more emotionally rich and contextual.

In addition, "Tiangong SkyMusic" also supports the creation of rap, folk, funk, antique, electronic and other music styles, and users can customize the music style according to their personal preferences.

However, not only "Tiangong SkyMusic", but also AI tools such as Suno, are still relatively far from the level of expert music consumption that is fake and real.

Therefore, this is also the reason why Kunlun Wanwei decided to make its technical architecture public, hoping that the industry will work together to promote the development of this field.

AI won't replace musicians

In addition to "Tiangong SkyMusic" and Suno, udio, another mysterious music model called "Sora version" in recent days, has attracted the attention of the whole network.

Netizens who got the test qualification said that udio music generation was much stronger, and they even felt the power of AGI.

Is AI really to the point where it can replace human music singers?

Originality, is it really no longer important?

Apparently neither.

The rapid iteration of AI music generation technology is undoubtedly changing the way and experience of music creation.

However, this does not mean that AI will completely replace musicians, or make originality no longer important.

On the contrary, AI music generation technology and music creators can complement each other.

On the one hand, powerful AI can lower the barrier to entry for music creation.

Even non-professionals have the opportunity to get in touch with music and create music works of a certain standard.

This will greatly expand the group of music creators and stimulate diverse music forms and cross-border cooperation.

On the other hand, tools such as SkyMusic can empower music creators.

They can help musicians improve their creative efficiency by simplifying melodic prototyping, providing creative inspiration, and assisting in the production of high-quality accompaniments.

Fang Han, chairman and CEO of Kunlun Wanwei, once said this sentence in an interview:

In the content production industry, there is a rule: if the threshold for content production is lowered by 1 times, then the number of content creators will increase by 10 times.

Therefore, when the threshold for music creation is lowered, more people will become "original musicians".

All in all, if you look at the industry from a static perspective, many people will think that the emergence of AI music has "cut the cake of the music industry".

However, from a dynamic point of view, the progress of technology can make the music market bigger and bigger, and the industry will flourish, giving birth to a new content ecology and music format.

For example, new business models such as on-demand customized music services and subscriptions to online music creation tools can bring new consumption growth to the music industry.

At present, many domestic music platforms have made a column for Suno AI to produce music, unlocking new traffic passwords.

For education, AI music creation can help us quickly perceive the principles of music creation, try a variety of music style creation, and cultivate and incubate a new generation of talents for the music industry.

Let everyone express themselves better

On a more macro level, in addition to pictures, videos, and AI music, it is also an important part of the road to emotional AGI.

Music is not only an art form, but also a way to communicate and express emotions.

Moreover, music can touch the depths of people's emotions and is an important medium for emotional expression.

In their research on AGI, many teams have focused on the expansion and enhancement of model intelligence.

The ultimate goal of real AGI is to be more human-like, with the ability of perceptual rationality, reasoning, logical thinking, and emotional understanding.

It is precisely because of this realization that Kunlun Wanwei, which has always regarded emotional AGI as an important direction, hopes to overcome the big technical problem of music AI.

In the process of developing "SkyMusic", the research team actively explored the unique advantages of audio content, especially music, in the understanding and expression of emotions.

They not only pay attention to the technical aspects of composing, arranging and singing musical works, but also emphasize the model's ability to perceive and reproduce the emotional colors of music.

The accuracy and diversity of emotional expression, as well as the sensitive capture of emotional changes in lyrics and paragraphs, confirm that Kunlun Wanwei has made substantial progress in emotional AGI.

Of course, in addition to AI music generation, AI writing, painting, animation and other fields, Kunlun Wanwei is also exploring its application in creative tools.

On the main line of emotional AGI, they hope to help creators better express and convey emotional connotations through AI technology through self-developed technology.

In the next 30 years, more and more people will express themselves, and the self-expression of human society will increase by 1,000 times.

Kunlun's next thing to do is to let AI lower the threshold for human creation, so that everyone can fully express themselves.

The first AI music model in China is a god!

Read on