laitimes

Liu Xiaoguang, DeepMusic: An in-depth interpretation of the principles of AIGC music creation technology

Liu Xiaoguang, DeepMusic: An in-depth interpretation of the principles of AIGC music creation technology

作者 | GenAICon 2024

The 2024 China Generative AI Conference was held in Beijing on April 18-19, and at the AIGC application session, the main venue on the second day of the conference, Liu Xiaoguang, CEO of DeepMusic, delivered a speech entitled "How AIGC Empowers Music Creation and Production".

Liu Xiaoguang systematically reviewed the current music business landscape, including the characteristics of different music user groups, the main products used, and the business profit models of related music companies.

He mentioned that the current music business pattern is mainly oriented to the singing consumption experience of pan-music lovers and shallow practice users, and the needs of more than 100 million active musicians and music practitioners have not yet been met by good products; At the same time, the music production process is long and the threshold is high, which makes music AIGC technology useful.

Music creation and production itself has a certain professional threshold, and it is difficult for non-professionals to express themselves through music, and the development of AIGC has brought another possibility to music creation. Liu Xiaoguang not only reviewed the 40-year evolution of music production tools and three key stages in detail, but also deduced a number of recent popular AI music generation products and interpreted the technical solutions behind them.

Liu Xiaoguang gave an in-depth explanation of the working principles, training data and algorithm technologies involved in the two types of AI music models, audio models and symbol models, and shared the design logic of the cross-platform one-stop AI music workstation "Chord School". "Chord School" presents some information of music creation with a more intuitive functional spectrum, solves the problem of communication difficulties between different modalities such as lyrics, melody, and accompaniment in music, and realizes the music creation and production experience across PC and mobile platforms.

He believes that the music industry will be able to generate high-quality accompaniment in natural language next year, and only need to upload 30 seconds of vocal material to generate songs sung in your own voice. In the future, DeepMusic will also realize fine-grained control of the audio model through the accumulated refined annotation data.

The following is a transcript of Liu Xiaoguang's speech:

Our company focuses on music AIGC technology, so based on our expertise in this field, we will discuss the following aspects: the current state of the industry, the potential impact of AIGC on the music industry, the application of AIGC in music data and technology, and future development trends.

1. Music business pattern: The monthly activity of pan-music lovers reaches 800 million, and the head enterprises are extremely concentrated

Let's start with the music industry as a whole.

The outermost group in the music industry is pan-music lovers, who mainly experience music by listening to songs, and the main products they use include Kugou Music, QQ Music, and NetEase Cloud Music. According to the data of the listed company, this group has about 800 million monthly active users.

Pan-music lovers are one of the most widely involved groups in the music industry. Out of interest in music, some listeners will participate in some music-related practical activities, for example, the shallow music practice is to sing K and watch music performances, mainly using products such as national karaoke.

In the medium practice phase, we usually use products such as the Apple pre-installed software Cool Band and the perfect piano in the Android app store. Perfect Piano may have been downloaded nearly 100 million times on the Android app store, but its retention rate is relatively low. This suggests that moderate practice users are starting to generate demand for music, but the current products on the market do not fully meet their needs.

Liu Xiaoguang, DeepMusic: An in-depth interpretation of the principles of AIGC music creation technology

This is followed by in-depth practice users, mainly covering young people between the ages of 15 and 30 and middle-aged and elderly people over the age of 50. Among them, about 15% of young people have participated in music interest clubs, while about 15% of middle-aged and elderly people have participated in interest clubs such as middle-aged and elderly choirs. These users are gradually showing a willingness to create, and the size is expected to be around 20 million. We refer to these groups of people who are actively engaged in music practice as music practitioners.

Further promotion from music practitioner is musician. The total number of musicians in China is about 1 million. This group is mainly from platforms such as Tencent, NetEase and Douyin, and is mainly engaged in creative and performance activities. As musicians, they have released at least 1 original composition. Most musicians are not trained through traditional professional music education, but rather acquire skills through vocational training schools, which do not provide talent for the digital music industry.

The software used by musicians is very interesting, such as Word for lyrics and tape recorders for composing, and we may also think that it is strange to use these software to make music, and we know that it is impossible to use tape recorders to make high-quality music works that we usually listen to.

There is also a group known as music producers. These people are usually working their way up from musicians, and they need years of production experience to be competent. Their main task is to produce the music recording demo provided by the musician. In this process, common music production software includes Yamaha's Cubase and Apple's Logic Pro, which are currently the mainstream music production tools and usually run on personal computers. However, the barrier to entry for these software is extremely high.

Liu Xiaoguang, DeepMusic: An in-depth interpretation of the principles of AIGC music creation technology

The music crowd is like this, so how does the business develop?

In this space, we can see industry players such as record labels, brokerage companies, etc. Their main task is to sign up with head music producers, get original songs from musicians, and then make and publish these songs to major music platforms, such as Tencent Music, NetEase Cloud Music, etc.

These platforms are first-party companies in this industry, with annual revenues of about 50 billion yuan. Among them, about 35% of the revenue comes from membership fees, that is, the subscription fees paid by users every year, which has exceeded 100 million people; Another 55% of the revenue comes from user-generated entertainment spending, and another 10% comes from advertising.

Of the $50 billion, about $10 billion will be allocated to music creators and record labels. Record labels, on the other hand, distribute revenue through a share of playbacks, which is determined based on the proportion of a song's playback to China's overall music audience.

The music industry is an extremely head-concentrated industry, taking Jay Chou's playback ratio as an example, his songs account for 5.6% of China's overall music market. This means that about 1 in 20 people are listening to Jay Chou's songs.

I think that the music business in the outer circle is concentrated, and the business model has been relatively mature or the problem has been basically solved.

2. AIGC breaks the high cost constraints of music production, and the audio model leads the era of music production tools 3.0

The main goal of AIGC is to solve the problem of intermediaries in the field of music.

We noticed that there was a lack of quality products for interactive, interactive learning and growth among music practitioners at the intermediate level. For deep practitioners, there is also a lack of good software that can help them improve their skills. There may be some problems with musicians using Word and audio recorder when creating music. Even if they create something with these tools, it still requires a lot of repetitive work to process further after handing it over to the music producer.

We believe that the goal of AIGC in the field of music is actually to serve about 10% of the world's total population as music practitioners.

It can be found that although 30 to 40 percent of children in China may study music in primary school, why do they not match the talent needed by the music business when they grow up? This is because in our music education, more emphasis is placed on basic music theory, singing harmony, composition analysis, and instrumental training, which will eventually turn students into performance machines.

However, in a true musical practice, entertainment, and business environment, what is required is skills in the music production process such as lyrics, composition, arrangement, recording, singing, and post-processing. Lyrics and composition are relatively easy to understand, and arrangement is called accompaniment.

Accompaniment refers to the sounds in a song, such as drums, guitars, bass, and other instruments. In order to become proficient in arrangement, the threshold to overcome is very high. Now, if I'm interested in music and want to make it into a finished product, the process is difficult, expensive, and slow.

Next, I will share the evolution of music production tools over the past 40 years.

Liu Xiaoguang, DeepMusic: An in-depth interpretation of the principles of AIGC music creation technology

The first is that before 2000, in the era of Music Production Tools 1.0, almost all music production relied on hardware recording, and music was very emotional at that time, because only the most professional musicians had the opportunity to participate in the recording process.

The second stage is the era of music production tools 2.0. Apple, Yamaha, and Avid have launched a software called the Digital Music Workstation, which runs on a computer and has a very high barrier to entry, but it is very powerful, capable of emulating the sound of traditional instruments such as pianos and guitars, and using MIDI and samplers on computers.

MIDI is a digital protocol that records the high and low of sound in chronological order, for example, if I play a note for 3 minutes and 626 seconds, it will record the pitch and timing of that note. By recording a series of such data, a complete piece of music can eventually be synthesized by a computer.

Entering the 2.5 era, the music industry has undergone a major transformation. Tencent Music Entertainment Group has promoted the commercialization of music entertainment, bringing the industry's revenue to 50 billion and allowing musicians to really make money.

At the same time, the tools of music production are becoming more and more mobile. For example, there is a tool that can be used on computers and mobile phones, and it is also becoming more and more powerful. At the same time, AI can be used to generate digital signals.

What is coming is Music Production Tools 3.0 – the audio model, which is similar to the TTS model of speech.

In the field of music, the production direction of AI can be roughly divided into two categories: audio solutions and music notation solutions.

In the audio solution, our company started to focus on music AI in 2018. At that time, audio models were not yet mature, and basically by tagging millions of songs and mapping the natural language model to the audio model in order to generate audio through some prompts.

At that time, most AI companies worked on music notation schemes because audio models were not yet mature.

The core idea of the musical notation scheme is to extract musical information from the songs we usually hear, including lyrics, melody, singing style, chord progression, instruments used, and timbre of the instrument, etc., and then digitally annotate this information. By training on these musical notations, new musical notations can be generated. Finally, these musical notations are rendered into audio through the traditional music production process.

This process involves three main technical areas: first, music information extraction technology, which is usually used for functions such as listening to songs and reading songs; The second is AI lyrics, AI composition, AI arrangement and other technologies for generating music symbols; Ultimately, the process of translating symbols into audio is what digital music workstations have done in the past.

3. Deduce the technical solutions of popular music generation products and create a one-stop music workstation

Recently, you may have noticed that products like Suno and Udio are frequently appearing on screen, but in fact, the technological breakthrough came from MusicLM and MusicGen.

These two technologies were the first to be able to counterpoint audio through natural language and generate audio frame by frame, which is a disruptive development, which first appeared at the end of last year and early last year. Later, Suno and Udio used audio schemes, such as NetEase Tianyin, and Tiangong SkyMusic, which used symbol schemes.

Liu Xiaoguang, DeepMusic: An in-depth interpretation of the principles of AIGC music creation technology

The audio scheme and the notation scheme have their own characteristics. The audio solution is an end-to-end model, which makes the resulting music sound more realistic, complete, and more integrated. Symbolic models, on the other hand, control all aspects of the generated content. We believe that the two models will converge in the future.

MusicLM and MusicGen are roughly able to generate natural language Prompt background music compositions with a significant foreground melody, which will be very helpful in inferring how they are technically implemented to determine that these compositions are based on the results of the audio scheme. The audio generated by the symbology scheme may sound higher, but the accompaniment and vocals are not as well blended, which is probably the case with pure BGM.

The symbology scheme and the audio scheme use different technology stacks.

In our symbology scheme, we use leading algorithms. We use an annotation tool to process the data. In the case of the well-known "Seven Mile Fragrance", in our annotation tool, the blue waveform at the top represents the audio file, and we need to annotate the key music theory information in it.

Liu Xiaoguang, DeepMusic: An in-depth interpretation of the principles of AIGC music creation technology

First, these blue lines are automatically identified and aligned with the 11, 12, and 13 bar lines above. Next, mark important music theory information in music, such as melodies, lyrics, chords, paragraphs, and keys. Once these annotations are complete, you can generate a melody using unimodality, generate counterpoint between the melody and lyrics, or generate accompaniment and melody based on the input lyrics. With a large amount of this data, we can develop generative AI models.

Since the popular product of the audio solution has not disclosed its specific implementation, we speculated through a lot of experiments and shared with you our understanding of the combination of AI and music. We believe that this way of production has turned our perception of smart technology on its head.

Recently, audio model products have become popular. The experience we see with these products goes something like this: you type in a lyric and a few prompts and you can generate the full music.

According to our inference, its algorithm may look like this: first, there is a batch of music data, and the corresponding lyrics are labeled at the same time. This kind of data can be directly obtained on platforms such as QQ Music. In addition, there is now a well-established technique called vocal accompaniment separation, which can separate the vocals from the accompaniment in the audio.

There are harmonies in the vocals of the live demonstration. When I was training, I only saw the first three lines now, which was roughly slicing the audio and then generating the final complete music through a separate BGM and the lyrics annotated in it. Here's how the model roughly works.

So, what we end up seeing is that we input a Prompt and it will find the best match for the audio clip from a BGM library, and then overlay a vocal model on top of the original audio based on the input lyrics or the instrument you want. They don't understand music quite the same way as we do, they understand music as a TTS model of a person listening to an accompaniment and reading the lyrics aloud. The whole process is an end-to-end model, so the fusion of accompaniment and vocals is excellent throughout the music.

4. One-stop low-threshold music creation software "Chord School": solves the three major challenges of controllability, compatibility, and cross-platform

I've just shared a little bit about the music industry in general and how audio models, symbol models, and so on work. Now I would like to share our own product called "Chord Pie". It is a one-stop low-threshold music creation software on mobile terminals, and AI plays a big role in it. We hope to solve several problems with this product.

First, we want AI to be controllable. But in music, we're redefining the way we control. For example, when we talk about how to describe musical knowledge, the first thing that probably comes to most people's minds is staves. However, staves are a product of two hundred years ago, when there were no phonographs. The purpose of a staff is to record how music should be played, as opposed to the way it is recorded in popular music these days. We wanted a more intuitive way to control the music.

Second, in the past, when we were creating music, we might use Word to write the lyrics and use a tape recorder to record the tunes. We wanted to be able to combine these features into a single platform for a one-stop music creation experience. In addition, in the producer and musician industry, each person may buy different sources, which leads to the problem of incompatibility between engineering files. We wanted to solve this problem and make the project files between different sources compatible with each other.

Thirdly, we want to be able to do music on our phones, not on our computers. However, it is true that there is a lot of difficulty in making music on mobile phones. For example, under Android, there is actually not a good audio engine to support this development work. As a result, we spent a lot of time developing a cross-platform audio engine to solve this problem.

Our overall design idea is as follows, this is the musical function spectrum. First of all, we realized that this product is not for all of humanity, but for about 10% of the human population. Functional notation is basically what music lovers need to know, which includes passages and chords that tell the musician how to play; Melodies and lyrics are also included that tell the singer how to sing it.

Liu Xiaoguang, DeepMusic: An in-depth interpretation of the principles of AIGC music creation technology

Pop music is not that complicated and usually contains only one accompaniment and one vocals. The vocal part is represented by numbers, such as "Do, Re, Mi, Do, Re, Mi", with lyrics to guide the singer's singing. The rest of the passages and chords instruct all the instruments on how to play. In short, these two parts make up the functional spectrum of music.

We create music by creating functional scores or by presenting them in other ways. Translating the spectrum of functions into sound, which is music creation; Turning a functional spectrum into music that we can hear is music production. This process culminated in our product – the Chord Pie.

In Chord School, we provide an editing page for functional scores. You can enter chords, melodies, and lyrics as you like. With the AI-assisted arrangement function, you can generate accompaniments; With the AI's singing synthesis, you can make these inputs be sung.

With the help of a large amount of words, songs, and chord counterpoint data, we can realize the functions of melody generation chords, chords generation melody, etc. This means that you can type in a lyric and we will generate a complete song for you; Or, you hum a melody and we can match you with chords and accompaniment. It can all be done in one single piece of software.

We offer a variety of interactive experiences for different users. For example, for moderate practitioners, based on the comprehension of large language models, they can generate lyrics and generate additional information about the music based on those lyrics; For deep practice users, they usually already understand the concept of chords, but may not know enough about the specifics of chords; For more in-depth musicians, they can edit all the chords, adjust the pitch, and modify the lyrics to quickly create the BGM they need.

We can turn off the guitar track, switch to an electric guitar, and adjust the way we play, so we can create freely even if we don't know the guitar. We've had a lot of users who have made great works this way, and some of them have even touched me deeply.

Our entire product can be output in one mobile phone software. We are firmly committed to mobile products because we believe that many post-00s and post-05s children are not very accustomed to using computers. We foresee a future where most of the music production process will be done on mobile phones. Only when a fine adjustment is required at the end will it be transferred to the computer.

5. In 2025, you can sing songs with your own voice after uploading 30 seconds of vocals

Let's talk about our vision for the future of the music industry.

First of all, we don't think the development of technologies such as AI and big data will bring much change to music consumption. Because the music industry itself is an industry with more supply than demand, although the emergence of AI has improved production efficiency, it will not have a huge impact on the industry ecology. However, on the music production side, we believe that more and more people will get involved and have fun doing it.

Liu Xiaoguang, DeepMusic: An in-depth interpretation of the principles of AIGC music creation technology

The new audio model can generate a complete BGM with a simple Prompt, while the TTS model can generate a complete song. Next, we can foresee that people will be able to make their own personalized BGM and fill in the lyrics on it. Each lyric can be re-edited, for example, if the second line is not ideal, it can be rewritten.

At the same time, volume adjustments will become more flexible. We are confident that by the end of this year, more than one company will launch such a product. At that point, the process of making music will become more popular. Musicians may first choose a BGM they like, and then use the language model to find inspiration for the lyrics, and modify and try it sentence by sentence. Eventually, they can follow traditional recording and music production workflows to complete and publish their work.

Next year, we will probably be able to achieve the function of natural language generation accompaniment, and the sound quality should be quite good. All you have to do is upload about 30 seconds of vocal material and you'll be able to sing the song in your own voice. The sound quality will be at a level that is basically usable.

At that point, we can move away from the traditional recording or "MIDI + sampler" workflow and instead use the "BGM + lyrics" input method. We only need to make simple coarse-grained changes, tweak the music with the audio model, and then we can distribute the work directly.

Liu Xiaoguang, DeepMusic: An in-depth interpretation of the principles of AIGC music creation technology

At that point, our understanding of music production tools might come back to setting up a song, including an intro, interlude, and chorus, and then typing the lyrics into it. We can split the song into different regions and choose a different library of instruments in each area. The user can generate the entire piece of music by dragging an instrument into the appropriate area and indicating how the instrument is playing.

Ultimately, this experience requires a combination of recording, MIDI, sampler, and audio models. There may be some gaps in the audio model in China, but we firmly believe that the accumulation we have made is undoubtedly meaningful for the future products for musicians and music lovers.

The above is a complete compilation of the content of Liu Xiaoguang's speech.

Read on