Scientists develop multimodal music comprehension and generative large models that have the ability to understand and compose music

author：DeepTech 2024-04-08 20:11:00

"Peers think that the idea of combining music understanding and generation with large models is relatively new, and this paper is also one of the preliminary works in the field of multimodal large models.

In addition, in addition to the large model itself, the dataset we propose for model training is also of great value to the academic community. Liu Shansong, a researcher at Tencent ARC Lab, said.

Scientists develop multimodal music comprehension and generative large models that have the ability to understand and compose music

Picture丨Liu Shansong (source: Liu Shansong)

Recently, his team at Tencent's ARC Lab and the research group of Assistant Professor Sun Chenshuo at the National University of Singapore jointly developed a multimodal music understanding and generation model, M2Ugen, which can meet users' needs for music understanding and generation, filling the gap of multimodal large models in the music field.

Specifically, the model not only understands music, but also generates music based on it.

The former refers to the ability to not only annotate the input music file descriptively, but also answer the user's questions related to the input music file, such as which instruments are included in the music.

The latter refers to the generation of music not only based on user instructions, such as a piece of music played on a guitar, but also based on an image or video input by the user.

Figure丨Multimodal music understanding and generation through the M2Ugen large model (Source: arXiv)

近日，相关论文以《M2Ugen：借助大型语言模型的力量进行多模态音乐理解和生成》（M2Ugen: Multi-modal Music Understanding and Generation with the Power of Large Language Models）为题在预印本平台 arXiv 上发表[1]。

Liu Shansong and Atin Sakkeer Hussain of the National University of Singapore are the first authors, and Liu Shansong is co-corresponding author with Sun Chenshuo and Shan Ying of Tencent's ARC Lab.

Figure丨Related papers (source: arXiv)

Currently, the field of large language models is booming. Practitioners in this field can either use its powerful reasoning ability to understand modalities such as text and images, or use it to understand human intentions and generate images and music and other content that users need.

However, most of the past studies based on large language models still focused on the comprehension level, and only a few studies combined comprehension and generation.

However, when it comes to practical application scenarios, users' needs for understanding and generation are often intertwined.

For example, at the end of each year, many employees need to make a year-end summary PPT. If you want to do this with a large language model, it needs to not only have the ability to understand so that users can get a PowerPoint template style that matches their ideas, but also the ability to generate text and illustrations.

Therefore, it is necessary to combine understanding and generative capabilities into the same model.

In terms of this outcome, why did the team choose music as an entry point for their research?

According to Liu Shansong, he has been engaged in audio research since he was a Ph.D. student and has a strong interest in music. After work, I found that many users have actual needs for soundtracks.

"For example, if a video maker wants to quickly accumulate followers, he or she needs to create a viral video to attract traffic. Among them, it is very important to choose the right soundtrack.

However, music requires a certain level of art appreciation, and those ordinary users often face difficulties when choosing. This is where they need to have a little assistant who can help them choose the right soundtrack and improve their creative efficiency. Liu Shansong said.

In addition, it is worth mentioning that this achievement is also a continuation of the previous study MU-LLaMA [2]. It is understood that the latter mainly focuses on a single music understanding task, while M2Ugen is based on music understanding, adding music generation capabilities guided by multimodal information, so that the model can not only understand music, but also create music.

"We started our research on M2Ugen after we completed the submission of MU-LLaMA in September 2023. Liu Shansong said.

After investigating and determining the status of the study and the objectives of the study, the researchers first selected three feature processors, MERT, ViT, and ViViT, to process music, image, and video inputs, respectively.

Then, the output of the encoder is introduced into the LLaMA2 open-source model of choice, so that it can understand and process the multimodal inputs to make decisions for downstream tasks.

Then, the comprehension and generation tasks are cleverly combined in the same large model.

Finally, by exploring the use of AudioLDM 2 and MusicGen, the models have the ability to generate music.

On the basis of completing the model architecture design, they collected all the open copyright music that can be found in the market today, and used MU-LLaMA and some visual basic models to generate a text/image/video-to-music multimodal dataset to facilitate the training of the M2Ugen model.

It's important to note that having more high-quality, open data is key to developing generative AI.

"If we can cooperate with more professional institutions in the future to obtain more high-quality music training data, and solve the problems of copyright and annotation data quality, we can complete further iterations of the performance and performance of the model. Sun Chenshuo said.

In the follow-up research, they will continue to iteratively optimize the performance of the model and improve the generalization of the model to better meet the needs of domestic users.

Resources:

1.S., Liu, A., Hussain.et al. M2Ugen: Multi-modal Music Understanding and Generation with the Power of Large Language Models. arXiv:2311.11255. https://doi.org/10.48550/arXiv.2311.11255

2. S., Liu, A., Hussain.et al. Music Understanding LLaMA：Advancing Text-To-Music Generation with Question Answering And Captioning. arXiv:2308.11276v1. https://arxiv.org/abs/2308.11276

Operation/Typesetting: He Chenlong

Scientists develop multimodal music comprehension and generative large models that have the ability to understand and compose music

Read on