"Science and Technology Innovation Board Daily" on January 28 (editor Song Ziqiao) On the track of generative AI models, Google is "soaring" all the way. Following the text generation AI model Wordcraft and video generation tool Imagen Video, Google has extended the application scenarios of generative AI to the music industry.
On January 27, local time, Google released a new AI model - MusicLM, which can generate high-fidelity music from text and even images, that is, it can convert a piece of text and a picture into a song, and the style of music is diverse.
Google showed a number of examples in related papers, such as the input subtitle "The fusion of reggae and electronic dance music with empty, otherworldly sounds that evoke the experience of being lost in space, and the music is designed to evoke a feeling of wonder and awe, while at the same time suitable for dancing", MusicLM generated 30 seconds of electronic music.
For example, with the world-famous painting "Napoleon Crossing the St. Bernard Pass in the Alps" as the "title", the music generated by MusicLM is solemn and elegant, and the fierce slaughter and heroism of winter are vividly embodied. In addition to realistic oil paintings, abstract paintings such as "Dance", "Scream", "Guernica" and "Starry Sky" can be used as themes.
MusicLM even has a music skewer that mixes different styles of music in story mode. Even if you ask to generate 5 minutes of music, MusicLM is no problem.
In addition, MusicLM has powerful auxiliary functions, which can specify specific instruments, places, genres, eras, musicians' performance levels, etc., and adjust the quality of the generated music, so that a piece of music can be transformed into multiple versions.
MusicLM is not the first AI model to generate songs, similar products include Riffusion, Dance Diffusion, etc., Google itself has released AudioML, and OpenAI, the developer of the most popular chatbot "ChatGPT", has launched Jukebox.
What makes MusicLM unique?
It is actually a hierarchical Sequence-to-Sequence model. According to AI scientist Keunwoo Choi, MusicLM combines multiple models such as MuLan+AudioLM and MuLan+w2b-Bert+Soundstream.
Among them, the AudioLM model can be regarded as the predecessor of MusicLM, which uses AudioLM's multi-stage autoregressive modeling as a generation condition, which can generate music at a frequency of 24kHz through text description and maintain this frequency for several minutes.
In contrast, MusicLM has more training data. The research team introduced the first MusicCaps specifically designed to generate task evaluation data for text-music to solve the problem of lack of evaluation data for tasks. MusicCaps is built by professionals and covers 5500 music-to-text pairs.
Based on this, Google trained MusicLM with a 280,000-hour music dataset.
Google's experiments show that MusicLM outperforms previous models in terms of both audio quality and compliance with text descriptions.
However, MusicLM also has risks common to all generative AI - technical imperfection, material infringement, moral disputes, etc.
For technical issues, for example, when MusicLM is asked to generate vocals, it is technically possible, but it does not work well, and the lyrics are messy and meaningless. MusicLM is also "lazy" – about 1% of the music generated is copied directly from the songs in the training set.
Also, is the music generated by the AI system considered original? Can it be copyrighted? Can you compete with "artificial music"? There has been no consensus on the dispute.
These are all reasons why Google did not release MusicLM to the public. "We acknowledge the potential risk of misappropriation of creative content from the model, and we emphasize that more work is needed in the future to address these risks associated with music generation." The paper published by Google reads.