Mildew sang "Rice Fragrance", and the Amphion audio generation of the domestic team became popular

Heart of the Machine column

Heart of the Machine Editorial Department

The team of Associate Professor Wu Zhengzhi of the School of Data Science of the Chinese University of Hong Kong, Shenzhen and the OpenMMLab team of the Shanghai Artificial Intelligence Laboratory have open-sourced Amphion, a comprehensive audio generation project. The system aims to create an open-source platform that integrates speech synthesis conversion, singing synthesis, sound effect music generation, etc. To date, Amphion has made the GitHub Trending Repositories list several times.

2022 is known as the first year of AIGC, and text and image applications represented by ChatGPT, Stable Diffusion, and MidJourney have ignited the AI field. In 2023, AI Yanzi Sun, AI Guo Degang, sound effect generation, and music generation are also popular on social media.

Today, we can still hear Taylor Swift sings Jay Chou's rice fragrance.

This may seem simple, but the technology behind it is very complex, and it is precisely because of the domain knowledge barrier in the audio field that it is not easy for engineers to get started.

Recently, the team of Associate Professor Wu Zhengzhi of the School of Data Science of the Chinese University of Hong Kong, Shenzhen, and the OpenMMLab team of the Shanghai Artificial Intelligence Laboratory have open-sourced Amphion, a comprehensive audio generation project. The system aims to create an open-source platform for scientific research groups and engineers who have just entered or want to enter the field, integrating speech synthesis and conversion, singing synthesis and conversion, sound effects and music generation. At present, the study has attracted great attention on overseas social platforms.

Mildew sang "Rice Fragrance", and the Amphion audio generation of the domestic team became popular

Project Address: https://github.com/open-mmlab/Amphion
Address: https://arxiv.org/abs/2312.09911

OpenMMLab is the most internationally influential open-source algorithm system for computer vision, with more than 90,000 stars on GitHub and users in 140 countries and regions around the world. The brother team of the joint laboratory has launched the 100-billion-level parameter large language model "InternLM" with leading performance, and built the first full-chain open source system for the development and application of large models. The team's research achievements also include OpenCompass, the largest and most complete large model evaluation platform in the community, and LMDeploy, a large model inference framework with leading inference performance.

This is OpenMMLab's first foray into the audio and voice field, and I believe that this open source will bring more imagination to multimodal generation. Before it was publicly advertised, Amphion had already made it to the GitHub Trending Repositories list several times. It can be said that Amphion was born with an aura of its own.

Amphion

Amphion is a comprehensive audio generation platform. The project covers a variety of classic audio generation tasks, such as speech synthesis, speech conversion, singing voice synthesis, singing voice conversion, sound effect generation, music generation, speech enhancement, as well as multiple AIGC audio tasks, such as multimodal controlled sound effect generation and music generation. Amphion's unique visualization capabilities help junior researchers and engineers better understand relevant models, enabling them to achieve sustainable research and development in audio, music, and speech generation.

The Amphion Technical Report provides a detailed comparison of the performance similarities and differences between some of Amphion's tasks and algorithms and those of the more popular open source systems on GitHub. Overall, Amphion has met or surpassed multiple popular systems on GitHub with a single system for related tasks.

SVC: Vocal Conversion

For many people, the term "singing voice transformation" may be relatively unfamiliar, but many people should have heard of this year's popular "AI Yanzi Sun". The technology behind "AI Stefanie Sun" is the singing voice transformation.

In layman's terms, singing voice conversion technology is a technology that uses AI technology to transform the timbre of a person's singing voice into another person's voice. This process usually involves algorithms such as signal processing, machine Xi, depth learning, and Xi. The Amphion system integrates the classic feature extraction model. In addition to the integration of the classic diffusion model, the VITS model, the Whisper model from the famous OpenAI is also integrated. In order to get good sound quality, Amphion integrates mainstream vocoders such as BigVGAN, HiFi-GAN, and DiffWave. At the same time, Amphion's vocoder also integrates the latest achievements of CUHK-Shenzhen.

Subjective reviews in Amphion's technical report show that Amphion surpasses the previously popular So-VITS-SVC system in terms of naturalness and similarity. Currently, Amphion's feature design has been borrowed from the So-VITS-SVC 5.0 system.

TTS: Speech Generation

Speech generation refers to the technology of converting text input into corresponding speech output. At present, this module mainly uses deep learning Xi technology to convert text into natural and fluent high-fidelity speech. The technology has a wide range of applications in audio e-books, video dubbing, etc. The Amphion system implements the classic FastSpeech2 model, VITS model, etc., as well as the latest popular zero-shot speech synthesis technology, namely Vall-E, NaturalSpeech2.

Amphion's technical report shows that Amphion meets or exceeds the level of the most talked-about open source systems today, both objectively and subjectively.

TTA: Audio Generation

Text-driven generative models have achieved remarkable results in both image and video domains. In the image space, Stable Diffusion and MidJourney are already capable of producing high-quality images, while in the audio space, text-to-audio generative models are set to have a positive impact on many creation-related industries. For example, game developers or film voice actors can use this technology to generate sound effects based on specific needs without having to search and edit them in a huge database of audio effects, increasing productivity.

Amphion integrates with the most popular text-driven audio generation model architecture, namely the text-driven audio generation algorithm based on VAE Encoder, Decoder, and Latent Diffusion. Under this architecture, the Latent Diffusion diffusion model takes the T5-encoded text as input and generates the corresponding audio effects according to the text's guidance.

Objective indicators of Amphion's technical report show that Amphion is at the leading state of the art in TTA tasks.

Vocoder: vocoder

Vocoder is the most important module for audio and speech generation, and it is also the key to ensure the quality of sound synthesis. Amphion integrates mainstream vocoders such as BigVGAN, HiFi-GAN, and DiffWave, as well as the latest published results of CUHK-Shenzhen.

Amphion's technical report shows that the HiFi-GAN vocoder in Amphion objectively outperforms the current popular open source tools.

visualization

Unlike traditional open-source tools for voice and audio, Amphion provides visualization capabilities. The Amphion team hopes that the visualization feature will give beginners a better understanding of the principles and details of the model. Currently, the Amphion team provides a visual screenshot of the diffusion model. This function visualizes the diffusion model on the vocal transition, visualizing the gradual process of one singer imitating another.

The Amphion Team

Person in charge: Dr. Wu Zhizhi

Dr. Wu Zhizhi is currently an associate professor at the Chinese University of Hong Kong, Shenzhen. He has been selected as a national young talent, and has been selected as one of the "Top 2% Scientists in the World" by Stanford University and "China Highly Cited Scholars" by Elsevier for many times. He received his PhD from Nanyang Technological University in 2015 and has worked in academic research and technology leadership at Meta (formerly Facebook), JD.com, Apple, the University of Edinburgh, Microsoft Research Asia, and more. Dr. Wu led the development of Merlin, an open source speech synthesis system, initiated and organized the first international evaluation of voiceprint recognition spoofing detection, the first international evaluation of speech conversion, and organized the 2019 international evaluation of speech synthesis (Blizzard Challenge 2019), and won the INTERSPEECH 2016 Best Student Paper Award and the 2012 Asia-Pacific Signal and Information Processing Association Annual Summit Best Paper Award. He is currently a member of the IEEE Speech and Language Processing Technical Committee, an Associate Editor of IEEE/ACM Transactions on Audio, Speech and Language Processing, an authoritative journal in the field of speech, and the conference chair of the IEEE Spoken Language Technology Workshop 2024. Invited reports at authoritative academic conferences such as ISCA SPSC Workshop and IJCAI 2023 DADA Workshop.

Core members

The core of the Amphion team are all students from CUHK-Shenzhen, and their backgrounds are quite bright, and they are a proper "other people's team".

A total of Xue Yao has just received a Ph.D., but his article has been cited hundreds of times by Google Scholar, and in 2023, he has been selected into the Tencent Rhino Bird Elite Talent Program with only 55 people in the country; a total of one king remotely brings a top meeting NeurIPS direct doctoral admission to CUHK (Shenzhen); a total of one Dr. Xue Liumeng has practical Xi experience in Microsoft, Tencent, Jingdong and other large factories.

It is worth mentioning that there are also two second-year students from CUHK-Shenzhen among the core members of Amphion. Gu Yicheng took over all the code for the vocoder in Amphion, he joined the research group in the first week of his freshman year, and held the top conference articles in the field of speech in the first semester of his sophomore year, and Chaoren Wang, a sophomore, is also a one-man who covers all the code for the visualization part of Amphion, and his personal open source system has received thousands of stars on GitHub.

The meaning behind the name Amphion

"Amphion" takes its name from Amphion, a legendary musician from ancient Greek mythology. Legend has it that Amphion was known for playing the harp and used his musical talents to build the walls of Thebes. His sound is said to move trees and rocks. The Amphion team wanted to leverage Amphion's musical talent and legend to envision the project's vision for research and development, and to create a blueprint for sound technology to become more sustainable.

Amphion Online Demo Experience Links:

Text to Speech

HuggingFace Demo: https://huggingface.co/spaces/amphion/Text-to-Speech
OpenXLab Applications: https://openxlab.org.cn/apps/detail/Amphion/Text-to-Speech

Singing Voice Conversion

HuggingFace Space: https://huggingface.co/spaces/amphion/singing_voice_conversion
OpenXLab Applications: https://openxlab.org.cn/apps/detail/Amphion/singing_voice_conversion

Text to Audio

HuggingFace Demo: https://huggingface.co/spaces/amphion/Text-to-Audio
OpenXLab Applications: https://openxlab.org.cn/apps/detail/Amphion/Text-to-Audio

Mildew sang "Rice Fragrance", and the Amphion audio generation of the domestic team became popular

Read on

Dyson OnTrac noise-cancelling headphones, a good partner for long journeys to Germany. Every time you travel across the country, you naturally have a lot of things to bring, but as a long-distance flight

How do I adjust the playback speed of my audio? Here are a few ways to adjust what everyone is using!

The audio of the explosion of the iPhone is on! iPhone 16 series battery disassembly: it is indeed better to disassemble

AI Daily: Confirm! The Three Sheep Recording Gate audio is an AI clone; Meitu MOKI is fully open; Google NotebookLM has launched new features

AI audio becomes a scam artifact! The lawyer's father was defrauded of 210,000 yuan, and the original voice can be cloned in 3 seconds

How do you adjust the audio speed? These adjustment methods are very easy to operate!

How to convert M4A to MP3 format? These audio conversion methods are very practical!

Listen to persuasion, starting from the first grade to primary school, don't let your child listen to audio anymore, which will affect Chinese performance

How do I speed up my audio? Try these four adjustment methods!

How do I increase the audio speed? These are a few ways to turn up the audio speed easily!

The big guy likes it, and the big factory follows up! AI podcasts are detonating the audio economy

How do I merge audio files? These merging methods are super easy to use!

What grade does Enco belong to? Audio and peripheral audio products are recommended

Still entangled with system sound? Reset Windows 11 audio settings now

Teach you how to extract audio from a video? These extraction methods are simple and easy to use!

How to convert M4A to MP3? Come and try these audio conversion methods!