laitimes

In addition to AI Sun Yanzi, what exactly can generative audio bring?

author:Titanium Media APP
In addition to AI Sun Yanzi, what exactly can generative audio bring?

Image source: @VisualChina

Text | BTmt Technology, author | Aoyama egret

Some people liken ChatGPT to the technology industry, like mints added to cola, and various applications gush out instantly.

This description could not be more apt. On June 11, Apple released a blockbuster product - Vision Pro headset, VR headset device has been considered by the industry to be Apple's future replacement of mobile phone business products, but the product has been delayed again and again, until recently released by Apple, to the surprise of the outside world, Apple actually connected the Vision Pro headset to the AI assistant.

What is the use of headset access to AI? Suffice it to say, apps are limited only by your imagination. For example, if you don't know how to fly a fighter jet and put on a headset with AI capabilities, it can teach you to fly an airplane hand-in-hand; Do you play mahjong and don't win? Headsets with AI functions make you "possessed by the God of Gamblers" in minutes; Go out hiking, the AI headset allows you to become a botanist or zoologist...

Not long ago, Boston Dynamics, the world's famous humanoid robot company, also announced that the robot dog was connected to ChatGPT, which is like giving the robot dog "life", allowing the robot dog to talk to humans and answer various tricky questions.

And this is just the tip of the iceberg for generative AI applications. What will generative AI look like in the future? Deutsche Bank's latest research seems to provide the answer, after the generative text boom, the technology world may usher in a big explosion of generative audio.

What exactly can generative audio do for us?

AI audio has come to us

Deutsche Bank's latest research report shows that from the first quarter of 2020 to the fourth quarter of 2022, the number of corporate documents mentioning "generative audio" increased by more than 13 times.

Deutsche Bank analysis points out that people can input text or images to generate audio content without the need for audio experts or computer experts. This could affect a range of areas such as gaming, communications, music, journalism, and healthcare.

The Forbes Technology column also pointed out that at present, AI models have entered the music field, and generative artificial intelligence is likely to increasingly become a valuable tool for creating songs and compositions...

Just when we thought such a scenario would be far away, generative audio had come to us.

Who would have thought that one day the top stream in the Chinese music industry would be AI? Some time ago, AI Sun Yanzi appeared on the hot search, AI Sun Yanzi sung "Love in front of the Western Yuan", "Hair Like Snow" and other Jay Chou's masterpieces have exceeded one million views on station B, many netizens were captivated by AI Sun Yanzi's singing, and even the "unpopular singer" Sun Yanzi herself had to post a response to this matter.

In addition to AI Sun Yanzi, AI instantly swept the music circle, and almost no popular singers can escape this AI boom. Even the AI craze has a tendency to spread to the surrounding areas such as music composition and lyrics.

Because AI singers were too hot, the B station music area had to open a section for "them" in the cover area. In addition to AI Sun Yanzi, there are also popular singers such as Eason Chan, Jay Chou, Jacky Cheung, Andy Lau and so on. In this section, you can listen to "Borrow 500 Years from the Sky" sung by AI Sun Yanzi, "Heavenly Road" sung by AI Ariana Grande, "The First Snow of 2002" sung by AI Naying, and "Plum Sauce" sung by AI Jay Chou...

Even "celebrities" who are not singers can become AI singers, such as AI Lei Jun can sing a song "A Thousand Miles Away" for everyone; AI Sun Honglei can also sing the tender version of "Red Beans"; AI Musk's singing of "Good Han Song" is not even contrary.

If this AI singer craze only served as entertainment, then using AI to revive those who died did add a little warmth to the cold technology world. When AI Michael Jackson sang songs for us again with his iconic voice, some netizens wrote in the following message: "As soon as MJ's voice came out, I burst into tears in an instant..." A netizen left a comment in the AI Leslie Cheung's singing video, AI music technology has allowed these deceased singers to come out with "new songs" in another way, which is not a psychological comfort for fans.

Just as ChatGPT has had an impact on all walks of life, AI singers have also brought great controversy, and some insiders said that whether AI singers constitute infringement is the biggest controversy in the industry. Some lawyers pointed out that the sound simulated by AI does not constitute infringement and is not protected by the Copyright Law, but the songs covered are copyrighted and need to be authorized to use them.

Some netizens pointed out that if the sound can be simulated, does it mean that products such as voiceprint locks will face great risks? Some people point out that "generative audio" will indeed bring more impact to the existing social order, such as telecom fraud, forgery of high-level instructions and a series of risks.

Unfortunately, such fears have become a reality. Time magazine reported in April that a family in Arizona thought they had received a kidnapping call, and the voice on the phone sounded exactly like the voice of their loved ones, even crying, and it turned out that it was a scam completely created by AI.

Dipu, an associate professor at the School of Electrical and Data Engineering at the University of Technology Sydney in Australia, pointed out to the media that AI models only need the imitated person to say a few phrases to "clone" the exact same voice as him, and some models and algorithms only take a minute or less.

The application scenarios are far beyond imagination

AI singers can bring to the public or just smile, entertainment is actually only a very small application scenario of "generative audio", "generative audio" can bring us far beyond imagination.

In fact, Internet companies are never absent from the forefront of the industry. According to the latest "Chinese Intelligent Big Model Map Research Report", according to incomplete statistics, 79 large models above the scale of 1 billion parameters in China have been released, especially in natural language understanding and multi-modality, and a number of influential large models in the industry have appeared.

Worldwide, the companies with the most generative audio-related patents include Sony, Amazon, Huawei, ByteDance, Adobe, Apple, and Tencent.

In early June, Alibaba Cloud revealed the progress of the Tongyi big model to the outside world, and the "Tongyi Listening Understanding" focusing on audio and video AI was officially unveiled, becoming the first large model application product in China to open public testing. "Tongyi Listening Understanding" is exactly a "non-entertainment" landing sample of "generative audio".

If you carefully trace the "past and present lives" of Tongyi Hearing, you will find that it is the "Hearing" large model product that Alibaba Cloud focused on building in 2021, and is now highly expected by Alibaba Cloud, because in addition to integrating the understanding and summarization capabilities of Ali Tongyi Qianwen model, it also integrates Alibaba's most advanced speech semantics, multimodal algorithms and other technologies.

What Tongyi Listening brings us is AI audio entering the office track. Combined with the current official definition, Tongyi Listening has the ability to "listen" and "understand", that is, "good listening", can generate meeting minutes with high accuracy, distinguish different speakers, and "high understanding", can form a summary, summarize the full text and each speaker's views, organize the focus of attention and to-do lists.

It's not just Alibaba Cloud that has entered the office track with AI audio. There are also Tencent Meeting, iFLYTEK and other powerful service providers, in addition to Douyin's Feishu Miaoji, Sogou, NetEase Youdao are also eyeing this track.

The reason will be that in addition to the way of interacting with the machine such as text input, the most efficient and accurate is the way of audio and video interaction with the machine. And if the speech recognition is high, the audio input speed is much higher than the text input. However, there are still some bottlenecks in the conversion between text and audio.

According to industry insiders, there are still some problems to be solved from the text language model to the audio language model. For example, there is no one-to-one correspondence between text and audio. For humans, the same sentence can have different tones of interpretation, but it is a difficult problem for AI understanding. Previously, Google's AI engineers pointed out that audio is not easy to record in characters: "Audio data rate is higher, and the audio waveform of written sentences that can be represented in tens of characters usually contains hundreds of thousands of values." ”

Generative audio isn't just making its way into the office, it's had an impact on the music industry As mentioned above, the emergence of various AI singers has almost disrupted the entire industry. However, generative audio brings more than just "destruction", it can also help musicians break through creative bottlenecks.

In fact, speech and audio synthesis technology has existed for decades, and music synthesizers have always played the "mission" of creating sounds that have never been seen in the world, but it must be operated by people every step of the way. Later, digital music was born, although it can greatly facilitate musicians to create music, but it still requires creators to have many years of learning and use experience.

When AI music swept the music circle, people found that creating AI music does not require much music knowledge and professional ability, only need to simply enter some text and description, you can quickly create music, of course, such "music" in the eyes of some musicians, can not be called "music". But with the training of large models, I believe that this kind of music created by AI will definitely have amazing results.

In addition, generative audio is disrupting the gaming industry. In the past, a big part of the game company's spending was the various sound effects, BGM, opening and ending songs, etc. in the game, but now with generative audio, this expenditure can be greatly reduced.

Some game industry practitioners pointed out that the audio in the game is mainly divided into four parts: music, voice, sound effects, and sound engine. Previously, game audio development required lengthy processes such as design, production and production, engine logic, and audio QA. At present, AI audio technology can be applied to design, production and production, which greatly shortens the development time of game audio and the cost is much lower.

Taken together, generative audio has come to us, and computer-generated speech can approach the level of expression, intonation, and emotion conveyed by human speech, which will open up new possibilities for real-time translation, audio dubbing, and automatic real-time dubbing and narration. For us, the arrival of generative audio is huge, but it also tells us that it is not a substitute for human creativity. So, what the world will look like in the future is still unknown.

Read on