laitimes

Customize the AI voice in 2 seconds! Wenxin is a big job in one word: the effect is surprising

author:Ray Technology

Xiao Lei usually brushes station B and often sees videos released by UP masters imitating celebrities singing, with at least 6-7 similarity in timbre and pitch, and even some AI models trained in place can reproduce almost the same voice as stars. In addition to singing, this function is also widely used in the dubbing of different characters, and a large AI model that has been fed with high-quality material with a sufficient amount and duration can definitely reach the level of fakeness.

Xiaolei, who has incomplete tones, yearns for this technology, but suffers from the complexity of local training models, and has not made up his mind to train his own AI voice. Coincidentally, Baidu Wenxin Yiyan recently launched a new function of customizing the exclusive voice of the agent, and the official claim is that users can complete the setting in just a few seconds.

With doubts, Xiaolei tried to create his own "AI mouth substitute".

Creating an "AI Mouth Substitute" is efficient, but the functionality is too limited

Open the Wenxin Yiyan App, click the "+" sign below, and we will enter the agent creation interface. In the sound options bar, we can select the sound characteristics for the agent. The official voices are categorized according to dialect, gender, timbre and role, providing 32 different voices. But we had a clear goal, let's try to create your own voice.

Customize the AI voice in 2 seconds! Wenxin is a big job in one word: the effect is surprising

Source: Produced by Lei Technology, Wenxin Yiyan page

Click "Create My Voice", the user needs to read the text given by the system in a natural tone, so that the system can recognize the timbre and pitch. After actual measurement, the recognition process only takes 2-3 seconds, and Xiaolei's "AI mouth replacement" is officially created. It is worth noting that the system will briefly identify the ambient sound before recording, and only after confirming that the noise meets the recording requirements will it officially enter the recording process.

Not only that, but we can also personalize the agent's personality traits, mantras, personal experiences, family and friends, hobbies, and opening remarks, which will affect the agent's subsequent communication performance.

Customize the AI voice in 2 seconds! Wenxin is a big job in one word: the effect is surprising

Source: Produced by Lei Technology, Wenxin Yiyan page

Without further ado, let's take a look at whether the AI voice created by Wenxin Yiyan in a short period of time can be satisfying. After turning on the sound broadcast function,Xiaolei tried to let the agent introduce me to the relevant information of Lei Technology,Let's not talk about the sound,At least the introduction of Lei Technology is relatively comprehensive,Except for the official account1.68 million fans (more than 1.7 million)The data is somewhat outdated,The other descriptions are generally the same。

Speaking of voice, I think that the timbre can reach at least 8% similarity, especially the performance of mood and tone, which almost made Xiao Lei think that he was talking. Perhaps in order for the user to better hear the agent's expression, the overall speech speed is a little slower, and it may be difficult for the user to patiently listen to all the answers.

Compared with the traditional text expression, the agent's voice answer is more anthropomorphic, and more mood words are added to the answer, which is closer to the expression habits of people's daily communication. After accepting the sound quality, Xiaolei decided to return to his essential requirements for AI mouth substitutes - singing, but unfortunately, the agent created by Wenxin Yiyan does not support this function for the time being. Then Xiao Lei changed his angle and asked the agent to read the lyrics, this time it was a success, although the reading used his own timbre, but from the presentation effect to the music is really not interesting.

Customize the AI voice in 2 seconds! Wenxin is a big job in one word: the effect is surprising

Source: Produced by Lei Technology, Wenxin Yiyan page

Subsequently, Xiao Lei conducted tests such as recitation and poetry recitation around the voice, and the effect was very similar. You can understand that you are a self who is always in a stable sound state, and can do a lot of basic language work on your behalf, but the presentation has a high correlation with the emotion, style and naturalness of your recording. Because Xiaolei is not engaged in broadcasting, the effect of AI voice is not particularly good, if users can provide higher quality voice materials, maybe Wenxin Yiyan can give better feedback.

In general, Wenxin Yiyan's new feature did bring surprises to Xiaolei, on the basis of traditional offline local training, through a large number of voice training of Wenxin large model and speech synthesis model, so that the AI voice can be satisfactory in terms of generation efficiency and presentation effect, but the positioning of its personal assistant has limited its function to a certain extent, the agent cannot provide other functions such as singing, and the user cannot further train the AI voice, so that the performance effect of the AI voice is closer to me.

High-quality AI voices also rely on high-intensity AI training

In fact, this is a problem that all applications that "create AI voices in the style of fast food" face the same problem. It is also a personalized voice customization service, and the service provided by Tongyi Lab requires users to record 20 sentences for customizing their own AI voice, and the overall effect is not much different from Wenxin Yiyan, and there are still bottlenecks in the effect, the key reason is that the input and training materials are not enough.

Customize the AI voice in 2 seconds! Wenxin is a big job in one word: the effect is surprising

图源:魔搭ModelScope

The scenes where you hear the most personalized customized sounds in your daily life should be voice navigation, text broadcasting, or novel reading. Generally speaking, in order for the text-to-voice technology to meet the qualified standards of AI voice, the amount of data recorded by the audio source person in a professional recording studio is required to be recorded, and the high-standard customization process rejects the vast majority of ordinary people from exploring AI sound.

With the maturity of Personal TTS technology, the platform can quickly build a speech synthesis system of the target after obtaining a small number of sound clips from common recording devices such as mobile phones and computers. The biggest advantage of personalized speech synthesis is that only a small amount of data is required compared to traditional custom sound technology.

Whether it is Wenxin Yiyan or Tongyi Lab, they only need a very small amount of data to provide users with personalized voice customization services, which greatly reduces the customization threshold of speech synthesis and popularizes AI voices to ordinary users. But there are gains and losses, and TTS technology not only lowers the threshold for sound customization, but also puts a shackle on the upper limit of this feature.

According to the product logic diagram provided by ModelScope, we can see that the TTS model needs to go through four stages: recording detection, data processing, model training, and packaging and synthesis, and finally form our AI voice. The limited amount of data feed makes the language logic and voice intonation of AI voices rely more on the data of the trained model, while the materials recorded by users may only act more on the surface of the sound, and the soul of the voice is still the large model data behind it.

Customize the AI voice in 2 seconds! Wenxin is a big job in one word: the effect is surprising

图源:魔搭ModelScope

As a reference, Xiao Lei investigated the steps of training the sound model locally. Compared with the convenient services of Wenxin Yiyan and Tongyi Lab, the upper limit of the sound effect of local training sound models is much higher, but the cost needs to be paid are also increased geometrically.

First of all, users have to prepare a batch of high-quality dry audio data, a computer with certain performance, and an AI sound open source project, and after a series of data processing, feature extraction, and N rounds of training, we can get the AI voice we need.

You may think that this is the case just by looking at the text description, but in fact, the collection of audio data alone is a big project. This determines the timbre and vocal characteristics of the AI voice. In particular, it should be noted that the audio data here refers to the dry sound of the target, that is, to remove all background sounds such as accompaniment and noise, which can be achieved by users without professional equipment.

Of course, if you are troublesome, you can also go to the model workshop website to download the trained sound model, but you must not restore your own voice, so you have a sense of achievement.

Customize the AI voice in 2 seconds! Wenxin is a big job in one word: the effect is surprising

Source: mxgf.cc

After unlimited high-intensity training, it can finally achieve the AI Yanzi effect that was more popular on the Internet some time ago, and users can also freely decide the AI voice to read aloud or sing and other situational expressions, no longer limited to a single form of expression.

Is large-scale model linkage the next opportunity for AI voice?

The impact of AI on sound has penetrated into various fields, from text-to-speech to music, and we have witnessed many interesting AI sound applications. Some time ago, Xiao Lei experienced the new star of Wensheng Audio-Suno, whose efficient and high-quality music generation method has caused many musicians to have a sense of crisis. Although most of the AI voice models still have some defects at this stage, it is almost inevitable for AIGC to reconstruct the content industry.

AI sound, like AI music, is the self-expression of ordinary people. The role of AI is more to lower the threshold for people's creation, so that ordinary people can also realize fantasy scenarios. At present, many AI large models are still in a state of "island", in the view of Lei Technology, when a single AI large model develops to the bottleneck stage, it may be followed by the effective linkage between different types of large models.

As a simple example, users generate the desired lyrics through ChatGPT, and Suno compiles the lyrics into a song and gives it a musical style, and finally adds their own AI voice to it. When multiple large models are connected, all the user has to do is give a single command to create a song of their own.

Of course, the current AI model is still in the stage of continuous development. Domestic large models such as Wenxin Yiyan and Tongyi Qianwen are also constantly iterating, although the personalized voice customization function of Xiaolei experience has performed well in terms of efficiency and quality, but there is still huge room for improvement in functional diversity.

Perhaps in the future, Wenxin Yiyan's agent can break through the assistant positioning and show a performance effect that is not inferior to that of local training large models, and then AI voice technology can also find more applicable scenarios, bringing earth-shaking changes to user experience and audio-related industries.

The Beijing International Automobile Exhibition (Beijing Auto Show) will be held from April 25th to May 4th, with the theme of "New Era, New Automobiles", which is the vane of "Cars from Electrification to Intelligence".

At that time, including BYD, Xiaomi, AITO, Xiaopeng, Weilai, ideal, Extreme Krypton, Jiyue, Changan Deep Blue and other head brands will all appear, in addition to the new model "big competition", the advancement of autonomous driving technology, the evolution of intelligent cockpit and the combination of AI large models and cars, will be important highlights. Lei Technology's "Pay attention to electric vehicles, understand more intelligent" account tram will send a reporting team to Beijing to conduct first-line professional reports, so stay tuned.

Customize the AI voice in 2 seconds! Wenxin is a big job in one word: the effect is surprising

Read on