With the evolution of AI voice, can the human ear still hear the difference between AI and real people?

On the podcast app Small Universe, the "Hacker News" account produced a program with the voice of "Xiaoxiao", and listeners left messages in the comment area saying "I want to give a reward".

AI invades the podcast circle!More realistic than real dubbing?Measured Microsoft AI voice character "Xiaoxiao"

In fact, this is not the first time that AI has invaded the podcast scene.

In October 2022, an audio of a conversation between Joe Rogan, a well-known American podcast host, and Steve Jobs, the late founder of Apple, sparked heated discussions on the Internet. In the 20-minute podcast, the two explored a variety of topics, including Jobs' college experience, insights into computers, and personal beliefs.

The podcast audio was launched by podcast.ai, and to generate this content, podcast.ai used Jobs' biography and all the recordings about him on the web, and was heavily trained by Play.ht AI language model. In addition, the voice of the show's host Rogan is also AI-generated.

In July 2023, the domestic podcast program "Vulgar Xiaoya" released a podcast with a completely AI-generated storyline and voice, which received more than 5,000 listens on the small universe, and there were also listeners in the comment area who left messages saying that they mistakenly took the unnaturalness in the AI-generated voice as the two anchors "have a bad emotional state".

From Steve Jobs' "resurrection" podcast to the AI podcast experiment of "Vulgar and Elegant", one of the main controversies facing AI-generated vocals is the lack of intonation and emotion of real voices, such as the monotony and mechanicality of voices, and the unnaturalness of rhythm and intonation. These are all problems that hinder the further application of AI voice technology in the creation of audio content.

Now, Microsoft's "Xiaoxiao" has been officially launched and available. Can it become a new voice generation tool for Chinese creators? What are the new ways to play AI + audio content creation? "Number One AI Player" conducted some exploration.

Measured Microsoft "Xiaoxiao": more realistic than real dubbing?

"Xiaoxiao" is a female voice character in the TTS (text-to-speech) voice library of Azure, a Microsoft cloud service platform. There are currently two versions:

The first version is the Chinese version of "Xiaoxiao", which supports 21 different speaking styles and is suitable for scenarios such as audiobooks, news, AI customer service, and multi-emotional expression.

In the "Multi-Emotional Expression" scene demonstration, she was able to switch emotions between multiple lines and accurately match the corresponding tone and intonation, and the overall performance was both natural and fluent.

The second version is a multilingual version of Xiaoxiao, which supports text-to-speech conversion in 91 languages, but only offers default speaking style options.

At present, both versions of "Xiaoxiao" can be experienced for free on the Azure official website (the link has been synchronized to the end of the article).

Due to the complexity of the process of applying for a Microsoft Azure account and deploying the voice service, the detailed steps are attached here for reference:

First, go to the official website of Microsoft Azure and create a free account.

New users can enjoy 12 months of free service after registration, and after the service expires, they can still get a free quota of 500,000 characters per month. For most creators, this "big enough" free quota is enough.

The whole process of registration and use, no magic is required, but a credit or debit card such as VISA, MasterCard, etc. is required for verification. In the actual test, we completed the registration with a VISA credit card issued by a domestic bank and a domestic mobile phone number.

After the verification is successful, go to the Azure homepage, enter the console, find "Voice" and click "Create" under the "AI + Machine Learning" category to deploy the Speech service.

Go to the Create Voice Service page, select "Free F0" for the pricing tier, and select the region you want to use for TTS voice support, because it is a test "Xiaoxiao", here we finally select "East Asia".

Finally, click "Review & Create" at the bottom of the page to complete the deployment.

Then go to the "Audio Content Creation" page, you can feed the text and let the AI generate the voice for you.

The interface layout mainly includes the text operation area in the middle and the tuning and editing toolbar on the right. Users can edit the entire text at once or fine-tune individual sentences or words.

The specific editing functions include reading role switching, pause setting, reading rule adjustment, and intonation and speed control, all of which can be customized and modified by users as needed.

For example, in the following text, we set the narrative narration to the "news" style of the Chinese version of Xiaoxiao, and switch the reading roles in the same sentence to create a sense of dialogue in the novel.

Test 1, number one AI player, 59 seconds

In the comparison test, we asked them to say the same sentence with different emotions, and they also showed obvious distinction, and the effect was surprising.

Test 2, number one AI player, 12 seconds

Although the multilingual version of "Xiaoxiao" only supports the default speaking style, the sound effect generated by it is natural and smooth, and it can process the mood words in the text, which at first glance makes it impossible to distinguish between the real and the fake.

Xiaoxiao Multilingual Version Test 1, the number one AI player, 8 seconds

However, in our test, she was only able to restore the presentation effect released by Microsoft earlier when her language skills were selected as "Chinese Mandarin", and the generated vocals were natural and realistic.

If you choose another language or regional dialect, such as Cantonese or Taiwanese Mandarin, the sound returns to the "hear it as AI" effect.

Xiaoxiao Multilingual Version Test 2, the number one AI player, 10 seconds

According to Microsoft's official introduction, 9 voice roles, including the multilingual version of Xiaoxiao, are trained based on large language models, such as OpenAI's GPT service built on the Azure cloud, so it is especially good at oral conversations, daily chats and other scenarios that require high language naturalness and expressiveness.

In addition to using the pre-configured voices of the TTS voice library, Microsoft Azure also provides a voice customization service, which allows brands or individuals to create custom voices using copyrighted audio samples as training data.

AI + audio content, entering the era of difficult to distinguish between true and false

From audiobooks to short video dubbing, to the ubiquitous text-to-speech function in hardware and software, AI-generated audio content has become one of the most frequently used AI technologies for ordinary people on a daily basis.

Take Microsoft's voice character "Yunxi" as an example, as long as you have watched short videos, you will definitely be able to recognize his voice.

Due to the removal of the mechanical pronunciation and single tone of the previous AI, "Yunxi" was widely popular in film and television commentary dubbing, and quickly became popular all over the Internet. At the same time, it has also been widely used in the field of audiobooks, and many netizens will use the software and API connected to Microsoft's TTS service to use the voice text to speech of "Yunxi" to improve the listening experience.

With the rapid iteration of technology, there are more and more convenient and easy-to-use products on the market. For example, Himalaya's audio scissors support one-stop AI audio creation, improving the efficiency of audiobook production and reducing the cost of creation.

Specific to the AI text-to-speech track, foreign popular products such as ElevenLabs do not have high support for Chinese, while domestic products such as MiniMax and Volcano Engine can generate relatively smooth Chinese reading audio, but still do not reach the level of naturalness and emotional expression required for podcast sound production.

For example, in the AI podcast experiment of "Vulgar Xiaoya", many netizens reported that the traces of AI-generated voice "reading drafts" were obvious.

Compared with short videos and audiobooks, the application of AI voice technology in podcast scenarios is still very limited.

The "number one AI player" has learned from many sources that podcast creators will currently use text generation models such as ChatGPT to improve the production efficiency of text content such as pre-content planning, content outline, and podcast content summaries (shownotes).

However, in terms of sound generation, podcast content production pursues not only smooth reading, but more importantly, conveying emotions through sound to enhance listeners' immersion and emotional resonance.

In addition, the strong IP attributes of podcast sound content and highly personalized expression are also issues that creators need to carefully consider when using AI-generated voices. These characteristics require AI to not only convey information accurately, but also be able to mimic human emotions and intonation to create an emotional connection with the audience.

For creators who are comfortable expressing their opinions through spoken language, a unique accent or intonation can be a differentiator and help shape the creator's personal style.

As AI-generated voices and cloned voices become more and more realistic, there are also content creators who have begun to use AI technology to produce informational voice broadcast content with high update frequency.

For example, the anchor of the podcast program "Crossroads" once revealed in an episode that the co-founder Kuaidao Tsing Yi is an AI information program "Kuaidao Radio Station", which only writes scripts, and the voice part is completed by AI, and the effect is quite natural.

The development of AI voice technology has undoubtedly provided new tools and possibilities for content creators.

In particular, Microsoft's recent launch of a multilingual version of Xiaoxiao further demonstrates the potential of AI voice technology in podcast content production. It is foreseeable that while technology will smooth the threshold of content creation, content creation will also be unprecedented "volume", and how to create differentiated content is a problem that every creator needs to think about.

AI invades the podcast circle!More realistic than real dubbing?Measured Microsoft AI voice character "Xiaoxiao"

AI + audio content, entering the era of difficult to distinguish between true and false