Microsoft Neural TTS's new model presents a human-like emotional AI voice

2022-01-27 16:07:22

IT House News on January 27, how to make AI voice effectively imitate the rich dynamics and emotions of human dialogue has become a common challenge for researchers around the world. According to Microsoft's official news, not long ago, Microsoft Azure Neural TTS (Neural Network Text-to-Speech) launched a new generation of model "Uni-TTS v4" made a milestone major breakthrough in this field. In the "Blizzard Challenge 2021" test, Uni-TTS v4's speech performance was almost indistinguishable from the natural voice on the common dataset, showing enough strength to "call" real people to dialogue.

Microsoft Neural TTS's new model presents a human-like emotional AI voice

"Ear listening" is true, Microsoft officials also released a few paragraphs of TTS and real dialogue comparison, together to feel the new model brought by the realistic voice performance.

英语：The visualizations of the vocal quality continue in a quartet and octet.

Live Recordings:

Uni-TTS v4:

英语：Like other visitors, he is a believer.

Chinese: In addition, it is also necessary to avoid the current geopolitical risks and wait for the right time to intervene.

Users can use self-created text to create new demos in the Azure TTS online service. Uni-TTS v4 currently supports 8 voices in 7 languages in the TTS language library, and the R&D team will continue to use the latest models to optimize other languages supported by Neural TTS and custom neural voices, so that users can directly get a better generation of TTS voice through the Azure TTS API, Microsoft Office and Edge browser.

Officially, in order to improve TTS in the above two aspects, Uni-TTS v4 introduced two important updates in acoustic modeling, the research team first adopted a new architecture with transformers and convolutions to better simulate local and global dependencies in the acoustic model, and secondly, systematically modeled variable information from explicit perspectives (identity ID, language ID, tone, speed) and implicit perspective (discourse level and phoneme level prosody). These perspectives use supervised and unsupervised learning, respectively, to ensure that the end-to-end audio is natural enough to express.

IT Home understands that as a powerful speech synthesis function in Microsoft Azure Cognitive Services, Neural TTS can be used to help developers transform text into a realistic natural voice like a real person, often used for voice assistant scenes, text reading functions, and as an accessibility tool, etc., and has also been integrated into Microsoft's Edge Read Aloud, Immersive Reader and Word Read Aloud flagship products, and has also been used by AT&T, Duolingo, Progressive and many other customers adopt. Neural TTS has more than 330 sounds and supports nearly 130 languages or dialects from different countries and regions. Users and businesses can create unique custom voices by searching for "Azure TTS" to go to the product website, test their experience with Neural TTS's rich preset voices, or record and upload their own samples.

Microsoft Neural TTS's new model presents a human-like emotional AI voice

Read on