When your childhood male god learns to switch between multiple languages seamlessly

ByteDance's AI Lab voice and audio team has been able to provide more than "17 languages", "13 dialects", "100+ different styles" comparable to the sound of real people, and its audio generation capabilities have gradually opened up to the market through the volcano engine. >

Friends who often watch videos on Douyin or have used clips to create short videos must be very familiar with the sounds in the video below:

After listening to the dubbing of different timbres and different languages, let's see the sponge that seamlessly switches between Chinese, English and Japanese languages:

Whether it's rich multilingual dubbing or cross-language synthesis, these stunning sound effects come from speech synthesis technology. Recently, the Intelligent Voice and Audio Team of the AI Lab Speech & Audio Team (hereinafter referred to as the SA Team), which provides technical support for the above capabilities, has implemented the latest upgraded multi-language and cross-language synthesis technology, which can be used in the video creation tool clipping and CapCut platform. Enterprise users can also use the same audio technology through the Volcano Engine.

The sounds of "understandable", "well spoken", and "will be more" are generated in this way

In the process of following CapCut deep into different countries and regions, the ByteDance SA team provided it with the ability to synthesize locally used languages. In order to provide a rich and diverse sound that meets the local culture and meets the preferences of local creative content, it poses great challenges to the number of languages, the richness of the timbre, the authenticity of the language, the expressiveness of the style, and the speed of production.

The production process of traditional TTS (speech synthesis) is to select a pronunciation person who can speak the authentic language to record a large amount of high-quality speech data, mark the processing through a team with a professional background in the language, and finally train the corresponding timbre through synthesis technology to achieve online application. However, under the premise that the target is multilingual synthesis, the traditional speech synthesis method faces the following problems:

Difficulty in obtaining data: The cultural laws of different countries have different restrictions on deep synthesis technology, and except for countries and regions with more developed dubbing industries such as China, the United States and Japan, professionally trained high-quality pronunciation people are scarce, and the resources of optional pronunciation speakers are limited.

High professional requirements: The recorded audio data requires professionals who understand the language to carry out data annotation processing, and some small language professionals are extremely difficult to obtain.

Training is difficult: Under the framework of traditional technology, it is difficult to model the rhythmic effects of different languages and different styles at a fine-grained level, making it difficult for the expressiveness of synthetic sounds to meet the higher expectations of creators.

High cost of consumption: Compared with Chinese, multilingual production will generate higher costs whether it is from the pronunciation person, professional configuration, and process production.

In order to solve these four major problems, the BYTEDance SA team proposed a multi-language and cross-language synthesis scheme to produce "listening to understand", "speaking well" and "will be more" sounds in batches at a low cost and efficient quantity.

"Listening to understand" means that the pronunciation is accurate, clear, and understandable.

"Well said" means an authentic accent that conforms to native speaker habits.

"Will be more" means that monolingual pronunciation people can have multiple languages and accent skills.

This solution mainly breaks through in the two directions of fine-grained prosody modeling and cross-language migration:

Fine-grained prosody modeling to create timbre matrices of different languages, accents, and styles

Different languages, dialects, and styles have their own prosody characteristics, with different speech speed, intonation, accent patterns and other phonological changes, this fine-grained prosody feature significantly affects the accuracy and authenticity of pronunciation, especially for accent languages like English (pitch-accent language), and the traditional end-to-end neural network framework is difficult to implicitly model and control this fine-grained prosody feature change.

To solve the problem of fine-grained prosody modeling, byteDance SA team developed the AM architecture for phoneme-level fine-grained prosody modeling in neural speech synthesis using ToBIrepresentation, Yuxiang Zou, etc, Interspeech 2021), By introducing phoneme-level ToBI prosody features (including pitch accent, phrase accent, and boundary tone), combined with phoneme-level pitch-level pitch and energy-composed variance adaptor, syllable, phrase, and sentence-level intonation and accent pattern changes can be achieved, respectively. Compared with the traditional implicit prosodymatic feature learning, the scheme can achieve more accurate and authentic pronunciation, achieving the goal of "listening to understand" and "speaking well" in a single language.

When your childhood male god learns to switch between multiple languages seamlessly

Was there a lot of music?

Fine-grained prosody modeling: Was there a lot of music?

Cross-language migration, break through the resource bottleneck, and realize the same voice to interpret multiple Chinese words

Although prosody modeling based on fine-grainedness can achieve more accurate and authentic speech synthesis effects, this still requires the pronunciation person to have the corresponding language ability, but also to meet a certain amount of data, which greatly limits the expansion ability of TTS, it is difficult to meet the speed of business expansion, and the rapid follow-up of video creation hotspots and explosive sounds.

So how to get the pronunciation person to break through this limitation ? Achieving the goal of "meeting more" is the key to increasing speech synthesis capacity.

ByteDance SA team applied transfer learning technology to speech synthesis, combined with unsupervised characterization learning technology, developed an acoustic model framework for cross-language migration, mainly to solve the problem of feature space decoupling and distribution mapping, through SCRN and unsupervised characterization, to achieve the decoupling of speakers, prosody, style and other features, mapping different languages to the same pronunciation space. Through cross-language migration technology, it is possible to achieve the goal of making a non-native speaker have the ability to speak at the level of native speaker, and achieve the goal of "speaking well" and "knowing more" across languages.

英语原声：Would you like to pay in cash or credit cards?

印尼语迁移:It has been registered in the pom so that its quality is guaranteed.

Brazilian Portuguese Migration: Compre sua máquina de cart o crédito e débito.

At the same time, in order to improve the efficiency of labeling, researchers have also developed corresponding automatic slicing tools and annotation tools, and the establishment of automatic labeling processes makes data annotation no longer become a bottleneck.

Through the continuous exploration and iteration of technology and active adaptation to the needs of users in different countries and regions, the SA team has been able to provide more than "17 languages", "13 dialects", "100+ different styles" comparable to the timbre of real people, and has made breakthroughs in the effect of "cross-language migration", successfully applied to video dubbing scenes, providing better localization dubbing capabilities for creators in various regions at home and abroad, and has been widely praised by users in many countries and regions.

Take a look at the voices of real users:

Translation: CapCut's text reading function is so powerful, "Fang ちゃん" (萌華) is really a cute doll's voice, so natural... Does uncle handsome have a voice too? twitter@mikisandayo_

Translation: CapCut's new text pronunciation is very versatile and super kawaii! Which sound do you like?

As technical capabilities are constantly verified in the business, the real voice of users is getting louder and louder. The SA team's audio generation capabilities are also gradually opening up to the market through the volcano engine, providing leading audio technology to multiple industry partners. Including providing rich dubbing gameplay for interactive entertainment users to stimulate creativity; providing an immersive book listening experience for novel users and creating high-quality AI anchors; creating assistant tones for intelligent interactive enterprises and hardware manufacturers to achieve cost reduction and efficiency, etc.; and reaching industry leading customer cooperation in video editing, audiobooks, automobiles, e-commerce and other industries, successfully realizing the application and expansion of capabilities in all walks of life.

About ByteDance AILab Intelligent Voice & Audio Team

ByteDance's AI Lab Speech & Audio Intelligent Speech and Audio Team is committed to providing a variety of AI capabilities and solutions such as audio understanding, audio synthesis, dialogue interaction, music retrieval and intelligent teaching for the company's various businesses. Since its inception in 2017, the team has focused on developing industry-leading AI intelligent voice technology, and constantly exploring the combination of AI and business scenarios to achieve greater user value. It provides various AI solutions for ByteDance's star products such as Toutiao, Douyin, Cut, Watermelon Video, Tomato Novel, Feishu Office Suite, and Dali Intelligent Education Desk Lamp. Up to now, it has served hundreds of business partners. With the rapid development of ByteDance's business, the SA team's speech recognition and speech synthesis cover multiple languages and dialects. In the future, the SA team hopes to develop 70+ languages and 20+ dialects to meet the needs of content creation and communication platforms. The team has selected 17 papers for the TOP AI conference, of which 8 papers have been accepted in the direction of audio generation.

When your childhood male god learns to switch between multiple languages seamlessly

Read on