Mildew speaks fluent Chinese, Guo Degang speaks crosstalk in English... I believe you have swiped these videos in recent days, and what core technologies are used behind these popular videos recently? Don't worry, curious baby will take you to easily understand these core technologies in four steps.
Let's take a look at Guo Degang's English cross talk first.
Guo Degang speaks cross talk in fluent English
Mildew is fluent in Chinese
Mildew speaks fluent Chinese-AI synthetic version
The original version actually looked like this.
Mildew Speaks English - Original Version
Let's take a look at Watson's Chinese and pay attention to the difference between the Chinese mouth shape and the English mouth shape
Watson speaks the Chinese-AI synthetic version
Watson speaks English - the original version
You think that's the end of it, and Trump, Bean is fluent in Chinese
Bean Speaks English - Original Version
Bean Speaks Chinese-AI Synthetic Version
So what are the core technologies behind these videos?
In fact, the technologies involved can be divided into four steps: 1. Speech-to-text, 2. Intelligent translation, 3. Voice style conversion, and 4. Lip-to-mouth matching.
Let's take Mr. Cai Ming's English talk show as an example to illustrate this process:
Mr. Cai's English talk show-AI synthetic version
Teacher Cai's talk show - the original version
Step 1: Speech-to-text:
In the original video, the sketch said by Mr. Cai Ming is in Chinese, and in order for Mr. Cai Ming to speak English, we first have to let the machine understand what Mr. Cai Ming said. Speech-to-text technology can do just that. Enter a piece of speech, and we convert it into text through this technology. This process is just like how we usually use WeChat's speech-to-text function, which is a relatively mature technology in itself.
Step 2: Intelligent Translation:
Using the above speech-to-text technology, we can easily get the content of Mr. Cai Ming's speech, but if we want Mr. Cai Ming to speak English, we also need to translate the text of these Chinese into English. This process is the same as the text translation software we use, input Chinese into English automatically.
Step 3: Voice Style Transition:
We can see from the video that Mr. Cai Ming's voice and intonation have a pure English accent, which is very different from the original Chinese voice. Therefore, if you want Mr. Cai to speak pure English, you also need to convert Mr. Cai Ming's voice into an English tone. This is where sound cloning technology, or style conversion technology, comes in. Transform one style of content into another. I don't know if anyone remembers that Huang Rong played by Yang Mi, who was very popular on the Internet before, actually changed Zhu Yin's style to Yang Mi's style, but here is a voice style conversion instead of a picture style conversion.
Huang Rong and Guo Jing - AI Synthesis Version
Huang Rong and Guo Jing - Original Video Version
Step 4: Lip Shape Matching:
In addition to the English accent of the voice, the mouth shape is also similar from the content of the speech. To do this, you also need to do lip matching. The AI generation technology used here is also a popular technology in recent years. For example, GeneFace++, students who want to know more about this paper can go to this paper. In short, we just need to know that there is such a tool that can do lip matching. Looking carefully at the video of Trump speaking Chinese, there are still differences between the Chinese mouth shape and the English version, and the key is to match each other.
GeneFace inference process
After reading these four steps, I believe you have understood these core technologies, what other interesting applications do you think these core technologies can bring, and do you worry about these AI creations?